# Symptom-Based Alerts: Putting User Experience at the Forefront

In today's digital age, user experience is the key. While traditional monitoring & observability tools have been diligent in flagging metrics from our infrastructure & APIs, there is *often a disconnect between these metrics and the user's real-world experience.* It is important for engineering teams to complement system observability with tracking of customer symptoms & SLOs.

### What are Symptom-based alerts?

Symptom-based alerting refers to monitoring the customer's “goal” / “experience”, especially when it comes to setting up alerts & SLOs. Tracking customer experience & goals are strong and actionable way to track the behavior of the user.

> ”Operations is ultimately a business problem, not just a technical one.”
> 
> — Blog by the [Google](https://cloud.google.com/blog/topics/developers-practitioners/why-focus-symptoms-not-causes) Cloud team

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1697736360112/06d1945d-e828-4370-800c-99c7786dd432.png align="center")

### Risks of skipping symptom-based alerts

Distributed systems are already hard to troubleshoot and investigate — getting too many alerts for an on-call engineer to troubleshoot doesn’t help teams much.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1697736042974/e3b7c505-d77a-4747-b7fc-5784c64beb1f.png align="center")

### **Benefits of adding symptom-based alerts:**

1. **User-Centric Approach:** With symptom-based alerts and SLOs, user experience always remains central, translating telemetry data into actionable real-world insights on what’s happening with users.
    
2. **Reduced Noise:** Traditional monitoring can flood teams with alerts, many of which might be insignificant in the context of overall system health. Symptom-based alerts focus on noticeable patterns, drastically reducing the number of irrelevant notifications.
    
3. **Immediate Impact Recognition:** By highlighting issues that directly impact user experience, teams can act proactively and faster, mitigating potential challenges and identifying root causes much faster.
    

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1697737075763/94c095c4-9c53-48dc-b20f-c580324b79ea.png align="center")

### Setting up symptom-based alerts:

Adding symptom-based alerts with custom instrumentation means defining SLOs and metrics that can define the customer experience/ goal. This definition can happen at multiple points in the development lifecycle:

* As part of the design process
    
* Iterate after product/feature launch
    
* Re-iterate after product stability
    

While working on setting them up, here’s a simple framework to help you keep it actionable:

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1697736383368/011e0a9b-28d9-4a58-9d77-5f98d4d588cf.png align="center")

## Mistakes to avoid while setting up alerts:

### 1\. **Only tracking individual components and not end goals:**

This mistake could lead to missing out on tracking critical workflows that might be split across asynchronous steps.

**Potential blind spot:** A silently failing scheduled cron job or a failure in publishing to a queue could lead to a customer impact, completely missed by the team.

### 2\. **Relying only on auto-instrumented metrics:**

Complement the APM golden signals and infrastructure metrics alongside custom metrics representing your user experience.

**Potential blind spot:** Error rate of your payment service, or distribution of the response\_status\_code **≠** tracking of successful payment rate.

### 3\. **Not adding tags/ identifiers:**

Add identifiers in your metrics to help you identify impacted users — these tags could vary from a “client name” to your user’s “device type” to the “user-id”.

**Potential blind spot:** Your overall SLOs might be well within the limits even though it might have breached significantly for a specific customer. Without the tags, it’ll be hard for your team to be able to identify the radius of impact.

### 4\. **Missing out on adding the configurations in logs:**

Configurations are an essential lifeline of any application and there will be an impact

**Potential blind spot:** A recent configuration change might have triggered an impact to your users, but might go unnoticed if there’s no way to correlate your metrics to the configurations.

### 5\. U**sing alerts as a goal, not a means to improvement:**

While it’s critical to improve the alerting & monitoring capabilities for operational reasons, it’s a very powerful methodology to also identify areas of improvement in your application and make them more reliable. 😊

If you want to read more about the topic, I’d recommend this [document](https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/) authored by Rob Ewaschuk, an SRE at Google.

### About Doctor Droid:

Doctor Droid is a real-time analytics platform to help teams create and track critical product & operational metrics with smart alerts & dashboards. Here's the [link](https://drdroid.io/) to sign up and try the product!
