Symptom-Based Alerts: Putting User Experience at the Forefront

In today's digital age, user experience is the key. While traditional monitoring & observability tools have been diligent in flagging metrics from our infrastructure & APIs, there is often a disconnect between these metrics and the user's real-world experience. It is important for engineering teams to complement system observability with tracking of customer symptoms & SLOs.

What are Symptom-based alerts?

Symptom-based alerting refers to monitoring the customer's “goal” / “experience”, especially when it comes to setting up alerts & SLOs. Tracking customer experience & goals are strong and actionable way to track the behavior of the user.

”Operations is ultimately a business problem, not just a technical one.”

— Blog by the Google Cloud team

Risks of skipping symptom-based alerts

Distributed systems are already hard to troubleshoot and investigate — getting too many alerts for an on-call engineer to troubleshoot doesn’t help teams much.

Benefits of adding symptom-based alerts:

User-Centric Approach: With symptom-based alerts and SLOs, user experience always remains central, translating telemetry data into actionable real-world insights on what’s happening with users.
Reduced Noise: Traditional monitoring can flood teams with alerts, many of which might be insignificant in the context of overall system health. Symptom-based alerts focus on noticeable patterns, drastically reducing the number of irrelevant notifications.
Immediate Impact Recognition: By highlighting issues that directly impact user experience, teams can act proactively and faster, mitigating potential challenges and identifying root causes much faster.

Setting up symptom-based alerts:

Adding symptom-based alerts with custom instrumentation means defining SLOs and metrics that can define the customer experience/ goal. This definition can happen at multiple points in the development lifecycle:

As part of the design process
Iterate after product/feature launch
Re-iterate after product stability

While working on setting them up, here’s a simple framework to help you keep it actionable:

Mistakes to avoid while setting up alerts:

1. Only tracking individual components and not end goals:

This mistake could lead to missing out on tracking critical workflows that might be split across asynchronous steps.

Potential blind spot: A silently failing scheduled cron job or a failure in publishing to a queue could lead to a customer impact, completely missed by the team.

2. Relying only on auto-instrumented metrics:

Complement the APM golden signals and infrastructure metrics alongside custom metrics representing your user experience.

Potential blind spot: Error rate of your payment service, or distribution of the response_status_code ≠ tracking of successful payment rate.

3. Not adding tags/ identifiers:

Add identifiers in your metrics to help you identify impacted users — these tags could vary from a “client name” to your user’s “device type” to the “user-id”.

Potential blind spot: Your overall SLOs might be well within the limits even though it might have breached significantly for a specific customer. Without the tags, it’ll be hard for your team to be able to identify the radius of impact.

4. Missing out on adding the configurations in logs:

Configurations are an essential lifeline of any application and there will be an impact

Potential blind spot: A recent configuration change might have triggered an impact to your users, but might go unnoticed if there’s no way to correlate your metrics to the configurations.

5. Using alerts as a goal, not a means to improvement:

While it’s critical to improve the alerting & monitoring capabilities for operational reasons, it’s a very powerful methodology to also identify areas of improvement in your application and make them more reliable. 😊

If you want to read more about the topic, I’d recommend this document authored by Rob Ewaschuk, an SRE at Google.

About Doctor Droid:

Doctor Droid is a real-time analytics platform to help teams create and track critical product & operational metrics with smart alerts & dashboards. Here's the link to sign up and try the product!