An Engineering Manager's Guide to Alert Operations — Part 1

An Engineering Manager's Guide to Alert Operations — Part 1

This guide is aimed at Engineering Managers / Tech Leads who own a few services, responsible for being on-call for their products and are escalated to by PMs or business teams for any customer impact. This guide talks about practices for being pro-active and setup the right alerts for you and your team.

Part 1 - Setting Up Alerts

In this section, I will provide a basic explanation of how to set up alerts in your team. I will also cover how to decide which metrics to track and provide some tactical information related to alert quality measurement.

Prerequisites for setting up alerts:

Before diving into the details, please ensure the following:

  1. You have already implemented instrumentation on key metrics for your services and have dashboards for visualization using tools like Grafana or APM tools like Datadog or NewRelic.

    1. For more information on "Golden signals" for your services, refer to this link.

    2. If you are using Kubernetes, you can follow this guide to set up Prometheus + Grafana for container-level metrics.

    3. If you use logs for error reporting, you can refer to this link for instructions on setting up log-error-based alerts without additional tools.

  2. You have access to and have integrated infrastructure-level metrics into the same toolset. This includes metrics for managed databases, cache, Kafka brokers, etc.

Getting started:

  1. Identify services that are directly in the path of your customers, both internal/external

    1. Prioritise critical services such as order management, payment processing, authentication, and onboarding.
  2. Use the alerting toolset that your metrics platform offers:

    1. Commercial APM tools have a very rich feature set for alert set up on all kinds of metrics

    2. Cloud providers have in-built metrics on all infrastructure components and provider alerting on them.

      1. You can refer to this example for setting up Cloudwatch alarms on your AWS resources and receiving alerts in your Slack channels.
    3. For each component, be it your k8s cluster or microservice or database, identify the following:

      1. The metrics on which you want to set alerts

      2. Threshold of the each metric that you want to be notified for

      3. Where to notify you (email, slack channel)

  3. It is also important to be alerted when a new runtime exception is introduced in your code. Set up alerts for each new exception using error reporting tools (we use Sentry).

  4. Some teams prefer a less structured approach and simply publish messages into their Slack channels or send emails when they notice issues in their code flow, such as exceptions or bad responses from services. While this is acceptable for small-scale setups, it becomes unmanageable in the long run for larger systems.

How to measure your alert quality?

Your alerts are leading indicators (before your business stakeholders)

You get to know of an incident or its potential before customers or business teams. For example, order volume decrease is a customer impact but the root cause could be slowness reaching from landing to checkout due to slowness in the catalogues APIs. An alert on their latency metric would have been great.

Your alerts are leading indicators (before other team members in other services/engineering)

Don't limit alerts to just service-level metrics like error rate or latency. Consider consumer alerts for upstream components (e.g., Kafka brokers) and downstream components (e.g., databases, cache).

Alerts indicate a potential impact rather than just a benchmark failure

Alerts should indicate a potential impact rather than just a failure to meet a benchmark. Avoid setting overly strict thresholds that result in frequent alerts, as this defeats the purpose of having them. Reserve such activities for SLO reports, which can be discussed periodically with the team.

Actionable Alert %

Establish a clear action plan for each alert and assign accountability to the on-call team. This ensures a feedback loop to adjust thresholds and ownership as alerts are investigated over time. Ideally, alerts should diminish over time as measures are taken to address the root causes.

Tips for Ensuring Alert Quality:

In this next article, I will be covering in detail how you can consistently ensure high-quality alerting is in place in your team. Here are some small nuggets about this:

New deployments happen with alerts in place

Set up alerts immediately upon deploying a service, rather than adding them later. Early deployments are particularly susceptible to issues, and delaying alert setup may result in missed problems due to low customer volume.

Alerts are being fine-tuned for thresholds and receivers periodically

Regularly fine-tune alert thresholds and receivers based on investigation outcomes. This makes alerts more actionable, relevant, and representative of customer impact. As your team evolves, consider changing the ownership of receiving alerts. For more on managing alerts, refer to Part 2 of this series.

At Doctor Droid, we have developed a small application that fetches alerts from your Slack channel or metrics monitoring tool (currently supporting 9 tools, including Sentry, New Relic, Datadog, HoneyBadger, and Cloudwatch). It provides insights into the quality of your alerts. You can learn more about it here. Instructions for trying it out are provided at the end of the document.