Notes by Dr Droid

Building an Events Lakehouse with Dynamic Schema

Siddarth Jain — Mon, 08 Apr 2024 10:12:50 GMT

Doctor Droid is a YC backed company, building products for software developers that help them detect & diagnose their production incidents faster. They have recently built an alert enrichment & automation platform that enables service owners create investigation guides for on-call engineers with in-built support to query metric & log sources.

Context:

Back in summer of 2022, when we started out on the journey to help engineers do faster & more accurate detection of production incidents, we decided to build a hot storage for their events to support fast search on the events and create metrics through aggregations which can then be tracked via simple alerts.

For instance, an user should be able to create this flowchart by aggregating and joining events:

Apart from capability of joining and aggregating events, following are the features that we promised our customers when we launched:

Source Integrations	Events Ingestion	Events Storage	User Features
Segment	Event drop guarantees	Source of Truth	Search
AWS Kinesis Firehose	Real-time processing	Fast-analytics	Metrics
SDK / API	Asynchronous processing	High performance on aggregation & search	Condition based alerts

Early Start with v0

Since we were trying to get our first customer, as a prototype, we started out as a simple Django + PostgresDB setup. Since our customer events were unstructured, we had leveraged the json features of Postgres to store them. Below is how our v0 architecture looked like:

But, as soon as we scaled a bit, we realised that PostgresDB has no legs to scale and we were becoming a sluggish interface for our customers. We decided to move to an OLAP database which will allow us to achieve our search & aggregation goals. However, when we researched for popular analytics databases, we realised that hardly any of them supported the json equivalent features. With mandates of upto 150M+ daily events from our customers, we were running out of time. This is what we could achieve with Postgres:

Performance on PostgreSQL (v0)

Events Volume	25-50k events/day
Aggregation API
Search API	Asynchronous

We evaluated some storage solutions that would work for our needs

	ClickHouse	Pinot	ElasticSearch
Storage	Columnar with parts optimised for querying across the dataset	Columnar with chunks with custom index, optimised for max data freshness	Inverted index storage, optimizing for quick full-text search and aggregations.
Maintenance	Single component deployment	Extensive fine-tuning & manual configuration, multiple components (Controller, Zookeeper, Broker, Server)	Difficult with non-strict schema due to mapping explosion / multi-tenancy issues
Schema Flexibility	High	Low with manual definition and updation

Clickhouse had recently added an experimental feature to allow json columns which can be enabled using a simple command. This allows a PostgreSQL type capabilities to add unstructured json objects in a column. Clickhouse internally manages the global set of keys in the json objects across all rows as hidden columns which are as fast to query against as any other column. However, there is an upper limit of 10K unique keys that are allowed in in a table. We signed up for an account on Clickhouse cloud and our migration was underway.

Introducing v1

Here is the modified architecture after we included Clickhouse:

With this migration, our search and aggregations become within the benchmarks we wanted to setup for our customer experience.

Following are the specs we work with today:

Databases:

Postgres DB (Main + Replica) : db.m6i.large (2 vCPUs, 8GB Memory)
Clickhouse Cloud DB : (4 vCPUs, 16GB Memory) X 3 Replicas

Kafka:

3 Broker Cluster (Each having 2 vCPUs, 8GB Memory, 200GB Disk)
Same cluster manages all topics

Platform:

Python, ReactJs
Deployed on AWS Kubernetes Engine

Scales we have tested:

150M+ logs/day with this config with a TTL of 15 days

While our unstructured schema handling is highly dependent on json support from Clickhouse, there has been discussions lately from the Clickhouse authors that this particular way to support json isnt working out great and they would probably sunset this experimental feature soon. Check out the video to learn about one of the issues we faced while using the experimental JSON feature and its impact.

https://www.youtube.com/watch?v=j1SfmtWqk_g

There are some other ways to manage unstructured data storage with ClickHouse which we will explore in the near future to make sure our products stay on track to fulfil what our customers are looking for. You can read more about them here.

Playground with Prometheus, Grafana, Loki and k6

Raman Tehlan — Thu, 04 Apr 2024 06:10:47 GMT

In this blog, we will set up a VM with a testing playground. This playground will have a microservice, with Prometheus, Grafana and Loki for runtime metrics and log observability and k6 for generating a constant load. It can be used for measuring the capabilities of your microservice, or for benchmarking your tools.

Context

Ive been following Doctor Droid for a bit and have been in touch with Founders to try their product. I recently started Zurich Lab, a tech consultancy, to explore different opportunities - so I hit them up and it turns out they were looking for someone for a project (win-win!).

Setup

Doctor Droid is an observability tool to auto-diagnose production issues with engineer-created playbooks. To test out their tool, they needed a few sandbox environments, with one or 10s of microservices, with syntactic points of failure and some chaotic traffic and ways to monitor the environment. We decided to start with a single microservice, which can be expanded to multiple.

Heres how we went about doing it

Step 1: Provision VM.
Step 2: Setup the observability environment.
Step 3: Setup the microservice
Step 4: Setup traffic simulation with k6.
Conclusion: Explore metrics in Grafana.

Below are the details for all the steps

Step 1: Provision VM.

We have the provision of an EC2 instance, with 4GB RAM and any basic CPU available, in our case, its Intel Xeon E5-2686 v4 (2) @ 2.299GHz, with Ubuntu 22. You can either use Multipass to have a local VM or provision a VM on by any cloud provider. Not recommended, but you can also use your own Ubuntu machine to do the same setup, which will make permanent changes to your system, which can be undone, but we have not covered that in this blog. :)

Step 2: Setup the observability environment

First, we will set up each part individually.

Prometheus

Get the latest release of Prometheus, and unzip it.

  wget https://github.com/prometheus/prometheus/releases/download/v2.37.6/prometheus-2.37.6.linux-amd64.tar.gz  tar xvfz prometheus-*.tar.gz  rm prometheus-*.tar.gz

Move the Prometheus binary to local/bin, and make etc and lib folder for Prometheus.

  cd prometheus-2.37.6.linux-amd64  sudo mv prometheus promtool /usr/local/bin/  sudo mkdir /etc/prometheus /var/lib/prometheus  sudo mv prometheus.yml /etc/prometheus/prometheus.yml  sudo mv consoles/ console_libraries/ /etc/prometheus/  prometheus --version

Add Prometheus to the user, and change permission of the folders.

  sudo useradd -rs /bin/false prometheus  sudo chown -R prometheus: /etc/prometheus /var/lib/prometheus

Add Prometheus to systemd, so it runs in the background.

  sudo "[Unit]      Description=Prometheus      Wants=network-online.target      After=network-online.target      [Service]      User=prometheus      Group=prometheus      Type=simple      Restart=on-failure      RestartSec=5s      ExecStart=/usr/local/bin/prometheus \          --config.file /etc/prometheus/prometheus.yml \          --storage.tsdb.path /var/lib/prometheus/ \          --web.console.templates=/etc/prometheus/consoles \          --web.console.libraries=/etc/prometheus/console_libraries \          --web.listen-address=0.0.0.0:9090 \          --web.enable-lifecycle \          --log.level=info      [Install]      WantedBy=multi-user.target" > /etc/systemd/system/prometheus.service  sudo systemctl daemon-reload  sudo systemctl enable prometheus  sudo systemctl start prometheus  sudo systemctl status prometheus

Prometheus should be up and running now!

Loki

Get the latest release of Loki, unzip it and create a symlink for binary to local/bin.

sudo mkdir /opt/loki | cd /opt/lokisudo wget -qO /opt/loki/loki.gz "https://github.com/grafana/loki/releases/download/v2.9.5/loki-linux-amd64.zip"sudo gunzip /opt/loki/loki.gzsudo chmod a+x /opt/loki/lokisudo ln -s /opt/loki/loki /usr/local/bin/loki

Get Loki configuration files.

sudo wget -qO /opt/loki/loki-local-config.yaml "https://raw.githubusercontent.com/grafana/loki/v2.9.5/cmd/loki/loki-local-config.yaml"

Add Loki service to systemd.

sudo "[Unit]Description=Loki log aggregation systemAfter=network.target[Service]ExecStart=/opt/loki/loki -config.file=/opt/loki/loki-local-config.yamlRestart=always[Install]WantedBy=multi-user.target" > /etc/systemd/system/loki.servicesudo systemctl daemon-reloadsudo systemctl enable lokisudo systemctl start lokisudo systemctl status loki

Get Promtail, is the agent, responsible for gathering logs and sending them to Loki. We will get the latest release, and create a symlink for binary to local/bin.

cd /opt/lokisudo wget -qO /opt/loki/loki-local-config.yaml"https://github.com/grafana/loki/releases/download/v2.9.5/promtail-linux-amd64.zip"sudo unzip promtail-linux-amd64.zipsudo rm -r promtail-linux-amd64.zipsudo chmod a+x promtail-linux-amd64sudo ln -s /opt/loki/promtail-linux-amd64 /usr/local/bin/promtail

Add promtail configuration.

sudo "[Unit]Description=Promtail serviceAfter=network.target[Service]Type=simpleExecStart=/usr/local/bin/promtail -config.file /opt/loki/promtail-config.yaml[Install]WantedBy=multi-user.target" > /etc/systemd/system/promtail.servicesudo systemctl daemon-reloadsudo systemctl enable promtailsudo systemctl start promtailsudo systemctl status promtail

Loki should be up now, and should start ingesting logs from Promtail!

Grafana

Get the latest release of Grafana, and unzip it.

sudo apt-get install -y apt-transport-https software-properties-commonsudo wget -q -O /usr/share/keyrings/grafana.key https://apt.grafana.com/gpg.keyecho "deb [signed-by=/usr/share/keyrings/grafana.key] https://apt.grafana.com stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.listsudo apt-get updatesudo apt-get install grafana -y

The above steps have added a Grafana Server service to the systems, but we need to enable it.

sudo apt-get updatesudo apt-get install grafana -ysudo systemctl daemon-reloadsudo systemctl enable grafana-server.servicesudo systemctl start grafana-serversudo systemctl status grafana-server

Add Prometheus as a data source in Grafana.
- Go to your public IP, and open port 3000.
- This will open the Grafana login page for you.
- Use "admin" as both username and password.
- Use the datasource open there, and add Prometheus in it, by pointing to http://localhost:9090, and save.
- You can follow a detailed description of the same here(https://grafana.com/docs/grafana/latest/datasources/).
Add Loki as a data source in Grafana.
- It follows the same steps as above, but when selecting the data source, select Loki, and point to http://localhost:3100.
Grafana should be up and running now! It should also have access to both Loki and Prometheus data.

Step 3: Setup the microservice

For our test, we have selected GoRubu repo, because it already has Prometheus integrated, and was small and simple enough to make changes.

We also made some changes in the service, like adding our custom instrumentation, parameterising the config files path, adding different points of failures, and random response time to APIs. However, we wont cover that in this blog.

Runtime Environment

We will first set up the runtime environment for this project. Which requires Golang, Mongo and Memcached. Lets install and run them.

  sudo apt install golang-go memcached make gnupg curl -y   curl -fsSL https://www.mongodb.org/static/pgp/server-7.0.asc | \  sudo gpg -o /usr/share/keyrings/mongodb-server-7.0.gpg --dearmor  echo "deb [ arch=amd64,arm64 signed-by=/usr/share/keyrings/mongodb-server-7.0.gpg ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list  sudo apt-get update  sudo apt-get install -y mongodb-org  sudo systemctl daemon-reload  sudo systemctl enable mongod  sudo usrmod -aG sudo mongodb      sudo systemctl start mongod

Lets clone the repo and run it. We will compile it, so we can run it later.

  git clone https://github.com/rv404674/goRubu.git  cd goRubu  go run .  go build

Run in background

Add service to the systemd, so it can auto-restart and run in the background.

  sudo "[Unit]  Description=backend-service  After=network.target  StartLimitIntervalSec=0  [Service]  Type=simple  Restart=always  RestartSec=1  User=ubuntu  ExecStart=/home/ubuntu/goRubu  [Install]  WantedBy=multi-user.target" > /etc/systemd/system/rubu.service  sudo systemctl daemon-reload  sudo systemctl enable rubu  sudo systemctl start rubu  sudo systemctl status rubu

Metrics & Logs Target

Service is running in the systemd, so the logs are auto-collected for it by Promtail!
Lets add the service to the scrape config of Prometheus. Add the following to /etc/prometheus/prometheus.yaml
```
  - job_name: "robu"    static_configs:      - targets: ["localhost:8080"]
```

Step 4: Setup traffic simulation with k6.

Now that all the services are running, we are ready to install and set up k6, to generate periodic load we will set up a cron job for it.

K6

Lets install k6.

sudo gpg -ksudo gpg --no-default-keyring --keyring /usr/share/keyrings/k6-archive-keyring.gpg --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" | sudo tee /etc/apt/sources.list.d/k6.listsudo apt-get updatesudo apt-get install k6

Load generation script

Lets write a script to generate the load. Below is the javascript code for it. Lets name it load.js

import http from 'k6/http';import { check, sleep } from 'k6';export const options = {        stages: [                { duration: '15s', target: 200},                { duration: '15s', target: 500},        ],}export default function () {        const shorten_url = "http://localhost:8080/all/shorten_url"        const redirect = "http://localhost:8080/all/redirect"        const payload = JSON.stringify({                Url: "https://google.com"        });        const params = {                headers: {                        'Content-Type': 'application/json',                        'Authorization': 'Bearer TOKEN'                }        };        const res = http.post(shorten_url, payload, params);        const res2 = http.post(redirect, payload, params);        check(res, {'status was 200': (r) => r.status == 200})        check(res2, {'status was 200': (r) => r.status == 200})}

Lets run the above script. You can read more about k6 here.

k6 run --vus 10 --duration 30s -o experimental-prometheus-rw load.jsk6 run --vus 10 --duration 30s -o experimental-prometheus-rw load.jsk6 run --vus 10 --duration 30s -o experimental-prometheus-rw load.js

Cron Load Generation

We will use crontab to set up a cron job. Lets install it.

sudo apt install crongsudo systemctl enable cron sudo systemctl start cron

Lets write a load script, which we will trigger every minute. Lets name it generate_load.sh. This will trigger the k6 command, with our load.js file. It generates the load for 30 minutes.

#!/bin/shk6 run --vus 5 --duration 30s -o experimental-prometheus-rw ~/load.js > k6_logs.log

Lets add our generate_load.sh to crontab.

crontab -e// Add following line in the end of the file// It sets the cron time, and the file to run.* * * * * /usr/bin/sh ~/PATH/generate_load.sh

Thats all, this will generate a periodic and consistent load to our service.

Conclusion

All the required services are active and running, which you can find in the systemctl.

Here is how the traffic looks in the Grafana.

Here is how the logs look in the Grafana.

With that, you have your playground running in the VM, feel free to tune the service for the scenario you want, or maybe try adding multiple services, to test multi-service dependency.

Thank you for reading this blog, do drop any feedback or questions you might have!

Happy engineering :)

Beyond Monitoring: 7 Ways Observability Empowers Your Team

Siddarth Jain — Wed, 20 Mar 2024 11:23:12 GMT

Introduction

The most direct benefit of Observability for engineering teams lie in reducing incidents or downtimes that the team could potentially face.

However, this is barely scratching the surface of observability's potential and value. Far from being just a fail-safe, observability offers a toolkit that not only safeguards service health but also drives insights, efficiency, and strategic advantage. This article explores the dual nature of observability, highlighting its basic protective functions alongside its more strategic benefits.

The Basic: Observability as a Safety Net

At its core, observability serves as the eyes and ears of your technological ecosystem, providing a comprehensive view of your application's health. This foundational aspect covers three critical functions:

Knowing the Health of Application: Observability offers a real-time scorecard of application health across various metrics and indicators, enabling teams to understand current performance at a glance.
Proactive Detection of Issues: With access to different data points like metrics, logs, and traces, observability tools have robust alerting capabilities to help you identify anomalies and potential issues before they escalate, allowing for preemptive action. Here's a short note on how symptom-based alerts could be game changing for your team.
Faster Resolution of Production Issues: When issues do arise, observability enables quicker root cause analysis and problem-solving, minimizing the impact on customer experience and service reliability.

The More: Observability as a Strategic Tool

Observability transcends its role as a mere safety net by unlocking deeper insights and opportunities for optimization and growth:

Insights About User Behavior: Beyond system health, observability tools can track and analyze user interactions, providing valuable insights into user behavior, preferences, and pain points. These insights can inform UX improvements, feature development, and more personalized user experiences. For instance, look at this image which gives overview of "Frustration" signals by users.
Cloud & Infrastructure Cost Optimization: With observability, organizations can gain a detailed understanding of their resource utilization and performance bottlenecks. This knowledge facilitates more informed decisions about infrastructure scaling, cost management, and efficiency improvements, leading to significant cost savings and optimization of cloud and infrastructure investments. Here are some sample questions that Observability data can help you get answers to.
```
 * Are the resources rightly-sized? * Are some long running queries bloating your database compute costs? * For high throughput requests, are there specific spans that have significant optimisation potential?
```
Identification of Technical Debts and Optimization Opportunities: Observability sheds light on areas of the codebase and architecture that may be inefficient, outdated, or prone to issues. By highlighting these technical debts, it enables teams to prioritize refactoring and optimization efforts, improving system resilience, performance, and maintainability.
Compliance and Security Insights: Observability tools can also play a crucial role in identifying security incidents and compliance breaches by monitoring unusual activities or deviations from normal operations. This helps in timely responses to potential threats and ensures adherence to regulatory requirements.

Conclusion

Observability is much more than a safety net for modern tech teams -- it's a culture of driving data driven approach in engineering teams. While its foundational aspects are crucial for maintaining the health and reliability of applications, its strategic benefits extend far beyond. Observability empowers organizations to drive actionable insights, optimize costs, and continuously improve their technology stack and user experiences.

Next Reads: If your monitoring stack has an alert fatigue, you can check out the Guide on How to manage Alert Operations.

Checklist for Service Health Tracking

Siddarth Jain — Fri, 15 Mar 2024 05:15:35 GMT

Myth: Health of Service = Golden Signals.

Fact: It's not that simple.

Introduction

In an era where technology forms the core of both enterprise operations and consumer experiences, ensuring the robustness of these technological systems is not just beneficial it's essential.

A robust health check mechanism gives user the full control on service reliability, performance, and user satisfaction. But what constitutes an ideal service health check? This blog post dives deeper into the critical components of an effective health monitoring system.

Golden Signals of Service

The simplest start of service monitoring are the golden signals: latency, traffic, errors, and saturation. These metrics offer a basic view of service performance and health. Latency measures response times (or "slowness"), traffic gauges demand (or "users"), errors track failures (or "Unexpected experiences for users"), and saturation assesses resource usage. Together, they provide a real-time snapshot of service health, enabling timely interventions when needed. You can also read this detailed blog on how to investigate APIs in production.

Upstream and Downstream Service Health

No service is an island. Within an application, services depend on each other in complex ways. Monitoring the health of upstream (dependencies) and downstream (dependents) services is crucial. This interconnected monitoring ensures that issues in related services are identified and addressed before they impact your service or end-users.

Connected 3rd Party Integrations

An average tech startup can have anywhere between 5-50 tech providers deeply integrated within their production depending on use-cases. These third-party integrations enhance services with external functionalities but also introduce potential points of failure. It's the responsibility of an engineering team to ensure that they have enough failsafe / backup mechanisms in place that any degradation in their performance doesn't compromise the overall service quality. Here's an interesting tool to track third party integrations.

Business Metrics

While technical metrics are crucial, aligning them with business outcomes ensures that the service delivers value. Monitoring key business metrics as part of service health checks bridges the gap between technical performance and business objectives, ensuring that the service supports the broader goals of the organization.

Underlying Infrastructure Health

Image reference: Blog

Be it k8s, containers or VMs -- tracking the health status of these infrastructure "orchestrators" is critical. Often, monitoring the health of this infrastructure is as vital as monitoring the service itself. Not only that, just defining how you monitor k8s health itself is a deep enough topic that it could need a book, not a passage!

Image Reference: Reddit post.

Look at this sample reference architecture from a reddit post. There are more infrastructure components than services -- mongoDB, PostgreSQL, RabbitMQ, Zipkin, et. al.

The health of these components directly impacts the performance and reliability of the service. Implementing targeted health checks for these components is critical, more so when these are self-hosted instead of being managed by Cloud Providers.

Conclusion

An ideal service health check is a comprehensive, multi-faceted approach that goes beyond surface-level metrics. It encompasses the golden signals, the health of connected services and components, third-party integrations, and underlying infrastructure, all while keeping an eye on the ultimate goal: delivering business value.

Doctor Droid:

Doctor Droid enables codification of monitoring logics that only you understand, so that you can share investigation strategies with team members with ease and make troubleshooting stress-free for both you & the on-call engineer! Check out here -- we also have a template for Service Health Monitoring.

Github Codespaces, Replit and the world of Cloud IDEs

Siddarth Jain — Tue, 02 Jan 2024 06:30:12 GMT

Today I'll share about yet another super solid developer tool I recently discovered. GitHub's Codespaces. (A cloud IDE)

tl;dr #Codespaces is VSCode on cloud with a solid free plan. Tried Replit too but it felt a little differently designed than a typical IDE.

Context:

Before I jump in too deep, I need to share context with folks here.

I am a hard core data scientist who doesn't like to move out of Jupyter notebooks but when we were productionizing a script at Dr. Droid, I was required to start working with my backend team to get it out.

Back in 2017-20, I used to use Anaconda with Jupyter notebooks but when I had to restart coding in 2023 after a gap, I was too lazy to go ahead with it as my distinct memory of Conda remains in the pain in managing different environments.

I had played around with Google Colab -- the Jupyter notebooks on cloud last year and I had found it to be super convenient to get started with.

Moving Colab code into working product:

After the early prototyping, I had to create an end point and deploy it. For this task, I was trying to avoid the task of creating a flask application / local server and instead try to look for some convenient alternative (I know, any backend engineer here must be judging me, but tbh, I don't enjoy deploying and managing applications).

Attempt 1: Replit

I signed up on Replit to create a repo/endpoint and see if it could be convenient enough to use it. It was quite decent -- they had managed package improve, secrets, configs, etc. pretty well. The UI was a bit "new" to me and they had asked me premium plan ($7/month) to get a server running.

I created the first application using it.

It did the job for me.

No Github Repo created.

Paid plan gave me a permanent endpoint (deployment)

Attempt 2: Using Streamlit & Codespaces

The next phase of my work was actually making the code integrated with an existing Github repo. Somehow Replit didn't feel like a traditional IDE here to start using with Github and I started exploring PyCharm.. Most of my work was using Pandas and Streamlit (I'll write another piece on it in the coming weeks).

Somewhere in the rabbit hole of Streamlit research, I stumbled upon Github Codespaces. (I think there was an example on it or something)

Codespaces has a deeply integrated browser within the IDE and that really took it off for me. All the early iterations became 10x faster as my Streamlit application was deployed onto the workspace.

I had tried PyCharm too here but the experience of switching between IDE and browser was bit cumbersome. Also, somehow having everything in the cloud felt convenient. The user experience on Codespaces exactly the same as on VSCode / PyCharm. There was nothing new or fancier here (like in Replit, which was designed for doing everything from scratch there).

Billing & Usage

In the last 1 month, I've extensively used Codespaces and still, my usage has not exceeded the free plan. I'll consider the paid subscription for this one as my usage exceeds their free plan.

An Engineering Manager's Guide to Alert Operations — Part 2

Dipesh Mittal — Wed, 27 Dec 2023 12:30:12 GMT

Part 2 - Managing and Improving Alert Operations

In the previous article, I discussed how to measure the quality of your alerts and what you can do to optimise them and make them more useful. This section will provide a deeper dive into important aspects of alerts and how to use them to stay ahead of incidents and customer impact. The goal is to ensure that alerts are actionable without overwhelming your team with unnecessary noise.

Assuming that you are an Engineering Manager for a team which has many critical alerts coming in day-in, day-out, heres my take on improving them.

Objective

To make alerts actionable, at the same time not adding too much noise for the team.

Choose the right alert consumption strategy

Proactive teams choose one of the two strategies:

A single team consumes all alerts and post receiving an alerts, does the following steps in this order:
1. Assess impact of the incident signalled by the alert
2. If there is no impact, mark the alert acknowledged and move on.
3. If there is impact, do further investigation to identify root cause.
4. Execute a remediation process for that incident if available else escalate to the service owner who is most suitable to resolve it.
5. This escalation happens mostly in the form of a ticket using a tool like PagerDuty, OpsGenie etc. Here is a document on how to link Prometheus alerts to email, Slack, and PagerDuty.
Every team consumes alerts for their own services and to upstream + downstream components. Once an alert comes, they know what to do.

For teams of a size upto 20 engineers, the second strategy is easy to setup and manage. For bigger teams, a more streamlined on-call process is preferred.

Diving deep into threshold management

Setting up alerts for every exception, metric deviation, and infrastructure component in a product with a minimum of 10 micro-services can result in a significant number of alerts, especially as the product usage expands and fluctuates. Therefore, it is crucial to establish appropriate thresholds for each metric-based alert.

You can categorise all of your metrics into 3 buckets:

Stable but seasonal - The metric remains consistent but changes to a different average value at a different hour or day of the week, following a repetitive pattern. This usually occurs with infrastructure components. In such cases, you need to set different thresholds for different time ranges. The average should be 25% higher than the usually observed metric value.
Spiky & unpredictable - The metric fluctuates frequently but has no visible impact on the business. In these cases, you should use a p99 aggregation as the metric function and keep it 25% above the usual p99 values observed in the past few weeks. This is common with service latencies.
Very steady with no deviations - This metric can generally be left without an alert. These metrics relate to database connections, disk space, etc. If desired, you can add an average value metric that is 20% higher than the usually observed value.

For error reporting, keep each new error to be reported as it appears.

Most of the tools today do not offer dynamic thresholds on metrics. Instead, they provide downtimes/mute hours as a feature. Using this, you can enable threshold for only certain hours of the day. Hence, if you set up 5 different alerts for different time ranges, you effectively have different thresholds for different time ranges. Read more about it, here. (Scheduled downtime on Datadog, Custom times on New Relic)

Another alternative is using anomaly detection as a feature (provided by both Datadog, New Relic), allowing alerts to be configured without specific thresholds. However, these anomalies are isolated on those metrics and lack the contextual linkage to other metrics.

At Doctor Droid, we have a bot that helps improve alert operations by providing insights into coverage and noise -- it integrates with your existing observability and alerting stack. Check out the sample report and docs here.

An Engineering Manager's Guide to Alert Operations — Part 1

Dipesh Mittal — Tue, 19 Dec 2023 04:15:12 GMT

This guide is aimed at Engineering Managers / Tech Leads who own a few services, responsible for being on-call for their products and are escalated to by PMs or business teams for any customer impact. This guide talks about practices for being pro-active and setup the right alerts for you and your team.

Part 1 - Setting Up Alerts

In this section, I will provide a basic explanation of how to set up alerts in your team. I will also cover how to decide which metrics to track and provide some tactical information related to alert quality measurement.

Prerequisites for setting up alerts:

Before diving into the details, please ensure the following:

You have already implemented instrumentation on key metrics for your services and have dashboards for visualization using tools like Grafana or APM tools like Datadog or NewRelic.
1. For more information on "Golden signals" for your services, refer to this link.
2. If you are using Kubernetes, you can follow this guide to set up Prometheus + Grafana for container-level metrics.
3. If you use logs for error reporting, you can refer to this link for instructions on setting up log-error-based alerts without additional tools.
You have access to and have integrated infrastructure-level metrics into the same toolset. This includes metrics for managed databases, cache, Kafka brokers, etc.

Getting started:

Identify services that are directly in the path of your customers, both internal/external
1. Prioritise critical services such as order management, payment processing, authentication, and onboarding.
Use the alerting toolset that your metrics platform offers:
1. Commercial APM tools have a very rich feature set for alert set up on all kinds of metrics
2. Cloud providers have in-built metrics on all infrastructure components and provider alerting on them.
  1. You can refer to this example for setting up Cloudwatch alarms on your AWS resources and receiving alerts in your Slack channels.
3. For each component, be it your k8s cluster or microservice or database, identify the following:
  1. The metrics on which you want to set alerts
  2. Threshold of the each metric that you want to be notified for
  3. Where to notify you (email, slack channel)
It is also important to be alerted when a new runtime exception is introduced in your code. Set up alerts for each new exception using error reporting tools (we use Sentry).
Some teams prefer a less structured approach and simply publish messages into their Slack channels or send emails when they notice issues in their code flow, such as exceptions or bad responses from services. While this is acceptable for small-scale setups, it becomes unmanageable in the long run for larger systems.

How to measure your alert quality?

Your alerts are leading indicators (before your business stakeholders)

You get to know of an incident or its potential before customers or business teams. For example, order volume decrease is a customer impact but the root cause could be slowness reaching from landing to checkout due to slowness in the catalogues APIs. An alert on their latency metric would have been great.

Your alerts are leading indicators (before other team members in other services/engineering)

Don't limit alerts to just service-level metrics like error rate or latency. Consider consumer alerts for upstream components (e.g., Kafka brokers) and downstream components (e.g., databases, cache).

Alerts indicate a potential impact rather than just a benchmark failure

Alerts should indicate a potential impact rather than just a failure to meet a benchmark. Avoid setting overly strict thresholds that result in frequent alerts, as this defeats the purpose of having them. Reserve such activities for SLO reports, which can be discussed periodically with the team.

Actionable Alert %

Establish a clear action plan for each alert and assign accountability to the on-call team. This ensures a feedback loop to adjust thresholds and ownership as alerts are investigated over time. Ideally, alerts should diminish over time as measures are taken to address the root causes.

Tips for Ensuring Alert Quality:

In this next article, I will be covering in detail how you can consistently ensure high-quality alerting is in place in your team. Here are some small nuggets about this:

New deployments happen with alerts in place

Set up alerts immediately upon deploying a service, rather than adding them later. Early deployments are particularly susceptible to issues, and delaying alert setup may result in missed problems due to low customer volume.

Alerts are being fine-tuned for thresholds and receivers periodically

Regularly fine-tune alert thresholds and receivers based on investigation outcomes. This makes alerts more actionable, relevant, and representative of customer impact. As your team evolves, consider changing the ownership of receiving alerts. For more on managing alerts, refer to Part 2 of this series.

At Doctor Droid, we have developed a small application that fetches alerts from your Slack channel or metrics monitoring tool (currently supporting 9 tools, including Sentry, New Relic, Datadog, HoneyBadger, and Cloudwatch). It provides insights into the quality of your alerts. You can learn more about it here. Instructions for trying it out are provided at the end of the document.

A Design Partnership Case Study - ProdigalTech x Dr. Droid

Siddarth Jain — Tue, 12 Dec 2023 05:00:09 GMT

About Prodigal Tech:

Prodigal Technologies, is the intelligence tool for the consumer finance ecosystem. They help turn user data into insights that can help consumer finance companies increase their revenue, reduce their cost, and minimize the risk of doing business.

Sangram Raje is the co-founder and CTO of Prodigal. Having been a quant researcher for most of his career, data has been close to him and everything that happens in the company!

The Challenge

Prodigals backend application has complex processes. For instance, they have a data pipeline that has long-running processes for ingestion, pre-processing and ML pipelines, all managed by different teams. (100s of GBs / day)

The challenge arises when a delay in one of the steps cascades into a poor customer experience. This smooth functioning of this whole data journey is extremely critical within their team and their data platform team owns up the end-to-end experience.

Feb-May 2023: A design partnership in making

When we started speaking, their primary requirement was to have visibility on critical pipeline status in near real-time. They had a make-shift solution internally but it was becoming tedious to manage it in-house and they were starting to explore alternate options.

Their internal solution was built over SQS-based event messaging and then added custom counter metrics in Redis, which were then monitored on a table in an internal dashboard. The challenges they had in their existing stacks were:

Given multiple streams and in-stream processing components, the reliability of these pipelines itself was getting challenging.
Pre-aggregating the data significantly limited the insights they were able to derive from the data.
Inability to receive or configure alerts from their internal platform.

While they did have a roadmap of these features, and many more, when they checked out our platform, they were super excited!

Our non-tech stakeholders could directly use those dashboards to get the visibility that they wanted. One of the clear objectives that we had internally was to reduce the burden on the engineering team to give visibility to the business side because now and then we would have the business side asking engineering to give them visibility. And we were looking at Dr. Droid to solve that problem.
Niroop Belgaumkar -- Data Platform team

Implementation

Going from a POC to a company-wide implementation of Dr. Droid took them less than a quarter. Some of the key checkpoints in the implementation:

Identifying the right checkpoints to be tracked within logs & code (~2 weeks across multiple stakeholders in ProdigalTech)
The Prodigal team connected their Cloudwatch log streams with our connector and filtered the events relevant for them (~1 day)
Setup for their early dashboards and metrics (~1 week)
Adoption and awareness (~4 weeks)

Results:

Since the launch, we have had over 50% of Prodigals employees visit Dr. Droid dashboards. We have become their single source of truth to track the health of their pipelines (a.k.a customer happiness).

We were so happy to see the out-of-the-box functionality of Dr. Droid, which helped us set up these monitors, which were just super cool mechanisms to track all of this. Your team helped us to understand where we were making some efficiency trade-offs and helped us improve the design of our asynchronous pipelines as well. Our non-engineering folks can set up these monitors and set up alerts based on these monitors in such an intuitive way that the engineering team just doesn't need to be involved. And, most important of all, we have alerts flying out to our slack bots and searches that help us be on top of our game.

Sangram Raje, CTO

Future Roadmap:

With Dr. Droid as Prodigals backbone for real-time visibility, we are working on maximising the extensibility of the platform. This means they will be able to leverage the same monitors and metrics internally as needed within their environment in case they want to set up a code runbook involving that data point.

Additionally, we are integrating with their existing observability stack (Cloudwatch, Database providers) to support the Prodigal team with automated RCA for any issue in their pipeline.

If you're interested in improving your observability, sign up today and try it out!

Use-cases for real time analytics in E-Commerce with Dr. Droid

Siddarth Jain — Tue, 24 Oct 2023 06:00:12 GMT

Introduction:

Today, every e-commerce business is deeply immersed in leveraging technology to build applications & products for their customers. Critical processes, from the customers ordering experience to the shipping and delivery experience, are run by complex software applications designed to drive automation, exceptional customer experience, and operational efficiency.

This piece delves deep into the essence, and intricate details of monitoring the critical journeys in E-commerce, and how you can use Dr. Droid to get intelligent alerts before your customers are impacted.

Part 1: Use cases for E-commerce platforms

Imagine having a key marketing campaign launch at a sports match and then seeing a spike in traffic, but no revenue growth. It's something that needs instant attention and needs to be caught before it becomes a challenge.

In our experience of working with multiple e-commerce companies, here are some of the top places where teams are investing to track in real-time:

Order counts and revenue: Tracking average order counts, revenue per order, identifying if there's any anomaly in the discounts being applied or inventories going out of stock. Any minor change in configurations, logics or a back gate identified by users, could lead to pilferages that need to be identified at the earliest!
Conversion Funnels & metrics: Keeping tight track of metrics, especially ones relating to sensitive components, like promo-code / coupons, and payments.
Shipment status and delays: Tracking the journey of shipments and identifying ones that are not on track to be delivered. A single process in e-commerce could happen across a couple of minutes to a few hours and even days. Issues are often due to delays between consecutive steps in the process or due to missing/incorrect actions at some of the steps. These are hard to track and end up being escalated by customers or internal teams, more often than not.

Part 2: Benefits of having strong monitoring in place

Improved conversions:

Identifying deviations and anomalies in order counts or revenue, due to payment integrations, new code changes or any other code changes.

2. Fraud detections:

Identifying pilferages due to miscreant users, coupon code leakages, revenue calculation, or such related issues.

3. Improved NPS:

Streamlined and error-free processes significantly augment the quality of products/services and customer interactions, culminating in improved customer satisfaction and loyalty.

4. Risk mitigation by reducing MTTR:

Having consistent visibility on your critical business processes in your application can promptly detect anomalies or irregularities, allowing your team to avert business disruptions due to and ensure continuous compliance with industry regulations and standards.

5. Data-driven Decision-Making:

The analytical insight provided by the tool can empower decision-makers with clearer visibility of operational aspects, facilitating informed and data-driven decisions that propel organizational growth.

6. Increased Profitability:

By refining operational efficiency and reducing process-related errors and delays, process monitoring can substantially curtail operational costs and enhance profitability.

Part 3: A Glimpse into Dr. Droid and how we make it easy to track metrics in real-time

Using Dr. Droid, you can track all the critical metrics and journeys associated with your business and operations.

No-code integration

You can transform your logs to track transactions across multiple sync/async checkpoints. Here are the top 3 ways you can get started:

Connect to your log stream and send JSONs to Dr. Droid
Send us your backend events (via segment, Kinesis or SDK)
Connect directly to your database and join across multiple tables / databases

Create Funnels & complex metrics

For an e-commerce company, the payment journey is one of the most crucial business processes. It governs the user's journey from selecting products to finally making a purchase. Let's delve deeper into its intricacies.

To experience the demo, you can check out the sandbox (playground) here.

Part 4: Risks to Business KPIs/Metrics if the checkout process is not well monitored

Increased Cart Abandonment Rates: Without monitoring the checkout workflow, there could be unidentified bottlenecks causing users to abandon their carts which would directly impact sales.
Inventory Mismanagement: If the inventory check isn't accurate, it could lead to overselling or stock wastage impacting both revenue and costs.
Payment Failures: Any hiccup in the payment process can result in lost sales. Monitoring can identify consistent issues with certain payment gateways or transaction types.
Delayed Order Confirmations: If users don't receive timely confirmation, it can lead to customer dissatisfaction and increased customer service inquiries.
Reputation Damage: Frequent issues in the checkout process can lead to negative user reviews and damage the brand's reputation.

Doctor Droid assists companies in monitoring critical KPIs associated with the operations and product, helping companies keep the focus on customer experience.

Our team has deep experience in helping companies setup their monitoring and observability stack, so if you need any assistance in setting it up, we are happy to assist. You can reach out to us, here.

Symptom-Based Alerts: Putting User Experience at the Forefront

Siddarth Jain — Thu, 19 Oct 2023 17:42:02 GMT

In today's digital age, user experience is the key. While traditional monitoring & observability tools have been diligent in flagging metrics from our infrastructure & APIs, there is often a disconnect between these metrics and the user's real-world experience. It is important for engineering teams to complement system observability with tracking of customer symptoms & SLOs.

What are Symptom-based alerts?

Symptom-based alerting refers to monitoring the customer's goal / experience, especially when it comes to setting up alerts & SLOs. Tracking customer experience & goals are strong and actionable way to track the behavior of the user.

Operations is ultimately a business problem, not just a technical one.
Blog by the Google Cloud team

Risks of skipping symptom-based alerts

Distributed systems are already hard to troubleshoot and investigate getting too many alerts for an on-call engineer to troubleshoot doesnt help teams much.

Benefits of adding symptom-based alerts:

User-Centric Approach: With symptom-based alerts and SLOs, user experience always remains central, translating telemetry data into actionable real-world insights on whats happening with users.
Reduced Noise: Traditional monitoring can flood teams with alerts, many of which might be insignificant in the context of overall system health. Symptom-based alerts focus on noticeable patterns, drastically reducing the number of irrelevant notifications.
Immediate Impact Recognition: By highlighting issues that directly impact user experience, teams can act proactively and faster, mitigating potential challenges and identifying root causes much faster.

Setting up symptom-based alerts:

Adding symptom-based alerts with custom instrumentation means defining SLOs and metrics that can define the customer experience/ goal. This definition can happen at multiple points in the development lifecycle:

As part of the design process
Iterate after product/feature launch
Re-iterate after product stability

While working on setting them up, heres a simple framework to help you keep it actionable:

Mistakes to avoid while setting up alerts:

1. Only tracking individual components and not end goals:

This mistake could lead to missing out on tracking critical workflows that might be split across asynchronous steps.

Potential blind spot: A silently failing scheduled cron job or a failure in publishing to a queue could lead to a customer impact, completely missed by the team.

2. Relying only on auto-instrumented metrics:

Complement the APM golden signals and infrastructure metrics alongside custom metrics representing your user experience.

Potential blind spot: Error rate of your payment service, or distribution of the response_status_code tracking of successful payment rate.

3. Not adding tags/ identifiers:

Add identifiers in your metrics to help you identify impacted users these tags could vary from a client name to your users device type to the user-id.

Potential blind spot: Your overall SLOs might be well within the limits even though it might have breached significantly for a specific customer. Without the tags, itll be hard for your team to be able to identify the radius of impact.

4. Missing out on adding the configurations in logs:

Configurations are an essential lifeline of any application and there will be an impact

Potential blind spot: A recent configuration change might have triggered an impact to your users, but might go unnoticed if theres no way to correlate your metrics to the configurations.

5. Using alerts as a goal, not a means to improvement:

While its critical to improve the alerting & monitoring capabilities for operational reasons, its a very powerful methodology to also identify areas of improvement in your application and make them more reliable. 😊

If you want to read more about the topic, Id recommend this document authored by Rob Ewaschuk, an SRE at Google.

About Doctor Droid:

Doctor Droid is a real-time analytics platform to help teams create and track critical product & operational metrics with smart alerts & dashboards. Here's the link to sign up and try the product!

Tools I use as an early stage founder

Siddarth Jain — Sun, 06 Aug 2023 12:32:09 GMT

I used to be a coder, but now am the non-technical co-founder at my startup. What it means is that while Im allowed to code sometimes, thats not what my job is anymore. My job is to put us in front of as many potential customers as possible and get feedback.

To make that happen, I didnt know much about where to start a year ago, and now in less than a year, there are a dozen plus tools that I use on a day-to-day basis to keep my life sane 🙈. Sharing the list here in case any of you are feeling similar pains and want to evaluate the same tools:

Sales:

This is one of the new things on my plate. The objective of the tooling here is to help build process in my life (really not used to that), especially in repetitive tasks.

Daily Usage (Almost)

Apollo.io - Apollo helps in sending emails to potential customers to try the platform. I like this tool because it has a lot of features in it like I can go from creating a list of companies, finding people in these companies, writing email sequences, sending them and analysing whats getting attention all in one single tool. (Its about $1000/seat/year)
Calendly - helps avoid the to-&-fro to set up meetings. (~$100/year)
Sybill Automated note taker so I can have better conversations with my users without the fear of forgetting to write something important. When I look at some of the notes I used to make Vs what this one makes, definitely its an upgrade. (~$1000/seat/year worth it)
Loom to create demo videos and send to potential clients (~$100/year)
Google Sheets + Notion: I still use some of these for the day-to-day planning/todo list instead of using a regular CRM. Its just easier because they load faster, and require fewer clicks to get to the point and get back to them.

Infrequent Usage

Linkedin Sales Navigator: To do some filtering and identifying people who could help with an introduction to specific people. Still, wonder if this is worth the RoI Inflexible tool with poor external integrations (~$1000/year)
Hubspot: Dont use this on a day-to-day basis but set it up in the hope that the day we scale, we at least have most of the interaction data captured from day-0. I would have loved to have a quicker Google Sheet type interface to use HubSpot faster but theres not one too many clicks to do the same action. (Free)

Content & Communication:

I have always been big on writing, with my blog, Linkedin blog, etc. so I had a bit of experience here. I typically prefer tools where I dont need assistance from an expert/pro for minor changes but are also a high-quality product at the same time.

Blog CMS: Hashnode We did thorough research on this one and evaluated multiple tools including Medium, Dev.to, Linkedin, Substack, and Ghost. The primary reason to pick Hashnode over Medium was that it felt better designed to help people discover more content within our blog page after landing on one page. We were on a free plan for almost a year. Moved to the Pro plan at ~$70/year hoping for some advanced features and priority support.
Canva: To create any posters which need a sleek/professional design. (~$50/year)
Excalidraw: To create any rough notes/infographics/edit images Canva looks a little too shiny generally while Excalidraw feels natural. Also, way easier to use. (~$70/year; purchased premium because I wanted to store different notes separately)
ReadMe: Simple tool to write docs. Had explored doing it in-house or using open source but felt like random toil. (~$1000/year, got a 25% off YC deal)
Webflow: A really simple way to make pretty websites. I do have a designer who helps whenever I need to create a new page from scratch but with a lot of copy/image changes, I can do it myself. Also, never needed help from my cofounder except once to connect to the domain and deploy. (Some complex pricing they have but afaik ~$500/year + $200/year/website hosting)
Loops: A really sleek/simple way to send newsletters, onboarding emails, etc. to users. Connected it to Segment for the onboarding emails. (1-year free YC deal, but really good tool)
Intercom: Added it as the chat layer in all our websites, from docs to website to platform all this with zero engineering effort. (1-year free YC deal, not sure of pricing after that)
Grammarly: Used to use the free version but now chatGPT, Notion, HashNode, and Superhuman, everyone seems to have something in their editor to give feedback on this.
Miro [Not using anymore]: Liked the Excalidraw way of doing things better (for now at least). (priced similarly to Excalidraw per seat)
Jasper.ai / LetterDrop: Tried as alternatives to ChatGPT, specifically for content marketing, but never got the bandwidth to get value during the trial period (I think it was 2 weeks or so). Maybe in the future. (these are $100-$200/month)

Miscellaneous:

DeepNote: To write on-the-go Python scripts to create simulations for the platform (Free plan). Its like Google Colab on steroids.😬
Google Colab: To help other users create quick simulations of events to build the product (Free plan). Didnt use Deepnote here because sharing a script with a non-user in Deepnote is cumbersome and has a long onboarding wall. :/
Figma: Generally leave it to the designers, I use it only when I want to edit the copy in a prototype.

Team Collaboration & Productivity Tools

Internal collaboration:

Obvious ones: Notion, Slack (huddle!!).
Loom: Esp. useful for sharing bugs internally so the cofounder doesnt have to spend too much time understanding the issue. As hes in Bangalore and Im in SF, its not as easy as it was when we were together in the office. Loom helps in faster collab.

Productivity:

SuperHuman: Emailing with infinite keyboard shortcuts so I never have to touch the trackpad, and be fast here. ($30/month)
Brain.fm: To work with a bit more focus, esp when its so easy to get distracted. (~$50/year)
ChatGPT: To review stuff, re-write in a certain tone, etc. Sometimes also to brainstorm ideas and get inputs. (~$20/seat/month)

Product:

We havent reached a stage where we need tooling for the product honestly but have been trying to setup these tools as hygiene:

Frigade: To help new users with an onboarding journey (currently in free plan)

Segment + MixPanel: to store user activity events on the UI hopefully for some usage in the future. (both free plans)

Support Operations Tools:

Some of these are more specific to building a US-based SaaS company as an Indian founder.

Finance

Brex US day-to-day banking, expenses
RazorpayX India expenses, day-to-day banking
QuickBooks US Book-keeping
Inkle US filing, accounting, etc.
Banking: SVB, Brex, Chase (had to diversify after the SVB fiasco), Skydo (to transfer US India, they have transparent and better fx conversion rates but have $10k limit/transaction), ICICI (use it as the main bank in India and for fx too)

HR

Gusto To manage payroll in the US. Had explored Rippling but they were trying to sell me 100 products when I just needed a payroll tool. Also, Rippling was way pricier. I think Gusto is charging around $25/month/employee after a 1-year YC free deal)
RazorpayX To manage payroll in India seamless product & must use. Highly recommend. (Around $1.5/month/employee, got 3 months free YC deal)
Pulley To manage esop for the team (1-year free plan, YC deal)

Miscellaneous

LastPass: Password storing & sharing internally you just saw how many tools I have 😅 (I think ~$5/month/user)
ExpressVPN: Only because some websites have country-specific constraints. Really clean tool. ($100/year)

Hope this was useful. If there's any other tool that has really helped you as an early stage founder, do share it with me!

What are triggers on Doctor Droid?

Siddarth Jain — Fri, 28 Jul 2023 20:02:58 GMT

Definition

Triggers are rules based on relative behavior between events that one can configure on Doctor Droid platform. These triggers can be useful to notify you with an alert when any of these critical steps is impacted.

A trigger can be applied to any transaction that you set up on the platform.

Why Triggers?

Consider these use cases below, where theres a need for stateful tracking needed:

In a data processing pipeline, you can set up a % or count-based failure rate above which you should get an alert.
In an order fulfillment product, you can setup ETA based alerts for every order on a locality level or for its average in dinner time on an hourly basis.
In a login OTP flow, you can get notified in real time if your SMS delivery success is dipping.
In a gig workers payout state machine, track the speed of transfer and get notified of failure rates and patterns like specific recipient banks failing.

Currently, you have two alternatives:

Track count(E2)/count(E1) A success rate here will be fairly accurate if most of your transactions close in a time window that is much smaller than an evaluation window.
- E.g. If your process (from step 1 2) takes 3 seconds to complete, the ratio will fairly explain the value. But if the tnx takes say, 1 minute, then on a 5-minute window, theres a potential of ~20% error margin as the count(E1) and count(E2) are not connected.
Process the value of E2/E1 at the E2 step in the code and then plot the metric of success rate/transaction time.
- The risk here is that in case the code didnt reach till E2 or takes too long, you wouldnt get proactive visibility on it.

Types of Triggers

Consider an example where youre a video recording tool and you want to track if your recording processing workflow is happening perfectly fine.

Per Event
Lets say, there are certain types of cases where you want to be particularly perfect. Say a high-paying customer or a specific type of video.
This is for generating an alert when E2 is delayed by more than 'X' time interval than E1.
It will generate alerts for each such instance and notify you of the E1 payload for which it happened. This type of trigger is useful when you need to know if every customer had a good experience.
Aggregated Events
This is for use cases when you want to get insights on aggregate behavior.
For instance, in case you want to get notified if more than 1% of your users are impacted for this transaction or if the p99 for this transaction is exceeding a certain value in the last 5 minutes.

Additional Filters

Filter on attributes in either of the events so that only the events satisfying that filter are being considered for the trigger rule.
For the aggregated events trigger, you can choose to consider only those cases where the secondary event got missed or only got delayed.

Conflict management:

In case you send duplicate events with the same group key, we consider the first event while creating transactions or reviewing trigger rules.

In case you can think of a use case where you want to set up a trigger, sign up here.

What are entities?

Siddarth Jain — Wed, 26 Jul 2023 13:01:11 GMT

Definition:

An entity is a chain of events connected using a common primary key.

Why Entities?

Entities help stitch together all critical events associated with a specific business object, in the above-mentioned case, a payment transaction.

The presence/absence of these events helps identify failure points in the transaction.
The attribute values within each event help drill-down and identify variable values within the transactions.

How do entities differ from transactions & events?

Features and capabilities of Entities

Grouping in List View: You can group all events associated with a single instance of an entity.
Alerting: Identifying anomalies in transactions at an aggregate or individual level basis entity events.
Filtering: You can filter out entities basis.
Flow views: Track the state of different instances started within a certain duration
Mapping external metrics to nodes: Every node in an entity can be considered to be equivalent

To get started with creating your entities, click here.

What are transactions on Dr. Droid?

Siddarth Jain — Mon, 24 Jul 2023 13:30:13 GMT

Definition

A transaction is defined as the transition of a joining key pk_1 from an initial state (primary event) tnx_initiated to a terminal state (secondary event) tnx_completed.

Why Transactions?

Transactions help you track the behaviour of your application in the context of the goals it is trying to achieve. Consider an example of a 3rd party payment provider API which is a part of your product journey. While its important to be able to track the error rate of outgoing API and webhook callbacks by the vendor, tracking relative behaviour helps identify anomalies faster.

Transactions vs events-based metrics:

The key benefit of using transaction instead of tracking start and completion separately using 2 different markers is that the correlation is easily tracked. If you do not have any common key between the two states, the only next best option is to add custom metrics at the individual checkpoints.

Unique Properties of a Transaction:

Compared to individual events, here are some properties, unique to a transaction:

Transaction time: The time to go from primary state to secondary state in a transaction.
Transaction status: Active or Finished, depending on whether it has reached the terminal state or not.
Transaction triggers: Rule-based alert configured on the transaction time, status or any other attribute associated with the transaction

What can you do with transactions?

Once you define a transaction, here are some out-of-the-box capabilities that we provide:

Alert rules on transactions success rate:
1. You can configure triggers on transaction success/failure rates, at both aggregate levels as well as individual transactions. Read more about triggers and their details here.
Track metrics related to the transaction and any of its associated attributes:
Filter transactions
1. What are the currently active transactions?
2. Search transactions by any variable, irrespective of whether they are in the primary event or secondary event

To get started with setting up a transaction, click here.

Create a trivia game about a random topic in 2 hours

Siddarth Jain — Sat, 15 Jul 2023 10:39:30 GMT

tl;dr

Last weekend, we were doing a team hackathon, with everyone just trying to tinker with ideas and build something fun.

I (Sid) had spent a weekend in February on LLMs (specifically OpenAI) and decided to pick up the same again to explore its capabilities of understanding datasets and getting insights from them. Compared to last time (which seems like ages away in the LLM world), its much more logical and supports structured responses natively.

The blog is a note on how we built it.

Hackathon output:

Built a trivia game on IPL 2023 using a public dataset, OpenAI APIs and some bit of prompt engineering.

Github repo backend, frontend. (Not adding the game link because it will otherwise consume all my OpenAI credits)

https://www.loom.com/share/a061a96227a84f6b95d4cc453a3cf05d?sid=a9f3711c-c18f-44b2-8dbc-1b7d0f132abe

Pre-cursor:

Learnings from trying out GPT the previous time:

(a) Unstructured Responses: Hard to get structured data, which meant using it in code was not feasible (yet)

(b) Frequent Hallucination: Illogical responses / irrelevant to the question are common

Learnings from building the game:

The responses have been way better this time around compared to last time:

Structured response: OpenAI itself has natively enabled functional responses, so its easier to now get structured responses directly.
Logic: With GPT4, the quality of responses has seen an uptick; it can transform text to pandas queries seamlessly.
Cost: Remains expensive as we needed to train it on the schema every time.

Tools we used:

OpenAI Function Calling
Kaggle Dataset IPL 2023
Code Stack: ReactJS, Python
Infra Stack: Docker Compose + AWS EC2 + gunicorn (nightmare to deploy full-stack application as a side project)

Detailed process:

Step 1 Create prompt: Dataset Schema and also provide it with some unique values for each dimension of the schema.

Our prompt:

messages = [{"role": "user",                 "content": "these are 4 datasets that I have related to IPL 2023. ""I want to ask fun trivia questions to a user basis this dataset - try a random question every time. ""\n\nGenerate the question in function format as requested\n\n\\n\\n""IPL2023_Batter.csv named as df_batter\\nmatch_no - No of match according to IPL schedule\\nBatsman - Name of the Batsman in that particular match\\nteam - Batsman played in which team\\nRun - Run scored by that Batsman\\nBall- The number of balls he took to score the runs\\n4s - How much 4 he scored in that match\\n6s - How much 6 he scored in that match\\nout_by - name of bowler who outed him\\n\\n\\nIPL2023_Bowler.csv named as df_bowler\\nmatch_no - No of match according to IPL schedule\\nBowler - Name of the Bowler in that particular match\\nteam - Bowler played in which team\\nover_ - How much over he did in that match\\nrun- How much runs he gave in that match\\nwicket - How much wicket taken by that bowler\\nNo_ball - How much No ball he done\\nECO - economy rate is the average number of runs\\n\\n\\nIPL2023_Match_Scoreboard.csv named as df_scoreboard\\nmatch_no - No of match according to IPL schedule\\nHome_team_run - Total Runs done by home team\\nHome_team_wickets - Home team's total wickets gone\\nHome_team_over - Total overs played by Home team\\nAway_team_run - Total Runs done by Away team\\nAway_team_wickets - Away team's total wickets gone\\nAway_team_over - Total overs played by Away team\\n\\n\\nIPL2023_Matches.csv named as df_matches\\nmatch_no - Match number according to IPL schedule\\ncity - Location of match\\ndate_of_match - Date of Match\\nvenue - Venue of the match\\nHome_team - Team playing in home ground\\nAway_team - Team not playing in home ground\\ntoss_winner - Team that won the toss\\nwinner - Team that won the match\\nman_of_the_match - Player awarded \\\"Man of the Match\\\"\\nresult - Game won by 1st batting team or 1st bowling team\\nresult_margin - \\neliminator\\numpire1\\numpire2\n\n\n""Here are some data ""Batsmen:  ['Devon Conway' 'Ruturaj Gaikwad' 'Moeen Ali' 'Ben Stokes' 'Ambati Rayudu'\n 'Shivam Dube' 'Ravindra Jadeja' 'MS Dhoni' 'Mitchell Santner'\n 'Wriddhiman Saha' 'Shubman Gill' 'Sai Sudharsan' 'Hardik Pandya'\n 'Vijay Shankar' 'Rahul Tewatia' 'Rashid Khan' 'Prabhsimran Singh'\n 'Shikhar Dhawan' 'Bhanuka Rajapaksa' 'Jitesh Sharma' 'Sikandar Raza'\n 'Sam Curran' 'Shahrukh Khan' 'Mandeep Singh' 'Rahmanullah Gurbaz'\n 'Anukul Roy' 'Venkatesh Iyer' 'Nitish Rana' 'Rinku Singh' 'Andre Russell'\n 'Shardul Thakur' 'Sunil Narine' 'KL Rahul' 'Kyle Mayers' 'Deepak Hooda'\n 'Krunal Pandya' 'Marcus Stoinis' 'Nicholas Pooran' 'Ayush Badoni'\n 'Krishnappa Gowtham' 'Prithvi Shaw' 'David Warner' 'Mitchell Marsh'\n 'Sarfaraz Khan' 'Rilee Rossouw' 'Rovman Powell' 'Aman Hakim Khan'\n 'Axar Patel' 'Kuldeep Yadav' 'Chetan Sakariya' 'Mukesh Kumar'\n 'Yashasvi Jaiswal' 'Jos Buttler' 'Sanju Samson' 'Devdutt Padikkal'\n 'Riyan Parag' 'Shimron Hetmyer' 'Ravichandran Ashwin' 'Abhishek Sharma'\n 'Mayank Agarwal' 'Rahul Tripathi' 'Harry Brook' 'Washington Sundar'\n 'Glenn Phillips' 'Abdul Samad' 'Adil Rashid' 'Bhuvneshwar Kumar'\n 'Umran Malik' 'Rohit Sharma' 'Ishan Kishan' 'Cameron Green'\n 'Suryakumar Yadav' 'Tilak Varma' 'Nehal Wadhera' 'Tim David'\n 'Hrithik Shokeen' 'Arshad Khan' 'Virat Kohli' 'Faf du Plessis'\n 'Dinesh Karthik' 'Glenn Maxwell' 'Mark Wood' 'Abishek Porel'\n 'Anrich Nortje' 'David Miller' 'Dhruv Jurel' 'Jason Holder' 'Umesh Yadav'\n 'Michael Bracewell' 'Harshal Patel' 'Shahbaz Ahmed' 'Anuj Rawat'\n 'David Willey' 'Karn Sharma' 'Akash Deep' 'Anmolpreet Singh'\n 'Aiden Markram' 'Romario Shepherd' 'Manish Pandey' 'Lalit Yadav'\n 'Tristan Stubbs' 'Piyush Chawla' 'Ajinkya Rahane' 'Abhinav Manohar'\n 'N Jagadeesan' 'Matthew Short' 'Harpreet Brar' 'Rahul Chahar'\n 'Nathan Ellis' 'Mohit Rathee' 'Jaydev Unadkat' 'Ravi Bishnoi'\n 'Avesh Khan' 'Yash Dhull' 'Mustafizur Rahman' 'Adam Zampa' 'Rishi Dhawan'\n 'Heinrich Klaasen' 'Mahipal Lomror' 'Yudhvir Singh' 'Atharva Taide'\n 'Harpreet Singh Bhatia' 'Kagiso Rabada' 'Alzarri Joseph' 'Trent Boult'\n 'Suyash Prabhudessai' 'Wayne Parnell' 'Wanindu Hasaranga' 'Marco Jansen'\n 'Mayank Markande' 'Liam Livingstone' 'Arshdeep Singh' 'Jason Roy'\n 'Litton Das' 'Varun Chakaravarthy' 'Philip Salt' 'Prerak Mankad'\n 'Harpreet Singh' 'Jofra Archer' 'Vijaykumar Vyshak' 'Mohammed Siraj'\n 'Abdul Basith' 'David Wiese' 'Ripal Patel' 'Ishant Sharma'\n 'Arjun Tendulkar' 'Jason Behrendorff' 'Riley Meredith' 'Priyam Garg'\n 'Akeal Hosein' 'Josh Hazlewood' 'Amit Mishra' 'Naveen-ul-Haq'\n 'Manan Vohra' 'Karan Sharma' 'Harshit Rana' 'Vaibhav Arora'\n 'Sandeep Sharma' 'Quinton de Kock' 'Swapnil Singh' 'Kedar Jadhav'\n 'Deepak Chahar' 'Tushar Deshpande' 'Vishnu Vinod' 'Noor Ahmad'\n 'Praveen Dubey' 'Joe Root' 'KM Asif' 'Dasun Shanaka' 'Mohammed Shami'\n 'Mohit Sharma' 'Sanvir Singh' 'Fazalhaq Farooqi' 'Vivrant Sharma']\n""Out by:  ['Mohammed Shami' 'Alzarri Joseph' 'Rashid Khan' 'Joshua Little' 'Not Out'\n 'Rajvardhan Hangargekar' 'Tushar Deshpande' 'Ravindra Jadeja'\n 'Tim Southee' 'Varun Chakaravarthy' 'Umesh Yadav' 'Sunil Narine'\n 'Arshdeep Singh' 'Nathan Ellis' 'Sikandar Raza' 'Rahul Chahar'\n 'Sam Curran' 'Chetan Sakariya' 'Axar Patel' 'Kuldeep Yadav'\n 'Khaleel Ahmed' 'Mark Wood' 'Avesh Khan' 'Fazalhaq Farooqi' 'T Natarajan'\n 'Umran Malik' 'Trent Boult' 'Yuzvendra Chahal' 'Jason Holder'\n 'Ravichandran Ashwin' 'Akash Deep' 'Mohammed Siraj' 'Reece Topley'\n 'Michael Bracewell' 'Karn Sharma' 'Harshal Patel' 'Arshad Khan'\n 'Cameron Green' 'Ravi Bishnoi' 'Moeen Ali' 'Mitchell Santner'\n 'Anrich Nortje' 'Mitchell Marsh' 'retd hurt(Not Out)' 'Run Out'\n 'David Willey' 'Shardul Thakur' 'Suyash Sharma' 'Krunal Pandya'\n 'Yash Thakur' 'Amit Mishra' 'Adil Rashid' 'Bhuvneshwar Kumar'\n 'Mukesh Kumar' 'Rovman Powell' 'Sandeep Sharma' 'Sisanda Magala'\n 'Jason Behrendorff' 'Piyush Chawla' 'Kumar Kartikeya' 'Marco Jansen'\n 'Mayank Markande' 'Wayne Parnell' 'Mustafizur Rahman' 'Hrithik Shokeen'\n 'Riley Meredith' 'Akash Singh' 'Adam Zampa' 'Mohit Sharma'\n 'Kagiso Rabada' 'Harpreet Brar' 'Andre Russell' 'Lalit Yadav'\n 'Vijaykumar Vyshak' 'Wanindu Hasaranga' 'Yudhvir Singh'\n 'Krishnappa Gowtham' 'Duan Jansen' 'Lockie Ferguson' 'Hardik Pandya'\n 'Noor Ahmad' 'Glenn Maxwell' 'Maheesh Theekshana' 'Matheesha Pathirana'\n 'Arjun Tendulkar' 'Marcus Stoinis' 'Ishant Sharma' 'Nitish Rana'\n 'Anukul Roy' 'Naveen-ul-Haq' 'Jofra Archer' 'Liam Livingstone'\n 'Kulwant Khejroliya' 'Washington Sundar' 'Kuldip Yadav' 'Harshit Rana'\n 'Akeal Hosein' 'Abhishek Sharma' 'Josh Hazlewood' 'Rishi Dhawan'\n 'Kartik Tyagi' 'Aiden Markram' 'Vaibhav Arora' 'Deepak Chahar'\n 'Akash Madhwal' 'Tristan Stubbs' 'Mohsin Khan' 'Chris Jordan' 'KM Asif'\n 'Glenn Phillips' 'Praveen Dubey' 'Yash Dayal' 'Retd Hurt' 'Shahbaz Ahmed'\n 'Navdeep Saini' 'Mayank Dagar']\nBowlers  ['Mohammed Shami' 'Hardik Pandya' 'Joshua Little' 'Rashid Khan'\n 'Alzarri Joseph' 'Yash Dayal' 'Deepak Chahar' 'Tushar Deshpande'\n 'Rajvardhan Hangargekar' 'Mitchell Santner' 'Ravindra Jadeja'\n 'Umesh Yadav' 'Tim Southee' 'Sunil Narine' 'Varun Chakaravarthy'\n 'Shardul Thakur' 'Sam Curran' 'Arshdeep Singh' 'Nathan Ellis'\n 'Sikandar Raza' 'Rishi Dhawan' 'Rahul Chahar' 'Harpreet Brar'\n 'Kyle Mayers' 'Jaydev Unadkat' 'Krishnappa Gowtham' 'Avesh Khan'\n 'Mark Wood' 'Ravi Bishnoi' 'Khaleel Ahmed' 'Mukesh Kumar'\n 'Chetan Sakariya' 'Axar Patel' 'Kuldeep Yadav' 'Bhuvneshwar Kumar'\n 'Fazalhaq Farooqi' 'Washington Sundar' 'T Natarajan' 'Adil Rashid'\n 'Umran Malik' 'Trent Boult' 'KM Asif' 'Jason Holder'\n 'Ravichandran Ashwin' 'Yuzvendra Chahal' 'Navdeep Saini' 'Mohammed Siraj'\n 'Reece Topley' 'Akash Deep' 'Harshal Patel' 'Karn Sharma'\n 'Michael Bracewell' 'Glenn Maxwell' 'Jason Behrendorff' 'Arshad Khan'\n 'Jofra Archer' 'Piyush Chawla' 'Cameron Green' 'Hrithik Shokeen'\n 'Krunal Pandya' 'Yash Thakur' 'Ben Stokes' 'Moeen Ali' 'RS Hangargekar'\n 'Anrich Nortje' 'Mitchell Marsh' 'David Willey' 'Shahbaz Ahmed'\n 'Suyash Sharma' 'Deepak Hooda' 'Amit Mishra' 'Aiden Markram'\n 'Rovman Powell' 'Sandeep Sharma' 'Murugan Ashwin' 'Sisanda Magala'\n 'Dwaine Pretorius' 'Kumar Kartikeya' 'Lockie Ferguson' 'Marco Jansen'\n 'Mayank Markande' 'Mohit Rathee' 'Wayne Parnell' 'Riley Meredith'\n 'Tilak Varma' 'Mustafizur Rahman' 'Lalit Yadav' 'Akash Singh'\n 'Maheesh Theekshana' 'Kuldeep Sen' 'Adam Zampa' 'Mohit Sharma'\n 'Kagiso Rabada' 'Matthew Short' 'Andre Russell' 'Vijaykumar Vyshak'\n 'Wanindu Hasaranga' 'Yudhvir Singh Charak' 'Arjun Tendulkar'\n 'Duan Jansen' 'Noor Ahmad' 'Matheesha Pathirana' 'Naveen-ul-Haq'\n 'Liam Livingstone' 'Ishant Sharma' 'Kulwant Khejroliya' 'Anukul Roy'\n 'Nitish Rana' 'Mayank Dagar' 'Marcus Stoinis' 'Jayant Yadav'\n 'Rahul Tewatia' 'David Wiese' 'Vaibhav Arora' 'Kuldip Yadav'\n 'Gurnoor Brar' 'Ayush Badoni' 'Harshit Rana' 'Akeal Hosein'\n 'Abhishek Sharma' 'Josh Hazlewood' 'Mahipal Lomror' 'Akash Madhwal'\n 'Kartik Tyagi' 'Raghav Goyal' 'Tristan Stubbs' 'Mohsin Khan'\n 'Swapnil Singh' 'Vivrant Sharma' 'Obed McCoy' 'Chris Jordan' 'Joe Root'\n 'Yudhvir Singh' 'Glenn Phillips' 'Praveen Dubey' 'Nitish Reddy'\n 'Himanshu Sharma']\n""Teams  ['Gujarat Titans' 'Chennai Super Kings' 'Kolkata Knight Riders'\n 'Punjab Kings' 'Lucknow Super Giants' 'Delhi Capitals'\n 'Sunrisers Hyderabad' 'Rajasthan Royals' 'Royal Challengers Bangalore'\n 'Mumbai Indians' 'sunrisers Hyderabad']\nCity  ['Ahmedabad' 'Mohali' 'Lucknow' 'Hyderabad' 'Bengaluru' 'Chennai' 'Delhi'\n 'Guwahati' 'Kolkata' 'Lucknow ' 'Mumbai' 'Jaipur' 'Dharamsala']\n\n"}]

Step 2 Define OpenAI Functions: In the response, we asked the OpenAI functions for the textual question, and the query to extract answers from the database. Additionally, since we needed to print the wrong options too, had created a field to ask for the answer category.

functions = [    {        "name": "generate_question",        "description": "Generates the answer by running the query in dataset",        "parameters": {            "type": "object",            "properties": {                "question_prompt": {                    "type": "string",                    "description": "Enter a question where the answer is either a batsman, bowler, team or a integer"                },                "pandas_query": {                    "type": "string",                    "description": "Enter a valid pandas query that is compatible with mentioned schema and gets the answers to the above question",                },                "answer_type": {                    "type": "string",                    "description": "This should be one of [batsman, bowler, team, out_by]",                }            },            "required": ["question_prompt", "pandas_query", "answer_type"],        },    }]

Step 3 Process response and post the quiz: Basis the response, we created wrong answers and shuffled them with the right ones, for the user to see it.

Tools that we tried but didnt end up using in Production:

Vercel: 1-click deployment for front-end applications
- It was easy to do the frontend hosting on Vercel but it was non-trivial to add any logic/backend tried spending a few hours on python functions that they enable, but not have much luck in getting anything live there.
LangChain: Wrapper on top of LLMs, designed to build complex products
- We explored it to try the memory feature -- if OpenAI could statefully remember our data schema, it could cut our per-call cost significantly. But we couldnt get our heads around it completely (turns out its so complex that a lot of folks arent able to use it).
Guardrails: Guardrails help get structured responses
- We were planning to use LangChain+Guardrails (Couldnt see Guardrails support for LangChain memory) but LangChain never worked so we didn't use Guardrails either.

What are custom metrics & How to track them with Doctor Droid?

Siddarth Jain — Sat, 15 Jul 2023 05:24:13 GMT

Golden signals & APMs

Installing an observability tool like Datadog or NewRelic gets you access to API & Infrastructure metrics like golden signals, out of the box.

These metrics are useful to track the health of a specific service but lack context about the product or business goals that your API is helping achieve.

Leading indicators of application health

To combat this challenge, engineers often add contextual metrics & logs in the code, which help in the early detection of anomalies. Typically, these are leading indicators of the product & business KPIs from an engineering standpoint.

In a food delivery business like Doordash, order cancellation percentage or average allocation time, are metrics that are related to the companys end-goals and at the same time, driven by engineering teams.

Types of custom metrics:

Multiple types of custom metrics could be relevant to your business:

Counters: These could be numbers like Live order count or Transactions in progress that you would want to track and compare against the past or targets.
Percentage values: These would be metrics like Success rates or cancellation percentage of a certain transaction that you want to track
Time Taken: These could be like a delivery time or a processing time, which represents the time between two critical checkpoints in your application
Sums / aggregate functions: These could be derivative metrics like Earnings in last 30 minutes or average free partners available during dinner time

Adding labels to these metrics:

Often, to give more context to your metric, you will tend to add labels which would help you triangulate the impact radius in case of any anomaly. This could be something like adding a city wise split in your order count, vendor wise split in your success rates, category wise split in your processing time, etc. When reported an issue, these labels convert a typical metric into actionable insight to drill down upon.

Adding unique identifiers:

In some cases, not only is it important to track the metric, but it also becomes necessary to add a unique identifier to every signal. In these cases, adding custom logs with the labels as key-value pairs become the most recommended way.

Tracking custom metrics with Doctor Droid:

On Doctor Droid, we enable metric monitoring by creating aggregation on events received on the platform. An event is a data point in a structured key-value pair. Lets extrapolate from the food delivery example mentioned above.

{  "name": "order_initiated",  "timestamp": 1677751161120,  "payload": {    "order_id": "ekpsHJ9GhQd7OU",    "city": "Bengaluru",    "store_code": "PZHT0056",    "order_value": 435,    "promised_eta": 23  }}

Using data like these, there are two types of metrics you can track on Doctor Droid let me walk you through both of them.

Type 1 Stateless metric:

Any metric that is an aggregation on a single event and is a representation of a point-in-time checkpoint is stateless. As soon as the platform detects an event, it automatically generates relevant metrics. You will be able to see these by clicking an Event Types Metrics.

In case you want to create an alternate metric using the same data, you can leverage the Metrics Explorer.

Type 2 Stateful metrics:

In continuation of the previous event (order initiated), lets assume the next step in the journey of an order is the part when the order is allocated to a driver who would deliver the order.

{  "name": "Allocation_successful",  "timestamp": 1677751561100,  "payload": {    "order_id": "ekpsHJ9GhQd7OU",    "delivery_eta": 14,    "driver_id": 32432  }}

The time it takes from order initiation allocation is a stateful metric that we want to track.

Firstly, define a Transaction in the platform.

Now to plot a metric associated with the transaction, just go to Metrics Explorer and select Metric Type as Transaction.

Note that we have plotted the variable from the second state but grouped it by the variable in the first state! You can also plot metrics that are not related to either of the states but about the correlation between them.

Additional benefits of using Doctor Droid:

Drill Downs

Double-click on a metric and you can see the underlying events/transaction data which created this metric, without having to write a new query.
You can add search and add filters by any of the parameters that might have been passed in the events.

Shareable dashboards

With Doctor Droid, its easy to create shareable dashboards this means, these dashboards are not restricted to only engineers who query it in ELK stack or Cloudwatch Log insights but are accessible to every leader as well as non-engineering team members that you want to share it with.

Understand impact radius and inter-dependencies

Every metric in our platform can be mapped to a relevant entity. This means that you can identify the entities that are related to this metric (e.g. A payment success metric could be a part of an order entity), making it easy to go upstream / downstream and identify cascading effects.

Stateful Monitoring

Siddarth Jain — Fri, 14 Jul 2023 04:13:24 GMT

Introduction

Every products execution or users interaction with it is a series of steps. These steps are either happening due to the users intent or are pre-programmed as a causal chain.

As developers, we are used to mentally looking at our products health as a sum of the health of each of these steps. These steps could be an API call, a background job or another task. The mapping of these individual components to the product goal is tribal knowledge, only stored as mental models and mind maps within the engineering team.

Triaging in Status Quo

When our current monitoring tools tell some API is misbehaving (performance or exceptions), we scan through traces or logs to understand the origin of the issue. If an infrastructure component like a DB or cache gives us an alert, we check metrics being emitted from that component and try to estimate steps in the product which could be affected by it. Most likely, we can observe that effect in the monitoring dashboards. From the above information, you know something is broken for the end user but its hard to identify the impact radius as information is in silo.

However, when someone external points out a user behaviour being impacted or some product metrics not being upto the mark, you need more information to find the root cause:

Mapping of that products lifecycle to the steps. Once steps are identified then you can drill down further.
The sequence of those steps in the product lifecycle, so that you know which misbehaviour in the product maps to which step, this information would make finding the root cause much faster given a complex system.

Stateful Monitoring

We are introducing stateful monitoring, an alternate way to look at your existing data. This adds a data interpretation and visualisation layer which understands your data as a state machine. It is like looking at your product lifecycle directly as a state machine of the steps. That way, misbehaviour in a step or an infrastructure component can tell you what is breaking for the end user.

Similarly, when a user product issue is reported, you no longer need to know the mapping between the product and the steps, because you can see the entire product lifecycle together as a state machine diagram with each steps metrics and their sequence patterns in front of you.

You can search within this state machine by using the property at any of the steps this helps avoid the need to manually go through every checkpoint manually. For example, if the vendor of the user is only in one of the steps, if you filter by vendor = PayPal, you can see the entire chain for each user who used PayPal. If you were to use a traditional tool, you will only be able to see the step which has the variable PayPal in it.

What you can do with a state machine:

Track the product use case against steps in your services. Helps in two-way mapping for impact radius as well as root cause analysis.
Set up monitoring on certain sequences/patterns/time taken in the state machine and set up alerts for % breach. This brings context into your alerts and removes alert fatigue.
Identify when a state machine is not reaching a terminal state in an expected time frame, signalling an anomalous behaviour that might need immediate attention.
Identify opportunities for improvement in your critical paths by identifying RDS (rate-determining steps).

The insights shared in this article are from our teams experience in building and debugging products.

We are always keen to learn about how other engineers & organisations think about triaging and troubleshooting issues. Do share your ideas & inputs we are reachable on any of our socials!

First time tech managers: what to remember

Siddarth Jain — Tue, 11 Jul 2023 08:57:07 GMT

This is a follow-up article from the first one The Path from an Engineer to a tech lead(er)

While we had asked folks for key questions that they would like to discuss with a leader like Natraj who transitioned from an IC to a manager to now a CTO, multiple questions came in from first-time managers.

Basis those questions & insights shared by Natraj, heres a summary of pointers that tech managers should be mindful of:

Project management skills: A tech lead becomes critical in successful delivery of any project. They are not only expected to manage engineers, but also keep in mind the needs of business, product and users.
- Continuous communication & collaboration: Heres a framework that you can take inspiration from while working on one of your first projects
- Fair translation: While discussing with non-engineers, you might want to be more conscious of conveying the message well without drowning into jargons
- Balancing Tech Debt <> Project Timelines: A tech lead does the job of bridging the gap between business urgency and the time required to handle complex technical tasks.
Continuous hands-on experience: The tech landscape evolves quite frequently and as a leader, to be in-sync with your teams language & maintaining communication, it becomes essential to stay up-to-date. It does not necessarily mean coding on applications everyday, but to understand the what, how & why of new technologies. Heres some ways:
- Follow engineering blogs of leading tech companies My favourites Netflix, Linkedin, Uber
- Read insightful newsletters & writer My favourites The Pragmatic Engineer, HackerNews
Giving feedback: As critical as it is to give feedback to team mates, its equally important how the feedback is given:
- Frequent & consistent feedback: Giving feedback before the context is lost, becomes important and critical. Additionally, sharing drastically contradicting feedback can be confusing for the teammate so its important to be thoughtful and consistent while sharing feedback.
- Actionable feedback: While feedback is important, providing actionable guidance without spoon-feeding is equally important as it helps in making improvements, faster. Sometimes, giving feedback in informal settings helps in being more candid and easier to consume.
- Heres a template that you can use as an inspiration to prepare performance reviews
Decision making: A large part of being a manager is about making decisions that impact the work of multiple people. Here are some pointers to keep in mind:
- Acknowledging bad decisions: As a manager, there is a need to quickly correct bad decisions to prevent them from becoming catastrophic or impacting customer goals
- Building trust and fairness: As a manager, it is critical to enable conversations within the team, especially before taking a crucial decision. Some key points:
  - Important to acknowledge the conflicting point of view
  - Communicating reasoning for decisions, irrespective of conflicts

If you enjoyed this and would be interested in coming in for future webinars, follow Dr Droid.

The path from an engineer to a tech lead(er)

Siddarth Jain — Wed, 05 Jul 2023 08:35:58 GMT

Being a developer in a high pace startup environment, it is obvious that you might have spent time thinking about what would be the optimal career path for you and what you can do to get there. From a survey of about 50 engineers, we noticed that some of the most common questions were around the transition from a senior engineer to a tech lead; and being successful as one.

Last week, I spoke with Natraj Choudhury & a few other tech leaders discussing these questions. We discussed the evolution of a developers journey into a tech lead, the practices that could help one progress further ahead in the path. Natraj has over 2 decades of experience in building products and spent more than a decade now with tech startups. Before becoming the CTO of a fast-growing startup Zolve, Natraj was an Engineering Leader and Chief Architect at Ola, Indias largest ride hailing service.

Here are some of the key take aways from the conversation:

Becoming master of ones craft: Be it frontend, backend, DevOps or data engineer, having a strong grasp of your skillset and proving it as an IC is essential before trying to jump to the wagon of managing people. Having that experience helps in building trust and empathy with your junior engineers.
Talking to customers: As an engineer, it is easy to drown into the rabbit hole of solving a hard problem. But as you evolve into more senior roles, you realise the most critical part of being a good engineer is to ensure that the customers use-case/needs are met. The practice of talking to customers to understand their needs, their feedback and challenges helps get a strong grasp of the product and reduce the whispers/context gap.
Data-driven decision making: Being proactive about tracking impact of the changes on metrics and goals helps a lead take the most efficient path to reach business objectives. For instance, if you are working on an allocation or business logic that could impact a critical flow, it becomes paramount that every decision in the business logic was written to maximise returns for the company. This data-led objectivity also helps in ensuring the maker-checker balance.
Be curious about the big picture: As an engineer, in most scenarios, you are not the smartest person in the team there are always senior engineers with more or at least different experience. Asking questions on design decisions helps you understand constraints, design choices and situational context. If youre working with any open source tool, spend time on the codebase to see how other (good) engineers go about building complex systems.
Be curious about the big (business) picture: Two ways here: (a) Understanding the different what-ifs and tradeoffs that youre making in use-cases / experience with the current scope. (b) Talking frequently with folks in the office beyond your team / other engineers -- might be unsaid but this gives a lot of tribal knowledge about challenges.
Getting non-technical mentors: As an engineer, you might have senior team members who give you frequent feedback on your technical capabilities. At the same time, it is critical that you spend time finding a good non-technical mentor/advisor to you earlier in the career. This helps in building the acumen for understanding business, product & stakeholder management.
Building trust: Trying to grow non-linearly / fast requires being able to pick up projects that might be beyond your current capability. To pitch yourself to pick one such project, proactively participate to pick projects beyond your comfort zone. While these do have certain amount of risk, putting in the effort to execute even one such project could give a sense of confidence & trust in the team to let you take independent decisions in the future.

Hope you enjoyed reading. For any more topics or further discussions, feel free to DM me. To get notified for future webinars, sign up for this newsletter.

How I use AI for code assistance

Dipesh Mittal — Fri, 30 Jun 2023 01:22:30 GMT

I have found some use cases where AI can write good, functional code and can assist you in completing your code-writing tasks faster. I have been using the ChatGPT web tool and Copilot both for my work (Copilot is like a sibling to the GPT, given they are built around the same core internally [Ref])

tl;dr

Github co-pilot: For faster code auto-completion Imagine it to be like a really well-prepared interview candidate for a first-level software engineer role and you are giving it code writing assignments. 😅
- What I like:
  - Plugins embedded into VSCode, IntelliJ, Sublime
  - Good in predicting what I want to write and while it is not perfect, it has helped me get some mundane code pieces to be written quickly
ChatGPT: For getting an end-to-end instruction set of how to solve a problem or for writing code for an entire use-case which spans across multiple classes & functions. It also explains the code it provides which is something Co-pilot doesnt give you when you do auto-complete.

Here are 6 ways I have used AI in my code writing:

1. CSS generation (Github Copilot)

While building Dr Droid, I took up the challenge to write our reactjs application and make new components. While we made our mockups using Figma, it takes some skills to replicate it onto your page design. I am a backend engineer by practising and writing CSS to take care of the grid, the mobile-friendliness & spacing are a bit daunting. So, I wrapped my HTML elements with a div with a class and then described the element from our Figma mockup in a prompt. Copilot added the CSS for that class with some assumptions on the actual measurements which I could fine-tune later, but it took the guessing part out on what CSS styles to use.

Input:"""Give CSS for 3 classes, one for the parent div and two for the children div where the children should get aligned horizontally with their internal texts to the left and right"""Output:.parentDiv {  flex-direction: 'inherit';  display: 'flow-root';  padding: '2px';}.leftDiv {  width: '30%';  height: '100%';  background: 'white';  float: 'left';  border: '1px solid black';    text-align: 'left';}.rightDiv {  width: '70%';  height: '100%';  background: 'white';  float: 'right';  border: '1px solid black';    text-align: 'right';}

2. Telling the copilot to write functions, mostly for sorting or data manipulation

At Dr Droid, we have lots of data transformations happening at different layers and for different features on our platform. Copilot has proved to be quite useful in writing those transformations based on prompts and then we can test it with our test cases. It is also useful for writing functions to merge two data objects into a single object with a different schema by describing what the new fields are and how they are based on the older fields.

For example, I wanted to merge JSON objects in an array over two string keys in them and give different stats over the third numeric key. The best part is that it can write it in the context of your code so variable renaming is not needed and you can tell it which library to use for doing so.

Input:"""Write function that merges json objects in an array over two string keys in them and give average, median stats over the third numeric key"""Output:function mergeJsonArray(jsonArray, key1, key2, key3) {  let mergedJson = {};  let mergedJsonArray = [];  jsonArray.forEach(item => {    if (item[key1] in mergedJson) {      if (item[key2] in mergedJson[item[key1]]) {        mergedJson[item[key1]][item[key2]].push(item[key3]);      } else {        mergedJson[item[key1]][item[key2]] = [item[key3]];      }    } else {      mergedJson[item[key1]] = {};      mergedJson[item[key1]][item[key2]] = [item[key3]];    }  });  for (const key1 in mergedJson) {    for (const key2 in mergedJson[key1]) {      let sum = 0.0;      let median = 0.0;      let mean = 0.0;      let count = 0;      let min = 0.0;      let max = 0.0;      mergedJson[key1][key2].forEach(item => {        sum += item;        count += 1;      });      mean = sum / count;      mergedJson[key1][key2].sort((a, b) => a - b);      median = mergedJson[key1][key2][Math.floor(mergedJson[key1][key2].length / 2)];      min = mergedJson[key1][key2][0];      max = mergedJson[key1][key2][mergedJson[key1][key2].length - 1];      mergedJsonArray.push({        [key1]: key1,        [key2]: key2,        'count': count,        'sum': sum,        'mean': mean,        'median': median,        'min': min,        'max': max      });    }  }    return mergedJsonArray;}

3. Writing code with repeated patterns that the copilot is very good at detecting from your postcode

I was writing a data deletion script across multiple tables in my DB for my staging environment cleanup. All those tables had timestamp columns in each of them but with different names. I jotted down all table names in a comment in a comma-separated manner and then wrote a deletion SQL query on the first table and the whole deletion commands using the connection cursor. Once that was done, Co-pilot started auto-suggesting entire blurbs of selection code blocks for each of those tables iteratively, that too modifying the column name in them based on what the timestamp column name in it was. However, it couldnt understand the order in which that deletion was possible, it only followed the order of how they were written in the model's file.

I have used Co-pilot plugins in IntelliJ and Sublime Text. I have found the IntelliJ plugin to be more intelligent in its recommendations, especially when it comes to the context of class definitions in other files.

4. Use ChatGPT to write entire reactjs components by describing what you want it to do, then it just needs some variable corrections

I was working with a reactjs library (react-flow) that allows a custom-written component to replace its own. My purpose is to make it look a little different & add some extra text on its top. While the documentation on how the new component can override the existing one was there, it wasnt clear on how to extend it and keep the original features intact. So, I asked ChatGPT and this is what I got:

Prompt:How to make a custom edge in react-flow which is thick violet line with a large arrow at the endOutput:import React from 'react';const CustomEdge = ({ id, sourceX, sourceY, targetX, targetY, style = {} }) => {  const edgePath = `M ${sourceX},${sourceY} L ${targetX},${targetY}`;  // Calculate the angle of the arrow  const angle = Math.atan2(targetY - sourceY, targetX - sourceX);  const arrowLength = 10; // Length of the arrow  // Calculate the arrowhead points  const arrowPoints = [    [targetX - arrowLength * Math.cos(angle - Math.PI / 6), targetY - arrowLength * Math.sin(angle - Math.PI / 6)],    [targetX, targetY],    [targetX - arrowLength * Math.cos(angle + Math.PI / 6), targetY - arrowLength * Math.sin(angle + Math.PI / 6)],  ];  return (    <>      <path        id={id}        d={edgePath}        style={{          fill: 'none',          strokeWidth: '4px',          stroke: 'violet',          ...style, // Allow overriding style        }}      />      <polygon        points={arrowPoints.map(p => p.join(',')).join(' ')}        fill="violet"        stroke="violet"        strokeWidth="1"      />      );};export default CustomEdge;// Use the custom edge component in your React Flow graphimport ReactFlow from 'react-flow-renderer';import CustomEdge from './CustomEdge';const elements = [  // ... your nodes and edges];const App = () => {  return (    <div style={{ height: '500px' }}>      <ReactFlow elements={elements} edgeTypes={{ customEdge: CustomEdge }} />    div>  );};export default App;

I had to make some minor corrections in its chosen css measurements but it saved me like 30 minutes of writing the code that would have worked correctly.

5. Ask ChatGPT for some code snippets. If it is not working correctly, paste in the error response and it keep correcting it till it gets fixed with your feedback

ChatGPT is good at improving its answers based on more feedback. But the feedback shouldnt just be regeneration or asking for a better solution. You can keep giving context in your feedback. Here is an example: I wanted to set up auto-arranging of my nodes in the react-flow graph and I found that there is a layout called cola that can help do it. I couldnt find many examples of the same online, so I asked ChatGPT:

Prompt:Give me a sample code of how to use react-flow and cola layout for arranging my nodesOutput:import React from 'react';import ReactFlow, { ReactFlowProvider, addEdge, removeElements, Controls } from 'react-flow-renderer';import { ColaLayout } from 'react-flow-renderer/dist/layoutCola';const MyFlowComponent = () => {  const initialElements = [    { id: '1', type: 'input', data: { label: 'Input Node' }, position: { x: 0, y: 0 } },    { id: '2', data: { label: 'Node 2' }, position: { x: 200, y: 50 } },    { id: '3', data: { label: 'Node 3' }, position: { x: 400, y: 100 } },    { id: 'e1-2', source: '1', target: '2' },    { id: 'e2-3', source: '2', target: '3' },  ];  const onLoad = (reactFlowInstance) => {    reactFlowInstance.fitView(); // Fit the graph within the viewport  };  const layout = {    type: 'cola',    edgeLength: 180,    animate: true,    randomize: false,    maxSimulationTime: 1500,  };  return (    <div style={{ height: '500px' }}>      <ReactFlowProvider>        <ReactFlow elements={initialElements} onLoad={onLoad} layout={layout}>          <Controls />        ReactFlow>      ReactFlowProvider>    div>  );};export default MyFlowComponent;// Install cola using npm:npm install react-flow react-flow-renderer

But the code wont work even after installation and it gave me an error, so I asked ChatGPT again in the same session:

Prompt:It is giving me this error "Module not found: Error: Package path ./dist/layoutCola is not exported from package".Output:I apologize for the confusion. It seems that the layoutCola module is no longer available directly in the react-flow-renderer package. To use the "cola" layout algorithm with React Flow, you can install the react-flow-dagre package, which provides additional layout options including the "cola" layout.

It gave me an alternative code block and gave me a new library to install:

# Use npmnpm install react-flow react-flow-renderer react-flow-dagre

but this time there was an installation error:

npm ERR! 404 Not Found - GET //registry.npmjs.org/react-flow-dagre> - Not found

By now, ChatGPT had understood what i am asking without me asking for it and started giving other suggestions on how i can achieve my original goal of arranging the nodes. I eventually ended up achieving my goal without the cola layout but my problem was resolved.

I recently wanted to set up a connection of messages from a topic in my Kafka cluster to an OpenSearch service. Kafka does provide connectors for this but I was too new to this so I was scouting online for some examples of those configurations. While there were detailed examples of how to set up a connector to an s3 bucket, I couldnt find a working example of OpenSearch being the data sink. I asked ChatGPT how it would work and I got the entire instruction set on how it could be done. I searched for the answer on the internet to see if it had taken it from memory or extrapolated it from other examples. It turns out there was no such article online.

Limitations and challenges

While AI-assisted coding is useful, it has its limitations which means you cant blindly trust its written code. You have to review and test its generated code at each step.

Some limitations I observed:

Sometimes Copilot will generate code using variables that do not exist in my code elsewhere but probably could have given the context and names of other variables. It is an exercise in writing good variable names but not correct when it comes to executing them.
Asking ChatGPT for an instruction set for some tool or tech sometimes gives stale results since its data banks are older. Some of these tools could have evolved and no longer have the interface that your prompt result instructs you about. I have observed this for a few use cases in AWS.

Hope you enjoyed reading this.

Using Connectors with Dr. Droid

Siddarth Jain — Fri, 05 May 2023 08:21:13 GMT

You can connect Dr. Droid with various data sources such as Segment, Amplitude, AWS CloudWatch, Sentry, and many more using Connectors. If you are using any of these providers already, you can pipe the existing events to Dr. Droid directly. This feature is currently in private beta - request access by signing up.

In this article, we will explore how you can send Segment and Amplitude events to the Dr. Droid platform.

Segment Connector

Fig 1: Segment, credits: segment.com

This section covers steps to connect your Segment account with Dr. Droids AWS Kinesis Firehose delivery stream and send events to Kinesis.

Steps to Set Up Connection

In the Segments dashboard, under the Connections section, the source you want to connect to your Kinesis Firehose destination, click Add Destination.

Fig 2: Adding a new destination in the Segment console

Select AWS Kinesis Firehose.

Fig 3: Select Amazon Kinesis Firehose

Fig 4: Click on Configure Amazon Kinesis Firehose
The next step is to set up the configuration of Amazon Kinesis Firehose.

Select Data Source, then click Next

Fig 5: Select Data Source

Next, enter the destination name and choose the preferred method for setting up the destination, which includes either filling out all the subsequent configurations manually or copying them from an existing destination instance. I opted to manually fill out the configuration settings, and then click Create Destination.

Fig 6: Do manual setup

Now you can see the destination is created.

Fig 7: Destination Configuration in Segment

We need to configure connection settings. You can refer here for more details.

Map segment events to the firehose delivery stream and click Save

Fig 8: Map segment events to firehose delivery stream

Enter AWS Kinesis Firehose region (defaults to us-west-2) and click Save

Fig 9: Enter AWS Kinesis Firehose Region

Enter the Role Address and click Save

[The address of the AWS role that will be written to Kinesis Firehose (ex: arn:aws:iam::xxxx:role/example-role)]

Fig 10: Enter Role Address

You can test in the Event Tester if the connection is successful.

Fig 11: Event Tester

On Dr. Droids dashboard, you can see the events that are created.

Fig 12: Dr. Droids Dashboard

Fig 13: Segment Test Event on Dr. Droids Dashboard

Navigate to Amazon Kinesis Firehose destination in Segment, go to the Settings tab, and click the Secret ID.

Fig 14: Secret ID

Share your Secret ID with the Dr. Droid team for them to configure it in the Amazon Kinesis Firehose to allow data flow from your segment account to us.

Since this connector is in Beta, we will get on a call with you to convert your events into a format that Dr. Droid understands. In the upcoming days, you will be able to do the transformation by yourself on the platform.

Amplitude Connector

Amplitude is a platform for analytics of products that aids businesses in monitoring visitors via collaborative analytics. Through the use of behavioral reports, the platform can comprehend how users interact with products and offer real-time insights to expedite work.

This section covers the details on how you can link your Amplitude account and start sending amplitude events to Dr. Droid using AWS Kinesis Firehose Delivery streams.

In the Amplitude Data tab, click Catalog and select the Destinations tab. In the Warehouse Destination section, click Kinesis Data Firehose.

Fig 15: Amplitude Dashboard

Fig 16: Select the Destination tab

In the Kinesis Data Stream dashboard, click Add Destination.

Fig 17: Add Destination - Kinesis Data Stream

Enter a sync name, then click Create Sync

Fig 18: Enter Sync name for the destination

Then Click Edit, then enter the AWS region, stream name, and role name, use the Send events filter to select the events you want to send. You can send all events, but Amplitude recommends choosing the most important ones. When finished, enable the destination and save.

Fig 19: Destination Settings

Verify the connection with test data.

Fig 20: Verify the connection

On Dr. Droids dashboard, you can see the amplitude_demo_event.

Fig 21: Event received on Dr. Droids dashboard

Fig 22: Attributes of the demo event

Summary

I hope you find these Connectors useful to simplify event tracking across various systems. We are continuously working to add more connectors [see roadmap] and looking forward to hearing about your specific needs. If you havent tried Dr. Droid yet, do signup. It is free for up to 1M events/month.

How to Effectively Monitor Cron Jobs Using Dr Droid

Siddarth Jain — Tue, 25 Apr 2023 05:30:39 GMT

Cron jobs are an essential tool for developers who want to automate recurring tasks such as sending emails, running backups, and cleaning up log files. However, managing cron jobs can be challenging as they run in the background, making it difficult to troubleshoot and monitor their activities.

In this article, we'll look at some examples of cron jobs and how Dr Droid simplifies the process of monitoring, making it easier to manage cron jobs.

Some Examples of Cron Jobs

Cron scheduling is one of the oldest and most popular techniques since the introduction of UNIX. Many sysadmins, DevOps engineers, and Operations teams use it on a daily to schedule their tasks. Some of the examples are

Databases: Automating regular backups, killing long-running queries, and generating reports using Cron jobs.

App management: Automating app backups and administrative tasks using cron jobs can help save time and reduce human error in managing the application. This ensures that tasks are executed regularly and efficiently, resulting in better performance of applications.

Rotating log files: To prevent server failures and downtime, it's essential to automate log file rotation. This involves taking regular backups of log files and creating new ones to minimize the risk of running out of disk space.

Scheduling Business Processes: Automating business processes, batch processing, etc using cron jobs, can save time and reduce the risk of human error. Examples of such business processing include payment disbursement, report generation, etc.

Typical Problems with Cron Jobs

Delays

If a cron job takes longer than expected to execute, it can cause a pile-up in the queue, which can result in subsequent jobs not running on time. This can lead to further delays.

Erroneous Execution

There can be scenarios where a job does not end and will wait forever to get the response until the server restarts or someone manually kills the job. Examples include infinite loop or condition which prevents proper termination or hung command in the job. Such jobs can consume more resources of the server and thus, affect other core business applications running on the same server.

Operational Maintenance

Cron jobs that are no longer needed can accumulate causing performance issues. To avoid this, its important to review and clean up those jobs regularly.

How to Effectively Monitor and Manage Cron Jobs using Dr Droid

Monitoring cron jobs is a crucial task that ensures the smooth functioning of systems and processes. Failing to monitor them can result in potential downtime, data loss, or suboptimal resource usage. For example, cron jobs can execute important tasks that require timely and accurate completion, and regular monitoring can help detect failures quickly, optimize their scheduling, and trigger alerts.

One effective tool for monitoring cron jobs is Dr Droid. It offers a stateful approach to monitoring the health of your cron jobs, enabling you to set up monitors that alert you to failures or delays. By using Dr Droid, you can address the problems mentioned above and ensure the uninterrupted and optimal performance of your systems and processes.

How does it work?

In the cron job script, you can send some events via REST API from the job to Dr Droid. When a cron job runs, it sends an event to Dr Droid to notify it of its status such as scheduled, started, and completed.

Fig 1: Creating events `cronjob_initiated` and `cronjob_completed`

The scheduled event for the cron job can be taken from the crontab configuration, which specifies the exact time and date when the job should run. Dr Droid can use this information to monitor the status of cron jobs.

However, if the cron service itself is broken, it can be challenging to detect issues using this approach. In such cases, Dr Droid may not receive the expected notifications, and it may be necessary to manually check the configuration.

Dr Droid can be used to monitor the average runtime of cron jobs and to send alerts if a job takes longer than expected.

In the dashboard, we can set up triggers for specific events, such as when a primary event (or any cron task) takes longer than expected to complete, causing a delay in a secondary event (another cron task). At that point, we can take action to resolve the delay.

Fig 2: Setting up Triggers

We can set up actions like sending alerts via Email or we can even configure webhook calls.

Fig 3: Setting up Actions

Product walkthrough

https://www.loom.com/share/6deaa438135c42219f7683789f55e484

Summary

I hope you have learned the importance of monitoring cron jobs and how tools like Dr Droid monitor cron jobs and help developers ensure that their applications continue to run effectively.

Do let us know through comments, how you monitor the cron jobs. We will be happy to learn about your use case.

How to monitor asynchronous 3rd party integrations proactively?

Mohit Goyal — Sat, 15 Apr 2023 08:41:21 GMT

If you'd like to try Dr Droid, click here.

In today's landscape, third-party integrations have become a crucial part of software development. These integrations offer helpful features, new capabilities, and a mashup of services to create useful products. For instance, using SaaS products like Twilio or Stripe saves significant time and resources, and is also non-trivial to be built in-house from scratch. (If you know any tech startup that built these in-house, let me know 😬)

Leading startups today have upwards of 50+ business critical integrations and an average of 350+ overall integrations.

Microservices rely on external API calls to connect to these services - most of which are asynchronous or moving towards being async. If an external API call is in the flow of a critical business journey or customer experience, it needs proactive monitoring.

To address these issues, businesses need to implement effective third-party integration monitoring strategies. In this article, we will focus on how third-party integration monitoring can help businesses ensure the reliability and availability of services. Furthermore, we will discuss how Dr Droid enables you to monitor the functioning of these services and take necessary steps in case of any potential problems.

To understand with an example, we are discussing a food delivery application use case that you probably use every day.

Use case: Food Delivery Application

Food delivery services often rely on timely and efficient communication between drivers, dispatchers, and customers. To ensure that deliveries are made on time, it's important to have a reliable and efficient system in place. This is where third-party integrations like IVR can be useful. IVR is an automated interactive voice response system, you often see deployed in call centres.

For instance, imagine a customer who orders a pizza for delivery and waits for it to be delivered piping hot. Timely delivery is important for a good customer experience. Everything is going fine but lets say the delivery agent is stuck and unresponsive for any reason. This event would risk the timely pizza delivery and the customer will have a poor experience.

A Sample Third-Party Integration

For the above scenario, a simple flow for this integration could include steps such as detecting the anomaly (i.e., driver not moving from their location), triggering an automated IVR call, prompting the driver to input their reason for the delay, suggesting appropriate actions based on the driver's input, and logging the outcome of the call for future reference.

Figure 1: Ideal scenario

This is the ideal scenario where the call is initiated and the response is received and required actions are taken but what if no response is received? This will result in delays in taking action and a bad customer experience.

For this, we need monitoring where we can take action even if the response is not received and the entire process would not be affected.

Monitoring using a Cron Task

Figure 2: Monitoring using a cron task

Lets understand how we can monitor such integrations effectively to reduce business risk.

The setup described involves a delivery service using an Interactive Voice Response (IVR) system and a Redis database to manage requests. A cron job is set up to trigger a specific action after 20 seconds.

When a customer initiates a delivery request through the IVR system, the request ID and state are stored in the Redis database. This allows the delivery service to keep track of all active delivery requests. Once the request is initiated, a cron job is set up to trigger after 20 seconds.

After 20 seconds, a signal is sent to the delivery service. The delivery service checks the Redis database to see if the request ID is still in the initiated state. If the request ID is still in the initiated state, the delivery service takes appropriate actions, such as updating the request status or sending a notification to the customer.

This setup does have some potential drawbacks while cron jobs can be useful for automating tasks and reducing the need for manual intervention, they do add an additional component to the architecture and increase the overall complexity of the system. This can increase the risk of failure and make it more difficult to identify and troubleshoot issues when they occur.

How does Dr Droid make the process easier?

Figure 3: Doing the above process using Dr Droid

The above flow describes a scenario in which a delivery service is using an IVR provider and Dr Droid to monitor the external integration we discussed above using Redis.

When it is found that there is some delay in delivery, the Delivery Service makes an external API call to the IVR provider to take inputs from the Delivery Executive about the reason for the delay and sends a "call initiated" event to Dr Droid. This event notifies Dr Droid that a new call has been initiated.

When the response is received from the Delivery Executive, the IVR provider sends the response back to the Delivery Service and sends a "callback received" event to Dr Droid, which notifies Dr Droid that a callback has been received for the call.

If a callback is not received within 20 seconds, you can set up alerts and webhooks on Dr Droid and it will take care of tracking and executing the recovery flow for delivery service.

Overall, this flow allows the Delivery Service to efficiently manage requests by leveraging the capabilities of the IVR provider and Dr Droid. With Dr Droid, businesses can monitor the performance and anomalies of their business process, receive alerts in case of any disruptions, and set up necessary actions to resolve the issues. The use of event notifications and decision-making processes enables the Delivery Service to handle calls promptly and effectively, while also providing visibility into the call process for monitoring and optimization purposes.

Setting up the monitoring using Dr Droid

We are using the API specs and showing how to interact with Dr Droid using the REST API.

An event called call_initiated is created.

Figure 3: call_initiated event created

Following that, we created another event callback_received:

Figure 4: callback_received event created

After creating the event, you can notice in the dashboard:

Figure 5: Events on Dr Droids dashboard

We can then set up monitoring in the console to track if a callback is received in x seconds after the call is initiated.

Figure 6: Monitor Dashboard

Figure 7: Selecting primary and secondary events

Configuring a notification to be triggered if there is a delay in the occurrence of the secondary event despite the primary event being initiated.

Figure 8: Setting up an alert

Figure 9: Setting up email notification

Figure 10: Monitoring setup is done

We are now ready to perform a test.

We will proceed by creating a new event called call_initiated

Figure 11: A new event created call_initiated

The primary event was initiated but as we have not yet created the secondary event, callback_recieved, an alert notification has been sent.

Figure 12: Monitoring dashboard

An email notification was received stating that the secondary event, "callback_received", has not been created within a timeframe of 20 seconds.

Figure 13: Email Notification

Figure 14: Notification on the dashboard

This makes monitoring the calls easier. If the event is delayed Dr Droid sends notifications via Email/Slack or triggers the configured webhook to take appropriate actions.

Summary

I hope you have learned the importance of using third-party tools and monitoring them to make the system work reliably and efficiently with some use cases.

Do let us know through comments, how you monitor the external APIs. We will be happy to learn about your use case.

How to Track Events Across Multiple Services

Siddarth Jain — Wed, 12 Apr 2023 07:21:39 GMT

In this article, we will see the various challenges in microservice architecture and we will then explore one major challenge that many developers and businesses face: how to track the events flowing across these services.

Tracking events are required for various business use cases such as audit trails, fraud detection, missing API callbacks, raising alarms on unusual patterns, and delays and failure in services.

https://hashnode.com/post/clfw52pbe000609mq6mctgx4r

Microservices Architecture:

If you're well-versed in microservices, you can directly jump to the next section from the table of contents above.

Microservices have been popularized since 2011 when it was introduced in an architecture workshop in Venice. Since then, most developers have adopted the microservices architectural style to develop their products.

Microservices is an approach that enables teams to work independently on different services, deploy changes more frequently, and scale services independently. The impact of microservices has been significant, enabling organizations to build and scale complex applications faster and with more agility.

Microservices offer several benefits for software development teams. The architecture provides greater agility by allowing faster development cycles and independent deployment of individual services, resulting in faster time-to-market.

Other benefits include better fault isolation, improved reliability, and uptime.

Challenges in Microservices

Microservices bring significant challenges such as distributed ownership of entities. As the number of microservices grows, it becomes difficult to track and understand the impact of decisions, leading to a lack of visibility and isolation in the decision-making process.
Identifying failures or issues across services can become challenging due to distributed nature of transactions and data flow. Also, failures in one service can have cascading effects on other services.
Managing data in a microservices architecture can become complex due to each microservice having its own data storage and retrieval requirements.
Integration testing becomes more complex.
Ensuring data privacy, having an audit trail, and implementing consistent security across multiple services can be challenging.

How to Implement Event Tracking Across Services?

Event tracking refers to the process of capturing and monitoring user interactions or events within an application, website, or system. Events are user actions or system-generated events such as clicks, taps, searches, logs, and more. For example, say we have an e-commerce platform where customers can place orders for products. When a customer places an order, the order is created in the Checkout Service, and the order details are stored in a database. The order is then processed by various other microservices, such as Inventory Service, Payment Service, and Delivery Service.

One important metric for the platform is the time taken for an order to be delivered. However, tracking this metric is difficult because the order passes through multiple services, and none of the services has context outside their own.

Below are the various ways to implement event tracking:

1. Adding Logs to ELK/Logging Tools

To facilitate issue detection across microservices, logs can be added in different services and sent to ELK or the logging system with a unique key to join events. However, this approach has cons such as complex querying in the logging system and non-trivial searching of events going into different indexes/streams.

2. Building an In-House System

Implementing event tracking in-house involves building a custom system to track and store events generated by the application. To achieve this we can embed code to capture events in business logic code and push the events to a database. This can also be combined with log files to create a custom report or visualization to identify anomalies or insights.

However, it can be complex and time-consuming to build and maintain such systems, so it is only recommended for applications with specific requirements that cannot be met by existing solutions. It may be more efficient to use a third-party event tracking tool instead such as Dr Droid.

3. Building ELT Pipelines

We can also use Extract-Load-Transform (ELT) techniques. This involves capturing data from multiple sources such as web analytics, mobile apps, and CRM systems, and then transforming and loading it into a centralized data warehouse for analysis. Inherently they don't provide any intelligence on top of it. We can use a BI tool and create complex queries on the data and extract value from it.

Lets consider an example:

Figure 1: A simplified ETL pipeline

Companies like Uber have created entire teams to build this scalably internally.

To get insights on any metric like the order_delivered time, ELT pipeline can be created, with any OLAP database (in this case, Snowflake) as the sink. Kafka queues are being used to stream the data from the Checkout and Delivery services to the Snowflake database.

Checkout and Delivery services will both send Order fact data (created_at and delivered_at timestamps) to snowflake tables which can be monitored via a BI tool generating hourly reports of the trending average service level agreements (SLAs) at the platform level.

However, setting up and maintaining a mix of tools or ELT can be complex and require significant effort. It may also require specialized skills and resources to build intelligence from the captured data.

How Dr Droid Simplifies Event Tracking.

To simplify the process of tracking events in microservices and correlating them to metrics, tools like Dr Droid can be used. This eliminates the need for managing a shared infrastructure and cron jobs. Dr Droid allows us to send events from microservices and set up alerts for individual orders or aggregated delayed orders. By sending events such as order_created and order_delivered to Dr Droid, it becomes easier to track metrics and set up alerts.

Dr Droid Set up Walkthrough

We are using the API specs and showing how to interact with Dr Droid using the REST API. An event called order_created is created.

Figure 2: Generating order_created event

After creating the event order_created, you can notice in the dashboard:

Figure 3: order_created event on the dashboard

Following that, we created another event order_delivered:

Figure 4: order_delivered event generation from API specs

Figure 5: order_delivered event at the dashboard

We can then set up monitoring in the console to track the time taken for each order to be delivered.

Figure 6: Monitor dashboard

Figure 7: Selecting primary and secondary events

Configuring a notification to be triggered if there is a delay in the occurrence of the secondary event despite the primary event being initiated.

Figure 8: Setting up an alert

There are two options available for setting up notifications, and I am choosing to enable email notifications.

Figure 9: Notification options

Figure 10: Setting up email notification

Figure 11: Monitoring set up is done

We are now ready to perform a test. We will proceed by creating a new event called order_created.

Figure 12: A new event created order_created

The primary event, which is currently active, can be viewed in the console.

Figure 13: Monitoring dashboard

As we have not yet created the secondary event, order_delivered, an alert notification has been received.

Figure 14: Monitoring dashboard alerts

An email notification was received stating that the secondary event, order_delivered, has not been created within a timeframe of 10 seconds.

Figure 15: Email notification

Figure 16: Notification on the dashboard

In this way, we can simply post the events to the Dr Droid over REST API to capture them in a single place and make it easy to audit, monitor and trigger alerts as per the business requirements.

Summary

I hope you have learned various use cases of event tracking and why it is vital in the microservices architecture. We also quickly learned various ways to implement event tracking including using Dr Droid.

Do let us know through comments, how you use event tracking and try Dr Droid. Our team will be happy to learn about your use cases.

API callback & webhooks monitoring

Dipesh Mittal — Fri, 31 Mar 2023 06:03:42 GMT

Webhooks are crucial components of a multi-system architecture. They enable one application to send automated messages or notifications to another application when a certain event occurs, without the need for constant polling. Hence, monitoring them is super important.

At Doctor Droid, we are enabling API callback monitoring through a context-based linking between events. To try it out, sign up here.

In this blog, Im going to walk you through how you can monitor webhooks so that your user experience is never compromised.

The 3 different types of webhooks to monitor

FYI webhooks (e.g. SMS got delivered)
These webhooks are stored for future analysis and performance measurement and do not trigger real-time actions.
Critical webhooks (e.g. Payment completed)
Essential webhooks play a crucial role in enabling product flow and directly impact the customer experience. These types of webhooks typically require immediate attention and should trigger alerts to ensure timely response and resolution.
Human action webhooks (e.g. third-party vendor updates)
These are vital for monitoring operational activities and may have some indirect impacts on the system's triggers.

Important practices to follow when building webhooks

The aim here is to build secure and efficient webhooks to create an application that provides a seamless user experience. Here are some of the best practices for you to follow:

Keep a fallback for polling the caller
There could be situations where the calling application is unable to make those calls. Its recommended to keep a fallback handy where after a certain time from the forward call, you poll that application for the same data that they may have sent you the webhook with.

This makes sure your product journey remains intact. However, if their application is not working then this wont be useful. You should make a note internally (or alert) if this polling fails to yield the desired output.
Always put validations for request schema coming into your application
The trick here is to only read those fields which are relevant to you and discard the rest. If this integration is important for financial reconciliation in the future, keep a copy of the incoming webhook requests in some persistent storage.
This will also help in case of a callback request failure, as you'll have a copy of the original request to refer to.
Throw alerts for deserialization errors in webhook requests
Since this application is not under your control, there is a rare but possible chance they may change their request schema without prior intimation. To make sure you are aware of this when it happens, set up error handling at your deserialization and request validation layer.
Make the caller aware of your response status
When handling webhooks, it's important to consider how failures are handled, specifically in the case of validation errors or serialization errors. Webhooks are typically called in a "fire-and-forget" mode, meaning that the caller may not be aware of any breakage in the webhook request data.
This can lead to the receiver suffering from issues without the caller being aware of them. To prevent this, it's crucial to make sure that the caller is aware of the response status to ensure proper communication and handling of errors.

There are 2 ways you can monitor these webhooks

Stateless monitoring
For high-traffic webhooks, like telephony callbacks on SMS delivery, you can measure overall behaviour by counting and analyzing a field in the webhook API request. No need to map it to the original request.

Stateful monitoring
To track the entity for which the callback is received, map the callback status from the bank to the transaction initiated via API. Alerts should be set up for missed or delayed webhook calls, as this helps to take further actions to maintain customer experience and product journey. There are two ways you can set up alerts:

At an individual level
You need to be informed of the duration for which each webhook has been missed. Theres little tolerance for failure due to the severe repercussions that can be inflicted and so every failure needs to be reported and investigated.
For example, in the financial world, the processing of payments cant be paused or delayed unless theres a problem at the recipients bank, but that also must be known at the earliest possible time.
At aggregated level
In this situation, you know some leakage is happening but theres a tolerance range. To monitor potential leakage within an acceptable range, track webhooks against orders in your system, such as those fulfilled by third-party vendors. Although real-time performance monitoring may not be feasible, you can still ensure that the performance doesn't deteriorate beyond a specified threshold.

How do you set up the monitoring process?

For stateless monitoring
a. Use logs
Add logs for each request coming in and the field in it that defines success/failure (or whatever state matters to you). Take these logs into Grafana Loki or ELK stack and plot the count and the trend of that status value. This will be limited with the retention period of your logs so not good if you want to look at very old performance data also.
b. Store in DB
Put the webhook call data into a DB of your choice as an immutable entry. Plot this using any data visualization tool you want to use like Sisense, Metabase, or Redash. This data could remain forever. You can archive this into s3 as parquet files every week and move older queries onto the s3 + athena stack.
c. Metrics
Keep counters for each incoming request and another counter for each success and failure status. Use Prometheus to scrape them and then see them in any observability tool of your choices like Grafana or commercial tools like Newrelic or Datadog. This again is limited by data retention. Tools like chronosphere allow you to resample the data by coalescing it for the past so you can enjoy longer retention but with reduced granularity over time.
For stateful monitoring
There is no easy way to monitor this. You want to keep the reference of your forward transaction when receiving or waiting for the webhook and map this behaviour onto alerts or charts.
a. Using logs
You can log forward and incoming webhook calls with some common log_message that can help link them when searching and plotting. You can use an id to represent the entity in question, but you can search both logs or the lack of the webhook by searching for that ID. This can be done in any log visualization tool. However, if you want to do this in aggregation, the logging solutions fall short.
b. Store in DB
You store both forward and webhook transactions into a DB and run periodic queries for the webhook being missed. This can be used to know of each failure (with the periodic query running every few seconds) or in aggregation every X minutes. This can then be plotted on a chart or sent in a notification. These features exist in data visualization tools like Metabase, Superset, Sisense. You may not want to run these periodic queries on your OLTP DB, so you may want to replicate it or set up an ETL pipeline to make a data lake and then run these queries on it like Snowflake. Making this in-house is not only tedious but also adds a lot of DevOps overhead. Using a cloud solution for this is simply too expensive.

Quick note: Dr Droid specializes in the stateful monitoring of products. Sign up here to get started.

3 Ways you can fix your webhooks before contacting the application owner

If the number of webhooks is dropping, do the following:
a. Inspect the forward action that causes webhook calls. Maybe your forward action itself is breaking or not happening, causing the application to not trigger callback requests.
b. Check for any rate-limiting errors in your nginx logs or any serialization/authentication errors your server might be throwing for the webhook call. These happen less often but can be a reason for breakage.
If the webhooks count is good but there are some missing for specific cases, identify a pattern in the forward calls for which webhooks are missing. That can help identify if the application triggering webhooks could be failing for cases matching that pattern.
If you still cant identify why your webhooks are missing, contact the application owner.

Maximize the user experience by monitoring webhook callbacks

There are no two ways about it - failing to monitor webhook callbacks will expose you to several application errors that will cripple application performance. Simply follow the steps outlined in this article and youll boost your chances of delivering an application that provides a seamless user experience.
If you want a fast and secure way of detecting technical issues before they impact your business, then watch this demo on how DrDroid can do this for you.

From Fresher to SDE-3: My journey at Swiggy

Mohit Goyal — Mon, 20 Mar 2023 08:28:53 GMT

I started my career as a software engineer at Swiggy after graduating from BITS in 2019. I was excited about this new path but didn't know what to expect since it was my first real job.

Looking back, some of the highlights from my time at Swiggy include:

Working on 4 different teams, from fulfilment to assignment to delivery
Contributing consistently to an open-source project and driving its adoption internally
Starting as an SDE fresher and leaving as an SDE-3 after three years

I had the opportunity to work with helpful mentors and colleagues, and these three years were the most valuable journey for me. In this document, I will share what worked well for me and how any engineer getting into a startup could make the most of it:

Read, read, and read:

Read through service documentation, and if you find gaps, ask your mentors and take the initiative to suggest upgrades to the documentation.
No one will be as invested in your growth story as you. Developing a habit of scanning technical documents shouldn't be hard, especially right out of college. Pick a topic each week and start reading its documentation. For example, if your team heavily uses DynamoDB as part of their database infrastructure, understand its architecture and why DynamoDB is preferred over something like MongoDB. This will lead to two things:
- It will allow you to actively participate, follow, and contribute to technical discussions with senior engineers on the team, building their confidence in you to handle technically challenging tasks.
- It will allow you to develop a research acumen that comes in handy when dealing with some technical specifications you may not have encountered before. As a fresh graduate, you will find yourself in such situations quite often.

Learn from feedback:

Pay attention to feedback from pull request (PR) reviews. Very critical and immediately actionable feedback comes from PR reviews. Make it a point to reduce the frequency of similar feedback in future PRs. Writing good code is an art. If any of the suggestions seem confounding or difficult to understand, follow these steps in order:
- Search on Google and look for relevant Stack Overflow pages.
- Search for similar code pieces on GitHub or Bitbucket.
- Ask your mentor or any senior engineer.
Have a "can-do, will-do" attitude, no matter how daunting the project may seem. The worst that could happen is you might need help from a senior engineer in your team. Never hesitate to seek help when you are stuck, but make sure to have done sufficient research on your own before entering any discussion. In the long run, this strategy has worked well for me.

Proactive involvement:

Develop a product backward thinking and proactively get involved in product discussions. Unlike college, catching corner cases is much more intense as it involves real-world scenarios, and missing any can have serious consequences. For example, users are quick to find loopholes while requesting for refund and if your unfiltered list of cancellation reasons shared with upstreams includes tech outage, you can probably gauge the impending nightmare.
Find open-source projects being used in your company and start contributing to them. The first step is to go through the documentation (no surprise here) and the recent PRs. This will give you a fair idea of what ongoing issues are with the project. Our minds are like clay, and restricting yourself to the stack used in your team for too long will make it difficult for you to work on a new service on a different stack.

Thanks for reading my first blog! Leave a comment below with any questions or feedback you may have - I'd love to hear from you!

Why API integrations break and how to avoid them?

Siddarth Jain — Fri, 20 Jan 2023 03:22:41 GMT

Today, our products are deeply dependent on third-party integrations to run successfully. Common integrations that we have come across in building products include payment gateways, communication APIs, CRMs, client integrations and banking APIs.

Most common reasons for issues with integrations

Poor Provisioning by Service Provider
While multi-tenant architectures with distributed load are the recommended way to build API products, very often we end up with issues in reaching service providers because of a lack of efficient provisioning. Peak traffic apart, sometimes even during a lean period, we have had vendor services going down due to a minor increase in the request volumes
Latency breaches Timeouts
To protect existing workflows and thread pools from choking, we have pre-defined thresholds with service providers for APIs. When these latencies are breached, it leads to timeouts of requests at our end
Unhandled error codes
A new error code, generated due to an edge case or otherwise, which is not managed during exception handling can impact our workflows
Incorrect response
Changes in third-partys APIs can lead to unexpected request responses and bodies, causing downstream APIs to reject the response, throw errors or respond unexpectedly
Webhook Deactivations
Modern integrations depend on webhooks to receive data from third-party. We have seen instances where webhooks have been deactivated at the service providers end without appropriate notifications

Remediation Strategies

Monitoring API latencies, throughput and error rates
Multiple integrations in case of critical services - As you read above, a service providers APIs can falter due to multiple reasons. In our attempts to normalise dependency, multiple integrations should be set in place for critical services
Setting up fallback options in case of failures - Setup circuits with a default fallback option in case of failure
In case of requests consistently exceeding latency thresholds, dropping the requests would be critical to avoid queuing
Enable webhook monitoring and set up alerts for the sudden dip in transaction frequency
Store API response bodies in logs or databases to be able to retrieve and identify issues in data in case of debugging needs

Introducing Outsiders: Monitor third-party integrations

Outsiders help you monitor 3rd party integrations - both performance and context monitoring.****

Context Monitoring

Change in response body format or variables
Data variation / sudden change for a specific parameter
A mismatch between the status code and API response body
Mapping between API call and webhook response

Performance Monitoring:

Error rates and error codes
API latency
Ack to webhook delay

Request access to our product in beta here!

Observability of APIs in Production Environment

Dipesh Mittal — Tue, 27 Sep 2022 13:21:45 GMT

If youre someone who understands instrumentation well, feel free to jump directly to the symptoms and investigation section directly.

1. Why are we talking about APIs

Software is eating the world. Adoption of API-first approach has been one of the key drivers for this fast paced software development.

APIs are the communication pathways through which programs talk to each other. They have become a powerful tool for abstracting the underlying implementations of a software and just expose what is needed for the caller to interact with.

2. What does a high performant API mean?

APIs come with certain promises, like repeatability of request/response structures (Contracts), predictable speed of receiving the response (SLAs) and logical outcomes (Status codes). Here are the 4 expectations from the API:

Predictable & Fast Latency
APIs are written for specific purpose and hence, that purpose must be fulfilled in a predictable time period. The faster and more predictable your APIs are, the better the experience of the caller.
No Errors & logical status codes
Runtime exceptions will cause your code to exit and throw 5xx errors to the client, which will have to be overridden using custom middleware. In cases where an exception occurs, clear reasoning on the error and the status code must be put in place.
Scalability
The performance and behaviour of the API should not change based on how much traffic it takes. There can be upper limits of how much traffic you can handle that you tell your clients/users, but it should behave similarly when below those limits.
Consistent contracts
Abstraction means that the caller wouldnt be able to know if the structure or the code of the API has been changed. Any change to the API payload, response or code need to be updated to the caller explicitly.

Why are we talking specifically about production

Adding a debugger in the compiler or running unit tests typically enable you to evaluate and test functionality of the APIs in staging environment. But it doesnt replicate the complexities & challenges of the production environment.

3. Setting up observability

To identify if our API is performing well, we need to observe our API's behaviour in production. This is doable by instrumenting our service code which has this API and view its metrics & traces. If you are new to instrumentation, read more about it here.

Logs

You can log statements and pass them to your logging framework to be available for querying later. You can put smart logs to tell the stage of the code your API request has reached and what the value of the variables are. This can help you gain a lot of insights. Adding unique identifier in the logs will help you search them better (more relevant if you do not have tracing implemented).

Errors

If you are checking logs for runtime exceptions in an API, Sentry, GlitchTip or equivalent will save you time - it pin points the error reasons and stack traces! (Both are open source)

Metrics

A quick health check of any API can be done by having a quick scan of historical time-series based metrics analysis of the following data points:

Traffic / Request count (in requests / minute)
Latency (in milliseconds)
Error rate (% error)

Traces

Traces enable you to see step-wise details of your code during execution. For example, clicking on the DB call within the steps will tell you which query ran and what is its average behaviour as a metric.

Metrics and traces can be setup using commercial tools or open source alternatives (Prometheus / Jaeger). More details on this to be released on a blog shortly.

4. Setting up monitoring

The telemetry data mentioned above, still needs to be made available at a UI that is accessible to the user. The following are the two essentials that need to be setup here:

Dashboards - A quick read about Golden Signals will give you an overview of what are some essential software metrics to track. Here are two tips to making your dashboards effective:

Grouping of metrics: Group metrics at an API or at a service or a product workflow level depending on the criticality of the API. If its a business critical API (e.g. payment or login), create a unique dashboard but if not, it can be a part of the service dashboard.
Accessibility: Add links to relevant dashboards in your troubleshooting playbooks and give democratic access to all dashboards to your users.

Alerts: Setting up an optimal alerting system with low false positives is iterative and it will be hard to explain it here in a short way. For now, you could read this guide by Google which explains on how to iterate on your alerts.

5. API Symptoms & root causes

In most cases, you might end up with a scenario where your API is not performing right. It could be because of multiple reasons.

Common API Errors

API errors can largely be classified into two categories: handled or unhandled.

Handled Errors

HTTP 400: For validation failures in the request data of the API, you return a 400 error. If that happens a lot, that would mean either the callers are sending wrong data too frequently or you have added some validations which are failing correct requests. How to fix: You should check the Pull Requests for recent releases in that service to find that changed validation or if you have added logs for validation failure cases, identify which callers are failing the most and inform them to correct their request data.
HTTP 401 / 403: Failure of proper authentication or authorisation results in 401 and 403 errors. If they happen too many times, that would mean your authentication token generation is happening improperly or the token checking process is failing. Most often the case is that the access token storage layer has some issue. How to fix: Check the API that is returning the auth token to the user app for any errors from your monitoring tools. If that doesnt work, pick up a sample token from your logs which is failing to get authenticated and see how the user got it (if your company policy allows it).
HTTP 404 / 405: In case the endpoint the client is hitting on your service isn't exposed, you throw the 404 code. In case the endpoint is present but the HTTP verb used in the call isn't supported, you throw 405. These are mostly handled by all modern web frameworks themselves. Any presence of these shows incorrectly integrated client. How to fix: Isolate the clients who are creating these errors from your error monitoring tool and inform them to correct the integration by sharing your correct documentation for the API.
HTTP 429: In rare scenarios, your clients may be exceeding the rate limits you have set on the APIs and that throttling is kicking in. This would return HTTP status code 429 to them for each extra hit. This is a practice you follow to protect your servers from being hogged by a few clients. How to fix: Ideally, in such cases you should either relax the throttling at your end if your business requires it (as long as your system can handle that load) or you ask your client to check at their if they are making so many requests.

Unhandled Errors

HTTP 500: When an error happens in your code that you haven't handled, the web framework will mostly throw 500 error. That indicates that your code and the variables it is handling ended up in a state that your code couldnt handle like a NullPointerException. These unhandled errors can be seen on your error monitoring tools. How to fix: From your error monitoring tool or from your logs, you will know which line of code is causing the error. This could have been introduced either due to a new release or some new data flowing in which wasnt earlier. Most likely youll need to make a code fix or disable some feature that caused it to break.
HTTP 502: If your API is returning 502 errors, that would mean some server is unreachable due to their DNS resolution failing. That happens if your configured hostname is incorrect for a downstream API call or that hostname is incorrect. How to fix: Putting retries in your caller code mostly solves it if its a network glitch, but don't put too many re-tries because in case the hostname is genuinely unavailable, it could cause trouble with your processing queues. You should put logs so that you can identify right away which downstream server in your API context is throwing this error. More on this here.
HTTP 503: 503 errors happen when your service is unavailable to take on requests. This can happen due to the web container unable to connect with the application server or if your LB doesn't have any healthy targets to serve the requests. How to fix: Check in the load balancer of your service if it has healthy targets to send requests to. Most likely this happens because health checks are failing to the targets due to them being too slow or having run out of available connections. Adding more targets can solve the problem if it is a connection pool issue but if your new targets are also going unhealthy, this could be linked to latency degradation of health check API.

Degradation of API latency

Response time increasing is a pain point that comes with increasing scale or poorly written code. Either way, the best place to know this is in the traces. For an API, the time is took to respond can be deconstructed into smaller steps it had to execute. They are called spans. By looking at them, you can find the slow moving parts.

Typical reasons for API slowness:

a. DB queries are taking time

DB call spans will tell you time it took to connect to your DB and querying it. Compare them with the DB span from the period when the API was working fine. Slowness in these spans could be caused by:

i) New code changes with inefficient queries (not using the correct index when selecting a large data set or fetching too big datasets). How to fix: The fastest way to fix this will be disabling the feature which causes that query or rolling back your changes. In case neither is possible, then quick remediation could be done by introducing new indexes in real-time, although that is highly discouraged.

ii) DB is under stress and queries are taking time (this can be confirmed by checking if all queries to the same DB are taking longer than before or not).

How to fix: This could be due to a crunch of different type of resources in the DB. A detailed note on this will be published soon.

iii) In relational DBs, concerned table could be locked. In case your are writing into a particular table and it is locked by some other thread, you query could be slow and eventually time out based on your DB setup. How to fix: Database running queries need to be check. Different DBs have different way to store and query this data. Here is how you find it in MySQL and Postgresql. The session which is running the locking query must be killed. These steps mostly likely can only be performed by your Devops or DBA team.

b. External API call is taking time

If your API is making a call synchronously to some other API, your slowness could be due to that. This could be a call to Redis or a broker or some other internal/external API.

How to fix: As a caller of APIs from anywhere in your code, always setup timeouts to protect your own customer's experience. Should also look at implementing circuit breakers if you are dependent on too many such downstream APIs and you can afford them not being available in your product temporarily. In any case, reach out to the owner of that API immediately if you cant rectify it.

c. Code execution is taking time

This happens due to your service running on under-provisioned infrastructure. You can identify this by checking code spans in the request trace. They can be identified by names starting with 'Function' or your programming language.

i) CPU - If a server is taking more requests than it can handle in terms of CPU cycles, it starts becoming slower overall as processes fight with each other for processing power. How to fix: Robust auto scaling based on CPU must be setup on your service hosts to make sure no host goes over the tipping point w.r.t request traffic. Make sure your host doesnt allow more traffic than what it can handle by fine-tuning the number of connections it can take in.

ii) Memory - If the processes running on the host are utilising the memory but aren't releasing it, that would make the memory unavailable for other processes to use. Although most modern languages do auto garbage collection for freeing up memory, poorly written code can still cause it.

How to fix: Quick remediation for memory issues on hosts is restarting your application process on it, but for long term code changes might be needed to remove the erroneous code. Make sure you use the latest stable version for all third party libraries as they would have been tested well for memory leaks by the authors and the community.

d. Insufficient connection pool

Your web containers could be finding it difficult to get connections to your application layer due to exhaustion of the connection pool. This happens when your hosts have maxed on the number of connections per host and also on the number of hosts. This could be happening due to poor configuration even though you can handle more load.

How to fix: This also can be solved using auto scaling on your hosts and auto scaling of workers on your hosts up to the limit each can handle. Quick remediation would be addition of more hosts.

Other Reasons

Another issue that you might face is that the API is within expected latency, has normal error rate but is not responding as per expectation.

This usually means some logical change has gone inside the system that has broken the API. Some obvious reasons could be:

a) For read-only APIs, it could be due to underlying data being corrupted or missing.

How to fix: Check what process inserts/updates that data in the storage. Putting logs in both the insertion and the read API can tell which part isnt working right. If you dont have logs, try and make the API call for reading data which exists and should have been returned. The result of this test can help you isolate the problem.

b) Some feature flag could have caused misbehaviour in the APIs. Lack of proper testing can leave bugs in the new feature or unintended consequences in existing product flows.

How to fix: Should ****look to disable the feature or rollback the release entirely to remediate quickly.

c) Although unlikely, it could be caused by bad data coming in from your API caller due to an issue at their end but is not causing any exception.

How to fix: There should be good validations set up on request data and any anomalies in it must be notified to the caller through 4xx status codes or logged for being noticed.

A very useful way to identify root cause for incorrect API behaviour is to compare the current API traces with the past. You should be able to see some differences in the spans and their latency and can tell you what code flow isn't being taken anymore or is being taken now and that can help you find a pattern of the change that is happening underneath.

6. Bonus Section

Investigation Strategy for APIs

Investigation Strategy

As you start investigating, what do you check first and what next? Heres the mental model that I follow to resolve any issue related to an API latency.

Cheatsheet for fixing errors

How do you debug an API? Tell us about your debugging strategies in the comments below!

Observability | Simplified

Dipesh Mittal — Thu, 15 Sep 2022 11:52:50 GMT

Observability is a common term thrown around in our developer circles; often coupled with monitoring & alerting. A lot of popular tools claim to be solving your problems end-to-end and a lot of exchanges go on around open source technologies and protocols around this. This article tries to simplify some of these terms and how observability really works.

What is Observability?

Observability is the practice of having data about your system that can help you know the unknown. It doesnt refer to your metrics dashboards (that is monitoring) or to the alerts you set up. The process of instrumenting and collecting data that enables you to observe how your software systems behave, be aware of their health and gather detailed knowledge of how they are working is observability.

There are 3 common types of data sources (called telemetry data) that help you uncover the truth:

1. Logs

If youre a developer, the first thing you add while testing your code is logs. They can be either system generated (e.g. by nginx) or manually generate and can have a variety of data that helps in knowing vital information about the execution of the code.

2. Metrics

They are numerical values quantifying a certain behavioural aspect of your software, which is saved in a time series storage for seeing over a period of time. Most software emit metrics, be it your service running on a pod or the k8s cluster itself.

To put it into context, the throughput (Requests per minute) or Avg response time of your API calls per minute are some metrics that you'll be familiar with and must have noticed in dashboards.

3. Traces

You can think of traces as a specialized form of logs, designed to give details around the set of steps your request took. It splits your entire execution into smaller chunks, including code level logic, DB queries & downstream calls. These executions (called spans) are easily identifiable with their names and their prefixes. Common names that you might have seen if your team has already setup traces:

Datastore - DB queries and connection handling steps
External - Calls made outside your service over a network protocol like HTTP, MQTT etc.
Function - Code execution within the current program
There are other span names that can come up based on your instrumentation agent.

What is monitoring?

Monitoring is the part where you use the telemetry data to set up dashboards and visualisations of metrics you already know that you need to track to view the system's health at any point in time. Observability means having data such that even when you dont know what you need to track, you can investigate your system deeply enough.

How does this work?

Below is a sample flow of how observability, when integrated within your micro-services architecture, looks like.

Instrumentation: Refers to how the telemetry data is generated within the system. Typically, it involves adding a small piece of code/program (instrumentation agent) to your existing code.

Ok, but do I need to know how this works? 😬

No. Not really. You can decide to go ahead with a commercial tool and all you need to do is follow a couple of lines of instructions to set it up. All the steps mentioned above are taken care of by them so the details are abstracted out and you can directly start monitoring your system.

Caveat: As your system scales, the cost of the commercial tools will start pinching and you might consider moving to OSS.

Bonus section:

Over the last couple of years, the term o11y is starting to get popular for the word Observability (e.g. The event o11yfest).

Now you say how? Find the output of this code to know how:

def word_encoder(word):    word = word.replace(" ","") #removing spaces    mid_char_count = len(word) - 2    encoding = word[0].lower() + str(mid_char_count) + word[-1].lower()    print(encoding)    return encodingword_encoder("observability")

Are you thinking where is this inspiration coming from? Try to find the output for these function calls and you'll know the answer :)

word_encoder("kubernetes")word_encoder("Andreessen Horowitz")

Fun fact: These words are numeronym

If you have come across any other jargon that needs to be simplified, mention them in the comments!

We are shortly publishing a comparison of the most relevant open source & commercial tools for observability. If you would like to get a copy of it, sign up below!

Building a Data-Driven Engineering culture

Siddarth Jain — Tue, 13 Sep 2022 05:48:23 GMT

Software is eating the world. This is a clich that most of you know!

But when you build software and something breaks, whats the first reaction of your team?

A data-driven engineering culture enables teams to mitigate and resolve conflicts. But where does one start? Setting up observability and setting it the right way can accelerate your teams journey to becoming data-first.

Observability practices to drive data-first culture:

From our personal experience and interactions with senior engineering leaders, here are 5 ways to set your team up for success.

1. Enable democratic access to observability data and monitoring dashboards:

While an engineer might not typically be bothered to check the health of a service thats not related directly to his/her domain, its common to have indirect dependencies that need to be checked. Avoiding data silos ensure that teams can get to the root cause without making your DevOps / admins the bottleneck.

2. Creating a team accountability culture:

Empower and hold your engineering team accountable for follow-up communication (post-outage) to all stakeholders (both EXTERNAL and internal):

What was the root cause of the issue?
Why it wasnt mitigated previously?
What remediation actions have been taken to avoid it in the future?

The same document should be circulated among your engineering team as well as to the relevant/impacted stakeholders in the company.

For some inspiration, heres how Herokus engineering team publishes follow-up reports.

3. Make code performance and monitoring a part of Developers KPI:

Modern engineering teams deploy code in tandem - expecting every DevOps/SRE team to monitor every alert is inefficient. Instead, hold developers accountable for the performance & monitoring of their work pre-deployment. Some top-performing teams have the following within the charter of the developers responsibility:

Check for instrumentation of observability within the CI/CD pipeline
Measure the performance of the code after deployment/integration
If its a new feature/service, create a monitoring dashboard that can track the health of the service appropriately. If its a code change on the existing one, the existing dashboard should be re-jigged if needed.

4. Avoid data fatigue:

Have you noticed that your team members frequently leave alerting slack channels due to irrelevant/too frequent notifications? This creates an approach of ignoring data while investigating. Here are some strategies that you can consider:

Only have Actionable alerts - pair an alert with the impact of the alert to make it actionable.
Contextual alerting - Mapping notifications to relevant stakeholders (both horizontally across teams & vertically within the team)
Continuous improvement - On-call engineer to create a report at end of their rotation about what % of alerts were relevant.

Heres a good article by Atlassian team on some best-practices to reduce alert fatigue.

5. Avoiding data scattered across multiple tools:

There are too many developer tools in the market. Period.

Adopting a new tool requires your engineer is like building a muscle - it needs conscious effort over a prolonged period of time. In this case, it could be getting used to the user interface or the querying language. Create guidelines for setting up the monitoring dashboard to enable ease of accessibility in times of crisis or urgency. As teams build habits of their respective tools, it only gets harder to migrate. (Sooner the better)

6. Look at the Total Cost of Ownership (TCO) and not just the tool pricing while evaluating options:

Just because a tool is open-source or free, doesnt make it the go-to option. Sometimes, orchestrating and managing open source tools can be demanding - if your team is very lean or worked up, avoid tooling that will require constant maintenance and development.

Chose tools that save your teams time. The quick time-to-value also improves adoption. Once youre closer to a scale where the tool cost pinches you too much, the TCO will automatically start weighing toward the open-source option.

We're Just Getting Warmed Up

At Dr Droid, our team is building tools to simplify the lives of engineering teams. And we are listening to what you have to say!

If you have any engineering practices to share that help drive data-first culture, tell us in the comments below.

Mistakes to avoid in Observability

Dipesh Mittal — Fri, 09 Sep 2022 16:09:56 GMT

The capabilities of observability tools to manage crisis situations have significantly improved over the past decade. Despite that, a lot of issues that couldve been avoided, with good observability, end up in production.

From our experience, and inspired by inputs from experts like Stephen & Soumyadeep, in this article, we talk about some common mistakes in observability that can be avoided:

Common Mistakes in Observability

Not building context of your product:

1. Not tracking what the customer sees:

Its easier to set up monitoring at a host level from the observability data but it doesnt tell you the customers perspective. Set up metrics that give you insight into how the customer would be impacted.

For e.g., in my previous job, we created custom dashboards to monitor the API response time to our clients and the success rate for our supply-demand matching algorithm since they were a direct indicator of our end customers experience.

The flip side is, what if you track only what the customer sees? Continue reading to see how that could be problematic too.

2. Following the same metric sampling rate and thresholds across services:

When setting up alerts, spend time identifying the sweet spot of metrics sampling rate and thresholds in them - these could vary based on your users requirement and use case. Business critical flows have stringent thresholds while internal tools could have relatively lenient thresholds.

Having a setup that is hard to investigate / triage:

1. Only tracking what the customer sees:

High-level metrics (e.g. response time) are useful to see the health but while investigating, your team will immediately feel the need to peek into more detailed metrics (e.g. CPU / memory / iops) so that finding the root cause is easier if any of the high-level metrics go bad.

Having deeper-second order dashboards alongside an overall health dashboard helps the team investigate faster.

2. Lack of instrumentation guidelines for new services:

When instrumentation is not done at the source or if its done differently for different parts of the code, it increases the difficulty of finding root-cause of misbehaviours: search queries become hard with inconsistent logs; monitoring becomes hard with high variance in metrics being tracked.

It is recommended to share common instructions with the team on how to instrument (logs, metrics and traces data) within any new service/component.

3. Not setting up tracing:

Do you follow a micro-services architecture with multiple components calling each other? Tracing enables you to follow a particular request within your code, and across services. Setup tracing (at least) on your most critical product flows. It will save you crucial triaging time, esp. in times of crucial SEV0 / P0 incidents.

4. Inaccessible or hard-to-find data

Limiting access to observability data & creating bottlenecks for access leads to data silos for engineers trying to understand the system as they are deeply interconnected. Democratic access to observability data empowers teams to triage faster and without the need for assistance.

5. Adopting tools without alignment / Using too many tools:

Data scattered across multiple dashboards (e.g. logs across server files, CloudWatch & Kibana) create an artificial need for context switch and slows the investigation process. Additionally, mandating tools without alignment of engineers (users) can lead to poor adoption and hence, difficulty in investigations.

Creating Fatigue and not actionable insights:

1. Too many dashboards create too much noise

Your service is a small portion of the overall product architecture. Standalone dashboards for each service can be avoided if they intertwine closely/cascade with another service. Combining dashboards for critical flows makes it easier to find data and give a holistic picture of the situation.

2. Only having time-series graph-based dashboards

Not every metric is a graph. For some things that are as direct as error rate or the number of live pods, keep numerical counters in your dashboards as well. That will make them easy to find and absorb for quick action.

3. Too many alerts

While you would want to know about everything that is going wrong with your system, only set up alerts with thresholds that you are ready to wake up at 3 AM for. For everything else, rely on dashboards alone as they are not critical for your customer.

Cultural Gaps:

1. Positioning observability as tool to use during issues and incidents only

Apart from incident triaging, observability is an important mechanism to help teams understand how their systems are behaving. It helps understand the performance of different requests, APIs and errors - which provides an opportunity to improve the quality of applications.

Without this, teams are collecting tech debt that will be too expensive to pay later. Read more on how to promote observability in your team here.

2. Having a single point of failure for the observability tool

Even if your team becomes trained at using the tool and has adopted it well, you should share knowledge on the set-up process and how to create dashboards/alerts. That not only will give your team a deeper perspective of how the tool works, it will remove dependency from individual developers.

Observability can become your most powerful weapon in driving data-first culture in your engineering team. Read more about this here.

Was there any other mistake that your team corrected over the last few years in your observability journey? Share them in the comments below and help others avoid them!

Notes by Dr Droid

Building an Events Lakehouse with Dynamic Schema

Context:

Early Start with v0

Performance on PostgreSQL (v0)

Introducing v1

Playground with Prometheus, Grafana, Loki and k6

Context

Setup

Step 1: Provision VM.

Step 2: Setup the observability environment

Prometheus

Loki

Grafana

Step 3: Setup the microservice

Runtime Environment

Run in background

Metrics & Logs Target

Step 4: Setup traffic simulation with k6.

K6

Load generation script

Cron Load Generation

Conclusion

Beyond Monitoring: 7 Ways Observability Empowers Your Team

Introduction

The Basic: Observability as a Safety Net

The More: Observability as a Strategic Tool

Conclusion

Checklist for Service Health Tracking

Introduction

Golden Signals of Service

Upstream and Downstream Service Health

Connected 3rd Party Integrations

Business Metrics

Underlying Infrastructure Health

Related Components Health: Database, Streams, Cache, etc.

Conclusion

Doctor Droid:

Github Codespaces, Replit and the world of Cloud IDEs

Context:

Moving Colab code into working product:

Attempt 1: Replit

Attempt 2: Using Streamlit & Codespaces

Billing & Usage

An Engineering Manager's Guide to Alert Operations — Part 2

Part 2 - Managing and Improving Alert Operations

Objective

Choose the right alert consumption strategy

Diving deep into threshold management

An Engineering Manager's Guide to Alert Operations — Part 1

Part 1 - Setting Up Alerts

Prerequisites for setting up alerts:

Getting started:

How to measure your alert quality?

Tips for Ensuring Alert Quality:

A Design Partnership Case Study - ProdigalTech x Dr. Droid

About Prodigal Tech:

The Challenge

Feb-May 2023: A design partnership in making

Implementation

Results:

Future Roadmap:

Use-cases for real time analytics in E-Commerce with Dr. Droid

Introduction:

Part 1: Use cases for E-commerce platforms

Part 2: Benefits of having strong monitoring in place

Part 3: A Glimpse into Dr. Droid and how we make it easy to track metrics in real-time

No-code integration

Create Funnels & complex metrics

Part 4: Risks to Business KPIs/Metrics if the checkout process is not well monitored

Symptom-Based Alerts: Putting User Experience at the Forefront

What are Symptom-based alerts?

Risks of skipping symptom-based alerts

Benefits of adding symptom-based alerts:

Setting up symptom-based alerts:

Mistakes to avoid while setting up alerts:

1. Only tracking individual components and not end goals:

2. Relying only on auto-instrumented metrics:

3. Not adding tags/ identifiers:

4. Missing out on adding the configurations in logs: