Skip to main content

Command Palette

Search for a command to run...

DrDroid: How AI SRE Helps Engineers who are on-call for production monitoring

When engineers first hear about DrDroid, the most common question is: "What will my team actually USE this for?" If you're on-call for production, here's exactly how DrDroid helps: from firefighting incidents at 2 AM to automating your most repetitive runbooks.

Published
10 min read
DrDroid: How AI SRE Helps Engineers who are on-call for production monitoring

Day 0 — Incident Response & Firefighting

Something is broken right now. You need answers fast.

# Your Task How Doctor Droid Helps You
1 "Our API latency spiked, what changed?" Ask the agent — it pulls recent deployments from ArgoCD/GitHub Actions, checks Grafana/Datadog metrics, and shows you what changed around the time latency spiked. No need to open 5 tabs.
2 "Which pods are crashing in production?" Ask the agent — it lists failing pods, pulls their logs, shows recent K8s events, and surfaces restart counts. You get a full picture in one response.
3 "Is the database the bottleneck?" Ask the agent — it runs slow query analysis on your Postgres/MySQL, checks connection pool usage, and correlates with application error rates from Datadog or New Relic.
4 "I got paged, what's actually going on?" The agent auto-investigates when an alert fires. By the time you open it, there's already a summary with metrics, logs, and likely root cause pulled from your connected sources.
5 "Are other services affected too?" Ask the agent — it checks health across your connected sources (Grafana, CloudWatch, K8s, Datadog) and tells you which services are degraded vs healthy.
6 "I need to check CloudWatch logs for this error" Tell the agent the error pattern and time range — it queries CloudWatch Logs, Loki, or Elasticsearch directly and returns matching entries. No console login needed.
7 "Run this PromQL/NRQL query for me" Give the agent your query — it executes against Prometheus, Grafana, New Relic, or Datadog and returns results inline. Great for quick checks during incidents.
8 "SSH into the box and check disk space" Tell the agent the command — it runs Bash commands on remote hosts via SSH and returns output. No need to find the SSH key or remember the hostname. Works even when you don’t have laptop access.
9 "Notify the team on Slack about this outage" Tell the agent what to post and where — it sends a formatted message to your Slack channel with the context you provide.
10 "Escalate this to PagerDuty" The agent creates or escalates a PagerDuty/OpsGenie incident based on the investigation findings. You don't need to context-switch to the PagerDuty UI.
11 "Create a JIRA ticket for the post-mortem" Tell the agent the summary — it creates a JIRA ticket with the incident details, investigation findings, and relevant links.
12 "Did the last deploy cause this?" Ask the agent — it checks the latest GitHub PR merges, Jenkins builds, ArgoCD sync status, and correlates timestamps with when the issue started.
13 "Roll back the deployment" Tell the agent to trigger a rollback pipeline — it kicks off the Jenkins build or GitHub Actions workflow you specify.
14 "Check if this API endpoint is responding" Give the agent the URL — it makes an HTTP call and tells you the status code, response time, and body. Works for any internal API.

Day 1 — Operational Tasks & Maintenance

No fire, but you need to keep things running smoothly.

# Your Task How Doctor Droid Helps You
1 "What's the current state of our K8s cluster?" Ask the agent — it shows pod status across namespaces, node resource usage, recent events, and any pods in CrashLoopBackOff or Pending state.
2 "Are all our data sources healthy?" The agent tests connectivity to every configured connector every 10 seconds. Ask it for the current status, or set up Slack alerts for failures.
3 "Show me all Grafana dashboards we have" Ask the agent — it returns the full inventory of dashboards, datasources, and folders it auto-discovered. Same for Datadog monitors, K8s resources, DB schemas, etc.
4 "How big are our database tables getting?" Ask the agent to run a table size query on your Postgres/MySQL/ClickHouse — it returns sizes, row counts, and index usage.
5 "Any slow queries running right now?" Ask the agent — it checks pg_stat_activity or equivalent on your database and shows long-running queries with their duration and state.
6 "Check if our Jenkins pipelines are green" Ask the agent — it pulls recent build status from Jenkins and tells you which jobs passed, failed, or are stuck.
7 "Is our ArgoCD app in sync?" Ask the agent — it checks sync status and health for your ArgoCD applications. Flags any that are out-of-sync or degraded.
8 "Pull logs from the payment service for the last hour" Tell the agent the service and time range — it queries Loki, CloudWatch, or Elasticsearch and returns the logs. No need to remember log group names.
9 "What GitHub PRs were merged today?" Ask the agent — it queries your GitHub repos and lists merged PRs with authors, titles, and timestamps.
10 "Trigger a build for the staging environment" Tell the agent which Jenkins job or GitHub Actions workflow to run — it triggers it and reports back the status.
11 "Send a daily cluster health report to Slack" Set up a scheduled playbook — the agent runs K8s health checks daily and posts a summary to your Slack channel automatically.
12 "Check MongoDB replica set health" Ask the agent — it runs the appropriate commands against your MongoDB instance and returns replica set status and lag.
13 "List all CloudWatch alarms in ALARM state" Ask the agent — it queries CloudWatch and returns currently firing alarms with their metric details and thresholds.
14 "What Datadog monitors are in alert?" Ask the agent — it checks your Datadog monitors and lists any that are currently alerting or warning.
15 "Run this custom SQL query on production" Give the agent the query — it executes against your connected Postgres, MySQL, ClickHouse, or BigQuery and returns results as a table.
16 "Check disk and memory on our VMs" Tell the agent to run df -h and free -m via Bash — it SSHs in and returns the output.
17 "Update this JIRA ticket with today's progress" Tell the agent the ticket ID and comment — it adds the update to JIRA without you opening the browser.
18 "What new resources were created in our K8s cluster this week?" Ask the agent — it compares the current asset inventory with the previous discovery and highlights new pods, services, or deployments.

Day 2 — Automation, Optimization & Reliability

You want to stop doing repetitive things and build resilience.

# Your Task How Doctor Droid Helps You
1 "Automate our incident response runbook" Build a runbook as per your internal process: when alert fires → pull metrics from Grafana → check K8s pods → query DB → post findings to Slack. Runs automatically on every alert.
2 "Auto-restart pods when OOM detected" Set up a workflow: if K8s event shows OOMKilled → restart the pod → notify on Slack → log to JIRA. No human in the loop.
3 "Scale up when CPU crosses 80%" Create a conditional workflow: check CPU metrics from Prometheus/CloudWatch → if above threshold → trigger scaling via K8s or API call → confirm on Slack.
4 "Track our SLOs across services" Set up scheduled playbooks that query Prometheus or Datadog for error rate and latency → calculate against SLO targets → post weekly reports to Slack.
5 "Alert me before we hit our error budget" Build a workflow that checks error budget consumption daily → if > 80% consumed → post warning to Slack and create a JIRA ticket.
6 "Correlate deploys with performance changes" Set up a playbook that triggers after every deploy → compares pre/post metrics from Grafana/Datadog → flags regressions automatically.
7 "Stop SSHing into boxes for the same checks" Convert your common SSH commands into playbooks — disk space, process checks, log tailing. Run them from Doctor Droid with one click or on schedule.
8 "I keep opening 5 dashboards for the same investigation" Build a single playbook that queries all 5 sources and gives you a combined view. Next time, ask the agent instead of opening dashboards.
9 "Validate infrastructure after every Terraform apply" Set up a post-deploy playbook: check K8s resources → verify CloudWatch alarms exist → test endpoints → report pass/fail.
10 "Audit our K8s RBAC and network policies weekly" Schedule a playbook that lists RBAC bindings and network policies → compares against expected state → flags drift on Slack.
11 "Auto-create a JIRA ticket when a deploy fails" Build a workflow: monitor Jenkins/GitHub Actions → if build fails → create JIRA ticket with build logs and assign to the team.
12 "Map out which services talk to which" Enable Network Mapper — it discovers service-to-service communication in K8s and shows you the dependency graph.
13 "Check if our Datadog monitors match our runbook" Ask the agent to list all Datadog monitors — compare with your documented expectations. Set this up as a weekly drift check.
14 "Reduce alert fatigue for the team" Use alert grouping and conditional workflows — similar alerts get grouped, only actionable ones reach Slack/PagerDuty.
15 "Automate capacity reporting for leadership" Schedule a monthly playbook: query CloudWatch/K8s for resource trends → query BigQuery for cost data → format and send via email or Slack.
16 "Clear application cache when memory exceeds threshold" Build a workflow: check memory metrics → if above limit → make API call to flush cache → verify memory dropped → notify on Slack.
17 "Run synthetic health checks every 5 minutes" Schedule a playbook that hits your critical endpoints via HTTP, checks response codes and latency, and alerts on Slack if anything degrades.
18 "Onboard new services faster" When a new service deploys, the agent auto-discovers it in K8s, finds its Grafana dashboards and Datadog monitors, and catalogs everything. You see it in your inventory immediately.

What's Next?

DrDroid helps your team move from reactive firefighting to proactive operations. Whether you're debugging an incident at 3 AM or building automation to prevent the next one, DrDroid becomes your team's operational co-pilot.

Ready to see how DrDroid works with your stack?

Want to try it out? Setup takes 1-2 hours and you'll see value from your first investigation. Get started here.