Skip to main content

Command Palette

Search for a command to run...

How DrDroid AI SRE Agent is specialised for Production Incidents & On-call Investigations

Here's how DrDroid's Investigation Agent is specifically engineered for incident response, alert investigation, and infrastructure troubleshooting

Published
11 min read
How DrDroid AI SRE Agent is specialised for Production Incidents & On-call Investigations

By working with 100s of engineers and their debugging problems, we iterated over DrDroid. The investigation agent assists engineers with complex analysis which are critical, time sensitive and have little room for error. Here are some of the things that help the investigation agent perform well:

1. Specialized Debugging Tools & Skills

Production incidents often require analyzing large volumes of logs, traces, and metrics. This can be token-intensive and time-consuming. Most LLMs and agentic frameworks hit context window limits quickly, can't process production-scale data, and lose quality when analyzing large datasets.

How DrDroid Solves This:

Pre-Built Aggregate Analysis Tools

Instead of feeding raw logs into an LLM, DrDroid has tools designed specifically for handling large-volume logs with:

  • Built-in log aggregation and pattern detection

  • Complex trace analysis across distributed systems

  • Large-volume metrics analysis with outlier detection and ML techniques

Specialized Investigation Skills

Our agent has domain-specific skills built from working with hundreds of engineers on real debugging problems:

  • How to query and analyze traces in Signoz or Datadog

  • How to navigate APM data efficiently

  • How to correlate metrics across multiple monitoring tools

Real Impact: The agent can process 100,000+ log lines in seconds and surface the 5 relevant errors—something that would exhaust a generic LLM's context window.

2. Code & Application Awareness

Most LLMs and agents start every investigation from zero, with only the context of the prompt and sometimes markdown files.

How DrDroid Solves This:

Automatic Code Context Generation

Even before your first chat with the agent, DrDroid builds knowledge of:

  • What each repository does

  • What capabilities, APIs, features, and workflows each repo covers

  • Programming languages, frameworks, and file structures

  • Connections between multiple repositories (discovered via traces and logs)

Business Workflow Understanding

You can ask DrDroid to build context around critical business and product workflows. The agent understands:

  • "The checkout flow involves payment-service, inventory-service, and notification-service"

  • "When users report 'payment stuck,' check these three services in this order"

Real Impact: When an alert fires on "payment-service," DrDroid already knows what that service does, which other services depend on it, and where to look for root causes.

3. Infrastructure & Resource Awareness

An LLM with MCP connections doesn't know which apps run in which Kubernetes clusters, which databases are in which cloud providers, or how your infrastructure is organised. It needs to query multiple tools (costing time and token) and explore it before it's able to answer that.

How DrDroid Solves This:

Auto-Discovery of Infrastructure

DrDroid continuously maps:

  • Apps hosted in different Kubernetes clusters

  • Databases and their cloud providers

  • Service dependencies and communication patterns

  • Network topology and resource relationships

Service Map & Dependency Graph

The agent can answer questions like:

  • "Which services depend on the payments database?"

  • "If eu-west-1 goes down, what's affected?"

  • "Show me all services running in the production cluster"

Real Impact: During an incident, the agent instantly knows the blast radius and which downstream services might be affected—without you having to explain your architecture.

4. Past Alert & Incident Pattern Recognition

With generic agents, every investigation is independent. It has no memory of past incidents or patterns.

How DrDroid Solves This: Searchable Alert History

The agent has access to:

  • All alerts since platform enablement

  • Past incidents and their resolutions

  • RCAs and postmortems (from Confluence, docs, or previous investigations)

  • Understanding patterns in alerts

When similar issues occur, the agent can say:

  • "This looks similar to the incident from Jan 15th where the Redis cache was full"

  • "Last time this alert fired, the root cause was a config change in service X"

Real Impact: Repeat incidents get resolved faster because the agent learns from past investigations.

5. Continually Learning System

DrDroid improves with every investigation.

Active Learning from Your Environment

The agent continuously creates notes and memory from:

  • Recent commits and merges in your applications

  • Investigations and conversations with the agent

  • Human conversations in Slack channels (optional)

Contextual Memory Storage

Everything is stored with metadata:

  • Timestamp

  • Related entities (services, databases, clusters)

  • Related team and people

  • Relevant tags and categories

Real Impact: The agent gets smarter every week. After a month, it knows your environment better than most new engineers.

6. Context Compaction (1M+ Token Conversations)

Typically, agents do the following with the problem of large context windows:

  • Summarize the entire conversation (losing critical context) or

  • Hit token limits and can't continue

  • Slow down dramatically as conversations grow

With production telemetry data, this can often happen.

How DrDroid Solves This:

Intelligent Compression Without Context Loss

  • Tool calls are compressed (only IDs and summaries preserved)

  • Reasoning and train of thought remain intact (no summarization)

  • Agent maintains full context even beyond 1M tokens

  • Smart Tool-Level Compaction

Our tools have built-in context management:

  • Logging tool has grep/search capability over large volumes

  • Agent can "eyeball and search" logs instead of loading everything into context

Real Impact: You can have a 2-hour debugging session with 500+ tool calls, and the agent never loses context or slows down.

7. Multi-Channel Conversations with Shareability

The agent is designed to work from your place of convenience:

  • Slack DMs

  • Thread replies to alerts

  • Web UI

  • CLI (coming soon)

  • API triggers

  • Voice calls (coming soon)

Seamless Sharing

Any investigation can be:

  • Shared with teammates for review

  • Linked in postmortems

  • Referenced in future incidents

Real Impact: When someone gets paged, they can see the auto-investigation that already ran in the Slack thread—no need to DM the agent separately.

8. Automated Investigations

DrDroid can run proactively or via automated triggers enabling proactive visibility for your team:

  • Alert fires in PagerDuty/OpsGenie → Investigation starts automatically

  • Cron-based health checks → Agent investigates on schedule

  • Custom triggers via API or webhooks

Real Impact: Agent can detect issues even without alerts; By the time you open the alert, the agent has already investigated and summarized the likely root cause.

9. Smart Model Switching (85% Cost Savings)

LLMs have been commoditised and the SOTA model is not necessarily required for every investigation. DrDroid smartly chooses between different LLMs based on investigation complexity

  • Simple tasks → Faster, cheaper models

  • Complex reasoning → State-of-the-art models

Real Impact: Up to 85% token savings compared to always using frontier models, with no degradation in investigation quality.

10. Dedicated File System & Memory

Memory management for a large scale infrastructure requires a structured approach.

DrDroid has a Persistent Knowledge Base - All context, memory, investigations, and alerts are stored and accessible:

  • Agent can navigate past investigations like files

  • Search across all historical data

  • Reference previous findings instantly

Real Impact: "Show me all investigations related to database timeouts in the last 30 days" returns instant results.

11. Coding Sub-Agent for Hotfixes

Coding agent operates very different from a production investigation agent. DrDroid comes pre-packaged with a coding agent connected to the investigation agent.

Real-Time Coding Agent When needed:

  • Spins up a coding agent in an ephemeral sandbox

  • Reviews the full repository

  • Creates hotfix PRs with proper context

Real Impact: During an incident, the agent can say "I found the bug in payment-service line 247—here's a PR to fix it."

12. Remote Machine & Kubernetes Access

Often, data needs to be reviewed on a remote machine or kubernetes cluster for production incidents. These might be inaccessible or sensitive.

How DrDroid Solves This: Direct Infrastructure Access without token access to the agent

  • Execute commands on remote machines via SSH (keys are not exposed to the agent)

  • Query read-only Kubernetes clusters directly

  • Access VMs and clusters within your VPC via reverse proxy

Real Impact: "Check disk space on prod-api-01" → Agent SSHs in, runs the command, and returns results. No manual execution needed.

13. Image Support for Dashboard Analysis

You might want to debug an issue with a screenshot shared by the customer as the starting point.

How DrDroid Solves This: DrDroid agent support image processing from Slack or UI.

  • Your product showing an error

  • A Grafana dashboard

  • A monitoring alert

The agent analyses it and continues the investigation from there.

Real Impact: "Here's what the user is seeing" → Agent understands the UI issue and investigates the backend cause.

14. Granular Access Control & RBAC

Production systems debugging come with sensitive data and access management. DrDroid ensures only the right people have the right access while debugging.

How DrDroid Solves This:

  • Read commands: Execute without approval (safe exploration)

  • Write commands: Require RBAC approval per your policy

  • SSO integration: Syncs with your internal permissions

  • Audit logs: Track who did what

Real Impact: Junior engineers can investigate safely, while dangerous operations require senior approval.

15. Third-Party Vendor Status Tracking

Often production incidents can be partly caused due to 3rd party downtimes or issues.

How DrDroid Solves This: Connected to Vendor Statuspages

  • Tracks 150+ status pages for your third-party vendors: Stripe, AWS, Datadog, MongoDB Atlas, etc.

  • Flags when vendor issues might be causing downstream impact

Real Impact: "Is this our issue or Stripe's?" → Agent checks Stripe's status page and correlates timing.

16. Automated Quality Evaluation

Production Agents need to come with quality guarantees for the team to track and trust.

How DrDroid Solves This: LLM-Based Evals on Every Investigation

Every investigation is automatically evaluated for:

  • Accuracy Safety Errors or hallucinations

  • Central teams get visibility into investigation quality and improvement opportunities.

Real Impact: Platform team can see "Investigation quality is 94% this month, down from 97% last month—let's review the low-scoring investigations."

17. User Feedback & Team Visibility

Context within DrDroid can continuously improve over time. But for that to improve, tracking and acting upon user feedback is critical.

How DrDroid Solves This: Collaborative Quality Control

  • Every investigation can be upvoted/downvoted

  • Feedback back to central team helps improve agent context

Real Impact: Central team & managers have visibility on confidence and impact of AI on the engineers.

18. Reasoning Lifecycle & Audit Trail

Production incident investigations cannot be led to be incorrect due to "hallucinations" or "guesses" by an LLM. DrDroid ensures that every reasoning and logic by the LLM is grounded in facts and data.

How DrDroid Solves This: Transparent Investigation Path

The agent tracks:

  • What data it queried

  • Why each data point was relevant

  • What hypothesis it built from each finding

  • How it reached its conclusion

Real Impact: You can backtrack through the investigation to validate correctness, spot gaps, or understand the agent's reasoning.

Summary: Why DrDroid is Purpose-Built for Production

Capability DrDroid Investigation Agent
Code awareness Auto-discovers repos, APIs, dependencies
Infrastructure knowledge Knows your K8s, cloud, databases
Log/metric analysis Specialized tools for production-scale data
Memory of past incidents Full history + pattern learning
Context window 1M+ tokens with intelligent compaction
Cost optimization Smart switching, 85% savings
Permissions & RBAC Enterprise-grade access control
Auto-triggered investigations Yes—from alerts, cron, API
Quality control Automated evals + team feedback
Infrastructure execution Direct SSH, K8s, API access

What This Means for Your Team:

  • Every investigation starts with past context

  • You do not have to guide the LLM or explain your architecture every time

  • It can handle production-scale logs or metrics

  • It supports automation or proactive help

  • It works across different channels where your team lives

Ready to See the agent?

DrDroid is a purpose-built investigation agent that understands your infrastructure, learns from your incidents, and gets smarter every day.

Next steps:

Want to see how it works with your stack?

Setup & go live takes 1-2 hours for smaller teams, < 1 week for enterprises. Get started here.