How DrDroid builds production context for AI Agents

Executive Summary

“Production systems” is a not a single component.

Kubernetes is a tangibly identifiable component with deterministic ways of operation. So is a database. And so is code. But “Production Systems” is a complex orchestration of different and discrete components that interact with each other to produce desired results for a business or user.

Joining together pieces, some of which are continuously changing, means that mathematically, the risk of failure is non-linearly higher than the risk of the individual components.

To tackle that, it is quintessential to design, iterate and architect systems with resilience, fallbacks, reliability and visibility.

DrDroid helps engineers leverage AI effectively for automating and accelerating operations. To enable high quality results, DrDroid has a stateful knowledge layer (context engine) about different components of production systems for agents.

The need for context engine

Engineers are using DrDroid to debug production alerts and get sharp data backed RCAs. To ensure that the agent is able to run complex investigations, is grounded in data and is context aware, we realised that a lot of reality is hidden in the day-to-day changes in the system. Here a few examples of typical changes, in sequence of most obvious to less obvious:

Code Changes: Regular changes in a service in accordance with deliverables and goals
Resource changes: A database table gets it’s memory upto the limit
Indirect code changes: A third party dependency being used downstream in a service changes their API payload
Indirect resource changes: A certificate expires on one of the machines where a critical service is running
User change: A different user triggers an edge case in a product journey that never got used until now, leading to a failure for the user

None of these issues are unheard of, or shockingly new to an engineer. The problem is that the impact could be visible at a very different point in the system compared to where the issue originated.

In fact, for very different reasons, the same outcome/impact/alert might get triggered at different times.

Being able to mitigate and avoid such issues often require more information than just the alert. This debugging is often a combination of system knowledge, deductive reasoning, tribal knowledge, specialised skills and real-time situational context.

Added to that, engineers are expected to do this in high pressure situations or unexpected times as there are often business guarantees associated with these issues.

Components within a context engine

Our context engine is designed to proactively process and maintain an update knowledge layer about a company.

Here are some of the examples of what the context engine comprises of:

System Knowledge:
1. Service Catalog: A list of every service, it's related teams and repositories, monitoring context and dependencies. This can auto-generated by fetching data across multiple tools like an APM or ArgoCD or Github repositories or Kubernetes. You can further edit the details if needed.
2. Infrastructure inventory: A list of all the resources, from Databases to VMs to kubernetes deployments and namespaces.
3. Pre-instrumented metrics knowledge: The context engine is aware of all the metrics that are instrumented, as well as the dashboards and it's panels. This ensures that agent does not have to query prometheus by first principle but can re-use your existing dashboards too!
4. Codebase context: The context engine continuously maintains knowledge of the capabilities and features within each repo, their structure. It additionally also stores correlations with other repositories (depending on the information available). Read more on source code security.
5. Third party dependencies: Within DrDroid, you can maintain a list of all 3rd parties that your application depends upon.
Assistance for deductive reasoning:
1. Service & infrastructure correlations: By leveraging the context of your inventory across infrastructure & services, and your traces/service maps, correlations between services and infrastructure is auto-processed. This can be built using your existing service maps from APM / traces or using network mappers in your kubernetes infrastructure (available with our reverse proxy service)

https://www.youtube.com/watch?v=Bt-G0ZRaj-8&t=2s

2.  **Pre-compiled log patterns and fingerprints:** Your log patterns are overtime stored within the context engine, ensuring that the agent can re-use the past experience of log analysis to improve the future log search queries.
    
3.  **Pre-mapped data flows and critical user journeys:** The agent maintains context about the product/architecture and can even keep context of critical workflows across repositories. Additionally, these can be edited to add additional insights and knowledge.

Tribal Knowledge:
1. Pre-existing runbooks and SOPs: All your existing runbooks and SOPs can be added to the context engine, either via syncing it with your wiki or uploading/creating documents in the account.
2. Past Incidents & investigations: The context engine is continuously updated as the agent runs alert investigations and RCAs. Insights of correlations and issues identified are stored for easy referencing and searching for the agent in the future.
3. Alert Patterns: The context engine stores alerts from Slack channels and webhooks, ensuring the agent can see the pattern of a specific alert and differentiate an anomaly from a regular alert.
4. Communication rosters: The context engine stores information about the teams and the users within the teams. Users can be auto-synced from Slack where as the teams can be simply configured in the platform.
Specialised skills
1. Internal tools specific knowledge: You can store guidelines about how your internal CLI / tools operate. This would enable the agent to use the tools effectively, ensuring your production workflows can be replicated here.
2. Best practices on trace analysis: Analysing traces can often be taxing for any engineer and require deep focus to identify the anomaly in them, across the time and spans. The context engine has specialised tools for trace analysis, enabling it to get insights on large volume of spans in a short duration with reduced token consumption.
3. Observability tools usage: Every tool has it's own data structures and entity design. Best practices around using tools like Signoz, NewRelic, AWS, Kubernetes, etc. are pre-baked into the context engine.
Real-time situational context
1. Code Changes: The context engine continuously processes PRs, commits and merges across repositories. This ensures that the agent can get a quick glance on recent changes.
2. Infrastructure Changes: Addition of inventory in the infrastructure, from a database in your cloud to a namespace in your kubernetes cluster, the diff is continuously tracked.
3. Ongoing issues and alerts: Alerts are processed and saved in real-time for the agent to use in real-time.
4. Vendor downtime events: The context engine is continuously querying the status pages of the third party vendors identified in the platform. This ensures that the agent can be notified upfront during an investigation if a downstream vendor API is impacted.

Where is the context engine used?

How does the context engine get used in the platform:

Investigation Agent: The investigation agent is our flagship agent which can help you investigate any issue in your production system - from alert investigation & service degradation to security review & cost analysis.
Alert Classification/Triaging Agent: Just send all your alerts to DrDroid. When you receive a large stream of alerts, this agent can classify your alert to be a noisy alert (and suppress it) or let that alert come through to you. It grounds every recommendation in actual system data. As you get started, it gives benefit of doubt to an alert being critical (so that a critical issue isn't missed) but over time, it learns from already triaged alerts, data queried and user feedback received. Users can also add specific "notes" at alert type level so that agent doesn't have to think from first principles when it gets started.
Alert Grouping Agent: If you have 100+ relevant alerts coming in, rarely are they 100 unique issues. More often than not, these might end up becoming 7 or 19 issues. The Grouping agent listens to your alerts and tries to group it into actionable buckets - it often does it by root cause (found by investigation agent) or by the impacted component / where it fits in the topology. This could ensure that during an incident or when you get a surged stream of alerts where most of them are due to the same issue, the one other signal doesn't get missed.
Continuous Improvement Agent: The core principles behind engineering operations is to react, learn and improve. Some teams often get too occupied with firefighting and do not get the bandwidth to work on proactive improvements of the system. This agent gives suggestions for improving your alerts, missing observability data for any of your new services, gives cost or security posture recommendations.
Meta agent: This agent helps you to improve context on the DrDroid platform based on ongoing activity, which will in-turn, make all the other agents more powerful.

How is the context engine implemented?

Read more about the design of the context engine and it's architecture in this blog.

Context Engine: How DrDroid's AI Agent leverages the Continuously Improving Knowledge Graph

Executive Summary

The need for context engine

Components within a context engine

Where is the context engine used?

How is the context engine implemented?

Comments

More from this blog

How DrDroid Builds and Maintains the Knowledge Layer That Powers an AI SRE Agent

How DrDroid’s MCP Server Puts Production Context Inside Claude Code and Any IDE

How DrDroid AI SRE Agent is specialised for Production Incidents & On-call Investigations

DrDroid: How AI SRE Helps Engineers who are on-call for production monitoring

Command Palette

Executive Summary

The need for context engine

Components within a context engine

Where is the context engine used?

How is the context engine implemented?

Comments

More from this blog