DrDroid: Building the Knowledge Layer for AI SRE Reliability

An AI agent operating on production systems is only as effective as the context it can access at the moment a question is asked. Generic foundation models, however capable, do not know your service names, your dashboards, your deploy history, or your tribal knowledge. Closing that gap is not a model problem. It is a knowledge engineering problem.

This paper describes DrDroid's Context Engine: the continuously updated knowledge layer that captures the state and history of a customer's production environment, and the retrieval system that surfaces the right slice of that knowledge during an agent investigation. We cover what the Context Engine stores, how it is built and kept current, how the agent searches it, and the design choices that make this approach measurably better than RAG, embeddings, or static documentation alone.

The problem we are solving

Production engineering is not a single discipline. It is the orchestration of dozens of systems, each with its own data model, vocabulary, and failure mode. A typical mid-sized engineering team operates across:

A monitoring stack (Grafana, Prometheus, Datadog, Signoz)
A deployment pipeline (ArgoCD, Jenkins, GitHub Actions)
A cloud or hybrid infrastructure provider (AWS, GCP, Azure)
An orchestrator (Kubernetes)
An alerting and on-call layer (PagerDuty, Opsgenie)
A code platform (GitHub, GitLab)
Collaboration and ticketing (Slack, Jira, Confluence)
Databases, APMs, error trackers, analytics tools

When an alert fires, the engineer who resolves it fastest is not the one who reads the alert text most carefully. It is the one who carries an internal map of how these systems connect for this specific company. They know which dashboard is the real one. They know that payments-svc and checkout-payments are the same service under two names. They remember that a similar alert two weeks ago was caused by a Redis eviction, not a code change.

This map is built over months of on-call rotations. It is the difference between a five-minute resolution and a fifty-minute one. And until now, it has been almost impossible to give to an AI agent.

A foundation model knows what Kubernetes is. It does not know what your Kubernetes cluster looks like. RAG over your documentation helps a little, but production reality is not in your documentation. It is scattered across live tools, recent deploys, evolving log patterns, and engineer's heads.

The Context Engine is our solution to this gap. It is the knowledge layer that makes a general-purpose AI agent behave like a senior engineer who has been on your team for two years.

What the Context Engine is

A short definition first:

The Context Engine is a structured, continuously updated knowledge layer about a customer's production environment, designed to be queried by an AI agent during operational tasks.

It has two parts. The first is the knowledge itself: a catalog of what exists in the production environment, organized into typed records. The second is the retrieval system that sits on top: a hierarchical search engine, which we refer to as Dynamic Memory Retrieval (DMR), that lets the agent pull the right records at the right moment without dumping everything into a context window.

The analogy we find useful is a library. The knowledge layer is the collection of books on the shelves. DMR is the card catalog and the librarian. You need both. A library without organization is just a pile of paper. A search system with nothing to search is just an empty interface.

The Context Engine is a knowledge layer that sits between your production environment and the AI agent. The goal of the layer is simple to state and harder to build: before the agent answers a question about your system, it should already have the same picture in its head that your senior engineer does.

To get there, the engine continuously builds and maintains four kinds of context. I'll walk through each one, what's inside, and how we keep it updated without anyone on your team babysitting it.

1. The Map of Your Stack

When you connect DrDroid to your existing tools, the first thing the engine builds is an inventory. Not a CMDB you fill out by hand. An actual working map pulled directly from your APM, ArgoCD, GitHub, and Kubernetes.

This map covers every service with its repos and owning team. Every Grafana and Datadog dashboard, the panels inside them, the queries those panels run. Every database, VM, K8s cluster, and namespace. Every third-party vendor your application talks to.

How we keep it current: Each source tool gets its own synchronizer. ArgoCD changes get picked up on a webhook. Kubernetes is reconciled on a short interval against the live cluster state. GitHub activity flows in through repo events. We don't trust any single tool to be authoritative on its own, so the engine resolves conflicts using a precedence order per record type. The result is a map that drifts in minutes, not days.

What this changes: When an alert mentions checkout-api, the agent already knows that means the checkout-api repo on GitHub, owned by the Payments team, running in the prod-us-east cluster, monitored by the checkout-deepdive Grafana dashboard, depending on the orders database and Stripe. It doesn't have to ask. It doesn't have to query five tools to figure it out. It already has it.

2. How Your Services Actually Connect

Maps are static. Production isn't. The engine also builds a live correlation layer which has what depends on what, what traffic flows where.

This is where most "AI for ops" tools quietly fall apart, because reconstructing a service graph from a static config is one thing and keeping it accurate while services come and go is another.

3. Your Team's Tribal Knowledge

This is the part senior engineers care about most, because this is the part they're tired of being the only source of.

The engine stores runbooks and SOPs, synced from Confluence or Notion or uploaded directly. It stores every past investigation and RCA the agent has run, including what it queried, what it correlated, and what it concluded. It watches your alert patterns over time so it can tell a routine 7 AM spike from a real anomaly. It knows who's on the Payments team and who's on-call this week.

How we keep it current: Runbooks sync on a fixed schedule and on doc-update events where the source supports them. Investigations are written into the engine as the agent finishes them, with full traceability of which records they touched. Alert pattern fingerprints are recomputed on a rolling window, so a service that used to be quiet and is now noisy doesn't look "normal" forever just because it once was.

What this changes: A new engineer on-call doesn't need someone senior to tell them "the last three times this fired, it was a config change in service X." The agent says that, because the agent has been watching.

4. What Changed in the Last Hour

Most production incidents trace back to something that changed recently. This is the freshest layer of the engine and the one that matters most during a live incident.

It tracks code changes (every PR, commit, and merge across your repos). It tracks infrastructure diffs (new namespace, renamed cluster, database added). It tracks live alerts (what's firing now, what just stopped). And it polls 150+ vendor status pages, so the agent knows if Stripe or AWS is having a bad day before you do.

How we keep it current: This is the part of the engine that's almost entirely event-driven. PRs and deploys come in on webhooks. Alerts flow in from your alerting tool directly. Vendor status pages are polled aggressively, on the order of every few minutes. The "last hour" delta is recomputed continuously, not cached.

What this changes: The agent walks into an investigation with a delta of what's different in the last hour. Most of the time, the answer is sitting somewhere in that delta.

How retrieval actually works

Building the knowledge is one problem. Pulling the right slice of it at the right time is another. This is the part where most teams trying to build something similar quietly run into a wall.

When we started, the obvious approach was retrieval-augmented generation. Take everything we know about a customer's environment, chunk it, embed it, do similarity search. It's the standard pattern for AI applications, and it works well for general-purpose knowledge bases.

It does not work for production infrastructure.

The reason took us some time to articulate clearly. In production engineering, a single character can mean a completely different thing.

us-east-1 and us-east-2 are different regions
checkout-prod and checkout-staging are different services
pod-abc123 and pod-abc124 are different pods
v1.4.2 and v1.4.3 might behave totally differently

Embedding models, by design, map semantically similar strings to nearby vectors. They will tell you these pairs are 99% similar. They are not. During an incident, confusing them takes a small problem and turns it into a much bigger one.

The other issue is that production names don't carry semantic content. A service name is an arbitrary identifier. A cluster name is a label. There's nothing for the embedding to "understand." There's just the exact string. Keyword search works much better here, because real production questions almost always anchor on a specific named entity, and that name appears verbatim somewhere in the right record.

So our retrieval layer (we call it Dynamic Memory Retrieval) is keyword-first against the structured graph. We index entities for exact and prefix matching first. We layer embeddings in only for the record types where semantic similarity genuinely helps: runbook prose, ticket descriptions, Slack threads, postmortems. Never as the default for infrastructure records.

A few other design choices worth flagging:

Hierarchical results. A service ranks above its log lines. A dashboard ranks above its individual panels. A runbook ranks above a casual Slack mention. The agent gets the parent record first and drills into children only when it needs to.

Multi-pass retrieval. A single search rarely surfaces everything an investigation needs. The agent runs successive queries, each refined by what the previous one returned. A typical end-to-end investigation does between 5 and 50 retrievals against the engine, interleaved with tool calls against the underlying systems.

Short-term and long-term memory, kept separate. Long-term records describe what's always true: the list of services, the structure of the codebase, the dashboards that exist. Short-term records describe what's currently true: the deploy from 40 minutes ago, the alert firing right now, today's open incident. Mixing them confuses the agent's sense of "now" versus "always." So they're stored and queried as distinct stores.

How the engine improves over time

The engine isn't static. Three things keep happening in the background.

It learns from every investigation. Each RCA the agent produces becomes a record in long-term memory. The next time something looks similar, the agent surfaces it: "This looks like the incident on March 14th where the Redis eviction policy was the cause."

It tracks your environment as it changes. New services, deleted dashboards, renamed clusters get picked up automatically. You don't re-onboard, ever. The engine that handled your environment on day 1 is materially different from the one handling it on day 90, in a good way.

It stays auditable. Every recommendation the agent makes is traceable back to which records it pulled and which tool calls it ran. If a senior engineer doesn't trust an answer, they can walk back through the reasoning step by step. This matters more than it sounds like it should, because trust is what determines whether an AI agent actually gets used.

One of our customers described the practical impact like this: "We went from a 90-day onboarding window for new SREs to two weeks." That's the bet, institutional knowledge living in a system instead of in three senior engineers' heads.

Conclusion

If you're building or evaluating an AI agent for production work, the single most important question is not which model it's using. It's what context the agent walks into the room with.

A foundation model with MCP servers and no context engine is going to spend the first ten tool calls of every investigation re-learning your environment, and it's going to get half of it wrong because production names look semantically similar but mean entirely different things.

A foundation model on top of a continuously maintained knowledge layer of your services, dashboards, runbooks, dependencies, and recent changes will start every investigation already pointed in roughly the right direction.

That's the whole architectural argument behind the Context Engine. Models will keep getting better. The work of building, maintaining, and retrieving the right context is the part that compounds, and it's the part most teams underestimate.

To understand more about content layer and knowledge layer, check out the youtube videos from @drdroiddev channel

If you want to see the engine running against your stack, schedule a technical evaluation.

How DrDroid Builds and Maintains the Knowledge Layer That Powers an AI SRE Agent

The problem we are solving