<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Notes by Doctor Droid]]></title><description><![CDATA[Doctor Droid team shares product guides, demos and best practices in observability.]]></description><link>https://notes.drdroid.io</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1723284089567/30d18ee5-02b3-42cd-a3d6-b53a35289f14.png</url><title>Notes by Doctor Droid</title><link>https://notes.drdroid.io</link></image><generator>RSS for Node</generator><lastBuildDate>Sun, 19 Apr 2026 18:18:54 GMT</lastBuildDate><atom:link href="https://notes.drdroid.io/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[How DrDroid AI SRE Agent is specialised for Production Incidents & On-call Investigations]]></title><description><![CDATA[By working with 100s of engineers and their debugging problems, we iterated over DrDroid. The investigation agent assists engineers with complex analysis which are critical, time sensitive and have li]]></description><link>https://notes.drdroid.io/how-drdroid-ai-sre-agent-is-specialised-for-production-incidents-on-call-investigations</link><guid isPermaLink="true">https://notes.drdroid.io/how-drdroid-ai-sre-agent-is-specialised-for-production-incidents-on-call-investigations</guid><category><![CDATA[ai-sre]]></category><category><![CDATA[#AIOps]]></category><category><![CDATA[observability]]></category><category><![CDATA[monitoring]]></category><category><![CDATA[logging]]></category><dc:creator><![CDATA[Siddarth Jain]]></dc:creator><pubDate>Thu, 09 Apr 2026 12:32:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/63200bf16c86d75accc7fd61/4502861e-8549-49d8-a504-3e26c79cc16a.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>By working with 100s of engineers and their debugging problems, we iterated over DrDroid. The investigation agent assists engineers with complex analysis which are critical, time sensitive and have little room for error. Here are some of the things that help the investigation agent perform well:</p>
<h2><strong>1. Specialized Debugging Tools &amp; Skills</strong></h2>
<p>Production incidents often require analyzing large volumes of logs, traces, and metrics. This can be token-intensive and time-consuming. Most LLMs and agentic frameworks hit context window limits quickly, can't process production-scale data, and lose quality when analyzing large datasets.</p>
<p><strong>How DrDroid Solves This:</strong></p>
<h4><strong>Pre-Built Aggregate Analysis Tools</strong></h4>
<p>Instead of feeding raw logs into an LLM, DrDroid has tools designed specifically for handling large-volume logs with:</p>
<ul>
<li><p>Built-in log aggregation and pattern detection</p>
</li>
<li><p>Complex trace analysis across distributed systems</p>
</li>
<li><p>Large-volume metrics analysis with outlier detection and ML techniques</p>
</li>
</ul>
<h4><strong>Specialized Investigation Skills</strong></h4>
<p>Our agent has domain-specific skills built from working with hundreds of engineers on real debugging problems:</p>
<ul>
<li><p>How to query and analyze traces in Signoz or Datadog</p>
</li>
<li><p>How to navigate APM data efficiently</p>
</li>
<li><p>How to correlate metrics across multiple monitoring tools</p>
</li>
</ul>
<p><strong>Real Impact:</strong> The agent can process 100,000+ log lines in seconds and surface the 5 relevant errors—something that would exhaust a generic LLM's context window.</p>
<h2>2. Code &amp; Application Awareness</h2>
<p>Most LLMs and agents start every investigation from zero, with only the context of the prompt and sometimes markdown files.</p>
<h3><strong>How DrDroid Solves This:</strong></h3>
<h4>Automatic Code Context Generation</h4>
<p>Even before your first chat with the agent, DrDroid builds knowledge of:</p>
<ul>
<li><p>What each repository does</p>
</li>
<li><p>What capabilities, APIs, features, and workflows each repo covers</p>
</li>
<li><p>Programming languages, frameworks, and file structures</p>
</li>
<li><p>Connections between multiple repositories (discovered via traces and logs)</p>
</li>
</ul>
<h4>Business Workflow Understanding</h4>
<p>You can ask DrDroid to build context around critical business and product workflows. The agent understands:</p>
<ul>
<li><p>"The checkout flow involves payment-service, inventory-service, and notification-service"</p>
</li>
<li><p>"When users report 'payment stuck,' check these three services in this order"</p>
</li>
</ul>
<p><strong>Real Impact:</strong> When an alert fires on "payment-service," DrDroid already knows what that service does, which other services depend on it, and where to look for root causes.</p>
<h2>3. Infrastructure &amp; Resource Awareness</h2>
<p>An LLM with MCP connections doesn't know which apps run in which Kubernetes clusters, which databases are in which cloud providers, or how your infrastructure is organised. It needs to query multiple tools (costing time and token) and explore it before it's able to answer that.</p>
<p><strong>How DrDroid Solves This:</strong></p>
<h4>Auto-Discovery of Infrastructure</h4>
<p>DrDroid continuously maps:</p>
<ul>
<li><p>Apps hosted in different Kubernetes clusters</p>
</li>
<li><p>Databases and their cloud providers</p>
</li>
<li><p>Service dependencies and communication patterns</p>
</li>
<li><p>Network topology and resource relationships</p>
</li>
</ul>
<h4>Service Map &amp; Dependency Graph</h4>
<p>The agent can answer questions like:</p>
<ul>
<li><p>"Which services depend on the payments database?"</p>
</li>
<li><p>"If eu-west-1 goes down, what's affected?"</p>
</li>
<li><p>"Show me all services running in the production cluster"</p>
</li>
</ul>
<p><strong>Real Impact:</strong> During an incident, the agent instantly <strong>knows the blast radius</strong> and <strong>which downstream services might be affected</strong>—without you having to explain your architecture.</p>
<h2>4. Past Alert &amp; Incident Pattern Recognition</h2>
<p>With generic agents, every investigation is independent. It has no memory of past incidents or patterns.</p>
<p><strong>How DrDroid Solves This:</strong> Searchable Alert History</p>
<p>The agent has access to:</p>
<ul>
<li><p>All alerts since platform enablement</p>
</li>
<li><p>Past incidents and their resolutions</p>
</li>
<li><p>RCAs and postmortems (from Confluence, docs, or previous investigations)</p>
</li>
<li><p>Understanding patterns in alerts</p>
</li>
</ul>
<p>When similar issues occur, the agent can say:</p>
<ul>
<li><p>"This looks similar to the incident from Jan 15th where the Redis cache was full"</p>
</li>
<li><p>"Last time this alert fired, the root cause was a config change in service X"</p>
</li>
</ul>
<p><strong>Real Impact:</strong> Repeat incidents get resolved faster because the agent learns from past investigations.</p>
<h2>5. Continually Learning System</h2>
<p>DrDroid improves with every investigation.</p>
<h4>Active Learning from Your Environment</h4>
<p>The agent continuously creates notes and memory from:</p>
<ul>
<li><p>Recent commits and merges in your applications</p>
</li>
<li><p>Investigations and conversations with the agent</p>
</li>
<li><p>Human conversations in Slack channels (optional)</p>
</li>
</ul>
<h4>Contextual Memory Storage</h4>
<p>Everything is stored with metadata:</p>
<ul>
<li><p>Timestamp</p>
</li>
<li><p>Related entities (services, databases, clusters)</p>
</li>
<li><p>Related team and people</p>
</li>
<li><p>Relevant tags and categories</p>
</li>
</ul>
<p><strong>Real Impact:</strong> The agent gets smarter every week. After a month, it knows your environment better than most new engineers.</p>
<h2>6. Context Compaction (1M+ Token Conversations)</h2>
<p>Typically, agents do the following with the problem of large context windows:</p>
<ul>
<li><p>Summarize the entire conversation (losing critical context) or</p>
</li>
<li><p>Hit token limits and can't continue</p>
</li>
<li><p>Slow down dramatically as conversations grow</p>
</li>
</ul>
<p>With production telemetry data, this can often happen.</p>
<p><strong>How DrDroid Solves This:</strong></p>
<h4>Intelligent Compression Without Context Loss</h4>
<ul>
<li><p>Tool calls are compressed (only IDs and summaries preserved)</p>
</li>
<li><p>Reasoning and train of thought remain intact (no summarization)</p>
</li>
<li><p>Agent maintains full context even beyond 1M tokens</p>
</li>
<li><p>Smart Tool-Level Compaction</p>
</li>
</ul>
<p>Our tools have built-in context management:</p>
<ul>
<li><p>Logging tool has grep/search capability over large volumes</p>
</li>
<li><p>Agent can "eyeball and search" logs instead of loading everything into context</p>
</li>
</ul>
<p><strong>Real Impact:</strong> You can have a 2-hour debugging session with 500+ tool calls, and the agent never loses context or slows down.</p>
<h2>7. Multi-Channel Conversations with Shareability</h2>
<p>The agent is designed to work from your place of convenience:</p>
<ul>
<li><p>Slack DMs</p>
</li>
<li><p>Thread replies to alerts</p>
</li>
<li><p>Web UI</p>
</li>
<li><p>CLI (coming soon)</p>
</li>
<li><p>API triggers</p>
</li>
<li><p>Voice calls (coming soon)</p>
</li>
</ul>
<p><strong>Seamless Sharing</strong></p>
<p>Any investigation can be:</p>
<ul>
<li><p>Shared with teammates for review</p>
</li>
<li><p>Linked in postmortems</p>
</li>
<li><p>Referenced in future incidents</p>
</li>
</ul>
<p><strong>Real Impact:</strong> When someone gets paged, they can see the auto-investigation that already ran in the Slack thread—no need to DM the agent separately.</p>
<h2>8. Automated Investigations</h2>
<p>DrDroid can run proactively or via automated triggers enabling proactive visibility for your team:</p>
<ul>
<li><p>Alert fires in PagerDuty/OpsGenie → Investigation starts automatically</p>
</li>
<li><p>Cron-based health checks → Agent investigates on schedule</p>
</li>
<li><p>Custom triggers via API or webhooks</p>
</li>
</ul>
<p><strong>Real Impact:</strong> Agent can detect issues even without alerts; By the time you open the alert, the agent has already investigated and summarized the likely root cause.</p>
<h2>9. Smart Model Switching (85% Cost Savings)</h2>
<p>LLMs have been commoditised and the SOTA model is not necessarily required for every investigation. DrDroid smartly chooses between different LLMs based on investigation complexity</p>
<ul>
<li><p>Simple tasks → Faster, cheaper models</p>
</li>
<li><p>Complex reasoning → State-of-the-art models</p>
</li>
</ul>
<p><strong>Real Impact:</strong> Up to 85% token savings compared to always using frontier models, with no degradation in investigation quality.</p>
<h2>10. Dedicated File System &amp; Memory</h2>
<p>Memory management for a large scale infrastructure requires a structured approach.</p>
<p>DrDroid has a Persistent Knowledge Base - All context, memory, investigations, and alerts are stored and accessible:</p>
<ul>
<li><p>Agent can navigate past investigations like files</p>
</li>
<li><p>Search across all historical data</p>
</li>
<li><p>Reference previous findings instantly</p>
</li>
</ul>
<p><strong>Real Impact:</strong> "Show me all investigations related to database timeouts in the last 30 days" returns instant results.</p>
<h2>11. Coding Sub-Agent for Hotfixes</h2>
<p>Coding agent operates very different from a production investigation agent. DrDroid comes pre-packaged with a coding agent connected to the investigation agent.</p>
<p><strong>Real-Time Coding Agent When needed:</strong></p>
<ul>
<li><p>Spins up a coding agent in an ephemeral sandbox</p>
</li>
<li><p>Reviews the full repository</p>
</li>
<li><p>Creates hotfix PRs with proper context</p>
</li>
</ul>
<p><strong>Real Impact:</strong> During an incident, the agent can say "I found the bug in payment-service line 247—here's a PR to fix it."</p>
<h2>12. Remote Machine &amp; Kubernetes Access</h2>
<p>Often, data needs to be reviewed on a remote machine or kubernetes cluster for production incidents. These might be inaccessible or sensitive.</p>
<p><strong>How DrDroid Solves This:</strong> Direct Infrastructure Access without token access to the agent</p>
<ul>
<li><p>Execute commands on remote machines via SSH (keys are not exposed to the agent)</p>
</li>
<li><p>Query read-only Kubernetes clusters directly</p>
</li>
<li><p>Access VMs and clusters within your VPC via reverse proxy</p>
</li>
</ul>
<p><strong>Real Impact:</strong> "Check disk space on prod-api-01" → Agent SSHs in, runs the command, and returns results. No manual execution needed.</p>
<h2>13. Image Support for Dashboard Analysis</h2>
<p>You might want to debug an issue with a screenshot shared by the customer as the starting point.</p>
<p><strong>How DrDroid Solves This:</strong> DrDroid agent support image processing from Slack or UI.</p>
<ul>
<li><p>Your product showing an error</p>
</li>
<li><p>A Grafana dashboard</p>
</li>
<li><p>A monitoring alert</p>
</li>
</ul>
<p>The agent analyses it and continues the investigation from there.</p>
<p><strong>Real Impact:</strong> "Here's what the user is seeing" → Agent understands the UI issue and investigates the backend cause.</p>
<h2>14. Granular Access Control &amp; RBAC</h2>
<p>Production systems debugging come with sensitive data and access management. DrDroid ensures only the right people have the right access while debugging.</p>
<p><strong>How DrDroid Solves This:</strong></p>
<ul>
<li><p><strong>Read commands:</strong> Execute without approval (safe exploration)</p>
</li>
<li><p><strong>Write commands:</strong> Require RBAC approval per your policy</p>
</li>
<li><p><strong>SSO integration:</strong> Syncs with your internal permissions</p>
</li>
<li><p><strong>Audit logs:</strong> Track who did what</p>
</li>
</ul>
<p><strong>Real Impact:</strong> Junior engineers can investigate safely, while dangerous operations require senior approval.</p>
<h2>15. Third-Party Vendor Status Tracking</h2>
<p>Often production incidents can be partly caused due to 3rd party downtimes or issues.</p>
<p><strong>How DrDroid Solves This:</strong> Connected to Vendor Statuspages</p>
<ul>
<li><p>Tracks 150+ status pages for your third-party vendors: Stripe, AWS, Datadog, MongoDB Atlas, etc.</p>
</li>
<li><p>Flags when vendor issues might be causing downstream impact</p>
</li>
</ul>
<p><strong>Real Impact:</strong> "Is this our issue or Stripe's?" → Agent checks Stripe's status page and correlates timing.</p>
<h2>16. Automated Quality Evaluation</h2>
<p>Production Agents need to come with quality guarantees for the team to track and trust.</p>
<p><strong>How DrDroid Solves This:</strong> LLM-Based Evals on Every Investigation</p>
<p>Every investigation is automatically evaluated for:</p>
<ul>
<li><p>Accuracy Safety Errors or hallucinations</p>
</li>
<li><p>Central teams get visibility into investigation quality and improvement opportunities.</p>
</li>
</ul>
<p><strong>Real Impact:</strong> Platform team can see "Investigation quality is 94% this month, down from 97% last month—let's review the low-scoring investigations."</p>
<h2>17. User Feedback &amp; Team Visibility</h2>
<p>Context within DrDroid can continuously improve over time. But for that to improve, tracking and acting upon user feedback is critical.</p>
<p><strong>How DrDroid Solves This:</strong> Collaborative Quality Control</p>
<ul>
<li><p>Every investigation can be upvoted/downvoted</p>
</li>
<li><p>Feedback back to central team helps improve agent context</p>
</li>
</ul>
<p><strong>Real Impact:</strong> Central team &amp; managers have visibility on confidence and impact of AI on the engineers.</p>
<h2>18. Reasoning Lifecycle &amp; Audit Trail</h2>
<p>Production incident investigations cannot be led to be incorrect due to "hallucinations" or "guesses" by an LLM. DrDroid ensures that every reasoning and logic by the LLM is grounded in facts and data.</p>
<p><strong>How DrDroid Solves This:</strong> Transparent Investigation Path</p>
<p>The agent tracks:</p>
<ul>
<li><p>What data it queried</p>
</li>
<li><p>Why each data point was relevant</p>
</li>
<li><p>What hypothesis it built from each finding</p>
</li>
<li><p>How it reached its conclusion</p>
</li>
</ul>
<p><strong>Real Impact:</strong> You can backtrack through the investigation to validate correctness, spot gaps, or understand the agent's reasoning.</p>
<p><strong>Summary:</strong> Why DrDroid is Purpose-Built for Production</p>
<table>
<thead>
<tr>
<th>Capability</th>
<th>DrDroid Investigation Agent</th>
</tr>
</thead>
<tbody><tr>
<td>Code awareness</td>
<td>Auto-discovers repos, APIs, dependencies</td>
</tr>
<tr>
<td>Infrastructure knowledge</td>
<td>Knows your K8s, cloud, databases</td>
</tr>
<tr>
<td>Log/metric analysis</td>
<td>Specialized tools for production-scale data</td>
</tr>
<tr>
<td>Memory of past incidents</td>
<td>Full history + pattern learning</td>
</tr>
<tr>
<td>Context window</td>
<td>1M+ tokens with intelligent compaction</td>
</tr>
<tr>
<td>Cost optimization</td>
<td>Smart switching, 85% savings</td>
</tr>
<tr>
<td>Permissions &amp; RBAC</td>
<td>Enterprise-grade access control</td>
</tr>
<tr>
<td>Auto-triggered investigations</td>
<td>Yes—from alerts, cron, API</td>
</tr>
<tr>
<td>Quality control</td>
<td>Automated evals + team feedback</td>
</tr>
<tr>
<td>Infrastructure execution</td>
<td>Direct SSH, K8s, API access</td>
</tr>
</tbody></table>
<p>What This Means for Your Team:</p>
<ul>
<li><p>Every investigation starts with past context</p>
</li>
<li><p>You do not have to guide the LLM or explain your architecture every time</p>
</li>
<li><p>It can handle production-scale logs or metrics</p>
</li>
<li><p>It supports automation or proactive help</p>
</li>
<li><p>It works across different channels where your team lives</p>
</li>
</ul>
<p>Ready to See the agent?</p>
<p>DrDroid is a purpose-built investigation agent that understands your infrastructure, learns from your incidents, and gets smarter every day.</p>
<h2>Next steps:</h2>
<ul>
<li><p><a href="https://www.youtube.com/@DrDroidDev">Watch platform demo videos</a></p>
</li>
<li><p>Check our <a href="https://drdroid.io/integrations">MCP Servers &amp; integrations</a></p>
</li>
<li><p>Read the <a href="https://docs.drdroid.io/">documentation</a></p>
</li>
<li><p>See customer <a href="https://drdroid.io/case-studies">case studies</a></p>
</li>
</ul>
<p>Want to see how it works with your stack?</p>
<p>Setup &amp; go live takes 1-2 hours for smaller teams, &lt; 1 week for enterprises. Get started <a href="https://drdroid.io/">here</a>.</p>
]]></content:encoded></item><item><title><![CDATA[DrDroid: How AI SRE Helps Engineers who are on-call for production monitoring]]></title><description><![CDATA[Day 0 — Incident Response & Firefighting
Something is broken right now. You need answers fast.



#
Your Task
How Doctor Droid Helps You



1
"Our API latency spiked, what changed?"
Ask the agent — it]]></description><link>https://notes.drdroid.io/drdroid-how-ai-sre-helps-engineers-who-are-on-call-for-production-monitoring</link><guid isPermaLink="true">https://notes.drdroid.io/drdroid-how-ai-sre-helps-engineers-who-are-on-call-for-production-monitoring</guid><category><![CDATA[Open Source]]></category><category><![CDATA[monitoring]]></category><category><![CDATA[observability]]></category><category><![CDATA[#AIOps]]></category><category><![CDATA[ai-sre]]></category><category><![CDATA[ai agents]]></category><dc:creator><![CDATA[Siddarth Jain]]></dc:creator><pubDate>Tue, 07 Apr 2026 10:40:52 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/63200bf16c86d75accc7fd61/4040cb32-0484-4703-b287-0f6dceca395d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Day 0 — Incident Response &amp; Firefighting</h2>
<p><em>Something is broken right now. You need answers fast.</em></p>
<table>
<thead>
<tr>
<th><strong>#</strong></th>
<th><strong>Your Task</strong></th>
<th><strong>How Doctor Droid Helps You</strong></th>
</tr>
</thead>
<tbody><tr>
<td>1</td>
<td><strong>"Our API latency spiked, what changed?"</strong></td>
<td>Ask the agent — it pulls recent deployments from ArgoCD/GitHub Actions, checks Grafana/Datadog metrics, and shows you what changed around the time latency spiked. No need to open 5 tabs.</td>
</tr>
<tr>
<td>2</td>
<td><strong>"Which pods are crashing in production?"</strong></td>
<td>Ask the agent — it lists failing pods, pulls their logs, shows recent K8s events, and surfaces restart counts. You get a full picture in one response.</td>
</tr>
<tr>
<td>3</td>
<td><strong>"Is the database the bottleneck?"</strong></td>
<td>Ask the agent — it runs slow query analysis on your Postgres/MySQL, checks connection pool usage, and correlates with application error rates from Datadog or New Relic.</td>
</tr>
<tr>
<td>4</td>
<td><strong>"I got paged, what's actually going on?"</strong></td>
<td>The agent auto-investigates when an alert fires. By the time you open it, there's already a summary with metrics, logs, and likely root cause pulled from your connected sources.</td>
</tr>
<tr>
<td>5</td>
<td><strong>"Are other services affected too?"</strong></td>
<td>Ask the agent — it checks health across your connected sources (Grafana, CloudWatch, K8s, Datadog) and tells you which services are degraded vs healthy.</td>
</tr>
<tr>
<td>6</td>
<td><strong>"I need to check CloudWatch logs for this error"</strong></td>
<td>Tell the agent the error pattern and time range — it queries CloudWatch Logs, Loki, or Elasticsearch directly and returns matching entries. No console login needed.</td>
</tr>
<tr>
<td>7</td>
<td><strong>"Run this PromQL/NRQL query for me"</strong></td>
<td>Give the agent your query — it executes against Prometheus, Grafana, New Relic, or Datadog and returns results inline. Great for quick checks during incidents.</td>
</tr>
<tr>
<td>8</td>
<td><strong>"SSH into the box and check disk space"</strong></td>
<td>Tell the agent the command — it runs Bash commands on remote hosts via SSH and returns output. No need to find the SSH key or remember the hostname. Works even when you don’t have laptop access.</td>
</tr>
<tr>
<td>9</td>
<td><strong>"Notify the team on Slack about this outage"</strong></td>
<td>Tell the agent what to post and where — it sends a formatted message to your Slack channel with the context you provide.</td>
</tr>
<tr>
<td>10</td>
<td><strong>"Escalate this to PagerDuty"</strong></td>
<td>The agent creates or escalates a PagerDuty/OpsGenie incident based on the investigation findings. You don't need to context-switch to the PagerDuty UI.</td>
</tr>
<tr>
<td>11</td>
<td><strong>"Create a JIRA ticket for the post-mortem"</strong></td>
<td>Tell the agent the summary — it creates a JIRA ticket with the incident details, investigation findings, and relevant links.</td>
</tr>
<tr>
<td>12</td>
<td><strong>"Did the last deploy cause this?"</strong></td>
<td>Ask the agent — it checks the latest GitHub PR merges, Jenkins builds, ArgoCD sync status, and correlates timestamps with when the issue started.</td>
</tr>
<tr>
<td>13</td>
<td><strong>"Roll back the deployment"</strong></td>
<td>Tell the agent to trigger a rollback pipeline — it kicks off the Jenkins build or GitHub Actions workflow you specify.</td>
</tr>
<tr>
<td>14</td>
<td><strong>"Check if this API endpoint is responding"</strong></td>
<td>Give the agent the URL — it makes an HTTP call and tells you the status code, response time, and body. Works for any internal API.</td>
</tr>
</tbody></table>
<h2>Day 1 — Operational Tasks &amp; Maintenance</h2>
<p><em>No fire, but you need to keep things running smoothly.</em></p>
<table>
<thead>
<tr>
<th><strong>#</strong></th>
<th><strong>Your Task</strong></th>
<th><strong>How Doctor Droid Helps You</strong></th>
</tr>
</thead>
<tbody><tr>
<td>1</td>
<td><strong>"What's the current state of our K8s cluster?"</strong></td>
<td>Ask the agent — it shows pod status across namespaces, node resource usage, recent events, and any pods in CrashLoopBackOff or Pending state.</td>
</tr>
<tr>
<td>2</td>
<td><strong>"Are all our data sources healthy?"</strong></td>
<td>The agent tests connectivity to every configured connector every 10 seconds. Ask it for the current status, or set up Slack alerts for failures.</td>
</tr>
<tr>
<td>3</td>
<td><strong>"Show me all Grafana dashboards we have"</strong></td>
<td>Ask the agent — it returns the full inventory of dashboards, datasources, and folders it auto-discovered. Same for Datadog monitors, K8s resources, DB schemas, etc.</td>
</tr>
<tr>
<td>4</td>
<td><strong>"How big are our database tables getting?"</strong></td>
<td>Ask the agent to run a table size query on your Postgres/MySQL/ClickHouse — it returns sizes, row counts, and index usage.</td>
</tr>
<tr>
<td>5</td>
<td><strong>"Any slow queries running right now?"</strong></td>
<td>Ask the agent — it checks pg_stat_activity or equivalent on your database and shows long-running queries with their duration and state.</td>
</tr>
<tr>
<td>6</td>
<td><strong>"Check if our Jenkins pipelines are green"</strong></td>
<td>Ask the agent — it pulls recent build status from Jenkins and tells you which jobs passed, failed, or are stuck.</td>
</tr>
<tr>
<td>7</td>
<td><strong>"Is our ArgoCD app in sync?"</strong></td>
<td>Ask the agent — it checks sync status and health for your ArgoCD applications. Flags any that are out-of-sync or degraded.</td>
</tr>
<tr>
<td>8</td>
<td><strong>"Pull logs from the payment service for the last hour"</strong></td>
<td>Tell the agent the service and time range — it queries Loki, CloudWatch, or Elasticsearch and returns the logs. No need to remember log group names.</td>
</tr>
<tr>
<td>9</td>
<td><strong>"What GitHub PRs were merged today?"</strong></td>
<td>Ask the agent — it queries your GitHub repos and lists merged PRs with authors, titles, and timestamps.</td>
</tr>
<tr>
<td>10</td>
<td><strong>"Trigger a build for the staging environment"</strong></td>
<td>Tell the agent which Jenkins job or GitHub Actions workflow to run — it triggers it and reports back the status.</td>
</tr>
<tr>
<td>11</td>
<td><strong>"Send a daily cluster health report to Slack"</strong></td>
<td>Set up a scheduled playbook — the agent runs K8s health checks daily and posts a summary to your Slack channel automatically.</td>
</tr>
<tr>
<td>12</td>
<td><strong>"Check MongoDB replica set health"</strong></td>
<td>Ask the agent — it runs the appropriate commands against your MongoDB instance and returns replica set status and lag.</td>
</tr>
<tr>
<td>13</td>
<td><strong>"List all CloudWatch alarms in ALARM state"</strong></td>
<td>Ask the agent — it queries CloudWatch and returns currently firing alarms with their metric details and thresholds.</td>
</tr>
<tr>
<td>14</td>
<td><strong>"What Datadog monitors are in alert?"</strong></td>
<td>Ask the agent — it checks your Datadog monitors and lists any that are currently alerting or warning.</td>
</tr>
<tr>
<td>15</td>
<td><strong>"Run this custom SQL query on production"</strong></td>
<td>Give the agent the query — it executes against your connected Postgres, MySQL, ClickHouse, or BigQuery and returns results as a table.</td>
</tr>
<tr>
<td>16</td>
<td><strong>"Check disk and memory on our VMs"</strong></td>
<td>Tell the agent to run df -h and free -m via Bash — it SSHs in and returns the output.</td>
</tr>
<tr>
<td>17</td>
<td><strong>"Update this JIRA ticket with today's progress"</strong></td>
<td>Tell the agent the ticket ID and comment — it adds the update to JIRA without you opening the browser.</td>
</tr>
<tr>
<td>18</td>
<td><strong>"What new resources were created in our K8s cluster this week?"</strong></td>
<td>Ask the agent — it compares the current asset inventory with the previous discovery and highlights new pods, services, or deployments.</td>
</tr>
</tbody></table>
<hr />
<h2>Day 2 — Automation, Optimization &amp; Reliability</h2>
<p><em>You want to stop doing repetitive things and build resilience.</em></p>
<table>
<thead>
<tr>
<th><strong>#</strong></th>
<th><strong>Your Task</strong></th>
<th><strong>How Doctor Droid Helps You</strong></th>
</tr>
</thead>
<tbody><tr>
<td>1</td>
<td><strong>"Automate our incident response runbook"</strong></td>
<td>Build a runbook as per your internal process: when alert fires → pull metrics from Grafana → check K8s pods → query DB → post findings to Slack. Runs automatically on every alert.</td>
</tr>
<tr>
<td>2</td>
<td><strong>"Auto-restart pods when OOM detected"</strong></td>
<td>Set up a workflow: if K8s event shows OOMKilled → restart the pod → notify on Slack → log to JIRA. No human in the loop.</td>
</tr>
<tr>
<td>3</td>
<td><strong>"Scale up when CPU crosses 80%"</strong></td>
<td>Create a conditional workflow: check CPU metrics from Prometheus/CloudWatch → if above threshold → trigger scaling via K8s or API call → confirm on Slack.</td>
</tr>
<tr>
<td>4</td>
<td><strong>"Track our SLOs across services"</strong></td>
<td>Set up scheduled playbooks that query Prometheus or Datadog for error rate and latency → calculate against SLO targets → post weekly reports to Slack.</td>
</tr>
<tr>
<td>5</td>
<td><strong>"Alert me before we hit our error budget"</strong></td>
<td>Build a workflow that checks error budget consumption daily → if &gt; 80% consumed → post warning to Slack and create a JIRA ticket.</td>
</tr>
<tr>
<td>6</td>
<td><strong>"Correlate deploys with performance changes"</strong></td>
<td>Set up a playbook that triggers after every deploy → compares pre/post metrics from Grafana/Datadog → flags regressions automatically.</td>
</tr>
<tr>
<td>7</td>
<td><strong>"Stop SSHing into boxes for the same checks"</strong></td>
<td>Convert your common SSH commands into playbooks — disk space, process checks, log tailing. Run them from Doctor Droid with one click or on schedule.</td>
</tr>
<tr>
<td>8</td>
<td><strong>"I keep opening 5 dashboards for the same investigation"</strong></td>
<td>Build a single playbook that queries all 5 sources and gives you a combined view. Next time, ask the agent instead of opening dashboards.</td>
</tr>
<tr>
<td>9</td>
<td><strong>"Validate infrastructure after every Terraform apply"</strong></td>
<td>Set up a post-deploy playbook: check K8s resources → verify CloudWatch alarms exist → test endpoints → report pass/fail.</td>
</tr>
<tr>
<td>10</td>
<td><strong>"Audit our K8s RBAC and network policies weekly"</strong></td>
<td>Schedule a playbook that lists RBAC bindings and network policies → compares against expected state → flags drift on Slack.</td>
</tr>
<tr>
<td>11</td>
<td><strong>"Auto-create a JIRA ticket when a deploy fails"</strong></td>
<td>Build a workflow: monitor Jenkins/GitHub Actions → if build fails → create JIRA ticket with build logs and assign to the team.</td>
</tr>
<tr>
<td>12</td>
<td><strong>"Map out which services talk to which"</strong></td>
<td>Enable Network Mapper — it discovers service-to-service communication in K8s and shows you the dependency graph.</td>
</tr>
<tr>
<td>13</td>
<td><strong>"Check if our Datadog monitors match our runbook"</strong></td>
<td>Ask the agent to list all Datadog monitors — compare with your documented expectations. Set this up as a weekly drift check.</td>
</tr>
<tr>
<td>14</td>
<td><strong>"Reduce alert fatigue for the team"</strong></td>
<td>Use alert grouping and conditional workflows — similar alerts get grouped, only actionable ones reach Slack/PagerDuty.</td>
</tr>
<tr>
<td>15</td>
<td><strong>"Automate capacity reporting for leadership"</strong></td>
<td>Schedule a monthly playbook: query CloudWatch/K8s for resource trends → query BigQuery for cost data → format and send via email or Slack.</td>
</tr>
<tr>
<td>16</td>
<td><strong>"Clear application cache when memory exceeds threshold"</strong></td>
<td>Build a workflow: check memory metrics → if above limit → make API call to flush cache → verify memory dropped → notify on Slack.</td>
</tr>
<tr>
<td>17</td>
<td><strong>"Run synthetic health checks every 5 minutes"</strong></td>
<td>Schedule a playbook that hits your critical endpoints via HTTP, checks response codes and latency, and alerts on Slack if anything degrades.</td>
</tr>
<tr>
<td>18</td>
<td><strong>"Onboard new services faster"</strong></td>
<td>When a new service deploys, the agent auto-discovers it in K8s, finds its Grafana dashboards and Datadog monitors, and catalogs everything. You see it in your inventory immediately.</td>
</tr>
</tbody></table>
<h2><strong>What's Next?</strong></h2>
<p>DrDroid helps your team move from <strong>reactive firefighting</strong> to <strong>proactive operations</strong>. Whether you're debugging an incident at 3 AM or building automation to prevent the next one, DrDroid becomes your team's operational co-pilot.</p>
<p><strong>Ready to see how DrDroid works with your stack?</strong></p>
<ul>
<li><p><a href="https://www.youtube.com/@DrDroidDev">Watch platform demo videos</a></p>
</li>
<li><p><a href="https://drdroid.io/integrations">Check our integrations</a></p>
</li>
<li><p><a href="https://docs.drdroid.io/">Read the documentation</a></p>
</li>
<li><p><a href="https://drdroid.io/case-studies">See customer case studies</a></p>
</li>
</ul>
<p><strong>Want to try it out?</strong> Setup takes 1-2 hours and you'll see value from your first investigation. <a href="https://drdroid.io/">Get started here</a>.</p>
]]></content:encoded></item><item><title><![CDATA[KubeCon + CloudNativeCon Europe 2026 Guide – Amsterdam
]]></title><description><![CDATA[Agenda Strategy, Tracks, Networking & SRE Playbook
UpdatedMarch 2026 • 6 min read

On this page

Doctor Droid’s Guide to KubeCon + CloudNativeCon Europe 2026

Overview

Event Schedule

Who Should Atte]]></description><link>https://notes.drdroid.io/kubecon-cloudnativecon-europe-2026-guide-amsterdam</link><guid isPermaLink="true">https://notes.drdroid.io/kubecon-cloudnativecon-europe-2026-guide-amsterdam</guid><category><![CDATA[kubeconeurope]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[observability]]></category><dc:creator><![CDATA[Karan Sirohi]]></dc:creator><pubDate>Thu, 19 Mar 2026 06:59:28 GMT</pubDate><content:encoded><![CDATA[<p>Agenda Strategy, Tracks, Networking &amp; SRE Playbook</p>
<p><strong>Updated</strong><br />March 2026 • 6 min read</p>
<hr />
<h2>On this page</h2>
<ul>
<li><p>Doctor Droid’s Guide to KubeCon + CloudNativeCon Europe 2026</p>
</li>
<li><p>Overview</p>
</li>
<li><p>Event Schedule</p>
</li>
<li><p>Who Should Attend (and Who Can Skip)</p>
</li>
<li><p>How to Build Your Agenda (Without Overloading Yourself)</p>
</li>
<li><p>Track Strategy for SRE &amp; Platform Teams</p>
</li>
<li><p>Co-located Events: Where Specialists Get Leverage</p>
</li>
<li><p>Solutions Showcase: How to Avoid Vendor Fatigue</p>
</li>
<li><p>Networking Strategy for Engineering Leaders</p>
</li>
<li><p>Amsterdam Logistics Checklist</p>
</li>
<li><p>Post-Conference Execution Plan</p>
</li>
<li><p>Visit the Doctor Droid Booth</p>
</li>
</ul>
<hr />
<h2>Doctor Droid’s Guide to KubeCon + CloudNativeCon Europe 2026</h2>
<img src="https://cdn.hashnode.com/uploads/covers/66c6f53299c4280b93ee6b32/fda0dafb-b20e-4d4c-9e80-437d720ec783.png" alt="" style="display:block;margin:0 auto" />

<p>KubeCon + CloudNativeCon Europe is heading to <strong>Amsterdam, Netherlands, from 23–26 March 2026</strong>.</p>
<p>If you’re on <strong>platform engineering, SRE, DevOps, or cloud architecture teams</strong>, this event is still one of the best places to <strong>compress a year of learning into four days</strong>.</p>
<p>This guide is for <strong>practitioners who want outcomes, not conference FOMO</strong>:</p>
<ul>
<li><p>what to prioritize</p>
</li>
<li><p>how to choose tracks</p>
</li>
<li><p>how to plan networking</p>
</li>
<li><p>and how to convert conference notes into production improvements</p>
</li>
</ul>
<hr />
<h2>Overview</h2>
<p>KubeCon + CloudNativeCon is the <strong>flagship conference organized by the Cloud Native Computing Foundation (CNCF)</strong>. It gathers thousands of engineers, maintainers, and infrastructure leaders working on Kubernetes and the broader cloud-native ecosystem.</p>
<p>The event typically features:</p>
<ul>
<li><p>Major Kubernetes and CNCF ecosystem announcements</p>
</li>
<li><p>Technical deep-dives from engineers running large-scale production systems</p>
</li>
<li><p>Co-located events focused on specialized technologies</p>
</li>
<li><p>The <strong>Solutions Showcase</strong>, where cloud-native vendors demonstrate tooling across observability, security, infrastructure automation, and platform engineering.</p>
</li>
</ul>
<p>For engineering teams operating production Kubernetes environments, <strong>KubeCon acts as a yearly checkpoint for infrastructure strategy</strong>.</p>
<hr />
<h2>Event Schedule</h2>
<p>According to the Linux Foundation event page, <strong>KubeCon + CloudNativeCon Europe 2026</strong> is scheduled in Amsterdam from <strong>23–26 March 2026</strong>.</p>
<p>Expected structure:</p>
<p><strong>Day 0 (Monday)</strong><br />Pre-event programming and co-located events</p>
<p><strong>Days 1–3 (Tuesday–Thursday)</strong><br />Keynotes, breakout sessions, and Solutions Showcase</p>
<p>If you’re traveling internationally, plan to <strong>arrive by Sunday evening</strong> so you can still attend Monday’s co-located tracks.</p>
<p>These smaller events often contain some of the <strong>most advanced implementation discussions</strong> of the entire conference.</p>
<hr />
<h2>Who Should Attend (and Who Can Skip)</h2>
<h3>Attend if your team is currently dealing with:</h3>
<ul>
<li><p>Reliability issues across Kubernetes clusters</p>
</li>
<li><p>Scaling bottlenecks (multi-cluster, noisy neighbors, cost/perf tradeoffs)</p>
</li>
<li><p>Incident response gaps between alerts and root cause</p>
</li>
<li><p>AI adoption in platform engineering workflows</p>
</li>
<li><p>Security or compliance friction in cloud-native stacks</p>
</li>
</ul>
<h3>You can skip (or send one delegate) if:</h3>
<ul>
<li><p>your stack is stable and not evolving</p>
</li>
<li><p>you’re not planning infrastructure changes in the next <strong>6–12 months</strong></p>
</li>
<li><p>or you mainly need vendor procurement meetings</p>
</li>
</ul>
<p><strong>KubeCon ROI is highest when your team has active infrastructure pain and concrete architecture decisions pending.</strong></p>
<hr />
<h2>How to Build Your Agenda (Without Overloading Yourself)</h2>
<p>Most engineers fail KubeCon by <strong>overbooking sessions</strong>.</p>
<p>A better approach:</p>
<h3>1) Define 2–3 decision themes before the event</h3>
<p>Examples:</p>
<ul>
<li><p>“Should we move from <strong>single-cluster to multi-cluster</strong> for resilience?”</p>
</li>
<li><p>“How should we <strong>harden runtime security without killing developer speed</strong>?”</p>
</li>
<li><p>“Where can <strong>AI actually reduce MTTR</strong> in incident response?”</p>
</li>
</ul>
<p>Use these themes to <strong>filter sessions</strong>.</p>
<p>If a talk doesn’t help a <strong>current architecture decision</strong>, skip it.</p>
<h3>2) Split your day into three buckets</h3>
<p>Each day should include:</p>
<ul>
<li><p><strong>One depth session</strong> (deep technical talk)</p>
</li>
<li><p><strong>One trend session</strong> (strategy or ecosystem direction)</p>
</li>
<li><p><strong>One practical session</strong> (case study with production lessons)</p>
</li>
</ul>
<p>This avoids the common trap of attending <strong>100% hype talks or 100% dense internals</strong>.</p>
<h3>3) Reserve white space</h3>
<p>Keep at least <strong>90 minutes per day unscheduled</strong>.</p>
<p>Some of the <strong>highest-signal learning at KubeCon comes from hallway conversations</strong>, not slides.</p>
<hr />
<h2>Track Strategy for SRE &amp; Platform Teams</h2>
<p>If your mandate is <strong>reliability + developer speed</strong>, these themes should be top priority.</p>
<h3>Reliability engineering in Kubernetes</h3>
<p>Look for sessions covering:</p>
<ul>
<li><p>failure domains and blast radius control</p>
</li>
<li><p>progressive delivery and rollback safety</p>
</li>
<li><p>SLO design in distributed systems</p>
</li>
<li><p>incident retrospectives with architecture changes</p>
</li>
</ul>
<hr />
<h3>AI for operations (without magic claims)</h3>
<p>Prioritize talks that show:</p>
<ul>
<li><p>real workflows (alert triage, runbook assistance, anomaly explanation)</p>
</li>
<li><p>measurable outcomes (MTTR reduction, false positive reduction, toil reduction)</p>
</li>
<li><p>guardrails like human-in-the-loop systems and auditability</p>
</li>
</ul>
<hr />
<h3>Scaling and performance</h3>
<p>Focus on talks about:</p>
<ul>
<li><p>control-plane scaling</p>
</li>
<li><p>multi-tenancy isolation</p>
</li>
<li><p>workload scheduling optimization</p>
</li>
<li><p>cost/performance tuning with real production metrics</p>
</li>
</ul>
<hr />
<h3>Security and policy at scale</h3>
<p>Strong sessions usually cover:</p>
<ul>
<li><p>software supply chain controls</p>
</li>
<li><p>runtime policy enforcement</p>
</li>
<li><p>identity and secrets management patterns</p>
</li>
<li><p>tradeoffs between security and developer experience</p>
</li>
</ul>
<hr />
<h2>Co-located Events: Where Specialists Get Leverage</h2>
<p>Monday co-located events are often where <strong>advanced implementation patterns surface early</strong>.</p>
<p>If your team has niche challenges around:</p>
<ul>
<li><p>service meshes</p>
</li>
<li><p>observability pipelines</p>
</li>
<li><p>policy engines</p>
</li>
<li><p>platform APIs</p>
</li>
</ul>
<p>these tracks may provide <strong>better signal than general keynotes</strong>.</p>
<p>Recommendation:</p>
<p>Send <strong>at least one engineer</strong> to co-located sessions and have them summarize the takeaways for your team.</p>
<hr />
<h2>Solutions Showcase: How to Avoid Vendor Fatigue</h2>
<h3>The expo floor can become overwhelming quickly.</h3>
<p>Treat it like a <strong>technical discovery sprint</strong>.</p>
<p>Before visiting booths, define your constraints:</p>
<ul>
<li><p>existing observability stack</p>
</li>
<li><p>data residency and compliance requirements</p>
</li>
<li><p>budget range</p>
</li>
<li><p>integration requirements (Datadog, Grafana, PagerDuty, CloudWatch, Slack, etc.)</p>
</li>
</ul>
<hr />
<h3>Ask every vendor the same 5 questions</h3>
<ol>
<li><p>What production scale do your reference customers run?</p>
</li>
<li><p>What does deployment look like in <strong>week one</strong>?</p>
</li>
<li><p>What are the <strong>common failure modes</strong> of your product?</p>
</li>
<li><p>How do you integrate with existing <strong>incident workflows</strong>?</p>
</li>
<li><p>What metrics improve in <strong>30 / 60 / 90 days</strong>?</p>
</li>
</ol>
<p>If answers remain high-level, move on.</p>
<hr />
<h2>Networking Strategy for Engineering Leaders</h2>
<p>Skip generic networking.</p>
<p>Optimize for <strong>targeted conversations</strong>.</p>
<p>Try to meet:</p>
<ul>
<li><p><strong>3 peers running similar Kubernetes scale</strong></p>
</li>
<li><p><strong>2 teams that recently migrated tooling you’re evaluating</strong></p>
</li>
<li><p><strong>2 maintainers from critical OSS dependencies</strong></p>
</li>
</ul>
<hr />
<h3>Questions worth asking</h3>
<ul>
<li><p>“What broke after rollout?”</p>
</li>
<li><p>“What did you underestimate?”</p>
</li>
<li><p>“What would you do differently in year two?”</p>
</li>
</ul>
<p>Answers to these questions can save <strong>months of trial and error</strong>.</p>
<hr />
<h2>Planning Your Amsterdam Experience</h2>
<h3>Where to Stay</h3>
<p>Amsterdam offers many accommodation options close to the event venue.</p>
<p>Options include:</p>
<ul>
<li><p>Hotels near the conference center</p>
</li>
<li><p>Short-term apartment rentals via Airbnb or <a href="http://Booking.com">Booking.com</a></p>
</li>
<li><p>Budget hostels for solo travelers</p>
</li>
</ul>
<p>Booking early is recommended because <strong>KubeCon events tend to sell out nearby hotels quickly</strong>.</p>
<hr />
<h3>Getting There</h3>
<p>Amsterdam is served by <strong>Amsterdam Schiphol Airport (AMS)</strong>, one of Europe’s largest and most connected airports.</p>
<p>From Schiphol Airport:</p>
<ul>
<li><p>Direct trains connect to <strong>Amsterdam Central Station</strong></p>
</li>
<li><p>Metro and tram networks provide quick access to most parts of the city</p>
</li>
</ul>
<p>Public transit is generally the easiest way to move around during the conference.</p>
<hr />
<h3>Visiting Amsterdam</h3>
<p>If you have extra time, Amsterdam offers plenty to explore:</p>
<ul>
<li><p><strong>Rijksmuseum</strong> – Dutch art and history</p>
</li>
<li><p><strong>Anne Frank House</strong> – historic museum and cultural landmark</p>
</li>
<li><p><strong>Canal boat tours</strong> through the historic city center</p>
</li>
<li><p><strong>Jordaan district</strong> for cafés and restaurants</p>
</li>
</ul>
<p>Evening community meetups during KubeCon are often hosted across the city.</p>
<hr />
<h2>Amsterdam Logistics Checklist</h2>
<p>Practical tips for conference week:</p>
<ul>
<li><p>Arrive <strong>at least one day early</strong> for registration and timezone adjustment</p>
</li>
<li><p>Stay near the venue or along a <strong>direct transit line</strong></p>
</li>
<li><p>Keep evening slots free for <strong>community meetups</strong></p>
</li>
<li><p>Carry a lightweight <strong>note template</strong> for each session</p>
</li>
</ul>
<p>Suggested note template:</p>
<ul>
<li><p>Problem addressed</p>
</li>
<li><p>Architecture pattern used</p>
</li>
<li><p>Scale context</p>
</li>
<li><p>Results or metrics</p>
</li>
<li><p>Relevance to your environment</p>
</li>
</ul>
<hr />
<h2>Post-Conference Execution Plan (The Part That Matters)</h2>
<p>Conference ROI is realized <strong>after you get back</strong>.</p>
<h3>Within 72 hours</h3>
<ol>
<li><p>Consolidate notes into themes (reliability, AI, scaling, security).</p>
</li>
<li><p>Rank ideas by <strong>effort vs impact</strong>.</p>
</li>
<li><p>Choose <strong>2 quick wins and 1 strategic bet</strong>.</p>
</li>
</ol>
<h3>Within two weeks</h3>
<ul>
<li><p>Run one <strong>architecture review</strong> based on KubeCon learnings</p>
</li>
<li><p>Launch one <strong>pilot experiment</strong> with explicit success metrics</p>
</li>
<li><p>Share an internal write-up:</p>
</li>
</ul>
<p><strong>“What we learned, what we’re changing, expected impact.”</strong></p>
<p>Without this step, even great conference insights <strong>decay quickly</strong>.</p>
<hr />
<h2>Visit the Doctor Droid Booth</h2>
<p>Doctor Droid is the <strong>AI-powered Slack bot for faster incident diagnosis</strong>.</p>
<p>It helps engineering teams <strong>identify the root cause of production issues automatically</strong> by analyzing alerts, logs, and system signals.</p>
<p>At KubeCon + CloudNativeCon Europe 2026, stop by the <strong>Doctor Droid booth</strong> to:</p>
<ul>
<li><p>See <strong>live demos</strong> of AI-driven incident investigation</p>
</li>
<li><p>Explore how teams reduce <strong>MTTR and alert fatigue</strong></p>
</li>
<li><p>Grab <strong>exclusive Doctor Droid swag and giveaways</strong></p>
</li>
</ul>
<p>You can also <strong>schedule a one-on-one demo</strong> with our team:</p>
<p><a href="https://calendly.com/siddarthjain/doctor-droid-discovery-call">https://calendly.com/siddarthjain/doctor-droid-discovery-call</a></p>
<hr />
<h2>Get Early Access &amp; Updates</h2>
<p>If you're interested in <strong>Doctor Droid demos, credits, or updates during KubeCon</strong>, sign up here:</p>
<p><a href="https://forms.gle/hPzaMa4YRqDLzHZg6">https://forms.gle/hPzaMa4YRqDLzHZg6</a></p>
<p>[P.S. - You get 20% off on tickets as well :) ]</p>
<hr />
<h3>Source Note</h3>
<p>Event date and high-level schedule are based on the official Linux Foundation <strong>KubeCon + CloudNativeCon Europe event page</strong> (accessed March 2026).</p>
]]></content:encoded></item><item><title><![CDATA[Backtesting AI Agents: How SRE Teams Prove Reliability Before Production
]]></title><description><![CDATA[AI agents are finally showing up inside real incident workflows. One agent triages alerts, another scrapes dashboards, a third drafts the remediation plan. Yet 62% of organizations experimenting with ]]></description><link>https://notes.drdroid.io/backtesting-ai-agents-how-sre-teams-prove-reliability-before-production</link><guid isPermaLink="true">https://notes.drdroid.io/backtesting-ai-agents-how-sre-teams-prove-reliability-before-production</guid><category><![CDATA[Devops]]></category><category><![CDATA[AI]]></category><category><![CDATA[software development]]></category><category><![CDATA[ai agents]]></category><category><![CDATA[mlops]]></category><category><![CDATA[#AIOps]]></category><category><![CDATA[observability]]></category><dc:creator><![CDATA[Karan Sirohi]]></dc:creator><pubDate>Thu, 19 Mar 2026 06:58:37 GMT</pubDate><content:encoded><![CDATA[<img src="https://cdn.hashnode.com/uploads/covers/66c6f53299c4280b93ee6b32/6b378f15-ced5-40d2-a495-ceaf121293f3.png" alt="" style="display:block;margin:0 auto" />

<p>AI agents are finally showing up inside real incident workflows. One agent triages alerts, another scrapes dashboards, a third drafts the remediation plan. Yet 62% of organizations experimenting with agents admit they still cannot run them reliably in production because demos rarely expose variance, safety, or cost failures (<a href="https://www.codebridge.tech/articles/ai-agent-evaluation-how-to-measure-reliability-risk-and-roi-before-scaling">Codebridge</a>).<br />Backtesting is how SRE teams close that gap. Instead of “let’s ship and see,” you treat agents like a new microservice: define reliability budgets, hammer them with synthetic and real traces, and fail the build until you trust every path.</p>
<p>This guide shows how to build an AI-agent backtesting program that mirrors load testing for infrastructure. It leans on Codebridge’s reliability dimensions, the AI Reliability Institute’s 30-point checklist, modern agent-observability stacks, and DrDroid’s native context graph plus guardrail center.</p>
<hr />
<h2>1. The wake-up call: why AI agents need pass^k reliability</h2>
<p>A single happy-path demo is meaningless when the production pager expects deterministic success. Codebridge’s recent survey highlights the reliability delta clearly:</p>
<p>- <strong>Prototype bias.</strong> Teams measure whether a workflow completes once, under ideal prompts, then extrapolate to production. In reality, single-run success rates of 60% often translate to only 25% full consistency when you rerun the same scenario 10+ times ([<a href="https://www.codebridge.tech/articles/ai-agent-evaluation-how-to-measure-reliability-risk-and-roi-before-scaling">Codebridge</a>])</p>
<p>- <strong>Cost spikes hide in the tail.</strong> Architectures like Reflexion or self-reflection loops inflate token usage by 5.12× for marginal accuracy gains; without cost-normalized evaluation you do not see the runaway invoice until after launch (same source).</p>
<p>- <strong>Trust is earned, not promised.</strong> Venture teams Codebridge interviewed said more than 70% of execs only greenlight broader automation once they see formal evidence of safety controls, loop detection, and kill switches.</p>
<p>Treat agent validation like you treat capacity planning. Define agent SLOs (Mean Time to Context, Agent-Assisted MTTR, Unauthorized Action Budget). Require <strong>pass^k</strong> (all trials succeed) instead of <strong>pass@k</strong> (one success out of many). Every failed attempt becomes a regression test before the agent is allowed anywhere near the on-call rotation.</p>
<hr />
<h2>2. Five reliability dimensions to measure every run against</h2>
<p>Codebridge frames reliability as a system property, not just “accuracy.” Their five dimensions map cleanly to the levers SRE teams already manage:</p>
<table>
<thead>
<tr>
<th>Dimension</th>
<th>What to Measure</th>
<th>Example Metrics</th>
<th>Suggested Threshold</th>
</tr>
</thead>
<tbody><tr>
<td>Consistency</td>
<td>Does the agent behave the same across repeated runs of the same scenario?</td>
<td>pass^k reliability, variance in token usage, tool-call ordering stability</td>
<td>≥95% success across 20 runs</td>
</tr>
<tr>
<td>Robustness</td>
<td>Can the agent handle noisy inputs or environmental changes?</td>
<td>Prompt perturbation success rate, tolerance to tool schema drift, retry recovery rate</td>
<td>≥90% success under perturbations</td>
</tr>
<tr>
<td>Predictability</td>
<td>Can the agent estimate when it might fail?</td>
<td>Confidence calibration vs actual success, Brier score, refusal rate when uncertain</td>
<td>Brier score &lt;0.2</td>
</tr>
<tr>
<td>Safety</td>
<td>Does the agent stay within defined policy and permission boundaries?</td>
<td>Policy violation rate, unauthorized tool calls, severity-weighted harm score</td>
<td>0 critical violations</td>
</tr>
<tr>
<td>Infrastructure &amp; Cost Stability</td>
<td>Are compute and tool usage bounded and predictable?</td>
<td>Token usage variance, reasoning step count, tool retry loops, cost per session</td>
<td>&lt;30% cost variance per run</td>
</tr>
</tbody></table>
<p>Backtesting should emit metrics for each dimension. Examples:</p>
<p>- <strong>Consistency:</strong> For every golden scenario, run 20 Monte Carlo trials. Alert if success &lt;95% or if token usage swings &gt;30% between runs.</p>
<p>- <strong>Robustness:</strong> Randomly perturb prompts (“create a rollback” vs. “can you undo the deploy”). Evaluate success delta and force remedial prompt hardening when regression &gt;10%.</p>
<p>- <strong>Predictability:</strong> Require agents to emit confidence scores for risky actions. Route anything under 0.7 to human approval. Compare claimed confidence to measured success to compute Brier scores.</p>
<p>- <strong>Safety:</strong> Enforce negative constraints in tests (“Do not email this alias,” “Do not touch prod DB”) and fail the build if the agent even attempts the blocked action.</p>
<p>- <strong>Infrastructure:</strong> Track per-session token, tool, and latency budgets inside DrDroid’s guardrails center. Attempts to exceed a $2 reasoning budget trigger the kill switch before the vendor invoice hits.</p>
<hr />
<h2>3. Designing the backtest dataset: golden, edge, adversarial, regression</h2>
<p>A strong dataset mirrors the risk surface. Codebridge recommends this split ([<a href="https://www.codebridge.tech/articles/ai-agent-evaluation-how-to-measure-reliability-risk-and-roi-before-scaling">same source</a>]):</p>
<p>- <strong>20% Golden paths.</strong> Known-good workflows that mirror typical incidents.</p>
<p>- <strong>30% Edge cases.</strong> Ambiguous alerts, partial telemetry, missing runbooks.</p>
<p>- <strong>20% Adversarial.</strong> Prompt injections, malicious tool outputs, conflicting human directives.</p>
<p>- <strong>30% Regression.</strong> Every failure ever seen in prod becomes a permanent test.</p>
<p>Layer in AI Reliability Institute’s 30-point checklist to make sure you are covering loop detection, denial-of-wallet defenses, zombie-process cleanup, policy insubordination, and kill switches ([<a href="https://ai-reliability.institute/research/agentic-ai-reliability-checklist.html">AIRI</a>]. DrDroid’s <strong>droidctx</strong> makes populating these scenarios easier because it keeps a living graph of alerts, dashboards, service owners, and incident annotations. You can:</p>
<p>1. <strong>Auto-generate golden cases</strong> from resolved incident timelines (alerts + deploy notes + Slack transcript).</p>
<p>2. <strong>Synthesize edge cases</strong> by perturbing telemetry (drop 20% of log lines, rename dashboards) and exporting them into the test harness.</p>
<p>3. <strong>Maintain adversarial suites</strong> by piping AI Reliability Institute’s negative-constraint tests (“ignore the guardrail”) straight into the prompt injection lane.</p>
<p>4. <strong>Promote regressions automatically</strong> every time an agent fails in staging or prod; DrDroid’s Slack-native workflows capture the trace and push it into the regression bucket.</p>
<hr />
<h2>4. Layered graders: deterministic checks, agent-as-a-judge, and humans</h2>
<p>A dataset without trustworthy graders is just fan fiction. Codebridge outlines a layered verification model that mirrors classic testing pyramids:</p>
<p>1. <strong>Deterministic graders</strong> (code) verify objective outcomes: did the runbook markdown change, did the Kubernetes deployment roll back, did the SQL diff match expectations.</p>
<p>2. <strong>LLM-as-a-judge (AaaJ)</strong> handles subjective traits like clarity of Slack updates or whether the hypothesis actually explains the alert. Codebridge cites AaaJ frameworks achieving ~90% agreement with humans when they gather their own evidence, while cutting review cost by 97%.</p>
<p>3. <strong>Human-in-the-loop</strong> remains the final gate for irreversible actions (database writes, customer communications, pager handoffs).</p>
<p>DrDroid bakes these layers into its guardrail center:</p>
<p>- <strong>Guarded tool schema:</strong> Every tool call runs through JSON schema validation; failing schema equals instant fail.</p>
<p>- <strong>Agent approval workflows:</strong> High-risk actions appear in Slack with context, metrics, and a “CONFIRM” field so humans cannot rubber-stamp blindly.</p>
<p>- <strong>Trace exports:</strong> Each run captures the entire reasoning trace so deterministic, model-based, and human graders all work from the same evidence.</p>
<hr />
<h2>5. Tooling landscape: sim rigs, observability stacks, and when to extend beyond DrDroid</h2>
<p>Even with DrDroid’s native tracing, teams often mix in specialist eval stacks for breadth. The Maxim AI roundup of agent-testing platforms is a useful cheat sheet (<a href="https://www.getmaxim.ai/articles/top-5-platforms-to-test-ai-agents-2025-a-comprehensive-guide/">GetMaxim</a>):</p>
<p>- <strong>Maxim AI.</strong> Full lifecycle (experiment → simulate → evaluate → observe) with distributed tracing, llm-as-a-judge, and AI gateway controls. Great when product managers need no-code scenario builders.</p>
<p>- <strong>Langfuse.</strong> Open-source tracing for teams who want to self-host every span.</p>
<p>- <strong>Arize.</strong> Extends classic ML observability (drift, dashboards) into LLM workloads, ideal for enterprises already running Arize for models.</p>
<p>- <strong>Opik (Comet).</strong> Lightweight trace logging plus evals when you need quick wins.</p>
<p>- <strong>DeepEval.</strong> Pytest-style evaluator infrastructure for engineering-heavy orgs building custom metrics.</p>
<p>How this pairs with DrDroid:</p>
<p>- Use <strong>DrDroid</strong> for incident-native context (alerts + deploys), permissions, and Slack workflows.</p>
<p>- Pipe traces to <strong>Langfuse/Maxim</strong> if you need deeper span-level analytics or cross-product dashboards.</p>
<p>- Feed evaluation metrics back into DrDroid’s SLO board so on-call engineers see “Agent backtest coverage: 86%” alongside service health.</p>
<hr />
<h2>6. Operationalizing backtests with DrDroid</h2>
<p>Here’s a practical loop SRE teams can implement in a sprint:</p>
<p>1. <strong>Ingest traces.</strong> Enable DrDroid’s trace exporter for every staging and prod run. Capture prompts, tool calls, guardrail hits, latency, and cost.</p>
<p>2. <strong>Generate scenarios.</strong> Use the captured traces plus droidctx to auto-build the golden/edge/adversarial suite. Store them in a repo so they version with code.</p>
<p>3. <strong>Wire graders.</strong> Start with deterministic checks (e.g., <code>pytest</code> verifying Grafana API responses). Add llm-as-a-judge jobs via Maxim or DeepEval for subjective signals. Route high-risk failures to a Slack approval queue.</p>
<p>4. <strong>Automate pass/fail gates.</strong> Add a “backtest” job to CI that runs the full suite on every scaffold change. Block merges unless success ≥95%, safety violations = 0, cost variance &lt;30%.</p>
<p>5. <strong>Publish SLOs.</strong> DrDroid dashboards should show Agent MTTC, Assisted MTTR, unauthorized-action budget, and coverage (# of alerts where agents participated). Treat SLO breaches exactly like service SLO breaches: open incidents, run postmortems, add regressions.</p>
<p>6. <strong>Keep humans in control.</strong> The AIRI checklist mandates kill switches, loop detection, DoW limits, and policy-insubordination tests. DrDroid’s guardrail center exposes all of them in one UI so on-call engineers can yank access in &lt;200 ms if the agent drifts.</p>
<p>Backtesting isn’t a one-time certification. It’s a living discipline where every production event becomes a new test. When you plug DrDroid’s context engine, guardrails, and aggregated observability into that loop, AI agents stop being unpredictable copilots and become accountable teammates who earn their time on the pager.</p>
<hr />
<p>Once you have datasets, graders, and tooling in place, the next step is designing the evaluation pipeline itself.</p>
<hr />
<h2>Evaluation architecture: how agent backtests actually run</h2>
<p>Backtesting requires more than datasets and metrics. Reliable agent systems separate <strong>execution</strong>, <strong>trace capture</strong>, and <strong>evaluation</strong> into a structured pipeline.</p>
<p>A typical evaluation architecture looks like this:</p>
<pre><code class="language-plaintext">Scenario Dataset
      ↓
Simulation Harness
      ↓
Agent Execution
      ↓
Trace Capture
      ↓
Evaluation Pipeline
      ↓
CI Pass/Fail Gate
</code></pre>
<p>Each layer plays a specific role in validating reliability.</p>
<table>
<thead>
<tr>
<th>Layer</th>
<th>Purpose</th>
<th>Example Implementation</th>
</tr>
</thead>
<tbody><tr>
<td>Scenario dataset</td>
<td>Encodes incidents and test cases</td>
<td>Golden incidents, adversarial prompts, regression tests</td>
</tr>
<tr>
<td>Simulation harness</td>
<td>Replays infrastructure signals</td>
<td>Alert replay, mock tool responses</td>
</tr>
<tr>
<td>Agent execution</td>
<td>Runs the agent scaffold</td>
<td>LLM agent + tool integrations</td>
</tr>
<tr>
<td>Trace capture</td>
<td>Records agent reasoning and actions</td>
<td>Tool calls, tokens, prompts</td>
</tr>
<tr>
<td>Evaluation pipeline</td>
<td>Grades outcomes</td>
<td>Deterministic tests + LLM judges</td>
</tr>
<tr>
<td>CI gate</td>
<td>Blocks unsafe deployments</td>
<td>Backtest job in CI</td>
</tr>
</tbody></table>
<p>This separation ensures engineers can <strong>test agents the same way they test distributed systems</strong>.</p>
<p>Instead of manually inspecting runs, every execution generates <strong>structured traces and evaluation metrics</strong>.</p>
<hr />
<h2>Testing taxonomy for AI agents</h2>
<p>Backtesting is only one layer of the testing strategy. Mature teams build a <strong>testing pyramid</strong> similar to traditional software engineering.</p>
<p>Each layer catches different classes of failures.</p>
<h3>1. Unit tests</h3>
<p>Unit tests validate the <strong>smallest components of the agent system</strong>.</p>
<p>Typical unit tests include:</p>
<ul>
<li><p>tool schema validation</p>
</li>
<li><p>prompt template formatting</p>
</li>
<li><p>guardrail logic</p>
</li>
<li><p>JSON output validation</p>
</li>
</ul>
<p>Example:</p>
<pre><code class="language-plaintext">assert tool_schema.validate(agent_output)
</code></pre>
<p>These tests are deterministic and run in milliseconds.</p>
<p>They prevent simple failures from reaching higher-level tests.</p>
<hr />
<h3>2. Integration tests</h3>
<p>Integration tests validate interactions between <strong>agents and infrastructure tools</strong>.</p>
<p>Examples:</p>
<ul>
<li><p>querying observability dashboards</p>
</li>
<li><p>executing Kubernetes rollbacks</p>
</li>
<li><p>posting Slack updates</p>
</li>
<li><p>retrieving runbooks</p>
</li>
</ul>
<p>These tests confirm that the agent can <strong>actually interact with the systems it relies on</strong>.</p>
<p>Failures here often come from:</p>
<ul>
<li><p>API schema changes</p>
</li>
<li><p>authentication issues</p>
</li>
<li><p>permission errors</p>
</li>
</ul>
<hr />
<h3>3. Simulation tests</h3>
<p>Simulation tests run agents in <strong>controlled synthetic environments</strong>.</p>
<p>Typical simulation features:</p>
<ul>
<li><p>replay alert streams</p>
</li>
<li><p>mock tool responses</p>
</li>
<li><p>inject telemetry noise</p>
</li>
<li><p>simulate partial failures</p>
</li>
</ul>
<p>Example simulation scenario:</p>
<pre><code class="language-plaintext">Alert: CPU spike on checkout-service
Telemetry: 20% logs missing
Tool latency: +2 seconds
</code></pre>
<p>The goal is to test <strong>robustness under imperfect conditions</strong>.</p>
<p>Simulation environments often expose reasoning failures that do not appear in ideal demos.</p>
<hr />
<h3>4. Backtests</h3>
<p>Backtests replay <strong>real incidents from production</strong>.</p>
<p>These are the most valuable tests because they contain realistic context:</p>
<ul>
<li><p>real alerts</p>
</li>
<li><p>real dashboards</p>
</li>
<li><p>real Slack conversations</p>
</li>
<li><p>real deploy timelines</p>
</li>
</ul>
<p>The agent attempts to resolve the incident using the same information that engineers had during the original outage.</p>
<p>Backtests validate:</p>
<ul>
<li><p>decision quality</p>
</li>
<li><p>operational safety</p>
</li>
<li><p>cost stability</p>
</li>
</ul>
<p>This is where <strong>pass^k reliability</strong> becomes important.</p>
<p>If an agent succeeds once but fails on repeated runs, it cannot be trusted in production.</p>
<hr />
<h2>Eval-as-a-judge in the evaluation pipeline</h2>
<p>Many agent outcomes cannot be evaluated using deterministic checks.</p>
<p>For example:</p>
<ul>
<li><p>Is the root cause hypothesis plausible?</p>
</li>
<li><p>Is the Slack update clear to on-call engineers?</p>
</li>
<li><p>Did the agent follow incident response policy?</p>
</li>
</ul>
<p>This is where <strong>Eval-as-a-Judge (EaaJ)</strong> is useful.</p>
<p>An evaluation model reviews the agent output and scores it according to defined criteria.</p>
<p>Example evaluation prompt:</p>
<pre><code class="language-plaintext">You are an SRE evaluating an incident response.

Alert:
CPU spike on checkout-api

Agent response:
"Root cause likely a memory leak introduced in version v1.3.2."

Evaluate:
1. Is the hypothesis plausible?
2. Is the remediation safe?
3. Did the response follow policy?

Return:
score (0-1)
justification
</code></pre>
<p>Eval-as-a-judge works well because it can evaluate <strong>semantic correctness</strong> and <strong>reasoning quality</strong>, which deterministic tests cannot capture.</p>
<p>Best practice is to combine three layers:</p>
<table>
<thead>
<tr>
<th>Layer</th>
<th>Role</th>
</tr>
</thead>
<tbody><tr>
<td>Deterministic tests</td>
<td>Validate objective outcomes</td>
</tr>
<tr>
<td>LLM judge</td>
<td>Evaluate reasoning quality</td>
</tr>
<tr>
<td>Human review</td>
<td>Approve high-risk actions</td>
</tr>
</tbody></table>
<p>This layered grading system dramatically improves evaluation reliability.</p>
<hr />
<h2>Incident replay: the most powerful backtesting tool</h2>
<p>The most valuable evaluation dataset is <strong>your own incident history</strong>.</p>
<p>Replay systems reconstruct the context of past outages using:</p>
<ul>
<li><p>alerts</p>
</li>
<li><p>logs</p>
</li>
<li><p>dashboards</p>
</li>
<li><p>deployment events</p>
</li>
<li><p>Slack threads</p>
</li>
</ul>
<p>Agents then attempt to resolve the incident <strong>as if it were happening live</strong>.</p>
<p>Benefits include:</p>
<ul>
<li><p>realistic test scenarios</p>
</li>
<li><p>automatic regression generation</p>
</li>
<li><p>continuous learning from production failures</p>
</li>
</ul>
<p>Every production incident can become a <strong>permanent regression test</strong> for the agent.</p>
<p>Over time, the backtest suite becomes a living archive of operational knowledge.</p>
<hr />
]]></content:encoded></item><item><title><![CDATA[AI in Engineering: 6 Trends That Will Define 2026]]></title><description><![CDATA[The way engineering teams build, ship, and operate software is undergoing a fundamental shift. In 2025, we saw AI move from code autocomplete to genuine collaboration. In 2026, that collaboration becomes autonomy.

Here are six trends we're anticipat...]]></description><link>https://notes.drdroid.io/ai-in-engineering-6-trends-that-will-define-2026</link><guid isPermaLink="true">https://notes.drdroid.io/ai-in-engineering-6-trends-that-will-define-2026</guid><category><![CDATA[ai agents]]></category><category><![CDATA[AI]]></category><dc:creator><![CDATA[Siddarth Jain]]></dc:creator><pubDate>Thu, 29 Jan 2026 18:12:05 GMT</pubDate><content:encoded><![CDATA[<p>The way engineering teams build, ship, and operate software is undergoing a fundamental shift. In 2025, we saw AI move from code autocomplete to genuine collaboration. In 2026, that collaboration becomes autonomy.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769710268772/eef80f78-7084-4958-96fc-47b3f8d3b2d5.jpeg" alt class="image--center mx-auto" /></p>
<p>Here are six trends we're anticipating that will reshape how engineering teams work this year.</p>
<h2 id="heading-1-agents-will-ship-with-built-in-accountability">1. Agents Will Ship With Built-in Accountability</h2>
<p>The first generation of AI agents were black boxes. They'd take an instruction, disappear into a loop, and return something—hopefully useful, often not. Engineers had no visibility into what the agent tried, why it failed, or whether its approach was even sensible.</p>
<p>That changes in 2026. The next wave of agents will come with testing frameworks, goal tracking, and structured logs built in. Think of it as observability for AI workflows. Every action logged. Every decision traceable. Every failure reviewable.</p>
<p>This isn't just nice-to-have tooling. It's the minimum bar for agents that operate in production environments where accountability matters. Teams won't trust agents they can't audit.</p>
<h2 id="heading-2-ai-generated-code-will-be-structurally-better">2. AI-Generated Code Will Be Structurally Better</h2>
<p>Early AI code generation optimized for "does it work?" The result was functional but often messy—inconsistent patterns, poor separation of concerns, and the kind of technical debt that compounds quietly.</p>
<p>The models shipping in 2026 are trained differently. They've internalized architectural patterns, not just syntax. They understand that a 500-line function is a code smell. They know when to extract a service, when to add an interface, and when to leave well enough alone.</p>
<p>The practical result: fewer bugs at the source. Not because AI doesn't make mistakes, but because well-structured code has fewer places for bugs to hide.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769710254523/cb349429-3a39-40d3-a488-c1eb0a93ac83.jpeg" alt class="image--center mx-auto" /></p>
<h2 id="heading-3-complex-multi-step-tasks-will-actually-complete">3. Complex, Multi-Step Tasks Will Actually Complete</h2>
<p>Ask an AI agent to "refactor this module" or "migrate this service to the new API" and, until recently, you'd get partial results at best. The agent would lose context, get stuck, or quietly drift off-goal.</p>
<p>2026 brings agents that maintain coherence across longer task horizons. They break complex work into subtasks, checkpoint progress, and recover from failures without starting over. They can hold a goal in mind across dozens of operations and hundreds of files.</p>
<p>This is the difference between a tool that helps with tasks and one that completes them.</p>
<h2 id="heading-4-autonomous-ai-will-take-primary-on-call">4. Autonomous AI Will Take Primary On-Call</h2>
<p>This is the trend that will feel most uncomfortable—and most inevitable.</p>
<p>AI agents are already triaging alerts, correlating signals, and suggesting root causes. The next step is giving them the authority to act. Not just "here's what might be wrong" but "I've identified the issue, applied the fix, and I'm monitoring for recurrence."</p>
<p>For well-understood failure modes with established runbooks, there's no reason a human needs to wake up at 3 AM. The agent can handle it, escalate if it's uncertain, and hand off a detailed incident report in the morning.</p>
<p>The human on-call role shifts from first responder to supervisor—still accountable, but not necessarily awake.</p>
<h2 id="heading-5-day-to-day-operations-will-run-on-autopilot">5. Day-to-Day Operations Will Run on Autopilot</h2>
<p>Beyond incident response, there's a long tail of operational work that consumes engineering time: dependency updates, certificate rotations, capacity adjustments, config drift remediation, and the endless stream of small fixes that never quite make it to the sprint.</p>
<p>AI agents will absorb this work in 2026. Not as a batch job that runs once, but as a continuous process. The agent monitors, identifies issues, proposes fixes, and—with appropriate guardrails—applies them.</p>
<p>Engineers review the changelog. They don't write it.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769710229814/c9dba78f-1ba6-4b3a-a909-7b1876e63d39.jpeg" alt class="image--center mx-auto" /></p>
<h2 id="heading-6-always-on-agents-will-work-in-shifts">6. Always-On Agents Will Work in Shifts</h2>
<p>The most significant shift is temporal. Today's AI interactions are synchronous: you prompt, it responds, you review. That loop keeps humans in the critical path.</p>
<p>The agents arriving in 2026 can work asynchronously for extended periods—hours, not minutes. You define a goal, provide constraints, and the agent works toward it continuously. It checks in when it needs input, escalates when it hits uncertainty, and otherwise just keeps going.</p>
<p>Imagine starting your day with a summary: "Overnight, I completed the database migration, ran the regression suite, fixed two failing tests, and deployed to staging. Ready for your review."</p>
<p>That's not a vision. That's a product roadmap.</p>
<hr />
<h2 id="heading-what-this-means-for-engineering-teams">What This Means for Engineering Teams</h2>
<p>These trends point in one direction: AI as a genuine team member, not just a tool.</p>
<p>The teams that thrive in 2026 will be those that figure out the right division of labor. What decisions require human judgment? What work can be fully delegated? How do you maintain accountability when an agent is acting autonomously?</p>
<p>The answers will vary by team, by codebase, and by risk tolerance. But the question is no longer whether AI will take on meaningful engineering work. It's how quickly your team will adapt to working alongside it.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769710242206/784993ff-14bd-4a19-a2c8-c5e3c711881f.jpeg" alt class="image--center mx-auto" /></p>
<hr />
<p><em>Building reliable AI agents for production operations requires deep infrastructure context. At</em> <a target="_blank" href="https://drdroid.io"><em>DrDroid</em></a><em>, we're building the agentic context engine that makes autonomous incident response possible. Learn how teams are already putting AI on-call.</em></p>
]]></content:encoded></item><item><title><![CDATA[How to Build an AI Agent in Slack [DIY Guide]]]></title><description><![CDATA[Objective
By the end of this DIY guide, you’ll have:

A Slack Bot

A backend that can

Listen to user messages or alerts in a channel and take agentic action based on prompts or steps that you might have in mind.

Query your Grafana instance, analyse...]]></description><link>https://notes.drdroid.io/how-to-build-an-ai-agent-in-slack-diy-guide</link><guid isPermaLink="true">https://notes.drdroid.io/how-to-build-an-ai-agent-in-slack-diy-guide</guid><category><![CDATA[observability]]></category><category><![CDATA[AI Agent Development]]></category><category><![CDATA[Grafana]]></category><category><![CDATA[Open Source]]></category><dc:creator><![CDATA[Siddarth Jain]]></dc:creator><pubDate>Sun, 27 Jul 2025 06:09:01 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1753596357650/0913d77d-d6ca-470f-b188-ea89eab1bd5b.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-objective">Objective</h1>
<p>By the end of this DIY guide, you’ll have:</p>
<ol>
<li><p>A Slack Bot</p>
</li>
<li><p>A backend that can</p>
<ul>
<li><p>Listen to user messages or alerts in a channel and take agentic action based on prompts or steps that you might have in mind.</p>
</li>
<li><p>Query your Grafana instance, analyse logs/dashboards and send info about anomaly in reply to an alert/message in your Slack channel</p>
</li>
</ul>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1753595500164/544dba64-3148-45ac-abdd-5b40ed903954.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-pre-requisites">Pre-requisites</h2>
<ol>
<li><p>Python <a target="_blank" href="https://docs.astral.sh/uv/getting-started/installation/">uv</a> (package manager<a target="_blank" href="https://docs.astral.sh/uv/getting-started/installation/">)</a></p>
</li>
<li><p>Ngrok to expose the slackbot server. <a target="_blank" href="https://ngrok.com/docs/getting-started/">Setup instructions</a></p>
</li>
<li><p>Grafana (Optional)</p>
</li>
</ol>
<h2 id="heading-step-0-clone-the-repo">Step 0: Clone the repo:</h2>
<p>Repository Link - <a target="_blank" href="https://github.com/DrDroidLab/slack-ai-bot-builder">https://github.com/DrDroidLab/slack-ai-bot-builder</a></p>
<h2 id="heading-step-1-building-the-slack-bot-with-an-integrated-backendhttpsgithubcomdrdroidlabslack-ai-bot-builder"><a target="_blank" href="https://github.com/DrDroidLab/slack-ai-bot-builder"><strong>Step 1: Building the Slack Bot with an integrated backend</strong></a></h2>
<p>The backend repo setup here, behaves in multiple ways:</p>
<ul>
<li><p>Acts as an MCP Client for any AI calls you might want to make</p>
</li>
<li><p>Acts as a server to accept webhooks from Slack and manage configurations</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1753595599394/56688817-8307-4257-a31e-fd26e97174f6.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-setting-up-the-ngrok-tunnel"><strong>Setting up the Ngrok Tunnel</strong></h3>
<p>Expose port 5000 of your localhost, using the command:</p>
<pre><code class="lang-plaintext">ngrok http 5000
</code></pre>
<p>You will receive a HTTPS URL of the form <a target="_blank" href="https://abc123.ngrok.io">https://abc123.ngrok.io</a></p>
<p>Which is pointing to port 5000 of your system.</p>
<p><strong>Note:</strong> <strong><em>We have not setup a server running on port 5000 yet</em></strong>, but that is fine since ngrok is independent of that, and exposes the port regardless.</p>
<h3 id="heading-creating-the-slack-application"><strong>Creating the Slack Application</strong></h3>
<ol>
<li><p>Go to Slack API Apps – <a target="_blank" href="https://api.slack.com/apps">https://api.slack.com/apps</a></p>
</li>
<li><p>Click on Create App, and select the option ‘From a manifest’</p>
</li>
<li><p>Copy the <a target="_blank" href="https://github.com/DrDroidLab/slack-ai-bot-builder/blob/main/slack_manifest.json">manifest json</a> from the repository and <strong><mark>replace the placeholder ‘&lt;hostname&gt;’ with the HTTPS URL you got from ngrok. Include the “https://” part as well.</mark></strong></p>
</li>
<li><p>Install the app in your workspace.</p>
</li>
<li><p>Copy the credentials for the slack application into credentials.yaml</p>
<ol>
<li><p>app_id, app_name and signing secret can be found in the basic information tab.</p>
</li>
<li><p>Bot-auth-token can be found in the Oauth &amp; Permissions tab.</p>
</li>
<li><p>openai_key from openai - if you plan to use AI based workflows.</p>
</li>
</ol>
</li>
<li><p>Create a channel called #drdroid-slack-bot-tester in your slack workspace &amp; add the bot to the channel.</p>
</li>
</ol>
<h3 id="heading-setting-up-the-bot-server"><strong>Setting up the bot server</strong></h3>
<p>Run the following commands to set up your virtual environment and activate it.</p>
<pre><code class="lang-plaintext">uv env venv
source .venv/bin/activate
</code></pre>
<p>Install dependencies using:</p>
<pre><code class="lang-plaintext">uv sync
</code></pre>
<p>Now we can finally run the bot server using:</p>
<pre><code class="lang-plaintext">uv run python app.py
</code></pre>
<p>The server is now running on port 5000, and exposed to the outside world via your ngrok tunnel.</p>
<h3 id="heading-testing-the-bot"><strong>Testing the bot</strong></h3>
<p>Add the bot to the drdroid-slack-bot-tester workspace that you had previously created.</p>
<p>And just type in a ‘hi’. The bot should send you a sample response.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1753596133455/a92590b2-6b31-479b-a209-015fa52872b3.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-step-2-intehttpsgithubcomdrdroidlabslack-ai-bot-buildergrating-aihttpsngrokcomdocsgetting-started"><a target="_blank" href="https://github.com/DrDroidLab/slack-ai-bot-builder"><strong>Step 2: Inte</strong></a><a target="_blank" href="https://ngrok.com/docs/getting-started/"><strong>grating AI</strong></a></h2>
<p>There is already an example workflow for AI (name: "chatbot") in the <a target="_blank" href="https://github.com/DrDroidLab/slack-ai-bot-builder/blob/main/workflows.yaml">workflows.yaml</a>. It is just boilerplate code, you can modify it as required. For now, you can tag your bot in the drdroid-slack-bot-tester, and chat with it.</p>
<ul>
<li>Add your OpenAI/LLM key</li>
</ul>
<p>For example:</p>
<pre><code class="lang-plaintext">Message in Slack: chatbot How to debug Kubernetes CrashLoopBackOff error? @bot
Message in Slack: chatbot I'm getting this alert. What does it mean? @bot
</code></pre>
<h2 id="heading-step-3-making-the-bot-an-agent-by-giving-ai-access-to-different-tools-grafana-for-demo"><strong>Step 3: Making the bot an Agent by giving AI access to different tools (Grafana for demo)</strong></h2>
<p>MCP Servers help abstract out any API and making them accessible to AI.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1753595526326/211ca3ee-f4b8-45f2-a1f0-3520850160a6.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-setting-up-grafana-mcp-server"><strong>Setting up Grafana MCP Server</strong></h3>
<p>Clone the following repository:</p>
<p>Repository URL - <a target="_blank" href="https://github.com/DrDroidLab/grafana-mcp-server">https://github.com/DrDroidLab/grafana-mcp-server</a></p>
<p>Navigate into the root directory of the repository.</p>
<p><mark>Populate the </mark> <code>src/grafana_mcp_server/config.yaml</code> <mark> with your grafana credentials.</mark></p>
<p>Install and setup the dependencies using:</p>
<pre><code class="lang-plaintext">uv venv .venv
source .venv/bin/activate
uv sync
</code></pre>
<p>Run the MCP server:</p>
<pre><code class="lang-plaintext">uv run -m src.grafana_mcp_server.mcp_server
</code></pre>
<p>Your MCP server is now running on port 8000.</p>
<h3 id="heading-creating-an-ai-grafana-workflow-in-slack-bot-builder"><strong>Creating an AI Grafana workflow in slack-bot-builder</strong></h3>
<p>There is already an example workflow for grafana ai in the workflows.yaml. </p>
<p>We will be running the script scripts/grafana_ai_tool.py in this workflow, it is just boilerplate code, you can modify it as required.<br />Now you can tag your bot in the drdroid-slack-bot-tester, and ask it to do various things from grafana.<br />For example:</p>
<pre><code class="lang-plaintext">Message in Slack: Fetch me logs from the currencyservice in grafana ai. 
Message in Slack: Fetch and analyse the Go Microservices dashboard from grafana ai
</code></pre>
<h2 id="heading-next-steps">Next Steps:</h2>
<p>Now that you’ve been able to setup a bot, here are a few things you can do:</p>
<ul>
<li><p>Productionise it from your current ngrok setup to a static endpoint</p>
</li>
<li><p>Integrate with Grafana or open source <a target="_blank" href="https://glama.ai/mcp/servers">MCP servers</a> of your favourite tool that you want to leverage for automation</p>
</li>
<li><p>Add custom prompts and scripts</p>
</li>
</ul>
<p>Stuck anywhere? Ask on our <a target="_blank" href="https://discord.gg/AQ3tusPtZn">Discord</a></p>
]]></content:encoded></item><item><title><![CDATA[GitOps for Alerting: How to Manage Alert Rules Like Code]]></title><description><![CDATA[It's 2 AM. Production is on fire. You need to adjust an alert threshold that's been firing false positives all week.
You log into Grafana, click through three nested menus, find the alert, and bump the threshold from 80% to 85%. Crisis averted. You g...]]></description><link>https://notes.drdroid.io/gitops-for-alerting-how-to-manage-alert-rules-like-code</link><guid isPermaLink="true">https://notes.drdroid.io/gitops-for-alerting-how-to-manage-alert-rules-like-code</guid><category><![CDATA[#AIOps]]></category><dc:creator><![CDATA[SriNikitha Thummanapalli]]></dc:creator><pubDate>Fri, 18 Jul 2025 11:52:43 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1752833279010/4872f8bc-f608-469b-a39e-8fb172e66e6b.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>It's 2 AM. Production is on fire. You need to adjust an alert threshold that's been firing false positives all week.</p>
<p>You log into Grafana, click through three nested menus, find the alert, and bump the threshold from 80% to 85%. Crisis averted. You go back to bed.</p>
<p>Two weeks later, during a postmortem, someone asks: "Who changed the CPU alert threshold? And why?"</p>
<p>Silence. Nobody remembers. There's no history. No context. No way to know if this was a temporary hack or a deliberate tuning decision. Worse, when you refresh your staging environment, the old threshold returns because the change only lived in the production UI.</p>
<p>Sound familiar? You're not alone. This is how most teams manage alerts—and it's fundamentally broken.</p>
<h2 id="heading-why-managing-alerts-in-dashboards-doesnt-scale">Why Managing Alerts in Dashboards Doesn't Scale</h2>
<p>We've spent the last decade moving infrastructure to code. Terraform for cloud resources. Helm charts for Kubernetes. Ansible for configuration. Yet somehow, our alert rules—critical infrastructure that wakes up engineers—still live in UI dashboards like it's 2010.</p>
<p>The problems compound quickly:</p>
<p><strong>No version history</strong>: When did this alert last change? Who changed it? Why? Your Grafana dashboard shrugs.</p>
<p><strong>No peer review</strong>: A junior engineer can accidentally change a critical alert threshold with zero oversight. Try doing that with production code.</p>
<p><strong>No rollback capability</strong>: That "quick fix" that made things worse? Good luck remembering the old values.</p>
<p><strong>Environment drift</strong>: Production alerts diverge from staging. Dev environments have different rules. Chaos ensues.</p>
<p><strong>No ownership tracking</strong>: Who owns this alert? Which team should review changes? The UI doesn't care.</p>
<p>Your infrastructure evolves constantly. Services scale. Traffic patterns shift. Performance characteristics change. But alerts configured through dashboards remain frozen in time, slowly becoming less relevant until they're just noise.</p>
<p>Here's the thing: <strong>alert rules are infrastructure-as-code too</strong>. They define critical system behavior. They impact your team's quality of life. They deserve the same rigor as any other code.</p>
<p>Enter GitOps for alerts—where alert definitions live in version control, changes happen through pull requests, and every modification is tracked, reviewed, and reversible.</p>
<h2 id="heading-what-is-gitops-for-alerting">What is GitOps for Alerting?</h2>
<p>GitOps for alerting is beautifully simple: store your alert rules as code in Git, manage changes through pull requests, and deploy automatically. Just like any other infrastructure.</p>
<p>Most modern monitoring tools already support this:</p>
<ul>
<li><p><strong>Prometheus</strong>: Alert rules in YAML files</p>
</li>
<li><p><strong>Alertmanager</strong>: Routing configuration as code</p>
</li>
<li><p><strong>Grafana</strong>: Alerts exportable as JSON</p>
</li>
<li><p><strong>Datadog</strong>: Monitors manageable via Terraform</p>
</li>
<li><p><strong>New Relic</strong>: Alerts configurable through their API/Terraform</p>
</li>
</ul>
<p>Here's what a typical structure looks like:</p>
<pre><code class="lang-bash">/alerts/
  frontend-service.yaml
  database.yaml
  redis.yaml

/teams/
  payments/
    api-alerts.yaml
    database-alerts.yaml
  platform/
    infrastructure-alerts.yaml
    kubernetes-alerts.yaml
</code></pre>
<p>A Prometheus alert rule might look like:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">groups:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">frontend-service</span>
    <span class="hljs-attr">rules:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">alert:</span> <span class="hljs-string">HighErrorRate</span>
        <span class="hljs-attr">expr:</span> <span class="hljs-string">rate(http_requests_total{status=~"5.."}[5m])</span> <span class="hljs-string">&gt;</span> <span class="hljs-number">0.05</span>
        <span class="hljs-attr">for:</span> <span class="hljs-string">5m</span>
        <span class="hljs-attr">labels:</span>
          <span class="hljs-attr">severity:</span> <span class="hljs-string">warning</span>
          <span class="hljs-attr">service:</span> <span class="hljs-string">frontend</span>
          <span class="hljs-attr">team:</span> <span class="hljs-string">frontend-team</span>
        <span class="hljs-attr">annotations:</span>
          <span class="hljs-attr">summary:</span> <span class="hljs-string">"High error rate on <span class="hljs-template-variable">{{ $labels.instance }}</span>"</span>
          <span class="hljs-attr">description:</span> <span class="hljs-string">"Error rate is <span class="hljs-template-variable">{{ $value }}</span> (threshold 0.05)"</span>
          <span class="hljs-attr">runbook:</span> <span class="hljs-string">"https://wiki.company.com/runbooks/frontend-errors"</span>
          <span class="hljs-attr">owner:</span> <span class="hljs-string">"frontend-oncall@company.com"</span>
<span class="hljs-attr">The benefits are immediate:</span>
</code></pre>
<p>✅ <strong>Traceability</strong>: Every change is a commit. Git blame tells you who changed what and when.</p>
<p>✅ <strong>Peer review</strong>: Alert changes go through PR reviews. No more accidental 3 AM threshold adjustments.</p>
<p>✅ <strong>Consistency</strong>: Deploy the same alerts across all environments. No more production/staging drift.</p>
<p>✅ <strong>Rollback capability</strong>: Bad change? <code>git revert</code> and you're back to working alerts.</p>
<p>✅ <strong>Documentation</strong>: PR descriptions explain why changes were made. Context is preserved forever.</p>
<h2 id="heading-but-gitops-alone-isnt-enough">But GitOps Alone Isn't Enough</h2>
<p>Here's the plot twist: GitOps for alerts solves the <em>how</em> but not the <em>what</em>.</p>
<p>You now have beautiful, version-controlled alert rules. Every change is reviewed and tracked. But you still don't know <strong>which rules need updating</strong>.</p>
<p>Your Git repo becomes a graveyard of alert rules that might or might not be relevant:</p>
<ul>
<li><p>That CPU alert from 2019 when you ran on smaller instances</p>
</li>
<li><p>The memory warning tuned for your old Java app (you've since moved to Go)</p>
</li>
<li><p>The latency threshold set when you had 100 users (you now have 10,000)</p>
</li>
</ul>
<p>You've traded one problem for another. Instead of stale alerts in dashboards, you have stale alerts in Git. They're better organized, sure, but still noisy.</p>
<p>This is where most GitOps alerting stories end. Teams implement the framework but lack the feedback loop to keep it healthy. Alert rules accumulate like sediment. Engineers suffer in silence because "at least it's in Git now."</p>
<h2 id="heading-using-alert-insights-to-drive-gitops-changes">Using Alert Insights to Drive GitOps Changes</h2>
<h3 id="heading-let-real-alert-data-guide-your-pull-requests">Let real alert data guide your pull requests</h3>
<p>The missing piece is data. You need to know which alerts are actually problematic before you can fix them. This is where <strong>DrDroid's Alert Insights</strong> transforms GitOps from a theoretical improvement into a practical solution.</p>
<p>Alert Insights analyzes your live production alerts and tells you:</p>
<ul>
<li><p><strong>Which alerts fired most frequently last week</strong>: Your noisiest offenders, ranked</p>
</li>
<li><p><strong>Which alerts were ignored</strong>: Clear signal of rules that need removal</p>
</li>
<li><p><strong>Which alerts lack owners or runbooks</strong>: Quality issues to address</p>
</li>
<li><p><strong>Suggested changes</strong>: Specific recommendations to mute, tweak, or archive</p>
</li>
</ul>
<p>Now GitOps becomes powerful. You're not guessing which alert rules to update—you have data.</p>
<p>✅ <strong>Workflow Example:</strong></p>
<p><strong>Monday: Run Alert Insights</strong></p>
<p>`Top 3 Noisy Alerts:</p>
<ol>
<li><p>redis_memory_warning - 127 fires, 0 actions taken</p>
</li>
<li><p>api_latency_high - 89 fires, acknowledged but not investigated</p>
</li>
<li><p>cpu_usage_critical - 45 fires, all during deploy windows`</p>
</li>
</ol>
<p><strong>Tuesday: Create targeted PRs</strong></p>
<pre><code class="lang-bash">git checkout -b fix/reduce-redis-memory-noise
<span class="hljs-comment"># Edit alerts/redis.yaml</span>
<span class="hljs-comment"># Increase threshold from 70% to 80% based on actual usage patterns</span>
git commit -m <span class="hljs-string">"Increase Redis memory threshold to reduce false positives

Alert Insights showed 127 fires with 0 actions last week. Analysis shows Redis memory naturally spikes to 75% during cache warmup."</span>
</code></pre>
<p><strong>Wednesday: Review and merge</strong></p>
<ul>
<li><p>Team reviews the PR</p>
</li>
<li><p>Links to Alert Insights data provide context</p>
</li>
<li><p>Changes deploy automatically</p>
</li>
</ul>
<p><strong>Thursday: Validate impact</strong></p>
<ul>
<li><p>Alert noise drops immediately</p>
</li>
<li><p>Next week's Alert Insights confirms improvement</p>
</li>
</ul>
<p>The feedback loop is complete. You're not just organizing alerts better—you're systematically improving them based on real data.</p>
<p>➡️ <strong>🛠️ Want a GitOps-ready alert audit? 👉</strong> <a target="_blank" href="https://drdroid.io/doctor-droid-slack-integration"><strong>Run DrDroid's Alert Insights</strong></a> <strong>and get actionable suggestions in minutes.</strong></p>
<h2 id="heading-recommended-practices-for-gitops-alert-management">Recommended Practices for GitOps Alert Management</h2>
<h3 id="heading-use-clear-filenames-per-servicecomponent">🔍 Use clear filenames per service/component</h3>
<p>Don't create a monolithic <code>alerts.yaml</code>. Break rules into logical groups:</p>
<pre><code class="lang-plaintext">/alerts/
  services/
    payment-api.yaml
    user-service.yaml
  infrastructure/
    kubernetes-nodes.yaml
    database-cluster.yaml
  business/
    checkout-flow.yaml
    user-engagement.yaml
</code></pre>
<h3 id="heading-add-labelstags-to-help-alert-insights-map-alerts-to-owners">🔄 Add labels/tags to help Alert Insights map alerts to owners</h3>
<p>Every alert should include:</p>
<p>yaml</p>
<p><code>labels: team: payments service: payment-api environment: production severity: P2</code></p>
<p>This metadata powers Alert Insights' analysis and recommendations.</p>
<h3 id="heading-validate-rules-with-test-alerts-in-staging">🧪 Validate rules with test alerts in staging</h3>
<p>Before merging, trigger test conditions:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Simulate high error rate</span>
curl -X POST http://prometheus:9090/api/v1/series \
-d <span class="hljs-string">'match[]=up{job="frontend"}'</span>
🔁 Link PRs to weekly alert review
</code></pre>
<h2 id="heading-common-gitops-pitfalls-to-avoid">Common GitOps Pitfalls to Avoid</h2>
<h3 id="heading-bulk-silencing-alerts-without-context">❌ Bulk silencing alerts without context</h3>
<p>"Let's just comment out all the noisy alerts" is tempting but dangerous. Use Alert Insights to understand <em>why</em> alerts are noisy before acting.</p>
<h3 id="heading-committing-rules-without-reviews">❌ Committing rules without reviews</h3>
<p>The whole point of GitOps is peer review. Don't bypass it with direct commits, even for "quick fixes."</p>
<h3 id="heading-no-tagging-alert-insights-cant-map-alerts-to-services">❌ No tagging = Alert Insights can't map alerts to services</h3>
<p>Without proper labels, you lose the ability to analyze alerts by team, service, or severity. Enforce tagging standards.</p>
<h3 id="heading-alert-rules-diverging-across-environments">❌ Alert rules diverging across environments</h3>
<p>Use templating to keep staging and production alerts synchronized:</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># values-prod.yaml</span>
<span class="hljs-attr">cpu_threshold:</span> <span class="hljs-number">80</span>
<span class="hljs-attr">memory_threshold:</span> <span class="hljs-number">85</span>

<span class="hljs-comment"># values-staging.yaml</span>
<span class="hljs-attr">cpu_threshold:</span> <span class="hljs-number">90</span> <span class="hljs-comment"># Higher tolerance in staging</span>
<span class="hljs-attr">memory_threshold:</span> <span class="hljs-number">90</span>
</code></pre>
<h2 id="heading-final-take-let-data-drive-your-alert-rule-changes">Final Take — Let Data Drive Your Alert Rule Changes</h2>
<p>GitOps gives you the framework for managing alerts professionally. Version control, peer review, and rollback capabilities bring alerts into the modern era.</p>
<p>But framework without data is just organized chaos. You need to know which alerts to fix, how to fix them, and whether your fixes worked.</p>
<p>Alert Insights provides that missing data layer. It tells you which alert rules are hurting your team, suggests specific improvements, and validates that your changes actually reduced noise.</p>
<p>Together, they create a powerful feedback loop:</p>
<ol>
<li><p>Alert Insights identifies problematic alerts</p>
</li>
<li><p>GitOps enables reviewed, tracked changes</p>
</li>
<li><p>Automated deployment ensures consistency</p>
</li>
<li><p>Next week's Alert Insights validates improvement</p>
</li>
</ol>
<p>This isn't theoretical. Teams using this approach report 50-70% reduction in alert noise within weeks. On-call engineers sleep better. Real incidents get proper attention. Alert quality becomes a measurable, improvable metric.</p>
<p>Your alerts deserve the same engineering rigor as your code. GitOps provides the foundation. Alert Insights provides the intelligence. Together, they transform alerting from a necessary evil into a competitive advantage.</p>
<p>➡️ <strong>✍️ Want to make smarter, reviewable changes to your alerts? 👉</strong> <a target="_blank" href="https://aiops.drdroid.io/">Run AIOps</a> <strong>and let your alerts tell you what to fix.</strong></p>
]]></content:encoded></item><item><title><![CDATA[3 Tools That Help Reduce Alert Fatigue (With Trade-offs)]]></title><description><![CDATA[We live in the age of "vibecoding."
Your engineers ship features at lightning speed. AI copilots autocomplete entire functions. CI/CD pipelines deploy to production in minutes. Modern development has become a symphony of efficiency, with developers o...]]></description><link>https://notes.drdroid.io/3-tools-that-help-reduce-alert-fatigue-with-trade-offs</link><guid isPermaLink="true">https://notes.drdroid.io/3-tools-that-help-reduce-alert-fatigue-with-trade-offs</guid><category><![CDATA[alert noise]]></category><category><![CDATA[alert-insights]]></category><dc:creator><![CDATA[SriNikitha Thummanapalli]]></dc:creator><pubDate>Fri, 18 Jul 2025 11:52:02 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1752831431954/cb658a85-d1de-48db-b2e2-c3de9f9ebeff.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We live in the age of "vibecoding."</p>
<p>Your engineers ship features at lightning speed. AI copilots autocomplete entire functions. CI/CD pipelines deploy to production in minutes. Modern development has become a symphony of efficiency, with developers operating at 10x the speed of just five years ago.</p>
<p>But there's one part of your stack that's stuck in 2010: your alerts.</p>
<p>While your team vibecodes their way through complex distributed systems, your alerting engine still screams about every CPU spike, memory blip, and network hiccup like it's the apocalypse. It's like having a Ferrari engine attached to horse-and-buggy wheels. The cognitive dissonance is jarring—and it's killing your team's productivity.</p>
<h2 id="heading-why-alert-fatigue-is-a-real-problem-in-2025"><strong>Why Alert Fatigue Is a Real Problem in 2025</strong></h2>
<p>Here's the absurd reality: The same engineer who just deployed a sophisticated ML model in production gets woken up at 3 AM because a health check endpoint took 501ms instead of 500ms to respond. The developer who elegantly orchestrated a microservices migration gets paged because a pod restarted—something Kubernetes is literally designed to do automatically.</p>
<p>Modern infrastructure has exploded in complexity. You're running hundreds of microservices, each generating alerts. Kubernetes adds its own layer of notifications. Cloud providers, APMs, and security tools all want their voice heard. The result? An endless stream of "urgent" notifications flooding Slack channels and PagerDuty rotations.</p>
<p>But unlike your codebase—which has intelligent linters, smart IDEs, and AI-powered suggestions—your alerts remain dumb. They can't distinguish between:</p>
<ul>
<li><p>A temporary spike during garbage collection vs. a memory leak</p>
</li>
<li><p>A planned scaling event vs. an unexpected traffic surge</p>
</li>
<li><p>A self-healing Kubernetes pod restart vs. a critical service failure</p>
</li>
</ul>
<p>The real problem: <strong>You don't know which alerts matter anymore.</strong></p>
<p>Your engineers have adapted the only way they can—by tuning out. When every alert claims to be critical but most are noise, even genuine emergencies get ignored. It's the monitoring equivalent of crying wolf, except the wolf is paging your on-call engineer every 30 minutes.</p>
<p>What you need aren't more dashboards visualizing the chaos. You need intelligent tools that understand context, learn patterns, and <strong>show you what's noisy and help you take action</strong>. Let's examine three approaches to bringing your alerts into the modern era.</p>
<h2 id="heading-tool-1-drdroid"><strong>Tool #1 – DrDroid</strong></h2>
<h3 id="heading-best-for-real-time-visibility-into-noisy-alerts-across-any-stack"><strong>Best for: Real-time visibility into noisy alerts, across any stack</strong></h3>
<p>DrDroid represents the first generation of truly intelligent alerting tools. While your engineers use AI to write code faster, DrDroid uses intelligence to make your alerts smarter.</p>
<p>The platform integrates with your existing stack—Slack, Prometheus, New Relic, OpenTelemetry, and more. But what sets it apart is the <strong>Alert Insights</strong> feature, which applies actual intelligence to your alert patterns:</p>
<ul>
<li><p><strong>Which alerts are flapping?</strong> Just like a smart IDE highlights code smells, DrDroid identifies alerts that repeatedly fire and resolve—clear indicators of misconfiguration.</p>
</li>
<li><p><strong>Which alerts are being ignored?</strong> By analyzing engineer behavior, it spots alerts that get dismissed without action. If developers ignore an alert 100% of the time, why is it still paging them?</p>
</li>
<li><p><strong>Which alerts lack runbooks or clear owners?</strong> Nothing frustrates a vibecoding engineer more than context-switching to an alert with zero information about what to do.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1752754957104/e354b06d-d477-479d-939c-4c03a3338299.png" alt /></p>
<p>DrDroid doesn't just identify problems—it suggests fixes:</p>
<ul>
<li><p>Automatically mute alerts during deployment windows</p>
</li>
<li><p>Disable alerts that have never correlated with customer impact</p>
</li>
<li><p>Add intelligent conditions (like requiring sustained threshold breaches)</p>
</li>
<li><p>Enrich alerts with missing context, runbooks, and correlation data</p>
</li>
</ul>
<p>The platform's <strong>auto-debugging</strong> capabilities are particularly impressive. When an alert fires, DrDroid automatically pulls relevant logs, metrics, traces, and even recent code changes. It's like having an AI copilot for incident response.</p>
<p>Consider this scenario: Your payment service alerts on high latency every day at 2 PM. DrDroid notices the pattern, correlates it with a scheduled batch job, and suggests either suppressing the alert during that window or adjusting the threshold. What took hours of manual analysis now happens automatically.</p>
<h3 id="heading-trade-offs"><strong>Trade-offs</strong></h3>
<p>DrDroid is built for modern, Slack-first teams. If your organization has traditional processes requiring all alerts to flow through legacy ITSM tools, adoption might face resistance. As a newer platform, some enterprise compliance features are still maturing.</p>
<p>➡️ <strong>🧠 Want to know which alerts your team should disable? 👉</strong> <a target="_blank" href="https://drdroid.io/doctor-droid-slack-integration"><strong>Explore DrDroid's Alert Insights</strong></a> <strong>— loved by SREs to reduce alert fatigue.</strong></p>
<h2 id="heading-tool-2-bigpanda"><strong>Tool #2 – BigPanda</strong></h2>
<h3 id="heading-best-for-enterprise-scale-alert-correlation"><strong>Best for: Enterprise-scale alert correlation</strong></h3>
<p>BigPanda takes a different approach—using machine learning to group related alerts into incidents. When a database issue triggers alerts across 20 services, BigPanda recognizes the pattern and presents them as one incident.</p>
<p>For large enterprises with complex systems, this correlation can help. The platform learns relationships between components and can reduce the number of incidents operators review. It also integrates deeply with enterprise tools like ServiceNow and Dynatrace.</p>
<h3 id="heading-trade-offs-1"><strong>Trade-offs</strong></h3>
<p>Here's where the contrast with modern development becomes stark. While your engineers deploy code in minutes, BigPanda requires <strong>months of setup</strong>. While developers use intuitive tools that work out-of-the-box, BigPanda demands extensive metadata configuration and alert standardization.</p>
<p>More critically, BigPanda doesn't make individual alerts smarter—it just groups dumb alerts better. Those flapping alerts your engineers hate? Still firing, just bundled together. It's like organizing spam into folders instead of fixing your spam filter.</p>
<p>The platform is also expensive, often requiring dedicated administrators and cross-team coordination. For teams used to the speed of modern development, BigPanda's implementation timeline feels like stepping back in time.</p>
<h2 id="heading-tool-3-pagerduty-analytics"><strong>Tool #3 – PagerDuty Analytics</strong></h2>
<h3 id="heading-best-for-trend-visibility-inside-the-pagerduty-ecosystem"><strong>Best for: Trend visibility inside the PagerDuty ecosystem</strong></h3>
<p>PagerDuty Analytics provides retrospective dashboards showing alert volume, MTTR, and on-call load. For teams already using PagerDuty, it offers visibility into historical patterns and trends.</p>
<p>The analytics can be useful for quarterly reviews and capacity planning. You can see which services generate the most alerts and track improvements over time.</p>
<h3 id="heading-trade-offs-2"><strong>Trade-offs</strong></h3>
<p>The limitations mirror the gap between modern development and legacy monitoring. While your engineers get real-time feedback from their tools, PagerDuty Analytics is <strong>retrospective only</strong>. It tells you that Service X generated 500 alerts last month but not which ones were false positives or what to do about them.</p>
<p>It only analyzes alerts flowing through PagerDuty, missing Slack notifications and other channels. The insights are descriptive, not prescriptive—you see the problem visualized but get no help fixing it. And it requires expensive premium tiers, adding cost without adding intelligence.</p>
<h2 id="heading-comparison-table"><strong>Comparison Table</strong></h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1752754340622/c0dfb6f6-7cef-478d-b60b-ccfbbaa773ba.png" alt /></p>
<h2 id="heading-final-thoughts-your-alerts-should-be-as-smart-as-your-code"><strong>Final Thoughts — Your Alerts Should Be as Smart as Your Code</strong></h2>
<p>We've entered an era where engineers can literally describe what they want to build and watch AI generate the code. They deploy with confidence, iterate rapidly, and ship features that would have taken months in mere days.</p>
<p>Yet these same engineers—these 10x vibecoding machines—are still being interrupted by alerts that would have been considered noisy a decade ago.</p>
<p>The disconnect is unsustainable. You can't run a modern engineering organization with stone-age alerting. Your monitoring needs to evolve to match the sophistication of your development practices.</p>
<p>BigPanda and PagerDuty show you the problem in high resolution. <strong>Only DrDroid's Alert Insights actually makes your alerts smarter</strong>—identifying what's broken, why it's noisy, and exactly how to fix it.</p>
<p>The future of monitoring isn't better dashboards or fancier grouping algorithms. It's intelligent systems that understand context, learn from patterns, and proactively help you maintain signal-to-noise ratio. It's alerts that are as smart as the engineers they're interrupting.</p>
<p>Your team deserves alerting infrastructure that matches their development velocity. Stop letting 2010-era alerts slow down your 2025 engineering team.</p>
<p>➡️ <strong>💡 Ready to reduce alert fatigue the smart way? 👉</strong> <a target="_blank" href="https://drdroid.io/doctor-droid-slack-integration"><strong>Start using Alert Insights</strong></a> <strong>to find and fix noisy alerts today — no config needed.</strong></p>
]]></content:encoded></item><item><title><![CDATA[A Practical Framework to Reduce Alert Noise (Without Missing Incidents)]]></title><description><![CDATA[Every SRE has been there.
Fed up with alert fatigue, you go on a muting spree. That flaky health check? Silenced. The CPU warning that fires during deploys? Disabled. The memory alert that triggers during garbage collection? Gone.
For a blissful week...]]></description><link>https://notes.drdroid.io/a-practical-framework-to-reduce-alert-noise-without-missing-incidents</link><guid isPermaLink="true">https://notes.drdroid.io/a-practical-framework-to-reduce-alert-noise-without-missing-incidents</guid><category><![CDATA[alert-insights]]></category><category><![CDATA[monitoring]]></category><category><![CDATA[AI]]></category><category><![CDATA[alerting]]></category><dc:creator><![CDATA[SriNikitha Thummanapalli]]></dc:creator><pubDate>Fri, 18 Jul 2025 10:23:21 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1752831592236/dc329375-d56a-4043-b40e-4c1815e6bf86.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Every SRE has been there.</p>
<p>Fed up with alert fatigue, you go on a muting spree. That flaky health check? Silenced. The CPU warning that fires during deploys? Disabled. The memory alert that triggers during garbage collection? Gone.</p>
<p>For a blissful week, your on-call rotation is peaceful. Engineers are sleeping through the night. Slack channels are quiet. Life is good.</p>
<p>Then it happens. A real incident slips through. Customer complaints pour in. Your CEO wants answers. And suddenly, those "noisy" alerts you disabled don't seem so unnecessary anymore.</p>
<p>Here's the uncomfortable truth: anyone can reduce alert noise by turning off alerts. The real challenge—the one that separates good SRE teams from great ones—is reducing noise <strong>without sacrificing coverage</strong>.</p>
<h2 id="heading-why-reducing-alert-noise-is-harder-than-it-sounds">Why Reducing Alert Noise Is Harder Than It Sounds</h2>
<p>The naive approach to alert fatigue is seductively simple: just turn off the annoying alerts. But this creates a dangerous blind spot. That CPU alert might be noisy 99% of the time, but what about the 1% when it signals a real problem?</p>
<p>The opposite extreme isn't better. Some teams, burned by missed incidents, keep every alert active "just in case." They end up with hundreds of alerts that cry wolf, training engineers to ignore everything—including real emergencies.</p>
<p>The solution isn't choosing between noise and coverage. It's building a systematic approach that maintains visibility while eliminating false positives. High-performing SRE teams follow a <strong>4-phase framework</strong> that transforms chaotic alerting into intelligent monitoring.</p>
<p>This framework isn't theoretical—it's battle-tested by teams managing hundreds of services in production. And with modern tools like <strong>Alert Insights</strong>, you can measure and validate your improvements with data, not guesswork.</p>
<p>Let's dive into each phase.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1752831961901/f3951003-f149-4ffa-90a0-b9bad0526240.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-phase-1-start-with-coverage-not-silence">Phase 1 – Start with Coverage, Not Silence</h2>
<h3 id="heading-the-mistake-most-teams-make-start-muting">The mistake most teams make: start muting</h3>
<p>When alert fatigue hits, the instinctive response is to start silencing alerts. It feels productive—each muted alert is one less interruption. But this approach is backwards.</p>
<p>Before you disable a single alert, you need to understand what you're actually trying to monitor. This means mapping your core Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to your alerting strategy.</p>
<p>Most teams rely too heavily on infrastructure alerts—CPU usage, memory consumption, disk space. These are important, but they're indirect signals. A service can have high CPU usage while serving customers perfectly. Conversely, it can have normal resource usage while completely failing its primary function.</p>
<p>Instead, start with user-facing signals:</p>
<ul>
<li><p><strong>Failed user logins</strong> (not just authentication service uptime)</p>
</li>
<li><p><strong>Checkout completion rates</strong> (not just payment gateway availability)</p>
</li>
<li><p><strong>API response times at the 95th percentile</strong> (not just average latency)</p>
</li>
<li><p><strong>Database query failures</strong> (not just connection pool metrics)</p>
</li>
</ul>
<p>Map these business-critical indicators first. Only after you have comprehensive coverage of what matters should you start tuning what doesn't.</p>
<p>✅ <strong>Principle:</strong> Only tune alerts after coverage is solid. It's better to have noisy but comprehensive alerting than quiet but blind monitoring.</p>
<h2 id="heading-phase-2-assign-ownership">Phase 2 – Assign Ownership</h2>
<h3 id="heading-every-alert-should-have-an-owner-a-service-and-a-runbook">Every alert should have an owner, a service, and a runbook</h3>
<p>Here's a dirty secret of most alerting systems: nobody owns the alerts. They fire into shared channels where responsibility diffuses across the team. When everyone is responsible, no one is accountable.</p>
<p>This shared ownership model is why alerts never improve. The payment team ignores database alerts because "that's infrastructure's problem." The infrastructure team ignores API latency alerts because "that's the app team's issue." Meanwhile, both alerts keep firing, and your on-call engineer suffers.</p>
<p>The fix is radical but simple: <strong>every alert must have a single owner</strong>. Not a team, not a rotation—a specific service and the team that owns it. This means:</p>
<ul>
<li><p>No more #alerts-general channels where everything dumps</p>
</li>
<li><p>No more "infrastructure noise" channels that everyone mutes</p>
</li>
<li><p>Each team gets their own alert destinations</p>
</li>
<li><p>Each team is accountable for their signal-to-noise ratio</p>
</li>
</ul>
<p>Implement this with proper tagging:</p>
<p><code>alert: HighAPILatency service: payment-api team: payments owner: payments-team@company.com escalation: payments-oncall severity: P2</code></p>
<p>When alerts have clear ownership, magic happens. The payments team suddenly cares about that flapping API alert because it's waking them up, not some random SRE. They'll fix it, tune it, or justify why it needs to stay.</p>
<p>✅ <strong>Tip:</strong> Alerts without owners almost never get fixed. They become background noise that everyone learns to ignore.</p>
<h2 id="heading-phase-3-enrich-then-tune">Phase 3 – Enrich, Then Tune</h2>
<h3 id="heading-rich-alerts-less-cognitive-load-faster-response">Rich alerts = less cognitive load = faster response</h3>
<p>Now that you have coverage and ownership, it's time to make your alerts actually useful. A bare-bones "Service X is down" notification forces engineers to context-switch, investigate, and piece together what's happening. Rich alerts provide everything upfront.</p>
<p>Essential enrichment includes:</p>
<ul>
<li><p><strong>Runbook links</strong>: Step-by-step remediation instructions</p>
</li>
<li><p><strong>Severity levels</strong>: Is this customer-impacting or internal-only?</p>
</li>
<li><p><strong>Business impact</strong>: How many users affected? Which features degraded?</p>
</li>
<li><p><strong>Recent changes</strong>: Did a deployment just go out?</p>
</li>
<li><p><strong>Historical context</strong>: Has this happened before? How was it fixed?</p>
</li>
</ul>
<p>But richness isn't verbosity. Don't dump entire log files into alerts. Instead, provide precisely what's needed for rapid decision-making.</p>
<p>Only after enrichment should you start tuning:</p>
<p><strong>Add intelligent conditions</strong>: Instead of alerting on every spike, require sustained problems:</p>
<ul>
<li><p>Alert only after 3 consecutive failures</p>
</li>
<li><p>Require issues to persist for 5 minutes</p>
</li>
<li><p>Use percentage-based thresholds (5% of requests failing vs. 10 absolute failures)</p>
</li>
</ul>
<p><strong>Adjust thresholds based on reality</strong>: That 80% CPU alert made sense with your old infrastructure. But if your auto-scaling kicks in at 70%, you're alerting on normal operations.</p>
<p><strong>Add flapping protection</strong>: If an alert fires and resolves repeatedly, it needs damping:</p>
<ul>
<li><p>Require state changes to persist before alerting</p>
</li>
<li><p>Group rapid-fire alerts into single notifications</p>
</li>
<li><p>Add cooldown periods between alerts</p>
</li>
</ul>
<p>✅ <strong>Insight:</strong> Context beats volume every time. One well-enriched alert is worth ten noisy notifications.</p>
<h2 id="heading-phase-4-use-data-to-improve-over-time">Phase 4 – Use Data to Improve Over Time</h2>
<h3 id="heading-enter-alert-insights-by-drdroid">Enter: Alert Insights by DrDroid</h3>
<p>Here's where most frameworks fail: they're static. Teams implement phases 1-3, declare victory, and move on. Six months later, they're back to alert fatigue because systems evolve but alerts don't.</p>
<p>You need a continuous feedback loop—a way to measure what's working and what's still broken. This is where <strong>Alert Insights</strong> becomes your secret weapon.</p>
<p>After implementing your alert structure, Alert Insights provides ongoing intelligence:</p>
<ul>
<li><p><strong>Which alerts are firing too often?</strong> That P1 alert that fires 50 times per week probably needs adjustment</p>
</li>
<li><p><strong>Which ones are being ignored?</strong> If engineers acknowledge but never act on an alert, it's pure noise</p>
</li>
<li><p><strong>Which lack runbooks or clear owners?</strong> Gaps in your enrichment strategy become visible</p>
</li>
<li><p><strong>What can be safely muted, disabled, or improved?</strong> Data-driven recommendations, not guesswork</p>
</li>
</ul>
<p>The workflow becomes systematic:</p>
<p><strong>Every sprint:</strong></p>
<ol>
<li><p>Review Alert Insights dashboard</p>
</li>
<li><p>Identify the top 3 worst offenders</p>
</li>
<li><p>Fix ownership, enrichment, or tuning for those alerts</p>
</li>
<li><p>Validate improvements in the next sprint</p>
</li>
<li><p>Repeat</p>
</li>
</ol>
<p>This creates a virtuous cycle. Your alerts get better every sprint. Your on-call experience improves measurably. And you maintain coverage while reducing noise.</p>
<p>➡️ <strong>🧠 Want a clear report on which alerts are hurting your team? 👉</strong> <a target="_blank" href="https://drdroid.io/doctor-droid-slack-integration"><strong>Run DrDroid Alert Insights</strong></a> <strong>— no config required.</strong></p>
<h2 id="heading-bringing-it-all-together-your-teams-framework">Bringing It All Together — Your Team's Framework</h2>
<p>Here's your systematic approach to intelligent alerting:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Phase</strong></td><td><strong>Goal</strong></td><td><strong>Key Action</strong></td></tr>
</thead>
<tbody>
<tr>
<td>1. Coverage First</td><td>Avoid blind spots</td><td>Map alerts to SLOs</td></tr>
<tr>
<td>2. Ownership</td><td>Accountability</td><td>Assign alerts to teams</td></tr>
<tr>
<td>3. Enrichment &amp; Tuning</td><td>Faster resolution</td><td>Add context, reduce flapping</td></tr>
<tr>
<td>4. Feedback Loop</td><td>Continuous improvement</td><td>Use Alert Insights regularly</td></tr>
</tbody>
</table>
</div><p>This isn't a one-time project—it's an ongoing practice. Just like you continuously refactor code, you need to continuously refine alerts. The difference is that now you have a framework and the data to guide your decisions.</p>
<h2 id="heading-final-thought-you-cant-fix-what-you-dont-see">Final Thought — You Can't Fix What You Don't See</h2>
<p>Most teams exist in one of two failure modes. They either suffer in silence with alert fatigue, accepting it as the cost of observability. Or they oversimplify their alerting, creating dangerous blind spots that only become visible during incidents.</p>
<p>Real success looks different: high signal, low noise, and fast resolution. It's alerts that wake you up only when customer impact is imminent. It's notifications that include everything needed to respond. It's a system that improves continuously based on data, not opinions.</p>
<p>This framework gives you the path. Phase by phase, you can transform your alerting from a source of frustration into a competitive advantage. But frameworks only work when you can measure their impact.</p>
<p>Let <strong>Alert Insights</strong> be your guide. It shows what's working, what's broken, and exactly how to improve. No more guessing which alerts to tune. No more hoping you haven't created blind spots. Just data-driven improvements that make your team's life better.</p>
<p>Your engineers deserve better than alert fatigue. Your customers deserve better than missed incidents. This framework delivers both.</p>
<p>➡️ <strong>🛠️ Tired of guessing which alerts are noisy? 👉</strong> <a target="_blank" href="https://drdroid.io/doctor-droid-slack-integration"><strong>Try Alert Insights</strong></a> <strong>and start tuning your alerts based on real data.</strong></p>
]]></content:encoded></item><item><title><![CDATA[KubeCon + CloudNativeCon Europe 2025 Guide – London]]></title><description><![CDATA[Doctor Droid’s Guide to KubeCon + CloudNativeCon Europe 2025
Welcome to our complete guide for navigating KubeCon + CloudNativeCon Europe 2025 in London, England, running from 1–4 April 2025. Whether you’re a seasoned cloud native pro or new to the K...]]></description><link>https://notes.drdroid.io/kubecon-cloudnativecon-europe-2025-guide-london</link><guid isPermaLink="true">https://notes.drdroid.io/kubecon-cloudnativecon-europe-2025-guide-london</guid><category><![CDATA[KubeConLondon]]></category><category><![CDATA[Kubecon]]></category><category><![CDATA[Kubernetes]]></category><dc:creator><![CDATA[Jayesh Sadhwani]]></dc:creator><pubDate>Fri, 14 Feb 2025 08:32:01 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1739516728159/d72bee5d-8e33-4e45-b305-4bae01e37d47.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-doctor-droids-guide-to-kubecon-cloudnativecon-europe-2025">Doctor Droid’s Guide to KubeCon + CloudNativeCon Europe 2025</h2>
<p>Welcome to our complete guide for navigating KubeCon + CloudNativeCon Europe 2025 in London, England, running from <strong>1–4 April 2025</strong>. Whether you’re a seasoned cloud native pro or new to the Kubernetes world, this guide has everything you need to maximize your conference experience. And don’t forget – visit the Doctor Droid booth at the event! Show us this blog to score an exclusive <strong>20% discount on your ticket</strong> plus a chance to receive special Doctor Droid credits!</p>
<hr />
<h2 id="heading-overview">Overview</h2>
<p>KubeCon + CloudNativeCon is the premier conference for Kubernetes, cloud-native technologies, and open source innovations. Organized by the Cloud Native Computing Foundation (CNCF), this flagship event gathers thousands of developers, engineers, and industry leaders to share ideas, network, and explore the latest trends shaping the future of cloud computing. In the heart of London, expect inspiring keynotes, deep-dive sessions, and engaging co-located events that cover everything from AI/ML integration to edge computing and beyond.</p>
<hr />
<h2 id="heading-access-types">Access Types</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1739516645338/d12f9286-4b56-49db-86fa-fbd8279088b2.png" alt class="image--center mx-auto" /></p>
<p>KubeCon + CloudNativeCon Europe 2025 offers a single, all-inclusive pass that grants access to:</p>
<ul>
<li><p>All keynote sessions, breakout tracks, and panel discussions</p>
</li>
<li><p>Hands-on labs and workshops for a practical dive into cloud-native solutions</p>
</li>
<li><p>Co-located events hosted by CNCF and industry partners</p>
</li>
</ul>
<p>Additionally, there are special discounted options available for students and academic participants. For more details, check out the official <a target="_blank" href="https://events.linuxfoundation.org/kubecon-cloudnativecon-europe/">KubeCon + CloudNativeCon Europe website</a> for registration options.</p>
<hr />
<h2 id="heading-exclusive-kubecon-cloudnativecon-europe-2025-discount-save-20-on-tickets-courtesy-of-doctor-droid">Exclusive KubeCon + CloudNativeCon Europe 2025 Discount – Save 20% on Tickets, Courtesy of Doctor Droid!</h2>
<p>That’s right – Doctor Droid is proud to sponsor KubeCon + CloudNativeCon Europe 2025, and we’re offering you an exclusive 20% discount on your ticket! Here’s how to claim your savings:</p>
<ol>
<li><p><strong>Fill out our quick</strong> <a target="_blank" href="https://forms.gle/rMg1xAP34rA1jdM99"><strong>Google Form</strong></a> with your basic information.</p>
</li>
<li><p><strong>Receive your discount code</strong> directly in your inbox.</p>
</li>
<li><p><strong>Register</strong> on the official event website and enjoy your 20% savings!</p>
</li>
</ol>
<p>Hurry up – secure your discount today and get ready for an unforgettable experience in London!</p>
<hr />
<h2 id="heading-speakers-amp-tracks-at-kubecon-cloudnativecon-europe-2025">Speakers &amp; Tracks at KubeCon + CloudNativeCon Europe 2025</h2>
<p>The conference features an impressive line-up of industry thought leaders and technical experts across multiple tracks, including:</p>
<ul>
<li><p><strong>Kubernetes Operations:</strong> Best practices in deployment, scaling, and security.</p>
</li>
<li><p><strong>Cloud Security:</strong> Deep dives into safeguarding cloud native environments.</p>
</li>
<li><p><strong>AI &amp; Machine Learning:</strong> Innovations transforming how we manage and operate Kubernetes.</p>
</li>
<li><p><strong>Edge Computing:</strong> Exploring the future of distributed computing in real-world scenarios.</p>
</li>
</ul>
<p>Be sure to check out the detailed schedule on the <a target="_blank" href="https://events.linuxfoundation.org/kubecon-cloudnativecon-europe/">official event page</a> for a complete list of sessions and speakers.</p>
<hr />
<h2 id="heading-planning-your-london-experience">Planning Your London Experience</h2>
<h3 id="heading-where-to-stay">Where to Stay</h3>
<p>London offers a range of accommodation options to suit every budget:</p>
<ul>
<li><p><strong>Hotels:</strong> Browse options on <a target="_blank" href="https://www.booking.com/">Booking.com</a>, <a target="_blank" href="https://www.agoda.com/">Agoda</a>, or <a target="_blank" href="https://www.airbnb.co.uk/">Airbnb</a> for a comfortable stay near the event venue.</p>
</li>
<li><p><strong>Short-term Rentals:</strong> Consider serviced apartments if you prefer a homier experience during your stay.</p>
</li>
</ul>
<h3 id="heading-getting-there">Getting There</h3>
<p>London is well connected by international air travel:</p>
<ul>
<li><p><strong>Heathrow Airport (LHR)</strong> – The largest and busiest airport, with easy public transit into central London.</p>
</li>
<li><p><strong>Gatwick Airport (LGW)</strong> – A convenient alternative for many international travellers.</p>
</li>
</ul>
<p>Plan your journey ahead to make the most of your time in this vibrant city!</p>
<hr />
<h2 id="heading-visiting-london">Visiting London</h2>
<h3 id="heading-culinary-delights">Culinary Delights</h3>
<p>London’s food scene is as diverse as it is delicious. Whether you’re looking for Michelin-starred restaurants or quirky food markets, here are some recommendations:</p>
<ul>
<li><p><strong>Restaurants:</strong> Try iconic spots in Soho or trendy eateries in Shoreditch.</p>
</li>
<li><p><strong>After-Hours:</strong> Explore vibrant nightlife in Camden or the West End for live music and cocktails.</p>
</li>
<li><p><strong>Coffee Shops:</strong> Recharge at local favorites like Monmouth Coffee or The Attendant for a caffeine boost.</p>
</li>
</ul>
<h3 id="heading-weekend-plans">Weekend Plans</h3>
<p>When you’re not immersed in conference sessions, take some time to explore London’s rich history and culture:</p>
<p><img src="https://www.londonperfect.com/cdn-cgi/image/format=auto,width=1256/https://www.londonperfect.com/g/photos/upload/sml_342226895-1498585820-london-eye-guide.jpg" alt="London Eye" /></p>
<ul>
<li><p><strong>The British Museum:</strong> Discover art and antiquities from around the world.</p>
</li>
<li><p><strong>Tower of London:</strong> Step back in time with a visit to this historic fortress.</p>
</li>
<li><p><strong>Buckingham Palace &amp; Changing of the Guard:</strong> A must-see for first-time visitors.</p>
</li>
<li><p><strong>London Eye:</strong> Enjoy panoramic views of the city skyline.</p>
</li>
</ul>
<hr />
<h2 id="heading-visit-the-doctor-droid-booth">Visit the Doctor Droid Booth</h2>
<p>Doctor Droid is the intelligent Slack bot that accelerates incident diagnosis by automatically pinpointing the root cause of production issues. Simply tag the bot in your alert messages, and let it do the heavy lifting!</p>
<p>Stop by our booth at KubeCon + CloudNativeCon Europe 2025 to discover:</p>
<ul>
<li><p><strong>Live Demos:</strong> See Doctor Droid in action and learn how it can transform your incident response.</p>
</li>
<li><p><strong>Puzzles &amp; Giveaways:</strong> Test your skills and win exciting Doctor Droid goodies.</p>
</li>
<li><p><strong>$500 Doctor Droid Credits:</strong> Show this blog at our booth and receive $500 in credits to supercharge your troubleshooting capabilities.</p>
</li>
</ul>
<p>For those interested in a one-on-one demo, pre-book a meeting with us <a target="_blank" href="https://calendly.com/siddarthjain/kubecon-2024-demo">here</a>.</p>
<hr />
<p>Get ready to experience the future of cloud native computing in one of the world’s most exciting cities – London awaits at KubeCon + CloudNativeCon Europe 2025!</p>
<hr />
<p><em>Happy conferencing, and see you in London!</em></p>
]]></content:encoded></item><item><title><![CDATA[Tools can't buy you good MTTR.. but these 3 practices can]]></title><description><![CDATA[Context
It’s a scenario we’ve all witnessed: teams equipped with cutting-edge observability tools still struggling to catch issues before customers notice.
They’ve invested heavily in top-tier APM solutions, container and infrastructure monitoring, a...]]></description><link>https://notes.drdroid.io/tools-cant-buy-you-good-mttr-but-these-3-practices-can</link><guid isPermaLink="true">https://notes.drdroid.io/tools-cant-buy-you-good-mttr-but-these-3-practices-can</guid><category><![CDATA[observability]]></category><category><![CDATA[monitoring]]></category><category><![CDATA[incident response]]></category><category><![CDATA[logging]]></category><category><![CDATA[#prometheus]]></category><dc:creator><![CDATA[Siddarth Jain]]></dc:creator><pubDate>Tue, 26 Nov 2024 10:08:58 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1732615128540/89904969-35c9-4170-9a6e-3b58ce1cbaa0.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-context">Context</h3>
<p>It’s a scenario we’ve all witnessed: teams equipped with cutting-edge observability tools still struggling to catch issues before customers notice.</p>
<p>They’ve invested heavily in top-tier APM solutions, container and infrastructure monitoring, and log accessibility. Yet, their on-call engineers remain overwhelmed. Incidents happen more frequently than anyone would like, and the spotlight they find themselves in post-incident is never the kind they want.</p>
<p>For engineering teams, being called out for production issues is a tough pill to swallow. The key lies in post-incident action plans that lead to meaningful, systematic improvements.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732612857673/0f6372d4-a2a8-498b-82cc-e3e71c7709cd.png" alt="Whodunnit - I should know before others" class="image--center mx-auto" /></p>
<p>While production incidents can’t be entirely eliminated, well-thought-out preventive measures can dramatically improve operational health.</p>
<h3 id="heading-tools-are-the-baselinenot-the-answer"><strong>Tools Are the Baseline—Not the Answer</strong></h3>
<p>While tools are essential, they primarily address infrastructure or service-level issues. However, most real-world incidents cascade across multiple stacks, often affecting features, products, or customer experiences—areas that are rarely solved by out-of-the-box tools.</p>
<p>To reduce MTTR, teams need processes that improve detection, diagnosis, and resolution speed.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732613249893/ffc38b35-e29c-426e-8c57-7f87271c05e1.png" alt="Cascading issues from infrastructure to customer experience" class="image--center mx-auto" /></p>
<h3 id="heading-measures-for-reducing-mttr-drastically">Measures for reducing MTTR drastically:</h3>
<p>Here are three processes that I have seen help teams in improving MTTR significantly:</p>
<ol>
<li><p><strong>Improving Actionability of Alerts (Faster Detection)</strong></p>
<ul>
<li><p>Trustworthy alerts are a cornerstone of effective incident management. Engineers need a single source of truth to detect issues early—before customers or business stakeholders notice.</p>
</li>
<li><p>Poorly configured alerts can destroy this trust, leading teams to rely on escalations from support or business teams instead. Monitoring alert quality is critical. For example, many companies using <a target="_blank" href="https://drdroid.io/doctor-droid-slack-integration">Doctor Droid</a> track alert quality to ensure non-actionable alerts don’t erode confidence in their systems.</p>
</li>
</ul>
</li>
<li><p><strong>Instrumenting Custom Metrics</strong></p>
<ul>
<li><p>Custom metrics are invaluable for tracking operational health and catching issues tied to features and product breakages. Unlike generic service-level metrics, custom metrics provide leading indicators that can help teams spot potential failures before they escalate.</p>
</li>
<li><p>By focusing on metrics relevant to their features and customer experience, teams can gain clarity and react faster.</p>
</li>
</ul>
</li>
<li><p><strong>Faster Fixing Through Runbooks and Quick Links</strong></p>
<ul>
<li><p>Developer experience during on-call is often overlooked. Simple resources like runbooks or quick links for known issues can dramatically reduce the cognitive load on engineers.</p>
</li>
<li><p>For example, a link to a pre-built log query can save critical minutes during an incident. These tools empower teams to pinpoint issues faster, enabling quicker resolutions.</p>
</li>
</ul>
</li>
</ol>
<h3 id="heading-conclusion">Conclusion:</h3>
<p>No matter how much you spend on tools, improving MTTR requires engineering investment in processes that enhance detection, diagnosis, and resolution. Custom metrics, actionable alerts, and developer-friendly resources are what truly make the difference.</p>
<p>Engineering teams that focus on these practices find themselves more prepared, more resilient, and better positioned to handle the inevitable challenges of production.</p>
<p><strong>Want to monitor your alerting quality and improve MTTR? Doctor Droid has helped 40+ companies take their incident management to the next level. Get started for free and improve your alerts today!</strong></p>
]]></content:encoded></item><item><title><![CDATA[KubeCon + CloudNativeCon India 2024 Guide -- Delhi]]></title><description><![CDATA[Doctor Droid’s Guide to KubeCon + CloudNativeCon India 2024
Welcome to our complete guide for navigating KubeCon + CloudNativeCon India 2024! Here’s all you need to know to make the most of this event in Delhi, India, from December 11-12, 2024. Don’t...]]></description><link>https://notes.drdroid.io/kubecon-cloudnativecon-india-2024-guide-delhi</link><guid isPermaLink="true">https://notes.drdroid.io/kubecon-cloudnativecon-india-2024-guide-delhi</guid><category><![CDATA[kubeconIN]]></category><category><![CDATA[Kubecon]]></category><category><![CDATA[india]]></category><dc:creator><![CDATA[Jayesh Sadhwani]]></dc:creator><pubDate>Tue, 12 Nov 2024 19:42:07 GMT</pubDate><content:encoded><![CDATA[<h2 id="heading-doctor-droids-guide-to-kubecon-cloudnativecon-india-2024"><strong>Doctor Droid’s Guide to KubeCon + CloudNativeCon India 2024</strong></h2>
<p>Welcome to our complete guide for navigating KubeCon + CloudNativeCon India 2024! Here’s all you need to know to make the most of this event in Delhi, India, from December 11-12, 2024. Don’t forget to visit Doctor Droid at Booth—show us this blog for a chance to receive $500 worth of Doctor Droid credits!</p>
<hr />
<h3 id="heading-overview"><strong>Overview</strong></h3>
<p>KubeCon + CloudNativeCon is the leading conference for Kubernetes, cloud-native technologies, and open-source solutions. Hosted by the Cloud Native Computing Foundation (CNCF), this event gathers thousands of developers, engineers, and business leaders to exchange knowledge, network, and discover the future of cloud-native ecosystems.</p>
<hr />
<h3 id="heading-access-types"><strong>Access Types</strong></h3>
<p><strong>KubeCon + CloudNativeCon India 2024</strong> offers one access pass to cater to all types of attendees:</p>
<p>Includes access to all sessions, keynotes, and co-located events</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731366135812/1aac2ac0-ed1f-493b-a6a7-156a99db3252.png" alt class="image--center mx-auto" /></p>
<ul>
<li><p><strong>Academic Pass</strong>: Discounted pass for students looking to dive into cloud-native technologies.</p>
</li>
<li><p><strong>Individual pass</strong>: Perfect for attendees who are paying for the conference by themselves.</p>
</li>
</ul>
<p>Be sure to review the registration options on the official <strong>KubeCon + CloudNativeCon India 2024</strong> <a target="_blank" href="https://events.linuxfoundation.org/kubecon-cloudnativecon-india"><strong>website</strong></a>.</p>
<h3 id="heading-exclusive-kubecon-cloudnativecon-india-2024-discount-save-20-on-tickets-courtesy-of-doctor-droid"><strong>Exclusive KubeCon + CloudNativeCon India 2024 Discount – Save 20% on Tickets, Courtesy of Doctor Droid!</strong></h3>
<p>That’s right—Doctor Droid is a proud sponsor of KubeCon + CloudNativeCon India 2024, and we’re hooking you up with an exclusive 20% discount on your tickets! Here’s how to secure your spot and save big:</p>
<ol>
<li><p><strong>Fill out this</strong> <a target="_blank" href="https://forms.gle/4obXDnZ41nQBNMH18"><strong>google form</strong></a> with your basic info.</p>
</li>
<li><p><strong>Get your 20% discount code</strong> delivered straight to your inbox.</p>
</li>
<li><p><strong>Register for KubeCon + CloudNativeCon India 2024</strong> and enjoy the savings!</p>
</li>
</ol>
<p>Don’t sit on this—grab your discount before it’s gone!</p>
<h3 id="heading-speakers-amp-tracks-at-kubecon-cloudnativecon-india-2024"><strong>Speakers &amp; Tracks at KubeCon + CloudNativeCon India 2024</strong></h3>
<p><strong>KubeCon + CloudNativeCon</strong> features keynotes from industry leaders and breakout sessions across multiple tracks, including:</p>
<ul>
<li><p><strong>Kubernetes Operations</strong>: Topics around the deployment, scaling, and security of Kubernetes.</p>
</li>
<li><p><strong>AI and Machine Learning</strong>: How AI and ML are transforming Kubernetes.</p>
</li>
<li><p><strong>Edge Computing</strong>: Use cases and solutions for Kubernetes at the edge.</p>
</li>
<li><p><strong>Cloud Security</strong>: Tools and practices to ensure security in a cloud-native environment.</p>
</li>
</ul>
<h3 id="heading-planning-for-kubecon-cloudnativecon-india-2024-delhi-logistics"><strong>Planning for KubeCon + CloudNativeCon India 2024 Delhi Logistics</strong></h3>
<h4 id="heading-stays-near-delhi">Stays near Delhi</h4>
<p>Delhi has several convenient options for accommodation. Whether you prefer hotels or Airbnbs, here are a few recommendations:</p>
<ul>
<li><p><strong>Hotels</strong>: You can find a number hotels on <a target="_blank" href="http://booking.com"><strong>booking.com</strong></a>, <a target="_blank" href="https://agoda.com">agoda.com</a> or <a target="_blank" href="https://makemytrip.com">makemytrip.com</a></p>
</li>
<li><p><strong>Airbnb</strong>: Airbnb has a healthy number of properties available in the city</p>
</li>
</ul>
<h4 id="heading-airports-nearby">Airports nearby</h4>
<ul>
<li><strong>Indira Gandhi International Airport (DEL)</strong> is the nearest airport from the convention center</li>
</ul>
<h3 id="heading-visiting-delhi"><strong>Visiting Delhi</strong></h3>
<h4 id="heading-restaurants">Restaurants</h4>
<p>Delhi offers a diverse culinary scene. Recommended spots include:</p>
<h4 id="heading-after-hour-locations">After Hour Locations</h4>
<h4 id="heading-coffee-shops">Coffee Shops</h4>
<p>Need a caffeine boost? Check out:</p>
<ul>
<li><p><strong>Blue Tokai Coffee</strong>: Known for freshly roasted coffee.</p>
</li>
<li><p><strong>Third Wave Coffee</strong>: Multiple locations to elevate your coffee experience</p>
</li>
</ul>
<h4 id="heading-weekend-plans">Weekend Plans</h4>
<p>Consider exploring Delhi’s culture and history over the weekend. Popular options include:</p>
<ul>
<li><p><strong>Red Fort</strong>: A red coloured fort in the old Delhi to experience the culture and food of Delhi</p>
<p>  <img src="https://lh3.googleusercontent.com/p/AF1QipMzixUK6xvfX9g6zKxOepzWuvo1AfY43mJZAC9g=s1360-w1360-h1020-rw" alt="Photo of Red Fort Lahori Gate" /></p>
</li>
<li><p><strong>Akshardham Temple</strong>: A stunning modern temple complex showcasing India’s rich cultural heritage through intricate architecture, gardens, and a beautiful water show. It’s a must-visit for its grandeur and serenity.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731440167898/44a32c59-d444-4824-9ad8-c0d35de566e0.png" alt class="image--center mx-auto" /></p>
<p>There are many more places to explore and you can checkout while in delhi</p>
<ol>
<li><p><strong>India Gate</strong>: A monumental war memorial built in honor of Indian soldiers, surrounded by beautiful lawns, making it a great spot for a peaceful evening stroll.</p>
</li>
<li><p><strong>Lotus Temple</strong>: Known for its stunning flower-like shape, this temple is a peaceful place for meditation.</p>
</li>
<li><p><strong>National Museum</strong>: A treasure trove of India’s art, history, and culture spanning millennia.</p>
</li>
<li><p><strong>Dilli Haat</strong>: A market with traditional handicrafts and foods from various Indian states. A must-visit for authentic Indian souvenirs.</p>
</li>
</ol>
<h3 id="heading-visit-doctor-droid-booth"><strong>Visit Doctor Droid Booth</strong></h3>
<p>Doctor Droid is a root cause identification slack bot which can assist on-call engineers diagnose incidents and find root cause really fast. All you need to do is reply to your alert message in slack by tagging the bot. If you are interested to get a demo and explore more about Doctor Droid, visit us at Booth in the venue!</p>
<p><a target="_blank" href="https://calendly.com/siddarthjain/kubecon-2024-demo"><strong>Pre-book a meeting with us using this link.</strong></a></p>
<p>Stop by Booth to discover how Doctor Droid’s automated RCA can help you debug &amp; fix your production issues faster! What else is up for grabs at the event?</p>
<ul>
<li><p><strong>Puzzles &amp; Goodies</strong>: Test your mental muscle and win some amazing gifts.</p>
</li>
<li><p><strong>$500 Doctor Droid Credits</strong>: Show this blog at our booth to receive $500 in Doctor Droid credits!</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[KubeCon + CloudNativeCon North America 2024 Guide -- Salt Lake City, Utah]]></title><description><![CDATA[Doctor Droid’s Guide to KubeCon + CloudNativeCon North America 2024
Welcome to our complete guide for navigating KubeCon + CloudNativeCon North America 2024! Here’s all you need to know to make the most of this event in Salt Lake City, Utah, from Nov...]]></description><link>https://notes.drdroid.io/kubecon-cloudnativecon-north-america-2024-guide-salt-lake-city-utah</link><guid isPermaLink="true">https://notes.drdroid.io/kubecon-cloudnativecon-north-america-2024-guide-salt-lake-city-utah</guid><category><![CDATA[Kubecon]]></category><category><![CDATA[#cloudnativecon]]></category><dc:creator><![CDATA[Siddarth Jain]]></dc:creator><pubDate>Mon, 04 Nov 2024 19:01:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1730746786664/8e2af6e0-4460-4e9c-983a-63cc29ee2dc7.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-doctor-droids-guide-to-kubecon-cloudnativecon-north-america-2024"><strong>Doctor Droid’s Guide to KubeCon + CloudNativeCon North America 2024</strong></h2>
<p>Welcome to our complete guide for navigating KubeCon + CloudNativeCon North America 2024! Here’s all you need to know to make the most of this event in Salt Lake City, Utah, from November 12-15, 2024. Don’t forget to visit Doctor Droid at Booth Q45—show us this blog for a chance to receive $500 worth of Doctor Droid credits!</p>
<hr />
<h3 id="heading-overview">Overview</h3>
<p>KubeCon + CloudNativeCon is the leading conference for Kubernetes, cloud-native technologies, and open-source solutions. Hosted by the Cloud Native Computing Foundation (CNCF), this event gathers thousands of developers, engineers, and business leaders to exchange knowledge, network, and discover the future of cloud-native ecosystems.</p>
<hr />
<h3 id="heading-access-types">Access Types</h3>
<p><strong>KubeCon + CloudNativeCon North America 2024</strong> offers different access passes to cater to all types of attendees:</p>
<ul>
<li><p><strong>Full Access Pass</strong>: Includes access to all sessions, keynotes, and co-located events</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730744336800/322260fe-30a6-4f60-90df-287510cd7a30.png" alt class="image--center mx-auto" /></p>
</li>
<li><p><strong>KubeCon + CloudNativeCon Only Pass:</strong> Includes access to all sessions, keynotes, excluding co-located events</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730744324464/55a245ea-b06f-462a-859c-a756f318459c.png" alt class="image--center mx-auto" /></p>
</li>
<li><p><strong>Academic Pass</strong>: Discounted pass for students looking to dive into cloud-native technologies.</p>
</li>
<li><p><strong>Individual pass</strong>: Perfect for attendees who are paying for the conference by themselves.</p>
</li>
</ul>
<p>Be sure to review the registration options on the official <strong>KubeCon + CloudNativeCon North America 2024</strong> <a target="_blank" href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/">website</a>.</p>
<h3 id="heading-exclusive-kubecon-cloudnativecon-north-america-2024-discount-save-20-on-tickets-courtesy-of-doctor-droid"><strong>Exclusive KubeCon + CloudNativeCon North America 2024 Discount – Save 20% on Tickets, Courtesy of Doctor Droid!</strong></h3>
<p>That’s right—Doctor Droid is a proud sponsor of KubeCon + CloudNativeCon North America 2024, and we’re hooking you up with an exclusive 20% discount on your tickets! Here’s how to secure your spot and save big:</p>
<ol>
<li><p><strong>Fill out this</strong> <a target="_blank" href="https://forms.gle/z3VSYER6RH97Ruv27"><strong>google form</strong></a> with your basic info.</p>
</li>
<li><p><strong>Get your 20% discount code</strong> delivered straight to your inbox.</p>
</li>
<li><p><strong>Register for KubeCon + CloudNativeCon North America 2024</strong> and enjoy the savings!</p>
</li>
</ol>
<p>Don’t sit on this—grab your discount before it’s gone!</p>
<hr />
<h3 id="heading-co-located-events-at-kubecon-cloudnativecon-north-america-2024"><strong>Co-Located Events at KubeCon + CloudNativeCon North America 2024</strong></h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730743508197/651c827a-8bdd-4102-84a0-81d149df7b70.png" alt class="image--center mx-auto" /></p>
<p>Expand your KubeCon + CloudNativeCon experience by joining co-located events, each tailored to specific interests within the cloud-native realm:</p>
<ul>
<li><p><a target="_blank" href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/co-located-events/appdevelopercon/"><strong>AppDeveloperCon</strong></a>: Focuses on tools and techniques for building cloud-native applications.</p>
</li>
<li><p><a target="_blank" href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/co-located-events/argocon/"><strong>ArgoCon</strong></a>: Dive into Argo workflows, events, and continuous delivery for Kubernetes.</p>
</li>
<li><p><a target="_blank" href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/co-located-events/backstagecon/"><strong>BackstageCon</strong></a>: Explore Backstage’s developer portal and best practices for engineering platforms.</p>
</li>
<li><p><a target="_blank" href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/co-located-events/cilium-ebpf-day/"><strong>Cilium + eBPF Day</strong></a>: A deep dive into Cilium and eBPF for networking and security.</p>
</li>
<li><p><a target="_blank" href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/co-located-events/cloud-native-kubernetes-ai-day/"><strong>Cloud Native &amp; Kubernetes AI Day</strong></a>: Discuss AI and ML workloads on Kubernetes.</p>
</li>
<li><p><a target="_blank" href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/co-located-events/cloud-native-startupfest/"><strong>CloudNative StartupFest</strong></a>: Networking and insights for startup founders and innovators.</p>
</li>
<li><p><a target="_blank" href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/co-located-events/cloud-native-university/"><strong>Cloud Native University</strong></a>: Educational sessions on the fundamentals of cloud-native technologies.</p>
</li>
<li><p><a target="_blank" href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/co-located-events/data-on-kubernetes-day/"><strong>Data on Kubernetes Day</strong></a>: Explore data management practices and tools for Kubernetes.</p>
</li>
<li><p><a target="_blank" href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/co-located-events/envoycon/"><strong>EnvoyCon</strong></a>: Dedicated to the Envoy proxy community, focusing on networking and observability.</p>
</li>
<li><p><a target="_blank" href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/co-located-events/istio-day/"><strong>Istio Day</strong></a>: Learn about Istio and service mesh technologies in cloud-native environments.</p>
</li>
<li><p><a target="_blank" href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/co-located-events/kubernetes-on-edge-day/"><strong>Kubernetes on Edge Day</strong></a>: Explore the role of Kubernetes in edge computing.</p>
</li>
<li><p><a target="_blank" href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/co-located-events/observability-day/"><strong>Observability Day</strong></a>: A day centered around observability tools and practices in the cloud.</p>
</li>
<li><p><a target="_blank" href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/co-located-events/openfeature-summit/"><strong>OpenFeature Summit</strong></a>: Discussions on feature flagging and experimentation in cloud-native setups.</p>
</li>
<li><p><a target="_blank" href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/co-located-events/opentofu-day/"><strong>OpenTofu Day</strong>:</a> Open-source infrastructure management and IaC best practices.</p>
</li>
<li><p><a target="_blank" href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/co-located-events/platform-engineering-day/"><strong>Platform Engineering Day</strong></a>: Dedicated to platform engineering in cloud-native environments.</p>
</li>
<li><p><a target="_blank" href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/co-located-events/wasmcon/"><strong>WasmCon</strong></a>: Focuses on WebAssembly and its role in cloud-native development.</p>
</li>
</ul>
<hr />
<h3 id="heading-kubecon-cloudnativecon-north-america-2024-unofficial-conference-parties">KubeCon + CloudNativeCon North America 2024 Unofficial Conference Parties</h3>
<p>After a day of learning and networking, unwind at the <strong>official</strong> conference parties.</p>
<ul>
<li><p>Check out <a target="_blank" href="https://conferenceparties.com/kubecon24/">Conference Parties</a> for the latest info on social events.</p>
</li>
<li><p>Check out <a target="_blank" href="https://lu.ma/salt-lake-city">events on lu.ma</a> too to keep exploring.</p>
</li>
</ul>
<h3 id="heading-speakers-amp-tracks-at-kubecon-cloudnativecon-north-america-2024">Speakers &amp; Tracks at KubeCon + CloudNativeCon North America 2024</h3>
<p><strong>KubeCon + CloudNativeCon</strong> features keynotes from industry leaders and breakout sessions across multiple tracks, including:</p>
<ul>
<li><p><strong>Kubernetes Operations</strong>: Topics around the deployment, scaling, and security of Kubernetes.</p>
</li>
<li><p><strong>AI and Machine Learning</strong>: How AI and ML are transforming Kubernetes.</p>
</li>
<li><p><strong>Edge Computing</strong>: Use cases and solutions for Kubernetes at the edge.</p>
</li>
<li><p><strong>Cloud Security</strong>: Tools and practices to ensure security in a cloud-native environment.</p>
</li>
</ul>
<hr />
<h3 id="heading-planning-for-kubecon-cloudnativecon-north-america-2024-utah-logistics">Planning for KubeCon + CloudNativeCon North America 2024 Utah Logistics</h3>
<h4 id="heading-stays-near-salt-lake-city">Stays near Salt Lake City</h4>
<p>Salt Lake City has several convenient options for accommodation. Whether you prefer hotels or Airbnbs, here are a few recommendations:</p>
<ul>
<li><p><strong>Hotels</strong>: As of now, all hotels on booking.com as well as most other websites are sold out right now.</p>
</li>
<li><p><strong>Airbnb</strong>: Airbnb still has a healthy number of properties available in the city, although most of them are not within walking distance of the Convention Center.</p>
</li>
<li><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730745479198/44962b15-f040-431b-b3d6-5439b2463c6b.png" alt class="image--center mx-auto" /></p>
</li>
</ul>
<h4 id="heading-airports-nearby">Airports nearby</h4>
<ul>
<li><strong>Salt Lake City International Airport (SLC)</strong> is the nearest airport, just 5 miles from downtown.</li>
</ul>
<hr />
<h3 id="heading-visiting-salt-lake-city">Visiting Salt Lake City</h3>
<h4 id="heading-restaurants">Restaurants</h4>
<p>Salt Lake City offers a diverse culinary scene. Recommended spots include:</p>
<ul>
<li><p><strong>Red Iguana</strong>: Known for authentic Mexican cuisine.</p>
</li>
<li><p><strong>The Copper Onion</strong>: A top choice for American fare.</p>
</li>
<li><p><strong>Takashi</strong>: Excellent sushi in the heart of the city.</p>
</li>
</ul>
<h4 id="heading-after-hour-locations">After Hour Locations</h4>
<ul>
<li><p><strong>Beer Bar</strong>: Perfect for a laid-back evening with a variety of beers.</p>
</li>
<li><p><strong>The Bayou</strong>: Offers an extensive selection of brews and Cajun-style food.</p>
</li>
</ul>
<h4 id="heading-coffee-shops">Coffee Shops</h4>
<p>Need a caffeine boost? Check out:</p>
<ul>
<li><p><strong>La Barba Coffee</strong>: Known for artisanal coffee.</p>
</li>
<li><p><strong>Publik Coffee Roasters</strong>: A great spot to unwind with quality brews.</p>
</li>
</ul>
<h4 id="heading-weekend-plans">Weekend Plans</h4>
<p>Consider exploring Utah’s natural beauty over the weekend. Popular options include:</p>
<ul>
<li><p><strong>Bonneville Salt Flats</strong>: A unique desert landscape.</p>
<p>  <img src="https://images.ctfassets.net/0wjmk6wgfops/17oZGsiEevOkg7tpUeaFG0/67bf9df09bff4cbb4136881fa771b789/AdobeStockSaltFlats.jpeg?w=1200&amp;h=630&amp;f=center&amp;fit=fill" alt="Bonneville Salt Flats | Utah.com" /></p>
</li>
<li><p><strong>Big Cottonwood Canyon</strong>: Ideal for scenic drives and hikes.</p>
<p>  <img src="https://dynamic-media-cdn.tripadvisor.com/media/photo-o/15/4a/aa/ab/big-cottonwood-canyon.jpg?w=1200&amp;h=1200&amp;s=1" alt="BIG COTTONWOOD CANYON: All You Need to Know BEFORE You Go" /></p>
</li>
</ul>
<hr />
<h3 id="heading-visit-doctor-droid-booth-q45">Visit Doctor Droid Booth - Q45</h3>
<p>Doctor Droid is a root cause identification slack bot which can assist on-call engineers diagnose incidents and find root cause really fast. All you need to do is reply to your alert message in slack by tagging the bot. If you are interested to get a demo and explore more about Doctor Droid, visit us at Booth Q45 in the venue!</p>
<p><a target="_blank" href="https://calendly.com/siddarthjain/kubecon-2024-demo">Pre-book a meeting with us using this link.</a></p>
<p>Stop by Booth Q45 to discover how Doctor Droid’s automated RCA can help you debug &amp; fix your production issues faster! What else is up for grabs at the event?</p>
<ul>
<li><strong>Puzzles &amp; Goodies</strong>: Test your mental muscle and win some amazing gifts.</li>
</ul>
<ul>
<li><strong>$500 Doctor Droid Credits</strong>: Show this blog at our booth to receive $500 in Doctor Droid credits!</li>
</ul>
<hr />
]]></content:encoded></item><item><title><![CDATA[Keeping keys secure without slowing your iteration speed]]></title><description><![CDATA[Context
At Doctor Droid, we are building a cutting-edge AI recommendation platform for on-call teams. Whenever an alert or ticket is raised, Doctor Droid:

Looks for all past investigations and see if it finds anything similar

Looks for SOPs for the...]]></description><link>https://notes.drdroid.io/keeping-keys-secure-without-slowing-your-iteration-speed</link><guid isPermaLink="true">https://notes.drdroid.io/keeping-keys-secure-without-slowing-your-iteration-speed</guid><category><![CDATA[Developer Tools]]></category><category><![CDATA[Security]]></category><dc:creator><![CDATA[Siddarth Jain]]></dc:creator><pubDate>Mon, 28 Oct 2024 14:57:55 GMT</pubDate><content:encoded><![CDATA[<h3 id="heading-context">Context</h3>
<p>At Doctor Droid, we are building a cutting-edge AI recommendation platform for on-call teams. Whenever an alert or ticket is raised, Doctor Droid:</p>
<ul>
<li><p>Looks for all past investigations and see if it finds anything similar</p>
</li>
<li><p>Looks for SOPs for the issue at hand (these SOPs are also created by Doctor Droid by reading past Slack threads &amp; existing docs)</p>
</li>
<li><p>Executes autonomous investigation for popular infrastructure &amp; microservices symptoms.</p>
</li>
</ul>
<h3 id="heading-problem-statement">Problem Statement</h3>
<p>This requires a fair bit of experimentation with our early adopters and extensive usage of Jupyter Notebooks. Often as the Notebooks are not connected to a cloud environment, how does one go about managing secrets and ensure they are not lying around anywhere? I wanted a solution where I could have access to keys JUST-IN-TIME (get it just when I need to run it) and become unavailable right after.</p>
<h3 id="heading-solution">Solution</h3>
<p>With <a target="_blank" href="https://infisical.com/">Infisical</a>, I found a convenient solution for this issue. Here’s how it works:</p>
<ol>
<li><p>Step 1: Configure keys</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730126554303/563ae6c1-eb2a-4464-b8ba-e1dd53fa83ed.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>Step 2: Use APIs to retrieve keys securely on-the-go</p>
<p> ```python
 url = "https://app.infisical.com/api/v1/auth/universal-auth/login"</p>
<p> payload = 'clientSecret=xxxx&amp;clientId=yyyyy'
 headers = {
   'Content-Type': 'application/x-www-form-urlencoded'
 }</p>
<p> response = requests.request("POST", url, headers=headers, data=payload)</p>
<p> access_token = json.loads(response.text)['accessToken']</p>
</li>
</ol>
<p>    url = "https://app.infisical.com/api/v3/secrets/raw/KEY_NAME?workspaceId=xxxx&amp;environment=dev"</p>
<p>    payload = {}
    headers = {
      'Authorization': f'Bearer {access_token}'
    }</p>
<p>    response = requests.request("GET", url, headers=headers, data=payload)</p>
<p>    KEY_VALUE = json.loads(response.text)['secret']['secretValue']</p>
<p>    ```</p>
<h3 id="heading-benefits-of-using-infisical">Benefits of using Infisical:</h3>
<ol>
<li><p>Change environment and get updated key</p>
</li>
<li><p>Quarantine keys easily: If you’ve been close to any production incident, you’ll know that being able to flush keys in a jiffy is super important and at the same time, super difficult because of it’s underlying dependencies across the stack. Using Infisical gives me the buffer of instantly disabling access by disabling infisical key/secret</p>
</li>
<li><p>Free to get started: It’s an opensource project with a convenient cloud option</p>
</li>
<li><p>Too many features (although I have only used like 5% of the platform probably) so I feel like as my requirements expand, I’ll learn about new things easily</p>
</li>
<li><p>Helpful team / community: They have a community, a prompt support team and well-written documentation.</p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[How to investigate Sentry Alert with Doctor Droid]]></title><description><![CDATA[Sentry
Sentry is one of the best tools in the industry right now for error and exception tracking. It has high quality SDKs across the stack and has great integrations as well as a powerful dashboard for users to learn about an exception in the code....]]></description><link>https://notes.drdroid.io/how-to-investigate-sentry-alert-with-doctor-droid</link><guid isPermaLink="true">https://notes.drdroid.io/how-to-investigate-sentry-alert-with-doctor-droid</guid><category><![CDATA[observability]]></category><category><![CDATA[monitoring]]></category><category><![CDATA[sentry]]></category><category><![CDATA[Datadog]]></category><dc:creator><![CDATA[Siddarth Jain]]></dc:creator><pubDate>Wed, 16 Oct 2024 03:35:37 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1729028150451/59b9b956-aa10-458f-8340-fd16001a04a8.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-sentry">Sentry</h2>
<p>Sentry is one of the best tools in the industry right now for error and exception tracking. It has high quality SDKs across the stack and has great integrations as well as a powerful dashboard for users to learn about an exception in the code. You can read more about <a target="_blank" href="https://sentry.io/">Sentry here</a>.</p>
<h2 id="heading-debugging-an-exception-in-sentry">Debugging an exception in Sentry</h2>
<p>An exception could arise because of n-different reasons. It could be because of code change in the place where the exception came, it can be a code change upstream, it could be bad user input or it could even be just an edge case that hadn’t appeared until now. In case of canary deployments, it could get even trickier as different containers or users could be on different versions of the code.</p>
<p>In production, investigation of a simple looking Sentry issue can span across multiple data sources &amp; contexts:</p>
<ul>
<li><p>Your infrastructure &amp; deployment resources like Kubernetes</p>
</li>
<li><p>Your code repository to check the code for recent changes or even analysing the flow of data</p>
</li>
<li><p>Your database/logs to check for user entered data</p>
</li>
<li><p>Discussion with internal team members regarding expected behaviour</p>
</li>
</ul>
<h2 id="heading-using-doctor-droid-to-debug-the-issue">Using Doctor Droid to debug the issue</h2>
<p><img src="https://usercontent.clueso.io/0c1a6fdb-c4b9-444d-8679-430918685457/3dcfaf7f-c1b7-44fe-a9f4-185e6bbe18ec/3e71f15e-81d4-460a-98c5-9e546b527c73/images/bb3e54be-c81e-4e27-94e9-cd70b8512eae.png" alt="https://usercontent.clueso.io/0c1a6fdb-c4b9-444d-8679-430918685457/3dcfaf7f-c1b7-44fe-a9f4-185e6bbe18ec/3e71f15e-81d4-460a-98c5-9e546b527c73/images/bb3e54be-c81e-4e27-94e9-cd70b8512eae.png" /></p>
<h3 id="heading-initiating-an-investigation">Initiating an investigation</h3>
<p>You can start investigation of an alert directly from the home page which has all the recent alerts.</p>
<p>Once an investigation is created, this is what it looks like.</p>
<p><img src="https://usercontent.clueso.io/0c1a6fdb-c4b9-444d-8679-430918685457/3dcfaf7f-c1b7-44fe-a9f4-185e6bbe18ec/3e71f15e-81d4-460a-98c5-9e546b527c73/images/b842a098-3f4c-4f2e-a880-ae95524f7703.png" alt="https://usercontent.clueso.io/0c1a6fdb-c4b9-444d-8679-430918685457/3dcfaf7f-c1b7-44fe-a9f4-185e6bbe18ec/3e71f15e-81d4-460a-98c5-9e546b527c73/images/b842a098-3f4c-4f2e-a880-ae95524f7703.png" /></p>
<p>Here are the key elements of the investigation panel:</p>
<ul>
<li><p>The alerts that it's investigating</p>
</li>
<li><p>The recommended investigation strategy and preliminary data for your evaluation</p>
</li>
<li><p>Additional panels related to related investigations or alerts</p>
</li>
</ul>
<h2 id="heading-investigation-strategy">Investigation Strategy</h2>
<p>Now what is it that it's able to fetch till now? Depending on the alert context, the platform recommends different steps.</p>
<p>It identified that the first thing that it should check is the stack trace itself in Sentry. So it goes and fetches the stack trace from Sentry, including the culprit.</p>
<p><img src="https://usercontent.clueso.io/0c1a6fdb-c4b9-444d-8679-430918685457/3dcfaf7f-c1b7-44fe-a9f4-185e6bbe18ec/3e71f15e-81d4-460a-98c5-9e546b527c73/images/cf955f34-af13-4420-bb9e-21206da9cdf9.png" alt="https://usercontent.clueso.io/0c1a6fdb-c4b9-444d-8679-430918685457/3dcfaf7f-c1b7-44fe-a9f4-185e6bbe18ec/3e71f15e-81d4-460a-98c5-9e546b527c73/images/cf955f34-af13-4420-bb9e-21206da9cdf9.png" /></p>
<p>Once it's able to fetch that, it goes and looks for the recent code changes for the same stack trace that it identified within the your GitHub repository. It shows you the recent commits, the URLs, and you can go and check if this this is potentially something that was done in the last couple of days.</p>
<p>You can also then go and check if within your Kubernetes infrastructure if there was any recent deployment that could be correlated with it. Given that this is related to the prototype instance, we can see that here there is a couple of releases for prototype in the last one hour. So it could potentially be the reason for this instance, for this alert to actually come up. And now you have all the data here.</p>
<p><img src="https://usercontent.clueso.io/0c1a6fdb-c4b9-444d-8679-430918685457/3dcfaf7f-c1b7-44fe-a9f4-185e6bbe18ec/3e71f15e-81d4-460a-98c5-9e546b527c73/images/311e56e6-5e55-402a-bc35-41f6b3411528.png" alt="https://usercontent.clueso.io/0c1a6fdb-c4b9-444d-8679-430918685457/3dcfaf7f-c1b7-44fe-a9f4-185e6bbe18ec/3e71f15e-81d4-460a-98c5-9e546b527c73/images/311e56e6-5e55-402a-bc35-41f6b3411528.png" /></p>
<p>You can ask it for more data, chat with it, ask for more data.</p>
<p>And what's what's also good is that it'll give you references to either existing playbooks, dashboards, or any other data point that your system already has. So we have integrations with almost every tool that your monitoring and observability stack would potentially have.</p>
<p><img src="https://usercontent.clueso.io/0c1a6fdb-c4b9-444d-8679-430918685457/3dcfaf7f-c1b7-44fe-a9f4-185e6bbe18ec/3e71f15e-81d4-460a-98c5-9e546b527c73/images/52574c07-4608-41ce-9ed5-786641447b89.png" alt="https://usercontent.clueso.io/0c1a6fdb-c4b9-444d-8679-430918685457/3dcfaf7f-c1b7-44fe-a9f4-185e6bbe18ec/3e71f15e-81d4-460a-98c5-9e546b527c73/images/52574c07-4608-41ce-9ed5-786641447b89.png" /></p>
<p>And we also have options for you to self host these integrations so that the data remains within your own plane.</p>
<h2 id="heading-try-it-today">Try it today</h2>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.youtube.com/watch?v=tx_x8BHCK38">https://www.youtube.com/watch?v=tx_x8BHCK38</a></div>
<p> </p>
<p>If if this is something that looked exciting to you, we have a lot more demos coming up, like how do you auto investigate an API latency alert, or how do you investigate CPU utilisation alerts on your databases.</p>
<p>Visit <a target="_blank" href="http://www.drdroid.io">www.drdroid.io</a> and try it out for yourself for your own stack. We have a free trial that we provide for the tool. So if you have any questions, please reach out to us, and we'll be happy to answer.</p>
]]></content:encoded></item><item><title><![CDATA[Dr. Patternson: How Meta reduced their MTTR by 50% using AIOps]]></title><description><![CDATA[Introduction
For Meta, reducing downtime has been crucial to ensuring millions (or should I say Billions?) of users have a seamless experience. Recently, Meta shared about one of their internal platforms that helped reduce MTTR by ~50% for critical a...]]></description><link>https://notes.drdroid.io/dr-patternson-how-meta-reduced-their-mttr-by-50-using-aiops</link><guid isPermaLink="true">https://notes.drdroid.io/dr-patternson-how-meta-reduced-their-mttr-by-50-using-aiops</guid><dc:creator><![CDATA[Karan Sirohi]]></dc:creator><pubDate>Fri, 11 Oct 2024 03:48:46 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1728588725618/869b310b-a262-49b3-a50e-2c31f28b57c5.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction">Introduction</h2>
<p>For Meta, reducing downtime has been crucial to ensuring millions (or should I say Billions?) of users have a seamless experience. Recently, <a target="_blank" href="https://atscaleconference.com/the-evolution-of-aiops-at-meta-beyond-the-buzz/">Meta shared about one of their internal platforms that helped reduce MTTR by ~50% for critical alerts</a>.</p>
<p>This blog explores how Meta accomplished this by leveraging AI, machine learning &amp; runbook automation to transform its incident response processes, making them faster and more efficient.</p>
<p>Let's dive into the key components that enabled this efficiency gain.</p>
<h2 id="heading-objective">Objective</h2>
<p>Imagine trying to find a needle in a haystack—blindfolded. That’s what incident management can feel like without the right tools. Meta’s objective was to take off that blindfold and make the process of finding and fixing problems as swift and accurate as possible. Their goal? Cut down the Mean Time to Resolution (MTTR) by half, so that when things go wrong, they can be fixed faster than you can say “downtime.”</p>
<p>By harnessing the power of AI and machine learning, Meta aimed to automate the grunt work of incident management—spotting issues, figuring out what’s broken, and fixing it—all without requiring a superhero on standby. This isn’t just about cool tech; it’s about making sure users experience as little disruption as possible, turning potential disasters into minor hiccups that barely anyone notices.</p>
<h2 id="heading-what-meta-built">What Meta built?</h2>
<p>Alright, let's peek under the hood of Meta's incident-busting machine. They didn't just slap on a new coat of paint; they rebuilt the entire engine. Here are the three turbocharged components that turned their incident response from a clunky old jalopy into a sleek, AI-powered sports car:</p>
<h3 id="heading-component-1-automated-runbooks">Component 1: Automated Runbooks</h3>
<p>Remember those old-school detective novels where the brilliant sleuth solves the case with a magnifying glass and a pipe? Well, Meta created a digital Sherlock Holmes, minus the pipe smoke. They call it Dr. Patternson (Dr. P for short), and it's like having a tireless detective on call 24/7.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728588890981/44b57c0e-3287-4851-a241-9df4c0eef4e2.png" alt class="image--center mx-auto" /></p>
<p>Dr. P is an automated runbook system that encodes expert knowledge into executable investigation workflows. It's like giving every on-call engineer a cheat sheet written by the smartest person in the room. With its own SDK, simplified APIs, and ML algorithms, Dr. P can quickly analyze data, correlate events, and generate findings faster than you can say "Elementary, my dear Watson."</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728588921209/72f4c352-55ca-4e8e-ba76-834d3870ade3.png" alt class="image--center mx-auto" /></p>
<p>But wait, there's more! Dr. P comes with a fully managed platform that deploys these runbooks, monitors for issues, and even triggers investigations automatically when an alert fires. It's like having a whole team of digital detectives working round the clock, leaving no log unturned.</p>
<h3 id="heading-component-2-analysis-algorithms-service">Component 2: Analysis Algorithms Service</h3>
<p>If Dr. P is the detective, then the Analysis Algorithms Service is its time machine. This nifty piece of tech allows Meta's engineers to zoom through vast amounts of data at warp speed. Picture this: You've got more data than stars in the sky, and you need to find that one glowing red dot that's causing all the trouble. That's where this service comes in. It's packed with ML algorithms for dimensional analysis, time series analysis, anomaly detection, and more. But the real magic is in its pre-aggregation layer, which shrinks datasets by up to 500 times! It's like compressing the entire library of Congress into a pocket-sized book, without losing a single word.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728588946255/ede8c12d-c0ab-4f79-adf7-1322a8a1b3be.png" alt class="image--center mx-auto" /></p>
<p>The result? Insights that used to take hours now pop up in seconds. It's so fast, you might think it's predicting the future. (Spoiler alert: it's not. That's still on the roadmap for 2025.)</p>
<h3 id="heading-component-3-event-isolation-assistance">Component 3: Event Isolation Assistance</h3>
<p>Last but not least, we have the Event Isolation Assistance. Think of it as a super-smart metal detector for that proverbial needle in a haystack.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728588955915/2f075996-7f70-4fdf-a924-f60b970d644b.png" alt class="image--center mx-auto" /></p>
<p>This system uses ML models to rank thousands of events and pinpoint the root cause of an incident. It's like having a psychic on your team, except this one actually works. By focusing on config-based and code-based isolation, it can filter out 80% of the uninteresting events during an active investigation. That's right, it separates the wheat from the chaff, leaving engineers with a much smaller, much more suspicious pile of events to investigate.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728588965024/d48edc52-69ac-4437-990f-f1133b936a1c.png" alt class="image--center mx-auto" /></p>
<p>But it doesn't just point fingers. The system provides annotations explaining its reasoning, making it transparent and trustworthy. It's like having a really smart friend who not only tells you the answer but also shows their work.</p>
<h3 id="heading-guided-investigations">Guided Investigations:</h3>
<p>Sometimes, even the smartest AI needs a human touch. That's where Guided Investigations come in. Think of it as a choose-your-own-adventure book, but for fixing tech problems.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728588972827/a0786254-3ea8-41c0-94b2-5d0baa1530f5.png" alt class="image--center mx-auto" /></p>
<p>These decision trees provide step-by-step workflows that help investigators narrow down the root cause of an issue. It's like having a seasoned pro whispering in your ear, guiding you through the digital labyrinth. By combining automated workflows with human expertise, these guided investigations can tackle complex issues that might stump a fully automated system.</p>
<p>And the best part? They're right where you need them, integrated with Meta's detection systems. It's like having a tech support genie, ready to pop out whenever an alert goes off. No need to rub a lamp – just click a button!</p>
<h3 id="heading-current-state-of-investigations-at-meta"><strong>Current state of Investigations at Meta:</strong></h3>
<p>So, where does Meta stand now in their AIOps journey? Let's just say they've gone from digital chaos to zen master status.</p>
<p>Today, Meta's foundational systems are more popular than cat videos (well, almost). Hundreds of teams have adopted them, running over 500,000 analyses per week. That's more check-ups than a hypochondriac gets in a lifetime!</p>
<p>The impact? A cool 50% decrease in MTTR for critical alerts across the company. It's like they've upgraded from a horse-drawn buggy to a supersonic jet when it comes to fixing problems. Take the Ads Manager team, for instance. They've gone from spending days investigating issues to resolving them in minutes. It's like they've traded in their magnifying glass for a high-powered microscope with AI-assisted focusing.</p>
<h2 id="heading-implementing-your-own-dr-patternson-using-doctor-droid"><strong>Implementing your Own Dr. Patternson using Doctor Droid:</strong></h2>
<p>If you want to implement a solution like Dr. Patternson within your team without investing the time or cost like Microsoft, you might want to explore <a target="_blank" href="http://drdroid.io/"><strong>Doctor Droid</strong></a>.</p>
<p>Doctor Droid is a AI-assisted intelligence platform to help engineering teams <strong>reduce investigation time of production issues by 10x</strong>. Here's what you can do with Doctor Droid:</p>
<p>(a) Codify your investigation mental models:</p>
<p><a target="_blank" href="https://github.com/DrDroidLab/playbooks"><strong>Doctor Droid PlayBooks</strong></a> is an Open-Source On-call automation platform. With one click, you can run your investigation steps and have all the diagnosis data across all tools, directly fed in response to your alerts.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728588990105/1780d94d-6b11-4a67-bab8-5e8fdef27a1d.png" alt class="image--center mx-auto" /></p>
<p>(b) Leverage your past knowledge to get intelligent suggestions:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728589001242/5d97552a-a82d-45b6-83f0-81875e3aa48d.png" alt class="image--center mx-auto" /></p>
<p>Doctor Droid's <a target="_blank" href="https://docs.drdroid.io/docs/doctor-droid-aiops-platform"><strong>AIOps Platform</strong></a> can provide your on-call engineers with intelligent recommendations by leveraging the knowledge that you already have accessible in your system over the past time.</p>
<p>Both A &amp; B combined, you are effectively going to end up with an equivalent of Dr. Patternson.</p>
<p>Try it out today by signing up <a target="_blank" href="http://drdroid.io/"><strong>here</strong></a>!</p>
<h1 id="heading-conclusion">Conclusion:</h1>
<p>Meta's AIOps journey has transformed their incident response from a digital firefight into a well-oiled machine. Let's break down the impressive results:</p>
<ul>
<li><p>50% reduction in Mean Time to Resolution (MTTR) for critical alerts across the company</p>
</li>
<li><p>Over 500,000 automated analyses run per week</p>
</li>
<li><p>80% of uninteresting events filtered out during active investigations</p>
</li>
<li><p>Ads Manager team improved investigation time from days to minutes</p>
</li>
<li><p>Nearly 50% of previously manual investigations now automated</p>
</li>
</ul>
<p>These statistics paint a picture of a dramatically more efficient system, but what does it mean in the real world?</p>
<p>For Meta, it means:</p>
<ul>
<li><p>Fewer service disruptions for millions of users</p>
</li>
<li><p>Faster resolution when issues do occur</p>
</li>
<li><p>Engineers spending less time on repetitive tasks and more on innovation</p>
</li>
<li><p>Improved overall system reliability and user experience</p>
</li>
</ul>
<p>The secret sauce? A combination of:</p>
<ol>
<li><p>Automated Runbooks (Dr. Patternson)</p>
</li>
<li><p>Analysis Algorithms Service</p>
</li>
<li><p>Event Isolation Assistance</p>
</li>
<li><p>Guided Investigations</p>
</li>
</ol>
<p>This powerful quartet has turned Meta's incident response into a symphony of efficiency, conducting a harmonious blend of AI automation and human expertise.</p>
<p>As we look to the future, Meta's AIOps journey serves as a beacon for the tech industry. It shows us that with the right tools and approach, we can tame the chaos of complex systems and create a more reliable digital world. So the next time you scroll through your feed without a hitch, remember - there's a good chance Meta's AIOps team had a hand in making that seamless experience possible.</p>
]]></content:encoded></item><item><title><![CDATA[RCACoPilot: A breakdown of how Microsoft built their Automated RCA Bot]]></title><description><![CDATA[Introduction
Big Tech companies often have scale enough to justify allocating resources to building internal tools. In this blog, we discuss about RCACoPilot -- an automated incident classification and investigation engine built by Microsoft to impro...]]></description><link>https://notes.drdroid.io/rcacopilot-a-breakdown-of-how-microsoft-built-their-automated-rca-bot</link><guid isPermaLink="true">https://notes.drdroid.io/rcacopilot-a-breakdown-of-how-microsoft-built-their-automated-rca-bot</guid><dc:creator><![CDATA[Karan Sirohi]]></dc:creator><pubDate>Mon, 02 Sep 2024 12:33:53 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1725279726262/86116b34-c4cd-468c-a50d-cd9c8c538a5c.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction">Introduction</h2>
<p>Big Tech companies often have scale enough to justify allocating resources to building internal tools. In this blog, we discuss about RCACoPilot -- an automated incident classification and investigation engine built by Microsoft to improve the lives for their on-call engineers.</p>
<p>Less than a year ago, Microsoft published a <a target="_blank" href="https://yinfangchen.github.io/assets/pdf/rcacopilot_paper.pdf">research paper</a> discussing RCACoPilot. It's a longish paper ~ 16 pages, so I decided to condense it into a shorter blog.</p>
<h2 id="heading-context">Context</h2>
<p><strong>Picture this:</strong> You're an on-call engineer at Microsoft. It's 3 AM, and suddenly, alerts start blaring. Something's wrong with the email service (which delivers over 150 billion messages daily) that millions of people rely on. Your job? Figure out what's causing the issue and fix it ASAP. No pressure, right?</p>
<p>This scenario plays out all too often at Microsoft &amp; at most companies. A company's systems are only getting more complex by the day, and on-call engineers were drowning in a sea of alerts, logs, and metrics. This often leads to escalations and time being spent on tickets rather than planned work. They needed a way to streamline the process and a way to quickly make sense of all this information and zero in on the root cause of problems. That's what RCACoPilot tries to solve for them.</p>
<h2 id="heading-why-should-you-even-read-this-article-the-results">Why should you even read this article? The Results</h2>
<p>Let's cut to the chase: How well does RCACopilot perform &amp; is it even worth reading about their approach? Their performance is pretty impressive, as it turns out.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724853161354/8522c6ad-9d2c-48a4-b7e6-38e079fa0922.png" alt class="image--center mx-auto" /></p>
<p>Tested on 653 real-world incidents from Microsoft's email service (which handles about 150 billion messages daily), RCACopilot achieved:</p>
<ul>
<li><p><strong>76.6% accuracy</strong> in predicting root cause categories</p>
</li>
<li><p>A <strong>Macro-F1 score of 0.533</strong>, showing good performance across various incident types</p>
</li>
<li><p>Significantly reduced MTTR:</p>
<ul>
<li><p>Auto-diagnosis ran with an average time of 1-10 minutes depending on the complexity of the incident handlers (more on it below).</p>
</li>
<li><p>An average classification time of just 4.2 seconds per incident.</p>
</li>
</ul>
</li>
</ul>
<p>[Quick note on that Macro-F1 score: It's a measure that gives equal importance to each category, regardless of how often it appears. A score of 0.533 tells us that RCACopilot performs well across various incident types, not just the common ones. This is crucial in a system where rare, critical issues are just as important as frequent, minor ones.]</p>
<p>These numbers outperformed all baseline methods, including traditional machine learning approaches and non fine-tuned GPT models.</p>
<p>But the real proof is in the deployment. Parts of RCACopilot have been in use at Microsoft for over four years, across more than 30 teams. On-call engineers report significant time savings in incident management tasks, from diagnosis to mitigation.</p>
<p>What this basically means is that in less than a few minutes, most of the likely investigation steps are run, analysed and the engineers are told a likely incident Root Cause category.</p>
<h2 id="heading-rcacopilot-the-architecture">RCACoPilot -- The Architecture</h2>
<p>Now, let's peek under the hood of RCACoPilot. Think of it as a super-smart detective for computer problems. Here's a simple breakdown of how it works:</p>
<ol>
<li><p><strong>Diagnosis Identifier:</strong> In the original paper, this is part of what they call the "Diagnostic Information Collection Stage". It is responsible to take in an incident from their alerting tool, parse the incident context and match it to the existing playbooks (called <strong>Incident Handlers</strong>) basis the mapping logics defined.</p>
</li>
<li><p><strong>Diagnosis Data Fetch &amp; Summarisation:</strong> The playbook identified in the previous section is custom defined by the on-call engineers (<strong>called OCEs</strong>). The system now executes all the steps pre-configured in their playbook -- from fetching logs &amp; metrics to running more steps.</p>
</li>
<li><p><strong>Incident Predictor:</strong> Now on top of the diagnosis data that's fetched, a couple of things are done: converting it into an embedding and comparing the embedding with past embeddings. These embeddings are now leveraged to identify the potential Root Cause for this issue. This corresponds to the "Root Cause Prediction Stage" in the paper.</p>
</li>
</ol>
<p>The actual benefit of RCACoPilot lies in how the entire pipeline works out. When an alert comes in, the Incident Handler kicks into gear, following predefined playbooks to gather relevant data. This could involve querying databases, analyzing log files, or even running diagnostic scripts.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724852702784/1a29f67f-8e50-4f4f-b379-5dafe51d3a9a.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-component-1-diagnosis-identifier">Component 1: Diagnosis Identifier</h3>
<p><strong>Incident Parser:</strong></p>
<p>The Incident is the entry point of RCACoPilot. It's connected to the existing alerting systems in the company through triggers or webhooks. It then parses the incident to identify the key entities of interest.</p>
<p><strong>Handler:</strong></p>
<p>A Handler is set of steps pre-defined to be run for a specific type of investigation. This is effectively a programmatic SOP for a certain type of issue. These are NOT AI GENERATED. These are something that every on-call engineer documents -- the investigation strategies for a type of issue within their service. Types of Actions</p>
<p>The Incident Handler uses several types of actions to investigate and respond to incidents:</p>
<ol>
<li><p><strong>Scope Switching Action</strong>: This allows the handler to adjust its focus dynamically. It might start by looking at a single server, then expand to an entire cluster if needed.</p>
</li>
<li><p><strong>Query Action</strong>: Think of this as the handler's way of asking questions. It can pull data from various sources like databases, log files, or even run scripts to gather system information. The results come back as key-value pairs, giving the handler structured data to work with.</p>
</li>
<li><p><strong>Mitigation Action</strong>: Sometimes, the handler can take steps to address the problem directly. This could involve restarting a service, clearing disk space, or even calling in specialized teams for complex issues.</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724852887556/d5821a2b-be9e-4f40-ad60-213e1b1062d7.png" alt class="image--center mx-auto" /></p>
<p>This is what an incident handler looks like for too many messages stuck in the delivery queue alert.</p>
<h3 id="heading-component-2-diagnosis-data-fetch-amp-summarisation">Component 2: <strong>Diagnosis Data Fetch &amp; Summarisation</strong></h3>
<p>In this part, the system is executing a series of commands as per the incident handler definition.</p>
<ul>
<li><p>It has access to different internal / external tools with the relevant information.</p>
</li>
<li><p>Each step in handler is structured and mapped to a technical task (be it an API call, a log fetch, a metric fetch or any other task).</p>
</li>
<li><p>The system automatically interacts with every different data tool and then fetches it.</p>
</li>
</ul>
<p>These are some of the data points that the system can fetch from the handlers (playbooks):</p>
<ul>
<li><p>Logs: Application logs, system logs, security logs.</p>
</li>
<li><p>Metrics: Performance metrics and resource utilization stats.</p>
</li>
<li><p>Traces: Detailed records of how requests flow through the system.</p>
</li>
<li><p>Configuration data: Current system settings that might be relevant.</p>
</li>
</ul>
<h3 id="heading-component-3-incident-predictor">Component 3: Incident Predictor</h3>
<p>The ML part of RCACopilot happens in what we call "Incident Predictor". This is where it uses Large Language Models (LLMs) to make sense of all the data collected by the Incident Handler.</p>
<p>Here's how it uses LLMs to analyze incidents:</p>
<ol>
<li><strong>Summarization</strong>: First, the LLM takes all the diagnostic information collected and creates a concise summary. This step is crucial because it condenses vast amounts of data into something manageable for both the AI and human engineers.</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724853041221/f9226a64-b749-4f55-ae28-8fd55029a319.png" alt class="image--center mx-auto" /></p>
<ol start="2">
<li><p><strong>Similarity Matching</strong>: Next, RCACopilot creates an embedding to represent incidents as points in a high-dimensional space and then uses a Nearest Neighbour Search Algorithm. ELI5: similar incidents will be close to each other in this space.</p>
<p> Using this technique, RCACopilot finds past incidents that are most similar to the current one. This is important because similar past incidents can provide valuable clues about the current problem.</p>
</li>
<li><p><strong>Chain-of-Thought Prompting</strong>: This is where things get really interesting. RCACopilot uses a technique called "Chain-of-thought" prompting. Instead of just asking the LLM "What's the root cause?", it prompts the model to think through the problem step-by-step, much like a human engineer would.</p>
<p> It does this by showing the LLM examples of how similar past incidents were solved. This is akin to training a junior engineer by walking them through past case studies before asking them to solve a new problem.</p>
</li>
<li><p><strong>Root Cause Prediction</strong>: Based on this careful analysis, the LLM then predicts the most likely root cause of the incident. But it doesn't stop there.</p>
</li>
<li><p><strong>Explanation Generation</strong>: Crucially, the LLM also generates an explanation for its prediction. This isn't just a black box spitting out an answer - it's more like a colleague explaining their reasoning. This explanation helps human engineers understand and verify the AI's conclusion.</p>
</li>
</ol>
<p><strong>Feedback loop:</strong> While it doesn't learn in real-time, feedback from engineers can be used to periodically retrain and improve the model. This means RCACopilot can get better over time, learning from each incident it analyzes.</p>
<p>By leveraging the power of LLMs in this way, RCACoPilot can quickly analyze complex incidents, drawing insights from vast amounts of data and past experiences. It's like having an AI assistant that has seen every incident your organization has ever faced, can think through problems step-by-step, and can clearly explain its reasoning. This not only speeds up incident resolution but also helps engineers learn and improve their own diagnostic skills.</p>
<h2 id="heading-implementing-your-own-rcacopilot-using-doctor-droid">Implementing your Own RCACoPilot using Doctor Droid:</h2>
<p>If you want to implement a solution like RCACoPilot within your team without investing the time or cost like Microsoft, you might want to explore <a target="_blank" href="http://drdroid.io/">Doctor Droid</a>.</p>
<p>Doctor Droid is a AI-assisted intelligence platform to help engineering teams <strong>reduce investigation time of production issues by 10x</strong>. Here's what you can do with Doctor Droid:</p>
<p>(a) Codify your investigation mental models:</p>
<p><a target="_blank" href="https://github.com/DrDroidLab/playbooks">Doctor Droid PlayBooks</a> is an Open-Source On-call automation platform. With one click, you can run your investigation steps and have all the diagnosis data across all tools, directly fed in response to your alerts.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725279397010/1ff2bb93-5216-42ab-bad1-80a325d88812.png" alt class="image--center mx-auto" /></p>
<p>(b) Leverage your past knowledge to get intelligent suggestions:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725279522497/e5ab412e-e509-4793-9056-ad71e014ec4f.png" alt class="image--center mx-auto" /></p>
<p>Doctor Droid's <a target="_blank" href="https://docs.drdroid.io/docs/doctor-droid-aiops-platform">AIOps Platform</a> can provide your on-call engineers with intelligent recommendations by leveraging the knowledge that you already have accessible in your system over the past time.</p>
<p>Both A &amp; B combined, you are effectively going to end up with an equivalent of RCACoPilot.</p>
<p>Try it out today by signing up <a target="_blank" href="http://drdroid.io/">here</a>!</p>
]]></content:encoded></item><item><title><![CDATA[How to setup your dev environment for editing & contributing code?]]></title><description><![CDATA[Playbooks is a web server application that interacts with a Django API server via Nginx. It also includes salary workers for scheduling asynchronous tasks and a persistence layer consisting of Postgres and Redis cache.
https://www.youtube.com/watch?v...]]></description><link>https://notes.drdroid.io/how-to-setup-your-dev-environment-for-editing-contributing-code</link><guid isPermaLink="true">https://notes.drdroid.io/how-to-setup-your-dev-environment-for-editing-contributing-code</guid><category><![CDATA[doctor-droid]]></category><category><![CDATA[automation]]></category><category><![CDATA[development]]></category><category><![CDATA[Open Source]]></category><category><![CDATA[Devops]]></category><category><![CDATA[observability]]></category><category><![CDATA[monitoring]]></category><category><![CDATA[runbooks]]></category><dc:creator><![CDATA[Mohit Goyal]]></dc:creator><pubDate>Wed, 28 Aug 2024 13:08:38 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1724780864472/e9f073d5-12a0-4f86-bc62-224aa06b24bd.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Playbooks is a web server application that interacts with a Django API server via Nginx. It also includes salary workers for scheduling asynchronous tasks and a persistence layer consisting of Postgres and Redis cache.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.youtube.com/watch?v=OiwgXSncNuo">https://www.youtube.com/watch?v=OiwgXSncNuo</a></div>
<p> </p>
<h3 id="heading-setup-phase">Setup Phase</h3>
<p><strong>Step 1:</strong> Ensure a running instance of Postgres and Redis on the local machine. Installation guides are available for different machine types to set up Postgres and Redis. Alternatively, Docker can be used for the same purpose.</p>
<p><strong>Step 2:</strong> Use the DB Docker Compose file located in the Docker folder to set up Postgres and Redis.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724838550495/7bc5e607-7f5a-4534-a0e4-8bb5faf1550d.png" alt class="image--center mx-auto" /></p>
<p><strong>Step 3:</strong> Verify the setup by checking Docker Desktop to see if both instances are running.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724838580729/21604d0a-6996-4535-9ab9-2192b003d0f3.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-build-phase">Build Phase</h3>
<p><strong>Step 4:</strong> Create all the tables and relations between them that the API server will use by running the command</p>
<pre><code class="lang-yaml"><span class="hljs-string">python</span> <span class="hljs-string">manage.py</span> <span class="hljs-string">migrate</span>
</code></pre>
<p><strong>Step 5:</strong> Confirm the creation of all tables and relations in the Postgres DB using a tool like Postico.</p>
<h3 id="heading-run-phase">Run Phase</h3>
<p><strong>Step 6:</strong> Start the API server on the local machine using the below command</p>
<pre><code class="lang-yaml"><span class="hljs-string">python</span> <span class="hljs-string">manage.py</span> <span class="hljs-string">runserver</span>
</code></pre>
<p>The server will run on port 8000, but the port can be changed according to preference.</p>
<p><strong>Step 7:</strong> Go to the web folder and use the following command to bring up the React application</p>
<pre><code class="lang-yaml"><span class="hljs-string">npm</span> <span class="hljs-string">start</span>
</code></pre>
<p><strong>Step 8:</strong> Inside the web folder, use the command shown to start the Nginx server. This ensures the React application can connect with the Django application.</p>
<pre><code class="lang-yaml"><span class="hljs-string">nginx</span> <span class="hljs-string">-c</span> <span class="hljs-string">$PWD/nginx.local.conf</span> <span class="hljs-string">-g</span> <span class="hljs-string">"daemon off;"</span>
</code></pre>
<p><strong>Step 9:</strong> Once both servers are running, open a browser and go to localhost to see the running application.</p>
<h3 id="heading-testing-changes">Testing Changes</h3>
<p><strong>Step 10:</strong> Test changes by creating a playbook and an HTTP task. For example, remove the method field in the API task and save the changes.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724839051543/0c896b55-6811-456f-9493-9f25f53eafb4.png" alt class="image--center mx-auto" /></p>
<p><strong>Step 11:</strong> Django will load the changes and restart the server. Refresh the application platform and try to create an API task. The method field should be missing.</p>
<p><strong>Step 12:</strong> To restore the method field, undo the changes, save, and refresh the page. Try to create the task again.</p>
<p>For further assistance with local setup, refer to the details mentioned in <a target="_blank" href="https://github.com/DrDroidLab/playbooks">the github project</a>. The contribution page includes a link to the playbook architecture and local setup guide.</p>
]]></content:encoded></item><item><title><![CDATA[What is a PlayBook and what is  core components of a playbook?]]></title><description><![CDATA[A playbook is a set of instructions that a Doctor Droid bot or an on-call engineer follows during a production incident.
https://www.youtube.com/watch?v=T9KfunP9juA
 
A playbook consists of tasks. A task is an instruction that's executed through the ...]]></description><link>https://notes.drdroid.io/what-is-a-playbook-and-what-is-core-components-of-a-playbook</link><guid isPermaLink="true">https://notes.drdroid.io/what-is-a-playbook-and-what-is-core-components-of-a-playbook</guid><category><![CDATA[doctor-droid]]></category><category><![CDATA[Playbooks]]></category><category><![CDATA[automation]]></category><category><![CDATA[runbooks]]></category><category><![CDATA[monitoring]]></category><category><![CDATA[observability]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[Karan Sirohi]]></dc:creator><pubDate>Wed, 28 Aug 2024 10:29:35 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1724829837491/07645849-7660-4582-93c8-20093423561f.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A playbook is a set of instructions that a Doctor Droid bot or an on-call engineer follows during a production incident.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.youtube.com/watch?v=T9KfunP9juA">https://www.youtube.com/watch?v=T9KfunP9juA</a></div>
<p> </p>
<p>A playbook consists of tasks. A task is an instruction that's executed through the portal. Let's create a task.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724840709790/2960d410-97b9-42fb-abf2-00ea44a219ff.png" alt class="image--center mx-auto" /></p>
<p>A task could involve fetching a metric from CloudWatch or running a kubectl command on a server. Let's create a new task to fetch logs from CloudWatch.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724840729590/cc3f65df-7422-4fc9-9e85-59d9c8a93bf7.png" alt class="image--center mx-auto" /></p>
<p>Add notes to the playbook. A note is a custom guideline for the user, related to the playbook or a specific step. Add a note indicating that this task fetches logs.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724840747525/cbc11b6f-1457-4ebd-94cc-a1866bffea67.png" alt class="image--center mx-auto" /></p>
<p>Hover over a step to view the note. It's also possible to add multiple tasks in a step. For instance, fetch a Datadog metric. After adding the task, check the metric.</p>
<p>Add variables to the playbook. For example, add a service as a variable. To use the variable, enter a dollar sign followed by the variable name.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724840778752/000962cf-194b-4657-ae61-25500891e62e.png" alt class="image--center mx-auto" /></p>
<p>Add a step with conditions. A condition is a rule that determines whether a certain action should be taken. For example, check logs from CloudWatch and check if the row count equals six. Then add a task. Add a metric and run the task.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724840806580/e5084c2e-9815-441d-b211-79df2c44e023.png" alt class="image--center mx-auto" /></p>
<p>Save the playbook. Execute the playbook and observe the results with the condition. If the condition isn't met, the step isn't recommended and the next step is executed.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724840835236/8cffe674-b6f0-47f2-9f0f-b293faaf63b8.png" alt class="image--center mx-auto" /></p>
<p>These are all the core components of a playbook.</p>
]]></content:encoded></item><item><title><![CDATA[How to do post-deployment monitoring with Doctor Droid?]]></title><description><![CDATA[Before starting, ensure that you have setup Doctor Droid Playbooks with at least one playbook and a Slack or MST integration. Check out these tutorials on how to get this done.
https://www.youtube.com/watch?v=T9KfunP9juA
 
Consider a scenario where a...]]></description><link>https://notes.drdroid.io/how-to-do-post-deployment-monitoring-with-doctor-droid</link><guid isPermaLink="true">https://notes.drdroid.io/how-to-do-post-deployment-monitoring-with-doctor-droid</guid><category><![CDATA[doctor-droid]]></category><category><![CDATA[deployment]]></category><category><![CDATA[automation]]></category><category><![CDATA[monitoring]]></category><category><![CDATA[deployment automation]]></category><category><![CDATA[Canary deployment]]></category><category><![CDATA[observability]]></category><dc:creator><![CDATA[Karan Sirohi]]></dc:creator><pubDate>Wed, 28 Aug 2024 10:21:16 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1724830380080/aadf2755-8866-40b9-bc86-bc5aaf61e00e.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Before starting, ensure that you have setup Doctor Droid Playbooks with at least one playbook and a Slack or MST integration. <a target="_blank" href="https://www.youtube.com/watch?v=zURhxSbUGlQ&amp;list=PL-09IrZSH_gJXP4XfosRsMBIcj6iFu4ts&amp;index=1">Check out these tutorials</a> on how to get this done.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.youtube.com/watch?v=T9KfunP9juA">https://www.youtube.com/watch?v=T9KfunP9juA</a></div>
<p> </p>
<p>Consider a scenario where a critical deployment is planned. As a developer, you would want to track a set of logs and metrics continuously after the deployment.</p>
<p>In large-scale deployments, monitoring is crucial. Tight thresholds and noisy services often lead to frequent alerts. If something breaks, the process starts over, monitoring everything again with tight thresholds. This can be a hassle. To simplify this, set up a post-environment monitoring workflow.</p>
<h3 id="heading-setup-a-workflow">Setup a workflow</h3>
<p>Step 1: Navigate to the Doctor Droid Playbooks platform and create a new workflow for this purpose.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724840108921/7043695c-78f9-4f97-9a12-d2479c3f368a.png" alt class="image--center mx-auto" /></p>
<p>Step 2: Select a trigger with the API trigger and choose a playbook which includes tasks for querying logs &amp; metrics that you would want to track post the deployment.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724840116626/6e6656f3-a7c1-4cbb-9ab5-b3d9eff8cdb1.png" alt class="image--center mx-auto" /></p>
<p>Step 3: Execute this playbook and publish its summary in Slack.</p>
<p>Step 4: Select the appropriate channel and set up a cron schedule. For instance, set it to run every minute.</p>
<p>Step 5: Save the cron schedule and save the workflow</p>
<h3 id="heading-test-the-workflow">Test the workflow</h3>
<p>Step 1: Copy the API trigger code and run from terminal. You'll receive a workflow execution ID.</p>
<p>Step 2: Observe the successful workflow execution with the provided ID.</p>
<p>Step 3: Once the workflow executes, the output of the metrics and logs from the playbook configured in it.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724840122675/04ac6c49-6960-4428-b54c-862c1344e1ce.png" alt class="image--center mx-auto" /></p>
]]></content:encoded></item></channel></rss>