Blog
Many AI systems promise speed yet ask teams to trust an opaque process. In incident response, that's not good enough. If an AI agent were to say “root cause found” and present a solution without clearly documented reasoning and evidence, engineers would revalidate dashboards, logs, and traces and rebuild the investigation. That defeats the purpose.
Site reliability engineers need a glass-box AI SRE. It shows its work step by step and links to the exact dashboard panels, log queries, deploy diffs, and docs it used. It lives in your workflow (for example, in Slack), so you can ask follow-ups and run the next step from the thread, such as “narrow to the last two deployments” or “show only api-gateway
errors.” Controls are practical and action-oriented: diagnostics run immediately, mitigations propose the next step, higher-risk changes follow your existing approval policy. Each step includes a confidence signal, and recommendations reflect it. Everything is logged for audit. That's the kind of AI teammate an on-call engineer needs at 2 a.m.
1) Reasoning engineers can verify.
Each investigation includes a clear trace of how the agent reached its conclusion, with links to the exact metrics, logs, deploys, and docs it used. Engineers can drill into raw data and challenge any step. That's the core of a glass-box approach.
2) Evidence-backed causal timelines.
Instead of a one-shot guess, the agent assembles incident timelines that connect “what happened” to “why it happened,” ranking likely causes by confidence and the strength of supporting evidence.
3) Detail on demand.
The agent leads with the bottom line and shows details only when asked. In a tense investigation, a wall of raw data slows you down. Deep links to specific dashboards, panels, and queries let on-call engineers drill down and spot-check when needed.
4) Runbook-aware investigations.
The agent reads and follows your runbooks during an investigation. It calls out which steps it used, with links, so any on-call engineer can see how the guidance was applied.
5) Runbooks that improve over time.
Operator corrections are captured and reflected in shared runbooks and rules. Fixes are documented, and repeat incidents ameliorate. This is how continuous learning happens. The AI SRE can also recommend runbook improvements after an incident is resolved.
6) Safe autonomy, under SRE control.
Risky actions require approvals. Teams get human-in-the-loop controls, auditability, and a clear path to add automation where it is safe.
7) Works where engineers work.
Investigations happen in Slack and across the existing stack. An AI SRE should integrate cleanly with tools like PagerDuty, Grafana, Datadog, Jira, and more. Asking an SRE to work any other way slows response when it matters.
System and telemetry specialization (your SRE dialect).
The agent learns your service map and ownership, SLOs, metric and label conventions, log fields and error codes, trace spans, deploy artifacts and naming, runbook vocabulary, and common failure modes. When it investigates, it speaks your your terminology and pulls the right panels, queries, diffs, and runbook steps on the first try, making its reasoning and evidence immediately useful.
Deterministic tool use across observability, change, and collaboration.
The agent does not just chat. It calls the same systems engineers use (for example, Datadog, Splunk, GCP Logging, Grafana), runs targeted queries, and returns permalinks to the exact panels, log searches, and diffs it used. It lives in Slack and supports programmatic follow-ups so next steps happen in the thread.
Agentic planning with verification and human-in-loop control.
Instead of a one-shot guess, the planner iterates: plan → check → refine → recommend. Every step is logged as a reasoning trace with linked evidence. Higher-risk actions follow your approval policy. Confidence signals are surfaced so suggestions reflect certainty.
Show your work. Ask the vendor to reproduce an RCA on a recent outage and provide the complete investigation log with links to the exact panels, log queries, deploy diffs, and docs. No screenshots. No summaries only.
Detail on demand. Verify that the agent starts with a summary and expands evidence only when asked, rather than flooding the channel with raw data.
Runbook-aware. Confirm the agent reads your runbooks and calls out which steps it used, with links to those sections.
Runbooks that improve. Correct the agent once during the trial. Check that the correction applies on the next investigation and is reflected in the shared runbook or rules.
Workflow fit. Trigger an alert into Slack and see if the agent delivers findings, accepts follow-ups (“narrow to last two deploys,” “show only api-gateway
errors”), and escalates with context into Jira or Service Desk.
Guardrails and audit. Test the path for higher-risk actions (for example, restart or rollback). You should see clear approvals and a complete audit log of what ran and why.
Confidence signals. Ensure each step exposes a confidence level and that recommendations reflect it, for example stricter thresholds when confidence is low.
SRE leaders do not need magic. They need fewer war rooms, faster RCAs, and a team that is less burned out and more prepared for the next incident. A glass-box AI SRE helps by making every step of the investigation visible, teachable, and reusable:
If you want to see a glass-box investigation in action, bring a real incident. We'll wire your stack, run an alert-triggered investigation in Slack, and share the full reasoning trace and evidence links so your team can verify every step.
Learn more: RunLLM AI SRE