Blog
Picture this: your API is down, customers are calling, and your new AI incident tool cheerfully announces it has "rolled back deployment #847 to resolve the issue." You never saw the logs it checked, you didn't approve the rollback, and you're left wondering if the system just made the right call or created a bigger one. That isn't helpful, even if it was the right call. It's loss of control and outsized risk at the worst possible moment.
The biggest risk in adding AI to incident response isn't that it won't be smart enough. It's that it will act on its own. When AI tries to take the wheel in production, engineers are forced to choose between blind trust and firefighting to undo its mistakes. Either way, the AI becomes a liability, not a teammate. The only viable path is AI that is controllable by design.
The right answer is to build AI with clear boundaries that preserve human agency and amplify human capability. Controllable by design means the system provides verifiable facts, qualified recommendations, and suggested actions — but never executes changes without explicit permission.
To be useful, AI must behave like a respectful teammate: diligent in surfacing evidence, thoughtful in proposing hypotheses, and careful to leave the final decision in human hands. This requires both structured roles and practical guardrails.
To keep boundaries clear, the AI should play three structured roles:
1) Gather Facts — Evidence without judgment
Collect and align the raw data an engineer would check anyway, without drawing conclusions.
Example: "DB connection errors started at 14:32:15. Deployment #847 occurred at 14:31:45. Here are the relevant log lines and the deployment diff."
Avoids: Unsupported correlations or silent execution.
2) Make Recommendations — Hypotheses with confidence
When patterns are clear, propose next steps — always with evidence and a confidence level.
Example: "High confidence (87%): recommend rolling back deployment #847. Evidence: timing correlation, error pattern matches incident #1247, and rollback is the first mitigation in the runbook."
Avoids: Prescriptive commands without rationale.
3) Facilitate Actions — Drafts but doesn't deploy
Write the command, but let the engineer decide whether to run it.
Example: "Suggested rollback: kubectl rollout undo deployment/api-service -n production to revert to image abc123."
Avoids: Auto-executing changes, even behind approval buttons that encourage click-through habits.
Guardrails that enforce control
Hard boundaries on what the AI can and cannot do:
Confidence signals. High confidence yields a concrete suggestion. Medium confidence prompts a clarifying question ("Have you checked p99 latency on auth-service?"). Low confidence sticks to facts.
Risk-aware approvals. For read-only queries like "show error rates for the last hour," the AI can provide prefilled snippets. For destructive actions like rollbacks, it requires explicit confirmation and suggests peer review. The AI never executes.
Evidence and accountability. Maintain a causal timeline with linked metrics, queries, and diffs. Keep an audit log of suggestions and human actions for review.
Design choices that support control
Features that make the system safer and easier to use under stress, without blurring control lines:
Detail on demand. Start with a summary. Expand for log lines, queries, or full dashboards. Engineers control the depth.
Workflow integration. Instead of forcing you into a new dashboard, it works within existing tools. When an alert fires, it posts "API errors spiked 40x at 14:32, correlating with deployment #847" directly in your incident Slack channel, with clickable links to the specific Grafana panel showing the spike and the deployment diff in your CI system.
What SREs want from an AI teammate is simple: help them understand, decide, and act faster without taking the wheel. Aggregate the right evidence, propose hypotheses with confidence thresholds and supporting links, recommend the next safe step, integrate with existing workflows, keep hands off production unless instructed, and leave a clear record for review.
In short: controllable by design. Facts, recommendations, and actions, supported by confidence signals, risk-aware approvals, and a clear record of evidence. Do that, and AI becomes a teammate worth paging.