Blog
Foundation models now make it possible to process massive volumes of observability and systems data. But that raw capability isn't enough on its own. Turning it into something useful for incident response requires specialization — fine-tuned modeling, rapid multimodal artifact ingestion in parallel, workflow integration, guardrails, and more. With those layers in place, however, AI can surface likely root causes much faster and in ways engineers can act on. And it can take all that, summarize it, and deliver a tidy Slack report where engineers are already collaborating on the incident. Sounds amazing, right?
However, tell a seasoned SRE to trust AI during an incident, and you'll get the same look a magician gives when someone asks if they really believe in magic. For SREs, the skepticism is earned. There are many impressive AI demos that just don't deliver in production.
Picture this: Your payment system is down and executives are breathing down your neck, "the algorithm says it's a database failure." Would you start the rollback without double checking?
Yet the same engineers drowning in alert noise can't keep up, with nearly a third of alerts going unaddressed. Three out of four developers report burnout, and on-call duty is a big reason why. Alert fatigue is real, and the exhaustion from constant firefighting could benefit enormously from the right kind of AI assistance.
The question isn't whether AI can help. It's whether it can earn trust.
Now let’s look at data. Spoiler: there’s a lot of it.
The fundamental problem with incident response isn't that teams don’t get the right data. It’s that they’re not sure which is important. In fact, too much “observability” makes things worse — the avalanche of logs, metrics, events, and traces often buries the real signal. Each monitoring tool offers only a slice of the picture, and when an incident hits, SREs face three painful realities:
The bottleneck isn't tooling or headcount—it's the cognitive overhead of correlation and triage under pressure. This is where AI can actually help, but only if teams can understand how it arrived at its conclusions.
Here's where AI has genuine advantages over traditional automation. Unlike rule-based systems that break when they encounter unexpected patterns, modern AI can:
For an SRE, that list is impressive — and most of it would be welcome help. But adoption hinges on how the AI behaves. It has to work transparently, giving engineers the ability to check its work. It's not about handing back "magic answers." It's about providing verifiable outputs, since a lot is riding on the actions an SRE takes based on that information.
Engineers already live by this principle in many contexts — code reviews, CI/CD pipelines, production changes. You trust your teammate, but you still run the tests. You trust your monitoring, but you still check the logs. The same mindset applies to AI in incident response: trust but verify.
At first, SREs will want to see how the AI reached its conclusions and spot-check the evidence. That means the system should provide:
These aren't checklists SREs will review every time. They're mechanisms that build trust. Early on, engineers will verify more often. Over time, accuracy, conservatism, and consistency in the AI's output will do the real work of building confidence.
Once trust is established, the AI becomes just another teammate: engineers rely on its recommendations without re-checking everything, but they know the audit trail is there if they need it.
When rebuilding a correlation engine to address these challenges, making every inference step explicit becomes crucial. Not because black-box versions are necessarily wrong, but because being right isn't enough. Trust requires transparency.
So what does "trust but verify" actually look like when an AI SRE is working alongside a human during an incident?
During an incident, instead of starting with a blank Slack channel and a dozen open browser tabs, searching for the right runbook (and hoping it's there and up-to-date), the on-call engineer instead gets:
The human stays in complete control — no auto-remediation, no black-box decisions. The AI just does the grunt work of data correlation and hypothesis generation. Toil reduced. Uptime improved.
Software and systems continue to get more complex. Systems are more distributed, teams more specialized, blast radii larger. More monitoring and more process are hurting not helping.
An AI teammate that earns trust through transparency can help teams spend less time gathering evidence and more time analyzing it, reduce false starts chasing irrelevant correlations, maintain institutional knowledge even as team members rotate, and focus human expertise where it matters most: judgment and action.
The next round of improvement in site reliability won't come from throwing more observability or bodies at the problem. It will come from AI that proves its value through transparency rather than magic.