When Your AI-Powered RCA Spews Pages of Useless Text

There’s an expression in Chinese, 废话连篇 (fèi huà lián piān), which means spewing pages of useless text. It can be pretty mean if you’re talking about a person, but let me assure you I’m talking about AI! A staff engineer at a cloud-native database company was describing the AI-powered RCA solution they were using, saying "it just goes off with a wall of text. Nobody has time to go through that during an incident. You always have to keep pushing it back on track. Most of us just let it do its thing in its own thread and ignore it. Sometimes, we check if what we did matched the agent afterwards."

A lot of reliability teams share experiences like this with me, and they usually ignore the reams of hallucinated RCAs filling up their on-call Slack channels. The worst (and also maybe the funniest?) was when one team told me they started manually copying logs into ChatGPT because it gave them better answers than the AI SRE product they were paying for.

It doesn't take much for a busy engineer to give up on a tool. Who can blame them if it’s hallucinating in the middle of an incident? The bottom line is that when it comes to incident management, most teams underestimate how hard it is for AI to deliver consistent, accurate RCAs. What gets missed most is the amount of janitorial data engineering it takes to make it work well.

Let me take you through it.

Why AI accuracy is hard

Everything comes down to context management. If you ask ChatGPT or Claude a question, you should expect a generic answer. By default, it’s not going to be specific to your workload, your environment, your architecture, or your team's priorities. The model has no persistent understanding of your service dependencies, your deployment history, or your team's roadmap.

Without deep, structured context about the specific environment, AI produces generic output. Now, generic answers can be more than good enough for lots of use cases. For example, if you're troubleshooting a known error with a well-documented fix, a generic answer will probably get you there. But in a real incident, the root cause is buried beneath layers of symptoms and you don't know where to start. But for incident investigation in your specific environment, it will not. For this, you need an answer specific to your system, not a generic answer that could apply to any system.

Why AI Root Cause Analysis (RCA) is even harder

Most AI use cases come down to one-shot retrieval. There is a question, there is a known answer somewhere, and the model finds it and returns it. One round trip. Enterprise search, for example, works this way. The answer exists, and the agent retrieves it.

But RCA isn’t about retrieval. It’s about exploration.

When an alert fires, even experienced engineers don’t know the root cause. They start with incomplete information and work through a sequence of exploratory actions: form a hypothesis, formulate a query to validate it, interpret the result, decide where to look next.

You might wonder what's stopping teams from just providing relevant context to a generic model. All they have to do is connect Claude Code to a suite of MCP servers and give it access to Datadog, GCP logs, and their infrastructure. Then it could issue queries with some context.

While this sounds great in theory, that's a lot of data for an agent to make sense of. It needs to know how to navigate all that information and accurately parse what it means, where it's coming from, how it's behaved historically.

Except without that understanding, you may as well be in a casino.

Without an understanding of available log data, metric schemas, and service dependencies, the model doesn’t know what to query or how to query it. It constructs queries against fields and labels that don’t actually exist in your environment. Those queries either return nothing or return an impossibly large volume of data the model cannot process. The investigation is broken before any answer is produced.

Each failed query costs tokens, adds latency, and wastes money. The model will try a bunch of things. The vast majority of those attempts will not work. But if you get to that point, you've likely already burned through your token budget and your engineer's patience.

What's required for accurate AI-powered RCA

Three things, and each is harder than it looks.

Knowledge curation. Customer data is scattered across different systems like Slack, Jira, Datadog, GitHub, Buildkite, and PagerDuty. Each stores different types of data in different formats. Before an incident ever fires, you need a data engineering pipeline that understands what is in each system, how the data relates across systems, and how to represent it in a way an LLM can reason over during an investigation. This is the pre-work. It’s a data processing and data engineering problem. None of it is glamorous. But without it, the model has no foundation to reason from.

Schema discovery. If you hire the best engineer in the world and ask them to debug a Sev1 outage on day one, they would have no idea what to do. Not because they lack skill, but because they do not yet understand the environment. They need to know what services exist, how they depend on each other, what monitoring is in place, and what the data looks like. Schema discovery is how the agent builds that understanding.

For every monitoring and telemetry tool the agent queries, it needs to know what log streams are available, what metrics exist, and what query structures are valid. And this has to happen proactively. We achieve this by scanning the entire system before any incident needs to be investigated. The agent orients itself within the environment so that when an incident fires, it already knows where to look and what kinds of queries will return meaningful results.

The real bar

Just like a human SRE solving an unbounded, unstructured problem, an AI agent needs context, needs to create limits, stack rank likely causes, explore smartly, test hypotheses, and run processes of elimination. That’s the baseline for accurate RCA. Only when an AI SRE is properly supported by that foundation, will it stop producing pages of useless text.