Blog

Why LLM-Over-Logs Is the Wrong Abstraction.

Log volume is not your friend, and more context is not the answer. The hard part of AI SRE is knowing what to throw away.

Vikram Sreekanti

At this point, we all know that modern LLMs are phenomenal data processing machines. We can feed all sorts of information into our favorite model and ask it to synthesize ideas or analyze data in a way that matches exactly what we're looking for. Usually, we get pretty good results in just a couple of minutes.

Likewise, you might think that pumping a bunch of logs into an LLM would help you understand what's going on with production software, and quickly debug issues during your on-call rotation. Unfortunately, that's not the case. In reality, when you take large amounts of log data and pump them through an LLM, you are crossing your fingers and hoping that the model is going to be able to figure out something useful. You're essentially rolling the dice, when realistically, the chances of getting a useful result are actually quite low.

That's what sets a properly built AI SRE agent apart. It's not just about writing a log query, pulling some results, and dumping them into a prompt. That gets you nowhere. Let's start with the root of the problem.

Data at scale is not your friend

Modern software systems generate incredible amounts of telemetry. Products like Datadog and Grafana are built to ingest, store, and index all of this data at scale with high efficiency. Unfortunately, LLMs aren't built for that.

When a human thinks about a "long" prompt, they might think of a few paragraphs — maybe a thousand words. In token terms, that’s a few thousand tokens. That's not what log data looks like. Logs contain massive amounts of redundant and repetitive information: timestamps, log levels, and codebase metadata. When this is related to a specific error, that context is incredibly useful; but when it’s just a regular info statement, the content is functionally useless for debugging.

Sticking all of that into an LLM and hoping for the best is the wrong approach. When you write a targeted log query—say, for two minutes of data from one particular system—you might get a few GB of data back, which translates to hundreds of thousands of tokens. While modern LLMs can process that context size, it doesn’t mean they should. For one thing, the larger your input becomes, the more the variance of the output increases. More importantly, you're fighting massive latency issues; waiting for an LLM to read 200,000 tokens takes significantly longer than running a targeted script. You're wasting time and tokens waiting for an LLM to process thousands of vertically useless pieces of information that don't contribute to finding a root cause.

In short: taking a few GB of logs and throwing them into an LLM might seem like an intelligent place to start, but it almost always yields terrible results.

How RunLLM handles logs

Log processing is critical to any debugging workflow, so we’ve thought long and hard about how to handle it. The answer isn't "stuff more data into the prompt." In fact, we take the opposite approach. Instead of relying solely on LLM inference, we rely on regular old data engineering.

For each log query we write, we first sample the results to see what types of information are contained within. We then spin up an agent whose responsibility is to take that log schema and determine — based on the specific task — which data points are most relevant to the debugging effort. The agent then generates a custom Python data engineering script designed to filter and aggregate those specific logs.

That log analysis agent executes the script, checks the results, and iterates on the code until it’s satisfied with the output. Only then are the processed, high-signal results fed into the LLM for final reasoning.

This serves two purposes:

Speed and Efficiency: Our agent iterates very quickly. It's not trying to process hundreds of thousands of tokens over and over again, which would be incredibly slow and expensive. By using custom data engineering to filter and process the log data before feeding it into an LLM, we reduce analysis latency from minutes to seconds.
Reduced Variance: We dramatically improve the quality of our root cause analyses. We all know LLMs can produce different results for the same prompt. In a high-stakes on-call setting, betting that you’ll get the "right" version of the output one out of ten times doesn't fly. Our agent deterministically processes the logs to maximize the chances of a correct conclusion.

Why the rest of your data matters too

So, have we "given up the farm" by explaining our log processing? Not quite. It turns out there is a lot more to building an AI SRE than just looking at some logs. Don’t get us wrong: Log data is important to hone in on the errors that indicate a root cause. But having access to log data on its own is relatively unhelpful. Looking at a log stream in isolation won’t help you pinpoint an issue because it contains many different kinds of information. Correlating those errors with related data like metrics, infrastructure, and code changes is necessary to reach the right conclusion.

A mature AI SRE agent takes into account all of those data sets individually and — more importantly — understands the relationships between different data streams. With that context, the agent is able to formulate realistic hypotheses about possible issues, understand causal relationships when evaluating those hypotheses, and over time, develop an understanding of which actions it should take to most efficiently resolve an issue.

Let’s look at an example to make this more concrete. Say that you receive a PagerDuty alert indicating a critical endpoint — perhaps the feed generation endpoint for your social media product — has an abnormally high latency. You know from extensive experience that slow feed generation has a significant impact on user engagement and revenue, so this is a huge deal.

Looking at the logs for this service might help you evaluate a couple possibilities. The logs might show a huge spike in the number of requests or help you determine if the nature of the requests during the slowdown is anomalous. While that’s a helpful start, it won’t get you to a solution. Consider everything an engineer would have to take into account: possible code changes, upstream or downstream service behavior, infrastructure configuration limits, and noisy neighbors, to name a few. Each one of these would be reflected in a different system — CI/CD, source code, metrics, infrastructure as code, Kubernetes, etc. Relying on logs alone would not get you very far at all.

Closing the understanding gap

The promise of an AI SRE is that it can evaluate many hypotheses in parallel, ideally reducing the time it takes to get to the correct RCA. Doing that well is a lot harder than it seems at first glance.

The general wisdom we’ve all learned about using LLMs is that you should provide them more context. The more detailed you can be, the better their outputs will match your requirements. The natural extension of that is to go beyond stuffing logs into an LLM – stuff all your observability data into an LLM. Unfortunately, this is not going to yield results that are dramatically more useful.

Without a clear understanding of what data is available and how it fits into the bigger picture, an LLM is simply going to be taking shots in the dark and hoping that it figures out the right answer. Modern models are good enough to get it right 5-10% of the time. As models improve in coming years, you might reach 25% or even 50%. But you’re not going to be anywhere near the accuracy required to trust an agent.

The reason for this is the “understanding gap.” If you hired the smartest engineer in the world, they’d spend their first day building a mental map of the codebase before fixing a single bug – this is expected because they need to understand what they’re doing before taking action. But even after all that learning, the new engineer will be less effective than the experienced one. Every time you dump raw data into an LLM, you’re essentially forcing it to repeat that “first day” over and over again. While an agent is faster at processing than a human, you’re still running the risk that the agent makes a mistaken assumption – that it lacks sufficient “experience.” To bridge this, we need a pre-processing phase — a way for the agent to learn everything it can up front so it can act with the precision of a veteran engineer the moment a crisis hits.

That’s why we make sure RunLLM never starts from a blank slate. When you connect our agent to your observability tools, it automatically catalogs and correlates all the data you’ve provided. The agent works to understand what data streams exist, where in the codebase they’re emitted, and how they’ve been used in past investigations and incidents – all before ever trying to debug an outage. That means that when you’re relying on it in a mission critical moment, it starts from a strong baseline understanding of your world – not as a brilliant stranger seeing your systems for the first time.

Wrapping Up

Building good agents is ultimately about data. An agent needs access to data that has the relevant signal – otherwise, it’s just guessing (hallucinating). But you can’t give it all the data at once – no matter how many fancy long-context blog posts you see. The hard part about AI SRE is that modern software systems generate incredible amounts of data – all of it might be useful in some context, but most of it is useless for any single issue.

If you’re not careful about how you use that data – cleaning it, analyzing it, and putting it into context – you’re most likely going to end up with an LLM guessing in a bunch of different directions that may or may not be correct. On the other hand, if you choose an agent that uses data effectively and efficiently, it can change the way that you operate production software.

‍