Blog

The Code Nobody Read Is Already in Production

Ben Sigelman argues that AI-generated code is a reliability crisis in slow motion, and what it means for how we observe production systems.
by
Peter Farago

AI is generating code faster than anyone can review it, and most of it is already in production.

Ben Sigelman has been building and studying production systems for over 20 years: at Google, where he co-created the Dapper tracing and Monarch monitoring systems, as co-founder and CEO of Lightstep, and as co-creator of OpenTelemetry. He predicts that because AI tools continue to make it cheaper to write (and rewrite) code on demand, production environments will become the primary place to evaluate whether software is correct or incorrect. To many people, that might sound insane. To Ben, it's already inevitable.

Ben argues that the natural consequence of this change is that the observability stack needs to continuously feed signals from production back into the process of writing, shipping, and rewriting code.

A Crisis Already Here

Software development is accelerating in a way that has decoupled velocity from review. Code generated by AI tools ships to production without the kind of detailed human scrutiny that production code used to require. This must be the case for productivity gains to be real; otherwise, the bottleneck would simply move from reading to review. 

The amount of code that is running in production – that was never really properly reviewed by anybody – is going up so fast. There's always a reliability-velocity tradeoff, and I think people are seeing the velocity and saying, ‘I don't really care that much about the reliability. I just can't resist the siren song of pushing features as fast.’ That to me feels like a full-on crisis for software.

The crisis is the loss of context that makes debugging possible in the first place. Back in the old days (2022), when a person wrote code, they understood its structure. When something broke, the understanding of what it was supposed to do and why it was implemented that way existed in someone’s head. With AI-generated code, the reasoning lives in the weights of a model that neither the engineer who shipped it nor the engineer investigating the incident can interrogate. The why behind a given design decision is often absent.

The only place left to understand how it will actually behave is in prod. As Ben puts it, you need to go into production to understand how the software should be written. Production stops being the place where software goes after it's been validated – it becomes the place where true validation actually happens.

Production as a Necessary Test Environment

Pre-production testing still matters. Ben expects it to become more rigorous, not less, as teams try to compensate for the loss of careful human review, and as LLMs make it easier to achieve high test coverage numbers. Still, there is a limit to what any pre-production environment can tell you.

"You absolutely cannot know if software is viable until it's also been tested in production. We need the production workload to fully qualify things, and we can never simulate that."

The reason is straightforward. Staging environments test what you anticipate. Production reveals what you didn't: not just in the sense of individual requests or transactions, but in the sense of the overall workload, including caching consequences and interference effects. The gap between the staging and prod has always existed, but now the velocity of AI-generated code is making that gap wider and harder to measure.

What changes is what happens next. Rewriting used to be expensive enough to justify exhaustive upfront validation. Now it isn't. As Ben explains, given how easy it is to revert and try again, it's only natural that more candidate releases will get as far as production traffic before one gets selected for mainline merging.

"Within two years, I would be surprised if we don't have multiple copies of the same software written in different ways, running concurrently for a lot of production applications. The idea that code can be evaluated as correct or incorrect without the workload is academically ridiculous. It needs to be in production to evaluate it."

What This Demands from Reliability

As production becomes the environment where software gets evaluated, the role of reliability changes. The on-call model — wait for an alert, investigate, resolve — was designed for a world where code was reviewed before it shipped and failures were exceptions. Waiting for something to break is too late.

In Ben's framing, the feedback loop from production back into the development cycle is essential to navigating the velocity-reliability tradeoff. Not a manual loop, an automated one. Ship fast, watch what happens, feed that signal back into the next version.

That loop runs continuously, not just when an incident fires. It requires something that understands what's running, what changed, and what the behavior means across reliability signals and business metrics before anyone gets paged.

The system this era requires continuously models your production environment, detects deviation before thresholds are crossed, and investigates without waiting to be told what to look for. The on-call engineer stops being the first line of defense and becomes the decision-maker at the end of an investigation that has already happened. That is a different job. It requires different tooling. 

The teams that recognize this shift will have a significant advantage.

"The insights are already there," Ben says. "It's just a matter of developing trust and repeatability with people."

Ben Sigelman co-created the Dapper distributed tracing system and Monarch monitoring system at Google, co-founded Lightstep (acquired by ServiceNow in 2021), and co-created OpenTracing and OpenTelemetry. 

Read the Latest

From thought leadership to product guides, we have resources for you.

Ready to Transform Your Incident Response?

The AI SRE that builds trust through evidence.

Contact Us