Blog

The future of software is production

Ship every piece of code you write directly into production.

Vikram Sreekanti

If you’ve been writing code since before 2022, that sounds incredibly reckless: What about careful code review, feedback from peers, and thorough testing? I love the idea of disciplined software engineering, but once you accept what coding agents have actually done to software development — not just make writing code faster, but fundamentally shift where the hard work happens — I realized that it becomes the only intellectually honest conclusion.

Why is that the case? It’s certainly not true that coding agents have solved the process of code writing and that we’re all generating pristine code all the time. In fact, coding agents haven't solved software engineering at all. They've just shifted the burden of quality out of the editor and into production… and production environments aren’t remotely ready for it.

The thing is that no one’s talking about production right now. The conversation around coding agents has endlessly revolved around how much code we can write through an LLM – will engineers still have jobs, can PMs ship their own features, and is there a future to all white-collar work? All of it starts from the same flawed premise – that writing mode code is all that matters. In my opinion, this is not an interesting topic. The ship has sailed on cheap code. Models could flatline for the next decade and the improvement we've already seen would still represent a sea change in how software gets built. The question that actually matters is what we do with all of that code we’re writing.

The operative question is how you know that the code you generated does what it’s supposed to do when it gets in front of real users with real data, and how you can minimize the number of times that it breaks horribly.

Writing is easy. Validation is hard. And it can't stay human.

While writing the first draft of this post, I had a Cursor agent running in the background to implement a feature in our product that included a database migration, two new APIs, a Slack app update, and a UI component. Just reasoning through that scope would have taken us a couple of hours before LLMs. So naturally I thought I could start writing while Cursor plugged away – I was wrong. It took under five minutes and came out to 670 lines of code and I had barely gotten through the first paragraph of the post.

Then came the hard part – actually testing whether that code worked. I had to spin up a local version of our app, start an Ngrok server, open a Slack workspace, connect a Slack channel to that server, manually send messages, and verify that the right data ended up in the right place in the database and surfaced correctly in the UI. The setup alone took longer than Cursor took to write the code. And it only told us whether the happy path worked — not whether anything subtle broke in edge cases that I hadn't thought to recreate.

In a nutshell, this is the gap that's growing. The timeframe for writing code has gone from days and weeks to seconds and minutes. The process of validation is comparatively unchanged. As the volume of code we ship increases by 5x, 10x, eventually 50x, the feedback loop between "Cursor finished writing the code" and "we actually know if this works" becomes a bottleneck that human review and testing cannot close. The frustration people report with tools like Cursor and Claude Code isn't really about the tools — it's about the mismatch between how fast they generate code and how slow it still is to know whether that code is any good. As long as validation is a human process, validation will always lag behind creation.

Why we should ship to production ASAP

Which brings me back to the thesis. If comprehensive pre-production validation is increasingly impossible — and I think it already is — the right response isn't to try harder to catch everything upstream. It's to change the relationship with production itself.

Most bugs that make it into production aren't the kind that take down the whole app. Those happen, and they're awful, but they're rare. What's far more common are the bugs that subtly confuse users, or issues that only surface under specific conditions that test environments rarely reproduce. Those are already slipping through today, even with human review. As code volume grows, we would happily bet that more of them will.

That’s okay. We’re actually all used to that to the extent that we don’t even notice it – just today, while using Linear, I found that there was a bug with tagging other issues through a comment. If I didn’t type fast enough, the selector would freeze. That was annoying, but I was able to work around it pretty quickly. I’m sure the Linear team will have fixed it by the time this post is published. And I think it’s safe to say that most of us would accept a few more bugs in exchange for many more features.

So: ship to production as soon as you’re sure that your code clears basic functional tests and doesn't catastrophically break anything.

There are three reasons this is the right instinct.

The most obvious one is that velocity and customer expectations are now a huge risk. It has always been the case that the best teams ship fast – think about how products like Facebook and Ramp were known for their breakneck pace. Now, that pace has become table stakes. If your features are blocked by multiple rounds of reviews, you’ll quickly be overtaken by a competitor who’s willing to tolerate a bit more uncertainty to get things out faster. This is because customer tolerance is rapidly changing – AI products can be so powerful that you’ll use one even if it’s only 90% reliable.

What’s more subtle is that testing and validation themselves might be getting less useful. No matter how many customers we’ve onboarded at RunLLM, we always find that some customers have an unexpected way of doing things – patterns that we would have never thought to test even if we invested all our time in that. This is the consequence of the fact that natural language is now the default interface for basically every product we might think of using. If we’re less confident in testing, we might very well want to lean into tossing the product into the deep end and seeing if it sinks or swims.

Finally, increased velocity doesn’t just mean shipping new things faster – it means recovering faster too. You might ship something messy, but Cursor can fix it just as quickly as it implemented the feature in the first place. We would argue that speed of iteration and recovery is increasingly a first-class skill for engineering teams. If you ship something, figure out where it doesn’t work, and improve quickly, you’re on a much more attractive flywheel compared to a world where you spend all your time contemplating what should be built.. Of course, you still have to know when something breaks – but that’s a different challenge. More on that below.

What's missing

As much as I love the idea of this future I’m talking about, the reality is that the tooling to support it doesn't fully exist yet. The gap isn't in writing code — it's in everything that needs to happen once that code is written.

Testing agents are the most obvious missing piece. The recent Cursor Cloud Agents release pointed in the right direction: an agent in the cloud with computer use enabled, able to spin up your application and actually interact with it. The limitation is that these agents run in sandboxes that struggle with the messy reality of most development environments — authentication flows, DNS configuration, third-party services like Slack that require their own credentials. That's not a fatal flaw – these agents are early, and they will absolutely get better. But there's significant work left before cloud agents can grapple with the complexity of real production systems in the way that a human can in a dev environment. We find ourselves increasingly relying on “use the product” as a main testing vehicle.

Predictive incident detection is the other critical gap. The current model of software reliability — set a threshold, wait for it to be breached, get paged, wake up, and debug — was designed for a world where code ships slowly and breaks in well-understood ways. It definitely wasn't designed for a world where dozens of features might ship in a day. By the time a fixed-threshold alert fires, broken code has often been in production far longer than you'd want, and if you’re shipping more, the impact of those bugs will compound. What's needed instead is an agent that's watching constantly, pattern-matching against what normal looks like, surfacing anomalies before they become incidents, and equipping you with the root cause to fix it immediately. The warning signs are almost always there early – the problem is that no one's watching carefully enough, fast enough.

Closely related is automated instrumentation — knowing which metrics to watch in the first place. Coding agents will add instrumentation if you ask them to, but they'll tend to track things that look technically plausible without necessarily correlating to actual user experience or business outcomes. The difference matters: A user sending 5 quick chat messages to an assistant in a minute is expected behavior. A user triggering five compute-heavy jobs in the same window probably means something is broken or confusing. An instrumentation system that understands the product and the business — not just the code — is what separates signal from noise.

Production is where software actually lives

For all the predictions about a fully democratized future where coding agents run without supervision and anyone can ship production software, the reality is that the hard part of software was never writing it. It was running it — in front of real users, with real data, under conditions that no one fully anticipated.

Production is where that complexity lives, and what we’ve done by “democratizing” code is reveal to the world just how dang hard it is to keep software running consistently and reliably at scale. And as we push more and more code into production, faster and faster, the infrastructure we've built to manage production — threshold-based alerting, manual review, human incident response — is going to buckle under the weight. That doesn’t mean infrastructure engineers should run for the hills – we know what we need to do to make the process better, but we need to invest in those tools early.

Realistically, we’re not likely to quickly reach a state where engineers are slinging code to production completely unsupervised. There will still be some testing & validation and critical components will still get reviewed by a human before going into production. But everyone should certainly be asking how far and fast they can push the boundaries – and the answer is almost certainly further than you think.