Beyond the Model Wars: The Real AI Race Begins

Apps not models is where AI success will made or lost. (Generated by ChatGPT 4o, edited in Canva)

Everyone has seen the model benchmarks. But here’s the reality: most real-world teams are no longer picking a model based on leaderboard scores. They’re asking a different question—what’s the fastest path to getting something working in production?

While the quest for more powerful foundation models continues, the most important battle is shifting focus. The next phase of AI is not solely about who commands the smartest model; it is about who harnesses these powerful tools most effectively.

For the past 18 months, the spotlight has been intensely focused on foundation model labs. Groundbreaking models like GPT-4o, Claude 3.5, Gemini, and Qwen 3 have entered the stage. While foundational model research continues to push boundaries, a discernible shift is underway. The explosive progress in model performance is now meeting its biggest test: real-world use. Model quality has reached impressive capability, making the real differentiators not just marginal improvements to the model itself, but in how it is brought to life.

Where is the action truly intensifying?

It is shifting to what happens after the model. It is in the intricate processes of how models are trained and adapted for specific domains, seamlessly integrated into complex systems, intelligently orchestrated across workflows, and reliably deployed at scale within real-world organizations. This is where the real AI race is heating up—a race focused on AI orchestration and practical implementation—and it is playing out faster than many realize.

Researchers, founders, and engineers at the frontier of AI application, such as Joey Gonzalez (UC Berkeley professor and director of Sky Computing Lab), Joe Hellerstein (UC Berkeley Jim Gray Professor of the Graduate School), DJ Patil (former U.S. Chief Data Scientist, LinkedIn and Obama Administration), Vikram Sreekanti (CEO of RunLLM), and Chenggang Wu (CTO of RunLLM, formerly Databricks and Google Brain), articulate a shared vision: the foundational model, while critical, is increasingly a powerful component within a larger, more complex system. The real challenge, and opportunity, lies in applying it well.

From Raw Power to Applied Intelligence: The Infrastructure Race

If large language models are analogous to microprocessors—powerful, general-purpose engines—then the next wave of disruptive value creation comes not solely from developing faster silicon, but from building the sophisticated software and infrastructure that uses that silicon to power real-world applications.

Joe Hellerstein articulates this precisely: “LLMs are like microprocessors. The hard part is connecting them to your business needs. That’s the software we still need to build.”

Even major model labs are increasingly investing in infrastructure plays, recognizing that the ability to effectively deploy and manage their powerful models is paramount.

“You can’t throw AI at a problem and expect it to work,” said Vikram Sreekanti, CEO of RunLLM. “Thoughtful applications are the hardest part.”

Benchmarks vs. Reality

That shift is reflected in how practitioners now evaluate models.

“New models are coming out all the time,” said Joey Gonzalez. “It’s really hard to keep up. At this point, I check LM Arena, but increasingly, that’s not my source of truth.”

What matters, he explains, is whether a model gets the job done. Benchmarks may reflect performance in artificial scenarios, but they do not capture real-world effectiveness. “We assume benchmarks are overfit now,” he said. “Take Llama 4—they tuned it to do well on LM Arena, but that doesn’t necessarily mean it works better in production.”

The best model is the one that works for your problem. And that is rarely captured by a leaderboard.

Post-Training Is Where the Magic Happens

Foundation models are trained to predict the next token. That gives you something that sounds plausible, but not something that is reliably helpful.

That reliability is earned in post-training.

Post-training is where base models are transformed through supervised fine-tuning, reinforcement learning, tool use, and feedback loops. It is how a model learns to follow instructions, admit uncertainty, escalate properly, and interact with tools.

As Gonzalez put it: “Post-training is where new capabilities are unlocked. It’s what turns boring, document-finishing models into helpful, reasoning companions.”

And it is where some of the most interesting research is happening, like showing that even without perfect reward signals, reinforcement learning can still elicit better behavior by encouraging practice and domain engagement.

This is not just about making models smarter. It is about making them trustworthy.

Specialization Is the Future

Not every application needs a general-purpose genius. Most businesses need an expert who does one thing extremely well.

This is where domain specialization matters. Routing support tickets, solving a specific API error, surfacing the right log lines—these tasks do not require general intelligence. They require targeted competence, fast.

That is why techniques like domain-adaptive training, LoRA, and other parametric adaptation methods are gaining traction. As Gonzalez noted, “Being good at five tools is far more valuable than being average at a thousand.”

Fine-tuning is no longer just about pushing accuracy. It is about aligning behavior with the exact tasks, systems, and constraints that matter.

From Demos to Systems

We are also seeing a shift in what “good” looks like. In 2023, a slick demo could raise a Series A. In 2025, that is just table stakes.

What separates a toy from a system?

Can it escalate when it is unsure?
Can it incorporate new knowledge over time?
Can it surface the right data from the right tool, quickly?

“AI is the roof,” said DJ Patil (former U.S. Chief Data Scientist, LinkedIn and Obama Administration). “Your data—the foundation—has to be rock solid first.”

And not just data. Execution paths. Validation layers. Retrieval infrastructure. Evaluation harnesses. The entire stack.

Consider Corelight, a cybersecurity company using AI to help support engineers resolve advanced ticket escalations. What started as a model integration now runs as an orchestrated pipeline that connects documentation, ticket history, and internal tools to surface high-confidence answers in production.

The real AI race is a systems race. It is a race to orchestrate high-trust agents that operate reliably within your workflows, under your constraints.

What Comes Next

We are not saying models do not matter. But they no longer win the game on their own.

From here, the winners will be those who:

Specialize with speed and precision
Post-train models to unlock reasoning and trust
Build infrastructure that is adaptable and observable
Obsess over operational quality and integration depth

Because having the best power drill does not matter if you are drilling into thin air.

You can win the battle and still lose the war. Ten years from now, when we look back on where value was created in AI, foundation models will be seen as a pivotal technological breakthrough. But applications—the systems, agents, and orchestration layers that bring those models to life—will represent a larger and more enduring share of the value chain.

Model development will continue to be important, but the application age is already upon us.

In previous AI waves—like expert systems in the 1980s or deep learning in the early 2010s—early excitement centered on raw capabilities. But long-term value was created by those who solved practical deployment challenges.

If you are exploring these problems, we would love to hear from you. Subscribe to The AI Frontier or visit RunLLM.

‍

Get in touch