Blog

The On-Call Problem AI Can Actually Solve

Heinrich Hartmann argues AI's most valuable role in SRE isn't autonomous remediation. It's making sure on-call engineers have the context to fix incidents fast.
by
Peter Farago

The 3 a.m. Problem Isn't Technical.

Heinrich Hartmann thinks about on-call readiness the way most people think about sleep: you don't notice it until it's gone.

Ahead of an on-call shift, engineers worry about things like: "What are all the services I'm supposed to be on call for? What are the typical failures, the drills I have to have down? What are the major risks? What's the trickiest thing I could run into?"

"This will affect my sleep, actually," he says. "If I don't know those things, I could be setting myself up for a miserable late-night session with a system I barely understand. And that would be extremely bad."

The stakes are real. "If there are millions of euros going down the drain while people wait on you to resolve an issue," he says. "You don't sleep well unless you have your drills down."

And remote work made it worse. "When everyone's remote, you're no longer just picking things up from the people around you,," Heinrich says. "You can be at a company for three years without having touched the service that just went down. You haven’t committed a single line, and have no idea how it works."

Heinrich calls it "a knowledge management issue": Experienced engineers hold enough context to be effective at debugging. Others are hoping it’s going to be quiet during their on-call, or that the right playbook will be available.

Most conversations about AI in SRE start with the end state: self-managing systems, autonomous remediation, the end of on-call as we know it. Heinrich starts somewhere else: the engineer sitting alone with a pager, wondering whether they'll know enough to respond when it goes off.

Print the Code. Grab a Highlighter. Go to a Coffee Shop.

Heinrich remembers how he used to get familiar with a new codebase.

"Sometimes I would just print out the whole source code, like a big book," he says, "and then I would go to a coffee shop with a highlighter, and start figuring out what's in there."

Today, the printer sees a lot less action, and he usually just points an AI tool at a codebase and asks: "What's in there? Give me a rundown of the core components, which are the most important classes, how are they wired together?"

"I wasn't able to ask these questions before," he says. Code exploration used to mean grep, find, and a lot of patience. "Now it's conversational, and onboarding is faster."

On-call readiness is a codebase comprehension problem as much as an operational one. An engineer who understands how services are wired together – what depends on what, where the fragile points are, what a deployment actually changed – resolves incidents faster. The 3 a.m. page becomes less terrifying when you've spent an afternoon walking through the codebase with AI assistance, and can quickly merge a PR to fix a nit.

But codebase comprehension is only half the picture. The other half is operational knowledge: what failed before, how it was fixed, which dashboards matter for which alerts? That knowledge often lives in Slack threads, post-mortems, and the heads of senior engineers who've been at the company long enough to have seen every failure mode at least once.

Heinrich sees both halves as the same problem. "Maybe the task gets easier if you have a copilot," he says. "But how do you absorb all that knowledge?"

The question isn't whether AI can diagnose a production issue autonomously. It's whether AI can help an engineer absorb enough context to be effective before the page fires.

Curate Knowledge. Earn Sleep.

Heinrich's vision for AI in SRE starts with documents, not dashboards or anomaly detection.

"I want a copilot that indexes company knowledge and gives me fast access," he says. "Source code would be good, but definitely recent deployments and the playbooks." He also wants to "auto-generate playbooks from past experiences, pair up with senior engineers, and co-create these documents."

The goal isn't to replace engineers with automation. It's to make the knowledge that senior engineers carry available to everyone, surfaced in context, at the moment it matters. "We want the AI to know what's relevant and surface it to engineers so they have the right context," he says. Engineers don't need to memorize every playbook. They need the right one to appear when the pager fires at 3 a.m.”

Heinrich sees this as the most impactful work in the broader AI space right now as: "Make sure you get the right model with the right context to the right people in the right situation." For operations this means the on-call engineer has a powerful AI model by his side, which knows all the relevant playbooks, the last incidents affecting that service, and the deployment diff from that afternoon.

It's also the work many teams in the space are skipping. "I haven't seen many people jump on the knowledge management side,” he says. "They all go to 'let's solve real production issues with telemetry data,' which is harder."

That's the on-call problem AI can solve today. Not a system that tries to fix the incident autonomously — a system that makes sure the on-call engineer responding knows what this service does, what broke last time, and what steps actually resolved it. It closes the distance between the senior engineer who's seen every failure mode and the mid-career engineer who's been at the company for five years but never touched three of the services on the rotation. Curate the knowledge. Earn the sleep.

<<+><+><+><+><+><+><+><+><+><+><+><+><+><+><+><+><+><+><+><+><+><+><+><+><+><+><+><+>>

Heinrich Hartmann is a Senior Principal SRE, host of the CASE Podcast, Chair of SREcon EMEA 2025, and organizer of Signals Berlin 2026, a single-track conference on reliability in the age of AI (Berlin, September 2026). Read more at his personal blog.

Read the Latest

From thought leadership to product guides, we have resources for you.

Ready to Transform Your Incident Response?

The AI SRE that builds trust through evidence.

Contact Us