Blog

Never Let a Good Incident Go to Waste

How AI turns firefighting into continuous learning

Learning from Incidents

Incidents are stressful, but they're also rich with lessons. Teams that consistently capture those lessons improve faster than those that don't. Research shows that groups sharing knowledge and reviewing outcomes perform better and burn out less. In reliability engineering, that principle feels obvious: if every incident makes the team smarter, you recover faster and prevent repeat failures. The challenge is making this happen in practice.

Why Team Learning is Hard

Most SRE teams have informal learning mechanisms. A veteran engineer explains a system quirk to someone on-call. Senior engineers get pulled into major incidents, and their Slack explanations become impromptu teaching moments. But unless explicitly captured, this knowledge transfer can be fleeting. Once the thread scrolls away in Slack or the senior engineer rotates off, the lesson might vanish. In other words, the team fought the fire but the lessons may have gone to waste.

Postmortems are designed to capture these kinds of learnings. Atlassian's 2022 research found that only 60% of teams consistently write them, and fewer follow through on action items. The process can be inconvenient, rushed, and sometimes skipped. And when the process breaks down, institutional knowledge unfortunately disappears with it.

The Runbook Problem

Runbooks are designed to document steps that help guide investigations during incidents. They're meant to be an invaluable reference when the pressure is on—but only if they're well maintained. We hope the instructions for how to use the fire extinguisher are there when you need it.

Even when runbooks exist, keeping them fresh depends on humans consistently documenting what happened. Solving a problem in the middle of a crisis doesn't automatically mean the fix gets written down. Sometimes a new fire needs putting out, the engineer is juggling an on-call shift on top of their regular role, or writing up notes takes outsized effort. Postmortems capture some of those lessons, but without process and discipline, much of the knowledge stays in the heads of the people who were there. Google's SRE culture emphasizes structured, blameless postmortems for exactly this reason. Many teams lack that consistency—and without systematic knowledge capture, valuable lessons from incidents are lost.

The bottom line is that documentation decay is a top reliability pain point.

AI Improves Runbooks with Every Incident

Well-designed AI can effectively lighten the documentation burden. Instead of relying on engineers to remember and write up what they did during an incident, AI can observe and structure that knowledge automatically.

Here's what that looks like in practice. During an incident, an AI system can track the investigation across multiple sources: commands run in terminal sessions, queries executed in observability tools, Slack conversations where engineers discuss hypotheses, dashboards opened in sequence, and configuration changes deployed. When an engineer checks database connection pool metrics, kills a slow query, restarts the pool, and verifies recovery—the AI captures not just what happened, but the reasoning embedded in Slack threads about why they checked that metric first.

From this, the system generates a structured playbook: "Database connection pool exhausted → check connection pool metrics in DataDog → identify slow queries using pg_stat_activity → terminate blocking transactions → restart connection pool → verify recovery by monitoring active connections." The next time someone faces a similar issue—maybe a junior engineer on their first database incident at 2am—they have a proven sequence to follow, created from the actual investigation steps of someone who solved it before.

This approach also generates more accurate postmortems by correlating incident timelines with system metrics and team communications. Even when formal write-ups get skipped, critical evidence gets preserved. Engineers can review and refine AI-generated documentation, maintaining accuracy while minimizing manual effort. Playbook instructions stay fresh because they update themselves with every incident fought.

Most importantly, AI can cluster related incidents, surface recurring patterns, and connect insights across tools and teams. It might notice that three seemingly different incidents over six months all traced back to the same database connection pool configuration, even though they manifested differently. What previously required constant human maintenance becomes part of continuous system improvement.

Implementation Considerations

Building this capability is non-trivial. Organizations need robust data pipelines, thoughtful AI model training, and integration across multiple tools—observability platforms, incident management systems, chat tools, and version control. Beyond the initial setup, which requires significant engineering investment, ongoing calibration and upkeep are required to maintain accuracy.

The greater challenge is often cultural. The perception that an engineer's work is being monitored can create resistance, especially in teams where surveillance or performance evaluation has eroded trust. This concern deserves serious attention.

The key is implementing this transparently and in service of the team, not management oversight. Engineers should understand exactly what's being captured (incident investigation actions) and what isn't (general work patterns or productivity metrics). Teams should have control over what gets surfaced and shared. The framing must be genuinely blameless—this isn't about evaluating individual performance, it's about capturing valuable investigation techniques that would otherwise be lost.

When implemented thoughtfully, the shift in perspective becomes clear: the work an engineer does during an incident can be invaluable for others who may face similar issues in the future. Capturing those lessons turns individual actions into team learning, creating leverage for the whole organization. The result is more resilient systems, faster recovery, and better uptime.

AI isn't a replacement for human judgment during incidents. Context will always matter, and engineers bring experience and intuition that can't be automated. The value of AI is making sure that the evidence, steps, and lessons from every incident are captured and shared, so teams spend less time re-solving the same problems and more time strengthening their systems.

The Compound Effect

Teams that solve the knowledge capture problem see compound benefits that become more pronounced over time. After six months, a team might have 50 automatically generated, peer-reviewed playbooks covering their most common failure modes. After a year, that library has grown to hundreds of investigation patterns, each refined through multiple real incidents.

Institutional memory becomes resilient to turnover. When a senior engineer leaves, their investigation techniques don't leave with them—they're embedded in the playbooks they generated through their work. New engineers ramp up faster because they're learning from the accumulated wisdom of the entire team, not just whoever happens to be available for questions.

The reduction in repeat incidents compounds. When the same database connection pool issue appears for the fourth time, it gets identified and resolved in minutes instead of hours, because the pattern is instantly recognized and the proven solution is immediately available. Teams spend less time firefighting the same problems and more time preventing new ones.

For engineering leaders, this means predictable on-call load, faster incident resolution, and confident knowledge transfer during team growth. For individual contributors, it reduces the stress of on-call, eliminates documentation guilt, and builds confidence that the organization won't forget what they learned the hard way.

Human engineers handle the crisis. AI ensures the learning endures. Together, they transform incidents from painful disruptions into investments in team resilience. In short: never let a good incident go to waste.