Blog
It's 9:14 p.m. when your support rep flags a customer complaint: "Checkout keeps timing out."
You glance at your dashboards. Everything is green. The monitors are quiet. No alerts have fired.
By the time your team confirms there's a problem, users have already posted screenshots on social media but your status page still says "All Systems Operational."
Then the grim reality hits you: your customers found the outage first.
When customers discover outages before you do, it's not just an operational failure—it's a trust failure. And it's far more common than most teams want to admit.
According to the 2024 State of Observability Report by New Relic, 30% of organizations still rely on customer complaints as their primary way to detect outages. If we're being ruthlessly practical, customers are pretty good monitors. But it's also a spectacular way to erode trust.
So why didn’t you know? The problem certainly isn't a lack of monitoring. You’ve got lots and lots, arguably too much. The reality is that observability systems are really good at flagging repeatable failures but much less good at spotting novel ones, like that new dependency that spiked latency for five minutes before any alert fired.
A new class of AI-powered detection systems is closing that gap.
If engineering teams have more monitoring tools than ever, why does it still take more than an hour to detect incidents? And how is it possible that your customers are finding issues faster than your very expensive observability stack? Fair questions, and the kind your CFO will ask during the next OpEx review.
But first, let's chase down the root cause.
Most engineering teams are drowning in alerts. In one large SaaS environment RunLLM works with, the company receives hundreds of thousands of alerts per month. The vast majority are false positives—minor blips or seasonal traffic changes that never impact users.
To survive, engineers raise thresholds and mute noisy monitors. The problem? Those early warnings—a slight latency increase here, a modest error rate bump there—get filtered out along with the noise. By the time a real signal breaks through, customers have already noticed.
As one engineering manager put it: "We have dashboards for everything, but they're all right until they're wrong."
Sound familiar?
Modern stacks generate telemetry across dozens of systems. When an incident occurs, it rarely raises its hand politely to announce itself. Maybe a database query slows slightly. Perhaps API errors tick up a subtle 2%. Or load balancer latency creeps higher.
Each monitoring tool sees its own piece, but none can reason across systems to connect these signals. Even unified platforms don’t understand the relationships between metrics. They’ll show you that several things are happening, but not that together they spell a specific failure mode.
That’s why your on-call engineers get paid the big bucks, and why alert fatigue is inevitable when every tool screams in isolation.
So let’s talk about your smart engineers who get rotated onto on-call shifts. Even when they’re on the trail of an issue, all the right signals exist across disparate parts of your stack, and pulling them together takes time. Humans are single-threaded. They have to toggle between dashboards, manually assemble a timeline, cross-reference logs with metrics, and hunt for the common thread. They can see patterns machines can’t, and apply judgment and intuition. But it takes a long time to form a strong hypothesis and validate it. And they’re expected to do this over and over again.
That’s why, according to Catchpoint's SRE Report 2024, the average mean time to detect (MTTD) for many organizations still exceeds 60 minutes. That's the time from when something breaks to when someone knows it's broken—over an hour just to realize there's a problem, let alone understand why.
Every minute you don't know about an incident is a minute it's getting worse.
When you detect an issue at minute 5, it might be affecting 1,000 users. By minute 30, that number could be 50,000. By minute 60, the issue is cascading into dependent systems, creating secondary failures.
Detection delays can transform minor incidents into headline-making outages. The difference isn't the root cause. It's how long the issue ran undetected.
90% of IT leaders say outages reduce customer confidence in their organization. And that damage multiplies when customers discover the problem before you do.
Customers don't just remember your service went down. They remember that they had to tell you. They remember having to refresh your status page over and over and still seeing "All Systems Operational" even though their transactions were failing.
As one Director of Engineering at a large SaaS company told us: "Too many of our incidents are detected by customers. Our mean time to detect was 27 hours—and that's unacceptable."
The gap between when something breaks and when you know about it is where reliability—and credibility—goes to die.
Traditional monitoring can't solve detection problems because it’s fundamentally reactive. You set thresholds, wait for breaches, and then investigate. But this approach also assumes you know what failure looks like in advance.
Modern distributed systems, however, fail in novel ways—combinations of subtle issues that have never occurred together before. Traditional monitoring can't catch these patterns. You need AI-powered detection systems that can reason over patterns, not just measure individual metrics.
To do so, you need a purpose-built detection system that will:
Instead of evaluating each signal independently, the detection layer looks for meaningful patterns across your entire observability stack—logs, metrics, traces, deployment events, and infrastructure data.
Let’s say your database query latency increases by 15%. Your traditional APM might not alert because it's below the threshold. When API errors tick up by 2%, your error tracking tool stays quiet for the same reason. When both happen together within 30 seconds of a deployment, that's a pattern. Unless someone manually sets up complex alert combinations, this pattern goes undetected.
The system detects relationships that humans and isolated tools miss. It notices when three weak signals together indicate an emerging problem, often catching issues before any individual metric crosses an alerting threshold.
Static thresholds fail in two ways:
The detection model builds dynamic, context-aware baselines for each service, endpoint, and dependency. It learns what "normal" looks like across:
When something deviates from its learned behavior, the system flags it—not because it crossed an arbitrary number, but because it's unusual for that specific service at that specific time.
This dramatically reduces false positives while catching real anomalies earlier. You're not paged when traffic spikes during a launch—only when the spike pattern itself is anomalous.
Not all anomalies matter equally. A database query that slows by 50ms might be statistically unusual. But if it only affects an internal admin tool, it's not urgent. That same 50ms slowdown on your checkout flow? That's revenue-impacting.
The detection system connects anomalies with user-facing transactions and business-critical paths. It distinguishes between "statistically interesting" and "actively hurting customers," ensuring you investigate what matters first.
This is where understanding your system's structure becomes critical. The reasoning layer needs to know which services map to which user flows, which APIs power revenue-generating features, and which dependencies are on critical paths. AI-powered detection systems learn this underlying context and can separate these subtle differences.
Customer-reported incidents aren't inevitable. They're a symptom of systems that measure everything but understand nothing—tools that report what's happening but can't reason about what it means.
These systems close that gap. By continuously connecting signals across your stack, learning what normal looks like, and prioritizing what matters, they detect issues before they cascade.
When you consistently detect problems before customers notice, trust compounds. Users start thinking, "They always know before we do." Detecting at minute 5 instead of minute 60 keeps problems contained and confidence intact.
We've seen MTTD drop by an order of magnitude—often from an hour to minutes—and catch issues before any customer impact occurs. You're no longer waiting for a threshold breach. You're detecting the pattern that precedes the failure. Because the only thing worse than downtime is being the last to know.