Mean Time to Resolve (MTTR) is one of the few metrics that directly connects engineering practice to business impact. Every minute a production database is degraded or unavailable, transactions fail, users leave, and revenue stops. For mid-size and large organisations, the cost of a single hour of database downtime exceeds $300,000.[1] For enterprises, it's north of $1 million.[2]
And yet, large incident datasets from major incident management vendors, covering over 150,000 incidents, show that more than half take over two hours to resolve. Only about 8% are resolved within 30 minutes.[3] The gap between what's possible and what actually happens in most organisations is enormous.
A caveat before we go further. John Allspaw, the former CTO of Etsy and one of the most influential voices in incident response, has argued persuasively that MTTR as a standalone metric generates very little insight.[9] Two incidents with identical resolution times can involve completely different challenges, levels of uncertainty, and quality of decision-making. The number alone doesn't tell you whether your team is getting better. He's right. What follows is not an argument that you should optimise for a number on a dashboard. It's a framework for improving the conditions under which your team investigates and resolves database incidents. A lower MTTR is a side effect of that improvement, not the goal itself.
With that framing, let's look at where the time actually goes, and what you can do to get it back.
Where the time actually goes
When you decompose a database incident timeline, the resolution itself is often the shortest part. The longest phases are detection, mobilisation, and investigation. Someone has to notice something is wrong. The right person has to be reached. That person has to orient themselves, figure out what changed, and work backwards to the contributing factors. Only then does the actual fix happen, which is frequently straightforward: kill a query, add an index, adjust a parameter, or fail over to a replica.
This is why MTTR improvements rarely come from hiring faster engineers. They come from compressing the time before the engineer starts fixing things.
The seven most common database incident types
Understanding which failures are most frequent helps you prioritise where to build investigation workflows. From practitioner post-mortems and community data, these are the patterns that appear most often.
Lock contention and deadlocks. Multiple transactions competing for the same rows, or long-running transactions holding locks that cascade into contention across the system. A single poorly-scoped transaction can escalate into hundreds of simultaneous lock waits.[5]
Connection pool exhaustion. All available connections are in use and new requests start queuing. This is often a secondary symptom: slow queries hold connections longer than expected, which exhausts the pool, which makes everything else slow too. By the time the alert fires, the original cause may no longer be the most visible problem.[6]
Query regression. A new deployment introduces a query that performs differently than expected. A missing index, a changed execution plan, or stale statistics can cause a query that ran in milliseconds to suddenly take seconds. These regressions are often progressive, degrading over minutes or hours rather than failing immediately.
Resource exhaustion. CPU saturation from unbounded table scans, disk pressure from large batch operations, or memory pressure from misconfigured parameters. These tend to affect the entire instance, making it difficult to isolate the specific cause without looking at multiple signals together.
Replication lag. A secondary falls behind the primary, causing stale reads or complicating failover if the primary goes down. The cause is often upstream: a heavy write workload, a long-running transaction on the primary, or insufficient resources on the replica.
Autovacuum failure (PostgreSQL). Dead tuple bloat accumulates faster than autovacuum can clean it, degrading query plans and consuming storage. With PostgreSQL's default settings, a billion-row table can accumulate 200 million dead rows before autovacuum even triggers.[7] This is a slow-building problem that eventually crosses a threshold and becomes an urgent one.
Configuration drift. Instances in the same fleet with inconsistent parameter settings. Everything works until load shifts and the inconsistently-configured instance behaves differently from the rest. These are among the hardest incidents to diagnose because the configuration difference may have been introduced weeks or months earlier.
A seven-step incident response framework
Every organisation's process will look slightly different, but the most effective database incident workflows share a common structure. What matters is that the structure exists before the incident does.
- Detect. Automated alerting on leading indicators: P95 latency spikes, connection pool utilisation crossing a threshold, replication lag exceeding acceptable bounds. The goal is to detect degradation before it becomes an outage. Alerts should be specific enough that the on-call engineer knows which system to look at first.
- Acknowledge. The on-call engineer confirms they've seen the alert and are investigating. This sounds trivial, but a five-minute acknowledgement SLA prevents the scenario where an alert fires and nobody acts on it because everyone assumes someone else is handling it.
- Triage. Is this user-impacting? Which service and which database instance are affected? How many users or transactions are involved? Triage determines the urgency and scope of the response. A degradation affecting one low-traffic service needs a different response than a full outage on your primary transaction database.
- Investigate. This is where most time is lost. The engineer needs to answer: what changed? For databases, that means looking at active queries (
pg_stat_activityorperformance_schema), lock wait graphs, recent deployment history, and resource metrics. The faster this information is surfaced, the faster the investigation converges. - Mitigate. Restore service first, understand causes second. Kill the offending queries, increase the connection pool, fail over to a replica, or roll back the recent deployment. Mitigation is about stopping the bleeding, not about understanding why it happened.
- Resolve. Once service is restored and the contributing factors are understood, apply a fix: add the missing index, correct the query, adjust the configuration, or update the schema migration process. Note: most incidents have multiple interacting causes rather than a single root cause, so resist the temptation to stop investigating once you've found one plausible explanation.
- Post-mortem. Document what happened, what was confusing during the response, how the team made decisions under pressure, and what could improve detection next time. The post-mortem is where institutional knowledge gets created. Without it, the same failure mode will produce the same investigation delay next time.
Five techniques that actually compress MTTR
The framework above describes the process. The following techniques target the specific phases where time is most commonly wasted.
1. Pre-built investigation runbooks
The single highest-leverage MTTR improvement for most teams. For each of the seven failure modes above, write down the exact investigation steps: which queries to run, which metrics to check, what patterns point toward likely contributing factors. This externalises the knowledge that currently lives in your most experienced engineer's head. When a less experienced team member is on call at 3am, the runbook turns a 45-minute investigation into a 10-minute one.
The runbook doesn't need to be perfect. It needs to exist. You can refine it after every incident through the post-mortem process.
2. Automated diagnostic collection on alert
When an alert fires, a script or tool automatically runs the standard investigation queries: active sessions, lock chains, recent query plan changes, resource utilisation, replication status. The results are posted to the incident channel before the on-call engineer even opens their laptop. Instead of starting from "what's happening?", they start from "here's what's happening, what does it mean?"
Organisations using this pattern report that it alone can cut investigation time by more than half.[4] The automation doesn't need to be sophisticated. A shell script triggered by your alerting system that queries pg_stat_activity and pg_locks and posts the results to Slack is often enough to start.
3. ChatOps integration for incident coordination
When an alert fires, automatically create a dedicated Slack channel (or thread) with the relevant context pre-populated: which instance, what the alert condition was, links to dashboards, and the output from the automated diagnostic collection. Everyone involved in the incident works in one place. The channel becomes the post-mortem artifact.
Teams using this approach through tools like PagerDuty and Opsgenie consistently report that their MTTR for standard failure patterns dropped from hours to minutes,[4] because the mobilisation and context-gathering phases that normally consume the first 30 to 60 minutes are compressed into seconds.
4. Query plan change detection
Many database incidents trace back to a query whose execution plan changed unexpectedly. A new index was added, statistics were updated, or the query planner chose a different join strategy. If you're tracking query plans over time (through tools like pg_stat_monitor or equivalent), you can detect these changes as they happen and alert before they cascade into a user-facing problem. This shifts from reactive incident response to proactive prevention.
5. Automated remediation for known patterns
For well-understood failure modes with safe, bounded responses, automation can act without human involvement. A query exceeding a duration threshold gets terminated. A connection pool approaching capacity triggers a pre-configured scale-up. A replication lag spike initiates a pre-defined failover sequence. These are not replacements for human judgement. They are codified responses to situations where the correct action is known in advance and the blast radius is bounded.
A word of caution here. Automated remediation works well for the incidents that match known patterns, which tend to be the easier ones. The incidents that truly hurt are the ones that don't match any pattern, where multiple things go wrong at once, or where the automated response itself interacts badly with the actual failure. Over-reliance on pattern-matching can atrophy the diagnostic skills your team needs for those novel situations. The automation handles the routine so your people stay sharp for the exceptions, not so they stop practising.
What best-in-class looks like
The teams that consistently achieve sub-60-minute MTTR share a few common traits. They have runbooks for their most common failure modes. They collect diagnostic data automatically when an alert fires. They coordinate in a single channel with context pre-populated. And they treat every post-mortem as an opportunity to improve the system rather than assign blame.
| Dimension | Typical team | Best-in-class team |
|---|---|---|
| Detection | Generic threshold alerts, often noisy | Tuned alerts on leading indicators with low false-positive rate |
| Investigation start | Engineer manually queries the database to orient | Automated diagnostics posted to incident channel on alert |
| Knowledge | Investigation approach depends on who's on call | Runbooks codify investigation steps for common patterns |
| Coordination | Scattered across DMs and ad-hoc calls | Dedicated incident channel with pre-populated context |
| Post-incident | Post-mortem sometimes happens, rarely updates process | Every incident improves a runbook or adds an alert |
Notably, none of this requires expensive tooling or exotic infrastructure. It requires discipline: the willingness to write down what you know, automate the repetitive parts of investigation, and treat every incident as a learning opportunity.
The real goal: learning, not just speed
This is where Allspaw's critique of MTTR becomes practically important, not just philosophically interesting. In his talk Incident Analysis: How Learning is Different Than Fixing, he draws a sharp distinction between the two activities that happen after an incident.[10] Fixing is about action items: add the missing index, increase the connection pool, update the alert threshold. Learning is about understanding how the incident unfolded, what decisions were made under pressure, what information was available and what wasn't, and how the organisation's systems and processes shaped the outcome.
Most post-incident reviews focus almost entirely on fixing. They produce a list of action items, assign owners, and move on. Allspaw's observation is that these documents are, in his words, written to be filed, not to be read. The action items get completed (or don't), but the deeper understanding of how the team actually responded never gets captured.
This matters for MTTR because the techniques above (runbooks, automated diagnostics, ChatOps) address the mechanical parts of incident response. They're necessary and high-leverage. But the harder, slower improvements come from studying how your team actually thinks during incidents. Which signals did they look at first? What hypotheses did they form and discard? Where did they get stuck, and why? These questions can't be answered by a number on a dashboard, but the answers are what turn a reactive team into a genuinely resilient one.
The practical implication: don't just measure your MTTR. After each significant incident, spend time understanding the response itself. The runbook tells your team what to check. The post-incident learning process tells you whether the runbook is right, whether it's missing something, and whether the team's mental model of the system matches how it actually behaves.
The observability connection
Organisations that invest in database reliability engineering see measurable MTTR improvements. Research shows that 64% of organisations using observability tools reported a 25% or greater improvement in MTTR, with nearly 36% seeing improvements in both detection time and resolution time simultaneously.[8]
The reason is straightforward: observability closes the gap between "something is wrong" and "here's why." Traditional monitoring tells you that CPU is at 95%. Observability tells you which queries are driving that CPU usage, when their execution plans changed, and what deployment preceded the change. The investigation step, which is where most MTTR is lost, compresses dramatically when the right signals are already correlated.
Where automation fits
As database fleets grow and the talent gap widens, the investigation and diagnostic work that consumes most of MTTR becomes a natural target for automation. Not the judgement calls (should we fail over? should we roll back the deploy?) but the information gathering that precedes them: what's running right now, what changed recently, what does the lock graph look like, which queries shifted plans.
This is the kind of work that an experienced DBRE does reflexively but that takes a less experienced engineer significant time to replicate. Tooling that automates this investigative layer doesn't replace the engineer's judgement. It gives them the information they need to exercise that judgement faster.
References
- Cost of IT Downtime in 2025: What SMBs Need to Know — MEV. 90% of firms report that one hour of downtime costs over $300,000.
- Cost of IT Downtime Statistics, Data & Trends (2026) — The Network Installers. Average cost exceeds $14,000/min for midsize businesses, $23,750/min for large enterprises.
- Incident benchmark data: MTTR and other stats from over 150K incidents — r/sre. ~8% resolved in under 30 minutes; 55%+ required over 2 hours.
- MTTR Statistics: What's Normal vs. What's Excellent — SleekOps. Average MTTR 3–5 hours; best-in-class under 1 hour. ChatOps integration reported to reduce MTTR from hours to minutes.
- Locks, Contention, and Performance: A Technical Study — dbsnoop. Documents how lock contention incidents can cascade into hundreds of simultaneous lock waits.
- Your Database Is the Bottleneck: How to Prove It with Load Testing — LoadForge. Connection pool exhaustion as secondary symptom of slow queries.
- Debugging Postgres autovacuum problems: 13 tips — Citus Data. With default autovacuum_vacuum_scale_factor of 0.2, large tables accumulate massive bloat before vacuum runs.
- Observability: Understanding Its Concept, Pillars, and Strategic Value — MGX. 64% of organisations using observability tools reported 25%+ MTTR improvement; 35.7% saw improvements in both MTTR and MTTD.
- Moving Past Shallow Incident Data — John Allspaw, Adaptive Capacity Labs (2018). Argues that common incident measurements like MTTR generate very little insight about incidents or how to improve.
- Incident Analysis: How *Learning* is Different Than *Fixing* — John Allspaw (2020). Distinguishes between post-incident fixing (action items) and learning (understanding how the team actually responded under pressure).