The distinction between monitoring and observability has been discussed extensively in the broader infrastructure community, but for database teams specifically, the implications are sharper than they might first appear. Monitoring tells you when pre-defined thresholds are breached. Observability gives you the ability to ask questions you didn't anticipate needing to ask. For databases, where failure modes are often subtle and emergent rather than binary, this distinction is the difference between catching a silent replication corruption early and discovering it 24 hours into a cascading outage.[1]

This article examines what the distinction means in practice for database teams, what current tooling does and doesn't cover, and where the gap between monitoring-as-practiced and observability-as-needed creates real operational risk.

73%
of organisations lack full-stack observability
2x
outage cost gap: $2M/hr without vs $1M/hr with observability
64%
achieve 25%+ MTTR improvement with observability tools

The formal distinction, and why it matters for databases

The term "observability" has its roots in control theory, specifically the work of Rudolf Kálmán in 1960, who defined it as a measure of how well the internal states of a system can be inferred from knowledge of its external outputs.[2][3] Kálmán paired observability with controllability: knowing the state of a system is only useful if you can act on what you learn. In a database context, this pairing is instructive. Observing that a specific query plan has regressed is only valuable if you have the context to understand why and the ability to intervene.

Charity Majors, co-founder of Honeycomb and co-author of Database Reliability Engineering, draws the line clearly: monitoring addresses known unknowns, where you pre-define what to watch and alerts fire when pre-defined thresholds are crossed. Observability addresses unknown unknowns, where you can ask arbitrary questions of your system, including questions you didn't formulate before the incident began.[4][5][6]

This is not a semantic distinction. Consider the GitHub October 2018 incident: a 43-second network maintenance window triggered a MySQL failover that left the database in a split-brain replication state. The monitoring showed the database was accessible. The dashboards said "healthy." The replication state was silently corrupted for hours. With monitoring, the question you could ask was "is the database up?" With observability, the question you'd ask is "for this specific set of user requests that are failing, what does the replication state look like, and how does it differ from the state we saw before the maintenance window?"

Hazel Weakly and Fred Hebert captured this in their 2024 definition: observability is the process through which one develops the ability to ask meaningful questions, get useful answers, and act effectively on what you learn.[7] The emphasis on "act effectively" is important. A dashboard that shows you a problem you can't diagnose is monitoring. A system that lets you trace from symptom to root cause to remediation, even for failure modes you've never encountered, is observability.

What dashboards miss: four categories

The limitations of dashboard-based monitoring become concrete when you examine the specific categories of database problems that dashboards consistently fail to surface.

Multi-dimensional correlations

A query that runs slowly only for users in one geography, on one client version, during peak load, after a specific deploy. This four-way intersection is invisible to a dashboard that shows "average query latency." The dashboard averages the slow queries in with the fast ones, and the signal disappears into the noise.[4]

This is where high cardinality becomes the defining technical differentiator. Cardinality refers to the number of unique values a dimension can take. A field like "database_engine" has low cardinality (PostgreSQL, MySQL, a handful of others). A field like "query_hash" or "user_id" has high cardinality, potentially millions of unique values. Monitoring tools pre-aggregate metrics, which forces you to choose your dimensions in advance and collapses high-cardinality fields into averages or buckets. Observability tools that store raw structured events with high-cardinality fields intact let you slice on arbitrary combinations of dimensions at query time: show me the P99 latency for this specific query hash, from this specific connection pool, on this specific replica, during this specific five-minute window after the last deploy. That kind of ad-hoc investigation is what "asking questions you didn't anticipate" looks like in practice.[4][24]

Novel failure modes

Any failure condition that isn't pre-represented as a dashboard panel is invisible to monitoring. The GitLab backup failure (a silent pg_dump version mismatch), the GitHub replication corruption, and the Cloudflare SQL catalog assumption were all invisible to dashboards because nobody had anticipated these specific failure modes.[8][9][10] This is not a criticism of the teams involved. It's a structural limitation of monitoring: you can only alert on conditions you've imagined in advance.

Cross-service causality

A slow database query caused by a noisy-neighbour workload from a completely different service is a common pattern in shared infrastructure. The database dashboard shows high latency, but the root cause is elsewhere. Observability platforms that correlate telemetry across services can surface this relationship. Siloed database dashboards cannot.[11]

Gradual degradation

Dashboards with fixed thresholds miss slow-moving degradation. A query whose P95 latency increases 5% per week will trigger no alerts until it's already causing visible user impact. By the time the threshold fires, the problem has been compounding for weeks. Observability tools with trend analysis and anomaly detection can catch this pattern earlier, before it reaches the point of user-visible impact.[12][13]

Observability 1.0 vs 2.0: a shift in architecture

The observability community has begun drawing a distinction between two generations of tooling, and the architectural difference matters for database teams.[5]

In the first generation, often called Observability 1.0, teams work with multiple sources of truth: metrics in a time-series database, logs in an ELK stack or similar, traces in Jaeger or Zipkin. Decisions about how to structure the data are made at write time, which means the questions you can ask are constrained by how you chose to aggregate and store the data months ago. If you didn't create a metric for a specific dimension, you can't query on it after the fact.

The second generation, Observability 2.0, moves toward a single source of truth: wide structured log events that capture many dimensions per event. Metrics, logs, and traces are all derived from the same underlying events. The critical difference is that questions can be asked at query time, rather than needing to be anticipated at write time. For database teams, this means you can investigate a query performance regression by slicing on dimensions (specific tables, specific connection pools, specific client versions, specific time windows) that you didn't know would be relevant when you instrumented the system.

In practice, many database teams operate somewhere between these two generations. They may have rich pg_stat_statements or Performance Schema data alongside traditional metrics, giving them some ability to ask ad-hoc questions while still relying on pre-configured dashboards for day-to-day monitoring. The gap between this hybrid state and full observability is where MTTR improvements tend to stall.

The current tooling landscape

The database monitoring and observability market is large, growing, and fragmented. Analyst firms project the overall observability market at tens of billions of dollars annually, and the cloud database and DBaaS market is on a trajectory from roughly $20 billion in 2024 to over $100 billion by 2035.[14][15] But market size doesn't tell you whether the tools actually solve the observability problem for database teams.

Tool Approach Best For Key Limitation
Datadog Full-stack APM + DB correlation Enterprise, multi-service Expensive at scale; 51.82% market share[14]
New Relic Dev-centric APM, unified telemetry Developer teams, APM-first Database is secondary to app; 24% market share[14][16]
Percona PMM Open-source, self-hosted MySQL/PostgreSQL specialists No fleet-level strategic intelligence
SolarWinds DPM Query-level, auto-detection MySQL, MongoDB on-prem Vendor-centric; limited multi-engine support[17]
Grafana + Prometheus Open-source, visualization Cost-conscious, DevOps-capable No AI-assisted analysis; requires DIY assembly
AWS CloudWatch DB Insights Native AWS, ML anomaly detection AWS-only estates 500 instance limit; per-account/region only[18][12]
Honeycomb Obs 2.0; high-cardinality structured events Teams that need ad-hoc investigation of unknown unknowns Smaller market share; requires instrumentation investment[24]
Dynatrace AI-powered, full-stack AI-driven root cause analysis High cost; 3.38% market share[14]

An important distinction is buried in this table. By the definition of observability used in this article, the ability to ask arbitrary questions of your system at query time, many of the tools listed above are monitoring platforms with better interfaces, not observability tools. Datadog, New Relic, Grafana+Prometheus, and CloudWatch are built around pre-aggregated metrics and dashboards. They can tell you that something is wrong, but they constrain your investigation to dimensions that were anticipated at instrumentation time. Tools built on the Obs 2.0 model (Honeycomb being the most prominent example) store high-cardinality structured events and let you slice across arbitrary dimensions at query time, which is what enables the "unknown unknowns" capability the rest of this article argues for.[5][24] The distinction matters because choosing a monitoring tool and calling it "observability" creates a false sense of coverage.

Beyond that architectural split, a second pattern emerges. The tools with the largest market share (Datadog, New Relic) are APM platforms that treat database monitoring as one module among many. They're strong at correlating application-level telemetry with database performance, but they typically lack deep database-specific intelligence: they won't tell you that your autovacuum_vacuum_scale_factor needs adjusting or that a specific index is being used by one query but is slowing down writes across the table. The database-native tools (Percona PMM, SolarWinds DPM) go deeper on per-query analysis but lack the cross-service correlation that observability requires. And the cloud-native options (CloudWatch DB Insights) are locked to a single provider.

The result is that many database teams stitch together multiple tools, often running Percona PMM or pg_stat_monitor alongside Datadog or Grafana, to approximate the observability they need. This works, but it creates its own problems: alert fatigue from overlapping monitors, context-switching between tools during incidents, and gaps where no tool covers the intersection.

The effectiveness data: what actually improves

The case for observability investment is supported by published data, though with important caveats. Research from Imply and others shows that 64% of organisations using observability tools achieve a 25% or greater improvement in MTTR.[19][20] Organisations with full-stack observability face outage costs of roughly $1 million per hour, versus $2 million per hour for those without, suggesting that the investment roughly halves incident costs for the incidents that do occur.[19]

However, no published study was found that directly measures what percentage of database incidents are first detected by monitoring alerts versus user reports versus application errors. This is a genuine gap in the public literature, not a search limitation. The implication is that the common assumption, that monitoring catches incidents before users notice, is an article of faith for many teams rather than a measured reality.

Three-quarters of organisations report positive ROI from observability investments, with nearly 1 in 5 reporting 3x to 10x returns.[19] But the same research notes that observability costs are rising faster than the value organisations can extract, driven by data volume growth that forces teams into painful trade-offs about what to instrument and what to ignore. It's worth noting that this cost pressure is partly an Obs 1.0 problem: when your architecture stores metrics, logs, and traces as separate pre-aggregated streams, data volume scales with the number of things you choose to monitor, and each new dimension multiplies cost. Obs 2.0 architectures that store raw structured events and derive metrics at query time face different cost dynamics, though they bring their own trade-offs around query performance and storage.[5] For database teams specifically, this tension is acute regardless of architecture: a busy PostgreSQL instance can generate gigabytes of pg_stat_statements data per day, and the question of what to sample and what to retain is itself a significant engineering decision.

What observability-driven database teams do differently

Practitioners at scale organisations have articulated a specific pattern for database observability that goes beyond "add more dashboards."[21][22][23]

Google's SRE practice encapsulates the priority as "SLOs first, then alerts, then dashboards." The sequence matters. You define what good looks like for users first (an SLO), then build the alerting that tells you whether you're meeting it, and only then build the dashboards that help you investigate when you're not. Many database teams run this sequence in reverse: they start with the dashboards their monitoring tool ships with, add alerts when something goes wrong, and never get around to defining what "healthy" means from the user's perspective.

Netflix treats observability as a development tool, not just an operational one. Their engineering teams use observability data in development feedback loops, catching performance regressions before they reach production rather than waiting for post-deployment monitoring to surface them.[21] This reflects a broader shift that Charity Majors has advocated: when you have genuine observability, production becomes your primary feedback environment. Rather than trying to reproduce database performance issues in staging (which rarely has representative data volumes or query patterns), teams with rich production telemetry can deploy, observe the actual behaviour, and iterate. The prerequisite is that your observability lets you ask sufficiently precise questions, slicing by deploy version, specific query patterns, specific user cohorts, to distinguish new behaviour from existing behaviour with confidence.[4]

For database teams specifically, the shift in practice looks like this: moving from "let's add another dashboard panel" to "let's ensure we can answer any question about this database's behaviour in production." The former scales linearly with complexity (every new failure mode needs a new panel). The latter scales logarithmically (a well-instrumented system lets you investigate failures you didn't anticipate).

Teams that have made this shift tend to share a few characteristics. They run time-bucketed query analytics continuously (using pg_stat_monitor or similar) rather than sampling periodically. They build automated anomaly detection on query execution patterns rather than relying on static thresholds. And they perform post-deploy query plan comparison as a standard release gate, catching regressions before users do, rather than waiting for complaints.[23]

The gap between monitoring and understanding

The incidents explored in our cost of database incidents analysis share a common thread: the failure mode was either invisible to monitoring or visible only in retrospect. This is not because the teams lacked monitoring. GitLab had monitoring. GitHub had monitoring. Cloudflare had monitoring. The monitoring was doing exactly what it was designed to do: checking for conditions that had been anticipated. The problem was that the actual failure was not among them.

John Allspaw's framing of incidents as unplanned investments connects directly to the observability question. If incidents are investments in understanding your system, then observability is the tool that maximises the return on that investment. A team with rich, queryable telemetry can extract far more learning from an incident than a team that has to reconstruct what happened from sparse dashboard snapshots and log grep sessions.

For DBRE teams specifically, the observability gap is acute. A DBRE managing multiple database engines across multiple environments needs to answer questions that span dimensions: "is this MySQL performance pattern related to the PostgreSQL vacuum behaviour we saw last week?" That question requires correlation across engines, across time windows, and across failure modes, which is precisely what dashboard-based monitoring cannot provide.

Where this is heading

The trajectory of database observability points toward three developments. First, the convergence of APM and database-specific tooling. The gap between "application-aware but database-shallow" and "database-deep but application-blind" is narrowing, but slowly. Teams that can bridge both perspectives, understanding a database performance issue in the context of the application behaviour that's driving it, resolve incidents faster.

Second, the shift from reactive alerting to proactive analysis. Rather than waiting for a threshold to fire, observability-driven teams are building systems that continuously analyse patterns, surface anomalies, and flag regressions before they become incidents. This is the difference between a smoke alarm (monitoring) and a structural engineer who inspects the building regularly (observability).

Third, the question of who does the analysis. The tooling landscape assumes that a human is asking the questions. But the volume of telemetry generated by a modern database fleet exceeds what any team can review manually. The organisations that extract the most value from observability will be the ones that combine rich telemetry with automated, intelligent analysis: systems that can surface the signal, propose the diagnosis, and help the team act, rather than simply presenting them with more data to sift through.

References

  1. How a 43-Second Network Issue Led to a 24-Hour GitHub Outage — Bytesized Design. Analysis of the October 2018 GitHub MySQL incident, where monitoring showed "healthy" while replication state was silently corrupted.
  2. Observability (control theory) — Wikipedia. Kálmán's 1960 definition of observability as a measure of how well internal states can be inferred from external outputs.
  3. Rudolf E. Kálmán — Wikipedia. Background on the control theory origin of observability paired with controllability.
  4. Observability 101: Terminology and Concepts — Honeycomb. Charity Majors on the monitoring (known unknowns) vs observability (unknown unknowns) distinction, including high-cardinality slicing.
  5. It's Time to Version Observability: Introducing Observability 2.0 — Honeycomb. Charity Majors defines the architectural shift from multi-pillar tooling (Obs 1.0) to arbitrarily wide structured events with query-time analysis (Obs 2.0).
  6. Database Reliability Engineering — Charity Majors & Laine Campbell, O'Reilly. Foundational text on applying reliability engineering principles to database operations.
  7. Observability Is About Confidence — Hazel Weakly & Fred Hebert (2024). Defines observability as the process of developing the ability to ask meaningful questions, get useful answers, and act effectively.
  8. Postmortem of Database Outage of January 31 — GitLab. The 2017 PostgreSQL incident where a silent pg_dump backup failure was invisible to monitoring.
  9. GitHub October 2018 Incident Analysis — Bytesized Design. MySQL split-brain replication state invisible to dashboards.
  10. Cloudflare Outage on November 18, 2025 Post Mortem — Hacker News discussion. SQL catalog assumption failure invisible to pre-configured monitoring.
  11. Monitoring Distributed Systems — Google SRE Book. Principles for cross-service telemetry correlation and the limitations of siloed monitoring.
  12. CloudWatch Anomaly Detection — AWS. ML-based anomaly detection for catching gradual degradation that fixed thresholds miss.
  13. DevOps Guru for Amazon RDS — AWS. Automated anomaly detection and trend analysis for RDS database performance.
  14. Application Performance Monitoring Market Share 2026 — 6sense. Market share data: Datadog 51.82%, New Relic 24%, Dynatrace 3.38%.
  15. Cloud Database and DBaaS Market Analysis — Grand View Research. Market trajectory from $19.76B (2024) to $111.34B (2035).
  16. New Relic Platform — New Relic. Unified telemetry approach with application-first database monitoring.
  17. Database Performance Monitor — SolarWinds. Query-level auto-detection for MySQL and MongoDB with 5-8 minute setup claims.
  18. Performance Insights — AWS. Native AWS database monitoring with the 500-instance per-account/region limitation.
  19. Observability Costs Rising Faster Than Value — Imply via BusinessWire. 64% achieve 25%+ MTTR improvement; 75% report positive ROI; $1M vs $2M hourly outage cost differential.
  20. MTTR: Average vs Excellent Incident Response Times — SleekOps. Benchmark data from 150K+ incidents on monitoring-only response rates.
  21. Netflix Tech Blog — Netflix engineering on observability as a development tool, not just operational. Posts on Atlas, Edgar, and their observability platform.
  22. Service Level Objectives — Google SRE Book. The "SLOs first, then alerts, then dashboards" prioritisation for observability.
  23. Percona Monitoring and Management — Percona. Open-source database monitoring with pg_stat_monitor integration and time-bucketed query analytics.
  24. What Is Observability? — Honeycomb. The Obs 2.0 approach: high-cardinality structured events with query-time analysis, enabling investigation of unknown unknowns without pre-aggregation.