The AI Hype Is Colliding with Reality

95%

of enterprise AI pilots with no measurable P&L impact (MIT, 2025)

80%+

of AI projects that fail overall (RAND Corporation)

46%

of AI POCs scrapped before reaching production

The MIT finding

In 2025, MIT published "The GenAI Divide: State of AI in Business," based on executive interviews, surveys of business leaders and employees, and analysis of 300 public AI deployments. The headline finding was stark: 95% of enterprise AI pilots delivered no measurable impact on the profit and loss statement. Billions invested, hundreds of pilots launched, and the overwhelming majority produced nothing that showed up in financial results.^[1]

The finding is less surprising when you examine what those failed pilots looked like. The pattern is predictable: generic "AI solutions" bolted onto existing workflows, endless proofs of concept that demonstrated technical possibility but stalled at the integration boundary, and deployment in high-visibility areas (marketing, sales) rather than areas where the operational problem was specific enough to measure.^[2]

MIT's own analysis identified the root cause as integration failure, not model quality. The AI models worked. What didn't work was connecting them to existing enterprise systems, data pipelines, and operational workflows in a way that produced measurable outcomes. The study found a "learning gap" where both the tools and the organisations using them had not matured enough to bridge the distance between a working demo and a production system that moved a financial metric.^[1]

The budget goes to the wrong place

One of MIT's more pointed observations was about where enterprise AI budgets are actually spent. Over half of generative AI spending goes to sales and marketing applications, despite the fact that back-office automation (document processing, procurement workflows, risk review) consistently shows higher ROI. The explanation is institutional: sales and marketing are where the executive sponsor sits, where the demo is easiest to build, and where the narrative about AI's potential is most compelling to a board audience.^[1]

This is a familiar dynamic for anyone who has worked in infrastructure engineering. The flashy project gets funded. The operational improvement that would save real money gets deprioritised because it's harder to put in a slide deck. AI has reproduced this pattern at scale, and the 95% failure rate is partly a consequence of it.

A Fivetran survey found that 42% of enterprise AI projects were delayed or underperformed specifically due to data readiness, and a separate Capital One/Forrester study of 500 data leaders found that 73% cited data quality as the primary barrier to AI project success.^[3]^[4] The problem is not that AI cannot work. The problem is that organisations are deploying AI into environments where the foundational data infrastructure has not been built to support it.

What the 5% have in common

The minority of AI projects that deliver measurable results share a pattern that the failed projects do not: they are specific about what they do, they have a clear baseline metric to measure against, and they are integrated into a single operational workflow rather than deployed as a general-purpose capability.

Intercom rebuilt customer support with Fin. Fin is not a general-purpose AI chatbot bolted onto Intercom's existing support workflow. It is a purpose-built AI agent designed to resolve customer support tickets autonomously. The results are measurable and specific: according to Intercom's public disclosures, Fin resolves over one million customer issues per week and handles upwards of 80% of support volume for some customers. Intercom publicly markets Fin with outcome-based pricing at $0.99 per resolved ticket, with a guarantee of up to $1M for certain enterprise customers if resolution targets are not met.^[5]^[6]

Walmart reported $75M in annual savings through AI-optimised logistics. Not a general AI initiative. A specific application: optimising truck routes and load utilisation. According to Walmart's own disclosures, the measurable outcome was $75 million in annual savings and a 72-million-pound reduction in CO2 emissions from reduced fuel usage. The AI was trained on a narrow operational problem with clear input data (routes, loads, fuel costs) and a clear success metric (cost per delivery).^[7]

What these examples share is not model sophistication. Intercom uses large language models. Walmart's logistics optimisation likely uses more conventional ML. What they share is specificity of application: one workflow, one measurable outcome, one clear integration point. They did not bolt AI onto an existing process. They rebuilt the process around what AI could do well.

The AI washing problem

Alongside the genuine failures, there is a growing category of AI projects that were marketed as AI but were not, in any meaningful sense, using AI to solve the problem they claimed. The term "AI washing" has entered the regulatory vocabulary: the CFA Institute published a formal research report on it, the FTC launched "Operation AI Comply" to prosecute AI-related deception, and 46 AI-related securities class actions have been filed in the US since 2020, with a significant proportion involving allegations of overstated AI capabilities.^[8]^[9]

AI washing matters for enterprise buyers because it pollutes the signal about what AI actually does well. When a vendor claims "AI-powered" and the product is a rules engine with a chatbot interface, the buyer's experience of "AI" is that it doesn't work. That experience feeds the 95% narrative, even though the product in question was not really an AI application. The cynicism is partly earned and partly a consequence of a market that incentivises overclaiming.

For technical buyers evaluating AI-powered infrastructure tools, the distinction between genuine AI and marketing AI is worth interrogating. Questions worth asking: What specific model architecture does the product use? What training data does it learn from? What happens when it encounters a situation it hasn't seen before? Does it degrade gracefully, or does it produce confident-sounding nonsense? And critically: what is the measurable outcome it is designed to produce, and how is that outcome measured?

What this means for database operations

Database operations is an area where the specificity principle applies with particular clarity. The operational problem is well-defined: databases produce telemetry (metrics, logs, query plans, wait events), incidents have identifiable root causes, and resolution time is measurable. The data is structured and available. The workflow is understood. The success metric (mean time to resolution, incident frequency, cost per incident) is quantifiable.

This makes database operations a stronger candidate for AI application than many of the domains where enterprise AI budgets are currently being spent. A marketing AI that generates copy has a diffuse success metric (did revenue increase? by how much? was it the AI or the campaign strategy?). A database AI that identifies root causes has a specific one: did it identify the correct root cause, and did it do so faster than a human would have?

The MTTR reduction case is illustrative. The average database incident takes 3 to 5 hours to resolve. Best-in-class teams do it in under 60 minutes. The gap is not primarily about skill. It is about the time spent on diagnostic investigation: querying pg_stat_activity, examining lock chains, correlating query plan changes with deployment events, checking replication lag, reviewing connection pool behaviour. This is structured, repeatable analytical work applied to structured data, exactly the kind of task where AI has demonstrated measurable capability when the application is specific enough.^[10]

Contrast this with a generic "AI for IT operations" (AIOps) platform that claims to handle everything from network monitoring to application performance to security alerting. The breadth of the claim is the red flag. An AI system that is genuinely good at database observability needs to understand database-specific telemetry: wait events, query plans, vacuum behaviour, replication topology, connection pool dynamics. An AI system that claims to handle databases, networks, applications, and security is almost certainly shallow in each domain, because the training data, feature engineering, and evaluation criteria are fundamentally different across those domains.

Native beats bolted-on

The distinction that separates the 5% from the 95% maps cleanly onto the distinction between AI-native and AI-augmented products. An AI-augmented product takes an existing workflow and adds AI as a feature: a dashboard with an "AI insights" panel, a monitoring tool with an "AI root cause" button, a ticketing system with AI-suggested resolutions. The AI is a layer on top of an architecture that was designed without it.

An AI-native product is designed from the beginning around what AI can do. The data pipeline is built to feed the model. The user interface is designed to present model output, not to present dashboards with an AI sidebar. The workflow assumes AI is doing the analytical work, with humans reviewing and acting on findings rather than doing the analysis themselves.

This is the same distinction that separated successful early Rails applications from failed enterprise Java ports: the products that worked were designed for the new tool, not adapted from designs that assumed the old one. Intercom's Fin works because Intercom rebuilt their support workflow around what an AI agent could do. A competitor that bolted a chatbot onto their existing ticketing system and called it "AI-powered support" would produce a qualitatively different (and worse) outcome.

For database reliability engineers evaluating AI-powered tooling, the question is whether the tool was built around database intelligence or whether AI was added as a feature to a monitoring product. The architecture of the product reveals the answer: does the tool ingest raw database telemetry and produce diagnostic findings, or does it ingest pre-aggregated metrics and produce natural language descriptions of what the metrics show? The first is AI doing diagnostic work. The second is AI describing a dashboard, and you don't need AI for that.

Specificity as a filter

The 95% statistic is sobering, but it is also useful as a filter. If you are evaluating an AI product (for database operations or anything else), the specificity test separates products that are likely to deliver measurable value from products that are likely to join the 95%.

The questions are straightforward. What specific operational problem does this product solve? What is the measurable outcome it is designed to produce? What data does it need to produce that outcome? How is the outcome measured? What happens when the AI is wrong? And what does the product do that a well-configured Grafana dashboard with alert rules cannot?

If the answers are specific, the product has a chance of being in the 5%. If the answers are vague ("AI-powered insights," "intelligent automation," "predictive analytics"), the product is probably in the 95%. The MIT study did not find that AI doesn't work. It found that unspecific AI doesn't work. The distinction matters, and it is the distinction that should inform how teams evaluate, adopt, and measure AI-powered infrastructure tools going forward.

References

MIT Report: 95% of Generative AI Pilots at Companies Failing — Fortune. Coverage of MIT's "The GenAI Divide: State of AI in Business 2025" study based on executive interviews, surveys of business leaders and employees, and analysis of 300 public AI deployments.
AI's Wall Street Reality Check — Axios. Analysis of the MIT findings and the disconnect between AI investment levels and measurable enterprise returns.
Fivetran Report: Nearly Half of Enterprise AI Projects Fail Due to Poor Data Readiness — Fivetran. Survey finding that 42% of enterprise AI projects were delayed or underperformed due to data quality issues.
The Surprising Reason Most AI Projects Fail — Informatica. CDO Insights 2025 data showing 43% cite data quality and 73% (Capital One/Forrester) cite data readiness as the primary barrier to AI success.
How Intercom Built $100M AI Agent — Mostly Metrics. Fin's growth from $1M to approaching $100M ARR, outcome-based pricing model, and resolution rate improvements.
Intercom Tops $400M in Recurring Revenue as AI Agent Fin Nears $100M Milestone — Business Post. Fin resolving 1M+ issues per week and handling 80%+ of support volume.
Enterprise AI Adoption Case Studies — NineTwoThree. Walmart's $75M annual savings from AI-optimised logistics, including fuel reduction and CO2 impact.
AI Washing Report — CFA Institute Research. Formal research on AI washing practices in financial services and the growing regulatory response.
AI Washing: The Cultural Traps That Lead to Exaggeration — California Management Review. Analysis of organisational incentives that drive AI overclaiming and 46 AI-related securities class actions since 2020.
From Alert to Resolution: How Incident Response Automation Cuts MTTR and Closes Gaps — PagerDuty. Automation workflows reducing MTTR from hours to minutes for standard failure patterns through structured diagnostic automation.

Ewen Fortune

CEO & Founder, SIXTA. The autonomous DBRE that joins your team.