We ran an informal poll asking database and SRE teams where they actually coordinate during a critical database incident. The results were lopsided but not surprising:

Slack / Teams Channel (Chat)58%
Zoom / Google Meet (Video)33%
Physical War Room (Office)8%
Jira / Ticket Comments0%

The 0% for Jira is the detail that jumps out. Ticketing systems are where incidents are documented after resolution, not where they are coordinated during resolution. The fact that nobody selected it confirms something practitioners already know: the tools designed for incident tracking and the tools actually used for incident coordination are different systems entirely.

The war room moved into chat

The physical war room, a conference room where responders gathered around screens during a major incident, was the default coordination model through the 2000s and into the 2010s. The Y2K era popularised the format for coordinated software patching, and it persisted in operations teams for two decades because it solved a real problem: getting the right people looking at the same data at the same time, with the lowest possible communication latency.[1]

The shift to virtual coordination was already underway before 2020 as distributed teams became more common, but the pandemic accelerated it dramatically. Physical war rooms require physical proximity, and once teams proved they could coordinate incident response remotely, there was little operational reason to go back. The 8% of respondents still using physical war rooms likely represent co-located teams, financial services firms with regulatory requirements for on-site incident handling, or organisations where the most senior database expertise happens to sit in the same office.[2]

What replaced the physical war room was not a single tool but a pattern: ChatOps. The term was coined by GitHub in 2013 to describe their approach of running operations through chat, but the practice has evolved well beyond its origins. In a modern ChatOps incident workflow, the alert fires in PagerDuty or Opsgenie, a dedicated Slack channel is auto-created with a naming convention like #incident-db-20260326, on-call responders are automatically invited, and the channel becomes the coordination surface for the duration of the incident.[3]

Why chat won

Chat-based coordination won the 58% majority for reasons that go beyond convenience. Several properties of chat make it structurally better suited to incident coordination than the alternatives.

Asynchronous participation. Not everyone needed for an incident is available simultaneously, particularly for database incidents that span time zones. A chat channel allows an engineer in London to post diagnostic findings that an engineer in San Francisco picks up two hours later without repeating the investigation. A video call requires synchronous presence; anyone who joins late has to be caught up verbally, consuming time from the people who should be investigating.

Automatic audit trail. Chat produces a timestamped record of who said what, when. This is not a minor advantage. Post-incident reviews depend on reconstructing the timeline of the investigation: when was the alert acknowledged, when was the root cause hypothesised, when was the mitigation applied, what was tried and discarded. A Slack channel provides this timeline by default. A video call provides it only if someone is designated to take notes, and the quality of those notes depends on that person's ability to listen and write simultaneously under pressure.

Tool integration. This is the structural advantage that separates chat from video. A Slack channel can receive automated messages from monitoring systems (Datadog, Grafana, CloudWatch), display query results from diagnostic bots, trigger runbook actions through slash commands, and surface relevant dashboards as links. The chat channel becomes a coordination surface that connects people and systems. A video call connects people to people but requires screen sharing to bring in system data, and screen sharing gives everyone the same view rather than allowing each responder to investigate independently.

Scalability. A chat channel can accommodate 5 or 50 participants without degradation. Participants can mute the channel if it's not relevant to their current investigation thread, then catch up on the timeline when they have a finding to share. A video call with more than 6 to 8 active participants becomes unmanageable: people talk over each other, microphones need muting and unmuting, and the signal-to-noise ratio drops.

The incident management platform layer

The 58% chat number understates the sophistication of what's actually happening in those channels, because a growing category of incident management platforms now orchestrate the entire incident lifecycle from within Slack or Teams.

Platform Approach Notable capability
incident.io Declare and manage incidents entirely within Slack; AI-generated summaries and timelines Used by 1,500+ teams including Netflix and Etsy; automated investigations in 1-2 minutes[4]
PagerDuty Alert routing with auto-created Slack channels; responders acknowledge and resolve from chat ChatOps integration reported to reduce MTTR from hours to minutes for standard patterns[5]
FireHydrant Automated Slack channel creation; acts as incident scribe recording chatter and attachments Comprehensive lifecycle management targeting mid-to-large teams[6]
Rootly AI-native platform triggering automation within Slack: channels, responder pulls, role assignment Deep workflow automation with a focus on smaller, faster-moving teams[6]

What these platforms share is a design philosophy: the incident channel is the source of truth, and the platform orchestrates workflows around it. This is a significant departure from earlier incident management tools that expected responders to leave their coordination surface (chat, phone, or room) and go to a separate web interface to update status, assign roles, or log timeline events. The friction of context-switching during an incident is not trivial. Platforms that eliminate it by meeting responders where they already are (in Slack) have a structural advantage over those that require responders to go somewhere else.

The 33%: when video makes sense

A third of respondents coordinate via Zoom or Google Meet, and there are legitimate reasons for this. Video calls provide higher-bandwidth communication than chat for certain incident phases, particularly the early triage phase where the scope and severity of the incident are unclear and rapid verbal exchange helps establish shared understanding faster than typed messages.

The pattern that works well in practice is a hybrid: a Slack channel as the persistent coordination surface, with a video bridge opened for the first 15 to 30 minutes of a major incident when synchronous discussion accelerates triage. Google's SRE handbook describes incident management roles (Incident Commander, Communications Lead, Operations Lead) that were originally designed for verbal coordination, adapted from the Incident Command System established by firefighters in 1968.[7] The Incident Commander role in particular benefits from voice communication during the triage phase because it requires rapid decision-making about severity, scope, and resource allocation.

The risk with video-first coordination is that it creates information loss. Diagnostic findings shared verbally on a call are available to the people on the call at that moment and to nobody else. If the incident extends beyond the initial call, if a database expert joins two hours in, if the post-incident review needs to reconstruct the timeline, the verbal findings are gone unless someone captured them. Teams that default to video often compensate by designating a scribe, but this is a manual workaround for a problem that chat solves by default.

What this means for database incidents specifically

Database incidents have properties that make chat-based coordination particularly well suited. The investigation workflow for a database incident is typically diagnostic rather than creative: check pg_stat_activity for active queries, examine lock wait graphs, review connection pool utilisation, look for recent schema changes or deployment events, compare query plans against baselines. These are structured investigation steps that produce data, and data is better communicated as text and screenshots in a chat channel than as verbal descriptions on a video call.[8]

Consider the difference between these two coordination modes during a PostgreSQL lock contention incident:

Video call: "I'm looking at pg_stat_activity and there's a query that's been running for 47 minutes, it's a SELECT on the orders table with a join to order_items, the PID is 28451, and it's waiting on a lock held by PID 28390 which appears to be an ALTER TABLE that started at 14:23." The responder hearing this has to hold all of that information in working memory while simultaneously formulating their response.

Chat channel: The responder pastes the query output, the lock chain, and the relevant pg_locks rows. Everyone in the channel can see the data, refer back to it, and investigate independently. Five minutes later, another responder pastes the deployment log showing the ALTER TABLE was triggered by a migration in the 14:20 release. The timeline builds itself.

The chat-based approach produces better outcomes for database incidents because database investigation is fundamentally about data, and chat is a better medium for sharing, referencing, and building on data than voice.

The incident command structure in a chat-first world

The Google SRE framework defines three incident roles: Incident Commander (IC), Communications Lead (CL), and Operations Lead (OL). These roles were designed for large-scale incidents requiring coordinated response across multiple teams.[7] In a chat-first world, these roles still apply, but their communication medium changes in ways that affect how they function.

The IC in a chat channel can pin messages, set channel topics to reflect current status, and use threaded replies to separate investigation streams. The CL role becomes partially automated: incident management platforms like incident.io generate status updates and stakeholder communications from the channel activity. The OL role benefits from the ability to issue commands (literal Slack slash commands or bot invocations) that trigger diagnostic runbooks, database health checks, or automated remediation steps directly within the coordination channel.

For database reliability engineers, this means that the on-call engineer's effectiveness during an incident depends not just on their database knowledge but on their ability to operate within a chat-based incident workflow: knowing which diagnostic output to surface, how to format it for readability in a channel, and how to communicate findings in a way that builds the investigation timeline rather than creating noise.

The diagnostic gap in the chat channel

There is, however, a gap in the current chat-based incident workflow that the incident management platforms have not fully addressed. The platforms are excellent at coordination: creating channels, assigning roles, tracking timelines, generating summaries. What they are less good at is investigation: the actual diagnostic work of figuring out why the database is misbehaving.

In a typical database incident channel, the investigation phase still depends on individual engineers running manual diagnostic queries, interpreting the results, and posting their findings. The incident management platform orchestrates the people. The observability tooling (Datadog, Grafana, the database's own metrics) provides the raw data. But the gap between "here is a PagerDuty alert and an auto-created Slack channel with the right people in it" and "here is the root cause and recommended mitigation" is still filled primarily by human expertise and manual investigation.

This is the gap that affects incident cost and MTTR. The coordination is fast. The investigation is slow. An engineer who receives a PagerDuty alert at 3am, joins the auto-created Slack channel, and sees the right dashboards linked is in a better starting position than an engineer who had to manually create a channel and find the right people. But they still need to run the diagnostic queries, interpret the results, correlate the findings with recent changes, and formulate a mitigation plan. The coordination tooling saved minutes. The investigation still takes hours.

The next evolution of incident tooling, and the area where we see the most opportunity, is closing that diagnostic gap: not just getting the right people into a channel faster, but giving them answers faster once they're there. Automated root cause analysis that runs the diagnostic investigation in parallel with the human coordination, surfacing findings into the incident channel as structured data rather than requiring engineers to run manual queries under pressure.

The zero percent

The 0% for Jira/ticket comments deserves a final note because it highlights a broader point about incident tooling design. Jira is an excellent tool for tracking work, but it was designed for asynchronous project management, not real-time incident coordination. The comment model (sequential, threaded, requiring page navigation) is structurally wrong for the communication pattern of an active incident, which requires rapid, overlapping, time-sensitive exchanges between multiple participants.

Teams that try to coordinate incidents through ticketing systems typically discover this quickly and migrate to chat or video. The 0% result suggests that this migration has already happened across the practitioner population we surveyed. If your organisation's incident response process still routes through ticket comments, that's worth examining, not because Jira is bad, but because the communication model it provides is mismatched with what incident coordination actually requires.

References

  1. The New War Room: Cybersecurity in the Modern Era — Dark Reading. History of physical war rooms and the shift to virtual coordination, including Y2K-era origins and the impact of distributed teams.
  2. How to Set Up an IT War Room — Xurrent. Guide to IT war room configuration covering both physical and virtual setups, with guidance on when each model applies.
  3. Slack Integration Guide — PagerDuty Knowledge Base. Channel connections, incident surfacing in Slack, and in-channel acknowledgement and resolution workflows.
  4. Incident Response in Slack — incident.io. Slack-native incident declaration, AI-generated summaries and timelines, used by 1,500+ teams including Netflix and Etsy.
  5. From Alert to Resolution: How Incident Response Automation Cuts MTTR and Closes Gaps — PagerDuty. Automation workflows reducing MTTR from hours to minutes for standard failure patterns.
  6. 5 Best Slack-Native Incident Management Platforms for 2025 — incident.io. Comparative analysis of FireHydrant, Rootly, and other Slack-native incident management tools.
  7. Managing Incidents — Google SRE Book. Incident Command System roles (IC, CL, OL), coordination principles, and the origins in firefighting incident management from 1968.
  8. Incident Response — Google SRE Workbook. Practical incident response procedures including communication protocols and diagnostic workflows.
  9. Making a Virtual War Room: The Journey to ChatOps — Robert Barron, Medium. First-hand account of migrating from physical war rooms to chat-based incident coordination.
  10. Developing Incident Response Team Communication and Coordination Practices — Sygnia. Research on structured contact priority systems delivering 60%+ reduction in MTTA and MTTR.