convops
  • Features
  • How it works
  • Pricing
  • Blog
  • Security
Log inStart free →
convops

Root cause, not noise.

Start free →

Product

  • Features
  • How it works
  • Pricing
  • Blog
  • Security
  • Get started

Compare

  • Vs PagerDuty
  • Vs incident.io
  • Vs Datadog
  • Vs Resolve.ai
  • Vs Rootly
  • Vs Squadcast

Solutions

  • AWS incident response
  • CloudWatch alarm diagnosis
  • AWS alerts to WhatsApp
  • WhatsApp on-call
  • PagerDuty alternative

Connect

  • X (Twitter)
  • LinkedIn

© 2026 ConvOps. All rights reserved.

Built at 2am, for a 2am.

← All posts

The Incident Response Playbook Every Engineering Team Needs

May 10, 2026·12 min read

Most engineering teams have incident response even if they've never written a playbook. Someone gets paged, they investigate, they fix it, they move on. The question is whether that process is fast and repeatable — or improvised and exhausting every single time.

This is the playbook. Five phases, clear ownership, and the specific things that move MTTR from 40 minutes to under 10.

What to read and when

  • What incident response actually is
  • Phase 1: Detect — know before your users do
  • Phase 2: Triage — is this real, and how bad?
  • Phase 3: Diagnose — find the cause, not just the symptom
  • Phase 4: Fix — act without making it worse
  • Phase 5: Learn — the post-mortem
  • On-call setup that doesn't burn people out
  • Communication during an incident
  • MTTR targets worth aiming for

What incident response actually is

Incident response is the set of actions between 'something is wrong' and 'it's fixed and we understand why.' It's not a team, not a tool, not a process document sitting in Confluence — it's a set of practiced habits that reduce the time between detection and resolution.

What incident response is not: a blame session, a ticket, or a 2-hour Zoom call where nobody knows who's driving. The best incident response is almost invisible — fast, calm, and leaves behind a short document that prevents the next one.

Phase 1: Detect — know before your users do

Good detection means you hear about an incident from your alarms before a user files a ticket. Bad detection means you find out on Twitter. The gap between the two is almost always alarm coverage.

Every production service needs at minimum: a CPU alarm (warn at 80%, critical at 95%), a memory alarm (warn at 85%, critical at 95%), an error rate alarm (warn at 5%, critical at 10%), and a latency alarm (P99 over your SLO threshold). For ECS, add a running task count alarm — a task crash that doesn't breach CPU or memory thresholds will still show up as running < desired.

If more than 20% of your alarms turn out to be false positives or noise, you have an alert fatigue problem — not a coverage problem. Fix thresholds before adding more alarms.

Detection also means knowing what changed. A deploy that happened 10 minutes before an alarm is almost always the cause. Wire your deployment pipeline to push an event or annotation to your monitoring system — that single data point cuts diagnosis time in half.

Related reading

  • → The Complete AWS CloudWatch Alarm Setup Guide
  • → Woken up by a CloudWatch alarm with no context

Phase 2: Triage — is this real, and how bad?

Triage answers two questions in under 2 minutes: is this a real incident, and what severity is it? Every minute spent on a P3 at 2am that should have waited until morning is a minute of sleep debt that compounds across the team.

SeverityCriteriaTarget responseWho acts
P1 — CriticalFull outage, data loss, or direct revenue impactAcknowledge within 5 min, mitigate within 30 minOn-call + escalate immediately
P2 — HighPartial outage, >10% error rate, major feature downAcknowledge within 15 min, mitigate within 2 hoursOn-call engineer
P3 — MediumDegraded performance, <5% error rate, non-critical featureNext business hours if overnightOn-call at discretion
P4 — LowCosmetic issue, no user impactWithin 24 hoursNormal sprint prioritisation

The most important triage question is: is this self-healing? If the metric breached threshold for 90 seconds and is already trending back to normal, it's a P3 worth investigating tomorrow — not a 3am war room. Check the graph before waking anyone up.

Phase 3: Diagnose — find the cause, not just the symptom

This is the phase that takes the longest and where most time is wasted. The symptom is in the alarm. The cause is in the logs, the recent deploys, and the related metrics — and gathering all of that manually, at 3am, takes 20–40 minutes for most engineers.

Effective diagnosis requires six pieces of context: the metric trend for the 45 minutes before the breach, every other metric on the same resource, recent log errors and exit codes, what changed in the past 2 hours (deploys, config, IAM), current resource state (is it still failing?), and whether this has happened before. When you have all six, you can usually identify the cause in under 2 minutes.

Common causes by service type

ServiceAlarmMost common causeFirst thing to check
ECSCPUUtilization > 80%Underprovision at traffic spike or memory leak causing swapRunning vs desired task count, recent deploy
ECSRunningTaskCount < desiredOOM kill (exit code 137) or crash loopCloudWatch Logs for exit code, memory watermark
RDSDatabaseConnections > thresholdConnection pool exhaustion — app not returning connectionsApplication logs for connection timeout errors
LambdaThrottles > 0Concurrency limit hit — too many simultaneous invocationsReserved concurrency setting, upstream invocation rate
ALBHTTPCode_Target_5XX_Count spikeApplication error — upstream or deploy-relatedTarget group health, recent ECS deploy, app error logs
The single biggest time saver in the diagnose phase: have all six pieces of context waiting for you when the alarm fires, rather than gathering them manually. That's the difference between a 3-minute diagnosis and a 35-minute one.

ConvOps automates the entire diagnose phase — when an alarm fires, it pulls metric trends, log errors, CloudTrail changes, and resource state before the notification reaches you. The message you wake up to already contains the cause. You're reviewing a conclusion, not starting an investigation.

Related reading

  • → How to find root cause in AWS CloudWatch alerts without an SRE team
  • → Woken up by a CloudWatch alarm with no context — what to do

Phase 4: Fix — act without making it worse

The fastest fix is not always the right one. The goal is to restore service with the lowest risk of making things worse. That usually means: prefer rollback over hotfix, prefer scale-out over restart, prefer cautious over fast.

Fix-first options, in order of risk

  1. Scale out — add ECS tasks, increase Lambda concurrency, scale RDS read replicas. Low risk, fast, buys time to diagnose properly.
  2. Rollback the last deploy — if a deploy happened in the 30 minutes before the incident, roll it back first and ask questions later. Correct >60% of the time.
  3. Restart the failing component — ECS service restart, RDS reboot, Lambda cold start flush. Medium risk: resolves the symptom, may not fix the cause.
  4. Failover — promote read replica, switch to secondary region, redirect traffic. High risk: verify before executing, confirm with a second engineer.
  5. Hotfix — deploy a targeted code change under pressure. Highest risk. Use only when rollback is not possible and the cause is confirmed.

One rule worth enforcing: no remediation action runs without being stated out loud (or in Slack) first. 'I'm going to restart the api-service ECS service — anyone object?' takes 10 seconds and prevents the second incident that happens when someone else is simultaneously rolling back the deploy.

Phase 5: Learn — the blameless post-mortem

The post-mortem is not a blame session. It's the moment where an incident becomes an improvement. Teams that skip it reliably repeat the same incidents — often within 90 days.

What to write

  • Impact: how many users affected, for how long, measurable revenue or SLA impact
  • Timeline: when it was detected, when triage completed, when cause was identified, when fix was applied, when service restored — with exact timestamps
  • Root cause: the specific technical failure, not the human action
  • Contributing factors: the conditions that made the failure possible (missing alarm, no staging load test, outdated runbook)
  • Action items: specific, owned, time-bound — not 'improve monitoring' but 'add RDS connection count alarm, owner: [name], due: [date]'

What not to write

  • Who made a mistake — the post-mortem is about systems, not people
  • Vague action items — 'improve observability' is not an action item
  • A narrative of what everyone did minute-by-minute — the timeline handles that
  • Speculation about what might have happened — only confirmed causes

The 5 Whys in practice

The 5 Whys is the fastest tool for finding the real root cause rather than the surface symptom. Start with the observable failure and ask why five times.

  1. Why did the service go down? → ECS task crashed (exit code 137 — OOM kill)
  2. Why did it OOM? → Memory exceeded the 512MB container limit during peak load
  3. Why did memory exceed the limit? → New endpoint in last deploy loads an entire dataset into memory per request
  4. Why wasn't this caught in staging? → Staging has no load test; the memory limit was not reviewed in the PR
  5. Why not? → No PR checklist item for memory impact of data-loading code paths

Action items from this post-mortem: add memory profiling to the staging load test, add a PR checklist item for endpoints that load datasets, increase the ECS memory limit for this service by 50% as an immediate safeguard.

Related reading

  • → The real cost of a 1-hour AWS outage — why post-mortems pay
  • → MTTR under 5 minutes: what actually moves the needle

On-call setup that doesn't burn people out

The goal of an on-call rotation is to ensure someone who can act is always reachable — without destroying anyone's sleep for months at a time. Both parts of that sentence matter equally.

Rotation design

  • Weekly rotations work well for teams of 4 or more — each person is on call for one week, then off for N-1 weeks
  • Daily rotations burn people out faster than weekly — the context switch cost is too high
  • Always have a primary and a secondary. The secondary exists to be escalated to — not to be on call simultaneously
  • Shadow rotations for new engineers: they shadow the primary for 2–4 weeks before carrying a pager alone
  • Never put two engineers from the same service on-call at the same time — if that service has a P1, you want one person investigating and one available to escalate

Escalation paths

Define escalation before an incident. During a P1 at 2am is not the time to figure out who to call next.

  1. On-call engineer — first responder, owns triage and initial diagnosis
  2. Secondary on-call — escalated if primary is unreachable or needs a second opinion within 15 minutes
  3. Service owner or team lead — escalated for P1s that aren't resolving within 30 minutes
  4. Engineering manager — escalated for P1s approaching 1 hour or with customer-facing impact requiring communication
  5. CTO/VP — for data loss, security incidents, or multi-hour outages affecting all users

Communication during an incident

Two audiences during an incident need different things: your team needs situational awareness and clear ownership; your users need honesty and cadence. Mixing up the formats for these two audiences is one of the most common mistakes.

Internal communication

  • Declare a dedicated Slack channel immediately: #incident-YYYY-MM-DD. Don't use #eng-general — the noise breaks focus
  • Post an initial message within 5 minutes: what's affected, who's investigating, when the next update is
  • Update every 15 minutes during active P1s, even if the update is 'still investigating, no change'
  • Assign an incident commander — one person whose only job is coordination, not investigation. On a 3-person team, this can be the same person as the investigator if necessary, but name the role
  • When the fix is applied, post a resolution message with the cause and the fix before closing the channel

External communication

  • Update your status page within 10 minutes of declaring a P1 — before customers ask
  • Keep external language factual and non-technical: 'We are investigating increased error rates on [service]. Engineers are engaged.' Not 'ECS task crash causing 5xx.'
  • Update every 30 minutes — even if it's 'investigation ongoing, no resolution yet'
  • Resolution message: state what happened, when it was fixed, what you're doing to prevent recurrence. One paragraph maximum
  • Never speculate about causes in a public update — only confirmed facts

MTTR targets worth aiming for

MTTR (mean time to resolution) is the single most useful incident health metric. It captures detection, diagnosis, and fix in one number. Here are realistic targets by maturity level.

SeverityBaseline (no playbook)GoodExcellent
P1 — Critical60–120 min< 30 min< 15 min
P2 — High2–4 hours< 1 hour< 30 min
P3 — Medium1–3 days< 8 hours< 4 hours

The biggest lever on MTTR is diagnosis time — specifically, how long it takes to go from 'alarm fired' to 'cause identified.' For most teams this is 20–40 minutes of manual investigation. Getting that to under 2 minutes — by having context arrive with the alarm — is what moves P1 MTTR from 60 minutes to under 15.

Track MTTR per severity level, not as a single average. A team that resolves 50 P3s in 30 minutes each but takes 4 hours on every P1 has a very different problem to solve than the MTTR average suggests.

Frequently asked questions

What is an incident response playbook?

An incident response playbook is a documented set of steps covering how your team detects, triages, diagnoses, fixes, and learns from incidents. It defines severity levels, ownership, escalation paths, and communication templates — so that when something breaks at 3am, nobody has to figure out the process from scratch.

What are the 5 phases of incident response?

Detect (know before your users do), Triage (severity and ownership in under 2 minutes), Diagnose (find the cause, not just the symptom), Fix (restore service with minimum risk), and Learn (post-mortem that prevents recurrence). Most teams are weakest in the Diagnose phase — that's where the most time is lost.

How do you run a blameless post-mortem?

Focus on systems, not people. Document what happened and why the system allowed it to happen, not who made a mistake. The root cause is always a systemic gap — missing alarm, no load test, unclear ownership — not a person's decision. Action items should be specific, owned, and time-bound. Vague items like 'improve observability' don't prevent the next incident.

What severity levels should I use for incidents?

Four levels work well for most teams: P1 (full outage or revenue impact, < 5 minute acknowledgement), P2 (partial outage or >10% error rate, < 15 minutes), P3 (degraded performance, next business hours if overnight), P4 (cosmetic, normal sprint). The key distinction between P1 and P2 is whether it's causing direct, measurable customer impact right now.

What is a realistic MTTR target for a production AWS service?

For P1 incidents: under 30 minutes is good, under 15 minutes is excellent. Most teams without a formal playbook average 60–120 minutes on P1s. The biggest single lever is diagnosis time — getting from 'alarm fired' to 'cause identified' in under 2 minutes instead of 20–40 minutes cuts P1 MTTR by more than any other change.

Related reading

  • → Woken up by a CloudWatch alarm with no context
  • → MTTR under 5 minutes: what actually moves the needle
  • → The real cost of a 1-hour AWS outage
  • → How to find root cause in AWS CloudWatch alerts without an SRE

Still debugging incidents manually?

ConvOps does this automatically — root cause in under 60 seconds, delivered to WhatsApp or Slack.

Try ConvOps free →See a live demo