The Incident Response Playbook Every Engineering Team Needs

Q: What is an incident response playbook?

An incident response playbook is a documented set of steps covering how your team detects, triages, diagnoses, fixes, and learns from incidents. It defines severity levels, ownership, escalation paths, and communication templates — so that when something breaks at 3am, nobody has to figure out the process from scratch.

Q: What are the 5 phases of incident response?

Detect (know before your users do), Triage (severity and ownership in under 2 minutes), Diagnose (find the cause, not just the symptom), Fix (restore service with minimum risk), and Learn (post-mortem that prevents recurrence). Most teams are weakest in the Diagnose phase — that's where the most time is lost.

Q: How do you run a blameless post-mortem?

Focus on systems, not people. Document what happened and why the system allowed it to happen, not who made a mistake. The root cause is always a systemic gap — missing alarm, no load test, unclear ownership — not a person's decision. Action items should be specific, owned, and time-bound. Vague items like 'improve observability' don't prevent the next incident.

Q: What severity levels should I use for incidents?

Four levels work well for most teams: P1 (full outage or revenue impact, 10% error rate, < 15 minutes), P3 (degraded performance, next business hours if overnight), P4 (cosmetic, normal sprint). The key distinction between P1 and P2 is whether it's causing direct, measurable customer impact right now.

Q: What is a realistic MTTR target for a production AWS service?

For P1 incidents: under 30 minutes is good, under 15 minutes is excellent. Most teams without a formal playbook average 60–120 minutes on P1s. The biggest single lever is diagnosis time — getting from 'alarm fired' to 'cause identified' in under 2 minutes instead of 20–40 minutes cuts P1 MTTR by more than any other change.

May 10, 202612 min read

Most engineering teams have incident response even if they've never written a playbook. Someone gets paged, they investigate, they fix it, they move on. The question is whether that process is fast and repeatable — or improvised and exhausting every single time.

This is the playbook. Five phases, clear ownership, and the specific things that move MTTR from 40 minutes to under 10.

What to read and when

What incident response actually is
Phase 1: Detect — know before your users do
Phase 2: Triage — is this real, and how bad?
Phase 3: Diagnose — find the cause, not just the symptom
Phase 4: Fix — act without making it worse
Phase 5: Learn — the post-mortem
On-call setup that doesn't burn people out
Communication during an incident
MTTR targets worth aiming for

What incident response actually is

Incident response is the set of actions between 'something is wrong' and 'it's fixed and we understand why.' It's not a team, not a tool, not a process document sitting in Confluence — it's a set of practiced habits that reduce the time between detection and resolution.

What incident response is not: a blame session, a ticket, or a 2-hour Zoom call where nobody knows who's driving. The best incident response is almost invisible — fast, calm, and leaves behind a short document that prevents the next one.

Phase 1: Detect — know before your users do

Good detection means you hear about an incident from your alarms before a user files a ticket. Bad detection means you find out on Twitter. The gap between the two is almost always alarm coverage.

Every production service needs at minimum: a CPU alarm (warn at 80%, critical at 95%), a memory alarm (warn at 85%, critical at 95%), an error rate alarm (warn at 5%, critical at 10%), and a latency alarm (P99 over your SLO threshold). For ECS, add a running task count alarm — a task crash that doesn't breach CPU or memory thresholds will still show up as running < desired.

If more than 20% of your alarms turn out to be false positives or noise, you have an alert fatigue problem — not a coverage problem. Fix thresholds before adding more alarms.

Detection also means knowing what changed. A deploy that happened 10 minutes before an alarm is almost always the cause. Wire your deployment pipeline to push an event or annotation to your monitoring system — that single data point cuts diagnosis time in half.

Related reading

Phase 2: Triage — is this real, and how bad?

Triage answers two questions in under 2 minutes: is this a real incident, and what severity is it? Every minute spent on a P3 at 2am that should have waited until morning is a minute of sleep debt that compounds across the team.

Severity	Criteria	Target response	Who acts
P1 — Critical	Full outage, data loss, or direct revenue impact	Acknowledge within 5 min, mitigate within 30 min	On-call + escalate immediately
P2 — High	Partial outage, >10% error rate, major feature down	Acknowledge within 15 min, mitigate within 2 hours	On-call engineer
P3 — Medium	Degraded performance, <5% error rate, non-critical feature	Next business hours if overnight	On-call at discretion
P4 — Low	Cosmetic issue, no user impact	Within 24 hours	Normal sprint prioritisation

The most important triage question is: is this self-healing? If the metric breached threshold for 90 seconds and is already trending back to normal, it's a P3 worth investigating tomorrow — not a 3am war room. Check the graph before waking anyone up.

Phase 3: Diagnose — find the cause, not just the symptom

This is the phase that takes the longest and where most time is wasted. The symptom is in the alarm. The cause is in the logs, the recent deploys, and the related metrics — and gathering all of that manually, at 3am, takes 20–40 minutes for most engineers.

Effective diagnosis requires six pieces of context: the metric trend for the 45 minutes before the breach, every other metric on the same resource, recent log errors and exit codes, what changed in the past 2 hours (deploys, config, IAM), current resource state (is it still failing?), and whether this has happened before. When you have all six, you can usually identify the cause in under 2 minutes.

Common causes by service type

Service	Alarm	Most common cause	First thing to check
ECS	CPUUtilization > 80%	Underprovision at traffic spike or memory leak causing swap	Running vs desired task count, recent deploy
ECS	RunningTaskCount < desired	OOM kill (exit code 137) or crash loop	CloudWatch Logs for exit code, memory watermark
RDS	DatabaseConnections > threshold	Connection pool exhaustion — app not returning connections	Application logs for connection timeout errors
Lambda	Throttles > 0	Concurrency limit hit — too many simultaneous invocations	Reserved concurrency setting, upstream invocation rate
ALB	HTTPCode_Target_5XX_Count spike	Application error — upstream or deploy-related	Target group health, recent ECS deploy, app error logs

The single biggest time saver in the diagnose phase: have all six pieces of context waiting for you when the alarm fires, rather than gathering them manually. That's the difference between a 3-minute diagnosis and a 35-minute one.

ConvOps automates the entire diagnose phase — when an alarm fires, it pulls metric trends, log errors, CloudTrail changes, and resource state before the notification reaches you. The message you wake up to already contains the cause. You're reviewing a conclusion, not starting an investigation.

Related reading

Phase 4: Fix — act without making it worse

The fastest fix is not always the right one. The goal is to restore service with the lowest risk of making things worse. That usually means: prefer rollback over hotfix, prefer scale-out over restart, prefer cautious over fast.

Fix-first options, in order of risk

Scale out — add ECS tasks, increase Lambda concurrency, scale RDS read replicas. Low risk, fast, buys time to diagnose properly.
Rollback the last deploy — if a deploy happened in the 30 minutes before the incident, roll it back first and ask questions later. Correct >60% of the time.
Restart the failing component — ECS service restart, RDS reboot, Lambda cold start flush. Medium risk: resolves the symptom, may not fix the cause.
Failover — promote read replica, switch to secondary region, redirect traffic. High risk: verify before executing, confirm with a second engineer.
Hotfix — deploy a targeted code change under pressure. Highest risk. Use only when rollback is not possible and the cause is confirmed.

One rule worth enforcing: no remediation action runs without being stated out loud (or in Slack) first. 'I'm going to restart the api-service ECS service — anyone object?' takes 10 seconds and prevents the second incident that happens when someone else is simultaneously rolling back the deploy.

Phase 5: Learn — the blameless post-mortem

The post-mortem is not a blame session. It's the moment where an incident becomes an improvement. Teams that skip it reliably repeat the same incidents — often within 90 days.

What to write

Impact: how many users affected, for how long, measurable revenue or SLA impact
Timeline: when it was detected, when triage completed, when cause was identified, when fix was applied, when service restored — with exact timestamps
Root cause: the specific technical failure, not the human action
Contributing factors: the conditions that made the failure possible (missing alarm, no staging load test, outdated runbook)
Action items: specific, owned, time-bound — not 'improve monitoring' but 'add RDS connection count alarm, owner: [name], due: [date]'

What not to write

Who made a mistake — the post-mortem is about systems, not people
Vague action items — 'improve observability' is not an action item
A narrative of what everyone did minute-by-minute — the timeline handles that
Speculation about what might have happened — only confirmed causes

The 5 Whys in practice

The 5 Whys is the fastest tool for finding the real root cause rather than the surface symptom. Start with the observable failure and ask why five times.

Why did the service go down? → ECS task crashed (exit code 137 — OOM kill)
Why did it OOM? → Memory exceeded the 512MB container limit during peak load
Why did memory exceed the limit? → New endpoint in last deploy loads an entire dataset into memory per request
Why wasn't this caught in staging? → Staging has no load test; the memory limit was not reviewed in the PR
Why not? → No PR checklist item for memory impact of data-loading code paths

Action items from this post-mortem: add memory profiling to the staging load test, add a PR checklist item for endpoints that load datasets, increase the ECS memory limit for this service by 50% as an immediate safeguard.

Related reading

On-call setup that doesn't burn people out

The goal of an on-call rotation is to ensure someone who can act is always reachable — without destroying anyone's sleep for months at a time. Both parts of that sentence matter equally.

Rotation design

Weekly rotations work well for teams of 4 or more — each person is on call for one week, then off for N-1 weeks
Daily rotations burn people out faster than weekly — the context switch cost is too high
Always have a primary and a secondary. The secondary exists to be escalated to — not to be on call simultaneously
Shadow rotations for new engineers: they shadow the primary for 2–4 weeks before carrying a pager alone
Never put two engineers from the same service on-call at the same time — if that service has a P1, you want one person investigating and one available to escalate

Escalation paths

Define escalation before an incident. During a P1 at 2am is not the time to figure out who to call next.

On-call engineer — first responder, owns triage and initial diagnosis
Secondary on-call — escalated if primary is unreachable or needs a second opinion within 15 minutes
Service owner or team lead — escalated for P1s that aren't resolving within 30 minutes
Engineering manager — escalated for P1s approaching 1 hour or with customer-facing impact requiring communication
CTO/VP — for data loss, security incidents, or multi-hour outages affecting all users

Communication during an incident

Two audiences during an incident need different things: your team needs situational awareness and clear ownership; your users need honesty and cadence. Mixing up the formats for these two audiences is one of the most common mistakes.

Internal communication

Declare a dedicated Slack channel immediately: #incident-YYYY-MM-DD. Don't use #eng-general — the noise breaks focus
Post an initial message within 5 minutes: what's affected, who's investigating, when the next update is
Update every 15 minutes during active P1s, even if the update is 'still investigating, no change'
Assign an incident commander — one person whose only job is coordination, not investigation. On a 3-person team, this can be the same person as the investigator if necessary, but name the role
When the fix is applied, post a resolution message with the cause and the fix before closing the channel

External communication

Update your status page within 10 minutes of declaring a P1 — before customers ask
Keep external language factual and non-technical: 'We are investigating increased error rates on [service]. Engineers are engaged.' Not 'ECS task crash causing 5xx.'
Update every 30 minutes — even if it's 'investigation ongoing, no resolution yet'
Resolution message: state what happened, when it was fixed, what you're doing to prevent recurrence. One paragraph maximum
Never speculate about causes in a public update — only confirmed facts

MTTR targets worth aiming for

MTTR (mean time to resolution) is the single most useful incident health metric. It captures detection, diagnosis, and fix in one number. Here are realistic targets by maturity level.

Severity	Baseline (no playbook)	Good	Excellent
P1 — Critical	60–120 min	< 30 min	< 15 min
P2 — High	2–4 hours	< 1 hour	< 30 min
P3 — Medium	1–3 days	< 8 hours	< 4 hours

The biggest lever on MTTR is diagnosis time — specifically, how long it takes to go from 'alarm fired' to 'cause identified.' For most teams this is 20–40 minutes of manual investigation. Getting that to under 2 minutes — by having context arrive with the alarm — is what moves P1 MTTR from 60 minutes to under 15.

Track MTTR per severity level, not as a single average. A team that resolves 50 P3s in 30 minutes each but takes 4 hours on every P1 has a very different problem to solve than the MTTR average suggests.

Frequently asked questions

What is an incident response playbook?

An incident response playbook is a documented set of steps covering how your team detects, triages, diagnoses, fixes, and learns from incidents. It defines severity levels, ownership, escalation paths, and communication templates — so that when something breaks at 3am, nobody has to figure out the process from scratch.

What are the 5 phases of incident response?

Detect (know before your users do), Triage (severity and ownership in under 2 minutes), Diagnose (find the cause, not just the symptom), Fix (restore service with minimum risk), and Learn (post-mortem that prevents recurrence). Most teams are weakest in the Diagnose phase — that's where the most time is lost.

How do you run a blameless post-mortem?

Focus on systems, not people. Document what happened and why the system allowed it to happen, not who made a mistake. The root cause is always a systemic gap — missing alarm, no load test, unclear ownership — not a person's decision. Action items should be specific, owned, and time-bound. Vague items like 'improve observability' don't prevent the next incident.

What severity levels should I use for incidents?

Four levels work well for most teams: P1 (full outage or revenue impact, < 5 minute acknowledgement), P2 (partial outage or >10% error rate, < 15 minutes), P3 (degraded performance, next business hours if overnight), P4 (cosmetic, normal sprint). The key distinction between P1 and P2 is whether it's causing direct, measurable customer impact right now.

What is a realistic MTTR target for a production AWS service?

For P1 incidents: under 30 minutes is good, under 15 minutes is excellent. Most teams without a formal playbook average 60–120 minutes on P1s. The biggest single lever is diagnosis time — getting from 'alarm fired' to 'cause identified' in under 2 minutes instead of 20–40 minutes cuts P1 MTTR by more than any other change.

The Incident Response Playbook Every Engineering Team Needs

May 10, 202612 min read

This is the playbook. Five phases, clear ownership, and the specific things that move MTTR from 40 minutes to under 10.

What to read and when

What incident response actually is
Phase 1: Detect — know before your users do
Phase 2: Triage — is this real, and how bad?
Phase 3: Diagnose — find the cause, not just the symptom
Phase 4: Fix — act without making it worse
Phase 5: Learn — the post-mortem
On-call setup that doesn't burn people out
Communication during an incident
MTTR targets worth aiming for

What incident response actually is

Phase 1: Detect — know before your users do

Good detection means you hear about an incident from your alarms before a user files a ticket. Bad detection means you find out on Twitter. The gap between the two is almost always alarm coverage.

If more than 20% of your alarms turn out to be false positives or noise, you have an alert fatigue problem — not a coverage problem. Fix thresholds before adding more alarms.

Related reading

Phase 2: Triage — is this real, and how bad?

Severity	Criteria	Target response	Who acts
P1 — Critical	Full outage, data loss, or direct revenue impact	Acknowledge within 5 min, mitigate within 30 min	On-call + escalate immediately
P2 — High	Partial outage, >10% error rate, major feature down	Acknowledge within 15 min, mitigate within 2 hours	On-call engineer
P3 — Medium	Degraded performance, <5% error rate, non-critical feature	Next business hours if overnight	On-call at discretion
P4 — Low	Cosmetic issue, no user impact	Within 24 hours	Normal sprint prioritisation

Phase 3: Diagnose — find the cause, not just the symptom

Common causes by service type

Service	Alarm	Most common cause	First thing to check
ECS	CPUUtilization > 80%	Underprovision at traffic spike or memory leak causing swap	Running vs desired task count, recent deploy
ECS	RunningTaskCount < desired	OOM kill (exit code 137) or crash loop	CloudWatch Logs for exit code, memory watermark
RDS	DatabaseConnections > threshold	Connection pool exhaustion — app not returning connections	Application logs for connection timeout errors
Lambda	Throttles > 0	Concurrency limit hit — too many simultaneous invocations	Reserved concurrency setting, upstream invocation rate
ALB	HTTPCode_Target_5XX_Count spike	Application error — upstream or deploy-related	Target group health, recent ECS deploy, app error logs

Related reading

Phase 4: Fix — act without making it worse

Fix-first options, in order of risk

Scale out — add ECS tasks, increase Lambda concurrency, scale RDS read replicas. Low risk, fast, buys time to diagnose properly.
Rollback the last deploy — if a deploy happened in the 30 minutes before the incident, roll it back first and ask questions later. Correct >60% of the time.
Restart the failing component — ECS service restart, RDS reboot, Lambda cold start flush. Medium risk: resolves the symptom, may not fix the cause.
Failover — promote read replica, switch to secondary region, redirect traffic. High risk: verify before executing, confirm with a second engineer.
Hotfix — deploy a targeted code change under pressure. Highest risk. Use only when rollback is not possible and the cause is confirmed.

Phase 5: Learn — the blameless post-mortem

The post-mortem is not a blame session. It's the moment where an incident becomes an improvement. Teams that skip it reliably repeat the same incidents — often within 90 days.

What to write

Impact: how many users affected, for how long, measurable revenue or SLA impact
Timeline: when it was detected, when triage completed, when cause was identified, when fix was applied, when service restored — with exact timestamps
Root cause: the specific technical failure, not the human action
Contributing factors: the conditions that made the failure possible (missing alarm, no staging load test, outdated runbook)
Action items: specific, owned, time-bound — not 'improve monitoring' but 'add RDS connection count alarm, owner: [name], due: [date]'

What not to write

Who made a mistake — the post-mortem is about systems, not people
Vague action items — 'improve observability' is not an action item
A narrative of what everyone did minute-by-minute — the timeline handles that
Speculation about what might have happened — only confirmed causes

The 5 Whys in practice

The 5 Whys is the fastest tool for finding the real root cause rather than the surface symptom. Start with the observable failure and ask why five times.

Why did the service go down? → ECS task crashed (exit code 137 — OOM kill)
Why did it OOM? → Memory exceeded the 512MB container limit during peak load
Why did memory exceed the limit? → New endpoint in last deploy loads an entire dataset into memory per request
Why wasn't this caught in staging? → Staging has no load test; the memory limit was not reviewed in the PR
Why not? → No PR checklist item for memory impact of data-loading code paths

Related reading

On-call setup that doesn't burn people out

The goal of an on-call rotation is to ensure someone who can act is always reachable — without destroying anyone's sleep for months at a time. Both parts of that sentence matter equally.

Rotation design

Weekly rotations work well for teams of 4 or more — each person is on call for one week, then off for N-1 weeks
Daily rotations burn people out faster than weekly — the context switch cost is too high
Always have a primary and a secondary. The secondary exists to be escalated to — not to be on call simultaneously
Shadow rotations for new engineers: they shadow the primary for 2–4 weeks before carrying a pager alone
Never put two engineers from the same service on-call at the same time — if that service has a P1, you want one person investigating and one available to escalate

Escalation paths

Define escalation before an incident. During a P1 at 2am is not the time to figure out who to call next.

On-call engineer — first responder, owns triage and initial diagnosis
Secondary on-call — escalated if primary is unreachable or needs a second opinion within 15 minutes
Service owner or team lead — escalated for P1s that aren't resolving within 30 minutes
Engineering manager — escalated for P1s approaching 1 hour or with customer-facing impact requiring communication
CTO/VP — for data loss, security incidents, or multi-hour outages affecting all users

Communication during an incident

Internal communication

Declare a dedicated Slack channel immediately: #incident-YYYY-MM-DD. Don't use #eng-general — the noise breaks focus
Post an initial message within 5 minutes: what's affected, who's investigating, when the next update is
Update every 15 minutes during active P1s, even if the update is 'still investigating, no change'
Assign an incident commander — one person whose only job is coordination, not investigation. On a 3-person team, this can be the same person as the investigator if necessary, but name the role
When the fix is applied, post a resolution message with the cause and the fix before closing the channel

External communication

Update your status page within 10 minutes of declaring a P1 — before customers ask
Keep external language factual and non-technical: 'We are investigating increased error rates on [service]. Engineers are engaged.' Not 'ECS task crash causing 5xx.'
Update every 30 minutes — even if it's 'investigation ongoing, no resolution yet'
Resolution message: state what happened, when it was fixed, what you're doing to prevent recurrence. One paragraph maximum
Never speculate about causes in a public update — only confirmed facts

MTTR targets worth aiming for

MTTR (mean time to resolution) is the single most useful incident health metric. It captures detection, diagnosis, and fix in one number. Here are realistic targets by maturity level.

Severity	Baseline (no playbook)	Good	Excellent
P1 — Critical	60–120 min	< 30 min	< 15 min
P2 — High	2–4 hours	< 1 hour	< 30 min
P3 — Medium	1–3 days	< 8 hours	< 4 hours