MTTR under 5 minutes: what actually moves the needle for small engineering teams

April 21, 20265 min read

MTTR — Mean Time To Recovery — is the average time from incident detection to service restoration. For most small teams it sits between 15 and 45 minutes. Getting it under 5 minutes is achievable without headcount, but only if you attack the right bottlenecks.

Before optimizing anything, measure your current MTTR. Create a spreadsheet: incident detected at, service restored at, duration. Do this for the last 10 incidents. Most teams discover their MTTR is longer than they think, and that a small number of slow incidents are pulling the average up.

What kills MTTR on small teams

Bottleneck	Typical time lost	Root fix
Time to awareness	5–10 min	Cut alert noise so every alert demands immediate action
Time to context	10–15 min	Attach the diagnosis to the alert — not after the engineer is already awake
Time to decision	3–8 min	Pre-approve the 5 most common fixes; no 2am approval calls
Time to action	3–5 min	Pin the exact commands — one fill-in-the-blank away from running
Slow rollback	5–10 min	Use ECS task definition revisions — rollback takes 30–60 s, not 8 min

The three changes below attack the biggest of these — in order of impact.

Change 1: Cut alert noise until every alert demands action

Alert fatigue is a MTTR killer because it trains engineers to delay response. If 40% of your alerts resolve themselves without any action, engineers learn to wait-and-see before responding. That wait costs 5–10 minutes before every real incident.

Audit your last 30 CloudWatch alarms. For each one, answer: did we take any action on this alert, or did it resolve on its own? If the answer is 'no action' more than twice, delete the alarm or raise the threshold.

A team that responds to 5 alerts per week in under 2 minutes beats a team receiving 50 alerts and ignoring half. Coverage breadth matters less than response quality.

Change 2: Deliver context at alert time, not after

The standard flow: alert fires → engineer opens AWS console → opens CloudWatch → searches logs → figures out what's wrong → decides on a fix. This takes 10–20 minutes.

The fast flow: alert fires → engineer receives the alert with root cause already identified → engineer makes a decision and acts. This takes 2–3 minutes.

Getting to the fast flow means investigation needs to happen before the human is involved — not after. Either build your own alert enrichment (a Lambda that pulls logs and recent deploy info when an alarm fires and attaches the result to the notification) or use a tool that does it. The key is that the diagnosis arrives with the alert, not as a result of it.

Change 3: Pre-approve the common fixes

Write down the 5 fixes you apply most often during incidents. For most teams this list includes:

Restart a specific ECS service or task
Roll back to the previous task definition revision
Scale out the ECS service by +2 tasks
Increase Lambda reserved concurrency temporarily
Clear a specific Redis key that has grown too large

For each fix, write the exact command. Pre-approve each one — no need to wake up a second person to confirm. The on-call engineer should be able to run any of these without a phone call.

# Roll back ECS service to a previous task definition
# Replace PREV with the last known-good revision number
aws ecs update-service \
  --cluster your-cluster-name \
  --service your-service-name \
  --task-definition your-task-family:PREV

# Scale out ECS service by 2 tasks (fill in current count + 2)
aws ecs update-service \
  --cluster your-cluster-name \
  --service your-service-name \
  --desired-count NEW_COUNT

The previous revision number is always visible in ECS → Task Definitions → your task family. At 2am, having this command ready to paste (with only the revision number to fill in) removes 3–4 minutes of fumbling.

The deployment rollback floor

If your normal deploy takes 8 minutes (build → push → update service → health check cycle), your rollback takes 8 minutes too. That's a hard floor on MTTR for deploy-caused incidents.

To get under this floor: use ECS task definition revision rollbacks instead of deploying new code. Rolling back a task definition takes 30–60 seconds because the image is already in ECR. This only works for code-only changes — if the deploy also changed infrastructure, a task definition rollback isn't sufficient.

For Lambda: use function aliases and version routing. Point an alias at the previous version. A version swap is instantaneous. This pattern works regardless of deployment tooling.

How to measure progress

After implementing these changes, track MTTR per incident for 4–6 weeks. You're looking for two things: median going down, and outliers (incidents taking more than 20 minutes) shrinking in frequency. Outliers are almost always incidents where one of the three factors above broke down — alert noise caused a late response, context wasn't available, or the fix required approval.

A spreadsheet with incident start time, resolution time, and a 1-line root cause note is enough. You don't need incident management software until you're having more than 3–4 incidents per week.

Frequently asked questions

What is a realistic MTTR target for a 3–5 person engineering team?

Under 5 minutes is achievable for most incident types with the right process changes — not headcount. The average small team starts at 15–45 minutes. The three highest-impact changes are: cutting alert noise so every alert demands action, delivering diagnosis with the alert rather than after, and pre-approving the 5 most common fixes so no approval call is needed at 2am.

How do I measure MTTR accurately?

Create a spreadsheet with three columns: incident detected at, service restored at, duration. Fill it in for every incident, even minor ones. Do this for the last 10 incidents before changing anything — most teams discover their actual MTTR is higher than they think, and that 2–3 slow outliers are pulling the average up significantly.

How fast is an ECS task definition rollback compared to a full redeploy?

An ECS task definition revision rollback takes 30–60 seconds because the container image is already in ECR — no build, no push, just switching which revision the service runs. A full redeploy typically takes 6–12 minutes. For code-only changes, always prefer the revision rollback to hit a lower MTTR floor.

What fixes should be pre-approved for on-call engineers?

The 5 most commonly pre-approved fixes for small teams: restart a specific ECS service or task, roll back to the previous task definition revision, scale out the ECS service by +2 tasks, increase Lambda reserved concurrency temporarily, and clear a specific Redis key. Pre-approval means the on-call engineer can run any of these without a phone call — this alone removes 3–8 minutes from MTTR.

What is the difference between MTTR and MTTD?

MTTD (Mean Time To Detect) measures how long from incident start to when the on-call engineer is aware. MTTR (Mean Time To Recovery) measures from incident start to service restoration. For most small teams, MTTD is under 5 minutes with CloudWatch alarms wired to SNS. MTTR is the harder number — it's where investigation, decision, and fix time live.

MTTR under 5 minutes: what actually moves the needle for small engineering teams

April 21, 20265 min read

What kills MTTR on small teams

Bottleneck	Typical time lost	Root fix
Time to awareness	5–10 min	Cut alert noise so every alert demands immediate action
Time to context	10–15 min	Attach the diagnosis to the alert — not after the engineer is already awake
Time to decision	3–8 min	Pre-approve the 5 most common fixes; no 2am approval calls
Time to action	3–5 min	Pin the exact commands — one fill-in-the-blank away from running
Slow rollback	5–10 min	Use ECS task definition revisions — rollback takes 30–60 s, not 8 min

The three changes below attack the biggest of these — in order of impact.

Change 1: Cut alert noise until every alert demands action

A team that responds to 5 alerts per week in under 2 minutes beats a team receiving 50 alerts and ignoring half. Coverage breadth matters less than response quality.

Change 2: Deliver context at alert time, not after

The standard flow: alert fires → engineer opens AWS console → opens CloudWatch → searches logs → figures out what's wrong → decides on a fix. This takes 10–20 minutes.

The fast flow: alert fires → engineer receives the alert with root cause already identified → engineer makes a decision and acts. This takes 2–3 minutes.

Change 3: Pre-approve the common fixes

Write down the 5 fixes you apply most often during incidents. For most teams this list includes:

Restart a specific ECS service or task
Roll back to the previous task definition revision
Scale out the ECS service by +2 tasks
Increase Lambda reserved concurrency temporarily
Clear a specific Redis key that has grown too large

For each fix, write the exact command. Pre-approve each one — no need to wake up a second person to confirm. The on-call engineer should be able to run any of these without a phone call.

# Roll back ECS service to a previous task definition
# Replace PREV with the last known-good revision number
aws ecs update-service \
  --cluster your-cluster-name \
  --service your-service-name \
  --task-definition your-task-family:PREV

# Scale out ECS service by 2 tasks (fill in current count + 2)
aws ecs update-service \
  --cluster your-cluster-name \
  --service your-service-name \
  --desired-count NEW_COUNT

The deployment rollback floor

If your normal deploy takes 8 minutes (build → push → update service → health check cycle), your rollback takes 8 minutes too. That's a hard floor on MTTR for deploy-caused incidents.

For Lambda: use function aliases and version routing. Point an alias at the previous version. A version swap is instantaneous. This pattern works regardless of deployment tooling.

How to measure progress

A spreadsheet with incident start time, resolution time, and a 1-line root cause note is enough. You don't need incident management software until you're having more than 3–4 incidents per week.

MTTR under 5 minutes: what actually moves the needle for small engineering teams

What kills MTTR on small teams

Change 1: Cut alert noise until every alert demands action

Change 2: Deliver context at alert time, not after

Change 3: Pre-approve the common fixes

The deployment rollback floor

How to measure progress

Frequently asked questions

What is a realistic MTTR target for a 3–5 person engineering team?

How do I measure MTTR accurately?

How fast is an ECS task definition rollback compared to a full redeploy?

What fixes should be pre-approved for on-call engineers?

What is the difference between MTTR and MTTD?

Related reading

MTTR under 5 minutes: what actually moves the needle for small engineering teams

What kills MTTR on small teams

Change 1: Cut alert noise until every alert demands action

Change 2: Deliver context at alert time, not after

Change 3: Pre-approve the common fixes

The deployment rollback floor

How to measure progress

Frequently asked questions

What is a realistic MTTR target for a 3–5 person engineering team?

How do I measure MTTR accurately?

How fast is an ECS task definition rollback compared to a full redeploy?

What fixes should be pre-approved for on-call engineers?

What is the difference between MTTR and MTTD?

Related reading