How to set up on-call rotations when your team is 3 engineers

March 24, 20265 min read

On-call with 3 engineers is not glamorous. You're covering 168 hours per week with 3 people who are also writing features, reviewing PRs, and handling everything else. Here's how to make it work without burning people out.

The math first

A weekly rotation means each engineer is primary on-call for 1 week every 3 weeks — roughly 33% of weeks. That sounds like a lot, but the actual burden depends on incident frequency. If you're having 1 meaningful alert per night, that's 7 interruptions per week. If you're having 10, the rotation schedule is the least of your problems.

Define response time expectations before setting up the rotation:

Severity	Criteria	Response time
Sev-1	Service completely down — users cannot use the product	15 minutes, 24/7
Sev-2	Degraded performance — partial failures, elevated errors, slow responses	1 hour
Sev-3	Minor issue, no direct customer impact	Next business day

Most 3am alerts don't need a 15-minute response. Getting this severity calibration right is the most important step — it determines whether your on-call engineer actually sleeps.

Primary and backup model

The simplest structure that doesn't leave you exposed:

Primary on-call: the first person paged. Expected to respond within 15 minutes for Sev-1.
Secondary on-call (backup): paged automatically if primary doesn't acknowledge within 15 minutes. Also the person primary calls when they need help.

Sample rotation for engineers A, B, and C:

Week 1:  Primary = A,  Secondary = B
Week 2:  Primary = B,  Secondary = C
Week 3:  Primary = C,  Secondary = A
Week 4:  repeat

Secondary is always one position ahead in the rotation.

The secondary never has to do anything unless the primary misses the alert — but knowing there's a safety net changes how comfortably the primary engineer sleeps.

Tools that work at this scale

Tool	Free tier	Cost	Best for
PagerDuty	Up to 5 users — schedules, escalation, phone/SMS/email/app	Free for ≤5 users; ~$20–25/user/month paid	Teams that need escalation policies and phone-call fallback
OpsGenie (Atlassian)	Up to 5 users — similar to PagerDuty free	Free for ≤5 users; comparable paid tiers	Teams already on Jira — tighter Atlassian integration
Twilio forwarding number	No free tier — pay per use	~$1/month + usage (~$5–15 total)	Teams that just need call/SMS forwarding with no escalation automation

For 3 engineers the free tiers of PagerDuty or OpsGenie cover everything you need. Set up two schedules (primary and secondary), wire them into an escalation policy that pages primary first and secondary after 15 minutes with no acknowledgment. Connect your SNS topics via the AWS integration.

The weekly handoff

A 5-minute handoff at the start of each on-call week prevents half the problems the incoming engineer will face. Cover these five things:

Alerts that fired last week and were suppressed or resolved — so the incoming engineer isn't surprised by a recurring alarm
Services in a known degraded state (higher than normal latency, flapping metrics that haven't hit the alarm threshold)
Deploys scheduled this week that might cause alerts
Runbook gaps noticed during the last shift — things that aren't documented that should be
Anything in the AWS console that looks unusual but isn't alarming yet

The handoff doesn't need to be a meeting. A Slack message covering these five points at the start of the shift is sufficient. But it has to happen every week without exception.

Writing the runbook

A runbook tells the on-call engineer what to do when a specific alarm fires. It doesn't need to be comprehensive — it needs to cover the 80% of incidents you actually have.

A runbook entry that actually works at 2am:

## api-service: CPUUtilization > 80%

First check: ECS console → api-service → Events tab
  Recent deploy in the last 2 hours?
    Yes → roll back: aws ecs update-service --cluster prod --service api-service \
          --task-definition api-service:PREV_REVISION
    No  → check ALB RequestCount for traffic spike
          Traffic spike → scale out: change desired count from N to N+2
          No spike → check memory (if also high, GC pressure — restart the service)

If service is completely down (0 healthy ALB targets):
  Restart: ECS → Tasks → select stopped tasks → force new deployment in service

Escalate to @on-call-lead if not resolved in 20 minutes.

Write a runbook entry for each alarm you have. Start with the alarms that fired most in the last 90 days. After resolving an incident, update the runbook before closing the incident ticket.

On-call compensation

If engineers are on-call and it's not acknowledged in any way, you will have an attrition problem within 12 months. On-call is a real burden and it needs to be recognized.

Common models at small companies:

Model	How it works	Notes
Weekly stipend	$150–$300/week while primary on-call	Most common — predictable, easy to budget, no tracking overhead
Comp time	30+ min nighttime incident → equivalent time off	Requires honest tracking; good for cash-constrained startups
Performance review recognition	On-call counted positively in reviews	Risky — depends on managers following through; don't use as the only compensation

At 3 engineers you can't afford to lose one. Be explicit about how on-call is compensated before the rotation starts, not after someone burns out.

The signal that the rotation is broken

If any given engineer is getting paged more than 2–3 times per night on average during their on-call week, the system is broken and the rotation structure isn't the fix. The fix is either: reducing incident frequency (better infrastructure, better code, better alerts), adding another engineer to extend the rotation, or automating the most common remediations so they resolve without human involvement.

A healthy on-call rotation should feel manageable, not heroic. If it feels heroic, something upstream needs to change.

Frequently asked questions

How should a 3-person engineering team set up on-call rotations?

Use a weekly primary/secondary model: one engineer is primary on-call for the full week, a second is backup (paged automatically if primary doesn't acknowledge in 15 minutes). In a 3-person team, each engineer is primary 1 week out of every 3. PagerDuty and OpsGenie both offer free tiers for up to 5 users — use either to automate escalation to secondary.

What should an on-call runbook include?

A runbook entry that works at 2am includes: the alarm name, a first-check instruction, a decision tree (if X then do Y, if not X then check Z), the exact commands for the most common fixes, and an escalation instruction. Cover the 80% of incidents you actually have — don't write runbooks for hypothetical scenarios. Update the runbook after every resolved incident.

What severity levels should a small engineering team use?

Three levels work well at small scale: Sev-1 (service completely down — 15 min response, 24/7), Sev-2 (degraded performance, partial failures — 1 hour response), Sev-3 (minor issue, no direct customer impact — next business day). The most important decision is what qualifies as Sev-1 — define it as 'users cannot use the product' to keep Sev-1 rare and Sev-1 response credible.

How should startups compensate on-call engineers?

The most common model is a weekly stipend of $150–$300 while primary on-call. Comp time (equivalent time off for nighttime incidents over 30 minutes) works for cash-constrained startups but requires honest tracking. Performance review recognition alone is risky — it's not tangible enough to offset interrupted sleep. Be explicit about compensation before the rotation starts, not after someone burns out.

What is the sign that an on-call rotation is broken?

If any engineer is getting paged more than 2–3 times per night on average during their on-call week, the rotation structure isn't the fix — incident frequency is too high. Fix by reducing alert noise (audit which alarms resolve without action and raise thresholds), adding an engineer to extend the rotation, or automating the most common remediations. The rotation schedule is the last thing to change.

How to set up on-call rotations when your team is 3 engineers

March 24, 20265 min read

The math first

Define response time expectations before setting up the rotation:

Severity	Criteria	Response time
Sev-1	Service completely down — users cannot use the product	15 minutes, 24/7
Sev-2	Degraded performance — partial failures, elevated errors, slow responses	1 hour
Sev-3	Minor issue, no direct customer impact	Next business day

Most 3am alerts don't need a 15-minute response. Getting this severity calibration right is the most important step — it determines whether your on-call engineer actually sleeps.

Primary and backup model

The simplest structure that doesn't leave you exposed:

Primary on-call: the first person paged. Expected to respond within 15 minutes for Sev-1.
Secondary on-call (backup): paged automatically if primary doesn't acknowledge within 15 minutes. Also the person primary calls when they need help.

Sample rotation for engineers A, B, and C:

Week 1:  Primary = A,  Secondary = B
Week 2:  Primary = B,  Secondary = C
Week 3:  Primary = C,  Secondary = A
Week 4:  repeat

Secondary is always one position ahead in the rotation.

The secondary never has to do anything unless the primary misses the alert — but knowing there's a safety net changes how comfortably the primary engineer sleeps.

Tools that work at this scale

Tool	Free tier	Cost	Best for
PagerDuty	Up to 5 users — schedules, escalation, phone/SMS/email/app	Free for ≤5 users; ~$20–25/user/month paid	Teams that need escalation policies and phone-call fallback
OpsGenie (Atlassian)	Up to 5 users — similar to PagerDuty free	Free for ≤5 users; comparable paid tiers	Teams already on Jira — tighter Atlassian integration
Twilio forwarding number	No free tier — pay per use	~$1/month + usage (~$5–15 total)	Teams that just need call/SMS forwarding with no escalation automation

The weekly handoff

A 5-minute handoff at the start of each on-call week prevents half the problems the incoming engineer will face. Cover these five things:

Alerts that fired last week and were suppressed or resolved — so the incoming engineer isn't surprised by a recurring alarm
Services in a known degraded state (higher than normal latency, flapping metrics that haven't hit the alarm threshold)
Deploys scheduled this week that might cause alerts
Runbook gaps noticed during the last shift — things that aren't documented that should be
Anything in the AWS console that looks unusual but isn't alarming yet

The handoff doesn't need to be a meeting. A Slack message covering these five points at the start of the shift is sufficient. But it has to happen every week without exception.

Writing the runbook

A runbook tells the on-call engineer what to do when a specific alarm fires. It doesn't need to be comprehensive — it needs to cover the 80% of incidents you actually have.

A runbook entry that actually works at 2am:

## api-service: CPUUtilization > 80%

First check: ECS console → api-service → Events tab
  Recent deploy in the last 2 hours?
    Yes → roll back: aws ecs update-service --cluster prod --service api-service \
          --task-definition api-service:PREV_REVISION
    No  → check ALB RequestCount for traffic spike
          Traffic spike → scale out: change desired count from N to N+2
          No spike → check memory (if also high, GC pressure — restart the service)

If service is completely down (0 healthy ALB targets):
  Restart: ECS → Tasks → select stopped tasks → force new deployment in service

Escalate to @on-call-lead if not resolved in 20 minutes.

Write a runbook entry for each alarm you have. Start with the alarms that fired most in the last 90 days. After resolving an incident, update the runbook before closing the incident ticket.

On-call compensation

If engineers are on-call and it's not acknowledged in any way, you will have an attrition problem within 12 months. On-call is a real burden and it needs to be recognized.

Common models at small companies:

Model	How it works	Notes
Weekly stipend	$150–$300/week while primary on-call	Most common — predictable, easy to budget, no tracking overhead
Comp time	30+ min nighttime incident → equivalent time off	Requires honest tracking; good for cash-constrained startups
Performance review recognition	On-call counted positively in reviews	Risky — depends on managers following through; don't use as the only compensation

At 3 engineers you can't afford to lose one. Be explicit about how on-call is compensated before the rotation starts, not after someone burns out.

The signal that the rotation is broken

A healthy on-call rotation should feel manageable, not heroic. If it feels heroic, something upstream needs to change.

How to set up on-call rotations when your team is 3 engineers

The math first

Primary and backup model

Tools that work at this scale

The weekly handoff

Writing the runbook

On-call compensation

The signal that the rotation is broken

Frequently asked questions

How should a 3-person engineering team set up on-call rotations?

What should an on-call runbook include?

What severity levels should a small engineering team use?

How should startups compensate on-call engineers?

What is the sign that an on-call rotation is broken?

Related reading

How to set up on-call rotations when your team is 3 engineers

The math first

Primary and backup model

Tools that work at this scale

The weekly handoff

Writing the runbook

On-call compensation

The signal that the rotation is broken

Frequently asked questions

How should a 3-person engineering team set up on-call rotations?

What should an on-call runbook include?

What severity levels should a small engineering team use?

How should startups compensate on-call engineers?

What is the sign that an on-call rotation is broken?

Related reading