How to set up on-call rotations when your team is 3 engineers
On-call with 3 engineers is not glamorous. You're covering 168 hours per week with 3 people who are also writing features, reviewing PRs, and handling everything else. Here's how to make it work without burning people out.
The math first
A weekly rotation means each engineer is primary on-call for 1 week every 3 weeks — roughly 33% of weeks. That sounds like a lot, but the actual burden depends on incident frequency. If you're having 1 meaningful alert per night, that's 7 interruptions per week. If you're having 10, the rotation schedule is the least of your problems.
Define response time expectations before setting up the rotation:
| Severity | Criteria | Response time |
|---|---|---|
| Sev-1 | Service completely down — users cannot use the product | 15 minutes, 24/7 |
| Sev-2 | Degraded performance — partial failures, elevated errors, slow responses | 1 hour |
| Sev-3 | Minor issue, no direct customer impact | Next business day |
Most 3am alerts don't need a 15-minute response. Getting this severity calibration right is the most important step — it determines whether your on-call engineer actually sleeps.
Primary and backup model
The simplest structure that doesn't leave you exposed:
- Primary on-call: the first person paged. Expected to respond within 15 minutes for Sev-1.
- Secondary on-call (backup): paged automatically if primary doesn't acknowledge within 15 minutes. Also the person primary calls when they need help.
Sample rotation for engineers A, B, and C:
Week 1: Primary = A, Secondary = B
Week 2: Primary = B, Secondary = C
Week 3: Primary = C, Secondary = A
Week 4: repeat
Secondary is always one position ahead in the rotation.The secondary never has to do anything unless the primary misses the alert — but knowing there's a safety net changes how comfortably the primary engineer sleeps.
Tools that work at this scale
| Tool | Free tier | Cost | Best for |
|---|---|---|---|
| PagerDuty | Up to 5 users — schedules, escalation, phone/SMS/email/app | Free for ≤5 users; ~$20–25/user/month paid | Teams that need escalation policies and phone-call fallback |
| OpsGenie (Atlassian) | Up to 5 users — similar to PagerDuty free | Free for ≤5 users; comparable paid tiers | Teams already on Jira — tighter Atlassian integration |
| Twilio forwarding number | No free tier — pay per use | ~$1/month + usage (~$5–15 total) | Teams that just need call/SMS forwarding with no escalation automation |
For 3 engineers the free tiers of PagerDuty or OpsGenie cover everything you need. Set up two schedules (primary and secondary), wire them into an escalation policy that pages primary first and secondary after 15 minutes with no acknowledgment. Connect your SNS topics via the AWS integration.
The weekly handoff
A 5-minute handoff at the start of each on-call week prevents half the problems the incoming engineer will face. Cover these five things:
- Alerts that fired last week and were suppressed or resolved — so the incoming engineer isn't surprised by a recurring alarm
- Services in a known degraded state (higher than normal latency, flapping metrics that haven't hit the alarm threshold)
- Deploys scheduled this week that might cause alerts
- Runbook gaps noticed during the last shift — things that aren't documented that should be
- Anything in the AWS console that looks unusual but isn't alarming yet
The handoff doesn't need to be a meeting. A Slack message covering these five points at the start of the shift is sufficient. But it has to happen every week without exception.
Writing the runbook
A runbook tells the on-call engineer what to do when a specific alarm fires. It doesn't need to be comprehensive — it needs to cover the 80% of incidents you actually have.
A runbook entry that actually works at 2am:
## api-service: CPUUtilization > 80%
First check: ECS console → api-service → Events tab
Recent deploy in the last 2 hours?
Yes → roll back: aws ecs update-service --cluster prod --service api-service \
--task-definition api-service:PREV_REVISION
No → check ALB RequestCount for traffic spike
Traffic spike → scale out: change desired count from N to N+2
No spike → check memory (if also high, GC pressure — restart the service)
If service is completely down (0 healthy ALB targets):
Restart: ECS → Tasks → select stopped tasks → force new deployment in service
Escalate to @on-call-lead if not resolved in 20 minutes.Write a runbook entry for each alarm you have. Start with the alarms that fired most in the last 90 days. After resolving an incident, update the runbook before closing the incident ticket.
On-call compensation
If engineers are on-call and it's not acknowledged in any way, you will have an attrition problem within 12 months. On-call is a real burden and it needs to be recognized.
Common models at small companies:
| Model | How it works | Notes |
|---|---|---|
| Weekly stipend | $150–$300/week while primary on-call | Most common — predictable, easy to budget, no tracking overhead |
| Comp time | 30+ min nighttime incident → equivalent time off | Requires honest tracking; good for cash-constrained startups |
| Performance review recognition | On-call counted positively in reviews | Risky — depends on managers following through; don't use as the only compensation |
At 3 engineers you can't afford to lose one. Be explicit about how on-call is compensated before the rotation starts, not after someone burns out.
The signal that the rotation is broken
If any given engineer is getting paged more than 2–3 times per night on average during their on-call week, the system is broken and the rotation structure isn't the fix. The fix is either: reducing incident frequency (better infrastructure, better code, better alerts), adding another engineer to extend the rotation, or automating the most common remediations so they resolve without human involvement.
A healthy on-call rotation should feel manageable, not heroic. If it feels heroic, something upstream needs to change.
Frequently asked questions
How should a 3-person engineering team set up on-call rotations?
Use a weekly primary/secondary model: one engineer is primary on-call for the full week, a second is backup (paged automatically if primary doesn't acknowledge in 15 minutes). In a 3-person team, each engineer is primary 1 week out of every 3. PagerDuty and OpsGenie both offer free tiers for up to 5 users — use either to automate escalation to secondary.
What should an on-call runbook include?
A runbook entry that works at 2am includes: the alarm name, a first-check instruction, a decision tree (if X then do Y, if not X then check Z), the exact commands for the most common fixes, and an escalation instruction. Cover the 80% of incidents you actually have — don't write runbooks for hypothetical scenarios. Update the runbook after every resolved incident.
What severity levels should a small engineering team use?
Three levels work well at small scale: Sev-1 (service completely down — 15 min response, 24/7), Sev-2 (degraded performance, partial failures — 1 hour response), Sev-3 (minor issue, no direct customer impact — next business day). The most important decision is what qualifies as Sev-1 — define it as 'users cannot use the product' to keep Sev-1 rare and Sev-1 response credible.
How should startups compensate on-call engineers?
The most common model is a weekly stipend of $150–$300 while primary on-call. Comp time (equivalent time off for nighttime incidents over 30 minutes) works for cash-constrained startups but requires honest tracking. Performance review recognition alone is risky — it's not tangible enough to offset interrupted sleep. Be explicit about compensation before the rotation starts, not after someone burns out.
What is the sign that an on-call rotation is broken?
If any engineer is getting paged more than 2–3 times per night on average during their on-call week, the rotation structure isn't the fix — incident frequency is too high. Fix by reducing alert noise (audit which alarms resolve without action and raise thresholds), adding an engineer to extend the rotation, or automating the most common remediations. The rotation schedule is the last thing to change.
Related reading
Still debugging incidents manually?
ConvOps does this automatically — root cause in under 60 seconds, delivered to WhatsApp or Slack.