How to find root cause in AWS CloudWatch alerts without an SRE team
CloudWatch fires an alarm. CPUUtilization > 80% on your api-service. It's 2am. Your SRE team doesn't exist. Here's the investigation workflow that tells you why it happened — not just that it did.
Step 1: Read the alarm notification carefully
The raw alarm payload has more signal than it looks like. Before opening any dashboard, extract these fields:
| Field | What it tells you | Urgency signal |
|---|---|---|
| Namespace | Which AWS service — ECS, EC2, Lambda, RDS | Scopes which metrics and logs to open first |
| Dimensions | The specific resource — ServiceName, ClusterName, FunctionName | The exact thing to investigate |
| StateChangeTime | When the threshold was crossed | Your incident start time for every subsequent query |
| MetricValue | How far over the threshold you are | 81% → investigate in the morning. 99% → act now |
| AlarmDescription | Team-written context and runbook pointers, if present | Read this before opening any console |
If CPU is at 81% for 5 minutes, you might investigate in the morning. If it's at 99% and climbing, you act immediately. The MetricValue tells you the urgency.
Step 2: Look at the alarm graph before you look at anything else
Open CloudWatch → Alarms → click the alarm name → look at the graph. Zoom out to 3–6 hours. Two patterns matter:
- Sudden spike: CPU was 20%, now it's 95%. Something changed. Look for a deploy, a config change, or a traffic spike.
- Gradual ramp: CPU has been climbing steadily over the past 2 hours. Resource exhaustion — memory leak, growing queue, accumulating connections, or traffic growth.
The shape of the curve tells you whether to look for a trigger event (spike) or a slow-building problem (ramp). Note the exact timestamp when the curve started climbing — you'll use it in every subsequent step.
Step 3: Find the first error using CloudWatch Logs Insights
Go to CloudWatch → Logs Insights. Select the log group for the affected service. For ECS it's usually /ecs/your-service-name. For Lambda it's /aws/lambda/your-function-name.
Run this query first — it counts errors per minute and shows you exactly when they started:
filter @message like /ERROR/ or @message like /Exception/ or @message like /exception/
| stats count() as errors by bin(1 min)
| sort @timestamp ascThe minute where the error count spikes is your incident start time. Now query that specific window for the actual messages:
fields @timestamp, @message, @logStream
| filter @message like /ERROR/ or @message like /Exception/
| sort @timestamp asc
| limit 50The first error message is usually the one that matters. Stack traces further down are often consequences, not causes.
Step 4: Check for recent changes
Most production incidents trace back to something that changed. In order of likelihood:
- A deployment (code change)
- A configuration change (environment variables, task definition update)
- A dependency failure (another service or external API your code calls)
- A traffic spike (external — your code is fine, you're under-resourced)
- A scheduled task or cron job that ran at the same time
In ECS: go to the cluster → Services → your service → Events tab. Every deployment, scaling event, and task restart is logged here with timestamps. Compare against your incident start time.
In Lambda: check the function → Versions. The latest version timestamp tells you when it was last deployed. Check the function → Configuration → Environment variables for recent edits.
Step 5: Service-specific investigation
ECS — high CPU
- Check running vs desired task count on the service detail page. If tasks are crash-looping and restarting, CPU spikes during each restart cycle.
- Check ALB → Target Groups → your group → Targets tab. Are targets unhealthy? Click on an unhealthy target to see the health check failure reason.
- Check the ALB RequestCount metric for the same time window — did traffic actually spike, or is the same traffic consuming more CPU than it used to?
- If traffic is stable and no deploy happened, the running code has a problem. Check memory too — GC pressure from high memory often shows up as high CPU.
Lambda — high duration or errors
filter @type = "REPORT"
| stats max(@duration) as maxDuration, avg(@duration) as avgDuration, count(@duration) as invocations by bin(5 min)
| sort @timestamp ascIf maxDuration equals your configured timeout exactly, functions are timing out. If avgDuration climbs gradually, something is slower — a downstream API, a database query, or a cold-start problem after a fresh deploy.
RDS — high CPU
- Check the DatabaseConnections metric alongside CPU. If both are high, connection pool exhaustion is causing the CPU overhead — the database is processing more connection setup/teardown than actual queries.
- Enable Performance Insights if it isn't already on (it costs roughly $0.02/vCPU/hour but is worth it for this exact scenario). The top SQL view shows you which query is consuming the most DB time.
- Check the slow query log: set slow_query_log=1 and long_query_time=1 in your RDS parameter group. CloudWatch receives the slow query log automatically.
Common root causes and what they look like
| Symptom pattern | Likely root cause |
|---|---|
| CPU spike + deploy 10 min prior | New code is more CPU-intensive, or a connection pool setting changed in the new config |
| CPU gradual ramp + stable traffic | Memory pressure causing GC pauses — common in Java and Node.js |
| 5xx errors + ALB target health failures | Service crashing on startup or failing health checks after a bad deploy |
| Memory climbs + stable CPU | Memory leak — connection objects or event listeners not being released |
| CPU spike at the same time every day | A cron job or scheduled ECS task competing for resources with your API |
What to do when you still don't know after 15 minutes
If you've gone through steps 1–4 and the cause is still unclear, you have two options:
- Roll back the most recent deploy. This fixes the incident even if you don't know the root cause yet. Investigate the root cause the next morning with more context and less pressure.
- Scale out if resources are the bottleneck (CPU/memory at limit with no recent deploy). This buys time to investigate without continued downtime.
The goal at 2am is service restoration. Root cause analysis happens at 10am. Don't let perfect be the enemy of resolved.
Building the runbook you wish you had
After resolving each incident, write down: which alarm fired, what you checked, what you found, what fixed it. After 5–6 incidents, patterns emerge and investigation time drops from 20 minutes to 4. The runbook should be pinned in your team's Slack channel — not buried in a wiki you'll never open at 2am.
Frequently asked questions
How do I find the root cause of a CloudWatch alarm without an SRE team?
Start by reading the alarm payload for MetricValue and StateChangeTime, then check the alarm graph for a spike vs gradual ramp pattern. Run a CloudWatch Logs Insights query filtering for ERROR or Exception in the affected log group starting from the incident timestamp. Check for recent deploys in ECS events or Lambda versions. In most cases the root cause is in the first error message logged at the time the curve started climbing.
What CloudWatch Logs Insights query should I run first during an incident?
Run: filter @message like /ERROR/ or @message like /Exception/ | stats count() as errors by bin(1 min) | sort @timestamp asc — set the time range to your incident window. This shows error volume per minute so you can pinpoint the exact minute the incident started, then drill into that window for the specific error messages.
What does it mean when an ECS CPU alarm fires at 94% vs 80%?
At 94% CPUUtilization, ECS tasks are being CPU-throttled by the container runtime — the kernel limits CPU access when a container consistently exceeds its allocated share. Latency spikes typically begin within 90 seconds of sustained throttling. At 81%, you have time to investigate. At 94%, act immediately: check for a recent deploy first, then ALB traffic spike, then memory pressure.
Should I roll back or scale out when I can't find root cause at 2am?
Roll back first if there was a recent deploy — it's the fastest path to service restoration regardless of root cause. If there was no recent deploy and CPU/memory is at the limit, scale out by +2 tasks to buy investigation time. Root cause analysis can wait until morning. At 2am the goal is service restoration, not perfect understanding.
How long should incident investigation take for a small team without an SRE?
A well-structured investigation following steps 1–4 (alarm data, graph shape, Logs Insights, recent changes) should yield a root cause or strong hypothesis in under 15 minutes for common incident patterns. If you've gone through all four steps and the cause is still unclear, roll back and investigate with fresh eyes the next morning.
Related reading
Still debugging incidents manually?
ConvOps does this automatically — root cause in under 60 seconds, delivered to WhatsApp or Slack.