How to find root cause in AWS CloudWatch alerts without an SRE team

April 28, 20267 min read

CloudWatch fires an alarm. CPUUtilization > 80% on your api-service. It's 2am. Your SRE team doesn't exist. Here's the investigation workflow that tells you why it happened — not just that it did.

Step 1: Read the alarm notification carefully

The raw alarm payload has more signal than it looks like. Before opening any dashboard, extract these fields:

Field	What it tells you	Urgency signal
Namespace	Which AWS service — ECS, EC2, Lambda, RDS	Scopes which metrics and logs to open first
Dimensions	The specific resource — ServiceName, ClusterName, FunctionName	The exact thing to investigate
StateChangeTime	When the threshold was crossed	Your incident start time for every subsequent query
MetricValue	How far over the threshold you are	81% → investigate in the morning. 99% → act now
AlarmDescription	Team-written context and runbook pointers, if present	Read this before opening any console

If CPU is at 81% for 5 minutes, you might investigate in the morning. If it's at 99% and climbing, you act immediately. The MetricValue tells you the urgency.

Step 2: Look at the alarm graph before you look at anything else

Open CloudWatch → Alarms → click the alarm name → look at the graph. Zoom out to 3–6 hours. Two patterns matter:

Sudden spike: CPU was 20%, now it's 95%. Something changed. Look for a deploy, a config change, or a traffic spike.
Gradual ramp: CPU has been climbing steadily over the past 2 hours. Resource exhaustion — memory leak, growing queue, accumulating connections, or traffic growth.

The shape of the curve tells you whether to look for a trigger event (spike) or a slow-building problem (ramp). Note the exact timestamp when the curve started climbing — you'll use it in every subsequent step.

Step 3: Find the first error using CloudWatch Logs Insights

Go to CloudWatch → Logs Insights. Select the log group for the affected service. For ECS it's usually /ecs/your-service-name. For Lambda it's /aws/lambda/your-function-name.

Run this query first — it counts errors per minute and shows you exactly when they started:

filter @message like /ERROR/ or @message like /Exception/ or @message like /exception/
| stats count() as errors by bin(1 min)
| sort @timestamp asc

The minute where the error count spikes is your incident start time. Now query that specific window for the actual messages:

fields @timestamp, @message, @logStream
| filter @message like /ERROR/ or @message like /Exception/
| sort @timestamp asc
| limit 50

Set the Logs Insights time range to match your incident window. Querying the full day buries the relevant errors in noise from earlier shifts.

The first error message is usually the one that matters. Stack traces further down are often consequences, not causes.

Step 4: Check for recent changes

Most production incidents trace back to something that changed. In order of likelihood:

A deployment (code change)
A configuration change (environment variables, task definition update)
A dependency failure (another service or external API your code calls)
A traffic spike (external — your code is fine, you're under-resourced)
A scheduled task or cron job that ran at the same time

In ECS: go to the cluster → Services → your service → Events tab. Every deployment, scaling event, and task restart is logged here with timestamps. Compare against your incident start time.

In Lambda: check the function → Versions. The latest version timestamp tells you when it was last deployed. Check the function → Configuration → Environment variables for recent edits.

Step 5: Service-specific investigation

ECS — high CPU

Check running vs desired task count on the service detail page. If tasks are crash-looping and restarting, CPU spikes during each restart cycle.
Check ALB → Target Groups → your group → Targets tab. Are targets unhealthy? Click on an unhealthy target to see the health check failure reason.
Check the ALB RequestCount metric for the same time window — did traffic actually spike, or is the same traffic consuming more CPU than it used to?
If traffic is stable and no deploy happened, the running code has a problem. Check memory too — GC pressure from high memory often shows up as high CPU.

Lambda — high duration or errors

filter @type = "REPORT"
| stats max(@duration) as maxDuration, avg(@duration) as avgDuration, count(@duration) as invocations by bin(5 min)
| sort @timestamp asc

If maxDuration equals your configured timeout exactly, functions are timing out. If avgDuration climbs gradually, something is slower — a downstream API, a database query, or a cold-start problem after a fresh deploy.

RDS — high CPU

Check the DatabaseConnections metric alongside CPU. If both are high, connection pool exhaustion is causing the CPU overhead — the database is processing more connection setup/teardown than actual queries.
Enable Performance Insights if it isn't already on (it costs roughly $0.02/vCPU/hour but is worth it for this exact scenario). The top SQL view shows you which query is consuming the most DB time.
Check the slow query log: set slow_query_log=1 and long_query_time=1 in your RDS parameter group. CloudWatch receives the slow query log automatically.

Common root causes and what they look like

Symptom pattern	Likely root cause
CPU spike + deploy 10 min prior	New code is more CPU-intensive, or a connection pool setting changed in the new config
CPU gradual ramp + stable traffic	Memory pressure causing GC pauses — common in Java and Node.js
5xx errors + ALB target health failures	Service crashing on startup or failing health checks after a bad deploy
Memory climbs + stable CPU	Memory leak — connection objects or event listeners not being released
CPU spike at the same time every day	A cron job or scheduled ECS task competing for resources with your API

What to do when you still don't know after 15 minutes

If you've gone through steps 1–4 and the cause is still unclear, you have two options:

Roll back the most recent deploy. This fixes the incident even if you don't know the root cause yet. Investigate the root cause the next morning with more context and less pressure.
Scale out if resources are the bottleneck (CPU/memory at limit with no recent deploy). This buys time to investigate without continued downtime.

The goal at 2am is service restoration. Root cause analysis happens at 10am. Don't let perfect be the enemy of resolved.

Building the runbook you wish you had

After resolving each incident, write down: which alarm fired, what you checked, what you found, what fixed it. After 5–6 incidents, patterns emerge and investigation time drops from 20 minutes to 4. The runbook should be pinned in your team's Slack channel — not buried in a wiki you'll never open at 2am.

Frequently asked questions

How do I find the root cause of a CloudWatch alarm without an SRE team?

Start by reading the alarm payload for MetricValue and StateChangeTime, then check the alarm graph for a spike vs gradual ramp pattern. Run a CloudWatch Logs Insights query filtering for ERROR or Exception in the affected log group starting from the incident timestamp. Check for recent deploys in ECS events or Lambda versions. In most cases the root cause is in the first error message logged at the time the curve started climbing.

What CloudWatch Logs Insights query should I run first during an incident?

Run: filter @message like /ERROR/ or @message like /Exception/ | stats count() as errors by bin(1 min) | sort @timestamp asc — set the time range to your incident window. This shows error volume per minute so you can pinpoint the exact minute the incident started, then drill into that window for the specific error messages.

What does it mean when an ECS CPU alarm fires at 94% vs 80%?

At 94% CPUUtilization, ECS tasks are being CPU-throttled by the container runtime — the kernel limits CPU access when a container consistently exceeds its allocated share. Latency spikes typically begin within 90 seconds of sustained throttling. At 81%, you have time to investigate. At 94%, act immediately: check for a recent deploy first, then ALB traffic spike, then memory pressure.

Should I roll back or scale out when I can't find root cause at 2am?

Roll back first if there was a recent deploy — it's the fastest path to service restoration regardless of root cause. If there was no recent deploy and CPU/memory is at the limit, scale out by +2 tasks to buy investigation time. Root cause analysis can wait until morning. At 2am the goal is service restoration, not perfect understanding.

How long should incident investigation take for a small team without an SRE?

A well-structured investigation following steps 1–4 (alarm data, graph shape, Logs Insights, recent changes) should yield a root cause or strong hypothesis in under 15 minutes for common incident patterns. If you've gone through all four steps and the cause is still unclear, roll back and investigate with fresh eyes the next morning.

How to find root cause in AWS CloudWatch alerts without an SRE team

April 28, 20267 min read

CloudWatch fires an alarm. CPUUtilization > 80% on your api-service. It's 2am. Your SRE team doesn't exist. Here's the investigation workflow that tells you why it happened — not just that it did.

Step 1: Read the alarm notification carefully

The raw alarm payload has more signal than it looks like. Before opening any dashboard, extract these fields:

Field	What it tells you	Urgency signal
Namespace	Which AWS service — ECS, EC2, Lambda, RDS	Scopes which metrics and logs to open first
Dimensions	The specific resource — ServiceName, ClusterName, FunctionName	The exact thing to investigate
StateChangeTime	When the threshold was crossed	Your incident start time for every subsequent query
MetricValue	How far over the threshold you are	81% → investigate in the morning. 99% → act now
AlarmDescription	Team-written context and runbook pointers, if present	Read this before opening any console

If CPU is at 81% for 5 minutes, you might investigate in the morning. If it's at 99% and climbing, you act immediately. The MetricValue tells you the urgency.

Step 2: Look at the alarm graph before you look at anything else

Open CloudWatch → Alarms → click the alarm name → look at the graph. Zoom out to 3–6 hours. Two patterns matter:

Sudden spike: CPU was 20%, now it's 95%. Something changed. Look for a deploy, a config change, or a traffic spike.
Gradual ramp: CPU has been climbing steadily over the past 2 hours. Resource exhaustion — memory leak, growing queue, accumulating connections, or traffic growth.

Step 3: Find the first error using CloudWatch Logs Insights

Go to CloudWatch → Logs Insights. Select the log group for the affected service. For ECS it's usually /ecs/your-service-name. For Lambda it's /aws/lambda/your-function-name.

Run this query first — it counts errors per minute and shows you exactly when they started:

filter @message like /ERROR/ or @message like /Exception/ or @message like /exception/
| stats count() as errors by bin(1 min)
| sort @timestamp asc

The minute where the error count spikes is your incident start time. Now query that specific window for the actual messages:

fields @timestamp, @message, @logStream
| filter @message like /ERROR/ or @message like /Exception/
| sort @timestamp asc
| limit 50

Set the Logs Insights time range to match your incident window. Querying the full day buries the relevant errors in noise from earlier shifts.

The first error message is usually the one that matters. Stack traces further down are often consequences, not causes.

Step 4: Check for recent changes

Most production incidents trace back to something that changed. In order of likelihood:

A deployment (code change)
A configuration change (environment variables, task definition update)
A dependency failure (another service or external API your code calls)
A traffic spike (external — your code is fine, you're under-resourced)
A scheduled task or cron job that ran at the same time

In ECS: go to the cluster → Services → your service → Events tab. Every deployment, scaling event, and task restart is logged here with timestamps. Compare against your incident start time.

In Lambda: check the function → Versions. The latest version timestamp tells you when it was last deployed. Check the function → Configuration → Environment variables for recent edits.

Step 5: Service-specific investigation

ECS — high CPU

Check running vs desired task count on the service detail page. If tasks are crash-looping and restarting, CPU spikes during each restart cycle.
Check ALB → Target Groups → your group → Targets tab. Are targets unhealthy? Click on an unhealthy target to see the health check failure reason.
Check the ALB RequestCount metric for the same time window — did traffic actually spike, or is the same traffic consuming more CPU than it used to?
If traffic is stable and no deploy happened, the running code has a problem. Check memory too — GC pressure from high memory often shows up as high CPU.

Lambda — high duration or errors

filter @type = "REPORT"
| stats max(@duration) as maxDuration, avg(@duration) as avgDuration, count(@duration) as invocations by bin(5 min)
| sort @timestamp asc

RDS — high CPU

Check the DatabaseConnections metric alongside CPU. If both are high, connection pool exhaustion is causing the CPU overhead — the database is processing more connection setup/teardown than actual queries.
Enable Performance Insights if it isn't already on (it costs roughly $0.02/vCPU/hour but is worth it for this exact scenario). The top SQL view shows you which query is consuming the most DB time.
Check the slow query log: set slow_query_log=1 and long_query_time=1 in your RDS parameter group. CloudWatch receives the slow query log automatically.

Common root causes and what they look like

Symptom pattern	Likely root cause
CPU spike + deploy 10 min prior	New code is more CPU-intensive, or a connection pool setting changed in the new config
CPU gradual ramp + stable traffic	Memory pressure causing GC pauses — common in Java and Node.js
5xx errors + ALB target health failures	Service crashing on startup or failing health checks after a bad deploy
Memory climbs + stable CPU	Memory leak — connection objects or event listeners not being released
CPU spike at the same time every day	A cron job or scheduled ECS task competing for resources with your API

What to do when you still don't know after 15 minutes

If you've gone through steps 1–4 and the cause is still unclear, you have two options:

Roll back the most recent deploy. This fixes the incident even if you don't know the root cause yet. Investigate the root cause the next morning with more context and less pressure.
Scale out if resources are the bottleneck (CPU/memory at limit with no recent deploy). This buys time to investigate without continued downtime.

The goal at 2am is service restoration. Root cause analysis happens at 10am. Don't let perfect be the enemy of resolved.

How to find root cause in AWS CloudWatch alerts without an SRE team

Step 1: Read the alarm notification carefully

Step 2: Look at the alarm graph before you look at anything else

Step 3: Find the first error using CloudWatch Logs Insights

Step 4: Check for recent changes

Step 5: Service-specific investigation

ECS — high CPU

Lambda — high duration or errors

RDS — high CPU

Common root causes and what they look like

What to do when you still don't know after 15 minutes

Building the runbook you wish you had

Frequently asked questions

How do I find the root cause of a CloudWatch alarm without an SRE team?

What CloudWatch Logs Insights query should I run first during an incident?

What does it mean when an ECS CPU alarm fires at 94% vs 80%?

Should I roll back or scale out when I can't find root cause at 2am?

How long should incident investigation take for a small team without an SRE?

Related reading

How to find root cause in AWS CloudWatch alerts without an SRE team

Step 1: Read the alarm notification carefully

Step 2: Look at the alarm graph before you look at anything else

Step 3: Find the first error using CloudWatch Logs Insights

Step 4: Check for recent changes

Step 5: Service-specific investigation

ECS — high CPU

Lambda — high duration or errors

RDS — high CPU

Common root causes and what they look like

What to do when you still don't know after 15 minutes

Building the runbook you wish you had

Frequently asked questions

How do I find the root cause of a CloudWatch alarm without an SRE team?

What CloudWatch Logs Insights query should I run first during an incident?

What does it mean when an ECS CPU alarm fires at 94% vs 80%?

Should I roll back or scale out when I can't find root cause at 2am?

How long should incident investigation take for a small team without an SRE?

Related reading