convops
  • Features
  • How it works
  • Pricing
  • Blog
  • Security
Log inStart free →
convops

Root cause, not noise.

Start free →

Product

  • Features
  • How it works
  • Pricing
  • Blog
  • Security
  • Get started

Compare

  • Vs PagerDuty
  • Vs incident.io
  • Vs Datadog
  • Vs Resolve.ai
  • Vs Rootly
  • Vs Squadcast

Solutions

  • AWS incident response
  • CloudWatch alarm diagnosis
  • AWS alerts to WhatsApp
  • WhatsApp on-call
  • PagerDuty alternative

Connect

  • X (Twitter)
  • LinkedIn

© 2026 ConvOps. All rights reserved.

Built at 2am, for a 2am.

← All posts

How to find root cause in AWS CloudWatch alerts without an SRE team

April 28, 2026·7 min read

CloudWatch fires an alarm. CPUUtilization > 80% on your api-service. It's 2am. Your SRE team doesn't exist. Here's the investigation workflow that tells you why it happened — not just that it did.

Step 1: Read the alarm notification carefully

The raw alarm payload has more signal than it looks like. Before opening any dashboard, extract these fields:

FieldWhat it tells youUrgency signal
NamespaceWhich AWS service — ECS, EC2, Lambda, RDSScopes which metrics and logs to open first
DimensionsThe specific resource — ServiceName, ClusterName, FunctionNameThe exact thing to investigate
StateChangeTimeWhen the threshold was crossedYour incident start time for every subsequent query
MetricValueHow far over the threshold you are81% → investigate in the morning. 99% → act now
AlarmDescriptionTeam-written context and runbook pointers, if presentRead this before opening any console

If CPU is at 81% for 5 minutes, you might investigate in the morning. If it's at 99% and climbing, you act immediately. The MetricValue tells you the urgency.

Step 2: Look at the alarm graph before you look at anything else

Open CloudWatch → Alarms → click the alarm name → look at the graph. Zoom out to 3–6 hours. Two patterns matter:

  • Sudden spike: CPU was 20%, now it's 95%. Something changed. Look for a deploy, a config change, or a traffic spike.
  • Gradual ramp: CPU has been climbing steadily over the past 2 hours. Resource exhaustion — memory leak, growing queue, accumulating connections, or traffic growth.

The shape of the curve tells you whether to look for a trigger event (spike) or a slow-building problem (ramp). Note the exact timestamp when the curve started climbing — you'll use it in every subsequent step.

Step 3: Find the first error using CloudWatch Logs Insights

Go to CloudWatch → Logs Insights. Select the log group for the affected service. For ECS it's usually /ecs/your-service-name. For Lambda it's /aws/lambda/your-function-name.

Run this query first — it counts errors per minute and shows you exactly when they started:

filter @message like /ERROR/ or @message like /Exception/ or @message like /exception/
| stats count() as errors by bin(1 min)
| sort @timestamp asc

The minute where the error count spikes is your incident start time. Now query that specific window for the actual messages:

fields @timestamp, @message, @logStream
| filter @message like /ERROR/ or @message like /Exception/
| sort @timestamp asc
| limit 50
Set the Logs Insights time range to match your incident window. Querying the full day buries the relevant errors in noise from earlier shifts.

The first error message is usually the one that matters. Stack traces further down are often consequences, not causes.

Step 4: Check for recent changes

Most production incidents trace back to something that changed. In order of likelihood:

  1. A deployment (code change)
  2. A configuration change (environment variables, task definition update)
  3. A dependency failure (another service or external API your code calls)
  4. A traffic spike (external — your code is fine, you're under-resourced)
  5. A scheduled task or cron job that ran at the same time

In ECS: go to the cluster → Services → your service → Events tab. Every deployment, scaling event, and task restart is logged here with timestamps. Compare against your incident start time.

In Lambda: check the function → Versions. The latest version timestamp tells you when it was last deployed. Check the function → Configuration → Environment variables for recent edits.

Step 5: Service-specific investigation

ECS — high CPU

  • Check running vs desired task count on the service detail page. If tasks are crash-looping and restarting, CPU spikes during each restart cycle.
  • Check ALB → Target Groups → your group → Targets tab. Are targets unhealthy? Click on an unhealthy target to see the health check failure reason.
  • Check the ALB RequestCount metric for the same time window — did traffic actually spike, or is the same traffic consuming more CPU than it used to?
  • If traffic is stable and no deploy happened, the running code has a problem. Check memory too — GC pressure from high memory often shows up as high CPU.

Lambda — high duration or errors

filter @type = "REPORT"
| stats max(@duration) as maxDuration, avg(@duration) as avgDuration, count(@duration) as invocations by bin(5 min)
| sort @timestamp asc

If maxDuration equals your configured timeout exactly, functions are timing out. If avgDuration climbs gradually, something is slower — a downstream API, a database query, or a cold-start problem after a fresh deploy.

RDS — high CPU

  • Check the DatabaseConnections metric alongside CPU. If both are high, connection pool exhaustion is causing the CPU overhead — the database is processing more connection setup/teardown than actual queries.
  • Enable Performance Insights if it isn't already on (it costs roughly $0.02/vCPU/hour but is worth it for this exact scenario). The top SQL view shows you which query is consuming the most DB time.
  • Check the slow query log: set slow_query_log=1 and long_query_time=1 in your RDS parameter group. CloudWatch receives the slow query log automatically.

Common root causes and what they look like

Symptom patternLikely root cause
CPU spike + deploy 10 min priorNew code is more CPU-intensive, or a connection pool setting changed in the new config
CPU gradual ramp + stable trafficMemory pressure causing GC pauses — common in Java and Node.js
5xx errors + ALB target health failuresService crashing on startup or failing health checks after a bad deploy
Memory climbs + stable CPUMemory leak — connection objects or event listeners not being released
CPU spike at the same time every dayA cron job or scheduled ECS task competing for resources with your API

What to do when you still don't know after 15 minutes

If you've gone through steps 1–4 and the cause is still unclear, you have two options:

  1. Roll back the most recent deploy. This fixes the incident even if you don't know the root cause yet. Investigate the root cause the next morning with more context and less pressure.
  2. Scale out if resources are the bottleneck (CPU/memory at limit with no recent deploy). This buys time to investigate without continued downtime.

The goal at 2am is service restoration. Root cause analysis happens at 10am. Don't let perfect be the enemy of resolved.

Building the runbook you wish you had

After resolving each incident, write down: which alarm fired, what you checked, what you found, what fixed it. After 5–6 incidents, patterns emerge and investigation time drops from 20 minutes to 4. The runbook should be pinned in your team's Slack channel — not buried in a wiki you'll never open at 2am.

Related reading

  • → The Complete AWS CloudWatch Alarm Setup Guide
  • → MTTR under 5 minutes: what actually moves the needle for small teams

Frequently asked questions

How do I find the root cause of a CloudWatch alarm without an SRE team?

Start by reading the alarm payload for MetricValue and StateChangeTime, then check the alarm graph for a spike vs gradual ramp pattern. Run a CloudWatch Logs Insights query filtering for ERROR or Exception in the affected log group starting from the incident timestamp. Check for recent deploys in ECS events or Lambda versions. In most cases the root cause is in the first error message logged at the time the curve started climbing.

What CloudWatch Logs Insights query should I run first during an incident?

Run: filter @message like /ERROR/ or @message like /Exception/ | stats count() as errors by bin(1 min) | sort @timestamp asc — set the time range to your incident window. This shows error volume per minute so you can pinpoint the exact minute the incident started, then drill into that window for the specific error messages.

What does it mean when an ECS CPU alarm fires at 94% vs 80%?

At 94% CPUUtilization, ECS tasks are being CPU-throttled by the container runtime — the kernel limits CPU access when a container consistently exceeds its allocated share. Latency spikes typically begin within 90 seconds of sustained throttling. At 81%, you have time to investigate. At 94%, act immediately: check for a recent deploy first, then ALB traffic spike, then memory pressure.

Should I roll back or scale out when I can't find root cause at 2am?

Roll back first if there was a recent deploy — it's the fastest path to service restoration regardless of root cause. If there was no recent deploy and CPU/memory is at the limit, scale out by +2 tasks to buy investigation time. Root cause analysis can wait until morning. At 2am the goal is service restoration, not perfect understanding.

How long should incident investigation take for a small team without an SRE?

A well-structured investigation following steps 1–4 (alarm data, graph shape, Logs Insights, recent changes) should yield a root cause or strong hypothesis in under 15 minutes for common incident patterns. If you've gone through all four steps and the cause is still unclear, roll back and investigate with fresh eyes the next morning.

Related reading

  • → MTTR under 5 minutes: what actually moves the needle
  • → The Complete AWS CloudWatch Alarm Setup Guide
  • → ConvOps vs PagerDuty (2026)
  • → ConvOps vs incident.io (2026)

Still debugging incidents manually?

ConvOps does this automatically — root cause in under 60 seconds, delivered to WhatsApp or Slack.

Try ConvOps free →See a live demo