The 12 CloudWatch alarms every small AWS team should have
It's 2:19am. Your RDS database stopped accepting writes 47 minutes ago. FreeStorageSpace hit zero at 1:32am. Every insert since then returned a read-only error. Users started seeing failures at 1:33am. You find out at 8:47am from a customer email. You had CPU alarms. You had Lambda error alarms. You had no disk space alarm. This is the 12-alarm list I run on every AWS account I care about.
Why most teams either over-alarm or under-alarm
The reflex answer is "monitor everything." AWS docs list 60+ metrics across ECS, EC2, RDS, Lambda, and ALB. A typical starter alarm guide will suggest 20-30 alarms. That's not wrong — but it misses the operational reality of a 5-person team where the same engineer who writes code is also on call.
When every alarm feels equally urgent, none of them are. I've watched on-call engineers silence their phones after three false positives in a week. The team then finds out about real incidents from users. The goal isn't comprehensive coverage. It's a small set of alarms where every trigger represents something worth waking up for — or at minimum, something worth investigating that day. These 12 cover the failure modes that actually take services down.
The 12 alarms at a glance
| # | Metric | Namespace | Threshold | Statistic | Severity |
|---|---|---|---|---|---|
| 1 | HealthyHostCount | AWS/ApplicationELB | ≤ 0 | Minimum | CRITICAL |
| 2 | HTTPCode_Target_5XX_Count | AWS/ApplicationELB | > 10/min | Sum | WARN |
| 3 | TargetResponseTime | AWS/ApplicationELB | > 2s (p99) | p99 | WARN |
| 4 | CPUUtilization | AWS/ECS | > 80% | Average | WARN |
| 5 | MemoryUtilization | AWS/ECS | > 85% | Average | WARN |
| 6 | FreeStorageSpace | AWS/RDS | < 5 GB | Average | WARN |
| 7 | FreeStorageSpace | AWS/RDS | < 1 GB | Average | CRITICAL |
| 8 | DatabaseConnections | AWS/RDS | > 80% of max_connections | Average | WARN |
| 9 | Errors | AWS/Lambda | > 5/min | Sum | WARN |
| 10 | StatusCheckFailed | AWS/EC2 | > 0 | Maximum | CRITICAL |
| 11 | ApproximateAgeOfOldestMessage | AWS/SQS | > 600s | Maximum | WARN |
| 12 | EstimatedCharges | AWS/Billing | > 2× monthly avg | Maximum | WARN |
Setting them up, one group at a time
Step 1: Availability — these page you immediately
Alarm 1 is the most important on this list. HealthyHostCount ≤ 0 means your ALB has no healthy targets — the service is returning 503 to every user. Set TreatMissingData to "breaching." If your ECS tasks crash completely and stop publishing metrics, you want this alarm to fire, not stay in OK state. One evaluation period of 60 seconds is enough. Don't wait 5 minutes to confirm you're down.
Alarm 2 catches application-level failures: 5XX errors reaching users. The threshold of 10 per minute with 2 evaluation periods (2 minutes sustained) filters out transient errors while catching real breakage. If your traffic is low — under 50 requests per minute — drop the threshold to > 2.
Parameters:
AlbFullName:
Type: String
Description: ALB full name from the ARN — everything after "loadbalancer/"
TargetGroupFullName:
Type: String
Description: Target group full name from the ARN — everything after "targetgroup/"
SnsTopicArn:
Type: String
Default: YOUR_SNS_TOPIC_ARN
Resources:
AlbNoHealthyHosts:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${AlbFullName}-no-healthy-hosts"
AlarmDescription: ALB healthy host count is zero - service is returning 503 to all users
Namespace: AWS/ApplicationELB
MetricName: HealthyHostCount
Dimensions:
- Name: LoadBalancer
Value: !Ref AlbFullName
- Name: TargetGroup
Value: !Ref TargetGroupFullName
Statistic: Minimum
Period: 60
EvaluationPeriods: 1
Threshold: 0
ComparisonOperator: LessThanOrEqualToThreshold
TreatMissingData: breaching
AlarmActions: [!Ref SnsTopicArn]
OKActions: [!Ref SnsTopicArn]
Alb5xxErrors:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${AlbFullName}-5xx-errors"
AlarmDescription: Application 5XX errors above 10/min for 2 consecutive minutes
Namespace: AWS/ApplicationELB
MetricName: HTTPCode_Target_5XX_Count
Dimensions:
- Name: LoadBalancer
Value: !Ref AlbFullName
Statistic: Sum
Period: 60
EvaluationPeriods: 2
Threshold: 10
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
OKActions: [!Ref SnsTopicArn]Step 2: Resource pressure — these give you lead time
Alarms 3-5 give you a warning window before things break. CPU at 80% sustained for 15 minutes (3 × 5-minute periods) gives you time to scale out before latency degrades. Setting the threshold at 95% is too late — by then latency has already spiked.
For TargetResponseTime (alarm 3), use the p99 statistic, not Average. A service averaging 180ms with p99 at 6 seconds is serving slow responses to 1% of users — roughly 10 requests per second at moderate traffic. Average hides this entirely. Memory at 85% gives you a 10-15 minute window before ECS starts killing tasks with exit code 137.
Step 3: Database — the silent killers
Alarms 6 and 7 are the same metric at two severity levels. FreeStorageSpace is the silent killer because MySQL and PostgreSQL on RDS stop accepting writes the moment disk is full — no graceful degradation, just immediate failure on every INSERT. The threshold value is in bytes: 5 GB = 5,368,709,120 bytes, 1 GB = 1,073,741,824 bytes.
Alarm 8 (DatabaseConnections) threshold depends on your instance class. max_connections for common types: db.t3.micro = 87 (threshold: 69), db.t3.medium = 341 (threshold: 272), db.t3.large = 648 (threshold: 518), db.r5.large = 1365 (threshold: 1092). Once max_connections is exhausted, new connection attempts fail immediately — no queuing.
Step 4: Lambda, EC2, SQS, and billing
Lambda Errors at > 5 per minute is deliberately higher than zero. Every Lambda function generates transient errors — cold start timeouts, rate limit retries, misconfigured event sources. Alarming at > 0 creates noise. At > 5 per minute sustained for 2 minutes, something is actually broken.
For SQS (alarm 11), use the Maximum statistic, not Average. If one message has been stuck for 20 minutes while 99% of messages process normally, Average hides it. Maximum catches the stuck message. The billing alarm (alarm 12) only works if you create it in us-east-1 — billing metrics are only published there. Set the threshold at 2× your average monthly spend.
Step 5: Audit your existing alarm state
Before adding new alarms, check what you already have. INSUFFICIENT_DATA alarms usually mean the metric is not being reported — the resource was deleted, renamed, or the dimension name is wrong. This command lists them all so you can clean up before adding more.
aws cloudwatch describe-alarms \
--state-value INSUFFICIENT_DATA \
--query "MetricAlarms[*].{Name:AlarmName,Namespace:Namespace,Metric:MetricName}" \
--output tableStep 6: After alarm 2 fires, find the root error
When the 5XX alarm fires, run this query in CloudWatch Logs Insights against your application's log group. Set the time range to cover 30 minutes before the StateChangeTime in the alarm notification. This groups identical error messages so you see the most frequent error first, not 500 lines of the same stack trace.
fields @timestamp, @message
| filter @message like /(?i)(error|exception|failed)/
| stats count() as occurrences by @message
| sort occurrences desc
| limit 25The most frequent error message is usually the root cause. Five hundred instances of the same NullPointerException is one bug. Two different errors appearing equally often usually indicates a config problem touching multiple code paths.
Four ways teams get this wrong
TreatMissingData: missing on availability alarms
This is the most dangerous misconfiguration. If your ECS service crashes completely and stops publishing metrics, an alarm with TreatMissingData: missing stays in OK state and never fires. For any alarm where no data means something is wrong — HealthyHostCount, StatusCheckFailed, any always-on service metric — set TreatMissingData: breaching.
Average instead of p99 for latency
p99 TargetResponseTime is not the same metric as Average TargetResponseTime. A service averaging 200ms with p99 at 8 seconds is giving 1% of users an 8-second wait — roughly 10-15 requests per second at moderate traffic. Average will never show this. If you care about user experience at the tail, alarm on p99.
Missing OKActions
When an alarm transitions from ALARM back to OK, do you know? If you only set AlarmActions and not OKActions, you get notified when something breaks but not when it recovers. An engineer shouldn't be debugging something that already fixed itself. Set OKActions to the same SNS topic as AlarmActions.
EvaluationPeriods: 1 on every alarm
EvaluationPeriods: 1 means a single anomalous data point triggers the alarm. For CPU, memory, and latency alarms, use 2 or 3 — the condition needs to be sustained. For HealthyHostCount = 0 and StatusCheckFailed, 1 is appropriate. The failure mode isn't using too many periods — it's applying the same value to every alarm without thinking about what one data point above threshold actually means for that metric.
What ConvOps does differently
Doing this manually is fine. ConvOps does it automatically. Here's what's different: when any of these 12 alarms fires, ConvOps immediately runs the Logs Insights query, correlates the timestamp with recent ECS deployments, and sends a root cause hypothesis to WhatsApp or Slack before you've opened your laptop. You still own the fix. We cut the time between "alarm fires" and "you understand what broke" from 20-40 minutes to under 90 seconds.
Frequently asked questions
How many CloudWatch alarms should a small AWS team have?
Start with 12-15 alarms covering your ALB, ECS service, and RDS instance. The goal isn't comprehensive coverage — it's a small set where every alarm represents something worth acting on. More alarms create more noise; noise creates fatigue; fatigue means real incidents get missed.
What is the right CPU threshold for an ECS CloudWatch alarm?
80% with 3 evaluation periods of 5 minutes each — meaning sustained CPU above 80% for 15 minutes triggers the alarm. Don't set the threshold at 95%: by the time you've sustained 95%, latency has already degraded and you've lost your response window.
Why is my CloudWatch alarm showing INSUFFICIENT_DATA?
INSUFFICIENT_DATA means CloudWatch isn't receiving data for the metric. Common causes: the resource was deleted or renamed (alarm dimension no longer matches), the ECS service has zero running tasks, or the metric was never published. Run `aws cloudwatch describe-alarms --state-value INSUFFICIENT_DATA` to list affected alarms, then verify the resource in the dimension field still exists.
What happens when RDS FreeStorageSpace hits zero?
MySQL and PostgreSQL on RDS stop accepting writes immediately — all INSERT, UPDATE, and DELETE statements return errors. The instance does not automatically expand storage unless you have storage autoscaling enabled. To recover: enable autoscaling in the RDS console, or manually increase allocated storage, which triggers a storage modification and a brief performance impact.
Should I set TreatMissingData to breaching or notBreaching?
Use breaching for alarms where no data means something is wrong: HealthyHostCount, StatusCheckFailed, or any metric from a service that should always be running. Use notBreaching for alarms where a quiet metric is normal — 5XX count at 3am is zero, not missing. Getting this wrong is the most common reason availability alarms don't fire when services go down.
Related reading
Still debugging incidents manually?
ConvOps does this automatically — root cause in under 60 seconds, delivered to WhatsApp or Slack.