The 12 CloudWatch alarms every small AWS team should have

May 20, 20269 min read

It's 2:19am. Your RDS database stopped accepting writes 47 minutes ago. FreeStorageSpace hit zero at 1:32am. Every insert since then returned a read-only error. Users started seeing failures at 1:33am. You find out at 8:47am from a customer email. You had CPU alarms. You had Lambda error alarms. You had no disk space alarm. This is the 12-alarm list I run on every AWS account I care about.

Why most teams either over-alarm or under-alarm

The reflex answer is "monitor everything." AWS docs list 60+ metrics across ECS, EC2, RDS, Lambda, and ALB. A typical starter alarm guide will suggest 20-30 alarms. That's not wrong — but it misses the operational reality of a 5-person team where the same engineer who writes code is also on call.

When every alarm feels equally urgent, none of them are. I've watched on-call engineers silence their phones after three false positives in a week. The team then finds out about real incidents from users. The goal isn't comprehensive coverage. It's a small set of alarms where every trigger represents something worth waking up for — or at minimum, something worth investigating that day. These 12 cover the failure modes that actually take services down.

The 12 alarms at a glance

#	Metric	Namespace	Threshold	Statistic	Severity
1	HealthyHostCount	AWS/ApplicationELB	≤ 0	Minimum	CRITICAL
2	HTTPCode_Target_5XX_Count	AWS/ApplicationELB	> 10/min	Sum	WARN
3	TargetResponseTime	AWS/ApplicationELB	> 2s (p99)	p99	WARN
4	CPUUtilization	AWS/ECS	> 80%	Average	WARN
5	MemoryUtilization	AWS/ECS	> 85%	Average	WARN
6	FreeStorageSpace	AWS/RDS	< 5 GB	Average	WARN
7	FreeStorageSpace	AWS/RDS	< 1 GB	Average	CRITICAL
8	DatabaseConnections	AWS/RDS	> 80% of max_connections	Average	WARN
9	Errors	AWS/Lambda	> 5/min	Sum	WARN
10	StatusCheckFailed	AWS/EC2	> 0	Maximum	CRITICAL
11	ApproximateAgeOfOldestMessage	AWS/SQS	> 600s	Maximum	WARN
12	EstimatedCharges	AWS/Billing	> 2× monthly avg	Maximum	WARN

Setting them up, one group at a time

Step 1: Availability — these page you immediately

Alarm 1 is the most important on this list. HealthyHostCount ≤ 0 means your ALB has no healthy targets — the service is returning 503 to every user. Set TreatMissingData to "breaching." If your ECS tasks crash completely and stop publishing metrics, you want this alarm to fire, not stay in OK state. One evaluation period of 60 seconds is enough. Don't wait 5 minutes to confirm you're down.

Alarm 2 catches application-level failures: 5XX errors reaching users. The threshold of 10 per minute with 2 evaluation periods (2 minutes sustained) filters out transient errors while catching real breakage. If your traffic is low — under 50 requests per minute — drop the threshold to > 2.

Decision rule: if HealthyHostCount = 0 AND 5XX count is high, the service is completely unavailable — check ECS task state and ALB target registration. If HealthyHostCount > 0 AND 5XX count is high, the infrastructure is running but the application is broken — open your application logs.

Parameters:
  AlbFullName:
    Type: String
    Description: ALB full name from the ARN — everything after "loadbalancer/"
  TargetGroupFullName:
    Type: String
    Description: Target group full name from the ARN — everything after "targetgroup/"
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN

Resources:
  AlbNoHealthyHosts:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AlbFullName}-no-healthy-hosts"
      AlarmDescription: ALB healthy host count is zero - service is returning 503 to all users
      Namespace: AWS/ApplicationELB
      MetricName: HealthyHostCount
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbFullName
        - Name: TargetGroup
          Value: !Ref TargetGroupFullName
      Statistic: Minimum
      Period: 60
      EvaluationPeriods: 1
      Threshold: 0
      ComparisonOperator: LessThanOrEqualToThreshold
      TreatMissingData: breaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

  Alb5xxErrors:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AlbFullName}-5xx-errors"
      AlarmDescription: Application 5XX errors above 10/min for 2 consecutive minutes
      Namespace: AWS/ApplicationELB
      MetricName: HTTPCode_Target_5XX_Count
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbFullName
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 10
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

Step 2: Resource pressure — these give you lead time

Alarms 3-5 give you a warning window before things break. CPU at 80% sustained for 15 minutes (3 × 5-minute periods) gives you time to scale out before latency degrades. Setting the threshold at 95% is too late — by then latency has already spiked.

For TargetResponseTime (alarm 3), use the p99 statistic, not Average. A service averaging 180ms with p99 at 6 seconds is serving slow responses to 1% of users — roughly 10 requests per second at moderate traffic. Average hides this entirely. Memory at 85% gives you a 10-15 minute window before ECS starts killing tasks with exit code 137.

Step 3: Database — the silent killers

Alarms 6 and 7 are the same metric at two severity levels. FreeStorageSpace is the silent killer because MySQL and PostgreSQL on RDS stop accepting writes the moment disk is full — no graceful degradation, just immediate failure on every INSERT. The threshold value is in bytes: 5 GB = 5,368,709,120 bytes, 1 GB = 1,073,741,824 bytes.

FreeStorageSpace is measured in bytes in CloudWatch, not gigabytes. A threshold of 5000 will not give you a 5 GB warning — it will alarm when you have 5 bytes left. Use 5368709120 for 5 GB and 1073741824 for 1 GB.

Alarm 8 (DatabaseConnections) threshold depends on your instance class. max_connections for common types: db.t3.micro = 87 (threshold: 69), db.t3.medium = 341 (threshold: 272), db.t3.large = 648 (threshold: 518), db.r5.large = 1365 (threshold: 1092). Once max_connections is exhausted, new connection attempts fail immediately — no queuing.

Step 4: Lambda, EC2, SQS, and billing

Lambda Errors at > 5 per minute is deliberately higher than zero. Every Lambda function generates transient errors — cold start timeouts, rate limit retries, misconfigured event sources. Alarming at > 0 creates noise. At > 5 per minute sustained for 2 minutes, something is actually broken.

For SQS (alarm 11), use the Maximum statistic, not Average. If one message has been stuck for 20 minutes while 99% of messages process normally, Average hides it. Maximum catches the stuck message. The billing alarm (alarm 12) only works if you create it in us-east-1 — billing metrics are only published there. Set the threshold at 2× your average monthly spend.

Step 5: Audit your existing alarm state

Before adding new alarms, check what you already have. INSUFFICIENT_DATA alarms usually mean the metric is not being reported — the resource was deleted, renamed, or the dimension name is wrong. This command lists them all so you can clean up before adding more.

aws cloudwatch describe-alarms \
  --state-value INSUFFICIENT_DATA \
  --query "MetricAlarms[*].{Name:AlarmName,Namespace:Namespace,Metric:MetricName}" \
  --output table

Step 6: After alarm 2 fires, find the root error

When the 5XX alarm fires, run this query in CloudWatch Logs Insights against your application's log group. Set the time range to cover 30 minutes before the StateChangeTime in the alarm notification. This groups identical error messages so you see the most frequent error first, not 500 lines of the same stack trace.

fields @timestamp, @message
| filter @message like /(?i)(error|exception|failed)/
| stats count() as occurrences by @message
| sort occurrences desc
| limit 25

The most frequent error message is usually the root cause. Five hundred instances of the same NullPointerException is one bug. Two different errors appearing equally often usually indicates a config problem touching multiple code paths.

Four ways teams get this wrong

TreatMissingData: missing on availability alarms

This is the most dangerous misconfiguration. If your ECS service crashes completely and stops publishing metrics, an alarm with TreatMissingData: missing stays in OK state and never fires. For any alarm where no data means something is wrong — HealthyHostCount, StatusCheckFailed, any always-on service metric — set TreatMissingData: breaching.

Average instead of p99 for latency

p99 TargetResponseTime is not the same metric as Average TargetResponseTime. A service averaging 200ms with p99 at 8 seconds is giving 1% of users an 8-second wait — roughly 10-15 requests per second at moderate traffic. Average will never show this. If you care about user experience at the tail, alarm on p99.

Missing OKActions

When an alarm transitions from ALARM back to OK, do you know? If you only set AlarmActions and not OKActions, you get notified when something breaks but not when it recovers. An engineer shouldn't be debugging something that already fixed itself. Set OKActions to the same SNS topic as AlarmActions.

EvaluationPeriods: 1 on every alarm

EvaluationPeriods: 1 means a single anomalous data point triggers the alarm. For CPU, memory, and latency alarms, use 2 or 3 — the condition needs to be sustained. For HealthyHostCount = 0 and StatusCheckFailed, 1 is appropriate. The failure mode isn't using too many periods — it's applying the same value to every alarm without thinking about what one data point above threshold actually means for that metric.

What ConvOps does differently

Doing this manually is fine. ConvOps does it automatically. Here's what's different: when any of these 12 alarms fires, ConvOps immediately runs the Logs Insights query, correlates the timestamp with recent ECS deployments, and sends a root cause hypothesis to WhatsApp or Slack before you've opened your laptop. You still own the fix. We cut the time between "alarm fires" and "you understand what broke" from 20-40 minutes to under 90 seconds.

Frequently asked questions

How many CloudWatch alarms should a small AWS team have?

Start with 12-15 alarms covering your ALB, ECS service, and RDS instance. The goal isn't comprehensive coverage — it's a small set where every alarm represents something worth acting on. More alarms create more noise; noise creates fatigue; fatigue means real incidents get missed.

What is the right CPU threshold for an ECS CloudWatch alarm?

80% with 3 evaluation periods of 5 minutes each — meaning sustained CPU above 80% for 15 minutes triggers the alarm. Don't set the threshold at 95%: by the time you've sustained 95%, latency has already degraded and you've lost your response window.

Why is my CloudWatch alarm showing INSUFFICIENT_DATA?

INSUFFICIENT_DATA means CloudWatch isn't receiving data for the metric. Common causes: the resource was deleted or renamed (alarm dimension no longer matches), the ECS service has zero running tasks, or the metric was never published. Run `aws cloudwatch describe-alarms --state-value INSUFFICIENT_DATA` to list affected alarms, then verify the resource in the dimension field still exists.

What happens when RDS FreeStorageSpace hits zero?

MySQL and PostgreSQL on RDS stop accepting writes immediately — all INSERT, UPDATE, and DELETE statements return errors. The instance does not automatically expand storage unless you have storage autoscaling enabled. To recover: enable autoscaling in the RDS console, or manually increase allocated storage, which triggers a storage modification and a brief performance impact.

Should I set TreatMissingData to breaching or notBreaching?

Use breaching for alarms where no data means something is wrong: HealthyHostCount, StatusCheckFailed, or any metric from a service that should always be running. Use notBreaching for alarms where a quiet metric is normal — 5XX count at 3am is zero, not missing. Getting this wrong is the most common reason availability alarms don't fire when services go down.

The 12 CloudWatch alarms every small AWS team should have

May 20, 20269 min read

Why most teams either over-alarm or under-alarm

The 12 alarms at a glance

#	Metric	Namespace	Threshold	Statistic	Severity
1	HealthyHostCount	AWS/ApplicationELB	≤ 0	Minimum	CRITICAL
2	HTTPCode_Target_5XX_Count	AWS/ApplicationELB	> 10/min	Sum	WARN
3	TargetResponseTime	AWS/ApplicationELB	> 2s (p99)	p99	WARN
4	CPUUtilization	AWS/ECS	> 80%	Average	WARN
5	MemoryUtilization	AWS/ECS	> 85%	Average	WARN
6	FreeStorageSpace	AWS/RDS	< 5 GB	Average	WARN
7	FreeStorageSpace	AWS/RDS	< 1 GB	Average	CRITICAL
8	DatabaseConnections	AWS/RDS	> 80% of max_connections	Average	WARN
9	Errors	AWS/Lambda	> 5/min	Sum	WARN
10	StatusCheckFailed	AWS/EC2	> 0	Maximum	CRITICAL
11	ApproximateAgeOfOldestMessage	AWS/SQS	> 600s	Maximum	WARN
12	EstimatedCharges	AWS/Billing	> 2× monthly avg	Maximum	WARN

Setting them up, one group at a time

Step 1: Availability — these page you immediately

Parameters:
  AlbFullName:
    Type: String
    Description: ALB full name from the ARN — everything after "loadbalancer/"
  TargetGroupFullName:
    Type: String
    Description: Target group full name from the ARN — everything after "targetgroup/"
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN

Resources:
  AlbNoHealthyHosts:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AlbFullName}-no-healthy-hosts"
      AlarmDescription: ALB healthy host count is zero - service is returning 503 to all users
      Namespace: AWS/ApplicationELB
      MetricName: HealthyHostCount
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbFullName
        - Name: TargetGroup
          Value: !Ref TargetGroupFullName
      Statistic: Minimum
      Period: 60
      EvaluationPeriods: 1
      Threshold: 0
      ComparisonOperator: LessThanOrEqualToThreshold
      TreatMissingData: breaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

  Alb5xxErrors:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AlbFullName}-5xx-errors"
      AlarmDescription: Application 5XX errors above 10/min for 2 consecutive minutes
      Namespace: AWS/ApplicationELB
      MetricName: HTTPCode_Target_5XX_Count
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbFullName
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 10
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

Step 2: Resource pressure — these give you lead time

Step 3: Database — the silent killers

Step 4: Lambda, EC2, SQS, and billing

Step 5: Audit your existing alarm state

aws cloudwatch describe-alarms \
  --state-value INSUFFICIENT_DATA \
  --query "MetricAlarms[*].{Name:AlarmName,Namespace:Namespace,Metric:MetricName}" \
  --output table

Step 6: After alarm 2 fires, find the root error

fields @timestamp, @message
| filter @message like /(?i)(error|exception|failed)/
| stats count() as occurrences by @message
| sort occurrences desc
| limit 25

Why most teams either over-alarm or under-alarm

The 12 alarms at a glance

Setting them up, one group at a time

Step 1: Availability — these page you immediately

Step 2: Resource pressure — these give you lead time

Step 3: Database — the silent killers

Step 4: Lambda, EC2, SQS, and billing

Step 5: Audit your existing alarm state

Step 6: After alarm 2 fires, find the root error

Four ways teams get this wrong

TreatMissingData: missing on availability alarms

Average instead of p99 for latency

Missing OKActions

EvaluationPeriods: 1 on every alarm

What ConvOps does differently

Frequently asked questions

How many CloudWatch alarms should a small AWS team have?

What is the right CPU threshold for an ECS CloudWatch alarm?

Why is my CloudWatch alarm showing INSUFFICIENT_DATA?

What happens when RDS FreeStorageSpace hits zero?

Should I set TreatMissingData to breaching or notBreaching?

Related reading

Why most teams either over-alarm or under-alarm

The 12 alarms at a glance

Setting them up, one group at a time

Step 1: Availability — these page you immediately

Step 2: Resource pressure — these give you lead time

Step 3: Database — the silent killers

Step 4: Lambda, EC2, SQS, and billing

Step 5: Audit your existing alarm state

Step 6: After alarm 2 fires, find the root error

Four ways teams get this wrong

TreatMissingData: missing on availability alarms

Average instead of p99 for latency

Missing OKActions

EvaluationPeriods: 1 on every alarm

What ConvOps does differently

Frequently asked questions

How many CloudWatch alarms should a small AWS team have?

What is the right CPU threshold for an ECS CloudWatch alarm?

Why is my CloudWatch alarm showing INSUFFICIENT_DATA?

What happens when RDS FreeStorageSpace hits zero?

Should I set TreatMissingData to breaching or notBreaching?

Related reading