{ ConvOps }
  • Features
  • How it works
  • Pricing
  • Blog
  • Security
  • About
Log inStart free →
{ ConvOps }

Root cause, not noise.

Start free →

Product

  • Features
  • How it works
  • Pricing
  • Blog
  • Security
  • About
  • Get started

Compare

  • Vs PagerDuty
  • Vs incident.io
  • Vs Datadog
  • Vs Resolve.ai
  • Vs Rootly
  • Vs AWS DevOps Guru
  • Vs Squadcast

Solutions

  • AWS incident response
  • CloudWatch alarm diagnosis
  • AWS alerts to WhatsApp
  • WhatsApp alerts for AWS
  • Works with PagerDuty

Connect

  • X (Twitter)
  • LinkedIn

© 2026 ConvOps. All rights reserved.

Built at 2am, for a 2am.

← All posts

The 12 CloudWatch alarms every small AWS team should have

May 20, 2026·9 min read

It's 2:19am. Your RDS database stopped accepting writes 47 minutes ago. FreeStorageSpace hit zero at 1:32am. Every insert since then returned a read-only error. Users started seeing failures at 1:33am. You find out at 8:47am from a customer email. You had CPU alarms. You had Lambda error alarms. You had no disk space alarm. This is the 12-alarm list I run on every AWS account I care about.

Why most teams either over-alarm or under-alarm

The reflex answer is "monitor everything." AWS docs list 60+ metrics across ECS, EC2, RDS, Lambda, and ALB. A typical starter alarm guide will suggest 20-30 alarms. That's not wrong — but it misses the operational reality of a 5-person team where the same engineer who writes code is also on call.

When every alarm feels equally urgent, none of them are. I've watched on-call engineers silence their phones after three false positives in a week. The team then finds out about real incidents from users. The goal isn't comprehensive coverage. It's a small set of alarms where every trigger represents something worth waking up for — or at minimum, something worth investigating that day. These 12 cover the failure modes that actually take services down.

The 12 alarms at a glance

#MetricNamespaceThresholdStatisticSeverity
1HealthyHostCountAWS/ApplicationELB≤ 0MinimumCRITICAL
2HTTPCode_Target_5XX_CountAWS/ApplicationELB> 10/minSumWARN
3TargetResponseTimeAWS/ApplicationELB> 2s (p99)p99WARN
4CPUUtilizationAWS/ECS> 80%AverageWARN
5MemoryUtilizationAWS/ECS> 85%AverageWARN
6FreeStorageSpaceAWS/RDS< 5 GBAverageWARN
7FreeStorageSpaceAWS/RDS< 1 GBAverageCRITICAL
8DatabaseConnectionsAWS/RDS> 80% of max_connectionsAverageWARN
9ErrorsAWS/Lambda> 5/minSumWARN
10StatusCheckFailedAWS/EC2> 0MaximumCRITICAL
11ApproximateAgeOfOldestMessageAWS/SQS> 600sMaximumWARN
12EstimatedChargesAWS/Billing> 2× monthly avgMaximumWARN

Setting them up, one group at a time

Step 1: Availability — these page you immediately

Alarm 1 is the most important on this list. HealthyHostCount ≤ 0 means your ALB has no healthy targets — the service is returning 503 to every user. Set TreatMissingData to "breaching." If your ECS tasks crash completely and stop publishing metrics, you want this alarm to fire, not stay in OK state. One evaluation period of 60 seconds is enough. Don't wait 5 minutes to confirm you're down.

Alarm 2 catches application-level failures: 5XX errors reaching users. The threshold of 10 per minute with 2 evaluation periods (2 minutes sustained) filters out transient errors while catching real breakage. If your traffic is low — under 50 requests per minute — drop the threshold to > 2.

Decision rule: if HealthyHostCount = 0 AND 5XX count is high, the service is completely unavailable — check ECS task state and ALB target registration. If HealthyHostCount > 0 AND 5XX count is high, the infrastructure is running but the application is broken — open your application logs.
Parameters:
  AlbFullName:
    Type: String
    Description: ALB full name from the ARN — everything after "loadbalancer/"
  TargetGroupFullName:
    Type: String
    Description: Target group full name from the ARN — everything after "targetgroup/"
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN

Resources:
  AlbNoHealthyHosts:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AlbFullName}-no-healthy-hosts"
      AlarmDescription: ALB healthy host count is zero - service is returning 503 to all users
      Namespace: AWS/ApplicationELB
      MetricName: HealthyHostCount
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbFullName
        - Name: TargetGroup
          Value: !Ref TargetGroupFullName
      Statistic: Minimum
      Period: 60
      EvaluationPeriods: 1
      Threshold: 0
      ComparisonOperator: LessThanOrEqualToThreshold
      TreatMissingData: breaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

  Alb5xxErrors:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AlbFullName}-5xx-errors"
      AlarmDescription: Application 5XX errors above 10/min for 2 consecutive minutes
      Namespace: AWS/ApplicationELB
      MetricName: HTTPCode_Target_5XX_Count
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbFullName
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 10
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

Step 2: Resource pressure — these give you lead time

Alarms 3-5 give you a warning window before things break. CPU at 80% sustained for 15 minutes (3 × 5-minute periods) gives you time to scale out before latency degrades. Setting the threshold at 95% is too late — by then latency has already spiked.

For TargetResponseTime (alarm 3), use the p99 statistic, not Average. A service averaging 180ms with p99 at 6 seconds is serving slow responses to 1% of users — roughly 10 requests per second at moderate traffic. Average hides this entirely. Memory at 85% gives you a 10-15 minute window before ECS starts killing tasks with exit code 137.

Step 3: Database — the silent killers

Alarms 6 and 7 are the same metric at two severity levels. FreeStorageSpace is the silent killer because MySQL and PostgreSQL on RDS stop accepting writes the moment disk is full — no graceful degradation, just immediate failure on every INSERT. The threshold value is in bytes: 5 GB = 5,368,709,120 bytes, 1 GB = 1,073,741,824 bytes.

FreeStorageSpace is measured in bytes in CloudWatch, not gigabytes. A threshold of 5000 will not give you a 5 GB warning — it will alarm when you have 5 bytes left. Use 5368709120 for 5 GB and 1073741824 for 1 GB.

Alarm 8 (DatabaseConnections) threshold depends on your instance class. max_connections for common types: db.t3.micro = 87 (threshold: 69), db.t3.medium = 341 (threshold: 272), db.t3.large = 648 (threshold: 518), db.r5.large = 1365 (threshold: 1092). Once max_connections is exhausted, new connection attempts fail immediately — no queuing.

Step 4: Lambda, EC2, SQS, and billing

Lambda Errors at > 5 per minute is deliberately higher than zero. Every Lambda function generates transient errors — cold start timeouts, rate limit retries, misconfigured event sources. Alarming at > 0 creates noise. At > 5 per minute sustained for 2 minutes, something is actually broken.

For SQS (alarm 11), use the Maximum statistic, not Average. If one message has been stuck for 20 minutes while 99% of messages process normally, Average hides it. Maximum catches the stuck message. The billing alarm (alarm 12) only works if you create it in us-east-1 — billing metrics are only published there. Set the threshold at 2× your average monthly spend.

Step 5: Audit your existing alarm state

Before adding new alarms, check what you already have. INSUFFICIENT_DATA alarms usually mean the metric is not being reported — the resource was deleted, renamed, or the dimension name is wrong. This command lists them all so you can clean up before adding more.

aws cloudwatch describe-alarms \
  --state-value INSUFFICIENT_DATA \
  --query "MetricAlarms[*].{Name:AlarmName,Namespace:Namespace,Metric:MetricName}" \
  --output table

Step 6: After alarm 2 fires, find the root error

When the 5XX alarm fires, run this query in CloudWatch Logs Insights against your application's log group. Set the time range to cover 30 minutes before the StateChangeTime in the alarm notification. This groups identical error messages so you see the most frequent error first, not 500 lines of the same stack trace.

fields @timestamp, @message
| filter @message like /(?i)(error|exception|failed)/
| stats count() as occurrences by @message
| sort occurrences desc
| limit 25

The most frequent error message is usually the root cause. Five hundred instances of the same NullPointerException is one bug. Two different errors appearing equally often usually indicates a config problem touching multiple code paths.

Four ways teams get this wrong

TreatMissingData: missing on availability alarms

This is the most dangerous misconfiguration. If your ECS service crashes completely and stops publishing metrics, an alarm with TreatMissingData: missing stays in OK state and never fires. For any alarm where no data means something is wrong — HealthyHostCount, StatusCheckFailed, any always-on service metric — set TreatMissingData: breaching.

Average instead of p99 for latency

p99 TargetResponseTime is not the same metric as Average TargetResponseTime. A service averaging 200ms with p99 at 8 seconds is giving 1% of users an 8-second wait — roughly 10-15 requests per second at moderate traffic. Average will never show this. If you care about user experience at the tail, alarm on p99.

Missing OKActions

When an alarm transitions from ALARM back to OK, do you know? If you only set AlarmActions and not OKActions, you get notified when something breaks but not when it recovers. An engineer shouldn't be debugging something that already fixed itself. Set OKActions to the same SNS topic as AlarmActions.

EvaluationPeriods: 1 on every alarm

EvaluationPeriods: 1 means a single anomalous data point triggers the alarm. For CPU, memory, and latency alarms, use 2 or 3 — the condition needs to be sustained. For HealthyHostCount = 0 and StatusCheckFailed, 1 is appropriate. The failure mode isn't using too many periods — it's applying the same value to every alarm without thinking about what one data point above threshold actually means for that metric.

What ConvOps does differently

Doing this manually is fine. ConvOps does it automatically. Here's what's different: when any of these 12 alarms fires, ConvOps immediately runs the Logs Insights query, correlates the timestamp with recent ECS deployments, and sends a root cause hypothesis to WhatsApp or Slack before you've opened your laptop. You still own the fix. We cut the time between "alarm fires" and "you understand what broke" from 20-40 minutes to under 90 seconds.

Related reading

  • → The Complete AWS CloudWatch Alarm Setup Guide
  • → How to find root cause in AWS CloudWatch alerts without an SRE team
  • → MTTR under 5 minutes: what actually moves the needle for small engineering teams

Frequently asked questions

How many CloudWatch alarms should a small AWS team have?

Start with 12-15 alarms covering your ALB, ECS service, and RDS instance. The goal isn't comprehensive coverage — it's a small set where every alarm represents something worth acting on. More alarms create more noise; noise creates fatigue; fatigue means real incidents get missed.

What is the right CPU threshold for an ECS CloudWatch alarm?

80% with 3 evaluation periods of 5 minutes each — meaning sustained CPU above 80% for 15 minutes triggers the alarm. Don't set the threshold at 95%: by the time you've sustained 95%, latency has already degraded and you've lost your response window.

Why is my CloudWatch alarm showing INSUFFICIENT_DATA?

INSUFFICIENT_DATA means CloudWatch isn't receiving data for the metric. Common causes: the resource was deleted or renamed (alarm dimension no longer matches), the ECS service has zero running tasks, or the metric was never published. Run `aws cloudwatch describe-alarms --state-value INSUFFICIENT_DATA` to list affected alarms, then verify the resource in the dimension field still exists.

What happens when RDS FreeStorageSpace hits zero?

MySQL and PostgreSQL on RDS stop accepting writes immediately — all INSERT, UPDATE, and DELETE statements return errors. The instance does not automatically expand storage unless you have storage autoscaling enabled. To recover: enable autoscaling in the RDS console, or manually increase allocated storage, which triggers a storage modification and a brief performance impact.

Should I set TreatMissingData to breaching or notBreaching?

Use breaching for alarms where no data means something is wrong: HealthyHostCount, StatusCheckFailed, or any metric from a service that should always be running. Use notBreaching for alarms where a quiet metric is normal — 5XX count at 3am is zero, not missing. Getting this wrong is the most common reason availability alarms don't fire when services go down.

Related reading

  • → The Complete AWS CloudWatch Alarm Setup Guide
  • → How to find root cause in AWS CloudWatch alerts without an SRE
  • → MTTR under 5 minutes: what actually moves the needle
  • → ConvOps vs PagerDuty (2026)

Still debugging incidents manually?

ConvOps does this automatically — root cause in under 60 seconds, delivered to WhatsApp or Slack.

Try ConvOps free →See a live demo