The 5 CloudWatch alarms most startups accidentally create that are just noise
You got paged 47 times in May. 38 of those resolved before you opened the console. That's not on-call — that's alert fatigue in its final stage, where the alarm has cried wolf so many times that when the real incident happens, the notification sits in the thread for 11 minutes before anyone looks at it.
Most of these alarms were created from documentation examples, Terraform registry modules, or copied from a colleague's working setup. The individual settings look defensible. The problem is almost always two parameters that the AWS console and most guides don't emphasise: DatapointsToAlarm and the choice between absolute counts versus rates.
The 5 noisy alarms at a glance
| Alarm | Noise pattern | Root cause |
|---|---|---|
| ECS/EC2 CPUUtilization > 80% | Fires on every traffic burst, resolves in under 2 minutes | DatapointsToAlarm = 1 — one 60-second data point above threshold is enough |
| Lambda Errors > 0 (or > 1) | Fires on timeouts, cold start failures, and transient downstream errors | Absolute threshold ignores invocation volume — 1 error in 50,000 invocations fires the alarm |
| ALB TargetResponseTime > 500ms (average) | Fires on every deploy as new tasks register and serve their first requests slowly | p50 average and DatapointsToAlarm = 1 make this alarm fire on expected variance |
| RDS FreeableMemory < 500 MB (static bytes) | Fires every read-heavy period as the buffer pool fills, recovers overnight when load drops | Static byte threshold ignores that RDS actively uses memory for buffer cache |
| SQS ApproximateNumberOfMessagesVisible > N | Fires every time a batch job drops messages into the queue — before any consumer processes them | Queue depth spikes are expected; message age is what signals a real backlog |
1. ECS/EC2 CPUUtilization > 80% with DatapointsToAlarm = 1
This is the most common noisy CloudWatch alarm in startup AWS accounts. ECS tasks spike CPU on every burst of traffic, every garbage collection cycle, every batch of requests above your P50 baseline. With DatapointsToAlarm = 1, a single 60-second data point above 80% fires the alarm — even if CPU drops to 42% on the next sample.
aws cloudwatch describe-alarms \
--alarm-names "api-service-HighCPUUtilization" \
--query 'MetricAlarms[0].{Period:Period,EvalPeriods:EvaluationPeriods,DatapointsToAlarm:DatapointsToAlarm,Threshold:Threshold}'
# What you get back on a typical startup alarm:
# {
# "Period": 60,
# "EvalPeriods": 1,
# "DatapointsToAlarm": 1,
# "Threshold": 80.0
# }Period 60, EvalPeriods 1, DatapointsToAlarm 1: any 60-second window where average CPU is above 80% fires the alarm. Your ECS service crosses 80% on a busy Tuesday morning and pages you before the load balancer's health checks even notice. The fix is to require sustained CPU pressure, not a single elevated sample.
Resources:
ECSApiServiceCPUAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: "api-service-HighCPUUtilization"
Namespace: AWS/ECS
MetricName: CPUUtilization
Dimensions:
- Name: ServiceName
Value: api-service
- Name: ClusterName
Value: production
Statistic: Average
Period: 60
EvaluationPeriods: 5
DatapointsToAlarm: 3 # was: 1. Requires 3 of 5 minutes above threshold.
Threshold: 85
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching2. Lambda Errors > 0 (or > 1) on an absolute count
Lambda Error alarms with a threshold of 0 or 1 are the second most common source of noise. Every Lambda function that calls an external API, queries a database, or reads from S3 will occasionally throw an error — a timeout, a rate limit, a transient DNS failure. At scale, individual errors are expected. A threshold of 0 defines an incident as 'any error ever occurred.'
The deeper problem: Lambda Errors is an absolute count, not a rate. Three errors in 10 invocations is a 30% error rate — a real problem. Three errors in 50,000 invocations is 0.006% — noise. A threshold on the raw count fires for both.
aws cloudwatch describe-alarms \
--alarm-names "prod-api-processor-Errors" \
--query 'MetricAlarms[0].{Threshold:Threshold,EvalPeriods:EvaluationPeriods,DatapointsToAlarm:DatapointsToAlarm,Period:Period}'
# Typical noisy config:
# {
# "Threshold": 0.0,
# "EvalPeriods": 1,
# "DatapointsToAlarm": 1,
# "Period": 60
# }The fix is metric math: divide Errors by Invocations to get an error rate, then alarm when that rate exceeds 5% for two of three consecutive 5-minute evaluation periods.
Resources:
LambdaErrorRateAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: "prod-api-processor-ErrorRate-High"
ComparisonOperator: GreaterThanThreshold
EvaluationPeriods: 3
DatapointsToAlarm: 2
Threshold: 5
TreatMissingData: notBreaching
Metrics:
- Id: error_rate
Expression: "(errors / invocations) * 100"
Label: ErrorRatePercent
ReturnData: true
- Id: errors
MetricStat:
Metric:
Namespace: AWS/Lambda
MetricName: Errors
Dimensions:
- Name: FunctionName
Value: prod-api-order-processor
Period: 300
Stat: Sum
ReturnData: false
- Id: invocations
MetricStat:
Metric:
Namespace: AWS/Lambda
MetricName: Invocations
Dimensions:
- Name: FunctionName
Value: prod-api-order-processor
Period: 300
Stat: Sum
ReturnData: false3. ALB TargetResponseTime > 500ms on average (p50) latency
ALB TargetResponseTime alarms are easy to misconfigure because the AWS console defaults to the Average statistic — p50, the median. A p50 of 480ms means half your requests are slower. More critically, average latency fires on every ECS deploy.
During a deployment, new ECS tasks register behind the ALB and process their first requests with cold JVM startup, cold caches, and fresh connections to RDS. Latency spikes for 2–4 minutes on every deploy. With DatapointsToAlarm = 1 on a 60-second period, your deployment pipeline triggers the alarm before any user has noticed anything.
aws cloudwatch describe-alarms \
--alarm-names "prod-alb-HighLatency" \
--query 'MetricAlarms[0].{Statistic:Statistic,Threshold:Threshold,Period:Period,DatapointsToAlarm:DatapointsToAlarm}'
# Noisy config:
# {
# "Statistic": "Average",
# "Threshold": 0.5,
# "Period": 60,
# "DatapointsToAlarm": 1
# }Resources:
ALBLatencyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: "prod-alb-p99Latency-High"
Namespace: AWS/ApplicationELB
MetricName: TargetResponseTime
Dimensions:
- Name: LoadBalancer
Value: app/prod-alb/1a2b3c4d5e6f7a8b
ExtendedStatistic: p99 # was: Average (p50)
Period: 60
EvaluationPeriods: 3
DatapointsToAlarm: 2 # was: 1
Threshold: 2 # seconds — calibrate to your p99 SLO
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching4. RDS FreeableMemory below a static byte threshold
RDS FreeableMemory drops under normal database load. When the instance is executing queries, it allocates buffer pool memory for frequently-read pages, indexes, and query results. This memory shows as used — it's not leaked, it's a database doing its job efficiently. A static threshold of 500 MB fires every morning when business-day query load starts.
aws cloudwatch describe-alarms \
--alarm-names "prod-rds-LowFreeableMemory" \
--query 'MetricAlarms[0].{Threshold:Threshold,EvalPeriods:EvaluationPeriods,DatapointsToAlarm:DatapointsToAlarm}'
# Noisy config:
# {
# "Threshold": 524288000, # 500 MB in bytes — someone guessed this number
# "EvalPeriods": 1,
# "DatapointsToAlarm": 1
# }The alarm clears overnight when traffic drops and the buffer pool shrinks. You investigate in the morning, see normal memory patterns, and go back to sleep. Real memory exhaustion on RDS is FreeableMemory that keeps dropping without recovering, accompanied by rising SwapUsage. Buffer pool pressure that recovers overnight is not an incident.
Resources:
RDSLowMemoryAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: "prod-rds-LowFreeableMemory"
Namespace: AWS/RDS
MetricName: FreeableMemory
Dimensions:
- Name: DBInstanceIdentifier
Value: prod-postgres-primary
Statistic: Average
Period: 300 # 5-minute periods, not 60s
EvaluationPeriods: 5
DatapointsToAlarm: 3 # was: 1
# db.t3.medium = 4 GiB total → 10% = 429,496,730 bytes
# db.r6g.large = 16 GiB total → 10% = 1,717,986,918 bytes
Threshold: 429496730 # Set to 10% of your instance's total RAM
ComparisonOperator: LessThanThreshold
TreatMissingData: notBreaching5. SQS ApproximateNumberOfMessagesVisible above a static count
ApproximateNumberOfMessagesVisible counts messages currently waiting for consumers. This number spikes whenever a producer sends messages — regardless of whether consumers are running. A batch job that drops 500 messages into the queue will trigger an alarm at threshold = 10 before a single consumer has polled.
The metric this alarm should use is ApproximateAgeOfOldestMessage: how long the oldest message has been waiting. If consumers are processing, the oldest message won't survive long. If the age is growing past your processing SLA, consumers have stopped — and that's an actual incident.
aws cloudwatch describe-alarms \
--alarm-names "prod-egress-queue-Depth" \
--query 'MetricAlarms[0].{MetricName:MetricName,Threshold:Threshold,DatapointsToAlarm:DatapointsToAlarm}'
# Noisy config:
# {
# "MetricName": "ApproximateNumberOfMessagesVisible",
# "Threshold": 10.0,
# "DatapointsToAlarm": 1
# }Resources:
SQSOldestMessageAgeAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: "prod-egress-queue-OldestMessageAge-High"
Namespace: AWS/SQS
MetricName: ApproximateAgeOfOldestMessage # was: ApproximateNumberOfMessagesVisible
Dimensions:
- Name: QueueName
Value: prod-egress-queue
Statistic: Maximum
Period: 300
EvaluationPeriods: 2
DatapointsToAlarm: 2
Threshold: 300 # seconds — calibrate to your processing SLA
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreachingThe common thread: DatapointsToAlarm = 1 and the wrong metric
Every alarm above has the same structural failure: it fires immediately on a single data point, or it measures a metric that spikes legitimately during normal operations. Real production problems sustain — a CPU spike that resolves in 90 seconds was a traffic burst that auto-scaled away. A Lambda error that doesn't recur is noise. Latency during a deploy is expected.
The two-question test for any alarm you're about to create: first, what does sustained anomalous behaviour actually look like for this metric? Second, does my DatapointsToAlarm setting require evidence of that sustained behaviour before firing? If the answer to the second question is 'no — one data point is enough,' you've just scheduled a 3am page for something that will have resolved before you finish reading the notification.
Frequently asked questions
What are noisy CloudWatch alarms?
Noisy CloudWatch alarms fire frequently on conditions that don't represent real incidents — CPU spikes that self-heal, Lambda errors that are expected at scale, latency during normal deployments. The term 'noisy' refers to a high ratio of false positive alerts to real incidents, which causes alert fatigue: engineers start ignoring or silencing alarms, eventually including the real ones.
What is DatapointsToAlarm in CloudWatch and why does it matter?
DatapointsToAlarm is the number of data points within an evaluation period that must breach the threshold before the alarm transitions to ALARM state. If EvaluationPeriods is 5 and DatapointsToAlarm is 1, a single data point above threshold fires the alarm — even if the other 4 are normal. Setting DatapointsToAlarm to 3 of 5 requires sustained anomalous behaviour before alerting, filtering out transient spikes.
How do I stop my Lambda error alarm from firing constantly?
Switch from an absolute error count threshold to an error rate using metric math (Errors / Invocations × 100). Set the threshold at 5% (or your acceptable error rate), with EvaluationPeriods = 3 and DatapointsToAlarm = 2. This catches sustained Lambda failures without triggering on occasional transient errors that every function experiences at scale.
Why does my RDS FreeableMemory alarm fire during the day and clear overnight?
RDS actively uses available memory for buffer pool cache — storing frequently-read index pages, query results, and data blocks. During business-hours query load, the buffer pool fills and FreeableMemory drops. Overnight when traffic subsides, the buffer pool shrinks and memory is returned. Fix: set the threshold to 10% of your instance's total RAM with DatapointsToAlarm = 3 and a 5-minute period, and add a separate SwapUsage > 0 alarm for real memory exhaustion.
What CloudWatch metric should I use for SQS monitoring instead of message count?
Use ApproximateAgeOfOldestMessage instead of ApproximateNumberOfMessagesVisible. Message count spikes whenever producers send messages — before consumers have had a chance to process them. Message age grows when consumers are stuck or crashed. An alarm on age > 300 seconds for a queue that should process in under 60 seconds signals a real consumer failure without firing every time a batch job runs.
Related reading
Still debugging incidents manually?
ConvOps does this automatically — root cause in under 60 seconds, delivered to WhatsApp or Slack.