The 5 CloudWatch alarms most startups accidentally create that are just noise

June 3, 20269 min read

You got paged 47 times in May. 38 of those resolved before you opened the console. That's not on-call — that's alert fatigue in its final stage, where the alarm has cried wolf so many times that when the real incident happens, the notification sits in the thread for 11 minutes before anyone looks at it.

Most of these alarms were created from documentation examples, Terraform registry modules, or copied from a colleague's working setup. The individual settings look defensible. The problem is almost always two parameters that the AWS console and most guides don't emphasise: DatapointsToAlarm and the choice between absolute counts versus rates.

The 5 noisy alarms at a glance

Alarm	Noise pattern	Root cause
ECS/EC2 CPUUtilization > 80%	Fires on every traffic burst, resolves in under 2 minutes	DatapointsToAlarm = 1 — one 60-second data point above threshold is enough
Lambda Errors > 0 (or > 1)	Fires on timeouts, cold start failures, and transient downstream errors	Absolute threshold ignores invocation volume — 1 error in 50,000 invocations fires the alarm
ALB TargetResponseTime > 500ms (average)	Fires on every deploy as new tasks register and serve their first requests slowly	p50 average and DatapointsToAlarm = 1 make this alarm fire on expected variance
RDS FreeableMemory < 500 MB (static bytes)	Fires every read-heavy period as the buffer pool fills, recovers overnight when load drops	Static byte threshold ignores that RDS actively uses memory for buffer cache
SQS ApproximateNumberOfMessagesVisible > N	Fires every time a batch job drops messages into the queue — before any consumer processes them	Queue depth spikes are expected; message age is what signals a real backlog

1. ECS/EC2 CPUUtilization > 80% with DatapointsToAlarm = 1

This is the most common noisy CloudWatch alarm in startup AWS accounts. ECS tasks spike CPU on every burst of traffic, every garbage collection cycle, every batch of requests above your P50 baseline. With DatapointsToAlarm = 1, a single 60-second data point above 80% fires the alarm — even if CPU drops to 42% on the next sample.

aws cloudwatch describe-alarms \
  --alarm-names "api-service-HighCPUUtilization" \
  --query 'MetricAlarms[0].{Period:Period,EvalPeriods:EvaluationPeriods,DatapointsToAlarm:DatapointsToAlarm,Threshold:Threshold}'

# What you get back on a typical startup alarm:
# {
#   "Period": 60,
#   "EvalPeriods": 1,
#   "DatapointsToAlarm": 1,
#   "Threshold": 80.0
# }

Period 60, EvalPeriods 1, DatapointsToAlarm 1: any 60-second window where average CPU is above 80% fires the alarm. Your ECS service crosses 80% on a busy Tuesday morning and pages you before the load balancer's health checks even notice. The fix is to require sustained CPU pressure, not a single elevated sample.

Resources:
  ECSApiServiceCPUAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: "api-service-HighCPUUtilization"
      Namespace: AWS/ECS
      MetricName: CPUUtilization
      Dimensions:
        - Name: ServiceName
          Value: api-service
        - Name: ClusterName
          Value: production
      Statistic: Average
      Period: 60
      EvaluationPeriods: 5
      DatapointsToAlarm: 3     # was: 1. Requires 3 of 5 minutes above threshold.
      Threshold: 85
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching

Rule: ECS and EC2 CPU alarms should require 3 of 5 evaluation periods above threshold — 5 minutes of evidence, not 60 seconds. A traffic burst that self-heals in 90 seconds will not trigger this alarm. Real CPU exhaustion will.

2. Lambda Errors > 0 (or > 1) on an absolute count

Lambda Error alarms with a threshold of 0 or 1 are the second most common source of noise. Every Lambda function that calls an external API, queries a database, or reads from S3 will occasionally throw an error — a timeout, a rate limit, a transient DNS failure. At scale, individual errors are expected. A threshold of 0 defines an incident as 'any error ever occurred.'

The deeper problem: Lambda Errors is an absolute count, not a rate. Three errors in 10 invocations is a 30% error rate — a real problem. Three errors in 50,000 invocations is 0.006% — noise. A threshold on the raw count fires for both.

aws cloudwatch describe-alarms \
  --alarm-names "prod-api-processor-Errors" \
  --query 'MetricAlarms[0].{Threshold:Threshold,EvalPeriods:EvaluationPeriods,DatapointsToAlarm:DatapointsToAlarm,Period:Period}'

# Typical noisy config:
# {
#   "Threshold": 0.0,
#   "EvalPeriods": 1,
#   "DatapointsToAlarm": 1,
#   "Period": 60
# }

The fix is metric math: divide Errors by Invocations to get an error rate, then alarm when that rate exceeds 5% for two of three consecutive 5-minute evaluation periods.

Resources:
  LambdaErrorRateAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: "prod-api-processor-ErrorRate-High"
      ComparisonOperator: GreaterThanThreshold
      EvaluationPeriods: 3
      DatapointsToAlarm: 2
      Threshold: 5
      TreatMissingData: notBreaching
      Metrics:
        - Id: error_rate
          Expression: "(errors / invocations) * 100"
          Label: ErrorRatePercent
          ReturnData: true
        - Id: errors
          MetricStat:
            Metric:
              Namespace: AWS/Lambda
              MetricName: Errors
              Dimensions:
                - Name: FunctionName
                  Value: prod-api-order-processor
            Period: 300
            Stat: Sum
          ReturnData: false
        - Id: invocations
          MetricStat:
            Metric:
              Namespace: AWS/Lambda
              MetricName: Invocations
              Dimensions:
                - Name: FunctionName
                  Value: prod-api-order-processor
            Period: 300
            Stat: Sum
          ReturnData: false

Switch from absolute error counts to error rate. This alarm fires when the error rate exceeds 5% for 2 of 3 five-minute periods — catching real Lambda failure modes without waking you up for the occasional transient timeout.

3. ALB TargetResponseTime > 500ms on average (p50) latency

ALB TargetResponseTime alarms are easy to misconfigure because the AWS console defaults to the Average statistic — p50, the median. A p50 of 480ms means half your requests are slower. More critically, average latency fires on every ECS deploy.

During a deployment, new ECS tasks register behind the ALB and process their first requests with cold JVM startup, cold caches, and fresh connections to RDS. Latency spikes for 2–4 minutes on every deploy. With DatapointsToAlarm = 1 on a 60-second period, your deployment pipeline triggers the alarm before any user has noticed anything.

aws cloudwatch describe-alarms \
  --alarm-names "prod-alb-HighLatency" \
  --query 'MetricAlarms[0].{Statistic:Statistic,Threshold:Threshold,Period:Period,DatapointsToAlarm:DatapointsToAlarm}'

# Noisy config:
# {
#   "Statistic": "Average",
#   "Threshold": 0.5,
#   "Period": 60,
#   "DatapointsToAlarm": 1
# }

Resources:
  ALBLatencyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: "prod-alb-p99Latency-High"
      Namespace: AWS/ApplicationELB
      MetricName: TargetResponseTime
      Dimensions:
        - Name: LoadBalancer
          Value: app/prod-alb/1a2b3c4d5e6f7a8b
      ExtendedStatistic: p99          # was: Average (p50)
      Period: 60
      EvaluationPeriods: 3
      DatapointsToAlarm: 2            # was: 1
      Threshold: 2                    # seconds — calibrate to your p99 SLO
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching

Use p99, not average, for latency alarms. Set the threshold at your SLO boundary — if your p99 SLO is 2 seconds, alarm at 2. Average latency of 480ms can coexist with a p99 of 4 seconds that is actively failing 1% of your users.

4. RDS FreeableMemory below a static byte threshold

RDS FreeableMemory drops under normal database load. When the instance is executing queries, it allocates buffer pool memory for frequently-read pages, indexes, and query results. This memory shows as used — it's not leaked, it's a database doing its job efficiently. A static threshold of 500 MB fires every morning when business-day query load starts.

aws cloudwatch describe-alarms \
  --alarm-names "prod-rds-LowFreeableMemory" \
  --query 'MetricAlarms[0].{Threshold:Threshold,EvalPeriods:EvaluationPeriods,DatapointsToAlarm:DatapointsToAlarm}'

# Noisy config:
# {
#   "Threshold": 524288000,   # 500 MB in bytes — someone guessed this number
#   "EvalPeriods": 1,
#   "DatapointsToAlarm": 1
# }

The alarm clears overnight when traffic drops and the buffer pool shrinks. You investigate in the morning, see normal memory patterns, and go back to sleep. Real memory exhaustion on RDS is FreeableMemory that keeps dropping without recovering, accompanied by rising SwapUsage. Buffer pool pressure that recovers overnight is not an incident.

Resources:
  RDSLowMemoryAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: "prod-rds-LowFreeableMemory"
      Namespace: AWS/RDS
      MetricName: FreeableMemory
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: prod-postgres-primary
      Statistic: Average
      Period: 300                     # 5-minute periods, not 60s
      EvaluationPeriods: 5
      DatapointsToAlarm: 3            # was: 1
      # db.t3.medium = 4 GiB total  → 10% = 429,496,730 bytes
      # db.r6g.large = 16 GiB total → 10% = 1,717,986,918 bytes
      Threshold: 429496730            # Set to 10% of your instance's total RAM
      ComparisonOperator: LessThanThreshold
      TreatMissingData: notBreaching

Set the threshold to 10% of your instance's total memory in bytes, not a static 500 MB that was guessed from a blog post. A db.t3.medium has 4 GiB total (threshold: 429,496,730 bytes). A db.r6g.large has 16 GiB (threshold: 1,717,986,918 bytes). Also add a separate SwapUsage > 0 alarm — swap usage is the real signal of memory exhaustion, not buffer pool pressure.

5. SQS ApproximateNumberOfMessagesVisible above a static count

ApproximateNumberOfMessagesVisible counts messages currently waiting for consumers. This number spikes whenever a producer sends messages — regardless of whether consumers are running. A batch job that drops 500 messages into the queue will trigger an alarm at threshold = 10 before a single consumer has polled.

The metric this alarm should use is ApproximateAgeOfOldestMessage: how long the oldest message has been waiting. If consumers are processing, the oldest message won't survive long. If the age is growing past your processing SLA, consumers have stopped — and that's an actual incident.

aws cloudwatch describe-alarms \
  --alarm-names "prod-egress-queue-Depth" \
  --query 'MetricAlarms[0].{MetricName:MetricName,Threshold:Threshold,DatapointsToAlarm:DatapointsToAlarm}'

# Noisy config:
# {
#   "MetricName": "ApproximateNumberOfMessagesVisible",
#   "Threshold": 10.0,
#   "DatapointsToAlarm": 1
# }

Resources:
  SQSOldestMessageAgeAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: "prod-egress-queue-OldestMessageAge-High"
      Namespace: AWS/SQS
      MetricName: ApproximateAgeOfOldestMessage  # was: ApproximateNumberOfMessagesVisible
      Dimensions:
        - Name: QueueName
          Value: prod-egress-queue
      Statistic: Maximum
      Period: 300
      EvaluationPeriods: 2
      DatapointsToAlarm: 2
      Threshold: 300                  # seconds — calibrate to your processing SLA
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching

ApproximateAgeOfOldestMessage > 300 seconds for a queue that should process in under 60 seconds tells you consumers have stopped. Queue depth > 10 before consumers have polled tells you your batch job is running correctly.

The common thread: DatapointsToAlarm = 1 and the wrong metric

Every alarm above has the same structural failure: it fires immediately on a single data point, or it measures a metric that spikes legitimately during normal operations. Real production problems sustain — a CPU spike that resolves in 90 seconds was a traffic burst that auto-scaled away. A Lambda error that doesn't recur is noise. Latency during a deploy is expected.

The two-question test for any alarm you're about to create: first, what does sustained anomalous behaviour actually look like for this metric? Second, does my DatapointsToAlarm setting require evidence of that sustained behaviour before firing? If the answer to the second question is 'no — one data point is enough,' you've just scheduled a 3am page for something that will have resolved before you finish reading the notification.

Frequently asked questions

What are noisy CloudWatch alarms?

Noisy CloudWatch alarms fire frequently on conditions that don't represent real incidents — CPU spikes that self-heal, Lambda errors that are expected at scale, latency during normal deployments. The term 'noisy' refers to a high ratio of false positive alerts to real incidents, which causes alert fatigue: engineers start ignoring or silencing alarms, eventually including the real ones.

What is DatapointsToAlarm in CloudWatch and why does it matter?

DatapointsToAlarm is the number of data points within an evaluation period that must breach the threshold before the alarm transitions to ALARM state. If EvaluationPeriods is 5 and DatapointsToAlarm is 1, a single data point above threshold fires the alarm — even if the other 4 are normal. Setting DatapointsToAlarm to 3 of 5 requires sustained anomalous behaviour before alerting, filtering out transient spikes.

How do I stop my Lambda error alarm from firing constantly?

Switch from an absolute error count threshold to an error rate using metric math (Errors / Invocations × 100). Set the threshold at 5% (or your acceptable error rate), with EvaluationPeriods = 3 and DatapointsToAlarm = 2. This catches sustained Lambda failures without triggering on occasional transient errors that every function experiences at scale.

Why does my RDS FreeableMemory alarm fire during the day and clear overnight?

RDS actively uses available memory for buffer pool cache — storing frequently-read index pages, query results, and data blocks. During business-hours query load, the buffer pool fills and FreeableMemory drops. Overnight when traffic subsides, the buffer pool shrinks and memory is returned. Fix: set the threshold to 10% of your instance's total RAM with DatapointsToAlarm = 3 and a 5-minute period, and add a separate SwapUsage > 0 alarm for real memory exhaustion.

What CloudWatch metric should I use for SQS monitoring instead of message count?

Use ApproximateAgeOfOldestMessage instead of ApproximateNumberOfMessagesVisible. Message count spikes whenever producers send messages — before consumers have had a chance to process them. Message age grows when consumers are stuck or crashed. An alarm on age > 300 seconds for a queue that should process in under 60 seconds signals a real consumer failure without firing every time a batch job runs.

The 5 CloudWatch alarms most startups accidentally create that are just noise

June 3, 20269 min read

The 5 noisy alarms at a glance

Alarm	Noise pattern	Root cause
ECS/EC2 CPUUtilization > 80%	Fires on every traffic burst, resolves in under 2 minutes	DatapointsToAlarm = 1 — one 60-second data point above threshold is enough
Lambda Errors > 0 (or > 1)	Fires on timeouts, cold start failures, and transient downstream errors	Absolute threshold ignores invocation volume — 1 error in 50,000 invocations fires the alarm
ALB TargetResponseTime > 500ms (average)	Fires on every deploy as new tasks register and serve their first requests slowly	p50 average and DatapointsToAlarm = 1 make this alarm fire on expected variance
RDS FreeableMemory < 500 MB (static bytes)	Fires every read-heavy period as the buffer pool fills, recovers overnight when load drops	Static byte threshold ignores that RDS actively uses memory for buffer cache
SQS ApproximateNumberOfMessagesVisible > N	Fires every time a batch job drops messages into the queue — before any consumer processes them	Queue depth spikes are expected; message age is what signals a real backlog

1. ECS/EC2 CPUUtilization > 80% with DatapointsToAlarm = 1

aws cloudwatch describe-alarms \
  --alarm-names "api-service-HighCPUUtilization" \
  --query 'MetricAlarms[0].{Period:Period,EvalPeriods:EvaluationPeriods,DatapointsToAlarm:DatapointsToAlarm,Threshold:Threshold}'

# What you get back on a typical startup alarm:
# {
#   "Period": 60,
#   "EvalPeriods": 1,
#   "DatapointsToAlarm": 1,
#   "Threshold": 80.0
# }

Resources:
  ECSApiServiceCPUAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: "api-service-HighCPUUtilization"
      Namespace: AWS/ECS
      MetricName: CPUUtilization
      Dimensions:
        - Name: ServiceName
          Value: api-service
        - Name: ClusterName
          Value: production
      Statistic: Average
      Period: 60
      EvaluationPeriods: 5
      DatapointsToAlarm: 3     # was: 1. Requires 3 of 5 minutes above threshold.
      Threshold: 85
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching

2. Lambda Errors > 0 (or > 1) on an absolute count

aws cloudwatch describe-alarms \
  --alarm-names "prod-api-processor-Errors" \
  --query 'MetricAlarms[0].{Threshold:Threshold,EvalPeriods:EvaluationPeriods,DatapointsToAlarm:DatapointsToAlarm,Period:Period}'

# Typical noisy config:
# {
#   "Threshold": 0.0,
#   "EvalPeriods": 1,
#   "DatapointsToAlarm": 1,
#   "Period": 60
# }

The fix is metric math: divide Errors by Invocations to get an error rate, then alarm when that rate exceeds 5% for two of three consecutive 5-minute evaluation periods.

Resources:
  LambdaErrorRateAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: "prod-api-processor-ErrorRate-High"
      ComparisonOperator: GreaterThanThreshold
      EvaluationPeriods: 3
      DatapointsToAlarm: 2
      Threshold: 5
      TreatMissingData: notBreaching
      Metrics:
        - Id: error_rate
          Expression: "(errors / invocations) * 100"
          Label: ErrorRatePercent
          ReturnData: true
        - Id: errors
          MetricStat:
            Metric:
              Namespace: AWS/Lambda
              MetricName: Errors
              Dimensions:
                - Name: FunctionName
                  Value: prod-api-order-processor
            Period: 300
            Stat: Sum
          ReturnData: false
        - Id: invocations
          MetricStat:
            Metric:
              Namespace: AWS/Lambda
              MetricName: Invocations
              Dimensions:
                - Name: FunctionName
                  Value: prod-api-order-processor
            Period: 300
            Stat: Sum
          ReturnData: false

3. ALB TargetResponseTime > 500ms on average (p50) latency

aws cloudwatch describe-alarms \
  --alarm-names "prod-alb-HighLatency" \
  --query 'MetricAlarms[0].{Statistic:Statistic,Threshold:Threshold,Period:Period,DatapointsToAlarm:DatapointsToAlarm}'

# Noisy config:
# {
#   "Statistic": "Average",
#   "Threshold": 0.5,
#   "Period": 60,
#   "DatapointsToAlarm": 1
# }

Resources:
  ALBLatencyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: "prod-alb-p99Latency-High"
      Namespace: AWS/ApplicationELB
      MetricName: TargetResponseTime
      Dimensions:
        - Name: LoadBalancer
          Value: app/prod-alb/1a2b3c4d5e6f7a8b
      ExtendedStatistic: p99          # was: Average (p50)
      Period: 60
      EvaluationPeriods: 3
      DatapointsToAlarm: 2            # was: 1
      Threshold: 2                    # seconds — calibrate to your p99 SLO
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching

4. RDS FreeableMemory below a static byte threshold

aws cloudwatch describe-alarms \
  --alarm-names "prod-rds-LowFreeableMemory" \
  --query 'MetricAlarms[0].{Threshold:Threshold,EvalPeriods:EvaluationPeriods,DatapointsToAlarm:DatapointsToAlarm}'

# Noisy config:
# {
#   "Threshold": 524288000,   # 500 MB in bytes — someone guessed this number
#   "EvalPeriods": 1,
#   "DatapointsToAlarm": 1
# }

Resources:
  RDSLowMemoryAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: "prod-rds-LowFreeableMemory"
      Namespace: AWS/RDS
      MetricName: FreeableMemory
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: prod-postgres-primary
      Statistic: Average
      Period: 300                     # 5-minute periods, not 60s
      EvaluationPeriods: 5
      DatapointsToAlarm: 3            # was: 1
      # db.t3.medium = 4 GiB total  → 10% = 429,496,730 bytes
      # db.r6g.large = 16 GiB total → 10% = 1,717,986,918 bytes
      Threshold: 429496730            # Set to 10% of your instance's total RAM
      ComparisonOperator: LessThanThreshold
      TreatMissingData: notBreaching

5. SQS ApproximateNumberOfMessagesVisible above a static count

aws cloudwatch describe-alarms \
  --alarm-names "prod-egress-queue-Depth" \
  --query 'MetricAlarms[0].{MetricName:MetricName,Threshold:Threshold,DatapointsToAlarm:DatapointsToAlarm}'

# Noisy config:
# {
#   "MetricName": "ApproximateNumberOfMessagesVisible",
#   "Threshold": 10.0,
#   "DatapointsToAlarm": 1
# }

Resources:
  SQSOldestMessageAgeAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: "prod-egress-queue-OldestMessageAge-High"
      Namespace: AWS/SQS
      MetricName: ApproximateAgeOfOldestMessage  # was: ApproximateNumberOfMessagesVisible
      Dimensions:
        - Name: QueueName
          Value: prod-egress-queue
      Statistic: Maximum
      Period: 300
      EvaluationPeriods: 2
      DatapointsToAlarm: 2
      Threshold: 300                  # seconds — calibrate to your processing SLA
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching

The 5 CloudWatch alarms most startups accidentally create that are just noise

The 5 noisy alarms at a glance

1. ECS/EC2 CPUUtilization > 80% with DatapointsToAlarm = 1

2. Lambda Errors > 0 (or > 1) on an absolute count

3. ALB TargetResponseTime > 500ms on average (p50) latency

4. RDS FreeableMemory below a static byte threshold

5. SQS ApproximateNumberOfMessagesVisible above a static count

The common thread: DatapointsToAlarm = 1 and the wrong metric

Frequently asked questions

What are noisy CloudWatch alarms?

What is DatapointsToAlarm in CloudWatch and why does it matter?

How do I stop my Lambda error alarm from firing constantly?

Why does my RDS FreeableMemory alarm fire during the day and clear overnight?

What CloudWatch metric should I use for SQS monitoring instead of message count?

Related reading

The 5 CloudWatch alarms most startups accidentally create that are just noise

The 5 noisy alarms at a glance

1. ECS/EC2 CPUUtilization > 80% with DatapointsToAlarm = 1

2. Lambda Errors > 0 (or > 1) on an absolute count

3. ALB TargetResponseTime > 500ms on average (p50) latency

4. RDS FreeableMemory below a static byte threshold

5. SQS ApproximateNumberOfMessagesVisible above a static count

The common thread: DatapointsToAlarm = 1 and the wrong metric

Frequently asked questions

What are noisy CloudWatch alarms?

What is DatapointsToAlarm in CloudWatch and why does it matter?

How do I stop my Lambda error alarm from firing constantly?

Why does my RDS FreeableMemory alarm fire during the day and clear overnight?

What CloudWatch metric should I use for SQS monitoring instead of message count?

Related reading