Why your CloudWatch alarm fired and resolved in 90 seconds (and why that's still a problem)

Q: What is a flapping CloudWatch alarm?

A flapping CloudWatch alarm repeatedly transitions between ALARM and OK state without staying in either state long enough to indicate a sustained incident. It typically means the metric oscillates near the threshold — either because the threshold is misconfigured, DatapointsToAlarm is set to 1, or an intermittent real problem is triggering the metric in cycles.

Q: How do I stop a CloudWatch alarm from flapping?

The most common fix is setting DatapointsToAlarm to at least 2 with EvaluationPeriods set to 3. This means the metric must breach the threshold across multiple consecutive data points before the alarm fires, filtering out brief spikes. If the alarm continues to flap after fixing the evaluation config, the threshold itself is likely set too close to the normal operating range.

Q: Is a self-resolving CloudWatch alarm always a false alarm?

Not always. A self-resolving alarm can indicate a real problem that temporarily fixes itself — for example, an application that exhausts database connections, backs off, and reconnects repeatedly. If you see error spikes in your logs at each ALARM transition, the alarm is working correctly and you have a cycling failure mode that needs investigation.

Q: What is DatapointsToAlarm in CloudWatch?

DatapointsToAlarm controls how many data points within an evaluation window must breach the threshold before the alarm fires. If EvaluationPeriods is 3 and DatapointsToAlarm is 2, the alarm fires when 2 out of 3 consecutive data points exceed the threshold. The default is 1, meaning any single data point above threshold triggers the alarm — the most common cause of flapping alarms.

May 25, 20267 min read

It's 2:17am. Your phone buzzes — HighCPUUtilization on api-service. You unlock, open the CloudWatch console, and by the time it loads, the alarm is already green. Resolved. You go back to sleep. Forty minutes later it fires again. Resolved before you can do anything. By morning it has done this six times. Six SNS notifications. Nobody looked at any of them.

Why a self-resolving alarm isn't automatically harmless

The misconception is that if an alarm resolved on its own, there's nothing to investigate. What it actually means is that the metric dipped below your threshold long enough to flip state — nothing more.

Whether that matters depends entirely on what the metric is and how the alarm is configured. A CPUUtilization alarm that spikes to 82% for 45 seconds and resolves is almost certainly misconfigured — DatapointsToAlarm is 1, which means a single 60-second data point above threshold fires the alarm. A database connection alarm that fires and resolves three times in 15 minutes is telling you your app is cycling through connection pool exhaustion, recovering briefly, then hitting the ceiling again. Same flapping pattern, completely different severity.

There's also something subtle about the timing. A flapping alarm usually resolves fastest — it clears before you've had time to look at it. That resolution creates a false sense of safety. "It went green, we're fine" is easy to tell yourself at 3am. But if the same alarm fires again 40 minutes later and resolves again, you now have two data points for the same pattern, and you're no more informed than you were the first time.

The dangerous part: CloudWatch only sends SNS notifications on state transitions. A flapping alarm generates more pages per hour than a stuck alarm. Alert fatigue sets in fast. Engineers start ignoring the channel — including the one notification that won't resolve on its own.

How to diagnose a flapping alarm

Step 1: Measure the transition pattern, not the individual alert

Before deciding whether to act, you need to see the full ALARM→OK→ALARM history for this alarm. The graph in the console only shows the current evaluation window. Use the API to get the actual state history:

aws cloudwatch describe-alarm-history \
  --alarm-name "api-service-HighCPUUtilization" \
  --history-item-type StateUpdate \
  --start-date 2026-05-24T00:00:00Z \
  --end-date 2026-05-25T00:00:00Z \
  --output json | jq '.AlarmHistoryItems[] | {timestamp: .Timestamp, summary: .HistorySummary}'

Count the ALARM→OK cycles. If the alarm has self-resolved more than 5 times in 14 days, the problem is almost certainly configuration — the threshold sits too close to the normal operating range. If it fired 6 times in the last 4 hours, something is actively cycling through a failure mode.

Decision rule: If the alarm self-resolved 3 or more times within 15 minutes, treat it as flapping and investigate the alarm config first. If it fired once and stayed in ALARM for more than 15 minutes, treat it as a real incident and investigate the service.

Step 2: Check if the threshold is calibrated correctly

The most common structural cause of flapping is a threshold that sits at the normal operating range. Pull 14-day statistics for the metric and compare them to your threshold:

aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name CPUUtilization \
  --dimensions Name=ServiceName,Value=api-service Name=ClusterName,Value=production \
  --start-time 2026-05-11T00:00:00Z \
  --end-time 2026-05-25T00:00:00Z \
  --period 86400 \
  --statistics Average Maximum \
  --output table

If your 14-day maximum is 78% and your alarm threshold is 80%, that alarm will fire on every modest traffic spike. Raise the threshold to 90%, or switch to anomaly detection — or, usually better, fix the DatapointsToAlarm setting first.

Step 3: Fix DatapointsToAlarm — the most common root cause

Check your current alarm configuration:

aws cloudwatch describe-alarms \
  --alarm-names "api-service-HighCPUUtilization" \
  --query 'MetricAlarms[0].{Period:Period,EvalPeriods:EvaluationPeriods,DatapointsToAlarm:DatapointsToAlarm,Threshold:Threshold,TreatMissing:TreatMissingData}'

A DatapointsToAlarm of 1 means a single 60-second data point above threshold triggers the alarm. One brief CPU spike and you're paged at 2am. Fix it in CloudFormation:

Resources:
  ApiServiceCPUAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: api-service-HighCPUUtilization
      Namespace: AWS/ECS
      MetricName: CPUUtilization
      Dimensions:
        - Name: ServiceName
          Value: api-service
        - Name: ClusterName
          Value: production
      Period: 60
      EvaluationPeriods: 3
      DatapointsToAlarm: 2      # require 2 of 3 data points above threshold
      Threshold: 85
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching

DatapointsToAlarm: 2 with EvaluationPeriods: 3 means two of three consecutive data points must breach the threshold before the alarm fires. Brief spikes don't trigger it. The alarm stays green until the problem is sustained — which is what you want.

One thing I don't know from this config alone: whether 85% is the right threshold for your service. That depends on how your application behaves at high CPU — some services handle 90% fine for minutes at a time; others start dropping requests at 75%. The CloudFormation snippet above gives you a stable alarm that won't flap on brief spikes, but you still need to validate the threshold itself against real load test data or 14-day historical peaks. The fix for flapping is not always to raise the threshold — sometimes the threshold is correct and the evaluation window is what's wrong.

Step 4: Look at the logs during each transition

Flapping tells you the metric is crossing a boundary repeatedly. Logs tell you why. Run this Logs Insights query against your service log group, scoped to ±5 minutes around each ALARM transition from step 1:

fields @timestamp, @message
| filter @message like /ERROR|error|exception|timeout|connection refused|WARN/
| stats count(*) as error_count by bin(1m)
| sort @timestamp asc
| limit 60

If error counts spike at every ALARM transition, the metric is faithfully reporting a real recurring problem. If the logs are clean during each transition, the metric is oscillating on its own because of a threshold or evaluation-period misconfiguration.

Decision rule: Logs show error spikes at every ALARM transition → real problem cycling through a failure mode; investigate the cycle cause (connection pool exhaustion, GC pressure, scheduled job, autoscaling lag). Logs are clean → threshold is wrong; recalibrate before the next occurrence. Logs show errors but always the same type → check if one downstream API or one slow query is the single trigger driving the metric.

Step 5: Classify and act

Pattern	What it usually means	Fix
Fires and resolves in <2 min, clean logs	Threshold too low or DatapointsToAlarm=1	Raise threshold; set DatapointsToAlarm ≥2
Fires 3–6 times in 15 min, error spikes each time	Real recurring problem: connection pool, GC pause, cron collision	Fix the underlying problem; the alarm config is probably correct
Fires once, stays in ALARM 15+ min	Actual incident	Treat as incident; investigate root cause immediately

Common failure modes

Suppressing the alarm entirely. When an alarm becomes noisy, the instinct is to disable it. That's almost always wrong. If the metric matters, you want to know when it's elevated — you just want fewer, better-timed notifications. Fix the evaluation config; don't silence the signal.

Setting a longer Period instead of fixing DatapointsToAlarm. Changing Period from 60 to 300 will reduce flapping, but it also means a real problem won't trigger the alarm for 5 minutes after it starts. DatapointsToAlarm solves flapping without adding detection latency.

Not investigating what's driving the cycle. Flapping alarms often reveal a real architectural problem — a connection pool exactly at capacity, a Lambda with memory set too low, an RDS instance undersized for the actual load pattern. The alarm stops flapping when the metric stabilizes. The metric stabilizes when you fix the underlying resource. Ignoring the flap means it comes back.

Treating all flapping alarms the same. A Lambda Errors alarm that fires twice in 5 minutes is more likely to be a real problem than a CPUUtilization alarm doing the same. Error-rate metrics are precise — the errors either happened or they didn't. Utilization metrics are aggregates that naturally oscillate. Apply more skepticism to utilization alarms; more urgency to error-rate alarms.

The automated version

Doing this manually is fine. Running the CLI commands, pulling the transition history, recalibrating the alarm — that works. ConvOps does the same tracking automatically, at alert time. When a flapping alarm crosses 3 fire-and-resolve cycles within 15 minutes, ConvOps sends one notification — "this alarm is flapping, I'm watching it" — and suppresses the rest. If it stays continuously in ALARM for more than 15 minutes despite the cycling, it escalates with one message and starts the automated log investigation. If the alarm goes quiet for 3 hours, the flap history resets — the next fire is treated as a fresh incident, not a continuation of the old pattern. No alarm is silently dropped; the noise-vs-incident distinction happens without you having to be awake to make that call.

Frequently asked questions

What is a flapping CloudWatch alarm?

A flapping CloudWatch alarm repeatedly transitions between ALARM and OK state without staying in either state long enough to indicate a sustained incident. It typically means the metric oscillates near the threshold — either because the threshold is misconfigured, DatapointsToAlarm is set to 1, or an intermittent real problem is triggering the metric in cycles.

How do I stop a CloudWatch alarm from flapping?

The most common fix is setting DatapointsToAlarm to at least 2 with EvaluationPeriods set to 3. This means the metric must breach the threshold across multiple consecutive data points before the alarm fires, filtering out brief spikes. If the alarm continues to flap after fixing the evaluation config, the threshold itself is likely set too close to the normal operating range.

Is a self-resolving CloudWatch alarm always a false alarm?

Not always. A self-resolving alarm can indicate a real problem that temporarily fixes itself — for example, an application that exhausts database connections, backs off, and reconnects repeatedly. If you see error spikes in your logs at each ALARM transition, the alarm is working correctly and you have a cycling failure mode that needs investigation.

What is DatapointsToAlarm in CloudWatch?

DatapointsToAlarm controls how many data points within an evaluation window must breach the threshold before the alarm fires. If EvaluationPeriods is 3 and DatapointsToAlarm is 2, the alarm fires when 2 out of 3 consecutive data points exceed the threshold. The default is 1, meaning any single data point above threshold triggers the alarm — the most common cause of flapping alarms.

Still debugging incidents manually?

ConvOps does this automatically — root cause in under 60 seconds, delivered to WhatsApp or Slack.

Try ConvOps free →

← All posts

Why your CloudWatch alarm fired and resolved in 90 seconds (and why that's still a problem)

May 25, 20267 min read

Why a self-resolving alarm isn't automatically harmless

How to diagnose a flapping alarm

Step 1: Measure the transition pattern, not the individual alert

aws cloudwatch describe-alarm-history \
  --alarm-name "api-service-HighCPUUtilization" \
  --history-item-type StateUpdate \
  --start-date 2026-05-24T00:00:00Z \
  --end-date 2026-05-25T00:00:00Z \
  --output json | jq '.AlarmHistoryItems[] | {timestamp: .Timestamp, summary: .HistorySummary}'

Step 2: Check if the threshold is calibrated correctly

The most common structural cause of flapping is a threshold that sits at the normal operating range. Pull 14-day statistics for the metric and compare them to your threshold:

aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name CPUUtilization \
  --dimensions Name=ServiceName,Value=api-service Name=ClusterName,Value=production \
  --start-time 2026-05-11T00:00:00Z \
  --end-time 2026-05-25T00:00:00Z \
  --period 86400 \
  --statistics Average Maximum \
  --output table

Step 3: Fix DatapointsToAlarm — the most common root cause

Check your current alarm configuration:

aws cloudwatch describe-alarms \
  --alarm-names "api-service-HighCPUUtilization" \
  --query 'MetricAlarms[0].{Period:Period,EvalPeriods:EvaluationPeriods,DatapointsToAlarm:DatapointsToAlarm,Threshold:Threshold,TreatMissing:TreatMissingData}'

A DatapointsToAlarm of 1 means a single 60-second data point above threshold triggers the alarm. One brief CPU spike and you're paged at 2am. Fix it in CloudFormation:

Resources:
  ApiServiceCPUAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: api-service-HighCPUUtilization
      Namespace: AWS/ECS
      MetricName: CPUUtilization
      Dimensions:
        - Name: ServiceName
          Value: api-service
        - Name: ClusterName
          Value: production
      Period: 60
      EvaluationPeriods: 3
      DatapointsToAlarm: 2      # require 2 of 3 data points above threshold
      Threshold: 85
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching

Step 4: Look at the logs during each transition

fields @timestamp, @message
| filter @message like /ERROR|error|exception|timeout|connection refused|WARN/
| stats count(*) as error_count by bin(1m)
| sort @timestamp asc
| limit 60

Step 5: Classify and act

Pattern	What it usually means	Fix
Fires and resolves in <2 min, clean logs	Threshold too low or DatapointsToAlarm=1	Raise threshold; set DatapointsToAlarm ≥2
Fires 3–6 times in 15 min, error spikes each time	Real recurring problem: connection pool, GC pause, cron collision	Fix the underlying problem; the alarm config is probably correct
Fires once, stays in ALARM 15+ min	Actual incident	Treat as incident; investigate root cause immediately

Common failure modes

The automated version

Frequently asked questions

What is a flapping CloudWatch alarm?

How do I stop a CloudWatch alarm from flapping?

Is a self-resolving CloudWatch alarm always a false alarm?

What is DatapointsToAlarm in CloudWatch?

Still debugging incidents manually?

ConvOps does this automatically — root cause in under 60 seconds, delivered to WhatsApp or Slack.

Try ConvOps free →