Composite CloudWatch alarms: stop getting paged for things that aren't incidents

May 15, 202610 min read

Show code for:

Your ECS service CPU hits 83%. CloudWatch pages you at 2:47am. You investigate for 11 minutes before realising it's the nightly analytics export — a batch job that's been running every night for 6 months and has never affected users. You silence the alarm. A week later, a real incident happens, CPU hits 91%, and you ignore the page because you think it's the batch job again.

This is the core failure mode of single-metric alerting: every threshold violation looks the same, so your team stops treating them seriously. Composite CloudWatch alarms are the fix.

What is a composite CloudWatch alarm?
Why single-metric alarms create noise your team learns to ignore
How composite alarm rule expressions work
Building a composite alarm for an ECS service (CloudFormation + Terraform)
The ActionsSuppressor: suppressing alerts during deployments
Common composite alarm patterns
What composite alarms can't do

What is a composite CloudWatch alarm?

A composite CloudWatch alarm combines the states of multiple existing alarms using a rule expression. It fires only when that expression evaluates to true — for example, when CPU is high AND latency is high AND error rate is rising. Unlike a metric alarm, a composite alarm doesn't monitor a metric directly; it monitors the states of other alarms.

Rule expressions use three state functions — ALARM(), OK(), INSUFFICIENT_DATA() — joined with AND, OR, and NOT operators. You can nest parentheses for precedence. When the expression evaluates to true, the composite alarm enters ALARM state and fires its configured actions: SNS, Auto Scaling, Lambda, or EC2 actions.

Composite alarms cost the same as standard CloudWatch alarms — $0.50/month in us-east-1. The child alarms still exist and still have their own costs, but the composite itself adds no premium.

Why single-metric alarms create noise your team learns to ignore

A typical 10-service ECS stack with 4–5 metrics monitored per service generates 30–60 individual alarm events per week. The overwhelming majority of those events are not incidents. They're batch jobs, traffic bursts, deploys that momentarily spike CPU, and Lambda cold-start clusters that never actually affect users.

The problem compounds over time. Teams that receive too many false pages do one of two things: they raise thresholds so high the alarm is nearly useless, or they unconsciously start treating pages as informational noise rather than action triggers. Both outcomes mean a real incident will be missed.

Alarm fires	What it usually means	What you actually need to check
ECS CPU > 80%	Nightly batch job, blue/green deploy warmup, traffic spike	Is latency or error rate also affected?
RDS connections > 80%	App restart, migration script, connection pool bug	Is query latency also elevated?
Lambda throttles > 0	Burst limit hit momentarily, resolves in seconds	Are retries exhausted? Is a downstream service affected?
ALB 5xx count > 5	Single bad deploy request, health check during deploy	Is the rate sustained? Is CPU or latency also elevated?
SQS queue depth > 100	Consumer restart, scheduled batch pause	Has the queue been growing for more than 5 minutes?

Composite alarms let you encode this reasoning directly into the alerting layer. Instead of asking your on-call engineer to make the 'is this real?' judgement at 3am, you make that judgement once in a rule expression and let the system apply it consistently.

How composite alarm rule expressions work

A rule expression is a string that references child alarm names using three state functions and three logical operators. The composite alarm enters ALARM state when the expression is true.

State functions

Function	Evaluates to true when…	Typical use
ALARM("alarm-name")	The child alarm is in ALARM state	The standard case — require this alarm to be firing
OK("alarm-name")	The child alarm is in OK state	Suppression logic — fire when a health check is passing but something else is wrong
INSUFFICIENT_DATA("alarm-name")	The child alarm has no data	Detecting gaps in telemetry from a service that should always be reporting

Logical operators

AND — both conditions must be true
OR — either condition must be true
NOT — inverts the condition
Parentheses — standard precedence grouping

A rule expression can be up to 10,240 characters and reference up to 100 child alarms. Child alarms can be metric alarms or other composite alarms — you can build nested composite alarms up to four levels deep.

# Page only when CPU is high AND (latency is degraded OR errors are elevated)
ALARM("api-service-cpu-high") AND
(ALARM("api-service-p99-latency-high") OR ALARM("api-service-5xx-rate-high"))

# Page when any of three critical services is degraded
ALARM("api-service-degraded") OR
ALARM("worker-service-degraded") OR
ALARM("auth-service-degraded")

# Page when queue is deep AND the consumer is running (not just stopped)
ALARM("payment-queue-depth-high") AND NOT ALARM("payment-consumer-stopped")

Building a composite alarm for an ECS service

Here's a production-ready composite alarm for an ECS service behind an ALB. It pages only when CPU is elevated AND users are seeing degraded performance — either high p99 latency or elevated 5xx rate.

Three child alarms feed into the composite: CPUUtilization from ECS, p99 TargetResponseTime from the ALB target group, and HTTPCode_Target_5XX_Count from the ALB. The composite rule requires CPU to be in alarm AND at least one user-facing metric to be in alarm. A CPU spike alone — the batch job case — never fires the composite.

Resources:
  # Child alarm 1: ECS CPU
  EcsCpuAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: api-service-cpu-high
      Namespace: AWS/ECS
      MetricName: CPUUtilization
      Dimensions:
        - Name: ClusterName
          Value: YOUR_CLUSTER_NAME
        - Name: ServiceName
          Value: api-service
      Statistic: Average
      Period: 60
      EvaluationPeriods: 3
      Threshold: 80
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching

  # Child alarm 2: ALB p99 latency
  AlbLatencyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: api-service-p99-latency-high
      Namespace: AWS/ApplicationELB
      MetricName: TargetResponseTime
      Dimensions:
        - Name: TargetGroup
          Value: YOUR_TARGET_GROUP_SUFFIX
        - Name: LoadBalancer
          Value: YOUR_ALB_SUFFIX
      ExtendedStatistic: p99
      Period: 60
      EvaluationPeriods: 3
      Threshold: 1
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching

  # Child alarm 3: ALB 5xx rate
  Alb5xxAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: api-service-5xx-rate-high
      Namespace: AWS/ApplicationELB
      MetricName: HTTPCode_Target_5XX_Count
      Dimensions:
        - Name: TargetGroup
          Value: YOUR_TARGET_GROUP_SUFFIX
        - Name: LoadBalancer
          Value: YOUR_ALB_SUFFIX
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 10
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching

  # Composite alarm: page only when users are affected
  ApiServiceDegraded:
    Type: AWS::CloudWatch::CompositeAlarm
    DependsOn:
      - EcsCpuAlarm
      - AlbLatencyAlarm
      - Alb5xxAlarm
    Properties:
      AlarmName: api-service-degraded
      AlarmDescription: >-
        CPU is high AND (p99 latency > 1s OR 5xx rate > 10/min).
        CPU alone does not page — only when users are affected.
      AlarmRule: >-
        ALARM("api-service-cpu-high") AND
        (ALARM("api-service-p99-latency-high") OR ALARM("api-service-5xx-rate-high"))
      AlarmActions:
        - YOUR_SNS_TOPIC_ARN
      OKActions:
        - YOUR_SNS_TOPIC_ARN

Remove alarm_actions from the child alarms — let only the composite alarm trigger notifications. If child alarms also have actions, you'll get paged by the child alarm AND the composite alarm for the same incident.

The ActionsSuppressor: suppressing alerts during deployments

ActionsSuppressor is the most underused feature in composite alarms. It lets you define a 'suppressor alarm' — when that alarm is in ALARM state, the composite alarm's actions are silenced even if the composite would otherwise fire. The composite alarm still enters ALARM state and is visible in the console; it just doesn't send notifications.

The canonical use case: you have a deployment alarm that fires whenever CodeDeploy or ECS rolling update is in progress. Wire that as the ActionsSuppressor on your composite service alarm. CPU and latency spikes during the 90-second deploy window stop paging your on-call engineer — without any manual maintenance window setup.

Resources:
  # A metric alarm that goes to ALARM during deployments.
  # Wire this from a CodeDeploy event via EventBridge, or set it manually
  # via a deploy script: aws cloudwatch set-alarm-state ...
  DeploymentInProgress:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: api-service-deploying
      Namespace: ConvOps/Deployments
      MetricName: DeploymentActive
      Dimensions:
        - Name: ServiceName
          Value: api-service
      Statistic: Maximum
      Period: 60
      EvaluationPeriods: 1
      Threshold: 1
      ComparisonOperator: GreaterThanOrEqualToThreshold
      TreatMissingData: notBreaching

  ApiServiceDegraded:
    Type: AWS::CloudWatch::CompositeAlarm
    Properties:
      AlarmName: api-service-degraded
      AlarmRule: >-
        ALARM("api-service-cpu-high") AND
        (ALARM("api-service-p99-latency-high") OR ALARM("api-service-5xx-rate-high"))
      AlarmActions:
        - YOUR_SNS_TOPIC_ARN
      # Suppress actions when a deploy is in progress
      ActionsSuppressor: api-service-deploying
      ActionsSuppressorWaitPeriod: 120
      ActionsSuppressorExtensionPeriod: 60

ActionsSuppressorWaitPeriod (120 seconds above) is how long to wait after the suppressor alarm clears before re-enabling notifications. ActionsSuppressorExtensionPeriod (60 seconds) extends suppression after the composite enters ALARM state — giving your deploy time to complete before the composite can fire.

Common composite alarm patterns

Pattern	Rule expression	What it solves
User-impact gate	ALARM("cpu-high") AND (ALARM("latency-high") OR ALARM("error-rate-high"))	Stops paging for resource spikes that don't affect users
Any-of-N services down	ALARM("svc-a-degraded") OR ALARM("svc-b-degraded") OR ALARM("svc-c-degraded")	Single 'platform health' alarm that fires if any critical service degrades
Queue stuck with live consumer	ALARM("queue-depth-high") AND NOT ALARM("consumer-stopped")	Distinguishes a stuck consumer from a paused/scaled-down consumer
Database under pressure	ALARM("rds-connections-high") AND ALARM("rds-latency-high")	Connection count alone spikes during restarts; only page when queries are slow too
Lambda at capacity	ALARM("lambda-throttles-high") AND ALARM("lambda-errors-high")	Throttles during a burst resolve in seconds; only page when errors are also elevated

What composite alarms can't do

Composite alarms are powerful but have real limits worth knowing before you design your alerting architecture.

No metric math in child alarms used as composite inputs — metric math alarms (those using MetricDataQueries) cannot be child alarms of a composite alarm. Use standard metric alarms instead.
No cross-account child alarms — all child alarms must be in the same AWS account. For multi-account monitoring you need to replicate alarms or use CloudWatch cross-account dashboards.
100 child alarms per composite — the hard limit. For large services with many metrics, build intermediate composite alarms and combine them.
Composite alarms don't collect data — they have no metrics or history you can graph. The alarm state history shows when it entered ALARM, not metric values.
Actions fire on the composite, not the child — if you need per-metric actions (e.g. auto-scaling on CPU specifically), keep those actions on the child alarms. Just remove the notification actions from children to avoid double-paging.

The setup that stops the 3am batch job page

The pattern that eliminates most false pages follows the same structure regardless of service type. Start with three child alarms: a resource utilisation alarm (CPU, memory, connections), a latency alarm (p99 response time or processing time), and an error rate alarm (5xx, DLQ depth, function errors). Build a composite that requires the resource alarm to be firing AND at least one user-facing alarm to be firing.

Remove notification actions from the child alarms. Only the composite sends pages. The child alarms stay visible in the console for investigation context — you can still see which individual metric triggered — but your on-call only gets paged when the system has determined an actual incident is in progress.

For more detail on what to monitor on each AWS service type, see the alarm configurations in the CloudWatch alarm setup guide below.

Frequently asked questions

What is a composite CloudWatch alarm?

A composite CloudWatch alarm aggregates the states of multiple existing CloudWatch alarms using a rule expression with AND, OR, and NOT logic. It enters ALARM state only when its rule evaluates to true — for example, when CPU is high AND latency is high. Unlike a metric alarm, it doesn't monitor a metric directly; it monitors the states of child alarms.

Can composite alarms reference other composite alarms?

Yes. Composite alarms can reference both metric alarms and other composite alarms as children, up to four levels of nesting. This lets you build hierarchical alerting — per-service composite alarms feeding into a platform-wide composite alarm.

Do composite alarms cost more than regular alarms?

No. Composite alarms are priced identically to standard CloudWatch metric alarms — $0.10/alarm/month in us-east-1. Each child alarm also incurs its own standard alarm cost. There's no additional charge for using the composite layer.

How do I suppress composite alarm notifications during maintenance?

Use the ActionsSuppressor field on the composite alarm. Point it at a 'maintenance active' alarm and set the alarm to ALARM state during your maintenance window — either manually via the CLI (aws cloudwatch set-alarm-state) or automatically via a CodeDeploy/EventBridge event. While the suppressor is in ALARM, the composite alarm enters ALARM state as usual but doesn't fire its configured actions.

Should I put alarm actions on child alarms or only on the composite alarm?

Only on the composite alarm. If child alarms also have SNS or other actions, you'll receive duplicate notifications — one from the child alarm firing and one from the composite alarm firing — for the same incident. Keep child alarms action-free; they're for investigation context in the console, not for notification delivery.

Still debugging incidents manually?

ConvOps does this automatically — root cause in under 60 seconds, delivered to WhatsApp or Slack.

Try ConvOps free →See a live demo

← All posts

Composite CloudWatch alarms: stop getting paged for things that aren't incidents

May 15, 202610 min read

Show code for:

This is the core failure mode of single-metric alerting: every threshold violation looks the same, so your team stops treating them seriously. Composite CloudWatch alarms are the fix.

What is a composite CloudWatch alarm?
Why single-metric alarms create noise your team learns to ignore
How composite alarm rule expressions work
Building a composite alarm for an ECS service (CloudFormation + Terraform)
The ActionsSuppressor: suppressing alerts during deployments
Common composite alarm patterns
What composite alarms can't do

What is a composite CloudWatch alarm?

Composite alarms cost the same as standard CloudWatch alarms — $0.50/month in us-east-1. The child alarms still exist and still have their own costs, but the composite itself adds no premium.

Why single-metric alarms create noise your team learns to ignore

Alarm fires	What it usually means	What you actually need to check
ECS CPU > 80%	Nightly batch job, blue/green deploy warmup, traffic spike	Is latency or error rate also affected?
RDS connections > 80%	App restart, migration script, connection pool bug	Is query latency also elevated?
Lambda throttles > 0	Burst limit hit momentarily, resolves in seconds	Are retries exhausted? Is a downstream service affected?
ALB 5xx count > 5	Single bad deploy request, health check during deploy	Is the rate sustained? Is CPU or latency also elevated?
SQS queue depth > 100	Consumer restart, scheduled batch pause	Has the queue been growing for more than 5 minutes?

How composite alarm rule expressions work

A rule expression is a string that references child alarm names using three state functions and three logical operators. The composite alarm enters ALARM state when the expression is true.

State functions

Function	Evaluates to true when…	Typical use
ALARM("alarm-name")	The child alarm is in ALARM state	The standard case — require this alarm to be firing
OK("alarm-name")	The child alarm is in OK state	Suppression logic — fire when a health check is passing but something else is wrong
INSUFFICIENT_DATA("alarm-name")	The child alarm has no data	Detecting gaps in telemetry from a service that should always be reporting

Logical operators

AND — both conditions must be true
OR — either condition must be true
NOT — inverts the condition
Parentheses — standard precedence grouping

# Page only when CPU is high AND (latency is degraded OR errors are elevated)
ALARM("api-service-cpu-high") AND
(ALARM("api-service-p99-latency-high") OR ALARM("api-service-5xx-rate-high"))

# Page when any of three critical services is degraded
ALARM("api-service-degraded") OR
ALARM("worker-service-degraded") OR
ALARM("auth-service-degraded")

# Page when queue is deep AND the consumer is running (not just stopped)
ALARM("payment-queue-depth-high") AND NOT ALARM("payment-consumer-stopped")

Building a composite alarm for an ECS service

Resources:
  # Child alarm 1: ECS CPU
  EcsCpuAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: api-service-cpu-high
      Namespace: AWS/ECS
      MetricName: CPUUtilization
      Dimensions:
        - Name: ClusterName
          Value: YOUR_CLUSTER_NAME
        - Name: ServiceName
          Value: api-service
      Statistic: Average
      Period: 60
      EvaluationPeriods: 3
      Threshold: 80
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching

  # Child alarm 2: ALB p99 latency
  AlbLatencyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: api-service-p99-latency-high
      Namespace: AWS/ApplicationELB
      MetricName: TargetResponseTime
      Dimensions:
        - Name: TargetGroup
          Value: YOUR_TARGET_GROUP_SUFFIX
        - Name: LoadBalancer
          Value: YOUR_ALB_SUFFIX
      ExtendedStatistic: p99
      Period: 60
      EvaluationPeriods: 3
      Threshold: 1
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching

  # Child alarm 3: ALB 5xx rate
  Alb5xxAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: api-service-5xx-rate-high
      Namespace: AWS/ApplicationELB
      MetricName: HTTPCode_Target_5XX_Count
      Dimensions:
        - Name: TargetGroup
          Value: YOUR_TARGET_GROUP_SUFFIX
        - Name: LoadBalancer
          Value: YOUR_ALB_SUFFIX
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 10
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching

  # Composite alarm: page only when users are affected
  ApiServiceDegraded:
    Type: AWS::CloudWatch::CompositeAlarm
    DependsOn:
      - EcsCpuAlarm
      - AlbLatencyAlarm
      - Alb5xxAlarm
    Properties:
      AlarmName: api-service-degraded
      AlarmDescription: >-
        CPU is high AND (p99 latency > 1s OR 5xx rate > 10/min).
        CPU alone does not page — only when users are affected.
      AlarmRule: >-
        ALARM("api-service-cpu-high") AND
        (ALARM("api-service-p99-latency-high") OR ALARM("api-service-5xx-rate-high"))
      AlarmActions:
        - YOUR_SNS_TOPIC_ARN
      OKActions:
        - YOUR_SNS_TOPIC_ARN

The ActionsSuppressor: suppressing alerts during deployments

Resources:
  # A metric alarm that goes to ALARM during deployments.
  # Wire this from a CodeDeploy event via EventBridge, or set it manually
  # via a deploy script: aws cloudwatch set-alarm-state ...
  DeploymentInProgress:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: api-service-deploying
      Namespace: ConvOps/Deployments
      MetricName: DeploymentActive
      Dimensions:
        - Name: ServiceName
          Value: api-service
      Statistic: Maximum
      Period: 60
      EvaluationPeriods: 1
      Threshold: 1
      ComparisonOperator: GreaterThanOrEqualToThreshold
      TreatMissingData: notBreaching

  ApiServiceDegraded:
    Type: AWS::CloudWatch::CompositeAlarm
    Properties:
      AlarmName: api-service-degraded
      AlarmRule: >-
        ALARM("api-service-cpu-high") AND
        (ALARM("api-service-p99-latency-high") OR ALARM("api-service-5xx-rate-high"))
      AlarmActions:
        - YOUR_SNS_TOPIC_ARN
      # Suppress actions when a deploy is in progress
      ActionsSuppressor: api-service-deploying
      ActionsSuppressorWaitPeriod: 120
      ActionsSuppressorExtensionPeriod: 60

Common composite alarm patterns

Pattern	Rule expression	What it solves
User-impact gate	ALARM("cpu-high") AND (ALARM("latency-high") OR ALARM("error-rate-high"))	Stops paging for resource spikes that don't affect users
Any-of-N services down	ALARM("svc-a-degraded") OR ALARM("svc-b-degraded") OR ALARM("svc-c-degraded")	Single 'platform health' alarm that fires if any critical service degrades
Queue stuck with live consumer	ALARM("queue-depth-high") AND NOT ALARM("consumer-stopped")	Distinguishes a stuck consumer from a paused/scaled-down consumer
Database under pressure	ALARM("rds-connections-high") AND ALARM("rds-latency-high")	Connection count alone spikes during restarts; only page when queries are slow too
Lambda at capacity	ALARM("lambda-throttles-high") AND ALARM("lambda-errors-high")	Throttles during a burst resolve in seconds; only page when errors are also elevated

What composite alarms can't do

Composite alarms are powerful but have real limits worth knowing before you design your alerting architecture.

No metric math in child alarms used as composite inputs — metric math alarms (those using MetricDataQueries) cannot be child alarms of a composite alarm. Use standard metric alarms instead.
No cross-account child alarms — all child alarms must be in the same AWS account. For multi-account monitoring you need to replicate alarms or use CloudWatch cross-account dashboards.
100 child alarms per composite — the hard limit. For large services with many metrics, build intermediate composite alarms and combine them.
Composite alarms don't collect data — they have no metrics or history you can graph. The alarm state history shows when it entered ALARM, not metric values.
Actions fire on the composite, not the child — if you need per-metric actions (e.g. auto-scaling on CPU specifically), keep those actions on the child alarms. Just remove the notification actions from children to avoid double-paging.

The setup that stops the 3am batch job page

For more detail on what to monitor on each AWS service type, see the alarm configurations in the CloudWatch alarm setup guide below.

Frequently asked questions

What is a composite CloudWatch alarm?

Can composite alarms reference other composite alarms?

Do composite alarms cost more than regular alarms?

How do I suppress composite alarm notifications during maintenance?

Should I put alarm actions on child alarms or only on the composite alarm?

Still debugging incidents manually?

ConvOps does this automatically — root cause in under 60 seconds, delivered to WhatsApp or Slack.

Try ConvOps free →See a live demo