Composite CloudWatch alarms: stop getting paged for things that aren't incidents
Your ECS service CPU hits 83%. CloudWatch pages you at 2:47am. You investigate for 11 minutes before realising it's the nightly analytics export — a batch job that's been running every night for 6 months and has never affected users. You silence the alarm. A week later, a real incident happens, CPU hits 91%, and you ignore the page because you think it's the batch job again.
This is the core failure mode of single-metric alerting: every threshold violation looks the same, so your team stops treating them seriously. Composite CloudWatch alarms are the fix.
- What is a composite CloudWatch alarm?
- Why single-metric alarms create noise your team learns to ignore
- How composite alarm rule expressions work
- Building a composite alarm for an ECS service (CloudFormation + Terraform)
- The ActionsSuppressor: suppressing alerts during deployments
- Common composite alarm patterns
- What composite alarms can't do
What is a composite CloudWatch alarm?
A composite CloudWatch alarm combines the states of multiple existing alarms using a rule expression. It fires only when that expression evaluates to true — for example, when CPU is high AND latency is high AND error rate is rising. Unlike a metric alarm, a composite alarm doesn't monitor a metric directly; it monitors the states of other alarms.
Rule expressions use three state functions — ALARM(), OK(), INSUFFICIENT_DATA() — joined with AND, OR, and NOT operators. You can nest parentheses for precedence. When the expression evaluates to true, the composite alarm enters ALARM state and fires its configured actions: SNS, Auto Scaling, Lambda, or EC2 actions.
Why single-metric alarms create noise your team learns to ignore
A typical 10-service ECS stack with 4–5 metrics monitored per service generates 30–60 individual alarm events per week. The overwhelming majority of those events are not incidents. They're batch jobs, traffic bursts, deploys that momentarily spike CPU, and Lambda cold-start clusters that never actually affect users.
The problem compounds over time. Teams that receive too many false pages do one of two things: they raise thresholds so high the alarm is nearly useless, or they unconsciously start treating pages as informational noise rather than action triggers. Both outcomes mean a real incident will be missed.
| Alarm fires | What it usually means | What you actually need to check |
|---|---|---|
| ECS CPU > 80% | Nightly batch job, blue/green deploy warmup, traffic spike | Is latency or error rate also affected? |
| RDS connections > 80% | App restart, migration script, connection pool bug | Is query latency also elevated? |
| Lambda throttles > 0 | Burst limit hit momentarily, resolves in seconds | Are retries exhausted? Is a downstream service affected? |
| ALB 5xx count > 5 | Single bad deploy request, health check during deploy | Is the rate sustained? Is CPU or latency also elevated? |
| SQS queue depth > 100 | Consumer restart, scheduled batch pause | Has the queue been growing for more than 5 minutes? |
Composite alarms let you encode this reasoning directly into the alerting layer. Instead of asking your on-call engineer to make the 'is this real?' judgement at 3am, you make that judgement once in a rule expression and let the system apply it consistently.
How composite alarm rule expressions work
A rule expression is a string that references child alarm names using three state functions and three logical operators. The composite alarm enters ALARM state when the expression is true.
State functions
| Function | Evaluates to true when… | Typical use |
|---|---|---|
| ALARM("alarm-name") | The child alarm is in ALARM state | The standard case — require this alarm to be firing |
| OK("alarm-name") | The child alarm is in OK state | Suppression logic — fire when a health check is passing but something else is wrong |
| INSUFFICIENT_DATA("alarm-name") | The child alarm has no data | Detecting gaps in telemetry from a service that should always be reporting |
Logical operators
- AND — both conditions must be true
- OR — either condition must be true
- NOT — inverts the condition
- Parentheses — standard precedence grouping
A rule expression can be up to 10,240 characters and reference up to 100 child alarms. Child alarms can be metric alarms or other composite alarms — you can build nested composite alarms up to four levels deep.
# Page only when CPU is high AND (latency is degraded OR errors are elevated)
ALARM("api-service-cpu-high") AND
(ALARM("api-service-p99-latency-high") OR ALARM("api-service-5xx-rate-high"))
# Page when any of three critical services is degraded
ALARM("api-service-degraded") OR
ALARM("worker-service-degraded") OR
ALARM("auth-service-degraded")
# Page when queue is deep AND the consumer is running (not just stopped)
ALARM("payment-queue-depth-high") AND NOT ALARM("payment-consumer-stopped")Building a composite alarm for an ECS service
Here's a production-ready composite alarm for an ECS service behind an ALB. It pages only when CPU is elevated AND users are seeing degraded performance — either high p99 latency or elevated 5xx rate.
Three child alarms feed into the composite: CPUUtilization from ECS, p99 TargetResponseTime from the ALB target group, and HTTPCode_Target_5XX_Count from the ALB. The composite rule requires CPU to be in alarm AND at least one user-facing metric to be in alarm. A CPU spike alone — the batch job case — never fires the composite.
Resources:
# Child alarm 1: ECS CPU
EcsCpuAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: api-service-cpu-high
Namespace: AWS/ECS
MetricName: CPUUtilization
Dimensions:
- Name: ClusterName
Value: YOUR_CLUSTER_NAME
- Name: ServiceName
Value: api-service
Statistic: Average
Period: 60
EvaluationPeriods: 3
Threshold: 80
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
# Child alarm 2: ALB p99 latency
AlbLatencyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: api-service-p99-latency-high
Namespace: AWS/ApplicationELB
MetricName: TargetResponseTime
Dimensions:
- Name: TargetGroup
Value: YOUR_TARGET_GROUP_SUFFIX
- Name: LoadBalancer
Value: YOUR_ALB_SUFFIX
ExtendedStatistic: p99
Period: 60
EvaluationPeriods: 3
Threshold: 1
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
# Child alarm 3: ALB 5xx rate
Alb5xxAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: api-service-5xx-rate-high
Namespace: AWS/ApplicationELB
MetricName: HTTPCode_Target_5XX_Count
Dimensions:
- Name: TargetGroup
Value: YOUR_TARGET_GROUP_SUFFIX
- Name: LoadBalancer
Value: YOUR_ALB_SUFFIX
Statistic: Sum
Period: 60
EvaluationPeriods: 2
Threshold: 10
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
# Composite alarm: page only when users are affected
ApiServiceDegraded:
Type: AWS::CloudWatch::CompositeAlarm
DependsOn:
- EcsCpuAlarm
- AlbLatencyAlarm
- Alb5xxAlarm
Properties:
AlarmName: api-service-degraded
AlarmDescription: >-
CPU is high AND (p99 latency > 1s OR 5xx rate > 10/min).
CPU alone does not page — only when users are affected.
AlarmRule: >-
ALARM("api-service-cpu-high") AND
(ALARM("api-service-p99-latency-high") OR ALARM("api-service-5xx-rate-high"))
AlarmActions:
- YOUR_SNS_TOPIC_ARN
OKActions:
- YOUR_SNS_TOPIC_ARNThe ActionsSuppressor: suppressing alerts during deployments
ActionsSuppressor is the most underused feature in composite alarms. It lets you define a 'suppressor alarm' — when that alarm is in ALARM state, the composite alarm's actions are silenced even if the composite would otherwise fire. The composite alarm still enters ALARM state and is visible in the console; it just doesn't send notifications.
The canonical use case: you have a deployment alarm that fires whenever CodeDeploy or ECS rolling update is in progress. Wire that as the ActionsSuppressor on your composite service alarm. CPU and latency spikes during the 90-second deploy window stop paging your on-call engineer — without any manual maintenance window setup.
Resources:
# A metric alarm that goes to ALARM during deployments.
# Wire this from a CodeDeploy event via EventBridge, or set it manually
# via a deploy script: aws cloudwatch set-alarm-state ...
DeploymentInProgress:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: api-service-deploying
Namespace: ConvOps/Deployments
MetricName: DeploymentActive
Dimensions:
- Name: ServiceName
Value: api-service
Statistic: Maximum
Period: 60
EvaluationPeriods: 1
Threshold: 1
ComparisonOperator: GreaterThanOrEqualToThreshold
TreatMissingData: notBreaching
ApiServiceDegraded:
Type: AWS::CloudWatch::CompositeAlarm
Properties:
AlarmName: api-service-degraded
AlarmRule: >-
ALARM("api-service-cpu-high") AND
(ALARM("api-service-p99-latency-high") OR ALARM("api-service-5xx-rate-high"))
AlarmActions:
- YOUR_SNS_TOPIC_ARN
# Suppress actions when a deploy is in progress
ActionsSuppressor: api-service-deploying
ActionsSuppressorWaitPeriod: 120
ActionsSuppressorExtensionPeriod: 60ActionsSuppressorWaitPeriod (120 seconds above) is how long to wait after the suppressor alarm clears before re-enabling notifications. ActionsSuppressorExtensionPeriod (60 seconds) extends suppression after the composite enters ALARM state — giving your deploy time to complete before the composite can fire.
Common composite alarm patterns
| Pattern | Rule expression | What it solves |
|---|---|---|
| User-impact gate | ALARM("cpu-high") AND (ALARM("latency-high") OR ALARM("error-rate-high")) | Stops paging for resource spikes that don't affect users |
| Any-of-N services down | ALARM("svc-a-degraded") OR ALARM("svc-b-degraded") OR ALARM("svc-c-degraded") | Single 'platform health' alarm that fires if any critical service degrades |
| Queue stuck with live consumer | ALARM("queue-depth-high") AND NOT ALARM("consumer-stopped") | Distinguishes a stuck consumer from a paused/scaled-down consumer |
| Database under pressure | ALARM("rds-connections-high") AND ALARM("rds-latency-high") | Connection count alone spikes during restarts; only page when queries are slow too |
| Lambda at capacity | ALARM("lambda-throttles-high") AND ALARM("lambda-errors-high") | Throttles during a burst resolve in seconds; only page when errors are also elevated |
What composite alarms can't do
Composite alarms are powerful but have real limits worth knowing before you design your alerting architecture.
- No metric math in child alarms used as composite inputs — metric math alarms (those using MetricDataQueries) cannot be child alarms of a composite alarm. Use standard metric alarms instead.
- No cross-account child alarms — all child alarms must be in the same AWS account. For multi-account monitoring you need to replicate alarms or use CloudWatch cross-account dashboards.
- 100 child alarms per composite — the hard limit. For large services with many metrics, build intermediate composite alarms and combine them.
- Composite alarms don't collect data — they have no metrics or history you can graph. The alarm state history shows when it entered ALARM, not metric values.
- Actions fire on the composite, not the child — if you need per-metric actions (e.g. auto-scaling on CPU specifically), keep those actions on the child alarms. Just remove the notification actions from children to avoid double-paging.
The setup that stops the 3am batch job page
The pattern that eliminates most false pages follows the same structure regardless of service type. Start with three child alarms: a resource utilisation alarm (CPU, memory, connections), a latency alarm (p99 response time or processing time), and an error rate alarm (5xx, DLQ depth, function errors). Build a composite that requires the resource alarm to be firing AND at least one user-facing alarm to be firing.
Remove notification actions from the child alarms. Only the composite sends pages. The child alarms stay visible in the console for investigation context — you can still see which individual metric triggered — but your on-call only gets paged when the system has determined an actual incident is in progress.
For more detail on what to monitor on each AWS service type, see the alarm configurations in the CloudWatch alarm setup guide below.
Frequently asked questions
What is a composite CloudWatch alarm?
A composite CloudWatch alarm aggregates the states of multiple existing CloudWatch alarms using a rule expression with AND, OR, and NOT logic. It enters ALARM state only when its rule evaluates to true — for example, when CPU is high AND latency is high. Unlike a metric alarm, it doesn't monitor a metric directly; it monitors the states of child alarms.
Can composite alarms reference other composite alarms?
Yes. Composite alarms can reference both metric alarms and other composite alarms as children, up to four levels of nesting. This lets you build hierarchical alerting — per-service composite alarms feeding into a platform-wide composite alarm.
Do composite alarms cost more than regular alarms?
No. Composite alarms are priced identically to standard CloudWatch metric alarms — $0.10/alarm/month in us-east-1. Each child alarm also incurs its own standard alarm cost. There's no additional charge for using the composite layer.
How do I suppress composite alarm notifications during maintenance?
Use the ActionsSuppressor field on the composite alarm. Point it at a 'maintenance active' alarm and set the alarm to ALARM state during your maintenance window — either manually via the CLI (aws cloudwatch set-alarm-state) or automatically via a CodeDeploy/EventBridge event. While the suppressor is in ALARM, the composite alarm enters ALARM state as usual but doesn't fire its configured actions.
Should I put alarm actions on child alarms or only on the composite alarm?
Only on the composite alarm. If child alarms also have SNS or other actions, you'll receive duplicate notifications — one from the child alarm firing and one from the composite alarm firing — for the same incident. Keep child alarms action-free; they're for investigation context in the console, not for notification delivery.
Still debugging incidents manually?
ConvOps does this automatically — root cause in under 60 seconds, delivered to WhatsApp or Slack.