CloudWatch metric math: how to build alarms no static threshold can match
A static CloudWatch alarm on Lambda Errors > 5 fires whenever five errors occur — whether that's 5 out of 10 invocations (50% error rate) or 5 out of 100,000 (0.005%). One is a crisis. The other is background noise. Metric math lets you alarm on the ratio, the rate, or the derived signal — the thing that actually determines whether users are affected.
What is CloudWatch metric math?
Metric math lets you define a CloudWatch alarm on a mathematical expression derived from one or more raw metrics. The expression is evaluated at the alarm's period, and the result is compared against the threshold. You get all standard CloudWatch alarm features — DatapointsToAlarm, missing data treatment, SNS actions — on the derived value, not on the raw count.
| Static alarm | Metric math alarm |
|---|---|
| Lambda Errors > 5 | (Errors / Invocations) * 100 > 5% |
| RDS DatabaseConnections > 80 | (DatabaseConnections / MaxConnections) * 100 > 80% |
| SQS ApproximateNumberOfMessages > 1000 | IF(FILL(NumberOfMessagesSent, 0) == 0, 1, 0) > 0 |
| ALB HTTPCode_ELB_5XX_Count > 10 | (HTTPCode_ELB_5XX + HTTPCode_Target_5XX) / RequestCount * 100 > 1% |
Pattern 1: Lambda error rate
Alarm when more than 5% of Lambda invocations return errors — regardless of traffic volume. A function getting 10 requests/minute and a function getting 10,000 requests/minute both trigger at the same error rate.
LambdaErrorRateAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: lambda-error-rate-high
AlarmDescription: Lambda error rate above 5% over 5 minutes
Metrics:
- Id: errors
MetricStat:
Metric:
Namespace: AWS/Lambda
MetricName: Errors
Dimensions:
- Name: FunctionName
Value: !Ref MyFunction
Period: 300
Stat: Sum
ReturnData: false
- Id: invocations
MetricStat:
Metric:
Namespace: AWS/Lambda
MetricName: Invocations
Dimensions:
- Name: FunctionName
Value: !Ref MyFunction
Period: 300
Stat: Sum
ReturnData: false
- Id: errorRate
Expression: "(errors / invocations) * 100"
Label: ErrorRate
ReturnData: true
ComparisonOperator: GreaterThanThreshold
Threshold: 5
EvaluationPeriods: 3
DatapointsToAlarm: 2
TreatMissingData: notBreaching`TreatMissingData: notBreaching` prevents the alarm from firing during periods when the function receives no traffic — Invocations = 0 would produce a divide-by-zero if the expression result was evaluated as a number rather than missing. CloudWatch returns missing data when a metric has no data points in the period, and the expression evaluates to missing rather than NaN.
Pattern 2: RDS connection saturation
Alarm when RDS has used more than 80% of its max_connections limit. The limit varies by instance type — a db.t3.micro has max_connections = 66, a db.r6g.large has 1365. A raw count threshold of `> 50` would be fine for one and catastrophically wrong for the other.
RDSConnectionSaturationAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: rds-connection-saturation-high
AlarmDescription: RDS using over 80% of max_connections
Metrics:
- Id: connections
MetricStat:
Metric:
Namespace: AWS/RDS
MetricName: DatabaseConnections
Dimensions:
- Name: DBInstanceIdentifier
Value: !Ref MyDBInstance
Period: 60
Stat: Maximum
ReturnData: false
- Id: maxConnections
MetricStat:
Metric:
Namespace: AWS/RDS
MetricName: MaximumUsedTransactionIDs
Dimensions:
- Name: DBInstanceIdentifier
Value: !Ref MyDBInstance
Period: 60
Stat: Maximum
ReturnData: false
- Id: saturation
Expression: "(connections / 100) * 100"
Label: ConnectionSaturationPct
ReturnData: true
ComparisonOperator: GreaterThanThreshold
Threshold: 80
EvaluationPeriods: 2
DatapointsToAlarm: 2
TreatMissingData: notBreaching# Correct approach: hardcode the known max_connections for your instance class
- Id: saturation
Expression: "(connections / 66) * 100"
Label: ConnectionSaturationPct
ReturnData: truePattern 3: ALB 5xx error rate
ALB publishes two distinct 5xx metrics: `HTTPCode_ELB_5XX_Count` (errors generated by the load balancer itself — 504 gateway timeout, 502 bad gateway) and `HTTPCode_Target_5XX_Count` (errors returned by your backend). A meaningful error rate alarm needs both.
ALB5xxRateAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: alb-5xx-error-rate-high
AlarmDescription: ALB 5xx error rate above 1% over 5 minutes
Metrics:
- Id: elbErrors
MetricStat:
Metric:
Namespace: AWS/ApplicationELB
MetricName: HTTPCode_ELB_5XX_Count
Dimensions:
- Name: LoadBalancer
Value: !GetAtt MyLoadBalancer.LoadBalancerFullName
Period: 300
Stat: Sum
ReturnData: false
- Id: targetErrors
MetricStat:
Metric:
Namespace: AWS/ApplicationELB
MetricName: HTTPCode_Target_5XX_Count
Dimensions:
- Name: LoadBalancer
Value: !GetAtt MyLoadBalancer.LoadBalancerFullName
Period: 300
Stat: Sum
ReturnData: false
- Id: requests
MetricStat:
Metric:
Namespace: AWS/ApplicationELB
MetricName: RequestCount
Dimensions:
- Name: LoadBalancer
Value: !GetAtt MyLoadBalancer.LoadBalancerFullName
Period: 300
Stat: Sum
ReturnData: false
- Id: errorRate
Expression: "((elbErrors + targetErrors) / requests) * 100"
Label: Total5xxRatePct
ReturnData: true
ComparisonOperator: GreaterThanThreshold
Threshold: 1
EvaluationPeriods: 3
DatapointsToAlarm: 2
TreatMissingData: notBreachingPattern 4: SQS consumer stopped
The standard SQS alarm watches `ApproximateNumberOfMessagesVisible` — but that spikes whenever producers send a batch, even when consumers are healthy. A better signal: detect when messages are being enqueued but none are being processed.
SQSConsumerStoppedAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: sqs-consumer-stopped
AlarmDescription: SQS messages visible but no messages being deleted — consumer may be down
Metrics:
- Id: deletes
MetricStat:
Metric:
Namespace: AWS/SQS
MetricName: NumberOfMessagesDeleted
Dimensions:
- Name: QueueName
Value: !GetAtt MyQueue.QueueName
Period: 300
Stat: Sum
ReturnData: false
- Id: visible
MetricStat:
Metric:
Namespace: AWS/SQS
MetricName: ApproximateNumberOfMessagesVisible
Dimensions:
- Name: QueueName
Value: !GetAtt MyQueue.QueueName
Period: 300
Stat: Maximum
ReturnData: false
- Id: consumerStopped
Expression: "IF(FILL(deletes, 0) == 0 AND visible > 0, 1, 0)"
Label: ConsumerStopped
ReturnData: true
ComparisonOperator: GreaterThanThreshold
Threshold: 0
EvaluationPeriods: 2
DatapointsToAlarm: 2
TreatMissingData: notBreaching`FILL(deletes, 0)` substitutes 0 when `NumberOfMessagesDeleted` has no data — which happens when no messages were deleted in the period. The `IF()` expression returns 1 when the consumer is stopped (no deletes AND messages exist), and 0 otherwise. With `Threshold: 0` and `GreaterThanThreshold`, the alarm fires when the result is 1.
Pattern 5: Compound CPU and latency
Some incidents only matter when two conditions are true simultaneously. ECS CPU above 85% is worth knowing about — but it's only an active user-facing problem if latency is also elevated. Alerting on the compound condition cuts noise on normal scale-out events.
CompoundCPULatencyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: ecs-cpu-and-latency-high
AlarmDescription: ECS CPU above 85% while ALB p99 latency is above 2s
Metrics:
- Id: cpu
MetricStat:
Metric:
Namespace: AWS/ECS
MetricName: CPUUtilization
Dimensions:
- Name: ServiceName
Value: !Ref MyService
- Name: ClusterName
Value: !Ref MyCluster
Period: 60
Stat: Average
ReturnData: false
- Id: latency
MetricStat:
Metric:
Namespace: AWS/ApplicationELB
MetricName: TargetResponseTime
Dimensions:
- Name: LoadBalancer
Value: !GetAtt MyLoadBalancer.LoadBalancerFullName
Period: 60
Stat: p99
ReturnData: false
- Id: compound
Expression: "IF(cpu > 85 AND latency > 2, 1, 0)"
Label: CPUAndLatencyHigh
ReturnData: true
ComparisonOperator: GreaterThanThreshold
Threshold: 0
EvaluationPeriods: 3
DatapointsToAlarm: 2
TreatMissingData: notBreachingThe `IF()` expression evaluates the boolean condition and returns 1 (true) or 0 (false) as a numeric value. `GreaterThanThreshold: 0` fires when the result is 1. This is the canonical pattern for any compound boolean alarm condition in CloudWatch.
When NOT to use metric math
Metric math adds operational complexity — the expression must be correct, and a division-by-zero or missing metric can produce unexpected alarm states. Avoid it when a simpler alarm achieves the same goal.
- Absolute capacity limits: alarm on `FreeableMemory < 512MB` directly — no math needed, the threshold is absolute.
- Simple count thresholds with consistent traffic: if your service handles exactly 1000 req/min with no variance, a raw error count threshold is fine.
- When you need per-alarm DatapointsToAlarm tuning on each metric: a metric math expression collapses to one alarm, losing the ability to tune each metric's anomaly sensitivity independently.
- Latency percentiles you want from X-Ray rather than ALB: X-Ray traces have more context; use a separate X-Ray-sourced alarm rather than forcing ALB TargetResponseTime through metric math.
Metric math functions reference
| Function | Syntax | Use case |
|---|---|---|
| IF | `IF(condition, trueVal, falseVal)` | Boolean conditions; returns numeric 0/1 for GreaterThanThreshold alarms |
| FILL | `FILL(metric, fillValue)` | Substitute a value (usually 0) when a metric has no data points in the period |
| RATE | `RATE(metric)` | Converts a cumulative count to a per-second rate — useful for metrics that reset on restart |
| SUM | `SUM([m1, m2, m3])` | Sum across multiple metrics — e.g. errors across multiple Lambda functions |
| MIN / MAX | `MIN([m1, m2])` / `MAX([m1, m2])` | Aggregate across a fleet — e.g. lowest FreeableMemory across all RDS read replicas |
| METRICS() | `METRICS('pattern')` | Reference all metrics matching an ID prefix — useful with metric streams |
| AVG | `AVG([m1, m2])` | Average across metric array — e.g. average CPU across all ECS tasks |
| ABS | `ABS(m)` | Absolute value — useful when metric values can be negative |
Related reading
Frequently asked questions
Frequently asked questions
Can I use metric math in a CloudWatch alarm without writing CloudFormation?
Yes. In the CloudWatch console, when creating or editing an alarm, choose 'Select metric' then switch to the 'Math expression' tab. You can reference any existing metric by its ID and type a metric math expression directly. The console validates the expression and shows a preview of the computed value before you save the alarm.
What happens when a metric math expression has a divide-by-zero?
CloudWatch treats the result as missing data for that evaluation period. The alarm's TreatMissingData setting determines what happens next: 'notBreaching' (stays OK), 'breaching' (triggers ALARM), 'ignore' (stays in current state), or 'missing' (transitions to INSUFFICIENT_DATA). Always set TreatMissingData to 'notBreaching' for rate calculations that can have zero denominators.
How do I alarm on the error rate across multiple Lambda functions?
Define each function's Errors and Invocations as separate metrics (e.g. errors1, invocations1, errors2, invocations2), then write the expression: `(SUM([errors1, errors2]) / SUM([invocations1, invocations2])) * 100`. This gives you a fleet-wide error rate alarm on a single threshold, without managing one alarm per function. Alternatively, publish a custom metric that aggregates at the application level.
What is the FILL function used for in CloudWatch metric math?
FILL substitutes a specified value when a metric has no data points in an evaluation period. Without FILL, a missing metric in a math expression causes the entire expression to evaluate as missing. The most common use is `FILL(metric, 0)` — treating periods of no activity as zero rather than missing data. Use it when dividing two metrics and the denominator might be zero or absent (e.g. SQS deletes during an idle period).
Does CloudWatch metric math work with custom metrics?
Yes. Custom metrics published to CloudWatch (via PutMetricData API or the CloudWatch agent) are fully supported in metric math expressions. Reference them the same way as AWS service metrics — by namespace, metric name, and dimensions. The only constraint is that all metrics in the expression must share the same Period and must be in the same AWS account and region as the alarm.
Related reading
Still debugging incidents manually?
ConvOps does this automatically — root cause in under 60 seconds, delivered to WhatsApp or Slack.