{ ConvOps }
  • Pricing
  • Blog
  • Security
  • About
Log inStart free →
{ ConvOps }

Root cause, not noise.

Start free →

Product

  • Audit
  • Watch
  • Diagnose

Compare

  • Vs PagerDuty
  • Vs incident.io
  • Vs Datadog
  • Vs Datadog Watchdog
  • Vs Resolve.ai
  • Vs Rootly
  • Vs AWS DevOps Guru
  • Vs Squadcast
  • Vs Komodor / Klaudia
  • Vs Sentry
  • Vs Coroot

Company

  • Pricing
  • Blog
  • Security
  • About

Connect

  • X (Twitter)
  • LinkedIn

© 2026 ConvOps. All rights reserved.

Built at 2am, for a 2am.

← All posts

CloudWatch metric math: how to build alarms no static threshold can match

June 3, 2026·11 min read

A static CloudWatch alarm on Lambda Errors > 5 fires whenever five errors occur — whether that's 5 out of 10 invocations (50% error rate) or 5 out of 100,000 (0.005%). One is a crisis. The other is background noise. Metric math lets you alarm on the ratio, the rate, or the derived signal — the thing that actually determines whether users are affected.

Metric math is only available for alarms with a single expression as the trigger metric. You cannot mix metric math expressions and raw metrics in the same alarm threshold. The expression result is evaluated against the threshold exactly like a raw metric would be.

What is CloudWatch metric math?

Metric math lets you define a CloudWatch alarm on a mathematical expression derived from one or more raw metrics. The expression is evaluated at the alarm's period, and the result is compared against the threshold. You get all standard CloudWatch alarm features — DatapointsToAlarm, missing data treatment, SNS actions — on the derived value, not on the raw count.

Static alarmMetric math alarm
Lambda Errors > 5(Errors / Invocations) * 100 > 5%
RDS DatabaseConnections > 80(DatabaseConnections / MaxConnections) * 100 > 80%
SQS ApproximateNumberOfMessages > 1000IF(FILL(NumberOfMessagesSent, 0) == 0, 1, 0) > 0
ALB HTTPCode_ELB_5XX_Count > 10(HTTPCode_ELB_5XX + HTTPCode_Target_5XX) / RequestCount * 100 > 1%

Pattern 1: Lambda error rate

Alarm when more than 5% of Lambda invocations return errors — regardless of traffic volume. A function getting 10 requests/minute and a function getting 10,000 requests/minute both trigger at the same error rate.

LambdaErrorRateAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: lambda-error-rate-high
    AlarmDescription: Lambda error rate above 5% over 5 minutes
    Metrics:
      - Id: errors
        MetricStat:
          Metric:
            Namespace: AWS/Lambda
            MetricName: Errors
            Dimensions:
              - Name: FunctionName
                Value: !Ref MyFunction
          Period: 300
          Stat: Sum
        ReturnData: false
      - Id: invocations
        MetricStat:
          Metric:
            Namespace: AWS/Lambda
            MetricName: Invocations
            Dimensions:
              - Name: FunctionName
                Value: !Ref MyFunction
          Period: 300
          Stat: Sum
        ReturnData: false
      - Id: errorRate
        Expression: "(errors / invocations) * 100"
        Label: ErrorRate
        ReturnData: true
    ComparisonOperator: GreaterThanThreshold
    Threshold: 5
    EvaluationPeriods: 3
    DatapointsToAlarm: 2
    TreatMissingData: notBreaching

`TreatMissingData: notBreaching` prevents the alarm from firing during periods when the function receives no traffic — Invocations = 0 would produce a divide-by-zero if the expression result was evaluated as a number rather than missing. CloudWatch returns missing data when a metric has no data points in the period, and the expression evaluates to missing rather than NaN.

Pattern 2: RDS connection saturation

Alarm when RDS has used more than 80% of its max_connections limit. The limit varies by instance type — a db.t3.micro has max_connections = 66, a db.r6g.large has 1365. A raw count threshold of `> 50` would be fine for one and catastrophically wrong for the other.

RDSConnectionSaturationAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: rds-connection-saturation-high
    AlarmDescription: RDS using over 80% of max_connections
    Metrics:
      - Id: connections
        MetricStat:
          Metric:
            Namespace: AWS/RDS
            MetricName: DatabaseConnections
            Dimensions:
              - Name: DBInstanceIdentifier
                Value: !Ref MyDBInstance
          Period: 60
          Stat: Maximum
        ReturnData: false
      - Id: maxConnections
        MetricStat:
          Metric:
            Namespace: AWS/RDS
            MetricName: MaximumUsedTransactionIDs
            Dimensions:
              - Name: DBInstanceIdentifier
                Value: !Ref MyDBInstance
          Period: 60
          Stat: Maximum
        ReturnData: false
      - Id: saturation
        Expression: "(connections / 100) * 100"
        Label: ConnectionSaturationPct
        ReturnData: true
    ComparisonOperator: GreaterThanThreshold
    Threshold: 80
    EvaluationPeriods: 2
    DatapointsToAlarm: 2
    TreatMissingData: notBreaching
RDS does not publish a MaxConnections metric. The actual max_connections value is an instance parameter — look it up once per instance class and hardcode it as a constant in your expression: `(connections / 66) * 100` for a db.t3.micro. Re-check when you resize. AWS publishes the parameter group values by instance class in their documentation.
# Correct approach: hardcode the known max_connections for your instance class
- Id: saturation
  Expression: "(connections / 66) * 100"
  Label: ConnectionSaturationPct
  ReturnData: true

Pattern 3: ALB 5xx error rate

ALB publishes two distinct 5xx metrics: `HTTPCode_ELB_5XX_Count` (errors generated by the load balancer itself — 504 gateway timeout, 502 bad gateway) and `HTTPCode_Target_5XX_Count` (errors returned by your backend). A meaningful error rate alarm needs both.

ALB5xxRateAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: alb-5xx-error-rate-high
    AlarmDescription: ALB 5xx error rate above 1% over 5 minutes
    Metrics:
      - Id: elbErrors
        MetricStat:
          Metric:
            Namespace: AWS/ApplicationELB
            MetricName: HTTPCode_ELB_5XX_Count
            Dimensions:
              - Name: LoadBalancer
                Value: !GetAtt MyLoadBalancer.LoadBalancerFullName
          Period: 300
          Stat: Sum
        ReturnData: false
      - Id: targetErrors
        MetricStat:
          Metric:
            Namespace: AWS/ApplicationELB
            MetricName: HTTPCode_Target_5XX_Count
            Dimensions:
              - Name: LoadBalancer
                Value: !GetAtt MyLoadBalancer.LoadBalancerFullName
          Period: 300
          Stat: Sum
        ReturnData: false
      - Id: requests
        MetricStat:
          Metric:
            Namespace: AWS/ApplicationELB
            MetricName: RequestCount
            Dimensions:
              - Name: LoadBalancer
                Value: !GetAtt MyLoadBalancer.LoadBalancerFullName
          Period: 300
          Stat: Sum
        ReturnData: false
      - Id: errorRate
        Expression: "((elbErrors + targetErrors) / requests) * 100"
        Label: Total5xxRatePct
        ReturnData: true
    ComparisonOperator: GreaterThanThreshold
    Threshold: 1
    EvaluationPeriods: 3
    DatapointsToAlarm: 2
    TreatMissingData: notBreaching

Pattern 4: SQS consumer stopped

The standard SQS alarm watches `ApproximateNumberOfMessagesVisible` — but that spikes whenever producers send a batch, even when consumers are healthy. A better signal: detect when messages are being enqueued but none are being processed.

SQSConsumerStoppedAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: sqs-consumer-stopped
    AlarmDescription: SQS messages visible but no messages being deleted — consumer may be down
    Metrics:
      - Id: deletes
        MetricStat:
          Metric:
            Namespace: AWS/SQS
            MetricName: NumberOfMessagesDeleted
            Dimensions:
              - Name: QueueName
                Value: !GetAtt MyQueue.QueueName
          Period: 300
          Stat: Sum
        ReturnData: false
      - Id: visible
        MetricStat:
          Metric:
            Namespace: AWS/SQS
            MetricName: ApproximateNumberOfMessagesVisible
            Dimensions:
              - Name: QueueName
                Value: !GetAtt MyQueue.QueueName
          Period: 300
          Stat: Maximum
        ReturnData: false
      - Id: consumerStopped
        Expression: "IF(FILL(deletes, 0) == 0 AND visible > 0, 1, 0)"
        Label: ConsumerStopped
        ReturnData: true
    ComparisonOperator: GreaterThanThreshold
    Threshold: 0
    EvaluationPeriods: 2
    DatapointsToAlarm: 2
    TreatMissingData: notBreaching

`FILL(deletes, 0)` substitutes 0 when `NumberOfMessagesDeleted` has no data — which happens when no messages were deleted in the period. The `IF()` expression returns 1 when the consumer is stopped (no deletes AND messages exist), and 0 otherwise. With `Threshold: 0` and `GreaterThanThreshold`, the alarm fires when the result is 1.

Pattern 5: Compound CPU and latency

Some incidents only matter when two conditions are true simultaneously. ECS CPU above 85% is worth knowing about — but it's only an active user-facing problem if latency is also elevated. Alerting on the compound condition cuts noise on normal scale-out events.

CompoundCPULatencyAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: ecs-cpu-and-latency-high
    AlarmDescription: ECS CPU above 85% while ALB p99 latency is above 2s
    Metrics:
      - Id: cpu
        MetricStat:
          Metric:
            Namespace: AWS/ECS
            MetricName: CPUUtilization
            Dimensions:
              - Name: ServiceName
                Value: !Ref MyService
              - Name: ClusterName
                Value: !Ref MyCluster
          Period: 60
          Stat: Average
        ReturnData: false
      - Id: latency
        MetricStat:
          Metric:
            Namespace: AWS/ApplicationELB
            MetricName: TargetResponseTime
            Dimensions:
              - Name: LoadBalancer
                Value: !GetAtt MyLoadBalancer.LoadBalancerFullName
          Period: 60
          Stat: p99
        ReturnData: false
      - Id: compound
        Expression: "IF(cpu > 85 AND latency > 2, 1, 0)"
        Label: CPUAndLatencyHigh
        ReturnData: true
    ComparisonOperator: GreaterThanThreshold
    Threshold: 0
    EvaluationPeriods: 3
    DatapointsToAlarm: 2
    TreatMissingData: notBreaching

The `IF()` expression evaluates the boolean condition and returns 1 (true) or 0 (false) as a numeric value. `GreaterThanThreshold: 0` fires when the result is 1. This is the canonical pattern for any compound boolean alarm condition in CloudWatch.

When NOT to use metric math

Metric math adds operational complexity — the expression must be correct, and a division-by-zero or missing metric can produce unexpected alarm states. Avoid it when a simpler alarm achieves the same goal.

  • Absolute capacity limits: alarm on `FreeableMemory < 512MB` directly — no math needed, the threshold is absolute.
  • Simple count thresholds with consistent traffic: if your service handles exactly 1000 req/min with no variance, a raw error count threshold is fine.
  • When you need per-alarm DatapointsToAlarm tuning on each metric: a metric math expression collapses to one alarm, losing the ability to tune each metric's anomaly sensitivity independently.
  • Latency percentiles you want from X-Ray rather than ALB: X-Ray traces have more context; use a separate X-Ray-sourced alarm rather than forcing ALB TargetResponseTime through metric math.

Metric math functions reference

FunctionSyntaxUse case
IF`IF(condition, trueVal, falseVal)`Boolean conditions; returns numeric 0/1 for GreaterThanThreshold alarms
FILL`FILL(metric, fillValue)`Substitute a value (usually 0) when a metric has no data points in the period
RATE`RATE(metric)`Converts a cumulative count to a per-second rate — useful for metrics that reset on restart
SUM`SUM([m1, m2, m3])`Sum across multiple metrics — e.g. errors across multiple Lambda functions
MIN / MAX`MIN([m1, m2])` / `MAX([m1, m2])`Aggregate across a fleet — e.g. lowest FreeableMemory across all RDS read replicas
METRICS()`METRICS('pattern')`Reference all metrics matching an ID prefix — useful with metric streams
AVG`AVG([m1, m2])`Average across metric array — e.g. average CPU across all ECS tasks
ABS`ABS(m)`Absolute value — useful when metric values can be negative
CloudWatch metric math expressions are evaluated server-side at alarm evaluation time, not in the console graph. Test expressions in the CloudWatch Metrics > Math expression playground before deploying — paste your metrics and expression to see the computed value over your intended window.

Related reading

  • → The 5 CloudWatch alarms most startups accidentally create that are just noise
  • → The Complete AWS CloudWatch Alarm Setup Guide
  • → The 12 CloudWatch alarms every small AWS team should have

Frequently asked questions

Frequently asked questions

Can I use metric math in a CloudWatch alarm without writing CloudFormation?

Yes. In the CloudWatch console, when creating or editing an alarm, choose 'Select metric' then switch to the 'Math expression' tab. You can reference any existing metric by its ID and type a metric math expression directly. The console validates the expression and shows a preview of the computed value before you save the alarm.

What happens when a metric math expression has a divide-by-zero?

CloudWatch treats the result as missing data for that evaluation period. The alarm's TreatMissingData setting determines what happens next: 'notBreaching' (stays OK), 'breaching' (triggers ALARM), 'ignore' (stays in current state), or 'missing' (transitions to INSUFFICIENT_DATA). Always set TreatMissingData to 'notBreaching' for rate calculations that can have zero denominators.

How do I alarm on the error rate across multiple Lambda functions?

Define each function's Errors and Invocations as separate metrics (e.g. errors1, invocations1, errors2, invocations2), then write the expression: `(SUM([errors1, errors2]) / SUM([invocations1, invocations2])) * 100`. This gives you a fleet-wide error rate alarm on a single threshold, without managing one alarm per function. Alternatively, publish a custom metric that aggregates at the application level.

What is the FILL function used for in CloudWatch metric math?

FILL substitutes a specified value when a metric has no data points in an evaluation period. Without FILL, a missing metric in a math expression causes the entire expression to evaluate as missing. The most common use is `FILL(metric, 0)` — treating periods of no activity as zero rather than missing data. Use it when dividing two metrics and the denominator might be zero or absent (e.g. SQS deletes during an idle period).

Does CloudWatch metric math work with custom metrics?

Yes. Custom metrics published to CloudWatch (via PutMetricData API or the CloudWatch agent) are fully supported in metric math expressions. Reference them the same way as AWS service metrics — by namespace, metric name, and dimensions. The only constraint is that all metrics in the expression must share the same Period and must be in the same AWS account and region as the alarm.

Related reading

  • → The 5 CloudWatch alarms most startups accidentally create that are just noise
  • → The Complete AWS CloudWatch Alarm Setup Guide
  • → The 12 CloudWatch alarms every small AWS team should have
  • → ConvOps Watch — 24/7 anomaly detection with 9 verification checks

Still debugging incidents manually?

ConvOps does this automatically — root cause in under 60 seconds, delivered to WhatsApp or Slack.

Try ConvOps free →
N

Nitesh

Founder, ConvOps

Published

June 2026

Updated

June 2026

Have feedback? [email protected]