convops
  • Features
  • How it works
  • Pricing
  • Blog
  • Security
Log inStart free →
convops

Root cause, not noise.

Start free →

Product

  • Features
  • How it works
  • Pricing
  • Blog
  • Security
  • Get started

Compare

  • Vs PagerDuty
  • Vs incident.io
  • Vs Datadog
  • Vs Resolve.ai
  • Vs Rootly
  • Vs Squadcast

Solutions

  • AWS incident response
  • CloudWatch alarm diagnosis
  • AWS alerts to WhatsApp
  • WhatsApp on-call
  • PagerDuty alternative

Connect

  • X (Twitter)
  • LinkedIn

© 2026 ConvOps. All rights reserved.

Built at 2am, for a 2am.

← All posts

The Complete AWS CloudWatch Alarm Setup Guide

March 31, 2026·15 min read
Show code for:

This guide covers every CloudWatch alarm your AWS infrastructure needs — ECS, EC2, RDS, Lambda, ALB, API Gateway, SQS, DynamoDB, ElastiCache, and cost alerts. All examples include both CloudFormation and Terraform.

Setup parameters

Replace these placeholder values before deploying any alarm:

  • YOUR_SNS_TOPIC_ARN — the ARN of the SNS topic that sends your notifications
  • YOUR_CLUSTER_NAME / YOUR_SERVICE_NAME — your ECS cluster and service names
  • YOUR_INSTANCE_ID — your EC2 instance ID
  • YOUR_DB_INSTANCE_ID — your RDS instance identifier
  • YOUR_FUNCTION_NAME — your Lambda function name
  • YOUR_ALB_SUFFIX — everything after loadbalancer/ in your ALB ARN
  • YOUR_API_NAME / YOUR_STAGE — your API Gateway name and stage
  • YOUR_QUEUE_NAME — your SQS queue name
  • YOUR_TABLE_NAME — your DynamoDB table name
  • YOUR_CACHE_CLUSTER_ID — your ElastiCache cluster ID
  • YOUR_MONTHLY_BUDGET — your monthly budget in USD
After deploying, AWS sends a confirmation email to your SNS subscription address. Click 'Confirm subscription' in that email — alarms won't deliver until you do.

SNS topic

AWSTemplateFormatVersion: '2010-09-09'
Parameters:
  AlertEmail:
    Type: String
    Description: Email address to receive CloudWatch alerts

Resources:
  AlertsTopic:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: infra-alerts
      Subscription:
        - Protocol: email
          Endpoint: !Ref AlertEmail

Outputs:
  SnsTopicArn:
    Value: !Ref AlertsTopic
    Description: Use this ARN as YOUR_SNS_TOPIC_ARN in all alarm snippets below

1. ECS — Elastic Container Service

Detect container saturation and task crashes before user impact occurs.

MetricThresholdPeriodEval PeriodsSeverityWhy It Matters
CPUUtilization> 80%5 min2WARNSustained CPU pressure — scale before saturation
CPUUtilization> 95%5 min2CRITICALTasks CPU-throttled; latency spikes imminent
MemoryUtilization> 85%5 min2WARNMemory pressure building; OOM kill possible
MemoryUtilization> 95%5 min2CRITICALNear OOM; task will be killed and restarted
RunningTaskCount< desired count1 min1CRITICALTasks crashed and not recovering
Parameters:
  ClusterName:
    Type: String
    Default: YOUR_CLUSTER_NAME
  ServiceName:
    Type: String
    Default: YOUR_SERVICE_NAME
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN
  DesiredTaskCount:
    Type: Number
    Default: 2

Resources:
  EcsCpuWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ServiceName}-cpu-warn"
      AlarmDescription: ECS CPU utilization above 80% for 10 minutes
      Namespace: AWS/ECS
      MetricName: CPUUtilization
      Dimensions:
        - Name: ClusterName
          Value: !Ref ClusterName
        - Name: ServiceName
          Value: !Ref ServiceName
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 80
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

  EcsCpuCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ServiceName}-cpu-critical"
      AlarmDescription: ECS CPU above 95% - tasks are throttled
      Namespace: AWS/ECS
      MetricName: CPUUtilization
      Dimensions:
        - Name: ClusterName
          Value: !Ref ClusterName
        - Name: ServiceName
          Value: !Ref ServiceName
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 95
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  EcsMemoryWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ServiceName}-memory-warn"
      AlarmDescription: ECS memory utilization above 85%
      Namespace: AWS/ECS
      MetricName: MemoryUtilization
      Dimensions:
        - Name: ClusterName
          Value: !Ref ClusterName
        - Name: ServiceName
          Value: !Ref ServiceName
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 85
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  EcsMemoryCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ServiceName}-memory-critical"
      AlarmDescription: ECS memory utilization above 95% - OOM kill imminent
      Namespace: AWS/ECS
      MetricName: MemoryUtilization
      Dimensions:
        - Name: ClusterName
          Value: !Ref ClusterName
        - Name: ServiceName
          Value: !Ref ServiceName
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 95
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  EcsRunningTasksCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ServiceName}-tasks-critical"
      AlarmDescription: Running task count below desired - service may be down
      Namespace: AWS/ECS
      MetricName: RunningTaskCount
      Dimensions:
        - Name: ClusterName
          Value: !Ref ClusterName
        - Name: ServiceName
          Value: !Ref ServiceName
      Statistic: Average
      Period: 60
      EvaluationPeriods: 1
      Threshold: !Ref DesiredTaskCount
      ComparisonOperator: LessThanThreshold
      TreatMissingData: breaching
      AlarmActions: [!Ref SnsTopicArn]

2. EC2 — Elastic Compute Cloud

Catch hardware failures and unresponsive instances that Auto Scaling may miss.

MetricThresholdPeriodEval PeriodsSeverityWhy It Matters
CPUUtilization> 85%5 min3WARNSustained high CPU; investigate before saturation
CPUUtilization> 95%5 min2CRITICALInstance at capacity; requests will queue or fail
StatusCheckFailed> 01 min2CRITICALInstance or system check failing
StatusCheckFailed_System> 01 min2CRITICALAWS hardware issue — instance may need recovery
NetworkIn< 1000 bytes/period5 min3WARNTraffic dropped to near-zero
Parameters:
  InstanceId:
    Type: String
    Default: YOUR_INSTANCE_ID
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN

Resources:
  Ec2CpuWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${InstanceId}-cpu-warn"
      AlarmDescription: EC2 CPU above 85% for 15 minutes
      Namespace: AWS/EC2
      MetricName: CPUUtilization
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 3
      Threshold: 85
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

  Ec2CpuCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${InstanceId}-cpu-critical"
      AlarmDescription: EC2 CPU above 95% for 10 minutes
      Namespace: AWS/EC2
      MetricName: CPUUtilization
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 95
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  Ec2StatusCheckFailed:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${InstanceId}-status-check-failed"
      AlarmDescription: EC2 status check failed - instance may be unresponsive
      Namespace: AWS/EC2
      MetricName: StatusCheckFailed
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
      Statistic: Maximum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: breaching
      AlarmActions: [!Ref SnsTopicArn]

  Ec2StatusCheckFailedSystem:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${InstanceId}-status-check-system"
      AlarmDescription: EC2 system status check failed - AWS hardware issue
      Namespace: AWS/EC2
      MetricName: StatusCheckFailed_System
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
      Statistic: Maximum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: breaching
      AlarmActions: [!Ref SnsTopicArn]

  Ec2NetworkInDrop:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${InstanceId}-network-in-drop"
      AlarmDescription: EC2 NetworkIn near zero - traffic may have stopped
      Namespace: AWS/EC2
      MetricName: NetworkIn
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 3
      Threshold: 1000
      ComparisonOperator: LessThanThreshold
      TreatMissingData: breaching
      AlarmActions: [!Ref SnsTopicArn]

3. RDS — Relational Database Service

Provide 10–30 minute warning window before database failures occur.

Instance Classmax_connections80% threshold
db.t3.micro8769
db.t3.small171136
db.t3.medium341272
db.t3.large648518
db.r5.large13651092
db.r5.xlarge27302184
db.r5.2xlarge54604368
MetricThresholdPeriodEval PeriodsSeverityWhy It Matters
CPUUtilization> 80%5 min3WARNDB under CPU load; queries slowing down
DatabaseConnections> 80% of max5 min2WARNConnection pool filling
FreeStorageSpace< 10 GB5 min2WARNDisk filling; DB will stop accepting writes
FreeStorageSpace< 2 GB5 min1CRITICALCritically low disk
ReplicaLag> 300 s1 min2WARNRead replica falling behind
FreeableMemory< 256 MB5 min3WARNLow memory; buffer pool shrinking
Parameters:
  DbInstanceId:
    Type: String
    Default: YOUR_DB_INSTANCE_ID
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN
  MaxConnectionsThreshold:
    Type: Number
    Default: 272
    Description: 80% of max_connections for your instance class (see table above)

Resources:
  RdsCpuWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${DbInstanceId}-cpu-warn"
      AlarmDescription: RDS CPU above 80% for 15 minutes
      Namespace: AWS/RDS
      MetricName: CPUUtilization
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: !Ref DbInstanceId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 3
      Threshold: 80
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RdsConnectionsWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${DbInstanceId}-connections-warn"
      AlarmDescription: RDS connections above 80% of max_connections
      Namespace: AWS/RDS
      MetricName: DatabaseConnections
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: !Ref DbInstanceId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: !Ref MaxConnectionsThreshold
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RdsDiskWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${DbInstanceId}-disk-warn"
      AlarmDescription: RDS free storage below 10 GB
      Namespace: AWS/RDS
      MetricName: FreeStorageSpace
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: !Ref DbInstanceId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 10737418240
      ComparisonOperator: LessThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RdsDiskCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${DbInstanceId}-disk-critical"
      AlarmDescription: RDS free storage critically low (below 2 GB)
      Namespace: AWS/RDS
      MetricName: FreeStorageSpace
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: !Ref DbInstanceId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 1
      Threshold: 2147483648
      ComparisonOperator: LessThanThreshold
      TreatMissingData: breaching
      AlarmActions: [!Ref SnsTopicArn]

  RdsReplicaLag:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${DbInstanceId}-replica-lag"
      AlarmDescription: RDS read replica lag above 5 minutes
      Namespace: AWS/RDS
      MetricName: ReplicaLag
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: !Ref DbInstanceId
      Statistic: Average
      Period: 60
      EvaluationPeriods: 2
      Threshold: 300
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RdsFreeMemoryWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${DbInstanceId}-memory-warn"
      AlarmDescription: RDS freeable memory below 256 MB
      Namespace: AWS/RDS
      MetricName: FreeableMemory
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: !Ref DbInstanceId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 3
      Threshold: 268435456
      ComparisonOperator: LessThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

4. Lambda

Detect errors, throttles, and runaway executions before they consume budget.

Set the Duration alarm threshold to 80% of your function's configured timeout. For example, if your timeout is 30 seconds, set threshold to 24000ms (24 seconds). You must set this manually — there is no automatic way to reference the function timeout in a CloudWatch alarm.
MetricThresholdPeriodEval PeriodsSeverityWhy It Matters
Errors> 01 min1WARNAny function error
Errors> 51 min2CRITICALRepeated errors — function may be broken
Throttles> 01 min2WARNRequests being dropped
Duration> 80% of timeout1 min2WARNFunction nearing timeout
ConcurrentExecutions> 8001 min2WARNApproaching account concurrency limit
Parameters:
  FunctionName:
    Type: String
    Default: YOUR_FUNCTION_NAME
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN
  DurationThresholdMs:
    Type: Number
    Default: 24000
    Description: |
      80% of your function timeout in ms.
      e.g. 30s timeout -> 24000ms, 15s timeout -> 12000ms, 5s timeout -> 4000ms

Resources:
  LambdaErrorsWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${FunctionName}-errors-warn"
      AlarmDescription: Lambda function errors detected
      Namespace: AWS/Lambda
      MetricName: Errors
      Dimensions:
        - Name: FunctionName
          Value: !Ref FunctionName
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 1
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

  LambdaErrorsCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${FunctionName}-errors-critical"
      AlarmDescription: Lambda function errors above 5 - may be completely broken
      Namespace: AWS/Lambda
      MetricName: Errors
      Dimensions:
        - Name: FunctionName
          Value: !Ref FunctionName
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 5
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  LambdaThrottlesWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${FunctionName}-throttles"
      AlarmDescription: Lambda throttles detected - requests being dropped
      Namespace: AWS/Lambda
      MetricName: Throttles
      Dimensions:
        - Name: FunctionName
          Value: !Ref FunctionName
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  LambdaDurationWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${FunctionName}-duration-warn"
      AlarmDescription: !Sub "Lambda duration above 80% of timeout (${DurationThresholdMs}ms)"
      Namespace: AWS/Lambda
      MetricName: Duration
      Dimensions:
        - Name: FunctionName
          Value: !Ref FunctionName
      ExtendedStatistic: p99
      Period: 60
      EvaluationPeriods: 2
      Threshold: !Ref DurationThresholdMs
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  LambdaConcurrencyWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${FunctionName}-concurrency-warn"
      AlarmDescription: Lambda concurrent executions above 800 (80% of default limit 1000)
      Namespace: AWS/Lambda
      MetricName: ConcurrentExecutions
      Dimensions:
        - Name: FunctionName
          Value: !Ref FunctionName
      Statistic: Maximum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 800
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

5. ALB — Application Load Balancer

Catch backend failures and unhealthy targets immediately.

To find your ALB suffix: go to EC2 → Load Balancers, click your ALB, and copy the ARN. The suffix is everything after loadbalancer/ (e.g. app/my-alb/abc123def456).
MetricThresholdPeriodEval PeriodsSeverityWhy It Matters
HTTPCode_Target_5XX_Count> 01 min2WARNBackend returning server errors
HTTPCode_Target_5XX_Count> 101 min2CRITICALHigh rate of 5XX errors
TargetResponseTime> 2 s5 min3WARNSlow responses — users experiencing latency
TargetResponseTime> 5 s5 min2CRITICALVery slow responses
UnHealthyHostCount> 01 min2CRITICALTargets failing health checks
RejectedConnectionCount> 01 min2WARNALB at max connections
Parameters:
  AlbSuffix:
    Type: String
    Default: YOUR_ALB_SUFFIX
    Description: e.g. app/my-alb/abc123def456 (after "loadbalancer/" in the ARN)
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN

Resources:
  Alb5xxWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "alb-5xx-warn-${AlbSuffix}"
      AlarmDescription: ALB backend 5XX errors detected
      Namespace: AWS/ApplicationELB
      MetricName: HTTPCode_Target_5XX_Count
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbSuffix
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

  Alb5xxCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "alb-5xx-critical-${AlbSuffix}"
      AlarmDescription: ALB backend 5XX errors above 10 per minute
      Namespace: AWS/ApplicationELB
      MetricName: HTTPCode_Target_5XX_Count
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbSuffix
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 10
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  AlbLatencyWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "alb-latency-warn-${AlbSuffix}"
      AlarmDescription: ALB target response time above 2 seconds
      Namespace: AWS/ApplicationELB
      MetricName: TargetResponseTime
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbSuffix
      ExtendedStatistic: p99
      Period: 300
      EvaluationPeriods: 3
      Threshold: 2
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  AlbLatencyCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "alb-latency-critical-${AlbSuffix}"
      AlarmDescription: ALB target response time above 5 seconds
      Namespace: AWS/ApplicationELB
      MetricName: TargetResponseTime
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbSuffix
      ExtendedStatistic: p99
      Period: 300
      EvaluationPeriods: 2
      Threshold: 5
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  AlbUnhealthyHosts:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "alb-unhealthy-hosts-${AlbSuffix}"
      AlarmDescription: ALB unhealthy target count above zero
      Namespace: AWS/ApplicationELB
      MetricName: UnHealthyHostCount
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbSuffix
      Statistic: Maximum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  AlbRejectedConnections:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "alb-rejected-connections-${AlbSuffix}"
      AlarmDescription: ALB rejected connections - load balancer at max capacity
      Namespace: AWS/ApplicationELB
      MetricName: RejectedConnectionCount
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbSuffix
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

6. API Gateway

Detect integration failures and slow backends before the 29-second timeout.

Detecting a sudden drop in request Count requires metric math (comparing current Count to a rolling average). Standard CloudWatch alarms can't do this natively — use CloudWatch Anomaly Detection or external monitoring for this alarm.
MetricThresholdPeriodEval PeriodsSeverityWhy It Matters
5XXError> 5 count1 min2WARNBackend integration errors
4XXError> 50 per 5 min5 min3WARNHigh client error rate
Latency> 3000 ms p995 min3WARNSlow backend responses
Latency> 10000 ms5 min2CRITICALNear 29s timeout
Parameters:
  ApiName:
    Type: String
    Default: YOUR_API_NAME
  Stage:
    Type: String
    Default: prod
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN

Resources:
  ApiGw5xxWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ApiName}-${Stage}-5xx-warn"
      AlarmDescription: API Gateway 5XX errors above 5 per minute
      Namespace: AWS/ApiGateway
      MetricName: 5XXError
      Dimensions:
        - Name: ApiName
          Value: !Ref ApiName
        - Name: Stage
          Value: !Ref Stage
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 5
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

  ApiGw4xxWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ApiName}-${Stage}-4xx-warn"
      AlarmDescription: API Gateway 4XX errors above 50 per 5 minutes
      Namespace: AWS/ApiGateway
      MetricName: 4XXError
      Dimensions:
        - Name: ApiName
          Value: !Ref ApiName
        - Name: Stage
          Value: !Ref Stage
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 3
      Threshold: 50
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  ApiGwLatencyWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ApiName}-${Stage}-latency-warn"
      AlarmDescription: API Gateway p99 latency above 3 seconds
      Namespace: AWS/ApiGateway
      MetricName: Latency
      Dimensions:
        - Name: ApiName
          Value: !Ref ApiName
        - Name: Stage
          Value: !Ref Stage
      ExtendedStatistic: p99
      Period: 300
      EvaluationPeriods: 3
      Threshold: 3000
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  ApiGwLatencyCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ApiName}-${Stage}-latency-critical"
      AlarmDescription: API Gateway latency above 10 seconds - near 29s timeout
      Namespace: AWS/ApiGateway
      MetricName: Latency
      Dimensions:
        - Name: ApiName
          Value: !Ref ApiName
        - Name: Stage
          Value: !Ref Stage
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 10000
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

7. SQS — Simple Queue Service

Detect queue backups and processing failures before they accumulate.

Detecting a sudden drop in NumberOfMessagesSent requires metric math (comparing to a rolling baseline). Use CloudWatch Anomaly Detection alarms for this — the standard alarm snippets below cover threshold-based alarms only.
MetricThresholdPeriodEval PeriodsSeverityWhy It Matters
ApproximateNumberOfMessagesNotVisible> 10005 min3WARNQueue building up
ApproximateNumberOfMessagesNotVisible> 100005 min2CRITICALSevere queue backup
ApproximateAgeOfOldestMessage> 300 s5 min2WARNMessages sitting unprocessed
ApproximateAgeOfOldestMessage> 900 s5 min2CRITICALMessages 15+ minutes old
Parameters:
  QueueName:
    Type: String
    Default: YOUR_QUEUE_NAME
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN

Resources:
  SqsQueueDepthWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${QueueName}-depth-warn"
      AlarmDescription: SQS queue depth above 1000 - consumers may be lagging
      Namespace: AWS/SQS
      MetricName: ApproximateNumberOfMessagesNotVisible
      Dimensions:
        - Name: QueueName
          Value: !Ref QueueName
      Statistic: Maximum
      Period: 300
      EvaluationPeriods: 3
      Threshold: 1000
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

  SqsQueueDepthCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${QueueName}-depth-critical"
      AlarmDescription: SQS queue depth above 10000 - severe consumer failure
      Namespace: AWS/SQS
      MetricName: ApproximateNumberOfMessagesNotVisible
      Dimensions:
        - Name: QueueName
          Value: !Ref QueueName
      Statistic: Maximum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 10000
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  SqsMessageAgeWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${QueueName}-age-warn"
      AlarmDescription: SQS oldest message age above 5 minutes
      Namespace: AWS/SQS
      MetricName: ApproximateAgeOfOldestMessage
      Dimensions:
        - Name: QueueName
          Value: !Ref QueueName
      Statistic: Maximum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 300
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  SqsMessageAgeCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${QueueName}-age-critical"
      AlarmDescription: SQS oldest message age above 15 minutes - SLA breach
      Namespace: AWS/SQS
      MetricName: ApproximateAgeOfOldestMessage
      Dimensions:
        - Name: QueueName
          Value: !Ref QueueName
      Statistic: Maximum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 900
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

8. DynamoDB

Catch throttling and capacity issues before they impact applications.

If you use on-demand (PAY_PER_REQUEST) mode, skip the ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits alarms — there is no provisioned limit to alarm against. Keep ThrottledRequests and SystemErrors for all modes.
MetricThresholdPeriodEval PeriodsSeverityWhy It Matters
SystemErrors> 01 min2CRITICALAWS-side DynamoDB errors
UserErrors> 05 min3WARNClient-side errors
ConsumedReadCapacityUnits> 80% of provisioned5 min2WARNRead capacity filling up
ConsumedWriteCapacityUnits> 80% of provisioned5 min2WARNWrite capacity filling up
ThrottledRequests> 05 min2WARNRequests being throttled
Parameters:
  TableName:
    Type: String
    Default: YOUR_TABLE_NAME
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN
  ProvisionedReadCapacity:
    Type: Number
    Default: 100
    Description: Your table's provisioned RCU (skip for on-demand mode)
  ProvisionedWriteCapacity:
    Type: Number
    Default: 100
    Description: Your table's provisioned WCU (skip for on-demand mode)

Resources:
  DynamoDbSystemErrors:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${TableName}-system-errors"
      AlarmDescription: DynamoDB system errors detected - possible AWS service issue
      Namespace: AWS/DynamoDB
      MetricName: SystemErrors
      Dimensions:
        - Name: TableName
          Value: !Ref TableName
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  DynamoDbUserErrors:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${TableName}-user-errors"
      AlarmDescription: DynamoDB user errors - bad requests or auth issues
      Namespace: AWS/DynamoDB
      MetricName: UserErrors
      Dimensions:
        - Name: TableName
          Value: !Ref TableName
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 3
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  DynamoDbThrottledRequests:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${TableName}-throttled"
      AlarmDescription: DynamoDB throttled requests - requests being delayed
      Namespace: AWS/DynamoDB
      MetricName: ThrottledRequests
      Dimensions:
        - Name: TableName
          Value: !Ref TableName
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

9. ElastiCache (Redis)

Monitor cache health, hit rates, and replication status to prevent backend overload.

MetricThresholdPeriodEval PeriodsSeverityWhy It Matters
CPUUtilization> 80%5 min2WARNRedis single-threaded; causes latency spikes
FreeableMemory< 100 MB5 min2WARNRedis evicting keys
CacheHitRate< 0.8 (80%)5 min3WARNCache not effective
CurrConnections> 10005 min2WARNHigh connection count
ReplicationLag> 60 s1 min2WARNReplica falling behind primary
Parameters:
  CacheClusterId:
    Type: String
    Default: YOUR_CACHE_CLUSTER_ID
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN

Resources:
  RedisCpuWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${CacheClusterId}-cpu-warn"
      AlarmDescription: ElastiCache CPU above 80%
      Namespace: AWS/ElastiCache
      MetricName: CPUUtilization
      Dimensions:
        - Name: CacheClusterId
          Value: !Ref CacheClusterId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 80
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RedisFreeMemoryWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${CacheClusterId}-memory-warn"
      AlarmDescription: ElastiCache freeable memory below 100 MB - keys may be evicted
      Namespace: AWS/ElastiCache
      MetricName: FreeableMemory
      Dimensions:
        - Name: CacheClusterId
          Value: !Ref CacheClusterId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 104857600
      ComparisonOperator: LessThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RedisCacheHitRateWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${CacheClusterId}-hit-rate-warn"
      AlarmDescription: ElastiCache cache hit rate below 80% - DB taking excessive load
      Namespace: AWS/ElastiCache
      MetricName: CacheHitRate
      Dimensions:
        - Name: CacheClusterId
          Value: !Ref CacheClusterId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 3
      Threshold: 0.8
      ComparisonOperator: LessThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RedisCurrConnectionsWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${CacheClusterId}-connections-warn"
      AlarmDescription: ElastiCache connections above 1000
      Namespace: AWS/ElastiCache
      MetricName: CurrConnections
      Dimensions:
        - Name: CacheClusterId
          Value: !Ref CacheClusterId
      Statistic: Maximum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 1000
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RedisReplicationLagWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${CacheClusterId}-replication-lag"
      AlarmDescription: ElastiCache replication lag above 60 seconds
      Namespace: AWS/ElastiCache
      MetricName: ReplicationLag
      Dimensions:
        - Name: CacheClusterId
          Value: !Ref CacheClusterId
      Statistic: Average
      Period: 60
      EvaluationPeriods: 2
      Threshold: 60
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

10. Cost & Budget Alerts

These use AWS Budgets, not CloudWatch. The resources are AWS::Budgets::Budget (CloudFormation) and aws_budgets_budget (Terraform), not CloudWatch alarms. They still send alerts to email or SNS.

Alert TypeThresholdTypeSeverityWhy It Matters
Monthly spend actual80% of budgetACTUALWARNEarly warning to review usage
Monthly spend actual100% of budgetACTUALCRITICALBudget exceeded
Monthly spend forecasted100% of budgetFORECASTEDWARNProjected to exceed budget
Anomaly detection$50 above expectedANOMALYWARNUnusual spending pattern
Parameters:
  MonthlyBudgetAmount:
    Type: Number
    Default: 100
    Description: Monthly AWS budget in USD
  AlertEmail:
    Type: String
    Default: you@yourcompany.com
    Description: Email for budget alerts

Resources:
  MonthlyBudget:
    Type: AWS::Budgets::Budget
    Properties:
      Budget:
        BudgetName: monthly-aws-budget
        BudgetType: COST
        TimeUnit: MONTHLY
        BudgetLimit:
          Amount: !Ref MonthlyBudgetAmount
          Unit: USD
      NotificationsWithSubscribers:
        - Notification:
            NotificationType: ACTUAL
            ComparisonOperator: GREATER_THAN
            Threshold: 80
            ThresholdType: PERCENTAGE
          Subscribers:
            - SubscriptionType: EMAIL
              Address: !Ref AlertEmail
        - Notification:
            NotificationType: ACTUAL
            ComparisonOperator: GREATER_THAN
            Threshold: 100
            ThresholdType: PERCENTAGE
          Subscribers:
            - SubscriptionType: EMAIL
              Address: !Ref AlertEmail
        - Notification:
            NotificationType: FORECASTED
            ComparisonOperator: GREATER_THAN
            Threshold: 100
            ThresholdType: PERCENTAGE
          Subscribers:
            - SubscriptionType: EMAIL
              Address: !Ref AlertEmail

  CostAnomalyMonitor:
    Type: AWS::CE::AnomalyMonitor
    Properties:
      MonitorName: aws-cost-anomaly-monitor
      MonitorType: DIMENSIONAL
      MonitorDimension: SERVICE

  CostAnomalySubscription:
    Type: AWS::CE::AnomalySubscription
    Properties:
      SubscriptionName: cost-anomaly-alerts
      MonitorArnList:
        - !GetAtt CostAnomalyMonitor.MonitorArn
      Subscribers:
        - Address: !Ref AlertEmail
          Type: EMAIL
      Threshold: 50
      Frequency: DAILY

Related reading

  • → How to find root cause in AWS CloudWatch alerts without an SRE team
  • → CloudWatch vs Datadog for startups: what you actually need

Frequently asked questions

What CloudWatch alarms should every ECS service have?

Every ECS service needs five alarms at minimum: CPUUtilization > 80% (WARN) and > 95% (CRITICAL) with 5-minute periods, MemoryUtilization > 85% (WARN) and > 95% (CRITICAL), and RunningTaskCount < desired count (immediate signal of task crashes). Wire all five to an SNS topic that sends to your on-call channel. Without the RunningTaskCount alarm, crashed tasks can go undetected until a user reports an error.

How do I create a CloudWatch alarm that sends to Slack or WhatsApp?

The path is: CloudWatch Alarm → SNS Topic → Lambda function → Slack/WhatsApp API. Create an SNS topic, subscribe a Lambda to it, and set the Lambda to call the Slack webhook or WhatsApp Business API. The Lambda receives the raw alarm payload and can optionally enrich it with CloudWatch Logs Insights data before sending — making the notification contain a diagnosis rather than just a raw metric value.

How many CloudWatch alarms does a typical AWS production environment need?

A typical production environment with ECS, RDS, ALB, and Lambda needs approximately 15–25 alarms: 5 per ECS service (CPU warn/critical, memory warn/critical, task count), 4 for RDS (CPU, connections, storage, replica lag), 3 for ALB (5xx error rate, P99 latency, request spike), 3 for Lambda (errors, duration, throttles), and 1–2 budget alerts. More than 30 alarms for a small team usually indicates noise, not coverage.

What is the difference between CloudFormation and Terraform for CloudWatch alarm setup?

Both produce identical CloudWatch alarms — the difference is toolchain. CloudFormation (YAML) is native AWS and requires no additional tools; it's the default choice if you have no existing IaC. Terraform (HCL) requires the CLI and state management but integrates better with multi-cloud or non-AWS resources. Choose whichever your team already uses — the alarm configuration is functionally identical.

What threshold should I use for ECS CPU alarms?

Use 80% for WARN and 95% for CRITICAL, both evaluated over 2 periods of 5 minutes each (10 minutes total). This prevents false positives from momentary CPU spikes during deploys or cold starts. At 94% CPUUtilization, ECS tasks are typically CPU-throttled and latency spikes begin within 90 seconds — the CRITICAL alarm at 95% over 10 minutes confirms a sustained problem rather than a transient spike.

Still debugging incidents manually?

ConvOps does this automatically — root cause in under 60 seconds, delivered to WhatsApp or Slack.

Try ConvOps free →See a live demo