The Complete AWS CloudWatch Alarm Setup Guide
This guide covers every CloudWatch alarm your AWS infrastructure needs — ECS, EC2, RDS, Lambda, ALB, API Gateway, SQS, DynamoDB, ElastiCache, and cost alerts. All examples include both CloudFormation and Terraform.
Setup parameters
Replace these placeholder values before deploying any alarm:
- YOUR_SNS_TOPIC_ARN — the ARN of the SNS topic that sends your notifications
- YOUR_CLUSTER_NAME / YOUR_SERVICE_NAME — your ECS cluster and service names
- YOUR_INSTANCE_ID — your EC2 instance ID
- YOUR_DB_INSTANCE_ID — your RDS instance identifier
- YOUR_FUNCTION_NAME — your Lambda function name
- YOUR_ALB_SUFFIX — everything after loadbalancer/ in your ALB ARN
- YOUR_API_NAME / YOUR_STAGE — your API Gateway name and stage
- YOUR_QUEUE_NAME — your SQS queue name
- YOUR_TABLE_NAME — your DynamoDB table name
- YOUR_CACHE_CLUSTER_ID — your ElastiCache cluster ID
- YOUR_MONTHLY_BUDGET — your monthly budget in USD
SNS topic
AWSTemplateFormatVersion: '2010-09-09'
Parameters:
AlertEmail:
Type: String
Description: Email address to receive CloudWatch alerts
Resources:
AlertsTopic:
Type: AWS::SNS::Topic
Properties:
TopicName: infra-alerts
Subscription:
- Protocol: email
Endpoint: !Ref AlertEmail
Outputs:
SnsTopicArn:
Value: !Ref AlertsTopic
Description: Use this ARN as YOUR_SNS_TOPIC_ARN in all alarm snippets below1. ECS — Elastic Container Service
Detect container saturation and task crashes before user impact occurs.
| Metric | Threshold | Period | Eval Periods | Severity | Why It Matters |
|---|---|---|---|---|---|
| CPUUtilization | > 80% | 5 min | 2 | WARN | Sustained CPU pressure — scale before saturation |
| CPUUtilization | > 95% | 5 min | 2 | CRITICAL | Tasks CPU-throttled; latency spikes imminent |
| MemoryUtilization | > 85% | 5 min | 2 | WARN | Memory pressure building; OOM kill possible |
| MemoryUtilization | > 95% | 5 min | 2 | CRITICAL | Near OOM; task will be killed and restarted |
| RunningTaskCount | < desired count | 1 min | 1 | CRITICAL | Tasks crashed and not recovering |
Parameters:
ClusterName:
Type: String
Default: YOUR_CLUSTER_NAME
ServiceName:
Type: String
Default: YOUR_SERVICE_NAME
SnsTopicArn:
Type: String
Default: YOUR_SNS_TOPIC_ARN
DesiredTaskCount:
Type: Number
Default: 2
Resources:
EcsCpuWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ServiceName}-cpu-warn"
AlarmDescription: ECS CPU utilization above 80% for 10 minutes
Namespace: AWS/ECS
MetricName: CPUUtilization
Dimensions:
- Name: ClusterName
Value: !Ref ClusterName
- Name: ServiceName
Value: !Ref ServiceName
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 80
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
OKActions: [!Ref SnsTopicArn]
EcsCpuCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ServiceName}-cpu-critical"
AlarmDescription: ECS CPU above 95% - tasks are throttled
Namespace: AWS/ECS
MetricName: CPUUtilization
Dimensions:
- Name: ClusterName
Value: !Ref ClusterName
- Name: ServiceName
Value: !Ref ServiceName
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 95
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
EcsMemoryWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ServiceName}-memory-warn"
AlarmDescription: ECS memory utilization above 85%
Namespace: AWS/ECS
MetricName: MemoryUtilization
Dimensions:
- Name: ClusterName
Value: !Ref ClusterName
- Name: ServiceName
Value: !Ref ServiceName
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 85
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
EcsMemoryCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ServiceName}-memory-critical"
AlarmDescription: ECS memory utilization above 95% - OOM kill imminent
Namespace: AWS/ECS
MetricName: MemoryUtilization
Dimensions:
- Name: ClusterName
Value: !Ref ClusterName
- Name: ServiceName
Value: !Ref ServiceName
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 95
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
EcsRunningTasksCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ServiceName}-tasks-critical"
AlarmDescription: Running task count below desired - service may be down
Namespace: AWS/ECS
MetricName: RunningTaskCount
Dimensions:
- Name: ClusterName
Value: !Ref ClusterName
- Name: ServiceName
Value: !Ref ServiceName
Statistic: Average
Period: 60
EvaluationPeriods: 1
Threshold: !Ref DesiredTaskCount
ComparisonOperator: LessThanThreshold
TreatMissingData: breaching
AlarmActions: [!Ref SnsTopicArn]2. EC2 — Elastic Compute Cloud
Catch hardware failures and unresponsive instances that Auto Scaling may miss.
| Metric | Threshold | Period | Eval Periods | Severity | Why It Matters |
|---|---|---|---|---|---|
| CPUUtilization | > 85% | 5 min | 3 | WARN | Sustained high CPU; investigate before saturation |
| CPUUtilization | > 95% | 5 min | 2 | CRITICAL | Instance at capacity; requests will queue or fail |
| StatusCheckFailed | > 0 | 1 min | 2 | CRITICAL | Instance or system check failing |
| StatusCheckFailed_System | > 0 | 1 min | 2 | CRITICAL | AWS hardware issue — instance may need recovery |
| NetworkIn | < 1000 bytes/period | 5 min | 3 | WARN | Traffic dropped to near-zero |
Parameters:
InstanceId:
Type: String
Default: YOUR_INSTANCE_ID
SnsTopicArn:
Type: String
Default: YOUR_SNS_TOPIC_ARN
Resources:
Ec2CpuWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${InstanceId}-cpu-warn"
AlarmDescription: EC2 CPU above 85% for 15 minutes
Namespace: AWS/EC2
MetricName: CPUUtilization
Dimensions:
- Name: InstanceId
Value: !Ref InstanceId
Statistic: Average
Period: 300
EvaluationPeriods: 3
Threshold: 85
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
OKActions: [!Ref SnsTopicArn]
Ec2CpuCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${InstanceId}-cpu-critical"
AlarmDescription: EC2 CPU above 95% for 10 minutes
Namespace: AWS/EC2
MetricName: CPUUtilization
Dimensions:
- Name: InstanceId
Value: !Ref InstanceId
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 95
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
Ec2StatusCheckFailed:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${InstanceId}-status-check-failed"
AlarmDescription: EC2 status check failed - instance may be unresponsive
Namespace: AWS/EC2
MetricName: StatusCheckFailed
Dimensions:
- Name: InstanceId
Value: !Ref InstanceId
Statistic: Maximum
Period: 60
EvaluationPeriods: 2
Threshold: 0
ComparisonOperator: GreaterThanThreshold
TreatMissingData: breaching
AlarmActions: [!Ref SnsTopicArn]
Ec2StatusCheckFailedSystem:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${InstanceId}-status-check-system"
AlarmDescription: EC2 system status check failed - AWS hardware issue
Namespace: AWS/EC2
MetricName: StatusCheckFailed_System
Dimensions:
- Name: InstanceId
Value: !Ref InstanceId
Statistic: Maximum
Period: 60
EvaluationPeriods: 2
Threshold: 0
ComparisonOperator: GreaterThanThreshold
TreatMissingData: breaching
AlarmActions: [!Ref SnsTopicArn]
Ec2NetworkInDrop:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${InstanceId}-network-in-drop"
AlarmDescription: EC2 NetworkIn near zero - traffic may have stopped
Namespace: AWS/EC2
MetricName: NetworkIn
Dimensions:
- Name: InstanceId
Value: !Ref InstanceId
Statistic: Sum
Period: 300
EvaluationPeriods: 3
Threshold: 1000
ComparisonOperator: LessThanThreshold
TreatMissingData: breaching
AlarmActions: [!Ref SnsTopicArn]3. RDS — Relational Database Service
Provide 10–30 minute warning window before database failures occur.
| Instance Class | max_connections | 80% threshold |
|---|---|---|
| db.t3.micro | 87 | 69 |
| db.t3.small | 171 | 136 |
| db.t3.medium | 341 | 272 |
| db.t3.large | 648 | 518 |
| db.r5.large | 1365 | 1092 |
| db.r5.xlarge | 2730 | 2184 |
| db.r5.2xlarge | 5460 | 4368 |
| Metric | Threshold | Period | Eval Periods | Severity | Why It Matters |
|---|---|---|---|---|---|
| CPUUtilization | > 80% | 5 min | 3 | WARN | DB under CPU load; queries slowing down |
| DatabaseConnections | > 80% of max | 5 min | 2 | WARN | Connection pool filling |
| FreeStorageSpace | < 10 GB | 5 min | 2 | WARN | Disk filling; DB will stop accepting writes |
| FreeStorageSpace | < 2 GB | 5 min | 1 | CRITICAL | Critically low disk |
| ReplicaLag | > 300 s | 1 min | 2 | WARN | Read replica falling behind |
| FreeableMemory | < 256 MB | 5 min | 3 | WARN | Low memory; buffer pool shrinking |
Parameters:
DbInstanceId:
Type: String
Default: YOUR_DB_INSTANCE_ID
SnsTopicArn:
Type: String
Default: YOUR_SNS_TOPIC_ARN
MaxConnectionsThreshold:
Type: Number
Default: 272
Description: 80% of max_connections for your instance class (see table above)
Resources:
RdsCpuWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${DbInstanceId}-cpu-warn"
AlarmDescription: RDS CPU above 80% for 15 minutes
Namespace: AWS/RDS
MetricName: CPUUtilization
Dimensions:
- Name: DBInstanceIdentifier
Value: !Ref DbInstanceId
Statistic: Average
Period: 300
EvaluationPeriods: 3
Threshold: 80
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
RdsConnectionsWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${DbInstanceId}-connections-warn"
AlarmDescription: RDS connections above 80% of max_connections
Namespace: AWS/RDS
MetricName: DatabaseConnections
Dimensions:
- Name: DBInstanceIdentifier
Value: !Ref DbInstanceId
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: !Ref MaxConnectionsThreshold
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
RdsDiskWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${DbInstanceId}-disk-warn"
AlarmDescription: RDS free storage below 10 GB
Namespace: AWS/RDS
MetricName: FreeStorageSpace
Dimensions:
- Name: DBInstanceIdentifier
Value: !Ref DbInstanceId
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 10737418240
ComparisonOperator: LessThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
RdsDiskCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${DbInstanceId}-disk-critical"
AlarmDescription: RDS free storage critically low (below 2 GB)
Namespace: AWS/RDS
MetricName: FreeStorageSpace
Dimensions:
- Name: DBInstanceIdentifier
Value: !Ref DbInstanceId
Statistic: Average
Period: 300
EvaluationPeriods: 1
Threshold: 2147483648
ComparisonOperator: LessThanThreshold
TreatMissingData: breaching
AlarmActions: [!Ref SnsTopicArn]
RdsReplicaLag:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${DbInstanceId}-replica-lag"
AlarmDescription: RDS read replica lag above 5 minutes
Namespace: AWS/RDS
MetricName: ReplicaLag
Dimensions:
- Name: DBInstanceIdentifier
Value: !Ref DbInstanceId
Statistic: Average
Period: 60
EvaluationPeriods: 2
Threshold: 300
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
RdsFreeMemoryWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${DbInstanceId}-memory-warn"
AlarmDescription: RDS freeable memory below 256 MB
Namespace: AWS/RDS
MetricName: FreeableMemory
Dimensions:
- Name: DBInstanceIdentifier
Value: !Ref DbInstanceId
Statistic: Average
Period: 300
EvaluationPeriods: 3
Threshold: 268435456
ComparisonOperator: LessThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]4. Lambda
Detect errors, throttles, and runaway executions before they consume budget.
| Metric | Threshold | Period | Eval Periods | Severity | Why It Matters |
|---|---|---|---|---|---|
| Errors | > 0 | 1 min | 1 | WARN | Any function error |
| Errors | > 5 | 1 min | 2 | CRITICAL | Repeated errors — function may be broken |
| Throttles | > 0 | 1 min | 2 | WARN | Requests being dropped |
| Duration | > 80% of timeout | 1 min | 2 | WARN | Function nearing timeout |
| ConcurrentExecutions | > 800 | 1 min | 2 | WARN | Approaching account concurrency limit |
Parameters:
FunctionName:
Type: String
Default: YOUR_FUNCTION_NAME
SnsTopicArn:
Type: String
Default: YOUR_SNS_TOPIC_ARN
DurationThresholdMs:
Type: Number
Default: 24000
Description: |
80% of your function timeout in ms.
e.g. 30s timeout -> 24000ms, 15s timeout -> 12000ms, 5s timeout -> 4000ms
Resources:
LambdaErrorsWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${FunctionName}-errors-warn"
AlarmDescription: Lambda function errors detected
Namespace: AWS/Lambda
MetricName: Errors
Dimensions:
- Name: FunctionName
Value: !Ref FunctionName
Statistic: Sum
Period: 60
EvaluationPeriods: 1
Threshold: 0
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
OKActions: [!Ref SnsTopicArn]
LambdaErrorsCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${FunctionName}-errors-critical"
AlarmDescription: Lambda function errors above 5 - may be completely broken
Namespace: AWS/Lambda
MetricName: Errors
Dimensions:
- Name: FunctionName
Value: !Ref FunctionName
Statistic: Sum
Period: 60
EvaluationPeriods: 2
Threshold: 5
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
LambdaThrottlesWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${FunctionName}-throttles"
AlarmDescription: Lambda throttles detected - requests being dropped
Namespace: AWS/Lambda
MetricName: Throttles
Dimensions:
- Name: FunctionName
Value: !Ref FunctionName
Statistic: Sum
Period: 60
EvaluationPeriods: 2
Threshold: 0
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
LambdaDurationWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${FunctionName}-duration-warn"
AlarmDescription: !Sub "Lambda duration above 80% of timeout (${DurationThresholdMs}ms)"
Namespace: AWS/Lambda
MetricName: Duration
Dimensions:
- Name: FunctionName
Value: !Ref FunctionName
ExtendedStatistic: p99
Period: 60
EvaluationPeriods: 2
Threshold: !Ref DurationThresholdMs
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
LambdaConcurrencyWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${FunctionName}-concurrency-warn"
AlarmDescription: Lambda concurrent executions above 800 (80% of default limit 1000)
Namespace: AWS/Lambda
MetricName: ConcurrentExecutions
Dimensions:
- Name: FunctionName
Value: !Ref FunctionName
Statistic: Maximum
Period: 60
EvaluationPeriods: 2
Threshold: 800
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]5. ALB — Application Load Balancer
Catch backend failures and unhealthy targets immediately.
| Metric | Threshold | Period | Eval Periods | Severity | Why It Matters |
|---|---|---|---|---|---|
| HTTPCode_Target_5XX_Count | > 0 | 1 min | 2 | WARN | Backend returning server errors |
| HTTPCode_Target_5XX_Count | > 10 | 1 min | 2 | CRITICAL | High rate of 5XX errors |
| TargetResponseTime | > 2 s | 5 min | 3 | WARN | Slow responses — users experiencing latency |
| TargetResponseTime | > 5 s | 5 min | 2 | CRITICAL | Very slow responses |
| UnHealthyHostCount | > 0 | 1 min | 2 | CRITICAL | Targets failing health checks |
| RejectedConnectionCount | > 0 | 1 min | 2 | WARN | ALB at max connections |
Parameters:
AlbSuffix:
Type: String
Default: YOUR_ALB_SUFFIX
Description: e.g. app/my-alb/abc123def456 (after "loadbalancer/" in the ARN)
SnsTopicArn:
Type: String
Default: YOUR_SNS_TOPIC_ARN
Resources:
Alb5xxWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "alb-5xx-warn-${AlbSuffix}"
AlarmDescription: ALB backend 5XX errors detected
Namespace: AWS/ApplicationELB
MetricName: HTTPCode_Target_5XX_Count
Dimensions:
- Name: LoadBalancer
Value: !Ref AlbSuffix
Statistic: Sum
Period: 60
EvaluationPeriods: 2
Threshold: 0
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
OKActions: [!Ref SnsTopicArn]
Alb5xxCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "alb-5xx-critical-${AlbSuffix}"
AlarmDescription: ALB backend 5XX errors above 10 per minute
Namespace: AWS/ApplicationELB
MetricName: HTTPCode_Target_5XX_Count
Dimensions:
- Name: LoadBalancer
Value: !Ref AlbSuffix
Statistic: Sum
Period: 60
EvaluationPeriods: 2
Threshold: 10
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
AlbLatencyWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "alb-latency-warn-${AlbSuffix}"
AlarmDescription: ALB target response time above 2 seconds
Namespace: AWS/ApplicationELB
MetricName: TargetResponseTime
Dimensions:
- Name: LoadBalancer
Value: !Ref AlbSuffix
ExtendedStatistic: p99
Period: 300
EvaluationPeriods: 3
Threshold: 2
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
AlbLatencyCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "alb-latency-critical-${AlbSuffix}"
AlarmDescription: ALB target response time above 5 seconds
Namespace: AWS/ApplicationELB
MetricName: TargetResponseTime
Dimensions:
- Name: LoadBalancer
Value: !Ref AlbSuffix
ExtendedStatistic: p99
Period: 300
EvaluationPeriods: 2
Threshold: 5
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
AlbUnhealthyHosts:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "alb-unhealthy-hosts-${AlbSuffix}"
AlarmDescription: ALB unhealthy target count above zero
Namespace: AWS/ApplicationELB
MetricName: UnHealthyHostCount
Dimensions:
- Name: LoadBalancer
Value: !Ref AlbSuffix
Statistic: Maximum
Period: 60
EvaluationPeriods: 2
Threshold: 0
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
AlbRejectedConnections:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "alb-rejected-connections-${AlbSuffix}"
AlarmDescription: ALB rejected connections - load balancer at max capacity
Namespace: AWS/ApplicationELB
MetricName: RejectedConnectionCount
Dimensions:
- Name: LoadBalancer
Value: !Ref AlbSuffix
Statistic: Sum
Period: 60
EvaluationPeriods: 2
Threshold: 0
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]6. API Gateway
Detect integration failures and slow backends before the 29-second timeout.
| Metric | Threshold | Period | Eval Periods | Severity | Why It Matters |
|---|---|---|---|---|---|
| 5XXError | > 5 count | 1 min | 2 | WARN | Backend integration errors |
| 4XXError | > 50 per 5 min | 5 min | 3 | WARN | High client error rate |
| Latency | > 3000 ms p99 | 5 min | 3 | WARN | Slow backend responses |
| Latency | > 10000 ms | 5 min | 2 | CRITICAL | Near 29s timeout |
Parameters:
ApiName:
Type: String
Default: YOUR_API_NAME
Stage:
Type: String
Default: prod
SnsTopicArn:
Type: String
Default: YOUR_SNS_TOPIC_ARN
Resources:
ApiGw5xxWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ApiName}-${Stage}-5xx-warn"
AlarmDescription: API Gateway 5XX errors above 5 per minute
Namespace: AWS/ApiGateway
MetricName: 5XXError
Dimensions:
- Name: ApiName
Value: !Ref ApiName
- Name: Stage
Value: !Ref Stage
Statistic: Sum
Period: 60
EvaluationPeriods: 2
Threshold: 5
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
OKActions: [!Ref SnsTopicArn]
ApiGw4xxWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ApiName}-${Stage}-4xx-warn"
AlarmDescription: API Gateway 4XX errors above 50 per 5 minutes
Namespace: AWS/ApiGateway
MetricName: 4XXError
Dimensions:
- Name: ApiName
Value: !Ref ApiName
- Name: Stage
Value: !Ref Stage
Statistic: Sum
Period: 300
EvaluationPeriods: 3
Threshold: 50
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
ApiGwLatencyWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ApiName}-${Stage}-latency-warn"
AlarmDescription: API Gateway p99 latency above 3 seconds
Namespace: AWS/ApiGateway
MetricName: Latency
Dimensions:
- Name: ApiName
Value: !Ref ApiName
- Name: Stage
Value: !Ref Stage
ExtendedStatistic: p99
Period: 300
EvaluationPeriods: 3
Threshold: 3000
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
ApiGwLatencyCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ApiName}-${Stage}-latency-critical"
AlarmDescription: API Gateway latency above 10 seconds - near 29s timeout
Namespace: AWS/ApiGateway
MetricName: Latency
Dimensions:
- Name: ApiName
Value: !Ref ApiName
- Name: Stage
Value: !Ref Stage
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 10000
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]7. SQS — Simple Queue Service
Detect queue backups and processing failures before they accumulate.
| Metric | Threshold | Period | Eval Periods | Severity | Why It Matters |
|---|---|---|---|---|---|
| ApproximateNumberOfMessagesNotVisible | > 1000 | 5 min | 3 | WARN | Queue building up |
| ApproximateNumberOfMessagesNotVisible | > 10000 | 5 min | 2 | CRITICAL | Severe queue backup |
| ApproximateAgeOfOldestMessage | > 300 s | 5 min | 2 | WARN | Messages sitting unprocessed |
| ApproximateAgeOfOldestMessage | > 900 s | 5 min | 2 | CRITICAL | Messages 15+ minutes old |
Parameters:
QueueName:
Type: String
Default: YOUR_QUEUE_NAME
SnsTopicArn:
Type: String
Default: YOUR_SNS_TOPIC_ARN
Resources:
SqsQueueDepthWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${QueueName}-depth-warn"
AlarmDescription: SQS queue depth above 1000 - consumers may be lagging
Namespace: AWS/SQS
MetricName: ApproximateNumberOfMessagesNotVisible
Dimensions:
- Name: QueueName
Value: !Ref QueueName
Statistic: Maximum
Period: 300
EvaluationPeriods: 3
Threshold: 1000
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
OKActions: [!Ref SnsTopicArn]
SqsQueueDepthCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${QueueName}-depth-critical"
AlarmDescription: SQS queue depth above 10000 - severe consumer failure
Namespace: AWS/SQS
MetricName: ApproximateNumberOfMessagesNotVisible
Dimensions:
- Name: QueueName
Value: !Ref QueueName
Statistic: Maximum
Period: 300
EvaluationPeriods: 2
Threshold: 10000
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
SqsMessageAgeWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${QueueName}-age-warn"
AlarmDescription: SQS oldest message age above 5 minutes
Namespace: AWS/SQS
MetricName: ApproximateAgeOfOldestMessage
Dimensions:
- Name: QueueName
Value: !Ref QueueName
Statistic: Maximum
Period: 300
EvaluationPeriods: 2
Threshold: 300
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
SqsMessageAgeCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${QueueName}-age-critical"
AlarmDescription: SQS oldest message age above 15 minutes - SLA breach
Namespace: AWS/SQS
MetricName: ApproximateAgeOfOldestMessage
Dimensions:
- Name: QueueName
Value: !Ref QueueName
Statistic: Maximum
Period: 300
EvaluationPeriods: 2
Threshold: 900
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]8. DynamoDB
Catch throttling and capacity issues before they impact applications.
| Metric | Threshold | Period | Eval Periods | Severity | Why It Matters |
|---|---|---|---|---|---|
| SystemErrors | > 0 | 1 min | 2 | CRITICAL | AWS-side DynamoDB errors |
| UserErrors | > 0 | 5 min | 3 | WARN | Client-side errors |
| ConsumedReadCapacityUnits | > 80% of provisioned | 5 min | 2 | WARN | Read capacity filling up |
| ConsumedWriteCapacityUnits | > 80% of provisioned | 5 min | 2 | WARN | Write capacity filling up |
| ThrottledRequests | > 0 | 5 min | 2 | WARN | Requests being throttled |
Parameters:
TableName:
Type: String
Default: YOUR_TABLE_NAME
SnsTopicArn:
Type: String
Default: YOUR_SNS_TOPIC_ARN
ProvisionedReadCapacity:
Type: Number
Default: 100
Description: Your table's provisioned RCU (skip for on-demand mode)
ProvisionedWriteCapacity:
Type: Number
Default: 100
Description: Your table's provisioned WCU (skip for on-demand mode)
Resources:
DynamoDbSystemErrors:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${TableName}-system-errors"
AlarmDescription: DynamoDB system errors detected - possible AWS service issue
Namespace: AWS/DynamoDB
MetricName: SystemErrors
Dimensions:
- Name: TableName
Value: !Ref TableName
Statistic: Sum
Period: 60
EvaluationPeriods: 2
Threshold: 0
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
DynamoDbUserErrors:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${TableName}-user-errors"
AlarmDescription: DynamoDB user errors - bad requests or auth issues
Namespace: AWS/DynamoDB
MetricName: UserErrors
Dimensions:
- Name: TableName
Value: !Ref TableName
Statistic: Sum
Period: 300
EvaluationPeriods: 3
Threshold: 0
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
DynamoDbThrottledRequests:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${TableName}-throttled"
AlarmDescription: DynamoDB throttled requests - requests being delayed
Namespace: AWS/DynamoDB
MetricName: ThrottledRequests
Dimensions:
- Name: TableName
Value: !Ref TableName
Statistic: Sum
Period: 300
EvaluationPeriods: 2
Threshold: 0
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]9. ElastiCache (Redis)
Monitor cache health, hit rates, and replication status to prevent backend overload.
| Metric | Threshold | Period | Eval Periods | Severity | Why It Matters |
|---|---|---|---|---|---|
| CPUUtilization | > 80% | 5 min | 2 | WARN | Redis single-threaded; causes latency spikes |
| FreeableMemory | < 100 MB | 5 min | 2 | WARN | Redis evicting keys |
| CacheHitRate | < 0.8 (80%) | 5 min | 3 | WARN | Cache not effective |
| CurrConnections | > 1000 | 5 min | 2 | WARN | High connection count |
| ReplicationLag | > 60 s | 1 min | 2 | WARN | Replica falling behind primary |
Parameters:
CacheClusterId:
Type: String
Default: YOUR_CACHE_CLUSTER_ID
SnsTopicArn:
Type: String
Default: YOUR_SNS_TOPIC_ARN
Resources:
RedisCpuWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${CacheClusterId}-cpu-warn"
AlarmDescription: ElastiCache CPU above 80%
Namespace: AWS/ElastiCache
MetricName: CPUUtilization
Dimensions:
- Name: CacheClusterId
Value: !Ref CacheClusterId
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 80
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
RedisFreeMemoryWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${CacheClusterId}-memory-warn"
AlarmDescription: ElastiCache freeable memory below 100 MB - keys may be evicted
Namespace: AWS/ElastiCache
MetricName: FreeableMemory
Dimensions:
- Name: CacheClusterId
Value: !Ref CacheClusterId
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 104857600
ComparisonOperator: LessThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
RedisCacheHitRateWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${CacheClusterId}-hit-rate-warn"
AlarmDescription: ElastiCache cache hit rate below 80% - DB taking excessive load
Namespace: AWS/ElastiCache
MetricName: CacheHitRate
Dimensions:
- Name: CacheClusterId
Value: !Ref CacheClusterId
Statistic: Average
Period: 300
EvaluationPeriods: 3
Threshold: 0.8
ComparisonOperator: LessThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
RedisCurrConnectionsWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${CacheClusterId}-connections-warn"
AlarmDescription: ElastiCache connections above 1000
Namespace: AWS/ElastiCache
MetricName: CurrConnections
Dimensions:
- Name: CacheClusterId
Value: !Ref CacheClusterId
Statistic: Maximum
Period: 300
EvaluationPeriods: 2
Threshold: 1000
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
RedisReplicationLagWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${CacheClusterId}-replication-lag"
AlarmDescription: ElastiCache replication lag above 60 seconds
Namespace: AWS/ElastiCache
MetricName: ReplicationLag
Dimensions:
- Name: CacheClusterId
Value: !Ref CacheClusterId
Statistic: Average
Period: 60
EvaluationPeriods: 2
Threshold: 60
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]10. Cost & Budget Alerts
These use AWS Budgets, not CloudWatch. The resources are AWS::Budgets::Budget (CloudFormation) and aws_budgets_budget (Terraform), not CloudWatch alarms. They still send alerts to email or SNS.
| Alert Type | Threshold | Type | Severity | Why It Matters |
|---|---|---|---|---|
| Monthly spend actual | 80% of budget | ACTUAL | WARN | Early warning to review usage |
| Monthly spend actual | 100% of budget | ACTUAL | CRITICAL | Budget exceeded |
| Monthly spend forecasted | 100% of budget | FORECASTED | WARN | Projected to exceed budget |
| Anomaly detection | $50 above expected | ANOMALY | WARN | Unusual spending pattern |
Parameters:
MonthlyBudgetAmount:
Type: Number
Default: 100
Description: Monthly AWS budget in USD
AlertEmail:
Type: String
Default: you@yourcompany.com
Description: Email for budget alerts
Resources:
MonthlyBudget:
Type: AWS::Budgets::Budget
Properties:
Budget:
BudgetName: monthly-aws-budget
BudgetType: COST
TimeUnit: MONTHLY
BudgetLimit:
Amount: !Ref MonthlyBudgetAmount
Unit: USD
NotificationsWithSubscribers:
- Notification:
NotificationType: ACTUAL
ComparisonOperator: GREATER_THAN
Threshold: 80
ThresholdType: PERCENTAGE
Subscribers:
- SubscriptionType: EMAIL
Address: !Ref AlertEmail
- Notification:
NotificationType: ACTUAL
ComparisonOperator: GREATER_THAN
Threshold: 100
ThresholdType: PERCENTAGE
Subscribers:
- SubscriptionType: EMAIL
Address: !Ref AlertEmail
- Notification:
NotificationType: FORECASTED
ComparisonOperator: GREATER_THAN
Threshold: 100
ThresholdType: PERCENTAGE
Subscribers:
- SubscriptionType: EMAIL
Address: !Ref AlertEmail
CostAnomalyMonitor:
Type: AWS::CE::AnomalyMonitor
Properties:
MonitorName: aws-cost-anomaly-monitor
MonitorType: DIMENSIONAL
MonitorDimension: SERVICE
CostAnomalySubscription:
Type: AWS::CE::AnomalySubscription
Properties:
SubscriptionName: cost-anomaly-alerts
MonitorArnList:
- !GetAtt CostAnomalyMonitor.MonitorArn
Subscribers:
- Address: !Ref AlertEmail
Type: EMAIL
Threshold: 50
Frequency: DAILYFrequently asked questions
What CloudWatch alarms should every ECS service have?
Every ECS service needs five alarms at minimum: CPUUtilization > 80% (WARN) and > 95% (CRITICAL) with 5-minute periods, MemoryUtilization > 85% (WARN) and > 95% (CRITICAL), and RunningTaskCount < desired count (immediate signal of task crashes). Wire all five to an SNS topic that sends to your on-call channel. Without the RunningTaskCount alarm, crashed tasks can go undetected until a user reports an error.
How do I create a CloudWatch alarm that sends to Slack or WhatsApp?
The path is: CloudWatch Alarm → SNS Topic → Lambda function → Slack/WhatsApp API. Create an SNS topic, subscribe a Lambda to it, and set the Lambda to call the Slack webhook or WhatsApp Business API. The Lambda receives the raw alarm payload and can optionally enrich it with CloudWatch Logs Insights data before sending — making the notification contain a diagnosis rather than just a raw metric value.
How many CloudWatch alarms does a typical AWS production environment need?
A typical production environment with ECS, RDS, ALB, and Lambda needs approximately 15–25 alarms: 5 per ECS service (CPU warn/critical, memory warn/critical, task count), 4 for RDS (CPU, connections, storage, replica lag), 3 for ALB (5xx error rate, P99 latency, request spike), 3 for Lambda (errors, duration, throttles), and 1–2 budget alerts. More than 30 alarms for a small team usually indicates noise, not coverage.
What is the difference between CloudFormation and Terraform for CloudWatch alarm setup?
Both produce identical CloudWatch alarms — the difference is toolchain. CloudFormation (YAML) is native AWS and requires no additional tools; it's the default choice if you have no existing IaC. Terraform (HCL) requires the CLI and state management but integrates better with multi-cloud or non-AWS resources. Choose whichever your team already uses — the alarm configuration is functionally identical.
What threshold should I use for ECS CPU alarms?
Use 80% for WARN and 95% for CRITICAL, both evaluated over 2 periods of 5 minutes each (10 minutes total). This prevents false positives from momentary CPU spikes during deploys or cold starts. At 94% CPUUtilization, ECS tasks are typically CPU-throttled and latency spikes begin within 90 seconds — the CRITICAL alarm at 95% over 10 minutes confirms a sustained problem rather than a transient spike.
Still debugging incidents manually?
ConvOps does this automatically — root cause in under 60 seconds, delivered to WhatsApp or Slack.