What are common causes of StatusCheckFailed going high in Amazon EC2?

System status check failure — the underlying AWS host hardware has failed; AWS detects this and typically migrates the instance automatically, but the migration window causes downtime Kernel panic or OS crash — the instance OS has crashed; a hard reboot initiated from the EC2 console or via API is required Network interface misconfiguration — a change to the instance's network interface, routing table, or security group has made the instance unreachable for health checks Corrupted disk — the instance's EBS root volume has encountered I/O errors, causing the OS to mount the filesystem in read-only mode and fail status checks Memory exhaustion — the instance ran out of memory, the OOM killer terminated critical OS processes, and the instance is effectively non-functional without a reboot

CloudWatch Metric Guide

AWS/EC2/StatusCheckFailedCount

StatusCheckFailedAmazon EC2 CloudWatch metric

Q: What is the recommended CloudWatch alarm threshold for StatusCheckFailed?

Recommended threshold: > 0 — alarm immediately. StatusCheckFailed has no gradual scale — it is either passing (0) or failing (1) per AWS documentation. A value of 1 represents a hard infrastructure failure that must be addressed immediately (ConvOps recommendation). Set this alarm with a 1-minute evaluation period and the minimum number of evaluation periods to minimize time-to-alert. There are no false positives in a production environment — StatusCheckFailed = 1 always warrants investigation.

Q: Which CloudWatch namespace does StatusCheckFailed belong to?

StatusCheckFailed is published in the AWS/EC2 CloudWatch namespace, with a unit of Count. See AWS documentation: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/viewing_metrics_with_cloudwatch.html

StatusCheckFailed combines the results of both the instance status check (instance software and network configuration) and the system status check (underlying AWS host hardware). A value of 1 means at least one of these checks has failed.

Audit your AWS for free Learn about ConvOps Watch

What it measures

About StatusCheckFailed

Namespace	AWS/EC2
Metric name	StatusCheckFailed
Unit	Count
AWS docs	Official Amazon EC2 metrics reference

Why this metric matters

StatusCheckFailed = 1 is an emergency signal, not a warning. Unlike CPU or memory metrics that indicate stress, a failed status check means the instance itself has a fundamental problem — either a kernel panic, a hardware failure on the underlying host, or a network misconfiguration that makes the instance unreachable.

AWS distinguishes two types of failures: instance status failures (problems in the OS, which a reboot or user intervention can often fix) and system status failures (problems with the underlying AWS hardware, which require AWS to move the instance to a healthy host). Both types result in StatusCheckFailed = 1. Both require immediate action. Neither resolves without intervention or AWS action.

Recommended alarm threshold for StatusCheckFailed

Recommended threshold

> 0 — alarm immediately

StatusCheckFailed has no gradual scale — it is either passing (0) or failing (1) per AWS documentation. A value of 1 represents a hard infrastructure failure that must be addressed immediately (ConvOps recommendation). Set this alarm with a 1-minute evaluation period and the minimum number of evaluation periods to minimize time-to-alert. There are no false positives in a production environment — StatusCheckFailed = 1 always warrants investigation.

Is your StatusCheckFailed alarm already set up correctly?

The free ConvOps Audit scans your CloudWatch setup and flags missing or misconfigured alarms — including StatusCheckFailed — in 5 minutes.

Run a free audit →

Common failures that show up in StatusCheckFailed

When StatusCheckFailed reaches an alarm threshold, these are the most common root causes — in order of how often ConvOps sees them across customer AWS accounts.

System status check failure — the underlying AWS host hardware has failed; AWS detects this and typically migrates the instance automatically, but the migration window causes downtime
Kernel panic or OS crash — the instance OS has crashed; a hard reboot initiated from the EC2 console or via API is required
Network interface misconfiguration — a change to the instance's network interface, routing table, or security group has made the instance unreachable for health checks
Corrupted disk — the instance's EBS root volume has encountered I/O errors, causing the OS to mount the filesystem in read-only mode and fail status checks
Memory exhaustion — the instance ran out of memory, the OOM killer terminated critical OS processes, and the instance is effectively non-functional without a reboot

How ConvOps debugs StatusCheckFailed alarms

When StatusCheckFailed triggers an alarm, ConvOps Diagnose reads CloudWatch Logs, CloudTrail (recent API calls, deploys, config changes), and the current resource state in parallel. It correlates these with AWS/EC2 metrics on the same resource — giving you a plain-English root cause with numbered fix options, sent to WhatsApp or Slack, usually within 60 seconds of the alarm firing.

Before any anomaly in StatusCheckFailed reaches you as a proactive alert (via ConvOps Watch), it passes through 9 verification checks: a Recovery check (did the metric self-heal?), an AWS Status check (is AWS itself having an incident?), a Deploy check (was there a recent Lambda update, ECS deploy, or RDS parameter change in the last 120 minutes?), a Quota check, an Infrastructure check, a Security check, a Flap check (has this metric been anomalous more than 5 times in the last 24 hours?), a TLS check, and a Vulnerability check. Only anomalies that pass all relevant checks reach you — with full context attached.

ConvOps Watch

Detects StatusCheckFailed anomalies with z-score against 30-day time-bucketed baselines. 9 verification checks before any alert.

ConvOps Diagnose

When a StatusCheckFailed alarm fires, reads logs, CloudTrail, and resource state. Sends root cause + fix options to WhatsApp or Slack.

ConvOps Audit

Scans your CloudWatch setup for missing or misconfigured StatusCheckFailed alarms. Free, 5-minute read-only scan.

See how ConvOps debugs your AWS →

Related Amazon EC2 metrics

StatusCheckFailed rarely fails in isolation. These metrics tend to correlate — monitor them together for complete Amazon EC2 coverage.

CPUUtilization NetworkIn NetworkOut

FAQ

Frequently asked questions about StatusCheckFailed

Common questions about setting up CloudWatch alarms for StatusCheckFailed in Amazon EC2.

What is the recommended CloudWatch alarm threshold for StatusCheckFailed?+

> 0 — alarm immediately. StatusCheckFailed has no gradual scale — it is either passing (0) or failing (1) per AWS documentation. A value of 1 represents a hard infrastructure failure that must be addressed immediately (ConvOps recommendation). Set this alarm with a 1-minute evaluation period and the minimum number of evaluation periods to minimize time-to-alert. There are no false positives in a production environment — StatusCheckFailed = 1 always warrants investigation.

Which CloudWatch namespace does StatusCheckFailed belong to?+

StatusCheckFailed is published in the AWS/EC2 namespace with a unit of Count. You can find it in the CloudWatch console under "Metrics" → "AWS/EC2". See the Amazon EC2 CloudWatch metrics reference in the AWS documentation.

Does ConvOps automatically create CloudWatch alarms for StatusCheckFailed?+

ConvOps does not create alarms for you by default — it debugs the alarms you already have (or identifies missing ones). The free ConvOps Audit scans your CloudWatch setup and tells you which Amazon EC2 resources are missing a StatusCheckFailed alarm. ConvOps Watch then monitors StatusCheckFailed using z-score anomaly detection against a 30-day baseline, running 9 verification checks before alerting you.

Can I use ConvOps without already having a StatusCheckFailed alarm set up?+

Yes. ConvOps Watch monitors StatusCheckFailed independently of your CloudWatch alarm configuration — it reads the metric directly from CloudWatch every 5 minutes on the Growth plan. If you run the free Audit first, it will tell you which resources need a StatusCheckFailed alarm and provide the copy-paste AWS CLI command to create it.