How is this different from just setting a CloudWatch alarm?

A CloudWatch alarm fires every time a metric crosses a static threshold — and stays fired until the metric recovers. ConvOps detects anomalies against a time-bucketed baseline (so Tuesday 3 PM is compared to other Tuesday 3 PM readings), then runs 9 verification checks before deciding whether to tell you. The result: far fewer false positives, and each alert arrives with context already attached.

How long does detection take?

ConvOps collects metrics every 5 minutes via an EventBridge cron. Detection latency is typically under 6 minutes from when the anomaly starts.

What does "self-healed" mean exactly?

After detecting an anomaly, check 1 re-fetches the metric from CloudWatch right now. If the value is back within 1.5σ of its baseline mean, the spike resolved itself and the alert is suppressed. This is what catches the 3 AM CPU blips that would have woken you up for nothing.

Does ConvOps replace my CloudWatch alarms?

No — the two pipelines are complementary. ConvOps Watch is proactive detection: it finds anomalies on metrics that have no alarm set. Your existing CloudWatch alarms continue to fire as normal, and ConvOps Diagnose enriches those alarm notifications with root-cause context. If an anomaly is already covered by a CloudWatch alarm, Watch suppresses the duplicate.

What is the daily digest?

Metrics that are flapping — firing and recovering more than 5 times in 24 hours — get routed to a daily digest email (sent at 09:00 UTC) instead of generating a full alert each time. You still know about the pattern; you don't get paged for it.

Does Watch work without any existing CloudWatch alarms?

Yes. Watch is proactive: it monitors raw CloudWatch metrics whether or not you have alarms configured. The ConvOps Audit product helps you figure out which alarms you should have — and can deploy them in one click.

Is my AWS account safe?

ConvOps connects with a read-only IAM role. The verification pipeline never writes to your account — it only reads CloudWatch metrics, CloudTrail events, and service health data. You can revoke the role in under 30 seconds from your AWS IAM console. SOC 2 Type II audit in progress.

ConvOps Watch · 24/7 AWS Debugging

The alert that didn't reach you.

Most alerts you get are noise — a metric blip, a self-healing spike, an AWS-side outage you can't fix. ConvOps Watch debugs every anomaly through 9 verification checks before deciding whether to wake you up. The ones that reach you matter. Most teams pair Watch with ConvOps Diagnose, which fires when CloudWatch alarms trigger.

Start free trial →Run free audit first →

Included in all plans. $49/mo Growth tier. No per-host or per-metric pricing.

The problem with alerting

Your CloudWatch alarms cry wolf.

You set up alarms because something broke once. Now they fire for transient spikes, self-healing metrics, and AWS outages you can't control. You start silencing the noisy ones. The next real incident slips through. That's how outages happen.

LAMBDA02:47 UTC

The metric that recovered itself.

CPU spiked to 92% for 60 seconds, then dropped to 18%. The alarm fired anyway. You woke up. There was nothing to fix.

↳ check_1 would have caught this — metric self-healed before the alert sent

ECS14:04 UTC

The deploy that explained itself.

Latency doubled at 14:02. Your alarm fired at 14:04. You spent 20 minutes debugging before someone said "oh yeah, I deployed at 14:01."

↳ check_3 would have caught this — CloudTrail deploy event at 14:01

RDSus-east-1 event

The AWS-side outage.

EBS degradation in us-east-1. Your RDS metrics went wild. Alarms fired. There was nothing you could do — AWS was already on it.

↳ check_2 would have caught this — AWS Health event active for this region

The 9-check verification pipeline

Every anomaly earns the alert.

All 9 checks run in parallel the moment an anomaly is detected. The two suppress checks decide whether to wake you up. The six context checks give Claude AI everything it needs to explain what happened. The one digest check routes chronic flapping metrics to a daily summary instead of your midnight alert stream.

Alert suppressed

Context forwarded to AI

Routed to daily digest

Self-healed

Re-fetches the metric from CloudWatch right now. If it's back within 1.5σ of baseline, the spike is over.

Alert suppressed — nothing to fix

SUPPRESS

AWS Health

Checks AWS Service Health Dashboard for events affecting this region/service.

Alert suppressed — AWS is already aware

SUPPRESS

Recent deploy

Looks at CloudTrail for Lambda update, ECS deploy, RDS parameter changes in the last 120 minutes.

Context added — this may be post-deploy behaviour

CONTEXT

Service quota

Checks if any service quota (concurrency limits, throughput, etc.) is at or near its limit.

Context added — quota exhaustion suspected

CONTEXT

State change

Checks for resource state changes: auto-scaling events, instance restarts, config updates in last 60 min.

Context added — infrastructure change detected

CONTEXT

Security finding

Queries GuardDuty and Security Hub for active findings on this resource.

Context added — active security finding present

CONTEXT

Flapping

Counts how many times this exact metric has been anomalous in the last 24 hours. If >5 times: flapping.

Routed to daily digest — not a full alert

DIGEST

Certificate expiry

For ALB and API Gateway resources, checks if any TLS certificate expires within 30 days.

Context added — cert expiry approaching

CONTEXT

Inspector findings

Checks AWS Inspector for active vulnerability findings on this resource.

Context added — Inspector finding active

CONTEXT

All checks run cross-account via a read-only IAM role. Nothing executes. Nothing is modified. The role can be revoked in under 30 seconds from your AWS IAM console.

A real ConvOps Watch alert

This is what makes it through.

Every alert that reaches you has passed all 9 checks — meaning ConvOps already knows it's real, already knows it didn't self-heal, already knows it's not a deploy or an AWS outage.

By the time your phone buzzes, ConvOps has also read the logs, correlated VPC flow data, checked CloudTrail, and written you a numbered action list. This one was caught with no CloudWatch alarm configured — pure baseline detection.

1 mindetection to alert

0CloudWatch alarms needed

HIGH · ConvOps ML Alert

Messages stuck in egress queue — consumer appears to have stopped

SQS · arn:aws:sqs:eu-central-1:123456789012:prod-egress-queue

DETECTED

28 May 2026 03:30 UTC

DIAGNOSED

28 May 2026 03:31 UTC

ACCOUNT

123456789012

ENVIRONMENT

Production

No CloudWatch alarm was configured for this metric — ConvOps caught it through baseline analysis.

What we observed

ApproximateNumberOfMessagesNotVisible rose to 4 (z-score 10) against a baseline of 0, at 03:30 UTC.
ApproximateAgeOfOldestMessage hit 484 seconds (z-score 10) at the same timestamp — messages are aging without being processed.
A prior anomaly at 03:25 UTC showed 2 in-flight messages and oldest-message age of 181 seconds, confirming the issue is escalating.
VPC flow logs show 10 active packet-reject spikes, ranging from 100 to 2,671 rejected packets, all currently active.
An AssumeRole event for WorkflowIAMRole occurred at 03:14 UTC — 16 minutes before the anomaly was first detected.

What we checked before alerting

Metric persistence: still anomalous at re-check (value 3.8, baseline 0.0) — confirmed not a transient spike.
Deployments: no UpdateService, UpdateFunctionCode, or equivalent CloudTrail events in the last 2 hours — not deployment-related.
Service quotas: all checked quotas below 80% utilisation — quota exhaustion ruled out.
Infrastructure changes: no scaling events, task restarts, or config changes in the last 60 minutes.
Security tooling: GuardDuty and Security Hub are not enabled — automated threat detection unavailable.
Flapping history: 5 occurrences in 24 hours — this is not a chronic noisy metric; treated as a genuine new event.
Vulnerabilities: AWS Inspector found no active findings for this resource.

What to check first

1.Check the consumer (Lambda or ECS service) processing this queue: in CloudWatch Metrics, inspect NumberOfMessagesSent, NumberOfMessagesDeleted, and ApproximateNumberOfMessagesVisible for prod-egress-queue over the window 03:00–03:40 UTC to confirm whether deletions stopped.
2.Review CloudWatch Logs for the consumer service between 03:10–03:35 UTC. Run a Logs Insights query: fields @timestamp, @message | filter @message like /ERROR|error|timeout|refused|reject/ | sort @timestamp desc | limit 50
3.Investigate the VPC packet-reject findings: identify which security group or NACL is dropping traffic and whether the consumer's outbound or inbound connections to the SQS endpoint are affected. Check the VPC flow logs for the consumer's ENI around 03:10–03:30 UTC.
4.Examine the AssumeRole event at 03:14 UTC for WorkflowIAMRole: verify in CloudTrail whether subsequent API calls from that role succeeded or returned AccessDenied errors, which could indicate a permissions failure blocking message processing.

Recommended action

Check the queue consumer's logs and CloudWatch metrics from 03:10–03:35 UTC for errors or stopped processing.

🔧 Prevent this anomaly permanently

Run this in CloudShell to create a CloudWatch alarm. Once in place, ConvOps routes future occurrences through the reactive pipeline automatically.

aws cloudwatch put-metric-alarm \
  --alarm-name "prod-egress-queue-inflight-high" \
  --metric-name ApproximateNumberOfMessagesNotVisible \
  --namespace AWS/SQS --statistic Average \
  --period 300 --threshold 100 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --datapoints-to-alarm 2 \
  --treat-missing-data notBreaching \
  --dimensions Name=QueueName,Value=prod-egress-queue

How anomalies are detected

Baseline-aware. Every metric, every hour.

A CloudWatch alarm compares a metric to a fixed number. ConvOps compares it to what that metric normally looks like at this exact time, on this day of the week — built from weeks of historical readings.

Tuesday 3 PM traffic is compared to previous Tuesday 3 PM readings, not a flat 24-hour average. 7 days × 24 hours = 168 time buckets per metric. When the current value is ≥ 3.0 standard deviations from that bucket's baseline, it becomes an anomaly candidate — and enters the 9-check pipeline.

5 min

Collection cadence

z ≥ 3.0

Anomaly threshold

z ≥ 4.0

High-confidence threshold

1.5σ

Self-heal threshold

168

Time buckets per metric

AWS service types

The decision tree

What happens when an anomaly is detected.

Every candidate anomaly follows the same path. The 9 checks run first. Then three additional gates at dispatch — existing alarm coverage, AI confidence, and a 7-day dedup window — catch anything the verification pass missed.

ANOMALY DETECTED

check_1

Self-healed?

Metric back within 1.5σ of baseline

YES→SUPPRESSED

check_7

Flapping?

This metric fired 5+ times in the last 24h

YES→DAILY DIGEST

Claude AI analysis

Context from checks 2–6, 8–9 forwarded

CloudWatch alarm exists?

Reactive pipeline already owns this metric

YES→SUPPRESSED

Confidence = low?

Insufficient evidence for a finding

YES→SUPPRESSED

Alerted in last 7 days?

Dedup window active for this resource + metric

YES→SUPPRESSED

✓ FULL ANOMALY ALERT sent

Email + Slack. Every gate passed.

Service coverage

Ten service types. No agents.

ConvOps reads CloudWatch metrics via a cross-account read-only IAM role. Nothing is deployed inside your account — no agents, no sidecars, no exporters. One CloudFormation stack. One role. Done.

10service types

34metrics tracked

Service	Metrics tracked
AWS Lambda	Duration (p99)ErrorsThrottlesConcurrentExecutions
Amazon RDS	FreeableMemoryCPUUtilizationDatabaseConnectionsReadLatencyWriteLatency
Amazon DynamoDB	ConsumedReadCapacityUnitsConsumedWriteCapacityUnitsThrottledRequestsSuccessfulRequestLatencySystemErrors
Amazon ECS	CPUUtilizationMemoryUtilizationRunningTaskCount
Application Load Balancer	TargetResponseTimeHTTPCode_Target_5XX_CountUnHealthyHostCount
Network Load Balancer	UnHealthyHostCountActiveFlowCount
Amazon EC2	CPUUtilizationNetworkInNetworkOutStatusCheckFailed
Amazon ElastiCache	CacheMissesCurrConnectionsCPUUtilizationFreeableMemory
Amazon SQS	ApproximateAgeOfOldestMessageApproximateNumberOfMessagesNotVisibleNumberOfMessagesSent
AWS Billing	EstimatedCharges (per service)

Pricing

Watch is included in every plan.

No per-metric pricing. No per-host pricing. Watch runs across all monitored accounts on your plan — Growth starts at $49/mo and covers up to 5 AWS accounts.

See full pricing →

Watch questions.

Answers to the questions we hear most about anomaly detection, the 9-check pipeline, and how Watch fits alongside your existing CloudWatch alarms.

Glossary

Terms used on this page.

Plain-English definitions for the detection and verification terms used throughout this page.

Recovery check: Re-fetches the metric from CloudWatch and tests whether it has returned to within 1.5 standard deviations of its baseline. If yes, the alert is suppressed — the spike resolved itself.
Z-score: A statistical measure of how many standard deviations a data point is from the mean. ConvOps Watch uses z ≥ 3 as the anomaly threshold and z ≥ 4 as the high-confidence threshold.
Time-bucketed baseline: A separate metric baseline computed for every day-of-week × hour-of-day combination, allowing detection to account for traffic patterns. Tuesday 2 PM traffic is compared to previous Tuesday 2 PM readings — not a flat 24-hour average. Each metric gets 168 time buckets (7 days × 24 hours).
Flap check: Counts how many times a metric has been anomalous in the last 24 hours. If more than 5, the metric is flapping and gets routed to a daily digest email instead of a real-time alert. You still see the pattern; you don't get paged for it repeatedly.
9-check pipeline: ConvOps Watch's pre-alert verification system. Every detected anomaly runs through 9 parallel checks — recovery, AWS Health, recent deploy, service quota, state change, security finding, flapping, certificate expiry, and Inspector findings — before any alert is sent.
Hygiene Score: ConvOps Audit's 0–100 score reflecting the quality of a CloudWatch alarm setup. Starts at 100. Points are deducted for noisy alarms, permanently suppressed alarms, missing critical alarms, unmonitored resources, and active security findings.
CloudTrail: AWS's audit log for API calls. ConvOps Watch's deploy check (check 3) reads CloudTrail to detect recent code or configuration changes that may explain the anomaly.

The alert that should have woken you up. It will.

Individual is free forever. Connect your AWS account in under 10 minutes. The 9-check pipeline starts running immediately — no alarms to configure, no dashboards to build.

Start free trial →Run free audit first →

Individual is free forever. Growth $49/mo. Cancel any time.