ConvOps Watch · 24/7 AWS Debugging
The alert that didn't reach you.
Most alerts you get are noise — a metric blip, a self-healing spike, an AWS-side outage you can't fix. ConvOps Watch debugs every anomaly through 9 verification checks before deciding whether to wake you up. The ones that reach you matter. Most teams pair Watch with ConvOps Diagnose, which fires when CloudWatch alarms trigger.
Included in all plans. $49/mo Growth tier. No per-host or per-metric pricing.
The problem with alerting
Your CloudWatch alarms cry wolf.
You set up alarms because something broke once. Now they fire for transient spikes, self-healing metrics, and AWS outages you can't control. You start silencing the noisy ones. The next real incident slips through. That's how outages happen.
The metric that recovered itself.
CPU spiked to 92% for 60 seconds, then dropped to 18%. The alarm fired anyway. You woke up. There was nothing to fix.
↳ check_1 would have caught this — metric self-healed before the alert sent
The deploy that explained itself.
Latency doubled at 14:02. Your alarm fired at 14:04. You spent 20 minutes debugging before someone said "oh yeah, I deployed at 14:01."
↳ check_3 would have caught this — CloudTrail deploy event at 14:01
The AWS-side outage.
EBS degradation in us-east-1. Your RDS metrics went wild. Alarms fired. There was nothing you could do — AWS was already on it.
↳ check_2 would have caught this — AWS Health event active for this region
The 9-check verification pipeline
Every anomaly earns the alert.
All 9 checks run in parallel the moment an anomaly is detected. The two suppress checks decide whether to wake you up. The six context checks give Claude AI everything it needs to explain what happened. The one digest check routes chronic flapping metrics to a daily summary instead of your midnight alert stream.
Self-healed
Re-fetches the metric from CloudWatch right now. If it's back within 1.5σ of baseline, the spike is over.
Alert suppressed — nothing to fix
AWS Health
Checks AWS Service Health Dashboard for events affecting this region/service.
Alert suppressed — AWS is already aware
Recent deploy
Looks at CloudTrail for Lambda update, ECS deploy, RDS parameter changes in the last 120 minutes.
Context added — this may be post-deploy behaviour
Service quota
Checks if any service quota (concurrency limits, throughput, etc.) is at or near its limit.
Context added — quota exhaustion suspected
State change
Checks for resource state changes: auto-scaling events, instance restarts, config updates in last 60 min.
Context added — infrastructure change detected
Security finding
Queries GuardDuty and Security Hub for active findings on this resource.
Context added — active security finding present
Flapping
Counts how many times this exact metric has been anomalous in the last 24 hours. If >5 times: flapping.
Routed to daily digest — not a full alert
Certificate expiry
For ALB and API Gateway resources, checks if any TLS certificate expires within 30 days.
Context added — cert expiry approaching
Inspector findings
Checks AWS Inspector for active vulnerability findings on this resource.
Context added — Inspector finding active
All checks run cross-account via a read-only IAM role. Nothing executes. Nothing is modified. The role can be revoked in under 30 seconds from your AWS IAM console.
A real ConvOps Watch alert
This is what makes it through.
Every alert that reaches you has passed all 9 checks — meaning ConvOps already knows it's real, already knows it didn't self-heal, already knows it's not a deploy or an AWS outage.
By the time your phone buzzes, ConvOps has also read the logs, correlated VPC flow data, checked CloudTrail, and written you a numbered action list. This one was caught with no CloudWatch alarm configured — pure baseline detection.
Messages stuck in egress queue — consumer appears to have stopped
SQS · arn:aws:sqs:eu-central-1:123456789012:prod-egress-queue
DETECTED
28 May 2026 03:30 UTC
DIAGNOSED
28 May 2026 03:31 UTC
ACCOUNT
123456789012
ENVIRONMENT
Production
No CloudWatch alarm was configured for this metric — ConvOps caught it through baseline analysis.
What we observed
- ApproximateNumberOfMessagesNotVisible rose to 4 (z-score 10) against a baseline of 0, at 03:30 UTC.
- ApproximateAgeOfOldestMessage hit 484 seconds (z-score 10) at the same timestamp — messages are aging without being processed.
- A prior anomaly at 03:25 UTC showed 2 in-flight messages and oldest-message age of 181 seconds, confirming the issue is escalating.
- VPC flow logs show 10 active packet-reject spikes, ranging from 100 to 2,671 rejected packets, all currently active.
- An AssumeRole event for WorkflowIAMRole occurred at 03:14 UTC — 16 minutes before the anomaly was first detected.
What we checked before alerting
- Metric persistence: still anomalous at re-check (value 3.8, baseline 0.0) — confirmed not a transient spike.
- Deployments: no UpdateService, UpdateFunctionCode, or equivalent CloudTrail events in the last 2 hours — not deployment-related.
- Service quotas: all checked quotas below 80% utilisation — quota exhaustion ruled out.
- Infrastructure changes: no scaling events, task restarts, or config changes in the last 60 minutes.
- Security tooling: GuardDuty and Security Hub are not enabled — automated threat detection unavailable.
- Flapping history: 5 occurrences in 24 hours — this is not a chronic noisy metric; treated as a genuine new event.
- Vulnerabilities: AWS Inspector found no active findings for this resource.
What to check first
- 1.Check the consumer (Lambda or ECS service) processing this queue: in CloudWatch Metrics, inspect NumberOfMessagesSent, NumberOfMessagesDeleted, and ApproximateNumberOfMessagesVisible for prod-egress-queue over the window 03:00–03:40 UTC to confirm whether deletions stopped.
- 2.Review CloudWatch Logs for the consumer service between 03:10–03:35 UTC. Run a Logs Insights query: fields @timestamp, @message | filter @message like /ERROR|error|timeout|refused|reject/ | sort @timestamp desc | limit 50
- 3.Investigate the VPC packet-reject findings: identify which security group or NACL is dropping traffic and whether the consumer's outbound or inbound connections to the SQS endpoint are affected. Check the VPC flow logs for the consumer's ENI around 03:10–03:30 UTC.
- 4.Examine the AssumeRole event at 03:14 UTC for WorkflowIAMRole: verify in CloudTrail whether subsequent API calls from that role succeeded or returned AccessDenied errors, which could indicate a permissions failure blocking message processing.
Recommended action
Check the queue consumer's logs and CloudWatch metrics from 03:10–03:35 UTC for errors or stopped processing.
🔧 Prevent this anomaly permanently
Run this in CloudShell to create a CloudWatch alarm. Once in place, ConvOps routes future occurrences through the reactive pipeline automatically.
aws cloudwatch put-metric-alarm \ --alarm-name "prod-egress-queue-inflight-high" \ --metric-name ApproximateNumberOfMessagesNotVisible \ --namespace AWS/SQS --statistic Average \ --period 300 --threshold 100 \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 2 \ --datapoints-to-alarm 2 \ --treat-missing-data notBreaching \ --dimensions Name=QueueName,Value=prod-egress-queue
How anomalies are detected
Baseline-aware. Every metric, every hour.
A CloudWatch alarm compares a metric to a fixed number. ConvOps compares it to what that metric normally looks like at this exact time, on this day of the week — built from weeks of historical readings.
Tuesday 3 PM traffic is compared to previous Tuesday 3 PM readings, not a flat 24-hour average. 7 days × 24 hours = 168 time buckets per metric. When the current value is ≥ 3.0 standard deviations from that bucket's baseline, it becomes an anomaly candidate — and enters the 9-check pipeline.
5 min
Collection cadence
z ≥ 3.0
Anomaly threshold
z ≥ 4.0
High-confidence threshold
1.5σ
Self-heal threshold
168
Time buckets per metric
10
AWS service types
The decision tree
What happens when an anomaly is detected.
Every candidate anomaly follows the same path. The 9 checks run first. Then three additional gates at dispatch — existing alarm coverage, AI confidence, and a 7-day dedup window — catch anything the verification pass missed.
ANOMALY DETECTED
Self-healed?
Metric back within 1.5σ of baseline
Flapping?
This metric fired 5+ times in the last 24h
Claude AI analysis
Context from checks 2–6, 8–9 forwarded
CloudWatch alarm exists?
Reactive pipeline already owns this metric
Confidence = low?
Insufficient evidence for a finding
Alerted in last 7 days?
Dedup window active for this resource + metric
✓ FULL ANOMALY ALERT sent
Email + Slack. Every gate passed.
Service coverage
Ten service types. No agents.
ConvOps reads CloudWatch metrics via a cross-account read-only IAM role. Nothing is deployed inside your account — no agents, no sidecars, no exporters. One CloudFormation stack. One role. Done.
| Service | Metrics tracked |
|---|---|
| AWS Lambda | Duration (p99)ErrorsThrottlesConcurrentExecutions |
| Amazon RDS | FreeableMemoryCPUUtilizationDatabaseConnectionsReadLatencyWriteLatency |
| Amazon DynamoDB | ConsumedReadCapacityUnitsConsumedWriteCapacityUnitsThrottledRequestsSuccessfulRequestLatencySystemErrors |
| Amazon ECS | CPUUtilizationMemoryUtilizationRunningTaskCount |
| Application Load Balancer | TargetResponseTimeHTTPCode_Target_5XX_CountUnHealthyHostCount |
| Network Load Balancer | UnHealthyHostCountActiveFlowCount |
| Amazon EC2 | CPUUtilizationNetworkInNetworkOutStatusCheckFailed |
| Amazon ElastiCache | CacheMissesCurrConnectionsCPUUtilizationFreeableMemory |
| Amazon SQS | ApproximateAgeOfOldestMessageApproximateNumberOfMessagesNotVisibleNumberOfMessagesSent |
| AWS Billing | EstimatedCharges (per service) |
Pricing
Watch is included in every plan.
No per-metric pricing. No per-host pricing. Watch runs across all monitored accounts on your plan — Growth starts at $49/mo and covers up to 5 AWS accounts.
Watch questions.
Answers to the questions we hear most about anomaly detection, the 9-check pipeline, and how Watch fits alongside your existing CloudWatch alarms.
Glossary
Terms used on this page.
Plain-English definitions for the detection and verification terms used throughout this page.
- Recovery check
- Re-fetches the metric from CloudWatch and tests whether it has returned to within 1.5 standard deviations of its baseline. If yes, the alert is suppressed — the spike resolved itself.
- Z-score
- A statistical measure of how many standard deviations a data point is from the mean. ConvOps Watch uses z ≥ 3 as the anomaly threshold and z ≥ 4 as the high-confidence threshold.
- Time-bucketed baseline
- A separate metric baseline computed for every day-of-week × hour-of-day combination, allowing detection to account for traffic patterns. Tuesday 2 PM traffic is compared to previous Tuesday 2 PM readings — not a flat 24-hour average. Each metric gets 168 time buckets (7 days × 24 hours).
- Flap check
- Counts how many times a metric has been anomalous in the last 24 hours. If more than 5, the metric is flapping and gets routed to a daily digest email instead of a real-time alert. You still see the pattern; you don't get paged for it repeatedly.
- 9-check pipeline
- ConvOps Watch's pre-alert verification system. Every detected anomaly runs through 9 parallel checks — recovery, AWS Health, recent deploy, service quota, state change, security finding, flapping, certificate expiry, and Inspector findings — before any alert is sent.
- Hygiene Score
- ConvOps Audit's 0–100 score reflecting the quality of a CloudWatch alarm setup. Starts at 100. Points are deducted for noisy alarms, permanently suppressed alarms, missing critical alarms, unmonitored resources, and active security findings.
- CloudTrail
- AWS's audit log for API calls. ConvOps Watch's deploy check (check 3) reads CloudTrail to detect recent code or configuration changes that may explain the anomaly.
The alert that should have woken you up. It will.
Individual is free forever. Connect your AWS account in under 10 minutes. The 9-check pipeline starts running immediately — no alarms to configure, no dashboards to build.
Individual is free forever. Growth $49/mo. Cancel any time.