ConvOps Watch · 24/7 AWS Debugging

The alert that didn't reach you.

Most alerts you get are noise — a metric blip, a self-healing spike, an AWS-side outage you can't fix. ConvOps Watch debugs every anomaly through 9 verification checks before deciding whether to wake you up. The ones that reach you matter. Most teams pair Watch with ConvOps Diagnose, which fires when CloudWatch alarms trigger.

Included in all plans. $49/mo Growth tier. No per-host or per-metric pricing.

The problem with alerting

Your CloudWatch alarms cry wolf.

You set up alarms because something broke once. Now they fire for transient spikes, self-healing metrics, and AWS outages you can't control. You start silencing the noisy ones. The next real incident slips through. That's how outages happen.

LAMBDA02:47 UTC

The metric that recovered itself.

CPU spiked to 92% for 60 seconds, then dropped to 18%. The alarm fired anyway. You woke up. There was nothing to fix.

check_1 would have caught this — metric self-healed before the alert sent

ECS14:04 UTC

The deploy that explained itself.

Latency doubled at 14:02. Your alarm fired at 14:04. You spent 20 minutes debugging before someone said "oh yeah, I deployed at 14:01."

check_3 would have caught this — CloudTrail deploy event at 14:01

RDSus-east-1 event

The AWS-side outage.

EBS degradation in us-east-1. Your RDS metrics went wild. Alarms fired. There was nothing you could do — AWS was already on it.

check_2 would have caught this — AWS Health event active for this region

The 9-check verification pipeline

Every anomaly earns the alert.

All 9 checks run in parallel the moment an anomaly is detected. The two suppress checks decide whether to wake you up. The six context checks give Claude AI everything it needs to explain what happened. The one digest check routes chronic flapping metrics to a daily summary instead of your midnight alert stream.

Alert suppressed
Context forwarded to AI
Routed to daily digest
01

Self-healed

Re-fetches the metric from CloudWatch right now. If it's back within 1.5σ of baseline, the spike is over.

Alert suppressed — nothing to fix

SUPPRESS
02

AWS Health

Checks AWS Service Health Dashboard for events affecting this region/service.

Alert suppressed — AWS is already aware

SUPPRESS
03

Recent deploy

Looks at CloudTrail for Lambda update, ECS deploy, RDS parameter changes in the last 120 minutes.

Context added — this may be post-deploy behaviour

CONTEXT
04

Service quota

Checks if any service quota (concurrency limits, throughput, etc.) is at or near its limit.

Context added — quota exhaustion suspected

CONTEXT
05

State change

Checks for resource state changes: auto-scaling events, instance restarts, config updates in last 60 min.

Context added — infrastructure change detected

CONTEXT
06

Security finding

Queries GuardDuty and Security Hub for active findings on this resource.

Context added — active security finding present

CONTEXT
07

Flapping

Counts how many times this exact metric has been anomalous in the last 24 hours. If >5 times: flapping.

Routed to daily digest — not a full alert

DIGEST
08

Certificate expiry

For ALB and API Gateway resources, checks if any TLS certificate expires within 30 days.

Context added — cert expiry approaching

CONTEXT
09

Inspector findings

Checks AWS Inspector for active vulnerability findings on this resource.

Context added — Inspector finding active

CONTEXT

All checks run cross-account via a read-only IAM role. Nothing executes. Nothing is modified. The role can be revoked in under 30 seconds from your AWS IAM console.

A real ConvOps Watch alert

This is what makes it through.

Every alert that reaches you has passed all 9 checks — meaning ConvOps already knows it's real, already knows it didn't self-heal, already knows it's not a deploy or an AWS outage.

By the time your phone buzzes, ConvOps has also read the logs, correlated VPC flow data, checked CloudTrail, and written you a numbered action list. This one was caught with no CloudWatch alarm configured — pure baseline detection.

1 mindetection to alert
0CloudWatch alarms needed
HIGH · ConvOps ML Alert

Messages stuck in egress queue — consumer appears to have stopped

SQS · arn:aws:sqs:eu-central-1:123456789012:prod-egress-queue

DETECTED

28 May 2026 03:30 UTC

DIAGNOSED

28 May 2026 03:31 UTC

ACCOUNT

123456789012

ENVIRONMENT

Production

No CloudWatch alarm was configured for this metric — ConvOps caught it through baseline analysis.

What we observed

  • ApproximateNumberOfMessagesNotVisible rose to 4 (z-score 10) against a baseline of 0, at 03:30 UTC.
  • ApproximateAgeOfOldestMessage hit 484 seconds (z-score 10) at the same timestamp — messages are aging without being processed.
  • A prior anomaly at 03:25 UTC showed 2 in-flight messages and oldest-message age of 181 seconds, confirming the issue is escalating.
  • VPC flow logs show 10 active packet-reject spikes, ranging from 100 to 2,671 rejected packets, all currently active.
  • An AssumeRole event for WorkflowIAMRole occurred at 03:14 UTC — 16 minutes before the anomaly was first detected.

What we checked before alerting

  • Metric persistence: still anomalous at re-check (value 3.8, baseline 0.0) — confirmed not a transient spike.
  • Deployments: no UpdateService, UpdateFunctionCode, or equivalent CloudTrail events in the last 2 hours — not deployment-related.
  • Service quotas: all checked quotas below 80% utilisation — quota exhaustion ruled out.
  • Infrastructure changes: no scaling events, task restarts, or config changes in the last 60 minutes.
  • Security tooling: GuardDuty and Security Hub are not enabled — automated threat detection unavailable.
  • Flapping history: 5 occurrences in 24 hours — this is not a chronic noisy metric; treated as a genuine new event.
  • Vulnerabilities: AWS Inspector found no active findings for this resource.

What to check first

  1. 1.Check the consumer (Lambda or ECS service) processing this queue: in CloudWatch Metrics, inspect NumberOfMessagesSent, NumberOfMessagesDeleted, and ApproximateNumberOfMessagesVisible for prod-egress-queue over the window 03:00–03:40 UTC to confirm whether deletions stopped.
  2. 2.Review CloudWatch Logs for the consumer service between 03:10–03:35 UTC. Run a Logs Insights query: fields @timestamp, @message | filter @message like /ERROR|error|timeout|refused|reject/ | sort @timestamp desc | limit 50
  3. 3.Investigate the VPC packet-reject findings: identify which security group or NACL is dropping traffic and whether the consumer's outbound or inbound connections to the SQS endpoint are affected. Check the VPC flow logs for the consumer's ENI around 03:10–03:30 UTC.
  4. 4.Examine the AssumeRole event at 03:14 UTC for WorkflowIAMRole: verify in CloudTrail whether subsequent API calls from that role succeeded or returned AccessDenied errors, which could indicate a permissions failure blocking message processing.

Recommended action

Check the queue consumer's logs and CloudWatch metrics from 03:10–03:35 UTC for errors or stopped processing.

🔧 Prevent this anomaly permanently

Run this in CloudShell to create a CloudWatch alarm. Once in place, ConvOps routes future occurrences through the reactive pipeline automatically.

aws cloudwatch put-metric-alarm \
  --alarm-name "prod-egress-queue-inflight-high" \
  --metric-name ApproximateNumberOfMessagesNotVisible \
  --namespace AWS/SQS --statistic Average \
  --period 300 --threshold 100 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --datapoints-to-alarm 2 \
  --treat-missing-data notBreaching \
  --dimensions Name=QueueName,Value=prod-egress-queue

How anomalies are detected

Baseline-aware. Every metric, every hour.

A CloudWatch alarm compares a metric to a fixed number. ConvOps compares it to what that metric normally looks like at this exact time, on this day of the week — built from weeks of historical readings.

Tuesday 3 PM traffic is compared to previous Tuesday 3 PM readings, not a flat 24-hour average. 7 days × 24 hours = 168 time buckets per metric. When the current value is ≥ 3.0 standard deviations from that bucket's baseline, it becomes an anomaly candidate — and enters the 9-check pipeline.

5 min

Collection cadence

z ≥ 3.0

Anomaly threshold

z ≥ 4.0

High-confidence threshold

1.5σ

Self-heal threshold

168

Time buckets per metric

10

AWS service types

The decision tree

What happens when an anomaly is detected.

Every candidate anomaly follows the same path. The 9 checks run first. Then three additional gates at dispatch — existing alarm coverage, AI confidence, and a 7-day dedup window — catch anything the verification pass missed.

ANOMALY DETECTED

check_1

Self-healed?

Metric back within 1.5σ of baseline

YESSUPPRESSED
check_7

Flapping?

This metric fired 5+ times in the last 24h

YESDAILY DIGEST

Claude AI analysis

Context from checks 2–6, 8–9 forwarded

CloudWatch alarm exists?

Reactive pipeline already owns this metric

YESSUPPRESSED

Confidence = low?

Insufficient evidence for a finding

YESSUPPRESSED

Alerted in last 7 days?

Dedup window active for this resource + metric

YESSUPPRESSED

✓ FULL ANOMALY ALERT sent

Email + Slack. Every gate passed.

Service coverage

Ten service types. No agents.

ConvOps reads CloudWatch metrics via a cross-account read-only IAM role. Nothing is deployed inside your account — no agents, no sidecars, no exporters. One CloudFormation stack. One role. Done.

10service types
34metrics tracked
ServiceMetrics tracked
AWS Lambda
Duration (p99)ErrorsThrottlesConcurrentExecutions
Amazon RDS
FreeableMemoryCPUUtilizationDatabaseConnectionsReadLatencyWriteLatency
Amazon DynamoDB
ConsumedReadCapacityUnitsConsumedWriteCapacityUnitsThrottledRequestsSuccessfulRequestLatencySystemErrors
Amazon ECS
CPUUtilizationMemoryUtilizationRunningTaskCount
Application Load Balancer
TargetResponseTimeHTTPCode_Target_5XX_CountUnHealthyHostCount
Network Load Balancer
UnHealthyHostCountActiveFlowCount
Amazon EC2
CPUUtilizationNetworkInNetworkOutStatusCheckFailed
Amazon ElastiCache
CacheMissesCurrConnectionsCPUUtilizationFreeableMemory
Amazon SQS
ApproximateAgeOfOldestMessageApproximateNumberOfMessagesNotVisibleNumberOfMessagesSent
AWS Billing
EstimatedCharges (per service)

Pricing

Watch is included in every plan.

No per-metric pricing. No per-host pricing. Watch runs across all monitored accounts on your plan — Growth starts at $49/mo and covers up to 5 AWS accounts.

Watch questions.

Answers to the questions we hear most about anomaly detection, the 9-check pipeline, and how Watch fits alongside your existing CloudWatch alarms.

Glossary

Terms used on this page.

Plain-English definitions for the detection and verification terms used throughout this page.

Recovery check
Re-fetches the metric from CloudWatch and tests whether it has returned to within 1.5 standard deviations of its baseline. If yes, the alert is suppressed — the spike resolved itself.
Z-score
A statistical measure of how many standard deviations a data point is from the mean. ConvOps Watch uses z ≥ 3 as the anomaly threshold and z ≥ 4 as the high-confidence threshold.
Time-bucketed baseline
A separate metric baseline computed for every day-of-week × hour-of-day combination, allowing detection to account for traffic patterns. Tuesday 2 PM traffic is compared to previous Tuesday 2 PM readings — not a flat 24-hour average. Each metric gets 168 time buckets (7 days × 24 hours).
Flap check
Counts how many times a metric has been anomalous in the last 24 hours. If more than 5, the metric is flapping and gets routed to a daily digest email instead of a real-time alert. You still see the pattern; you don't get paged for it repeatedly.
9-check pipeline
ConvOps Watch's pre-alert verification system. Every detected anomaly runs through 9 parallel checks — recovery, AWS Health, recent deploy, service quota, state change, security finding, flapping, certificate expiry, and Inspector findings — before any alert is sent.
Hygiene Score
ConvOps Audit's 0–100 score reflecting the quality of a CloudWatch alarm setup. Starts at 100. Points are deducted for noisy alarms, permanently suppressed alarms, missing critical alarms, unmonitored resources, and active security findings.
CloudTrail
AWS's audit log for API calls. ConvOps Watch's deploy check (check 3) reads CloudTrail to detect recent code or configuration changes that may explain the anomaly.

The alert that should have woken you up. It will.

Individual is free forever. Connect your AWS account in under 10 minutes. The 9-check pipeline starts running immediately — no alarms to configure, no dashboards to build.

Individual is free forever. Growth $49/mo. Cancel any time.