{ ConvOps }
  • Pricing
  • Blog
  • Security
  • About
Log inStart free →
{ ConvOps }

Root cause, not noise.

Start free →

Product

  • Audit
  • Watch
  • Diagnose

Compare

  • Vs PagerDuty
  • Vs incident.io
  • Vs Datadog
  • Vs Datadog Watchdog
  • Vs Resolve.ai
  • Vs Rootly
  • Vs AWS DevOps Guru
  • Vs Squadcast
  • Vs Komodor / Klaudia
  • Vs Sentry
  • Vs Coroot

Company

  • Pricing
  • Blog
  • Security
  • About

Connect

  • X (Twitter)
  • LinkedIn

© 2026 ConvOps. All rights reserved.

Built at 2am, for a 2am.

← All posts

CloudWatch Logs Insights queries: the practical library for ECS, Lambda, RDS, and EC2

June 3, 2026·10 min read

CloudWatch Logs Insights has fast query execution and deep log access, but the syntax is dense and the AWS docs bury the useful patterns in example blocks you have to modify yourself. This is the library that's actually organised by investigation scenario — what's going wrong, which service, and what you need to know right now.

All queries assume the default CloudWatch Logs Insights time window is set to the incident window. Start 5 minutes before the alarm fired and end 10 minutes after it resolved. A tight window eliminates noise and makes result counts meaningful.

Lambda queries

Lambda writes structured log lines to its own log group (`/aws/lambda/<function-name>`). The runtime injects `@type`, `@duration`, `@maxMemoryUsed`, and `@initDuration` as parsed fields — use them directly without a `parse` statement.

Find all errors in a time window

fields @timestamp, @message, @requestId
| filter @message like /(?i)(error|exception|traceback)/
| sort @timestamp desc
| limit 50

Case-insensitive regex catches Python tracebacks, Node.js Error objects, and Java exceptions in the same query. The `@requestId` field links the error line to the full invocation in X-Ray.

Find timeouts

fields @timestamp, @requestId, @duration
| filter @message like /Task timed out/
| sort @timestamp desc
| limit 20

Lambda writes a `Task timed out after X.XX seconds` line for every timeout — this is distinct from function errors. If `@duration` is consistently close to your configured timeout (e.g. 29.8s with a 30s limit), the function is regularly hitting the wall.

Identify cold starts

fields @timestamp, @requestId, @duration, @initDuration
| filter ispresent(@initDuration)
| sort @initDuration desc
| limit 20

`@initDuration` is only present on cold-start invocations. Sorting by descending init duration shows which invocations had the longest container init — useful for diagnosing provisioned concurrency gaps or large deployment packages.

Find invocations approaching memory limit

fields @timestamp, @requestId, @maxMemoryUsed, @memorySize
| filter @maxMemoryUsed / @memorySize > 0.9
| sort @maxMemoryUsed desc
| limit 20

Memory usage above 90% of the configured limit is a pre-OOM warning. The function hasn't been killed yet, but it will be under load. Filter at 0.85 if you want earlier warning.

Duration percentiles across all invocations

stats count() as invocations,
  pct(@duration, 50) as p50,
  pct(@duration, 95) as p95,
  pct(@duration, 99) as p99,
  max(@duration) as maxDuration
by bin(5m)

This query gives you a latency distribution over time — useful for identifying whether a latency spike was sudden (one 5m bucket) or a gradual drift. A p99 spike with a stable p50 usually indicates an upstream dependency issue.

ECS queries

ECS task logs land in the log group configured in your task definition's `logConfiguration`. With `awslogs` driver, each container writes to a log stream named `<prefix>/<container-name>/<task-id>`. Point the query at the correct log group.

Find all application errors

fields @timestamp, @message, @logStream
| filter @message like /(?i)(error|exception|fatal|panic)/
| sort @timestamp desc
| limit 50

Detect OOM kills (exit code 137)

fields @timestamp, @message, @logStream
| filter @message like /exit code 137/
  or @message like /OutOfMemoryError/
  or @message like /Killed/
| sort @timestamp desc
| limit 20

Exit code 137 means the kernel sent SIGKILL — the container used too much memory. This is different from a graceful shutdown (exit code 0) or an application crash (exit code 1). If you see 137, the fix is always more memory or a memory leak investigation.

Last log lines before a container stopped

fields @timestamp, @message
| filter @logStream like /<your-task-id>/
| sort @timestamp desc
| limit 100

Replace `<your-task-id>` with the stopped task ID from the ECS console. The last 100 lines before the container terminated usually contain the root cause. Combine with CloudTrail for the StopTask API call if you need to know who or what triggered it.

Error count by 5-minute window

filter @message like /(?i)error/
| stats count() as errorCount by bin(5m)
| sort bin(5m) asc

This gives you a timeline of error frequency — useful for correlating with a deployment, a traffic spike, or an upstream dependency failure. A sudden step-change in errorCount at a specific 5-minute bucket points to the cause.

RDS queries

RDS logs require explicit enablement. For PostgreSQL, enable `log_min_duration_statement = 1000` to log queries over 1 second. For MySQL, enable slow query log and general log. The log group is typically `/aws/rds/instance/<db-identifier>/postgresql` or `/aws/rds/instance/<db-identifier>/slowquery`.

Find slow queries over 1 second

fields @timestamp, @message
| parse @message "duration: * ms" as durationMs
| filter durationMs > 1000
| sort durationMs desc
| limit 30

PostgreSQL writes `duration: 1234.567 ms` for every statement when `log_min_duration_statement` is set. The `parse` statement extracts the value into a numeric field for sorting. Queries over 5000ms (5 seconds) are almost always missing an index or doing a sequential scan.

Find connection errors

fields @timestamp, @message
| filter @message like /(?i)(connection refused|too many connections|remaining connection slots|pg_hba.conf)/
| sort @timestamp desc
| limit 20

`remaining connection slots are reserved for non-replication superuser connections` means you've hit `max_connections`. `pg_hba.conf` errors mean authentication is failing. Both are distinct from application-level query errors and need different fixes.

Detect lock waits and deadlocks

fields @timestamp, @message
| filter @message like /(?i)(deadlock|lock wait timeout|lock not available|waiting for .* lock)/
| sort @timestamp desc
| limit 20

Lock contention is a common cause of p99 latency spikes that don't show up in CPU or connection metrics. If you see deadlock messages during a deployment, it usually means two transactions are writing to the same rows in different order.

API Gateway queries

API Gateway access logs must be enabled manually in the Stage settings. Set the log format to JSON so every field is queryable. The log group name is set by you — typically `/aws/apigateway/<api-name>`.

Find 5xx errors

fields @timestamp, @message
| filter @message like /"status":5/
| sort @timestamp desc
| limit 50

If your access log format uses JSON, filter on the status field string pattern. For the default CLF format, use `filter @message like / 5[0-9][0-9] /` to match HTTP 5xx codes between whitespace.

Slowest endpoints by p99 latency

fields @timestamp, @message
| parse @message '"path":"*"' as path
| parse @message '"responseLatency":*,' as latencyMs
| stats pct(latencyMs, 99) as p99, count() as requests by path
| sort p99 desc
| limit 20

This identifies which routes are slow, not just that the API is slow overall. A single route at p99 = 8000ms while all others are under 200ms narrows the investigation to one Lambda function or one database query.

Error count by HTTP status code

fields @timestamp, @message
| parse @message '"status":*,' as statusCode
| filter statusCode >= 400
| stats count() as count by statusCode
| sort count desc

Splitting errors by status code separates client errors (4xx — usually bad input or auth failures) from server errors (5xx — your problem). A spike in 429s means you're being rate-limited or your throttle config is too aggressive. A spike in 502s means Lambda is returning an invalid response format.

Logs Insights syntax patterns to know

PatternSyntaxWhen to use
Regex filter`filter @message like /pattern/`Case-sensitive substring match; add `(?i)` for case-insensitive
Field extraction`parse @message "prefix * suffix" as field`Extracts a value between fixed strings into a queryable field
Presence check`filter ispresent(@fieldName)`Filters to only rows where a parsed/injected field exists
Time bucketing`by bin(5m)`Groups results into time windows — 1m, 5m, 15m, 1h
Percentiles`pct(@duration, 99)`P99 latency; works on any numeric field
Multi-log joinSelect multiple log groups in the query consoleQuery /aws/lambda/fn and /aws/apigateway/api simultaneously
Logs Insights scans compressed log data. Query cost scales with the volume of data scanned, not the number of results returned. Narrow your time window and log group selection before running broad regex filters on high-volume production log groups.

Related reading

  • → How to find root cause in AWS CloudWatch alerts without an SRE team
  • → The 5 CloudWatch alarms most startups accidentally create that are just noise
  • → The Complete AWS CloudWatch Alarm Setup Guide

Frequently asked questions

Frequently asked questions

What is the CloudWatch Logs Insights query limit?

CloudWatch Logs Insights scans up to 10,000 log events per query by default. Use `limit` to control result set size. The query time range and log group selection control how much data is scanned — narrower windows are faster and cheaper. Concurrent queries per account are limited to 10.

How do I query multiple log groups at once in Logs Insights?

In the CloudWatch console, select multiple log groups from the log group selector before running the query — the query runs across all selected groups simultaneously. You can also use `SOURCE '/aws/lambda/fn1', '/aws/lambda/fn2'` syntax in some regions. Results include `@logGroup` and `@logStream` fields to identify which log group each line came from.

Why does my CloudWatch Logs Insights query return no results?

The three most common causes: (1) the time window doesn't overlap with the log events you expect — check the time range selector; (2) the log group doesn't have the correct IAM permissions; (3) the log group name is wrong — log groups must be selected explicitly, wildcards in the log group name are only supported with SOURCE syntax. Run a bare `fields @timestamp, @message | limit 10` with no filter to confirm log data exists.

How do I extract a field from a JSON log line in Logs Insights?

CloudWatch Logs Insights auto-parses JSON log events — if your log line is valid JSON, you can reference top-level fields directly without a `parse` statement. For `{"level":"error","duration":1234}`, you can write `filter level = "error"` and `stats pct(duration, 99)` without any parse. Nested JSON fields require the `parse` statement with dot notation or regex extraction.

What's the difference between filter and parse in Logs Insights?

`filter` selects which log events to include — it's a WHERE clause that reduces the result set. `parse` extracts a substring from `@message` into a named field — it doesn't filter, it adds a column. Use `parse` first to create the field, then `filter` or `stats` to work with it. A `parse` that matches no events still returns all events unless you also `filter ispresent(extractedField)`.

Related reading

  • → How to find root cause in AWS CloudWatch alerts without an SRE
  • → The 5 CloudWatch alarms most startups accidentally create that are just noise
  • → The Complete AWS CloudWatch Alarm Setup Guide
  • → ConvOps Diagnose — instant root cause when an alarm fires

Still debugging incidents manually?

ConvOps does this automatically — root cause in under 60 seconds, delivered to WhatsApp or Slack.

Try ConvOps free →
N

Nitesh

Founder, ConvOps

Published

June 2026

Updated

June 2026

Have feedback? [email protected]