CloudWatch Logs Insights queries: the practical library for ECS, Lambda, RDS, and EC2
CloudWatch Logs Insights has fast query execution and deep log access, but the syntax is dense and the AWS docs bury the useful patterns in example blocks you have to modify yourself. This is the library that's actually organised by investigation scenario — what's going wrong, which service, and what you need to know right now.
Lambda queries
Lambda writes structured log lines to its own log group (`/aws/lambda/<function-name>`). The runtime injects `@type`, `@duration`, `@maxMemoryUsed`, and `@initDuration` as parsed fields — use them directly without a `parse` statement.
Find all errors in a time window
fields @timestamp, @message, @requestId
| filter @message like /(?i)(error|exception|traceback)/
| sort @timestamp desc
| limit 50Case-insensitive regex catches Python tracebacks, Node.js Error objects, and Java exceptions in the same query. The `@requestId` field links the error line to the full invocation in X-Ray.
Find timeouts
fields @timestamp, @requestId, @duration
| filter @message like /Task timed out/
| sort @timestamp desc
| limit 20Lambda writes a `Task timed out after X.XX seconds` line for every timeout — this is distinct from function errors. If `@duration` is consistently close to your configured timeout (e.g. 29.8s with a 30s limit), the function is regularly hitting the wall.
Identify cold starts
fields @timestamp, @requestId, @duration, @initDuration
| filter ispresent(@initDuration)
| sort @initDuration desc
| limit 20`@initDuration` is only present on cold-start invocations. Sorting by descending init duration shows which invocations had the longest container init — useful for diagnosing provisioned concurrency gaps or large deployment packages.
Find invocations approaching memory limit
fields @timestamp, @requestId, @maxMemoryUsed, @memorySize
| filter @maxMemoryUsed / @memorySize > 0.9
| sort @maxMemoryUsed desc
| limit 20Memory usage above 90% of the configured limit is a pre-OOM warning. The function hasn't been killed yet, but it will be under load. Filter at 0.85 if you want earlier warning.
Duration percentiles across all invocations
stats count() as invocations,
pct(@duration, 50) as p50,
pct(@duration, 95) as p95,
pct(@duration, 99) as p99,
max(@duration) as maxDuration
by bin(5m)This query gives you a latency distribution over time — useful for identifying whether a latency spike was sudden (one 5m bucket) or a gradual drift. A p99 spike with a stable p50 usually indicates an upstream dependency issue.
ECS queries
ECS task logs land in the log group configured in your task definition's `logConfiguration`. With `awslogs` driver, each container writes to a log stream named `<prefix>/<container-name>/<task-id>`. Point the query at the correct log group.
Find all application errors
fields @timestamp, @message, @logStream
| filter @message like /(?i)(error|exception|fatal|panic)/
| sort @timestamp desc
| limit 50Detect OOM kills (exit code 137)
fields @timestamp, @message, @logStream
| filter @message like /exit code 137/
or @message like /OutOfMemoryError/
or @message like /Killed/
| sort @timestamp desc
| limit 20Exit code 137 means the kernel sent SIGKILL — the container used too much memory. This is different from a graceful shutdown (exit code 0) or an application crash (exit code 1). If you see 137, the fix is always more memory or a memory leak investigation.
Last log lines before a container stopped
fields @timestamp, @message
| filter @logStream like /<your-task-id>/
| sort @timestamp desc
| limit 100Replace `<your-task-id>` with the stopped task ID from the ECS console. The last 100 lines before the container terminated usually contain the root cause. Combine with CloudTrail for the StopTask API call if you need to know who or what triggered it.
Error count by 5-minute window
filter @message like /(?i)error/
| stats count() as errorCount by bin(5m)
| sort bin(5m) ascThis gives you a timeline of error frequency — useful for correlating with a deployment, a traffic spike, or an upstream dependency failure. A sudden step-change in errorCount at a specific 5-minute bucket points to the cause.
RDS queries
RDS logs require explicit enablement. For PostgreSQL, enable `log_min_duration_statement = 1000` to log queries over 1 second. For MySQL, enable slow query log and general log. The log group is typically `/aws/rds/instance/<db-identifier>/postgresql` or `/aws/rds/instance/<db-identifier>/slowquery`.
Find slow queries over 1 second
fields @timestamp, @message
| parse @message "duration: * ms" as durationMs
| filter durationMs > 1000
| sort durationMs desc
| limit 30PostgreSQL writes `duration: 1234.567 ms` for every statement when `log_min_duration_statement` is set. The `parse` statement extracts the value into a numeric field for sorting. Queries over 5000ms (5 seconds) are almost always missing an index or doing a sequential scan.
Find connection errors
fields @timestamp, @message
| filter @message like /(?i)(connection refused|too many connections|remaining connection slots|pg_hba.conf)/
| sort @timestamp desc
| limit 20`remaining connection slots are reserved for non-replication superuser connections` means you've hit `max_connections`. `pg_hba.conf` errors mean authentication is failing. Both are distinct from application-level query errors and need different fixes.
Detect lock waits and deadlocks
fields @timestamp, @message
| filter @message like /(?i)(deadlock|lock wait timeout|lock not available|waiting for .* lock)/
| sort @timestamp desc
| limit 20Lock contention is a common cause of p99 latency spikes that don't show up in CPU or connection metrics. If you see deadlock messages during a deployment, it usually means two transactions are writing to the same rows in different order.
API Gateway queries
API Gateway access logs must be enabled manually in the Stage settings. Set the log format to JSON so every field is queryable. The log group name is set by you — typically `/aws/apigateway/<api-name>`.
Find 5xx errors
fields @timestamp, @message
| filter @message like /"status":5/
| sort @timestamp desc
| limit 50If your access log format uses JSON, filter on the status field string pattern. For the default CLF format, use `filter @message like / 5[0-9][0-9] /` to match HTTP 5xx codes between whitespace.
Slowest endpoints by p99 latency
fields @timestamp, @message
| parse @message '"path":"*"' as path
| parse @message '"responseLatency":*,' as latencyMs
| stats pct(latencyMs, 99) as p99, count() as requests by path
| sort p99 desc
| limit 20This identifies which routes are slow, not just that the API is slow overall. A single route at p99 = 8000ms while all others are under 200ms narrows the investigation to one Lambda function or one database query.
Error count by HTTP status code
fields @timestamp, @message
| parse @message '"status":*,' as statusCode
| filter statusCode >= 400
| stats count() as count by statusCode
| sort count descSplitting errors by status code separates client errors (4xx — usually bad input or auth failures) from server errors (5xx — your problem). A spike in 429s means you're being rate-limited or your throttle config is too aggressive. A spike in 502s means Lambda is returning an invalid response format.
Logs Insights syntax patterns to know
| Pattern | Syntax | When to use |
|---|---|---|
| Regex filter | `filter @message like /pattern/` | Case-sensitive substring match; add `(?i)` for case-insensitive |
| Field extraction | `parse @message "prefix * suffix" as field` | Extracts a value between fixed strings into a queryable field |
| Presence check | `filter ispresent(@fieldName)` | Filters to only rows where a parsed/injected field exists |
| Time bucketing | `by bin(5m)` | Groups results into time windows — 1m, 5m, 15m, 1h |
| Percentiles | `pct(@duration, 99)` | P99 latency; works on any numeric field |
| Multi-log join | Select multiple log groups in the query console | Query /aws/lambda/fn and /aws/apigateway/api simultaneously |
Related reading
Frequently asked questions
Frequently asked questions
What is the CloudWatch Logs Insights query limit?
CloudWatch Logs Insights scans up to 10,000 log events per query by default. Use `limit` to control result set size. The query time range and log group selection control how much data is scanned — narrower windows are faster and cheaper. Concurrent queries per account are limited to 10.
How do I query multiple log groups at once in Logs Insights?
In the CloudWatch console, select multiple log groups from the log group selector before running the query — the query runs across all selected groups simultaneously. You can also use `SOURCE '/aws/lambda/fn1', '/aws/lambda/fn2'` syntax in some regions. Results include `@logGroup` and `@logStream` fields to identify which log group each line came from.
Why does my CloudWatch Logs Insights query return no results?
The three most common causes: (1) the time window doesn't overlap with the log events you expect — check the time range selector; (2) the log group doesn't have the correct IAM permissions; (3) the log group name is wrong — log groups must be selected explicitly, wildcards in the log group name are only supported with SOURCE syntax. Run a bare `fields @timestamp, @message | limit 10` with no filter to confirm log data exists.
How do I extract a field from a JSON log line in Logs Insights?
CloudWatch Logs Insights auto-parses JSON log events — if your log line is valid JSON, you can reference top-level fields directly without a `parse` statement. For `{"level":"error","duration":1234}`, you can write `filter level = "error"` and `stats pct(duration, 99)` without any parse. Nested JSON fields require the `parse` statement with dot notation or regex extraction.
What's the difference between filter and parse in Logs Insights?
`filter` selects which log events to include — it's a WHERE clause that reduces the result set. `parse` extracts a substring from `@message` into a named field — it doesn't filter, it adds a column. Use `parse` first to create the field, then `filter` or `stats` to work with it. A `parse` that matches no events still returns all events unless you also `filter ispresent(extractedField)`.
Related reading
Still debugging incidents manually?
ConvOps does this automatically — root cause in under 60 seconds, delivered to WhatsApp or Slack.