Debugging Slow Postgres Queries With AI and EXPLAIN ANALYZE

A query that ran in 12ms in staging took 4 seconds in production, and the only difference was data volume. The on-call engineer pasted the query into a chatbot, got back “add an index on the WHERE column,” added it, and nothing changed — because the planner was already using a different index and the real problem was a nested loop over a bad row estimate. That is the trap with slow-query debugging: the obvious fix is usually wrong, and the actual answer lives inside the EXPLAIN plan. AI is genuinely good at reading those plans out loud and pointing at the expensive node. It is not good at knowing your data distribution or your prod load. So I let it narrate; I verify on a replica.

This post is the workflow I actually use when a query is slow and someone wants it fixed before lunch.

Get a real plan, not a guess

Never debug from the query text alone. EXPLAIN shows estimates; EXPLAIN (ANALYZE, BUFFERS) actually runs the query and shows you real timings, real row counts, and how much data came from cache versus disk. That last part — BUFFERS — is the one people skip and the one that tells you whether you are I/O bound.

EXPLAIN (ANALYZE, BUFFERS, VERBOSE, FORMAT TEXT)
SELECT o.id, o.total, c.email
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.created_at >= now() - interval '7 days'
  AND o.status = 'pending'
ORDER BY o.created_at DESC
LIMIT 50;

Run it on a replica or a recent restore, never blindly on prod if the query mutates anything. Capture the full output. The two numbers you hunt for first: where actual rows diverges wildly from estimated rows (a planner misestimate), and any node with a high actual time that loops many times.

Hand the plan to AI as a translator

This is where AI earns its place. Plan output is dense, and reading it fluently takes practice. I paste the whole plan in and ask for a structured read.

Here is EXPLAIN (ANALYZE, BUFFERS) output from Postgres 16. Walk me through it bottom-up. Identify the single most expensive node by actual total time, tell me whether the row estimates are accurate, flag any sequential scans on large tables, and explain whether the sort or the join is the bottleneck. Do not suggest fixes yet — just explain what the plan is doing.

Asking it to not fix anything yet is deliberate. I want the explanation before the prescription, because the explanation is verifiable and the prescription often isn’t. A good answer will say something like “the bottleneck is a Seq Scan on orders returning 2M rows that get filtered down to 800, and the estimate was off by 100x, which suggests stale statistics.” Now I have a hypothesis I can test.

Test the cheap fixes first

Before adding indexes, rule out stale statistics. A 100x misestimate is frequently just an autovacuum that hasn’t run an ANALYZE recently.

ANALYZE orders;
-- then re-run the EXPLAIN ANALYZE and compare estimates

If the estimates snap back to reality and the plan changes, you are done — no schema change, no index, no migration. I have closed more “slow query” tickets with a plain ANALYZE than with anything clever. Check when stats were last gathered:

SELECT relname, last_analyze, last_autoanalyze, n_live_tup, n_dead_tup
FROM pg_stat_user_tables
WHERE relname IN ('orders', 'customers');

Find the pattern, not just this one query

One slow query is an anecdote. pg_stat_statements shows you the queries that actually cost you across the whole workload, which is where you should spend effort.

SELECT
  substr(query, 1, 80) AS query,
  calls,
  round(total_exec_time::numeric, 1) AS total_ms,
  round(mean_exec_time::numeric, 2) AS mean_ms,
  rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

The query someone is complaining about may not even be in the top 20. Optimizing a query that runs twice a day while ignoring one that runs 4,000 times a minute is a classic misuse of an afternoon. I feed this table to AI too, asking it to group similar query shapes and rank by total time, not mean.

When an index really is the answer

If the plan shows a sequential scan over a large table to satisfy a selective predicate, an index is legitimate. Build it without locking writes:

CREATE INDEX CONCURRENTLY idx_orders_status_created
ON orders (status, created_at DESC)
WHERE status = 'pending';

A partial index here is intentional — most rows aren’t pending, so the index stays small and hot. I ask AI to justify the column order and the partial predicate, then I prove it on the replica by re-running the same EXPLAIN (ANALYZE, BUFFERS) and confirming the node changed from a seq scan to an index scan and the total time dropped. If only one of those two things happened, I keep digging. Index design is deep enough to deserve its own treatment; I keep a running set of plan-reading prompts in the prompt library.

The discipline that keeps you out of trouble

Every change gets the same loop: capture the before-plan, make one change, capture the after-plan on the same replica with the same data, compare actual times. If the numbers don’t move, revert — no matter how confident the model sounded. AI shortens the time between “this is slow” and “I understand why,” which is most of the work. It does not get to decide what ships. For more on Postgres performance and the broader operational picture, browse the rest of the Postgres guides, and if your instance is internet-adjacent, read up on hardening Redis and Postgres against exposure before you tune anything.