Tuning RDS and Aurora Performance With AI

The page that woke me up said “database CPU at 95%,” and the on-call instinct in that moment is almost always wrong: scale up the instance, go back to sleep, deal with it tomorrow. I’ve done it. It’s expensive and it doesn’t fix anything, because a runaway query on a db.r6g.large will happily saturate a db.r6g.4xlarge too. The actual job is to find which queries are eating the CPU and why their plans are bad — and that’s slow, fiddly analysis that AI accelerates enormously, provided you feed it the database’s own diagnostic data instead of vibes.

Where I draw the line: AI reads Performance Insights output and EXPLAIN plans and proposes specific index or query changes with its reasoning. I run those changes on a replica first and verify the plan actually improved before anything touches production. The model is a very well-read DBA who can’t see your database; you’re the one with the keys.

Get the database to tell you what hurts

RDS Performance Insights already ranks load by wait event and by SQL digest — that’s the data you want, not a guess. Pull the top SQL by load:

aws pi get-resource-metrics \
  --service-type RDS \
  --identifier db-ABCDEFG \
  --metric-queries '[{"Metric":"db.load.avg","GroupBy":{"Group":"db.sql_tokenized","Limit":5}}]' \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s) --period-in-seconds 60 \
  --query 'MetricList[].{dim:Key.Dimensions,vals:DataPoints[-1].Value}'

That gives you the handful of query shapes carrying the load and the wait events they’re stuck on (CPU, IO:DataFileRead, Lock:transactionid, etc.). The wait event is the most important clue and the one people skip.

Make AI read the wait events, not just the SQL

Paste the top SQL plus their dominant wait events and ask the model to reason about the category of problem first:

Here are the top 5 SQL digests by db.load.avg from RDS Performance Insights, each with its dominant wait event. Before suggesting fixes, classify each: CPU-bound (likely a missing index or bad plan → look at EXPLAIN), IO-bound on reads (working set bigger than buffer cache, or a full scan), or lock contention (transaction/row locks → look at transaction scope). Tell me which one to investigate first for the biggest win and what evidence to pull next.

The reply kept me from chasing the wrong thing:

The top digest is SELECT ... FROM orders WHERE customer_id = ? AND status = ? with wait event CPU at 60% of total load. CPU-bound on a filtered SELECT strongly implies a missing composite index causing a scan. Investigate this one first — it’s the majority of your CPU. The #2 digest is IO:DataFileRead on a reporting query; lower priority and probably acceptable. Pull EXPLAIN (ANALYZE, BUFFERS) for the #1 query next.

So the fix isn’t a bigger instance — it’s one index. But I don’t add it on the model’s say-so.

EXPLAIN is the ground truth — feed it the real plan

Run the actual plan on a read replica (never ANALYZE a mutating query on primary), and hand the plan to AI:

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE customer_id = 4821 AND status = 'pending';

Below is a Postgres EXPLAIN ANALYZE plan. Identify the most expensive node, whether it’s doing a Seq Scan that should be an Index Scan, whether the row estimates are far off from actual (stale stats), and propose the minimal index that fixes it. Give me the exact CREATE INDEX statement and explain why this column order.

The model reads the Seq Scan on orders (actual rows=2.1M) node, notices the filter selectivity, and proposes a composite index with the right column order:

CREATE INDEX CONCURRENTLY idx_orders_customer_status
  ON orders (customer_id, status);

It correctly puts customer_id first (high selectivity, equality) and explains the reasoning. That column-order judgment is where a lot of hand-rolled indexes go wrong, and it’s exactly the kind of well-trodden pattern the model is reliable on. The CONCURRENTLY keyword — so the index build doesn’t lock the table — is something I always double-check it included, because forgetting it on a hot table is a self-inflicted outage.

Verify the plan actually changed before production

This is the non-negotiable step. Create the index on the replica, re-run the same EXPLAIN ANALYZE, and confirm the Seq Scan became an Index Scan and the cost dropped. Only then promote the change to production. AI’s prediction that the index helps is a hypothesis; the replica’s new plan is proof. I’ve seen “obvious” indexes get ignored by the planner because of low selectivity or correlated columns — the model can’t know that without the plan, but the replica tells you immediately.

Parameter groups: let AI explain, you decide the risk

For instance-level tuning (e.g. shared_buffers, work_mem, max_connections), AI is great at explaining what a parameter does and the risk of changing it, which beats reading the docs cold. But these are blast-radius changes, so the human owns the call:

aws rds describe-db-parameters \
  --db-parameter-group-name prod-aurora-pg15 \
  --query "Parameters[?ParameterName=='work_mem' || ParameterName=='shared_buffers']"

Ask it: “We’re seeing temp-file IO on sort-heavy queries — is bumping work_mem the right lever, what’s the per-connection memory risk at our max_connections, and what would you change it to?” It’ll reason about work_mem being allocated per sort node per connection, which is the trap people fall into. You take that reasoning, size it against your real connection count, and apply it during a maintenance window.

The bottom line

The RDS panic move — scale up and pray — fails because it treats a query problem as a capacity problem. AI is a fast, well-read partner for the real work: reading Performance Insights wait events, decoding EXPLAIN plans, and proposing the minimal index or parameter change with sound reasoning. But it’s reasoning over diagnostics, not querying your database. So every suggestion gets validated on a replica with a real before/after plan, and every blast-radius parameter change stays a human decision made during a window.

This evidence-first habit runs through the rest of the AWS guides, and the Performance Insights and EXPLAIN prompts I lean on are in the prompts collection.