Tuning Cloud SQL With AI: Slow Queries, Flags, and

A Cloud SQL Postgres instance started timing out under normal load, and the team’s reflex was to bump the machine size. We did. It helped for a week, then the timeouts came back, now on a bigger, more expensive instance. The real problem was a single unindexed query doing a sequential scan on a 12-million-row table, plus a connection count creeping toward the instance limit because nobody had pooled. Throwing hardware at a query problem is the most common Cloud SQL mistake I see, and it’s expensive. The instance gives you Query Insights, database flags, and connection metrics — plenty of signal — but reading an EXPLAIN ANALYZE plan or knowing which of 200 flags matters is specialist knowledge. That’s where I bring in AI: to read the plan, to know the flag, while I verify and decide what actually ships.

Start with Query Insights, not the machine type

Cloud SQL Query Insights surfaces your most expensive queries. Pull the worst offender’s plan and hand it to the model — reading an execution plan is one of the things AI does genuinely well:

EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)
SELECT * FROM orders
WHERE customer_id = $1 AND status = 'pending'
ORDER BY created_at DESC LIMIT 50;

Prompt: “Here is a Postgres EXPLAIN ANALYZE plan (JSON) from Cloud SQL. Identify the most expensive node, tell me whether it’s a sequential scan that should be an index scan, and write the exact CREATE INDEX statement that would help — including column order and whether a partial or covering index is better here. Explain why the column order is what it is.”

The model reads the plan, finds the seq scan on orders, and proposes a composite index. Column order matters — equality predicates before the range/sort column — and AI gets that right when prompted to justify it:

CREATE INDEX CONCURRENTLY idx_orders_customer_status_created
ON orders (customer_id, status, created_at DESC);

I use CONCURRENTLY on production so the build doesn’t lock writes, and I verify the new plan actually uses the index with another EXPLAIN before declaring victory. AI proposes the index; the planner confirms whether it helps. Sometimes the planner ignores an index for low-selectivity predicates, and only re-running EXPLAIN tells you that.

Database flags: know which one, know the trade-off

Cloud SQL exposes tuning through database flags, and there are a lot of them. The skill is knowing which flag addresses your symptom and what it costs. I describe the symptom and the instance:

Prompt: “I have a Cloud SQL Postgres instance, 4 vCPU / 16GB RAM, mostly read-heavy OLTP with occasional reporting queries. Connection counts are high and I see frequent temp file usage in the logs. Recommend specific database flag values for work_mem, shared_buffers, max_connections, and effective_cache_size. For each, state the trade-off and why your value fits 16GB. Don’t set work_mem so high that concurrent queries exhaust RAM.”

That last clause matters because work_mem is per-operation, not per-connection — set it carelessly with high concurrency and you OOM the instance. The model knows this trade-off, but I make it state the math so I can check it:

gcloud sql instances patch my-pg-instance \
  --database-flags=work_mem=16384,max_connections=200,effective_cache_size=12288MB

I apply flags in a maintenance window because some require a restart, and I change one category at a time so I can attribute any regression. AI tells me the value; I own the rollout discipline.

The connection problem is usually pooling

High connection counts on Cloud SQL are rarely solved by raising max_connections — each connection costs memory, and Postgres degrades with too many. The fix is a pooler. I have AI evaluate whether I need one and configure it:

Prompt: “My Cloud SQL Postgres instance peaks near its max_connections limit. Traffic comes from a Cloud Run service that opens a connection per request. Should I use Cloud SQL’s built-in PgBouncer-style pooling, a sidecar pooler, or app-level pooling? Recommend pool sizing given the instance has 4 vCPU, and explain why raising max_connections is the wrong first move.”

The model correctly steers away from raising the limit and toward a transaction-mode pool sized to the instance — which is the right call and the one teams resist because raising a number feels easier than adding a pooler.

Let AI read the slow query log in bulk

Insights shows the top queries; the slow query log has the long tail. I pull it and have AI cluster:

gcloud logging read \
  'resource.type="cloudsql_database"
   AND resource.labels.database_id="my-proj:my-pg-instance"
   AND textPayload:"duration:"' \
  --limit=200 --format=json

Prompt: “Here are 200 Cloud SQL slow-query log lines. Normalize the queries (replace literals with placeholders), group identical query shapes, and rank by total time consumed and by call count. Tell me which query shape is worth indexing first — the one eating the most total time, not just the slowest single run.”

Ranking by total time, not worst single execution, is the insight that decides where to spend effort. A query that runs 10,000 times at 50ms hurts more than one that runs once at 3s, and AI sorts that out cleanly across 200 lines.

Where the human draws the line

Cloud SQL tuning with AI works because plans, flags, and logs are all structured text that a model reads expertly — far faster than I parse a JSON execution plan by hand. But the model can’t see my live load, can’t run the EXPLAIN itself, and doesn’t carry the consequences of a flag that OOMs production. So the loop is fixed: AI reads the plan and names the flag or index, I verify it against the real planner and the real instance, and I own the rollout. The right index from a hallucinated plan is still the wrong index.

The reusable prompts are in my prompts collection, and the GCP with AI series covers the services that talk to your database. Stop resizing the instance. Read the plan.

Tuning Cloud SQL With AI: Slow Queries, Flags, and Connections

Start with Query Insights, not the machine type

Database flags: know which one, know the trade-off

The connection problem is usually pooling

Let AI read the slow query log in bulk

Where the human draws the line

Download the Free 500-Prompt DevOps AI Toolkit