Debugging Cloud CDN and Cloud DNS With AI: Caching and

Two reflexes cause more edge incidents than they fix: flushing the entire CDN cache when content looks stale, and assuming a DNS change should take effect immediately. The first triggers a cache-miss storm that stampedes your origin; the second sends people chasing a “broken” change that was simply waiting out an old TTL. Both come from skipping the evidence and jumping to the lever. Cloud CDN and Cloud DNS problems are almost always readable from headers and records — the Age and Cache-Control on a CDN response, the zone records and dig output for DNS — and reading that evidence is exactly where AI saves time. I’ve stopped letting anyone on my team flush a cache before they can tell me what the response headers say.

CDN: why isn’t it caching?

A low cache hit rate has two usual causes, and both are visible in the config and the response headers rather than in vibes. Either the origin’s headers don’t permit caching, or the cache key is fragmented so badly that nothing gets reused.

curl -sI https://example.com/static/app.js | grep -iE 'age|cache-control|x-cache'
gcloud compute backend-services describe web-backend --global \
  --format="yaml(cdnPolicy, enableCDN)"

Prompt: “Our Cloud CDN hit rate is low. Here are the response headers (Age, Cache-Control, X-Cache) and the backend’s cdnPolicy including cacheMode and the cache key policy. Tell me whether the origin headers actually permit caching and match the cacheMode, and whether the cache key includes anything high-cardinality (query strings, cookies) that’s fragmenting the cache. Point at the specific cause.”

The two findings the model surfaces most: an origin sending Cache-Control: no-store or private while the team expected caching, and a cache key that includes a per-user tracking query parameter so every request is unique and nothing is ever a hit. The first is a header fix at the origin or a cacheMode change; the second is removing the param from the cache key policy.

Prompt: “Our cache key includes the full query string, and we have a utm_source parameter that varies per visitor but doesn’t change the response. Show me how to set the cache key policy to exclude tracking parameters while still keying on the ones that change content, and explain the hit-rate impact.”

CDN: stale content without a stampede

When the CDN serves content that’s too old, the fix is the right TTL plus a targeted invalidation — by path, not a full flush. A broad invalidation forces every edge to miss simultaneously and slams the origin.

gcloud compute url-maps invalidate-cdn-cache web-map \
  --path="/static/app.js"

Prompt: “We deployed a new app.js but edges are still serving the old one. Recommend the right TTL for versioned static assets versus HTML, and give me a path-scoped invalidation for just the changed files. Explain the origin-load risk if I invalidate /* instead, and how to stagger a large invalidation if I really need one.”

The model reliably steers toward the narrowest invalidation, because the origin-stampede risk scales with how much you invalidate at once. For versioned assets I’d rather fix the cache-busting (a content hash in the filename) than invalidate at all.

DNS: resolution and propagation

For Cloud DNS, NXDOMAIN or wrong answers usually trace to the record set, the zone’s delegation, or a split between public and private zones resolving the query differently. Slow “propagation” is usually just the old TTL not having expired.

dig +short example.com @ns-cloud-a1.googledomains.com
gcloud dns record-sets list --zone=my-zone --name=example.com.

Prompt: “Users get NXDOMAIN for a subdomain we just added in Cloud DNS. Here are the zone’s record sets and the NS delegation. Check whether the record exists, whether the delegation is correct, and whether a private zone could be shadowing the public answer for internal clients (split-horizon). Tell me the specific cause and fix.”

Prompt: “We changed an A record an hour ago and some clients still resolve the old IP. The record’s previous TTL was 3600. Explain why the change won’t fully propagate until the old TTL expires from resolver caches, and why lowering the TTL only helps future changes — the reduction itself has to wait out the old TTL. Tell me when to expect full propagation.”

That TTL reasoning resolves most “DNS is broken” tickets. A change is gated by the old record’s TTL sitting in resolver caches, so lowering the TTL before a planned cutover only helps if you do it far enough ahead that the old, higher TTL has already aged out.

The honest division of labor

AI is fast at the evidence-reading these problems require: interpreting cache headers against a cacheMode, spotting a high-cardinality cache key, reconciling a DNS change against the TTL that gates it. Those are deterministic relationships, which is why the model is dependable on them. What it can’t see is your origin’s real capacity or which paths are safe to invalidate broadly — so it tells me the cause and the narrowest fix, and I decide whether a wider invalidation is worth the origin load.

The rule I hold to: read the headers and records before reaching for a lever, and invalidate the smallest path set that solves the problem. The reusable prompts live in my prompts library, and the GCP with AI series covers the layer beneath the edge, including load balancer backend and health-check debugging for when the CDN is fine but the origin behind it isn’t. The edge is predictable once you stop flushing and start reading.

Debugging Cloud CDN and Cloud DNS With AI: Caching and Resolution