Debugging VPC Firewall and Routing on GCP With AI

A VM couldn’t reach an internal load balancer that was clearly healthy. Ping from a neighboring instance worked. The app’s own connectivity test failed every time. I’d been staring at it for forty minutes before I remembered that GCP VPC firewall rules have priorities, that there are two implied rules you never see in the console, and that a higher-priority deny somewhere can shadow the allow you’re looking at. GCP networking fails quietly: no error, just packets that don’t arrive. That silence is what makes it a strong AI debugging target — I can dump the full rule set and route table and ask the model to trace the path, instead of holding the whole evaluation order in my head.

Get the full picture before guessing

The mistake is debugging from the console one rule at a time. Pull everything at once so the model sees what you see:

# All firewall rules sorted by priority (lower number wins)
gcloud compute firewall-rules list \
  --filter="network=prod-vpc" \
  --sort-by=priority \
  --format="table(name, priority, direction, sourceRanges.list(), allowed[].map().firewall_rule().list(), denied[].map().firewall_rule().list(), targetTags.list())"

# Effective routes for the source instance
gcloud compute routes list --filter="network=prod-vpc" \
  --format="table(name, destRange, nextHopGateway, nextHopInstance, priority)"

Then paste both into the model with the actual question.

Prompt: “Below are all firewall rules for prod-vpc (sorted by priority) and the route table. A VM with network tag app-tier at 10.20.1.5 cannot open TCP 443 to an internal load balancer at 10.20.4.9. Trace egress from the VM and ingress to the LB. Tell me which specific rule allows or blocks this flow, accounting for the two implied GCP rules (default-deny ingress, default-allow egress) and priority ordering. If nothing explicitly allows it, say so.”

The model walks the evaluation: egress is allowed by the implied egress-allow, ingress to the LB subnet needs a rule, and the highest-priority matching ingress rule was a deny at priority 900 that I’d forgotten existed, sitting above my allow at priority 1000. Lower number wins. That’s the bug, and it’s exactly the kind of off-by-priority mistake humans miss when scanning rules top to bottom.

Let Connectivity Tests be the ground truth

AI reasons about the config, but GCP’s Connectivity Tests reason about the actual data plane, including rules AI can’t see. I use both: AI to form a hypothesis fast, the test to confirm it against reality.

gcloud network-management connectivity-tests create app-to-ilb \
  --source-instance=projects/my-proj/zones/us-central1-a/instances/app-vm-1 \
  --destination-ip-address=10.20.4.9 \
  --destination-port=443 \
  --protocol=TCP

Then feed the test’s JSON trace back to the model:

Prompt: “This is the JSON result of a GCP Connectivity Test. It dropped at a step — read the drops and traces arrays, tell me the exact cause in one sentence, and the precise gcloud command to fix it without widening the rule more than necessary.”

I insist on “without widening more than necessary” because the lazy fix is a 0.0.0.0/0 allow, which trades a connectivity bug for a security hole. The model will happily suggest the broad fix unless you tell it not to.

Have AI write the corrective rule with the right scope

Once the cause is clear, the model drafts a tightly-scoped rule that I review before applying. Network tags beat IP ranges for tier-to-tier rules because they survive re-IPing:

gcloud compute firewall-rules create allow-app-to-ilb \
  --network=prod-vpc \
  --direction=INGRESS \
  --action=ALLOW \
  --rules=tcp:443 \
  --source-tags=app-tier \
  --target-tags=ilb-backend \
  --priority=950 \
  --enable-logging

Note the priority 950 — placed deliberately below the existing deny at 900? No. Above it, numerically below, so it wins. This is the detail to check by hand: I ask the model to state which existing rule its new rule must out-prioritize, and I verify that number myself. I never trust an AI-chosen priority without seeing the neighbor it has to beat.

Turn on logging and let AI read the firehose

Firewall rule logging produces a lot of noise. AI is good at turning that into a summary of what’s actually being denied:

Prompt: “Here are 200 GCP firewall log entries (JSON). Group them by rule_details.reference and disposition. Show me the top denied flows by count, with source range and destination port, so I can tell which denies are protecting me and which are blocking legitimate traffic that needs a rule.”

That report tells me whether a deny is doing its job or quietly breaking something — the question you actually care about, surfaced from logs you’d never read line by line.

The division of labor

The model is excellent at the parts that are pure mechanical reasoning over config: priority ordering, implied rules, route precedence, log aggregation. It is not a substitute for the data-plane truth that Connectivity Tests give you, and it has no idea what your security posture should be. So I let it form hypotheses and draft rules, I confirm with the platform’s own tooling, and I personally own every priority number and source range that ships.

For the prompts I reuse across networking incidents, see my prompts library, and the rest of the GCP with AI series for adjacent problems. A VPC that fails silently doesn’t have to stay a mystery — it just needs the full rule set in front of a reader patient enough to trace it, and that reader can be AI as long as you stay the one who decides.

Get the full picture before guessing

Let Connectivity Tests be the ground truth

Have AI write the corrective rule with the right scope

Turn on logging and let AI read the firehose

The division of labor

Download the Free 500-Prompt DevOps AI Toolkit