Debugging VPC Connectivity With AI: Routes, NACLs, and Security Groups
Connection timed out, no logs, no clues. Here's how to use AI to reason through VPC routing, NACLs, and security groups so you find the broken layer fast.
- #aws
- #ai
- #vpc
- #networking
- #troubleshooting
A teammate pinged me with the most boring and most maddening AWS error there is: “the app can’t reach the database, it just times out.” No connection refused, no DNS error — just a timeout. In VPC land, a timeout almost always means a packet left somewhere and never came back, and the cause could be in any of five places: the route table, the network ACL on either subnet, the security group on either side, or a missing NAT/gateway. The classic way to debug this is to open six console tabs and squint. I’ve started using AI to do the squinting, and it’s cut my time on these from an hour to about ten minutes.
The important framing: AI is not poking your network. It can’t see whether a packet actually flowed. What it can do is reason about layered config faster and more completely than I do under pressure — as long as I feed it the real config and treat its conclusion as a hypothesis I then confirm with a real test.
Dump the actual config, not your mental model
The number one reason these bugs survive is that people debug their belief about the network instead of the network. So step one is always to pull the ground truth for the specific path. Say the app runs in subnet subnet-aaa (instance i-app) and the DB is in subnet-bbb on port 5432.
# Route table for the app subnet
aws ec2 describe-route-tables \
--filters Name=association.subnet-id,Values=subnet-aaa \
--query 'RouteTables[].Routes' --output json
# NACLs for both subnets
aws ec2 describe-network-acls \
--filters Name=association.subnet-id,Values=subnet-aaa,subnet-bbb \
--query 'NetworkAcls[].{id:NetworkAclId,entries:Entries}' --output json
# Security groups on both ends
aws ec2 describe-security-groups \
--group-ids sg-app sg-db \
--query 'SecurityGroups[].{id:GroupId,in:IpPermissions,out:IpPermissionsEgress}' \
--output json
That’s four JSON blobs that together fully describe whether a TCP connection from app to DB is even possible. A human has to hold all of it in their head simultaneously, including the fact that NACLs are stateless and security groups are stateful — which is the single most common thing people get wrong.
Make AI reason through every layer in order
Paste the JSON in and give it a prompt that forces a layer-by-layer walk rather than a vibe-based answer.
I have a connectivity timeout from an app instance to a database on TCP 5432. Below is the route table for the app subnet, the NACLs for both subnets, and the security groups for both instances. Walk the packet path in order: (1) app SG egress, (2) app subnet route table, (3) app subnet NACL outbound, (4) DB subnet NACL inbound, (5) DB SG ingress, then the return path. Remember NACLs are STATELESS so the return traffic needs an explicit outbound/inbound ephemeral-port rule. For each layer, state ALLOW or BLOCK and why. End with the single most likely broken layer.
The “NACLs are stateless” reminder earns its keep constantly. The model came back with this, and it was right:
Path analysis:
- app SG egress: ALLOW (0.0.0.0/0 all ports)
- app route table: ALLOW (local route covers 10.0.0.0/16)
- app NACL outbound: ALLOW (rule 100, all traffic)
- DB NACL inbound: ALLOW (rule 100 permits 5432 from 10.0.0.0/16)
- DB SG ingress: ALLOW (5432 from sg-app)
- RETURN path — DB NACL outbound: only permits 5432, NOT the ephemeral range 1024–65535. The response packet to the app’s ephemeral source port is BLOCKED here. Most likely broken layer: DB subnet NACL outbound rules. Add an outbound allow for TCP 1024–65535.
That ephemeral-port-on-a-stateless-NACL bug is exactly the kind of thing I overlook at 2pm on a Friday, because the forward path looks perfect and my eyes glaze over the return path.
Confirm with a real test — never trust the reasoning alone
Here’s the discipline: the model gave me a hypothesis, not a fact. Before I touch a single NACL rule, I confirm with VPC Reachability Analyzer, which actually models the path AWS-side:
aws ec2 create-network-insights-path \
--source i-app --destination i-db \
--destination-port 5432 --protocol tcp \
--query 'NetworkInsightsPath.NetworkInsightsPathId' --output text
aws ec2 start-network-insights-analysis \
--network-insights-path-id nip-0123456789 \
--query 'NetworkInsightsAnalysis.NetworkInsightsAnalysisId' --output text
aws ec2 describe-network-insights-analyses \
--network-insights-analysis-ids nia-0123456789 \
--query 'NetworkInsightsAnalyses[0].{path:NetworkPathFound,blocked:Explanations}'
If NetworkPathFound is false, the Explanations array tells you the exact component and rule that blocks it. When that matches the model’s conclusion, I have two independent sources agreeing and I fix with confidence. When they disagree, the analyzer wins — it’s modeling AWS’s real evaluation, the model is reading JSON.
Fix narrowly, then re-verify
The fix is one targeted rule, not a “let me just open it up” panic edit:
aws ec2 create-network-acl-entry \
--network-acl-id acl-db --rule-number 110 \
--protocol tcp --port-range From=1024,To=65535 \
--cidr-block 10.0.0.0/16 --egress --rule-action allow
Then re-run the analysis. NetworkPathFound: true is the receipt. I’ve watched people “fix” this by adding 0.0.0.0/0 all traffic to the NACL, which works and also quietly turns a stateless firewall into a no-op. AI is useful here too: ask it to review your proposed rule for over-broadness before you apply it.
The internet-bound variant: NAT, IGW, and DNS
Subnet-to-subnet timeouts are one family; “my private instance can’t reach the internet” is the other, and the layers are different enough that you should tell the model which problem you have. For an egress-to-internet failure, gather the route table plus whether the instance has a public IP and whether DNS is even resolving:
# Does the private subnet route 0.0.0.0/0 to a NAT gateway?
aws ec2 describe-route-tables \
--filters Name=association.subnet-id,Values=subnet-private \
--query 'RouteTables[].Routes[?DestinationCidrBlock==`0.0.0.0/0`]'
# Is DNS resolution/hostnames enabled on the VPC?
aws ec2 describe-vpc-attribute --vpc-id vpc-abc --attribute enableDnsSupport
Then ask the model to reason about the egress path specifically: a private instance needs a 0.0.0.0/0 route to a NAT gateway (not an internet gateway — that’s for public subnets), and the NAT itself must live in a public subnet with an IGW route. The classic self-inflicted bug is putting the NAT gateway in the same private subnet it’s supposed to serve, which the model catches immediately because the routing is circular. It’s equally good at distinguishing “no route to internet” from “route is fine but DNS is broken” — two symptoms that both present as a timeout but have nothing to do with each other. The same Reachability Analyzer confirmation applies: set the destination to an external IP and let AWS tell you exactly where the path dies.
Where this leaves you
AI is a fast, tireless reader of layered network config, and it knows the stateless-vs-stateful gotchas better than a tired human does. But it is reasoning about config, not observing packets — so the workflow is always: real config in, layer-by-layer hypothesis out, Reachability Analyzer to confirm, narrow fix, re-verify. That keeps you in control of the actual change while letting AI do the part that’s just careful reading.
If VPC and networking debugging is a recurring tax for you, the rest of the AWS guides cover adjacent failure modes, and I keep reusable troubleshooting prompts like the one above in the prompts collection.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.