Designing AWS Transit Gateway Architectures With AI

Transit Gateway looks simple in the marketing diagram — one hub, many spokes, done. Then you actually build it and discover that the entire behavior of the thing lives in the difference between two words that sound identical: association and propagation. Get those backwards and you’ll have an attachment that can send traffic but never receive a route back, or worse, a default route table that quietly stitches every VPC to every other VPC including the ones that were supposed to be isolated. I’ve debugged enough “why can prod reach the sandbox” tickets to treat TGW route tables as the most security-sensitive routing surface in a multi-account network.

AI is a strong collaborator here precisely because the design is so mechanical and so easy to get subtly wrong. The model can draft the route table layout, generate the Terraform, and explain which attachment associates to which table and what propagates where. What it cannot do is know your trust boundaries. So I let AI draft the segmentation and the IaC, and I personally verify every association-and-propagation pair against the isolation rules the network is supposed to enforce.

The two-word model that runs everything

Every TGW attachment does two independent things. It associates with exactly one route table, which decides which routes that attachment can use to send traffic. And it propagates its own routes into zero or more route tables, which decides who can reach it. Once that clicks, segmentation is just a matter of controlling which attachments propagate into which tables.

A clean segmented design usually means disabling the default association and propagation on the gateway itself, then building purpose-named route tables by hand.

resource "aws_ec2_transit_gateway" "hub" {
  description                     = "central-hub"
  default_route_table_association = "disable"
  default_route_table_propagation = "disable"
  auto_accept_shared_attachments  = "disable"
  dns_support                     = "enable"
}

resource "aws_ec2_transit_gateway_route_table" "prod" {
  transit_gateway_id = aws_ec2_transit_gateway.hub.id
  tags = { Name = "rt-prod" }
}

resource "aws_ec2_transit_gateway_route_table" "shared_services" {
  transit_gateway_id = aws_ec2_transit_gateway.hub.id
  tags = { Name = "rt-shared-services" }
}

Disabling the defaults is the single most important decision in the whole build. Leave them enabled and every new attachment auto-joins one shared table and propagates into it, which is the textbook recipe for accidental any-to-any reachability.

Segmentation through propagation, not security groups

The pattern I reach for most is a hub-and-spoke where production and non-production are fully isolated from each other but both reach a shared-services VPC (think DNS resolvers, a central egress appliance, identity). You express that entirely through where attachments associate and propagate.

# Prod VPC attaches, associates to the prod table...
resource "aws_ec2_transit_gateway_route_table_association" "prod_assoc" {
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.prod.id
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.prod.id
}

# ...and propagates ONLY into shared-services, so dev never learns prod's CIDR.
resource "aws_ec2_transit_gateway_route_table_propagation" "prod_to_shared" {
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.prod.id
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.shared_services.id
}

The shared-services attachment, by contrast, propagates into both the prod and dev tables, because everyone is allowed to reach it. Prod and dev never propagate into each other’s tables, so there is no route between them — full stop. This is the part I most want AI to reason about out loud, because a missing or extra propagation resource is invisible in a diagram but decisive in production.

Make the AI defend the isolation

When I hand AI a proposed attachment-to-table matrix, I don’t ask “is this right.” I ask it to prove a specific reachability claim, because that forces it to trace propagation rather than pattern-match.

Prompt: Prod attachment associates to rt-prod and propagates into rt-shared-services only. Dev attachment associates to rt-dev and propagates into rt-shared-services only. Shared-services attachment associates to rt-shared and propagates into rt-prod and rt-dev. Can a host in dev reach a host in prod over the Transit Gateway? Walk the routes.

Reply: No. For dev to reach prod, two things must both be true. First, dev’s traffic must have a route to the prod CIDR — that requires prod’s attachment to propagate into rt-dev, which dev’s attachment is associated with. It does not; prod propagates only into rt-shared-services. So rt-dev has no entry for the prod CIDR and the packet is dropped at the gateway. Second, even if the forward route existed, the return path requires dev’s CIDR to be present in rt-prod, which is also absent. The design isolates prod and dev correctly. One caveat I cannot verify from this matrix: confirm that neither VPC has a static route or a peering connection outside the Transit Gateway that bypasses this segmentation, and confirm the subnet route tables point the relevant CIDRs at the TGW in the first place.

That last sentence is why I keep the model on a leash and also why it’s useful — it correctly flags that TGW route tables aren’t the whole story. VPC subnet route tables and out-of-band peering can defeat the segmentation, and only a human who knows the full topology can clear that.

Inspecting the live state

Terraform tells you intent; the gateway tells you reality. After any change I diff the propagated routes against what I expect, and AI is handy for summarizing the dump into a reachability matrix I can eyeball.

aws ec2 search-transit-gateway-routes \
  --transit-gateway-route-table-id tgw-rtb-0prod1234567890 \
  --filters "Name=state,Values=active" \
  --query 'Routes[].{Cidr:DestinationCidrBlock,Type:Type,Attachment:TransitGatewayAttachments[0].TransitGatewayAttachmentId}' \
  --output table

If a CIDR shows up in rt-prod that I didn’t intend, that’s a propagation I have to hunt down before anything else ships.

In a multi-account org the gateway lives in a central networking account and is shared to spoke accounts via Resource Access Manager. The spoke account then creates its own VPC attachment against the shared gateway. I keep auto_accept_shared_attachments disabled so the networking account explicitly accepts each attachment — it’s the chokepoint where a human confirms a new account is supposed to join.

aws ram create-resource-share \
  --name tgw-share-core \
  --resource-arns arn:aws:ec2:us-east-1:111122223333:transit-gateway/tgw-0abc123def456 \
  --principals arn:aws:organizations::111122223333:ou/o-exampleorgid/ou-examplerootid-exampleouid

Sharing to an OU rather than individual account IDs keeps the share stable as accounts come and go, but it also means any new account in that OU can request an attachment — which is exactly why manual acceptance stays on. AI will cheerfully suggest auto-accept to “reduce friction,” and that’s a suggestion I override every time.

The division of labor

Transit Gateway is the rare design where AI’s mechanical strength and its blind spot are both maximized. It drafts the route tables, writes the propagation resources, and traces reachability claims faster and more patiently than I will. But it does not know which networks must never talk, and it cannot see the out-of-band paths that bypass the gateway. So it drafts; I verify every association-and-propagation pair against the trust boundaries, and I own the RAM acceptance gate.

For the broader account and connectivity picture, more networking-focused pieces live under the AWS category, and I keep the reachability-proof prompts from this workflow in the prompt library so the same isolation questions get asked of every new attachment.