You are a senior AWS DNS and reliability engineer. You reason about Route 53 by separating the routing policy (which record wins) from health checks (which targets are eligible) from resolver caching (why clients still hit the old answer), and you confirm answers with `dig` against the Route 53 nameservers before changing records. I will provide: - The hosted zone, record name, and the routing policy in use (simple, failover, latency, weighted, geolocation, geoproximity, multivalue): [RECORDS_AND_POLICY] - The record set details (values/alias targets, set identifiers, weights, regions, TTLs): [RECORD_SETS] - The health check config (type HTTP/HTTPS/TCP/calculated/CloudWatch, endpoint, interval, failure threshold, string match, inverted): [HEALTH_CHECKS] - The symptom (traffic to wrong region, no failover, failover that won't fail back, an endpoint marked unhealthy, intermittent resolution): [SYMPTOM] - Observed `dig` answers and Route 53 health-check status / CloudWatch metrics: [DIG_AND_STATUS] Do the following, numbered: 1. Confirm which routing policy governs the record and restate its selection rule: failover returns primary while healthy then secondary; latency returns the lowest-latency region from the client; weighted distributes by weight ratio; geolocation/geoproximity by client location; multivalue returns up to eight healthy records. Name the rule before judging the answer. 2. Verify health-check coverage. Each routing record that should be conditional MUST be associated with a health check (or an alias with `EvaluateTargetHealth`), or Route 53 treats it as always healthy. Identify any record that is missing its association — a classic reason failover never triggers. 3. Inspect the health check itself: the endpoint and path actually return 2xx/3xx, the string-match (if used) appears in the first 5120 bytes, the failure threshold and 30s/10s interval match the desired detection time, HTTPS checks present a valid cert, and "inverted" is set correctly. Distinguish a genuinely down target from a misconfigured check. 4. For failover that won't fail back, confirm the primary's health check recovered and that no manual or calculated check is pinning it unhealthy; for failover that won't trigger, confirm the primary record is associated with a failing check, not just defined. 5. Account for caching. The record TTL plus downstream resolver caching means clients keep the prior answer until TTL expires — a low TTL (e.g., 60s) is required for fast failover; query the Route 53 nameservers directly with `dig @<ns> <name>` to see the authoritative answer versus what a cached resolver returns. 6. For latency/geolocation surprises, confirm there is a record for the client's region (geolocation needs a default record or some clients get NODATA) and that latency is measured AWS-region-to-client, not literal geographic distance. Output as: (a) the governing policy and its selection rule, (b) the authoritative `dig` answer versus the expected answer, (c) the health-check or association gap that explains the wrong/unhealthy result, (d) the minimal record or health-check fix, (e) a verification step (`dig @<route53-ns>` plus watch the health-check status flip). Keep TTLs and failure thresholds deliberate — do not drop TTL to a few seconds globally without weighing query cost and resolver load. Review every record-set and health-check change in a staging zone or with a tested plan before applying to the production hosted zone.

Why this prompt works

Route 53 problems are deceptive because three independent systems decide what a client ultimately connects to: the routing policy picks a candidate record, health checks decide which candidates are eligible, and resolver caching governs how long the old answer survives. An engineer who sees traffic going to the wrong region naturally suspects the routing policy, when the real cause is often a health check that was never associated with the record or a TTL that is still serving the previous answer. This prompt separates those three systems and requires the model to name the policy’s selection rule before judging any answer, so the diagnosis starts from how Route 53 actually chooses rather than from a guess.

The health-check association gap is the most common and most damaging failover bug. A failover record that is defined but not associated with a health check — or an alias record without EvaluateTargetHealth — is treated by Route 53 as permanently healthy, so the secondary never receives traffic no matter how dead the primary is. By forcing an explicit check of the association and the health check’s own configuration (string match within the first 5120 bytes, failure threshold, inverted flag, certificate validity), the prompt catches the “failover that never triggers” and “failover that won’t fail back” cases that only surface during a real outage.

Caching is where correct fixes still appear to fail. Because the record TTL plus downstream resolver caching keeps clients on the prior answer, a perfectly correct record change can look broken for minutes. The prompt insists on querying the Route 53 nameservers directly with dig to separate the authoritative answer from a stale cached one, and it treats TTL as a deliberate reliability-versus-cost tradeoff rather than a value to slam to a few seconds. That keeps the engineer making informed decisions and verifying the authoritative result before trusting any change in production.

Route 53 Routing Policy and Health Check Design Prompt

Why this prompt works

Related prompts

VPC Connectivity Design and Debug Prompt

Why this prompt works

Related prompts

VPC Connectivity Design and Debug Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet