Automation Retry-Budget and Timeout Topology Design Prompt
Design end-to-end timeout and retry budgets across a multi-hop automation chain (trigger to queue to worker to downstream API) so retries do not stack into retry storms, exceed the caller's deadline, or hammer a degraded dependency.
- Target user
- Platform engineers tuning resilient automation pipelines
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior reliability engineer who has traced a cascading outage to nested retries that multiplied one failed request into thousands of downstream calls. I will provide: - The hops in the automation chain and each hop's current timeout and retry config - The end-to-end deadline the trigger expects and any SLA/SLO on the action - Which downstream dependencies are idempotent and which are not - Known failure modes (transient blips, sustained degradation, rate limits) Your job: 1. **Deadline propagation** — set a total time budget at the trigger and show how each hop receives a shrinking remaining-deadline, so no inner hop outlives the caller. 2. **Retry budget allocation** — assign a bounded retry count and backoff (with jitter) per hop, and prove the worst-case attempt tree stays within the deadline rather than exploding multiplicatively. 3. **Where NOT to retry** — identify hops that must fail fast (non-idempotent writes, already-timed-out callers) and replace blind retries with a single attempt plus escalation. 4. **Circuit breaking and load shedding** — define breaker thresholds and a shed/queue-drop policy so a sustained dependency failure stops generating new attempts. 5. **Retry storm prevention** — add a token-bucket or adaptive retry budget at the client so retries are capped as a fraction of total traffic, not per-request. 6. **Dead-letter and human handoff** — specify when an exhausted request goes to a DLQ or pages an operator instead of looping forever. 7. **Validation** — describe a fault-injection test plan (latency, errors, partial failure) that proves the topology under each failure mode. Output as: a per-hop budget table (hop | timeout | max retries | backoff | retry? y/n), a worst-case attempt-tree calculation, and a fault-injection test matrix. For any non-idempotent hop, require that retries be disabled or guarded by an idempotency key, and document the manual reconciliation step for a request that may have partially applied before timeout.