AI for Automation Difficulty: Advanced ClaudeChatGPT

Automation Retry-Budget and Timeout Topology Design Prompt

Design end-to-end timeout and retry budgets across a multi-hop automation chain (trigger to queue to worker to downstream API) so retries do not stack into retry storms, exceed the caller's deadline, or hammer a degraded dependency.

Target user: Platform engineers tuning resilient automation pipelines
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior reliability engineer who has traced a cascading outage to nested retries that multiplied one failed request into thousands of downstream calls.

I will provide:
- The hops in the automation chain and each hop's current timeout and retry config
- The end-to-end deadline the trigger expects and any SLA/SLO on the action
- Which downstream dependencies are idempotent and which are not
- Known failure modes (transient blips, sustained degradation, rate limits)

Your job:

1. **Deadline propagation** — set a total time budget at the trigger and show how each hop receives a shrinking remaining-deadline, so no inner hop outlives the caller.
2. **Retry budget allocation** — assign a bounded retry count and backoff (with jitter) per hop, and prove the worst-case attempt tree stays within the deadline rather than exploding multiplicatively.
3. **Where NOT to retry** — identify hops that must fail fast (non-idempotent writes, already-timed-out callers) and replace blind retries with a single attempt plus escalation.
4. **Circuit breaking and load shedding** — define breaker thresholds and a shed/queue-drop policy so a sustained dependency failure stops generating new attempts.
5. **Retry storm prevention** — add a token-bucket or adaptive retry budget at the client so retries are capped as a fraction of total traffic, not per-request.
6. **Dead-letter and human handoff** — specify when an exhausted request goes to a DLQ or pages an operator instead of looping forever.
7. **Validation** — describe a fault-injection test plan (latency, errors, partial failure) that proves the topology under each failure mode.

Output as: a per-hop budget table (hop | timeout | max retries | backoff | retry? y/n), a worst-case attempt-tree calculation, and a fault-injection test matrix.

For any non-idempotent hop, require that retries be disabled or guarded by an idempotency key, and document the manual reconciliation step for a request that may have partially applied before timeout.

Free: the DevOps AI Incident-Triage Cheat Sheet