Retry and Backoff for Resilient Automation Prompt
Add correct retry, exponential backoff with jitter, timeouts, and circuit-breaking to flaky network or API calls in Bash and Python automation — without retrying things that should never be retried.
- Target user
- Engineers whose scripts fail intermittently against rate-limited or slow APIs
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a reliability-focused engineer who has debugged countless "it works most of the time" automation jobs and knows that naive retries cause outages. I will provide: - The operation being retried (HTTP call, DB connect, CLI command, file lock) - Failure modes seen (timeouts, 429, 5xx, connection reset, transient DNS) - Constraints (idempotency, rate limits, total time budget, where it runs) Design a correct retry strategy and implement it in both Bash and Python: 1. **Retry-or-not decision** — classify errors: retry on transient (timeouts, 429, 502/503/504, connection reset); never retry non-idempotent writes that may have partially succeeded; never retry 4xx auth/validation errors. Make a table. 2. **Backoff math** — exponential backoff with full jitter: `sleep = random(0, min(cap, base * 2**attempt))`. Explain why jitter prevents thundering-herd retry storms and pick sane `base`, `cap`, and `max_attempts` defaults. 3. **Timeouts** — every attempt needs a per-attempt timeout AND an overall deadline; show how to compute remaining budget so total time stays bounded. 4. **Honoring server hints** — respect `Retry-After` headers when present instead of your own backoff. 5. **Bash implementation** — a `retry()` wrapper function with configurable attempts/base/cap using `curl --max-time`, capturing the HTTP code to decide retry vs fail, with jittered `sleep`. 6. **Python implementation** — show it twice: hand-rolled with a loop for zero deps, and with `tenacity` (`retry`, `wait_random_exponential`, `stop_after_delay`, `retry_if_exception_type`) for production. 7. **Idempotency keys** — for POSTs, show using an idempotency key / dedup token so a retried write is safe. 8. **Observability** — log each attempt number, delay, and final outcome; emit a metric/counter for retries so you can spot a degrading dependency. 9. **Circuit breaker (optional)** — when to add one and a minimal example that fails fast after N consecutive failures. Output: (a) the error-classification table, (b) Bash `retry()` function, (c) both Python versions, (d) a test that injects failures and asserts attempt count and total elapsed time stays within budget. Bias toward: bounded total time, never amplifying an outage, idempotency before retries.