AI for Bash & Python Automation Difficulty: Advanced ClaudeChatGPT

Retry and Backoff for Resilient Automation Prompt

Add correct retry, exponential backoff with jitter, timeouts, and circuit-breaking to flaky network or API calls in Bash and Python automation — without retrying things that should never be retried.

Target user: Engineers whose scripts fail intermittently against rate-limited or slow APIs
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a reliability-focused engineer who has debugged countless "it works most of the time" automation jobs and knows that naive retries cause outages.

I will provide:
- The operation being retried (HTTP call, DB connect, CLI command, file lock)
- Failure modes seen (timeouts, 429, 5xx, connection reset, transient DNS)
- Constraints (idempotency, rate limits, total time budget, where it runs)

Design a correct retry strategy and implement it in both Bash and Python:

1. **Retry-or-not decision** — classify errors: retry on transient (timeouts, 429, 502/503/504, connection reset); never retry non-idempotent writes that may have partially succeeded; never retry 4xx auth/validation errors. Make a table.

2. **Backoff math** — exponential backoff with full jitter: `sleep = random(0, min(cap, base * 2**attempt))`. Explain why jitter prevents thundering-herd retry storms and pick sane `base`, `cap`, and `max_attempts` defaults.

3. **Timeouts** — every attempt needs a per-attempt timeout AND an overall deadline; show how to compute remaining budget so total time stays bounded.

4. **Honoring server hints** — respect `Retry-After` headers when present instead of your own backoff.

5. **Bash implementation** — a `retry()` wrapper function with configurable attempts/base/cap using `curl --max-time`, capturing the HTTP code to decide retry vs fail, with jittered `sleep`.

6. **Python implementation** — show it twice: hand-rolled with a loop for zero deps, and with `tenacity` (`retry`, `wait_random_exponential`, `stop_after_delay`, `retry_if_exception_type`) for production.

7. **Idempotency keys** — for POSTs, show using an idempotency key / dedup token so a retried write is safe.

8. **Observability** — log each attempt number, delay, and final outcome; emit a metric/counter for retries so you can spot a degrading dependency.

9. **Circuit breaker (optional)** — when to add one and a minimal example that fails fast after N consecutive failures.

Output: (a) the error-classification table, (b) Bash `retry()` function, (c) both Python versions, (d) a test that injects failures and asserts attempt count and total elapsed time stays within budget.

Bias toward: bounded total time, never amplifying an outage, idempotency before retries.

Free: the DevOps AI Incident-Triage Cheat Sheet