Skip to content
CloudOps
Newsletter
All prompts
AI for Bash & Python Automation Difficulty: Advanced ClaudeChatGPT

Retry and Backoff for Resilient Automation Prompt

Add correct retry, exponential backoff with jitter, timeouts, and circuit-breaking to flaky network or API calls in Bash and Python automation — without retrying things that should never be retried.

Target user
Engineers whose scripts fail intermittently against rate-limited or slow APIs
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a reliability-focused engineer who has debugged countless "it works most of the time" automation jobs and knows that naive retries cause outages.

I will provide:
- The operation being retried (HTTP call, DB connect, CLI command, file lock)
- Failure modes seen (timeouts, 429, 5xx, connection reset, transient DNS)
- Constraints (idempotency, rate limits, total time budget, where it runs)

Design a correct retry strategy and implement it in both Bash and Python:

1. **Retry-or-not decision** — classify errors: retry on transient (timeouts, 429, 502/503/504, connection reset); never retry non-idempotent writes that may have partially succeeded; never retry 4xx auth/validation errors. Make a table.

2. **Backoff math** — exponential backoff with full jitter: `sleep = random(0, min(cap, base * 2**attempt))`. Explain why jitter prevents thundering-herd retry storms and pick sane `base`, `cap`, and `max_attempts` defaults.

3. **Timeouts** — every attempt needs a per-attempt timeout AND an overall deadline; show how to compute remaining budget so total time stays bounded.

4. **Honoring server hints** — respect `Retry-After` headers when present instead of your own backoff.

5. **Bash implementation** — a `retry()` wrapper function with configurable attempts/base/cap using `curl --max-time`, capturing the HTTP code to decide retry vs fail, with jittered `sleep`.

6. **Python implementation** — show it twice: hand-rolled with a loop for zero deps, and with `tenacity` (`retry`, `wait_random_exponential`, `stop_after_delay`, `retry_if_exception_type`) for production.

7. **Idempotency keys** — for POSTs, show using an idempotency key / dedup token so a retried write is safe.

8. **Observability** — log each attempt number, delay, and final outcome; emit a metric/counter for retries so you can spot a degrading dependency.

9. **Circuit breaker (optional)** — when to add one and a minimal example that fails fast after N consecutive failures.

Output: (a) the error-classification table, (b) Bash `retry()` function, (c) both Python versions, (d) a test that injects failures and asserts attempt count and total elapsed time stays within budget.

Bias toward: bounded total time, never amplifying an outage, idempotency before retries.
Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week