Retry and Backoff Patterns for Reliable Automation Scripts

A surprising number of “incidents” are nothing but a transient blip that a single retry would have absorbed. The API returned a 503 for two seconds during a deploy. A DNS lookup timed out once. The database was mid-failover for 800 milliseconds. Your script saw one failure, gave up, and paged someone.

In 25 years of automating infrastructure, adding sensible retry-with-backoff to scripts has quietly eliminated more false-positive pages than almost anything else. But naive retries cause their own outages. Here is how to do it right.

Not everything should be retried

The first rule of retries: only retry things that are actually transient. Retrying a permanent failure just wastes time and hammers a struggling service.

Retry: timeouts, connection refused, 429 (rate limited), 502/503/504, DNS failures.
Do not retry: 400 (bad request), 401/403 (auth), 404 (not found), validation errors. These will fail identically every time.

A script that retries a 401 five times is a script that takes five times longer to tell you your token is wrong.

Exponential backoff, not fixed sleep

A fixed sleep 1 between retries is better than nothing but still hammers a recovering service. Exponential backoff doubles the wait each attempt — 1s, 2s, 4s, 8s — giving an overloaded backend room to breathe.

Here is the bash version I use:

retry() {
  local max=5 delay=1 attempt=1
  until "$@"; do
    if (( attempt >= max )); then
      echo "failed after $max attempts: $*" >&2
      return 1
    fi
    echo "attempt $attempt failed; retrying in ${delay}s" >&2
    sleep "$delay"
    delay=$(( delay * 2 ))
    attempt=$(( attempt + 1 ))
  done
}

retry curl -fsS https://api.internal/health

retry takes any command, runs it, and on failure backs off exponentially up to five attempts. The curl -f flag is important — it makes curl exit non-zero on HTTP errors so the retry loop actually triggers.

Add jitter or risk a thundering herd

Here is the subtle one that bites people at scale. If fifty machines all fail at the same instant and all back off on the exact same 1-2-4-8 schedule, they all retry simultaneously — a synchronized stampede that knocks the service back down the moment it recovers. This is a thundering herd, and pure exponential backoff causes it.

The fix is jitter: randomize each delay. In bash:

delay=$(( delay * 2 ))
jitter=$(( RANDOM % delay ))
sleep $(( delay + jitter ))

Now the fifty machines spread their retries across a window instead of slamming the service at the same tick. At any real scale, jitter is not optional.

The Python version

Python lets you express this more cleanly, including a cap on the maximum delay:

import random
import time

def retry(fn, *, max_attempts=5, base=1.0, cap=30.0):
    for attempt in range(1, max_attempts + 1):
        try:
            return fn()
        except TransientError as e:
            if attempt == max_attempts:
                raise
            delay = min(cap, base * 2 ** (attempt - 1))
            delay += random.uniform(0, delay)  # full jitter
            time.sleep(delay)

The cap matters: without it, exponential growth means attempt 10 waits 512 seconds. Capping the delay keeps backoff bounded while still spreading load. Note it only catches TransientError — a permanent failure propagates immediately instead of being retried into the ground.

Use a library when you can

For real projects, reach for tenacity rather than rolling your own:

from tenacity import retry, stop_after_attempt, wait_exponential_jitter

@retry(stop=stop_after_attempt(5),
       wait=wait_exponential_jitter(initial=1, max=30))
def fetch_config():
    resp = httpx.get("https://api.internal/config", timeout=5)
    resp.raise_for_status()
    return resp.json()

tenacity gives you backoff, jitter, attempt limits, and the ability to retry only on specific exceptions — all declaratively. It has the edge cases handled that a hand-rolled loop usually misses.

Always bound the total time

Retries plus backoff can blow up your total runtime. A five-attempt loop with exponential backoff can take a minute, and if that script runs inside a deploy with its own timeout, you have a problem. Always cap either the number of attempts or a total deadline, and prefer a deadline for anything time-sensitive:

deadline = time.monotonic() + 30  # 30s total budget
while time.monotonic() < deadline:
    ...

Set per-request timeouts too. A retry around a curl with no --max-time can hang forever on attempt one and never reach the retry logic at all.

Where AI helps

When I am adding resilience to an existing script, I describe the failure modes and let AI draft the retry wrapper:

“Wrap this API call in a retry with exponential backoff and full jitter, max 5 attempts, 30s cap. Only retry on timeouts and HTTP 429/502/503/504. Do NOT retry 4xx auth or validation errors. Add a per-request timeout.”

Being explicit about what not to retry is the key part of the prompt — left unconstrained, models happily retry everything, including the 401 that will never succeed. I keep these resilience prompts in my prompt library.

The short list

Retry only genuinely transient failures; let permanent ones fail fast.
Use exponential backoff, not a fixed sleep.
Add jitter — always, at any scale.
Cap the delay and bound total time with a deadline.
Set per-request timeouts so a single hang does not defeat the whole scheme.
Use tenacity in Python rather than hand-rolling.

Get this right and a huge class of 2am pages quietly disappears. For more reliability patterns, see our automation guides.