Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Bash & Python Automation By James Joyner IV · · 10 min read

Timeouts and Watchdogs in Python with signal.alarm and SIGALRM

Bound blocking calls and build watchdogs in Python using signal.alarm and SIGALRM. Covers the main-thread and Unix-only caveats and when subprocess timeouts win.

  • #automation
  • #ai
  • #python
  • #signals
  • #reliability

Every ops engineer eventually inherits a script that hangs forever on a flaky DNS lookup, a database driver that ignores its own timeout setting, or a legacy library call that has no timeout parameter at all. You cannot patch the library, you cannot add timeout= to a function that does not accept one, and the cron job that wraps it has been silently stacking up zombie processes for weeks. This is where signal.alarm earns its place in the toolkit. It lets you impose a hard wall-clock deadline on an arbitrary, otherwise-uninterruptible block of Python code. It is also one of the most caveat-laden corners of the standard library, and an AI assistant will cheerfully hand you a version that works perfectly in the model’s head and breaks the moment you run it off the main thread. Draft with the AI, verify against the caveats.

How SIGALRM Actually Works

The mechanism is simple. You ask the kernel to deliver a SIGALRM signal after N seconds with signal.alarm(N). You register a handler that runs when that signal arrives. Inside the handler you raise an exception, which unwinds the stack out of whatever blocking call was running and lands you back in your own try/except. The key insight is that this interrupts even C-level blocking calls that the GIL-aware threading timeout machinery cannot touch, which is exactly why it works on stubborn library calls.

import signal
from contextlib import contextmanager

class Timeout(Exception):
    pass

@contextmanager
def time_limit(seconds: int):
    def _handler(signum, frame):
        raise Timeout(f"operation exceeded {seconds}s")

    old = signal.signal(signal.SIGALRM, _handler)
    signal.alarm(seconds)
    try:
        yield
    finally:
        signal.alarm(0)               # cancel the pending alarm
        signal.signal(signal.SIGALRM, old)  # restore prior handler

That finally block is not optional decoration. signal.alarm(0) cancels the timer so a slow-but-successful call does not get a stray SIGALRM delivered a second later into unrelated code. Restoring the previous handler keeps your context manager from leaking global state into the rest of the program. Both are the kind of cleanup an AI draft routinely omits because the happy path appears to work without them.

Using It on a Blocking Call

The whole point is wrapping a call you cannot otherwise bound.

import socket

def resolve_with_deadline(host: str, seconds: int = 3) -> str:
    try:
        with time_limit(seconds):
            return socket.gethostbyname(host)   # can hang on bad resolvers
    except Timeout:
        raise RuntimeError(f"DNS lookup for {host} timed out")

Run this in a cron-driven health check and a wedged resolver costs you three seconds and a clean error instead of an accumulating pile of stuck processes. The improvement is not subtle, and it is the most common legitimate use of signal.alarm in ops code.

The Caveats That Will Bite You

This is where the standard library’s sharp edges live, and where I have watched generated code fail in production.

It only works on the main thread. signal.signal can only be called from the main thread of the main interpreter. If your script uses a ThreadPoolExecutor, a background worker thread, or any framework that runs your code off-thread, calling this context manager raises ValueError: signal only works in main thread. There is no workaround within the signal API. If you are threaded, you need a different timeout strategy entirely.

It is Unix-only. signal.alarm and SIGALRM do not exist on Windows. Code that relies on them is not portable, and if you ship tooling that teammates run on Windows or WSL-less laptops, it will fail at import-of-behavior time. Guard accordingly or pick a cross-platform approach.

Granularity is whole seconds. signal.alarm takes an integer count of seconds. If you need sub-second deadlines, signal.setitimer(signal.ITIMER_REAL, 0.5) gives you fractional timers using the same SIGALRM delivery, which is worth knowing because the AI will often reach for alarm even when you asked for 500 milliseconds.

Prompt: “Add a 5-second timeout to this function using signal.alarm.” The assistant produced a clean context-manager version, but it never mentioned that the calling code ran inside a worker thread three frames up. Dropped in, it raised ValueError: signal only works in main thread on the first request. The code was textbook-correct in isolation and wrong for the program it lived in, which is precisely the failure mode you verify for rather than trust.

Nesting and Re-entrancy

There is exactly one alarm timer per process. You cannot nest two time_limit blocks and expect both deadlines to fire independently, because the inner signal.alarm overwrites the outer one’s countdown and the inner finally cancels the outer alarm on the way out. If you genuinely need nested deadlines, track the remaining time yourself and reset the alarm to the tighter bound, or step back and reconsider whether the design wants something other than process-global signals.

When to Prefer a Subprocess Timeout

Here is the judgment call that separates a robust tool from a clever one. If the risky work can be isolated into its own process, a subprocess timeout is almost always the better choice, and signal.alarm becomes a fallback rather than the first instinct.

import subprocess

def run_bounded(cmd: list[str], seconds: int = 30) -> str:
    try:
        out = subprocess.run(
            cmd, capture_output=True, text=True,
            timeout=seconds, check=True,
        )
        return out.stdout
    except subprocess.TimeoutExpired:
        raise RuntimeError(f"{cmd[0]} exceeded {seconds}s and was killed")

The subprocess approach wins on three counts. It works on any thread because the timeout is enforced by waiting on the child, not by signal delivery into your interpreter. It actually kills the runaway work, where signal.alarm merely unwinds your stack and may leave the underlying C operation still churning in the background until it finishes or the GC tears it down. And it is portable. The equivalent at the shell is timeout 30 some-command, and reaching for that, or its Python wrapper, is the right move whenever the dangerous work is a discrete command rather than an inline call. The signal approach is reserved for the case where you truly cannot fork the work out: an in-process library call with no timeout knob, running on the main thread, on Unix.

A Watchdog Pattern

You can also use the same machinery as a periodic watchdog by re-arming the alarm inside the handler. The handler checks whether progress was made since the last tick and aborts if the program appears wedged. This is heavier than it looks and easy to get subtly wrong, so for long-running supervision most ops teams reach for a dedicated heartbeat or process-monitor pattern instead of hand-rolling re-armed signals.

The throughline is the same discipline that makes any AI-assisted automation trustworthy: let the model draft the structure, then check it against the three things it cannot know about your environment, which are the thread it runs on, the OS it ships to, and whether the work should have been a subprocess in the first place. For the related patterns, see python-safe-subprocess-wrapper for bounding external commands, python-signal-graceful-shutdown for handling SIGTERM cleanly, and python-process-watchdog-auto-restart for supervision. More of the surrounding tooling lives in the bash and Python automation category.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.