CPU-Bound Ops Work in Python: concurrent.futures

You wrote a script to checksum ten thousand artifacts, sharded the work across a ThreadPoolExecutor, and watched it run no faster than the single-threaded version. This is the moment most ops engineers meet the Global Interpreter Lock the hard way. Threads in CPython take turns holding one lock to execute Python bytecode, so for work that is genuinely CPU-bound, adding threads just adds context-switching overhead. The fix is to run the work in separate processes, each with its own interpreter and its own GIL, and concurrent.futures.ProcessPoolExecutor makes that almost as easy as the thread version. Almost. The pickling, the chunking, and the exception handling all have sharp edges, and an AI assistant will happily generate code that looks correct and silently swallows failures. Let the AI draft it, then verify it does what you think.

Threads vs Processes: Knowing Which Problem You Have

The first decision is diagnostic, not architectural. If your bottleneck is waiting on the network or disk, threads are fine, because the GIL is released during blocking I/O. If your bottleneck is the CPU actually computing something in Python, threads will not help and processes will. Hashing, parsing, compression, regex over large bodies of text, image resizing in pure Python: these are CPU-bound. A quick way to confirm is to watch a single-core utilization number while the script runs. One pegged core and the rest idle means you are GIL-bound and processes are the answer.

import hashlib
from pathlib import Path

def checksum(path: str) -> tuple[str, str]:
    h = hashlib.sha256()
    with open(path, "rb") as f:
        for block in iter(lambda: f.read(1 << 20), b""):
            h.update(block)
    return path, h.hexdigest()

This checksum function is the unit of work we will parallelize. Note that it is a plain top-level function that takes simple, picklable arguments and returns a picklable result. That constraint is not optional with processes, and it shapes everything below.

map: The Clean Path for Uniform Work

When every task is roughly the same shape and you do not care about results until they are all in, executor.map is the simplest correct option. It preserves input order, which is often exactly what you want when you are joining results back against a list of inputs.

from concurrent.futures import ProcessPoolExecutor

def checksum_all(paths: list[str], workers: int = 4) -> dict[str, str]:
    results: dict[str, str] = {}
    with ProcessPoolExecutor(max_workers=workers) as ex:
        for path, digest in ex.map(checksum, paths, chunksize=64):
            results[path] = digest
    return results

The chunksize argument is the detail people skip, and it matters enormously when you have many small tasks. Without it, each individual item is shipped to a worker one at a time, and the inter-process serialization overhead can dwarf the actual work. By batching 64 paths per dispatch, you amortize that cost. A reasonable starting heuristic is len(items) // (workers * 4), then tune from there. Too small and you pay serialization tax; too large and you lose load balancing because one slow chunk holds up a worker while others sit idle.

Prompt: “Rewrite this ThreadPoolExecutor checksum script to use ProcessPoolExecutor, and explain how you chose chunksize.” The assistant returned a working map version but hardcoded chunksize=1000 with no justification. For 3,000 files across 4 workers that left most workers idle. The explanation it offered was the tell that the number was wrong, which is exactly why you ask for the reasoning, not just the code.

submit and as_completed: When You Need Results as They Land

map is the wrong tool when tasks have wildly different durations or when you want to act on results the moment they arrive rather than in submission order. For that, submit each task to get a Future, then drain them with as_completed.

from concurrent.futures import ProcessPoolExecutor, as_completed

def checksum_streaming(paths: list[str], workers: int = 4) -> dict[str, str]:
    results: dict[str, str] = {}
    with ProcessPoolExecutor(max_workers=workers) as ex:
        futures = {ex.submit(checksum, p): p for p in paths}
        for fut in as_completed(futures):
            path = futures[fut]
            try:
                _, digest = fut.result()
                results[path] = digest
            except Exception as exc:
                print(f"FAILED {path}: {exc!r}")
    return results

The dictionary mapping Future -> input is the idiomatic pattern, because a completed future does not carry its original arguments. When something fails, you want to know which path blew up, and futures[fut] gives you that.

Exception Propagation Is the Part That Bites

Here is the trap that catches everyone. With ProcessPoolExecutor, an exception raised inside a worker does not crash your program immediately. It is captured, pickled, shipped back, and re-raised only when you call future.result() or iterate the map generator. If you never read the result, the exception vanishes and your script reports success on a job that half-failed.

# WRONG: exceptions in workers are silently lost
with ProcessPoolExecutor() as ex:
    for fut in [ex.submit(checksum, p) for p in paths]:
        pass  # never calling .result(), so failures disappear

# RIGHT: surface every failure, decide per-item whether to continue
with ProcessPoolExecutor() as ex:
    futures = {ex.submit(checksum, p): p for p in paths}
    failures = []
    for fut in as_completed(futures):
        try:
            fut.result()
        except Exception as exc:
            failures.append((futures[fut], exc))
    if failures:
        raise RuntimeError(f"{len(failures)} tasks failed: {failures[:3]}")

There is a second class of failure that exceptions cannot rescue: if a worker process is killed by the OS, for example the OOM killer reaping it under memory pressure, the pool raises BrokenProcessPool and every outstanding future fails. That is a signal to lower max_workers or shrink your per-task memory footprint, not to add a retry loop. This is the kind of subtle correctness issue an AI draft will not anticipate, because the happy-path version it generates never hits an OOM condition on the model’s mental test set.

The Picklability Constraint

Everything you pass into a worker and everything you return must be picklable. Lambdas, closures, open file handles, database connections, and locally defined functions are not. If you find yourself wanting to pass a configured client object, restructure so the worker constructs it from plain config inside the process. The initializer and initargs parameters of the executor exist exactly for expensive per-process setup like opening a connection pool once per worker rather than once per task.

import os

def _init_worker(log_level: str) -> None:
    import logging
    logging.basicConfig(level=log_level)
    logging.info("worker %d initialized", os.getpid())

with ProcessPoolExecutor(max_workers=4,
                         initializer=_init_worker,
                         initargs=("INFO",)) as ex:
    list(ex.map(checksum, paths, chunksize=64))

Putting It in an Ops Pipeline

In practice you wrap this behind a CLI that picks the pool type based on the workload, and you set max_workers to os.cpu_count() for CPU-bound jobs rather than the much-higher counts you would use for I/O fan-out. Guard the entry point with if __name__ == "__main__": on platforms that spawn rather than fork, or the child processes will re-import your script and recursively spawn pools. That one line has caused more confusing fork-bomb incidents than any other detail in this API.

The discipline that makes process pools reliable is the same one that makes any AI-assisted automation reliable: have the model draft the structure, then verify the three things it routinely gets wrong, which are chunk sizing, exception draining, and the picklability of your payload. If you want a deeper reference on choosing between pool types and tuning worker counts, see python-concurrent-futures-pool-picker and python-multiprocessing-cpu-batch-worker. For the broader pattern of moving shell pipelines into structured Python so you can parallelize them safely, migrate-bash-to-python is a good starting point, and the rest of the bash and Python automation category covers the surrounding tooling.

CPU-Bound Ops Work in Python: concurrent.futures ProcessPoolExecutor Done Right