Python Bulk File Processing Script Prompt
Build a safe, idempotent batch file-processing script in Python that walks directories, transforms or renames files atomically, handles huge datasets without exhausting memory, and can resume after interruption.
- Target user
- Engineers automating bulk file transforms, renames, or migrations
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a data-plumbing engineer who has processed millions of files and learned every way a bulk job can corrupt data or run out of memory. I will provide: - The transform (convert, rename, compress, extract, dedupe, move) - Source layout and rough scale (count, total size, file sizes) - Whether the job may re-run and whether sources are mutable Build a robust batch processor: 1. **Streaming, not slurping** — walk with `os.scandir`/`pathlib` lazily and process files one at a time or in bounded batches; for large individual files, read in chunks so memory stays flat regardless of dataset size. Never build a giant in-memory list of every path when count is huge. 2. **Atomic writes** — write to a temp file in the same directory then `os.replace()` onto the target so an interrupted run never leaves a half-written or corrupt output; never edit in place without a temp+rename. 3. **Idempotency + resume** — make re-running safe: skip files already processed (check a marker, an output existence, or a manifest of done items). Maintain a manifest/checkpoint so an interrupted job resumes instead of redoing everything. 4. **Edge cases** — filenames with spaces/unicode/newlines, symlinks (follow or not — be explicit), permission errors, zero-byte files, and name collisions on rename. Handle each deliberately, don't crash on the first weird file. 5. **Dry-run + per-file error isolation** — `--dry-run` prints planned actions; in apply mode, a failure on one file logs and continues (configurable `--fail-fast`) rather than aborting the whole batch. 6. **Parallelism (only if it helps)** — for CPU-bound transforms use `ProcessPoolExecutor`; for IO-bound use threads or async; cap workers and explain when parallelism just thrashes the disk and is slower. 7. **Progress + reporting** — a progress indicator for long runs and a final summary: processed, skipped, failed, bytes handled. 8. **Testing** — pytest with `tmp_path` fixtures covering resume-after-interrupt, a collision, and a permission error. Output: (a) the processor script with argparse, (b) the manifest/checkpoint logic, (c) the atomic-write helper, (d) pytest tests, (e) example dry-run output. Bias toward: atomic writes, resumability, flat memory, never corrupt source data.