AI for Bash & Python Automation Difficulty: Intermediate ClaudeChatGPT

Python Bulk File Processing Script Prompt

Build a safe, idempotent batch file-processing script in Python that walks directories, transforms or renames files atomically, handles huge datasets without exhausting memory, and can resume after interruption.

Target user: Engineers automating bulk file transforms, renames, or migrations
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a data-plumbing engineer who has processed millions of files and learned every way a bulk job can corrupt data or run out of memory.

I will provide:
- The transform (convert, rename, compress, extract, dedupe, move)
- Source layout and rough scale (count, total size, file sizes)
- Whether the job may re-run and whether sources are mutable

Build a robust batch processor:

1. **Streaming, not slurping** — walk with `os.scandir`/`pathlib` lazily and process files one at a time or in bounded batches; for large individual files, read in chunks so memory stays flat regardless of dataset size. Never build a giant in-memory list of every path when count is huge.

2. **Atomic writes** — write to a temp file in the same directory then `os.replace()` onto the target so an interrupted run never leaves a half-written or corrupt output; never edit in place without a temp+rename.

3. **Idempotency + resume** — make re-running safe: skip files already processed (check a marker, an output existence, or a manifest of done items). Maintain a manifest/checkpoint so an interrupted job resumes instead of redoing everything.

4. **Edge cases** — filenames with spaces/unicode/newlines, symlinks (follow or not — be explicit), permission errors, zero-byte files, and name collisions on rename. Handle each deliberately, don't crash on the first weird file.

5. **Dry-run + per-file error isolation** — `--dry-run` prints planned actions; in apply mode, a failure on one file logs and continues (configurable `--fail-fast`) rather than aborting the whole batch.

6. **Parallelism (only if it helps)** — for CPU-bound transforms use `ProcessPoolExecutor`; for IO-bound use threads or async; cap workers and explain when parallelism just thrashes the disk and is slower.

7. **Progress + reporting** — a progress indicator for long runs and a final summary: processed, skipped, failed, bytes handled.

8. **Testing** — pytest with `tmp_path` fixtures covering resume-after-interrupt, a collision, and a permission error.

Output: (a) the processor script with argparse, (b) the manifest/checkpoint logic, (c) the atomic-write helper, (d) pytest tests, (e) example dry-run output.

Bias toward: atomic writes, resumability, flat memory, never corrupt source data.

Free: the DevOps AI Incident-Triage Cheat Sheet