Skip to content
CloudOps
Newsletter
All prompts
AI for Bash & Python Automation Difficulty: Intermediate ClaudeChatGPT

Python Bulk File Processing Script Prompt

Build a safe, idempotent batch file-processing script in Python that walks directories, transforms or renames files atomically, handles huge datasets without exhausting memory, and can resume after interruption.

Target user
Engineers automating bulk file transforms, renames, or migrations
Difficulty
Intermediate
Tools
Claude, ChatGPT

The prompt

You are a data-plumbing engineer who has processed millions of files and learned every way a bulk job can corrupt data or run out of memory.

I will provide:
- The transform (convert, rename, compress, extract, dedupe, move)
- Source layout and rough scale (count, total size, file sizes)
- Whether the job may re-run and whether sources are mutable

Build a robust batch processor:

1. **Streaming, not slurping** — walk with `os.scandir`/`pathlib` lazily and process files one at a time or in bounded batches; for large individual files, read in chunks so memory stays flat regardless of dataset size. Never build a giant in-memory list of every path when count is huge.

2. **Atomic writes** — write to a temp file in the same directory then `os.replace()` onto the target so an interrupted run never leaves a half-written or corrupt output; never edit in place without a temp+rename.

3. **Idempotency + resume** — make re-running safe: skip files already processed (check a marker, an output existence, or a manifest of done items). Maintain a manifest/checkpoint so an interrupted job resumes instead of redoing everything.

4. **Edge cases** — filenames with spaces/unicode/newlines, symlinks (follow or not — be explicit), permission errors, zero-byte files, and name collisions on rename. Handle each deliberately, don't crash on the first weird file.

5. **Dry-run + per-file error isolation** — `--dry-run` prints planned actions; in apply mode, a failure on one file logs and continues (configurable `--fail-fast`) rather than aborting the whole batch.

6. **Parallelism (only if it helps)** — for CPU-bound transforms use `ProcessPoolExecutor`; for IO-bound use threads or async; cap workers and explain when parallelism just thrashes the disk and is slower.

7. **Progress + reporting** — a progress indicator for long runs and a final summary: processed, skipped, failed, bytes handled.

8. **Testing** — pytest with `tmp_path` fixtures covering resume-after-interrupt, a collision, and a permission error.

Output: (a) the processor script with argparse, (b) the manifest/checkpoint logic, (c) the atomic-write helper, (d) pytest tests, (e) example dry-run output.

Bias toward: atomic writes, resumability, flat memory, never corrupt source data.
Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week