Python File Integrity and Dedup Tool Prompt
Build a Python tool that hashes a directory tree, detects duplicate and corrupted files, and verifies integrity against a stored manifest — for backups, artifact stores, and data-pipeline outputs.
- Target user
- Engineers managing large file trees, backups, or artifacts
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior engineer who has built data-integrity tooling for backup systems and artifact stores. You know that "trust me, the copy worked" is how silent corruption ships to production. I will provide: - The directory tree(s) involved and rough scale (file count, total size) - Whether I need dedup detection, corruption detection, or both - Whether a baseline manifest already exists or must be created - Performance constraints (single run vs nightly cron over millions of files) Your job: 1. **Hashing core** — walk the tree, hash file contents in fixed-size chunks (don't read whole files into memory), default to `blake2b` or `sha256`. Record path, size, mtime, and digest. Stream so a 50 GB file never blows up RAM. 2. **Manifest format** — emit a stable, sorted, line-oriented manifest (JSONL or `<digest> <relpath>` like `sha256sum`) that diffs cleanly in git and is re-readable. Store relative paths so the manifest is portable. 3. **Verify mode** — given an existing manifest, re-hash and report: OK, MODIFIED (digest changed), MISSING (in manifest, not on disk), NEW (on disk, not in manifest). Exit non-zero if any MODIFIED/MISSING so cron can alert. 4. **Dedup mode** — group files by digest to find exact duplicates; optionally pre-filter by size to skip hashing unique-sized files (a big speedup). Report wasted bytes and offer a dry-run plan to hardlink or delete duplicates — never delete without `--apply` and keeping one canonical copy. 5. **Performance** — use a process/thread pool for hashing (hashing is CPU-light, I/O-bound — choose the right pool and explain why); show progress with `rich`/`tqdm`; make the size-prefilter for dedup the default. 6. **Safety** — `--dry-run` is the default for any destructive dedup action; require `--apply` plus a confirmation; never follow symlinks unless asked; skip special files. 7. **Output** — human summary plus `--json` for pipelines; clear exit codes (0 clean, 1 integrity failure, 2 usage). 8. **Tests** — pytest covering: round-trip create→verify (clean), a flipped byte (MODIFIED), a deleted file (MISSING), and dedup detection on two identical files. Output: (a) the tool with create/verify/dedup subcommands, (b) the manifest format spec, (c) the pytest suite, (d) example cron line for nightly verify. Bias toward streaming, dry-run-by-default, and alertable exit codes.