AI for Bash & Python Automation Difficulty: Intermediate ClaudeChatGPT

Python File Integrity and Dedup Tool Prompt

Build a Python tool that hashes a directory tree, detects duplicate and corrupted files, and verifies integrity against a stored manifest — for backups, artifact stores, and data-pipeline outputs.

Target user: Engineers managing large file trees, backups, or artifacts
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior engineer who has built data-integrity tooling for backup systems and artifact stores. You know that "trust me, the copy worked" is how silent corruption ships to production.

I will provide:
- The directory tree(s) involved and rough scale (file count, total size)
- Whether I need dedup detection, corruption detection, or both
- Whether a baseline manifest already exists or must be created
- Performance constraints (single run vs nightly cron over millions of files)

Your job:

1. **Hashing core** — walk the tree, hash file contents in fixed-size chunks (don't read whole files into memory), default to `blake2b` or `sha256`. Record path, size, mtime, and digest. Stream so a 50 GB file never blows up RAM.

2. **Manifest format** — emit a stable, sorted, line-oriented manifest (JSONL or `<digest>  <relpath>` like `sha256sum`) that diffs cleanly in git and is re-readable. Store relative paths so the manifest is portable.

3. **Verify mode** — given an existing manifest, re-hash and report: OK, MODIFIED (digest changed), MISSING (in manifest, not on disk), NEW (on disk, not in manifest). Exit non-zero if any MODIFIED/MISSING so cron can alert.

4. **Dedup mode** — group files by digest to find exact duplicates; optionally pre-filter by size to skip hashing unique-sized files (a big speedup). Report wasted bytes and offer a dry-run plan to hardlink or delete duplicates — never delete without `--apply` and keeping one canonical copy.

5. **Performance** — use a process/thread pool for hashing (hashing is CPU-light, I/O-bound — choose the right pool and explain why); show progress with `rich`/`tqdm`; make the size-prefilter for dedup the default.

6. **Safety** — `--dry-run` is the default for any destructive dedup action; require `--apply` plus a confirmation; never follow symlinks unless asked; skip special files.

7. **Output** — human summary plus `--json` for pipelines; clear exit codes (0 clean, 1 integrity failure, 2 usage).

8. **Tests** — pytest covering: round-trip create→verify (clean), a flipped byte (MODIFIED), a deleted file (MISSING), and dedup detection on two identical files.

Output: (a) the tool with create/verify/dedup subcommands, (b) the manifest format spec, (c) the pytest suite, (d) example cron line for nightly verify. Bias toward streaming, dry-run-by-default, and alertable exit codes.

Free: the DevOps AI Incident-Triage Cheat Sheet