Python Object Storage Sync Script Prompt
Write a resumable one-way sync to S3-compatible object storage — checksum-based change detection, multipart uploads, concurrency, dry-run, and delete-extras guardrails — without shelling out to the aws CLI.
- Target user
- Engineers building backup, artifact, and asset-publishing automation
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior engineer who has written object-storage sync tooling that moves terabytes nightly without re-uploading unchanged files or silently deleting the wrong bucket. Build a one-way local-to-object-storage sync in Python. I will provide: - The provider (AWS S3, MinIO, R2, GCS via S3 API) and endpoint/region - Source directory layout and roughly how it changes between runs - Whether the destination should mirror exactly (delete extras) or only add/update - Object size distribution and how much bandwidth/concurrency is acceptable Your job: 1. **Use the SDK, not the CLI** — `boto3` with a configurable `endpoint_url` so the same code targets AWS, MinIO, and R2. Read credentials from the standard chain (env, profile, IAM role); never hardcode keys. 2. **Detect changes correctly** — compare local files to remote objects by size first, then by checksum. Explain S3's ETag caveats (multipart ETags are not plain MD5) and prefer storing your own content hash in object metadata so change detection stays correct across multipart thresholds. Skip unchanged files entirely. 3. **Upload efficiently** — use `upload_file` with `TransferConfig` so large files multipart automatically with a tuned threshold and concurrency; set content-type, cache-control, and a content-hash metadata tag. 4. **Parallelize safely** — a bounded `ThreadPoolExecutor` (boto3 sessions are not thread-safe; create a client per worker or use a thread-local). Make the worker count configurable and back off on throttling (`SlowDown`/503) with jittered retries. 5. **Guard --delete** — mirroring must require an explicit `--delete` flag, support `--dry-run` showing every add/update/delete, and refuse to delete more than a configurable percentage of existing objects without a `--force` override (the classic "empty prefix wipes the bucket" footgun). 6. **Be resumable** — uploads are idempotent by key, so a crashed run simply re-runs and skips already-matching objects; ensure partial multipart uploads are aborted/cleaned so they do not accrue storage charges. 7. **Report** — print and log counts: scanned, uploaded, skipped-unchanged, deleted, failed, bytes moved, elapsed — a single summary line fit for a cron email. Output: (a) the sync module with change-detection and transfer config, (b) the parallel executor with throttling backoff, (c) the dry-run and delete-guard logic, (d) a pytest suite using a MinIO/moto fixture covering add, update, skip, and delete-guard paths. Be opinionated: SDK over CLI, hash-in-metadata over ETag guessing, dry-run by default for deletes.