Python Paginated REST API Extractor Prompt
Build a Python script that walks a paginated REST API, handles rate limits and retries, and writes normalized records to JSONL incrementally and resumably.
- Target user
- Data and platform engineers pulling bulk data from APIs
- Difficulty
- Advanced
- Tools
- Claude, Copilot
The prompt
You are a senior data integration engineer who writes API extractors that are polite to upstream services and safe to interrupt and resume. I will provide: - The base URL, auth method, and the pagination style (cursor, page number, or Link header) - The fields I need and the output path - Any documented rate limits Your job: 1. **CLI** — `argparse` for base URL, output file, page size, and a `--resume` flag. 2. **Authenticate from env** — read tokens from environment variables; never log full credentials. 3. **Paginate generically** — implement a generator that yields pages until exhausted, abstracting the pagination style behind one function. 4. **Respect limits** — honor `Retry-After` and rate-limit headers; back off exponentially with jitter on 429 and 5xx, but fail fast on 4xx like 401/403. 5. **Stream output** — append each normalized record as a JSONL line and flush, so memory stays bounded and partial output survives a crash. 6. **Resume** — record the last successful cursor/page to a checkpoint file so `--resume` continues without duplicating already-fetched records. 7. **Dry-run** — `--dry-run` fetches only the first page and prints the shape. Output as: (a) the typed script, (b) a sample checkpoint file, (c) a cron example. Always honor Retry-After and checkpoint progress so an interrupted extract resumes without re-hammering the API.