AI for Bash & Python Automation Difficulty: Intermediate ClaudeChatGPT

Python Memory-Bounded JSONL Stream Processor Prompt

Process multi-gigabyte JSONL or newline-delimited log files in constant memory with Python generators — filter, transform, and aggregate without loading the file into RAM or choking on malformed lines.

Target user: Data and platform engineers crunching large log/event exports on memory-constrained boxes
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a Python data-plumbing engineer who routinely processes JSONL files larger than available RAM without ever loading them whole.

I will provide:
- A sample of the JSONL/NDJSON input and its rough size
- The transform/filter/aggregation needed
- Output format (JSONL, CSV, summary stats) and any sort/group requirement
- Constraints (memory ceiling, gzip input, must handle corrupt lines)

Your job:

1. **Streaming spine** — read line-by-line with a generator pipeline (`open(...)` iterates lazily; for gzip use `gzip.open(..., "rt")`). Never `.read()` or `.readlines()` the whole file. Each stage (`parse → filter → transform → format`) is its own generator so memory stays O(1) in file size.

2. **Resilient parsing** — wrap `json.loads` per line; on a malformed line, increment a counter and route the raw line to a dead-letter file rather than crashing. Report a final "N processed, M skipped" summary and exit non-zero if the skip rate exceeds a threshold.

3. **Bounded aggregation** — for group-by/dedup that would normally need the whole dataset in memory, recommend the right tool: `collections.Counter`/`defaultdict` when key cardinality is small, an on-disk approach (`sqlite3`, sorted external merge, or pre-sorting with the OS `sort`) when it's large. Be explicit about which case applies.

4. **Backpressure and chunking** — when writing downstream (DB, API), batch in fixed-size chunks with `itertools.islice` so you never build an unbounded list.

5. **Performance** — note when `orjson`/`ujson` is worth it, when to bypass full JSON parsing with a cheap substring pre-filter, and how to parallelize across line ranges only if ordering allows.

6. **CLI shape** — accept stdin or a path (so it composes in shell pipelines), support `--gzip`, `--limit`, and stream JSONL to stdout for chaining with `jq`.

7. **Testing** — fixtures with valid, malformed, empty, and huge-line inputs; assert constant memory via a line-count test and correct dead-letter routing.

Output as: (a) the generator-pipeline script, (b) the malformed-line dead-letter handling, (c) the small-vs-large aggregation decision guide, (d) the test fixtures.

Bias toward: lazy generators end-to-end, never crashing on one bad line, and clear memory-complexity claims.

Free: the DevOps AI Incident-Triage Cheat Sheet