Processing Huge Files with awk and Streaming, Not RAM
When a log file is bigger than your memory, loading it into a list is the wrong move. Here's how to stream multi-gigabyte files with awk and Python generators.
- #bash
- #python
- #awk
- #streaming
- #performance
- #data
There’s a moment every ops engineer hits: a log file or data export that’s 12GB, you need to extract something from it, and your first instinct — open it, read it into a list, process the list — gets the box OOM-killed. The file is bigger than RAM, and that approach was never going to work.
The fix is a mindset, not a tool: stream the data, don’t materialize it. Process one record at a time, hold a tiny bit of state, and let the bytes flow through. awk does this natively, Python does it with generators, and once it clicks you’ll stop loading files into memory for good.
Why awk is built for this
awk’s entire execution model is streaming. It reads input line by line, runs your program against each line, and never holds the whole file. A 100-byte awk one-liner will happily chew through a 50GB file using a few megabytes of memory. That’s not a trick — it’s the design.
The structure is pattern { action }. For each line, if the pattern matches, run the action:
# Sum response sizes (field 10) for HTTP 500s in an access log — constant memory
awk '$9 == 500 { bytes += $10 } END { print bytes }' access.log
$9 and $10 are the ninth and tenth whitespace-separated fields. bytes accumulates across lines, and the END block runs once after the last line. No matter how big access.log is, this uses essentially no memory beyond the current line.
Real streaming aggregations with awk
The power shows up when you aggregate into associative arrays — awk’s built-in hash maps:
# Count requests per status code
awk '{ count[$9]++ } END { for (code in count) print code, count[code] }' access.log
# Top 10 client IPs by request volume
awk '{ ip[$1]++ } END { for (a in ip) print ip[a], a }' access.log \
| sort -rn | head -10
# p95-ish: requests slower than 1 second, by endpoint
awk '$NF > 1.0 { slow[$7]++ } END { for (e in slow) print slow[e], e }' access.log \
| sort -rn
The only memory you use is the size of the distinct keys — usually tiny relative to the file. A billion log lines with a few hundred status/endpoint combinations still fits in kilobytes of state. That’s the streaming aggregation pattern: bounded memory regardless of input size.
For multi-gigabyte files this is dramatically faster than loading into pandas, and it runs on any box with no install.
The Python equivalent: generators and lazy iteration
When the logic gets too gnarly for awk — JSON parsing, complex conditionals, calling out to other code — switch to Python, but keep the streaming discipline. The cardinal rule:
# WRONG — loads the entire file into a list, then into another list
lines = open("huge.log").readlines() # 12GB in RAM, OOM
errors = [l for l in lines if "ERROR" in l]
# RIGHT — iterate the file object lazily, one line at a time
def error_lines(path):
with open(path) as f:
for line in f: # file objects are lazy iterators
if "ERROR" in line:
yield line # produced on demand, never stored
A file object in Python is already a lazy iterator — for line in f reads one line at a time. readlines() and read() are the two methods that defeat this by pulling everything into memory. Avoid them on big files.
Chain generators to build a pipeline where nothing is materialized until the very end:
import json
def parse_json_logs(path):
with open(path) as f:
for line in f:
try:
yield json.loads(line)
except json.JSONDecodeError:
continue # skip malformed lines, keep streaming
def slow_requests(records, threshold_ms=1000):
for rec in records:
if rec.get("duration_ms", 0) > threshold_ms:
yield rec
# Nothing is read until we start consuming the final generator
records = parse_json_logs("requests.jsonl")
slow = slow_requests(records)
from collections import Counter
by_endpoint = Counter()
for rec in slow: # the pipeline runs lazily, line by line
by_endpoint[rec["endpoint"]] += 1
for endpoint, n in by_endpoint.most_common(10):
print(n, endpoint)
Each line flows through parse_json_logs → slow_requests → the counter, then is discarded. Peak memory is one line plus the Counter’s bounded keys — whether the file is 1MB or 1TB.
Combine them: awk to thin, Python to think
The pragmatic move on a truly huge file is to let awk do the cheap, high-volume filtering and hand Python the small remainder for complex logic:
# awk slashes 50GB to the 0.1% of lines that matter, Python parses those
awk '$9 >= 500' access.log | python3 analyze_errors.py
Read the stream in Python with sys.stdin:
import sys
for line in sys.stdin: # still lazy — one line at a time from the pipe
...
This plays to each tool’s strength: awk’s raw line-throughput for the firehose, Python’s expressiveness for the trickle.
Pitfalls to avoid
readlines(),read(),f.read().split("\n")— all of these materialize the file. They’re the usual culprits behind OOM.- Building a giant list “to process later.” If you
appendevery record, you’ve reinvented loading the file. Aggregate as you go. sorton the whole file in memory — GNUsortactually spills to disk, but a naive Pythonsorted(all_lines)won’t. Sort thinned data, or usesortitself.- Unbounded dictionaries. Streaming protects you only if your state stays small. A
dictkeyed on something high-cardinality (request IDs, timestamps) grows with the file and brings back the OOM.
The whole discipline is one idea: data flows through, it doesn’t pile up. Keep your state bounded and you can process a file ten times the size of your RAM on a tiny box without breaking a sweat.
For more on wrangling data in the shell and Python, see the Bash & Python automation guides or start from a prompt.
Field numbers and delimiters in the awk examples assume a standard whitespace-delimited log. Adjust $N and set FS to match your actual format before running.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.