Processing Huge Files with awk and Streaming, Not RAM

There’s a moment every ops engineer hits: a log file or data export that’s 12GB, you need to extract something from it, and your first instinct — open it, read it into a list, process the list — gets the box OOM-killed. The file is bigger than RAM, and that approach was never going to work.

The fix is a mindset, not a tool: stream the data, don’t materialize it. Process one record at a time, hold a tiny bit of state, and let the bytes flow through. awk does this natively, Python does it with generators, and once it clicks you’ll stop loading files into memory for good.

Why awk is built for this

awk’s entire execution model is streaming. It reads input line by line, runs your program against each line, and never holds the whole file. A 100-byte awk one-liner will happily chew through a 50GB file using a few megabytes of memory. That’s not a trick — it’s the design.

The structure is pattern { action }. For each line, if the pattern matches, run the action:

# Sum response sizes (field 10) for HTTP 500s in an access log — constant memory
awk '$9 == 500 { bytes += $10 } END { print bytes }' access.log

$9 and $10 are the ninth and tenth whitespace-separated fields. bytes accumulates across lines, and the END block runs once after the last line. No matter how big access.log is, this uses essentially no memory beyond the current line.

Real streaming aggregations with awk

The power shows up when you aggregate into associative arrays — awk’s built-in hash maps:

# Count requests per status code
awk '{ count[$9]++ } END { for (code in count) print code, count[code] }' access.log

# Top 10 client IPs by request volume
awk '{ ip[$1]++ } END { for (a in ip) print ip[a], a }' access.log \
  | sort -rn | head -10

# p95-ish: requests slower than 1 second, by endpoint
awk '$NF > 1.0 { slow[$7]++ } END { for (e in slow) print slow[e], e }' access.log \
  | sort -rn

The only memory you use is the size of the distinct keys — usually tiny relative to the file. A billion log lines with a few hundred status/endpoint combinations still fits in kilobytes of state. That’s the streaming aggregation pattern: bounded memory regardless of input size.

For multi-gigabyte files this is dramatically faster than loading into pandas, and it runs on any box with no install.

The Python equivalent: generators and lazy iteration

When the logic gets too gnarly for awk — JSON parsing, complex conditionals, calling out to other code — switch to Python, but keep the streaming discipline. The cardinal rule:

# WRONG — loads the entire file into a list, then into another list
lines = open("huge.log").readlines()      # 12GB in RAM, OOM
errors = [l for l in lines if "ERROR" in l]

# RIGHT — iterate the file object lazily, one line at a time
def error_lines(path):
    with open(path) as f:
        for line in f:                    # file objects are lazy iterators
            if "ERROR" in line:
                yield line                # produced on demand, never stored

A file object in Python is already a lazy iterator — for line in f reads one line at a time. readlines() and read() are the two methods that defeat this by pulling everything into memory. Avoid them on big files.

Chain generators to build a pipeline where nothing is materialized until the very end:

import json

def parse_json_logs(path):
    with open(path) as f:
        for line in f:
            try:
                yield json.loads(line)
            except json.JSONDecodeError:
                continue          # skip malformed lines, keep streaming

def slow_requests(records, threshold_ms=1000):
    for rec in records:
        if rec.get("duration_ms", 0) > threshold_ms:
            yield rec

# Nothing is read until we start consuming the final generator
records = parse_json_logs("requests.jsonl")
slow = slow_requests(records)

from collections import Counter
by_endpoint = Counter()
for rec in slow:                  # the pipeline runs lazily, line by line
    by_endpoint[rec["endpoint"]] += 1

for endpoint, n in by_endpoint.most_common(10):
    print(n, endpoint)

Each line flows through parse_json_logs → slow_requests → the counter, then is discarded. Peak memory is one line plus the Counter’s bounded keys — whether the file is 1MB or 1TB.

Combine them: awk to thin, Python to think

The pragmatic move on a truly huge file is to let awk do the cheap, high-volume filtering and hand Python the small remainder for complex logic:

# awk slashes 50GB to the 0.1% of lines that matter, Python parses those
awk '$9 >= 500' access.log | python3 analyze_errors.py

Read the stream in Python with sys.stdin:

import sys
for line in sys.stdin:            # still lazy — one line at a time from the pipe
    ...

This plays to each tool’s strength: awk’s raw line-throughput for the firehose, Python’s expressiveness for the trickle.

Pitfalls to avoid

readlines(), read(), f.read().split("\n") — all of these materialize the file. They’re the usual culprits behind OOM.
Building a giant list “to process later.” If you append every record, you’ve reinvented loading the file. Aggregate as you go.
sort on the whole file in memory — GNU sort actually spills to disk, but a naive Python sorted(all_lines) won’t. Sort thinned data, or use sort itself.
Unbounded dictionaries. Streaming protects you only if your state stays small. A dict keyed on something high-cardinality (request IDs, timestamps) grows with the file and brings back the OOM.

The whole discipline is one idea: data flows through, it doesn’t pile up. Keep your state bounded and you can process a file ten times the size of your RAM on a tiny box without breaking a sweat.

For more on wrangling data in the shell and Python, see the Bash & Python automation guides or start from a prompt.

Field numbers and delimiters in the awk examples assume a standard whitespace-delimited log. Adjust $N and set FS to match your actual format before running.