Profiling Linux Performance with perf and an AI Copilot

The first time I ran perf record on a misbehaving service, I got a wall of hex addresses and symbol names like __memmove_avx_unaligned_erms and felt no closer to understanding why CPU was pinned at 100%. perf is the single most powerful sampling profiler on Linux, but its output assumes you already know what you’re looking at. These days I treat an AI assistant as a fast junior engineer sitting next to me: it can name those symbols, group them into a story, and propose hypotheses far quicker than I can grep through them. I still own the conclusions, but the grunt work of decoding stacks is delegated. Here’s the workflow.

Installing perf and getting symbols right

perf ships in linux-tools matched to your running kernel. The mismatch between installed tools and the running kernel is the number one reason people get garbage output.

sudo apt install linux-tools-$(uname -r) linux-tools-common
perf --version

You also need debug symbols, or every frame collapses into [unknown]. For your own binaries, build with frame pointers (-fno-omit-frame-pointer) or DWARF. For distro packages, install the matching -dbg/-dbgsym packages.

perf record -F 99 -a -g --call-graph dwarf -- sleep 30

That samples the whole system at 99 Hz for 30 seconds with DWARF-based call graphs. The odd frequency of 99 avoids lock-stepping with periodic timers.

Capturing a profile of a single hot process

When you’ve already identified the offending PID from top, scope the capture to it rather than the whole box.

perf record -F 199 -g -p 4821 -- sleep 20
perf report --stdio | head -60

perf report --stdio is the text view that you can actually paste into an AI chat. The interactive TUI is nice for humans but useless for sharing context.

Pro Tip: Always capture for a fixed duration with -- sleep N rather than Ctrl-C. A clean, bounded sample is reproducible, and reproducibility is what lets you compare a “before” and “after” profile honestly.

Turning stacks into flame graphs

Flame graphs are the format both humans and models reason about best. Brendan Gregg’s scripts turn folded stacks into an SVG.

perf script > out.perf
git clone https://github.com/brendangregg/FlameGraph
./FlameGraph/stackcollapse-perf.pl out.perf > out.folded
./FlameGraph/flamegraph.pl out.folded > flame.svg

The out.folded file is plain text: one collapsed stack and its sample count per line. That text is exactly what you hand to an AI. It can read 50,000 folded lines far faster than you can scroll a flame graph, and it can total samples per function for you.

Letting AI decode the symbols

Here’s where the copilot earns its keep. Paste the top of perf report --stdio or a slice of the folded stacks and ask a focused question. A prompt I reuse:

“This is folded perf output from a Python web service pinned at 100% CPU. Group the stacks by what they’re actually doing (GC, JSON serialization, lock contention, syscalls), give me the percentage of samples in each bucket, and list the three functions worth investigating first. Don’t suggest code changes yet.”

The model will name __memmove_avx_unaligned_erms as a large memory copy, spot that 40% of your samples sit under a JSON encoder, and flag a futex cluster as lock contention. That is an hour of manual decoding compressed into seconds. You then verify each claim against the actual flame graph — the AI is pattern-matching symbol names, and it can confidently misread an inlined frame, so you check before you act. I keep a folder of these decoding prompts in my prompt library so I’m not rewriting them at 2am.

Differential profiling to confirm a fix

The honest way to prove a change worked is to compare two profiles. perf diff does this natively.

perf record -F 199 -g -o before.data -p 4821 -- sleep 20
# deploy your change
perf record -F 199 -g -o after.data -p 4821 -- sleep 20
perf diff before.data after.data

The output shows the delta in sample share per symbol. Hand both perf report dumps to the AI and ask it to summarize what moved. It’s good at spotting that your hot function dropped from 38% to 4% while a different one crept up — the kind of regression you’d otherwise miss.

Beyond CPU: cache misses and context switches

perf isn’t only a CPU sampler. Hardware counters expose why a workload is slow even when CPU looks busy.

perf stat -d -p 4821 -- sleep 10

This prints instructions per cycle, cache miss rates, and branch mispredictions. A low IPC with high cache-miss percentage usually means a memory-bound workload, not a compute-bound one — a completely different fix. Ask the AI to interpret the perf stat block; it knows the rules of thumb (IPC under ~0.5 on a modern core is suspicious) and will explain them in context.

Profiling off-CPU time, not just on-CPU

The biggest blind spot in a CPU flame graph is everything your process isn’t running. A request that takes 800ms but only burns 40ms of CPU is spending the rest blocked — on a lock, a disk read, a network call — and a standard perf record shows you almost none of that. Off-CPU profiling fills the gap by sampling scheduler events instead of CPU cycles.

perf record -e sched:sched_switch -e sched:sched_stat_sleep \
  -g -p 4821 -- sleep 20
perf script | head -40

This captures where the process went to sleep and for how long. The output is even harder to read than a CPU profile, which is precisely why I hand it to the AI. A prompt that works:

“This is off-CPU perf data from a request handler that’s slow but uses little CPU. Tell me whether the time is going to lock contention, disk I/O, or network waits, and which stack dominates. Don’t propose fixes yet.”

The model will spot a cluster of futex_wait stacks and call out lock contention, or a run of block-device sleeps pointing at slow storage. Combining the on-CPU and off-CPU pictures is how you get the whole latency story rather than just the compute half — and it’s a synthesis the AI does well, because it can hold both profiles in context at once and reconcile them. As always, you verify each claim against the raw stacks before you believe it.

Pro Tip: When a service is “slow” but CPU looks idle, reach for off-CPU profiling first. Nine times out of ten the answer is a lock, a synchronous disk write, or a chatty downstream call — none of which a CPU flame graph will ever show you.

Keeping it safe

A few hard rules I never break. perf needs elevated privileges and reads kernel addresses, so run captures from a low-privilege ops account using sudo for the specific command, and never paste a profile into a chat tool that still contains hostnames, internal IPs, or environment dumps from /proc/<pid>/environ. Strip that first. And the AI never touches the box: it reads exported text files only. It is a fast junior engineer that suggests where to dig — it does not get prod credentials, it does not run perf for you, and it does not get the final word on what to change. If your team wants this loop tracked and reproducible, our code-review dashboard keeps the profile-to-fix reasoning attached to the change.

Conclusion

perf went from a tool I avoided to one I reach for first, and the difference was offloading symbol decoding to an AI copilot while keeping every decision in human hands. Capture a bounded sample, fold it, let the model bucket the stacks and name the suspects, then verify against the flame graph and prove the fix with perf diff. For more on this category see Linux admin guides, and if you want battle-tested prompts ready to go, the Linux admin prompt pack bundles the profiling ones I use weekly.