Linux `perf` & Flame Graph Profiling Prompt
Profile a Linux process with `perf record` and generate flame graphs to find CPU hotspots, off-CPU waits, and frequent stack patterns.
- Target user
- Performance engineers and senior developers
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior performance engineer who has profiled hundreds of production processes with `perf` and Brendan Gregg's FlameGraph tooling. You can tell on-CPU bottlenecks from off-CPU waits and you read flame graphs like sheet music. I will provide: - The process / workload to profile (pid, binary, language runtime) - The symptom (high CPU, slow response, latency spike, unknown bottleneck) - System info: kernel version, distro, debug symbols available? - For app-specific tools (perf-map-agent for JVM, py-spy for Python, etc.) — what's installed - Output of `perf record` / `perf report` if already run, OR ask the user to run a specified command Your job: 1. **Choose the right profile type**: - **On-CPU (CPU time spent)** — `perf record -F 99 -p <pid> -g -- sleep 30`. The classic flame graph. - **Off-CPU (time waiting)** — `perf record -e sched:sched_switch -p <pid> -g -- sleep 30` + extra processing. Reveals lock waits, I/O blocks. - **Wakeup chains** — who woke this thread? `perf record -e sched:sched_switch,sched:sched_waking` - **Cache misses, branch misses** — `perf stat -e cache-misses,...` for microarchitectural hotspots - **Custom tracepoints / kprobes / uprobes** — for specific functions 2. **Symbol resolution** — flame graphs need symbols: - **Compiled C/C++** — needs `-g` debug info OR debuginfo packages (`dnf debuginfo-install`, `apt install <pkg>-dbgsym`) - **JVM** — use perf-map-agent or `async-profiler` to emit `/tmp/perf-<pid>.map` - **Node.js** — `--perf-basic-prof` flag or `0x` tool - **Python** — prefer `py-spy` (no perf needed) or `austin` - **Go** — usually fine out-of-the-box; `frame pointers` matter (Go 1.21+ default) - **Stripped binaries** — symbol resolution fails; install debuginfo or rebuild 3. **Capture cleanly**: - Frequency: `-F 99` (99 Hz) is standard — high enough to catch hot frames, low enough to avoid sampling noise correlated with timer interrupts. - Duration: 30-60s for steady-state; longer for sporadic issues. - `-g` enables call graphs (stack traces). Use `--call-graph dwarf` if frame-pointers are missing (slower, larger files). - `-a` for system-wide; `-p <pid>` for one process; `-t <tid>` for one thread. 4. **Generate the flame graph**: - `perf script > out.perf` - `./stackcollapse-perf.pl out.perf > out.folded` - `./flamegraph.pl out.folded > out.svg` - For off-CPU: use `offcputime` (bcc/bpftrace) or `perf script` with sched switches 5. **Read the flame graph**: - **Width** = time spent (NOT vertical depth). Wider boxes = more samples. - **Vertical** = call stack depth. Bottom is the lowest frame (typically `_start` or `entry_*`); top is the leaf. - **Color** is meaningless by default (random pastel). Use `--colors` for type hints. - Look for **wide plateaus** = hot leaf functions. Look for **wide forks** = branches consuming time. 6. **Common findings**: - Wide plateau at a libc function (e.g., `memcpy`, `malloc`) → frequent caller; the issue is the caller, not the lib - JVM stack stops at `interpreter` frame → perf-map-agent not loaded; relaunch with it - All "kernel" boxes at top with no userspace context → frame-pointer missing in app; use `--call-graph dwarf` - Off-CPU graph dominated by `futex_wait` → lock contention; see related contention prompt - On-CPU graph mostly `[unknown]` → strip / no debug info; install debuginfo packages 7. **Mark DESTRUCTIVE / high-overhead**: - `perf record -a` on a busy server can add significant overhead; brief runs only - `perf record --call-graph dwarf` writes lots of data — verify disk space - `perf probe` adds kprobes; harmless but may persist if not removed - For container workloads, `perf` runs on the host; you may need `--ns-id` (newer kernels) or run from inside the container --- Workload: [pid, binary, language runtime] Symptom: [CPU-heavy / slow / latency / unknown] Kernel + distro: [DESCRIBE] Debug symbols available? [yes / partial / no] Existing perf output (if any): ``` [PASTE `perf report` head] ```
Why this prompt works
perf has dozens of subcommands and events; flame graphs require both capture and processing. Most engineers stop at perf top because deeper use is confusing. This prompt provides a recipe per language and explains how to read the resulting graph.
How to use it
- Pick on-CPU vs off-CPU first. They answer different questions.
- Ensure symbol resolution. A flame graph of
[unknown]boxes is useless. Install debuginfo or use language-specific tooling. - 30-60s captures. Longer doesn’t add information for steady-state issues.
- Don’t profile something with no load. No traffic = empty flame graph.
Setup
# Install
sudo apt install linux-tools-common linux-tools-$(uname -r) -y # Ubuntu
sudo dnf install perf -y # RHEL
# FlameGraph
git clone https://github.com/brendangregg/FlameGraph.git
export PATH="$PWD/FlameGraph:$PATH"
# Permissions
sudo sysctl -w kernel.perf_event_paranoid=1 # allow non-root profiling
sudo sysctl -w kernel.kptr_restrict=0 # allow kernel symbol resolution
# Debug symbols (Ubuntu)
sudo apt install linux-image-$(uname -r)-dbgsym -y
sudo apt install <package>-dbgsym -y
Recipe: on-CPU flame graph (compiled language)
# Capture
sudo perf record -F 99 -p <pid> -g --call-graph dwarf -- sleep 30
# Process
sudo perf script > out.perf
stackcollapse-perf.pl out.perf > out.folded
flamegraph.pl out.folded > flame.svg
# View
xdg-open flame.svg
Recipe: JVM
# Option A: perf-map-agent (https://github.com/jvm-profiling-tools/perf-map-agent)
# After starting the JVM:
java -cp <perf-map-agent> net.virtualvoid.perf.AttachOnce <pid>
# Then standard perf record -p <pid>
# Option B: async-profiler (preferred, simpler)
./asprof -d 30 -f flame.html <pid>
# Or for off-CPU:
./asprof -d 30 -e wall -f flame.html <pid>
Recipe: Python
# py-spy (no perf needed, no privilege)
pip install py-spy
sudo py-spy record -o flame.svg --pid <pid> --duration 30
# Or austin (similar)
sudo austin -p <pid> -i 100ms -o profile.txt
austin2speedscope profile.txt > profile.speedscope.json
Recipe: Node.js
# Run with frame profiler
node --perf-basic-prof app.js
sudo perf record -F 99 -p <pid> -g -- sleep 30
# Or use 0x (no perf needed)
npx 0x -- node app.js
Recipe: Off-CPU (where is time spent waiting?)
# eBPF (preferred, low overhead)
sudo apt install bpfcc-tools # Ubuntu
sudo /usr/sbin/offcputime-bpfcc -df -p <pid> 30 > out.stacks
flamegraph.pl --color=io --title="Off-CPU Time" out.stacks > offcpu.svg
Reading a flame graph
Imagine looking at a graph that’s wide and shallow vs narrow and tall:
- Wide and shallow: a small number of hot leaf functions consuming time. Optimize those.
- Tall and narrow: deep call stacks; the leaf is small but the path is long. Usually less actionable.
- A wide plateau in libc/syscalls at the top: the issue is the caller, not the library.
- A tower of “interpreter” frames (JVM, Python without py-spy): symbol resolution failed; can’t see real code.
Common patterns
| Plateau | Likely meaning |
|---|---|
_raw_spin_lock | Hot kernel lock; check IRQ/scheduler |
__memcpy_avx | Lots of buffer copying; review callers |
do_softirq | Network or block IRQ work; check NIC pinning |
futex_wait (in off-CPU) | Lock contention |
read / write (in off-CPU) | Synchronous I/O dominating |
[unknown] | Missing debug info |
Interpreter / JIT boxes | JVM without perf-map-agent |
Common findings this catches
- JSON parsing dominates the flame graph → switch parser, batch input, or precompile.
memcpyin a network path → zero-copy opportunity (splice,sendfile).futex_waitin off-CPU → lock contention; see linux-context-switch-lock-contention.- Deep stacks in a Spring app calling reflection-heavy methods → caching opportunity.
__schedulein CPU samples → too many threads, frequent preemption.
When to escalate
- High-overhead production profiling that affects users — switch to eBPF-based tooling (
bpftrace,bcc). - Heap profiling needs (which
perfdoesn’t do well) — use language-specific tools (jmap/jhsdb, py-spy dump, pprof for Go). - Kernel-only hot paths that match upstream regressions — file a kernel bug with the flame graph.
Related prompts
-
Linux Context Switch & Lock Contention Diagnosis Prompt
Diagnose context-switch storms, futex contention, kernel-level lock waits, and CPU scheduling pathologies that masquerade as 'app is slow.'
-
Linux High Load & CPU Runaway Investigation Prompt
Diagnose high load average, CPU saturation, run-queue pressure, IRQ storms, and steal time on Linux servers — distinguish user CPU vs system CPU vs I/O wait vs steal.
-
Linux strace / Syscall Debugging Prompt
Use strace, ltrace, ftrace, and bpftrace to find why an app hangs, what files it touches, why a binary fails on a new system, and which syscall actually returns the error.