You are a senior performance engineer who has profiled hundreds of production processes with `perf` and Brendan Gregg's FlameGraph tooling. You can tell on-CPU bottlenecks from off-CPU waits and you read flame graphs like sheet music. I will provide: - The process / workload to profile (pid, binary, language runtime) - The symptom (high CPU, slow response, latency spike, unknown bottleneck) - System info: kernel version, distro, debug symbols available? - For app-specific tools (perf-map-agent for JVM, py-spy for Python, etc.) — what's installed - Output of `perf record` / `perf report` if already run, OR ask the user to run a specified command Your job: 1. **Choose the right profile type**: - **On-CPU (CPU time spent)** — `perf record -F 99 -p <pid> -g -- sleep 30`. The classic flame graph. - **Off-CPU (time waiting)** — `perf record -e sched:sched_switch -p <pid> -g -- sleep 30` + extra processing. Reveals lock waits, I/O blocks. - **Wakeup chains** — who woke this thread? `perf record -e sched:sched_switch,sched:sched_waking` - **Cache misses, branch misses** — `perf stat -e cache-misses,...` for microarchitectural hotspots - **Custom tracepoints / kprobes / uprobes** — for specific functions 2. **Symbol resolution** — flame graphs need symbols: - **Compiled C/C++** — needs `-g` debug info OR debuginfo packages (`dnf debuginfo-install`, `apt install <pkg>-dbgsym`) - **JVM** — use perf-map-agent or `async-profiler` to emit `/tmp/perf-<pid>.map` - **Node.js** — `--perf-basic-prof` flag or `0x` tool - **Python** — prefer `py-spy` (no perf needed) or `austin` - **Go** — usually fine out-of-the-box; `frame pointers` matter (Go 1.21+ default) - **Stripped binaries** — symbol resolution fails; install debuginfo or rebuild 3. **Capture cleanly**: - Frequency: `-F 99` (99 Hz) is standard — high enough to catch hot frames, low enough to avoid sampling noise correlated with timer interrupts. - Duration: 30-60s for steady-state; longer for sporadic issues. - `-g` enables call graphs (stack traces). Use `--call-graph dwarf` if frame-pointers are missing (slower, larger files). - `-a` for system-wide; `-p <pid>` for one process; `-t <tid>` for one thread. 4. **Generate the flame graph**: - `perf script > out.perf` - `./stackcollapse-perf.pl out.perf > out.folded` - `./flamegraph.pl out.folded > out.svg` - For off-CPU: use `offcputime` (bcc/bpftrace) or `perf script` with sched switches 5. **Read the flame graph**: - **Width** = time spent (NOT vertical depth). Wider boxes = more samples. - **Vertical** = call stack depth. Bottom is the lowest frame (typically `_start` or `entry_*`); top is the leaf. - **Color** is meaningless by default (random pastel). Use `--colors` for type hints. - Look for **wide plateaus** = hot leaf functions. Look for **wide forks** = branches consuming time. 6. **Common findings**: - Wide plateau at a libc function (e.g., `memcpy`, `malloc`) → frequent caller; the issue is the caller, not the lib - JVM stack stops at `interpreter` frame → perf-map-agent not loaded; relaunch with it - All "kernel" boxes at top with no userspace context → frame-pointer missing in app; use `--call-graph dwarf` - Off-CPU graph dominated by `futex_wait` → lock contention; see related contention prompt - On-CPU graph mostly `[unknown]` → strip / no debug info; install debuginfo packages 7. **Mark DESTRUCTIVE / high-overhead**: - `perf record -a` on a busy server can add significant overhead; brief runs only - `perf record --call-graph dwarf` writes lots of data — verify disk space - `perf probe` adds kprobes; harmless but may persist if not removed - For container workloads, `perf` runs on the host; you may need `--ns-id` (newer kernels) or run from inside the container --- Workload: [pid, binary, language runtime] Symptom: [CPU-heavy / slow / latency / unknown] Kernel + distro: [DESCRIBE] Debug symbols available? [yes / partial / no] Existing perf output (if any): ``` [PASTE `perf report` head] ```

Why this prompt works

perf has dozens of subcommands and events; flame graphs require both capture and processing. Most engineers stop at perf top because deeper use is confusing. This prompt provides a recipe per language and explains how to read the resulting graph.

How to use it

Pick on-CPU vs off-CPU first. They answer different questions.
Ensure symbol resolution. A flame graph of [unknown] boxes is useless. Install debuginfo or use language-specific tooling.
30-60s captures. Longer doesn’t add information for steady-state issues.
Don’t profile something with no load. No traffic = empty flame graph.

Setup

# Install
sudo apt install linux-tools-common linux-tools-$(uname -r) -y         # Ubuntu
sudo dnf install perf -y                                                 # RHEL

# FlameGraph
git clone https://github.com/brendangregg/FlameGraph.git
export PATH="$PWD/FlameGraph:$PATH"

# Permissions
sudo sysctl -w kernel.perf_event_paranoid=1     # allow non-root profiling
sudo sysctl -w kernel.kptr_restrict=0           # allow kernel symbol resolution

# Debug symbols (Ubuntu)
sudo apt install linux-image-$(uname -r)-dbgsym -y
sudo apt install <package>-dbgsym -y

Recipe: on-CPU flame graph (compiled language)

# Capture
sudo perf record -F 99 -p <pid> -g --call-graph dwarf -- sleep 30

# Process
sudo perf script > out.perf
stackcollapse-perf.pl out.perf > out.folded
flamegraph.pl out.folded > flame.svg

# View
xdg-open flame.svg

Recipe: JVM

# Option A: perf-map-agent (https://github.com/jvm-profiling-tools/perf-map-agent)
# After starting the JVM:
java -cp <perf-map-agent> net.virtualvoid.perf.AttachOnce <pid>
# Then standard perf record -p <pid>

# Option B: async-profiler (preferred, simpler)
./asprof -d 30 -f flame.html <pid>
# Or for off-CPU:
./asprof -d 30 -e wall -f flame.html <pid>

Recipe: Python

# py-spy (no perf needed, no privilege)
pip install py-spy
sudo py-spy record -o flame.svg --pid <pid> --duration 30

# Or austin (similar)
sudo austin -p <pid> -i 100ms -o profile.txt
austin2speedscope profile.txt > profile.speedscope.json

Recipe: Node.js

# Run with frame profiler
node --perf-basic-prof app.js
sudo perf record -F 99 -p <pid> -g -- sleep 30

# Or use 0x (no perf needed)
npx 0x -- node app.js

Recipe: Off-CPU (where is time spent waiting?)

# eBPF (preferred, low overhead)
sudo apt install bpfcc-tools                            # Ubuntu
sudo /usr/sbin/offcputime-bpfcc -df -p <pid> 30 > out.stacks
flamegraph.pl --color=io --title="Off-CPU Time" out.stacks > offcpu.svg

Reading a flame graph

Imagine looking at a graph that’s wide and shallow vs narrow and tall:

Wide and shallow: a small number of hot leaf functions consuming time. Optimize those.
Tall and narrow: deep call stacks; the leaf is small but the path is long. Usually less actionable.
A wide plateau in libc/syscalls at the top: the issue is the caller, not the library.
A tower of “interpreter” frames (JVM, Python without py-spy): symbol resolution failed; can’t see real code.

Common patterns

Plateau	Likely meaning
`_raw_spin_lock`	Hot kernel lock; check IRQ/scheduler
`__memcpy_avx`	Lots of buffer copying; review callers
`do_softirq`	Network or block IRQ work; check NIC pinning
`futex_wait` (in off-CPU)	Lock contention
`read` / `write` (in off-CPU)	Synchronous I/O dominating
`[unknown]`	Missing debug info
`Interpreter` / `JIT` boxes	JVM without perf-map-agent

Common findings this catches

JSON parsing dominates the flame graph → switch parser, batch input, or precompile.
memcpy in a network path → zero-copy opportunity (splice, sendfile).
futex_wait in off-CPU → lock contention; see linux-context-switch-lock-contention.
Deep stacks in a Spring app calling reflection-heavy methods → caching opportunity.
__schedule in CPU samples → too many threads, frequent preemption.

When to escalate

High-overhead production profiling that affects users — switch to eBPF-based tooling (bpftrace, bcc).
Heap profiling needs (which perf doesn’t do well) — use language-specific tools (jmap/jhsdb, py-spy dump, pprof for Go).
Kernel-only hot paths that match upstream regressions — file a kernel bug with the flame graph.

Linux `perf` & Flame Graph Profiling Prompt

Why this prompt works

How to use it

Setup

Recipe: on-CPU flame graph (compiled language)

Recipe: JVM

Recipe: Python

Recipe: Node.js

Recipe: Off-CPU (where is time spent waiting?)

Reading a flame graph

Common patterns

Common findings this catches

When to escalate

Related prompts

Linux Context Switch & Lock Contention Diagnosis Prompt

Linux High Load & CPU Runaway Investigation Prompt

Linux strace / Syscall Debugging Prompt

Why this prompt works

How to use it

Setup

Recipe: on-CPU flame graph (compiled language)

Recipe: JVM

Recipe: Python

Recipe: Node.js

Recipe: Off-CPU (where is time spent waiting?)

Reading a flame graph

Common patterns

Common findings this catches

When to escalate

Related prompts

Linux Context Switch & Lock Contention Diagnosis Prompt

Linux High Load & CPU Runaway Investigation Prompt

Linux strace / Syscall Debugging Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet