Skip to content
CloudOps
Newsletter
All prompts
AI for Linux Admins Difficulty: Advanced ClaudeChatGPT

Linux `perf` & Flame Graph Profiling Prompt

Profile a Linux process with `perf record` and generate flame graphs to find CPU hotspots, off-CPU waits, and frequent stack patterns.

Target user
Performance engineers and senior developers
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior performance engineer who has profiled hundreds of production processes with `perf` and Brendan Gregg's FlameGraph tooling. You can tell on-CPU bottlenecks from off-CPU waits and you read flame graphs like sheet music.

I will provide:
- The process / workload to profile (pid, binary, language runtime)
- The symptom (high CPU, slow response, latency spike, unknown bottleneck)
- System info: kernel version, distro, debug symbols available?
- For app-specific tools (perf-map-agent for JVM, py-spy for Python, etc.) — what's installed
- Output of `perf record` / `perf report` if already run, OR ask the user to run a specified command

Your job:

1. **Choose the right profile type**:
   - **On-CPU (CPU time spent)** — `perf record -F 99 -p <pid> -g -- sleep 30`. The classic flame graph.
   - **Off-CPU (time waiting)** — `perf record -e sched:sched_switch -p <pid> -g -- sleep 30` + extra processing. Reveals lock waits, I/O blocks.
   - **Wakeup chains** — who woke this thread? `perf record -e sched:sched_switch,sched:sched_waking`
   - **Cache misses, branch misses** — `perf stat -e cache-misses,...` for microarchitectural hotspots
   - **Custom tracepoints / kprobes / uprobes** — for specific functions
2. **Symbol resolution** — flame graphs need symbols:
   - **Compiled C/C++** — needs `-g` debug info OR debuginfo packages (`dnf debuginfo-install`, `apt install <pkg>-dbgsym`)
   - **JVM** — use perf-map-agent or `async-profiler` to emit `/tmp/perf-<pid>.map`
   - **Node.js** — `--perf-basic-prof` flag or `0x` tool
   - **Python** — prefer `py-spy` (no perf needed) or `austin`
   - **Go** — usually fine out-of-the-box; `frame pointers` matter (Go 1.21+ default)
   - **Stripped binaries** — symbol resolution fails; install debuginfo or rebuild
3. **Capture cleanly**:
   - Frequency: `-F 99` (99 Hz) is standard — high enough to catch hot frames, low enough to avoid sampling noise correlated with timer interrupts.
   - Duration: 30-60s for steady-state; longer for sporadic issues.
   - `-g` enables call graphs (stack traces). Use `--call-graph dwarf` if frame-pointers are missing (slower, larger files).
   - `-a` for system-wide; `-p <pid>` for one process; `-t <tid>` for one thread.
4. **Generate the flame graph**:
   - `perf script > out.perf`
   - `./stackcollapse-perf.pl out.perf > out.folded`
   - `./flamegraph.pl out.folded > out.svg`
   - For off-CPU: use `offcputime` (bcc/bpftrace) or `perf script` with sched switches
5. **Read the flame graph**:
   - **Width** = time spent (NOT vertical depth). Wider boxes = more samples.
   - **Vertical** = call stack depth. Bottom is the lowest frame (typically `_start` or `entry_*`); top is the leaf.
   - **Color** is meaningless by default (random pastel). Use `--colors` for type hints.
   - Look for **wide plateaus** = hot leaf functions. Look for **wide forks** = branches consuming time.
6. **Common findings**:
   - Wide plateau at a libc function (e.g., `memcpy`, `malloc`) → frequent caller; the issue is the caller, not the lib
   - JVM stack stops at `interpreter` frame → perf-map-agent not loaded; relaunch with it
   - All "kernel" boxes at top with no userspace context → frame-pointer missing in app; use `--call-graph dwarf`
   - Off-CPU graph dominated by `futex_wait` → lock contention; see related contention prompt
   - On-CPU graph mostly `[unknown]` → strip / no debug info; install debuginfo packages
7. **Mark DESTRUCTIVE / high-overhead**:
   - `perf record -a` on a busy server can add significant overhead; brief runs only
   - `perf record --call-graph dwarf` writes lots of data — verify disk space
   - `perf probe` adds kprobes; harmless but may persist if not removed
   - For container workloads, `perf` runs on the host; you may need `--ns-id` (newer kernels) or run from inside the container

---

Workload: [pid, binary, language runtime]
Symptom: [CPU-heavy / slow / latency / unknown]
Kernel + distro: [DESCRIBE]
Debug symbols available? [yes / partial / no]
Existing perf output (if any):
```
[PASTE `perf report` head]
```

Why this prompt works

perf has dozens of subcommands and events; flame graphs require both capture and processing. Most engineers stop at perf top because deeper use is confusing. This prompt provides a recipe per language and explains how to read the resulting graph.

How to use it

  1. Pick on-CPU vs off-CPU first. They answer different questions.
  2. Ensure symbol resolution. A flame graph of [unknown] boxes is useless. Install debuginfo or use language-specific tooling.
  3. 30-60s captures. Longer doesn’t add information for steady-state issues.
  4. Don’t profile something with no load. No traffic = empty flame graph.

Setup

# Install
sudo apt install linux-tools-common linux-tools-$(uname -r) -y         # Ubuntu
sudo dnf install perf -y                                                 # RHEL

# FlameGraph
git clone https://github.com/brendangregg/FlameGraph.git
export PATH="$PWD/FlameGraph:$PATH"

# Permissions
sudo sysctl -w kernel.perf_event_paranoid=1     # allow non-root profiling
sudo sysctl -w kernel.kptr_restrict=0           # allow kernel symbol resolution

# Debug symbols (Ubuntu)
sudo apt install linux-image-$(uname -r)-dbgsym -y
sudo apt install <package>-dbgsym -y

Recipe: on-CPU flame graph (compiled language)

# Capture
sudo perf record -F 99 -p <pid> -g --call-graph dwarf -- sleep 30

# Process
sudo perf script > out.perf
stackcollapse-perf.pl out.perf > out.folded
flamegraph.pl out.folded > flame.svg

# View
xdg-open flame.svg

Recipe: JVM

# Option A: perf-map-agent (https://github.com/jvm-profiling-tools/perf-map-agent)
# After starting the JVM:
java -cp <perf-map-agent> net.virtualvoid.perf.AttachOnce <pid>
# Then standard perf record -p <pid>

# Option B: async-profiler (preferred, simpler)
./asprof -d 30 -f flame.html <pid>
# Or for off-CPU:
./asprof -d 30 -e wall -f flame.html <pid>

Recipe: Python

# py-spy (no perf needed, no privilege)
pip install py-spy
sudo py-spy record -o flame.svg --pid <pid> --duration 30

# Or austin (similar)
sudo austin -p <pid> -i 100ms -o profile.txt
austin2speedscope profile.txt > profile.speedscope.json

Recipe: Node.js

# Run with frame profiler
node --perf-basic-prof app.js
sudo perf record -F 99 -p <pid> -g -- sleep 30

# Or use 0x (no perf needed)
npx 0x -- node app.js

Recipe: Off-CPU (where is time spent waiting?)

# eBPF (preferred, low overhead)
sudo apt install bpfcc-tools                            # Ubuntu
sudo /usr/sbin/offcputime-bpfcc -df -p <pid> 30 > out.stacks
flamegraph.pl --color=io --title="Off-CPU Time" out.stacks > offcpu.svg

Reading a flame graph

Imagine looking at a graph that’s wide and shallow vs narrow and tall:

  • Wide and shallow: a small number of hot leaf functions consuming time. Optimize those.
  • Tall and narrow: deep call stacks; the leaf is small but the path is long. Usually less actionable.
  • A wide plateau in libc/syscalls at the top: the issue is the caller, not the library.
  • A tower of “interpreter” frames (JVM, Python without py-spy): symbol resolution failed; can’t see real code.

Common patterns

PlateauLikely meaning
_raw_spin_lockHot kernel lock; check IRQ/scheduler
__memcpy_avxLots of buffer copying; review callers
do_softirqNetwork or block IRQ work; check NIC pinning
futex_wait (in off-CPU)Lock contention
read / write (in off-CPU)Synchronous I/O dominating
[unknown]Missing debug info
Interpreter / JIT boxesJVM without perf-map-agent

Common findings this catches

  • JSON parsing dominates the flame graph → switch parser, batch input, or precompile.
  • memcpy in a network path → zero-copy opportunity (splice, sendfile).
  • futex_wait in off-CPU → lock contention; see linux-context-switch-lock-contention.
  • Deep stacks in a Spring app calling reflection-heavy methods → caching opportunity.
  • __schedule in CPU samples → too many threads, frequent preemption.

When to escalate

  • High-overhead production profiling that affects users — switch to eBPF-based tooling (bpftrace, bcc).
  • Heap profiling needs (which perf doesn’t do well) — use language-specific tools (jmap/jhsdb, py-spy dump, pprof for Go).
  • Kernel-only hot paths that match upstream regressions — file a kernel bug with the flame graph.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week