AI for Linux Admins Difficulty: Advanced ClaudeChatGPT

ftrace & kprobe Dynamic Kernel Tracing Prompt

Drive a structured ftrace / kprobe investigation to trace kernel function latency, follow a syscall through the kernel, and answer 'why is this call slow inside the kernel' without recompiling or rebooting.

Target user: Linux admins and kernel-adjacent SREs debugging in-kernel latency
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a kernel tracing specialist who reaches for ftrace and kprobes before bpftrace because they ship in every modern kernel via `/sys/kernel/tracing` and need zero toolchain.

I will provide:
- Kernel version (`uname -r`) and whether `CONFIG_FUNCTION_TRACER` / `CONFIG_DYNAMIC_FTRACE` / `CONFIG_KPROBES` are set
- The symptom (a syscall, ioctl, or filesystem op that is intermittently slow)
- Any candidate kernel functions or subsystems you suspect
- Constraints (production box, can't reboot, tracefs may be the only interface)

Walk me through this, command-by-command:

1. **Confirm the interface** — `mount | grep tracefs`, fall back to `debugfs`; check `available_tracers` and `available_filter_functions` so we only target symbols that actually exist.

2. **Function & function_graph tracing** — set `current_tracer`, use `set_ftrace_filter` / `set_ftrace_notrace` to scope to one subsystem, and read `trace`. Show how `function_graph` exposes per-function duration and the call tree, and how to set `tracing_thresh` to capture only slow calls.

3. **Per-PID and per-CPU scoping** — `set_ftrace_pid`, `tracing_cpumask`, and why unbounded tracing will swamp the ring buffer (`buffer_size_kb`, overruns in `trace` header).

4. **kprobes for arguments** — register a dynamic probe via `kprobe_events` to capture function arguments and return values (kretprobe), naming the exact `%di`/`%si` or `$arg1` syntax for the arch, and reading results from the per-event `trace`.

5. **Latency tracers** — when to switch to `irqsoff`, `preemptoff`, or `wakeup_rt` to chase scheduling/latency rather than a specific function.

6. **Correlate to userspace** — tie kernel timestamps back to the offending PID and the userspace stack so the finding is actionable.

7. **Tear down cleanly** — reset `current_tracer` to `nop`, clear filters and `kprobe_events`, restore `tracing_on`; leaving probes armed has measurable overhead.

For every step give the exact echo/cat into tracefs, what a healthy vs pathological trace looks like, and the overhead. End with a root-cause statement and the single trace excerpt that proves it.

Bias toward: minimal blast radius, always-clean teardown, and reproducible one-liners over GUI tooling.

Free: the DevOps AI Incident-Triage Cheat Sheet