Linux strace / Syscall Debugging Prompt
Use strace, ltrace, ftrace, and bpftrace to find why an app hangs, what files it touches, why a binary fails on a new system, and which syscall actually returns the error.
- Target user
- Linux sysadmins and developers debugging at the syscall layer
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior Linux engineer who can read `strace` output the way other engineers read application logs. You know when `strace` is right, when `ltrace` is better, and when only `bpftrace` will work without crippling the production process. I will provide: - The symptom (app hangs, "permission denied" with no useful error, slow startup, file not found, binary crashes on one system but not another) - The process (already running pid OR command to launch) - Privilege available (can run as root? as the user?) - Production sensitivity (can the process tolerate strace overhead?) Your job: 1. **Choose the right tracer**: - **`strace`** — syscalls (open, read, write, mmap, ...); slows the process 2-100× - **`ltrace`** — library calls (libc, OpenSSL, etc.); slow; less reliable on modern binaries - **`perf trace`** — kernel-level, lower overhead via tracepoints - **`bpftrace` / `bcc` tools** — eBPF; lowest overhead; needs root, kernel support - **`ftrace`** — kernel-level tracing via `/sys/kernel/debug/tracing` 2. **For "app hangs"**: - `strace -p <pid>` → see the current syscall it's blocked in - `cat /proc/<pid>/stack` → kernel stack of the thread - `cat /proc/<pid>/wchan` → short kernel function name - Common blockers: `futex` (lock wait), `read` (waiting on FD), `epoll_wait` (event loop idle), `connect` (slow handshake) 3. **For "file not found" or "permission denied"**: - `strace -e openat ./command` — see every open attempt with paths - `strace -e openat -f ./command` — follow children too - Reveals: wrong paths, missing config, wrong UID's home, /lib vs /lib64 in container 4. **For "binary fails on new system"**: - `strace -e openat ./binary 2>&1 | grep ENOENT` → missing libs/configs - `ldd ./binary` first; `strace` catches dynamically-loaded plugins 5. **For slow startup**: - `strace -c ./command` → summary of syscall counts and total time per call - `strace -tt -e openat,stat ./command` → timestamped trace of file ops - Reveals: stat-ing 1000 files at startup, slow DNS, slow TLS handshake 6. **For production processes**: - **Avoid `strace`** if possible — adds 2-100× overhead per syscall - Use `perf trace -p <pid>` (lower overhead) or eBPF tools - `opensnoop-bpfcc`, `execsnoop-bpfcc`, `tcpconnect-bpfcc`, `biolatency-bpfcc` for targeted views - **Attach briefly** if you must — `strace -p <pid> -e ...` then Ctrl-C ASAP 7. **Strace flag cheatsheet**: - `-p <pid>` — attach to running - `-f` — follow child processes - `-e <expr>` — filter (e.g., `-e openat` or `-e trace=network`) - `-c` — summary only at exit - `-tt` — microsecond timestamps - `-T` — show time spent in each syscall - `-s <N>` — string length (default 32; raise for full reads) - `-o <file>` — write to file - `-y` — translate FDs to paths 8. **For library calls** (`ltrace`): - Hooking modern binaries is fragile (PLT entries vary) - Static binaries don't trace at all - Use `frida-trace` or `ltrace -e <lib>` for specific symbols Mark DESTRUCTIVE: attaching strace to a critical production process (overhead can cause timeouts/cascading failures), trying to trace a process that has dropped privileges (PTRACE may fail), tracing systemd's PID 1 (system instability). --- Symptom: [DESCRIBE] Process: [pid OR command to launch] Privilege: [root / regular user] Production sensitivity: [tolerable overhead / live customer-facing / dev env] What you've already tried: [DESCRIBE]
Why this prompt works
strace output is wall-of-text and intimidating; many engineers stop after the first few lines. But for “app hangs,” “missing file,” or “permission denied without context,” it tells you the exact syscall and arguments that failed. This prompt picks the right tool per scenario.
How to use it
- Pick the tool by scenario — strace for “what is it doing,” ltrace for “what library call failed,” bpftrace for production observability.
- Filter early.
strace -e openatgives a focused view; full strace is overwhelming. - For production, minimize duration. Attach, capture, detach within seconds.
- For permission errors, look for
EACCES/EPERMin the trace output.
Useful commands
# strace basics
sudo strace -p <pid> # attach (Ctrl-C to detach)
sudo strace -p <pid> -e openat # only open syscalls
sudo strace -p <pid> -e network # network calls
sudo strace -p <pid> -f -o /tmp/trace.txt # follow children, to file
sudo strace -c -p <pid> # summary at exit (Ctrl-C)
sudo strace -tt -T -p <pid> # timestamps + duration
# At launch
strace -e openat ./command
strace -ff -o trace.log ./command # one file per PID
strace -e trace=signal ./command # only signals
# Filter to common need
strace -e trace=file -p <pid> # all FS-related
strace -e trace=desc -p <pid> # FD-related (read/write/close)
strace -e trace=process -p <pid> # fork/exec/clone
strace -e trace=network -p <pid> # socket/connect/accept
strace -y -p <pid> # translate FDs to paths
# What is a hung process doing?
sudo cat /proc/<pid>/stack
sudo cat /proc/<pid>/wchan
sudo cat /proc/<pid>/status | grep State
sudo strace -p <pid> # see current syscall
# ltrace (library calls; modern binaries often fail to hook)
sudo ltrace -p <pid>
sudo ltrace -e malloc+free+strlen ./command
# perf trace (lower overhead than strace)
sudo perf trace -p <pid>
sudo perf trace --no-syscalls --event 'syscalls:sys_enter_openat' -p <pid>
# bpftrace one-liners
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'
sudo bpftrace -e 'tracepoint:syscalls:sys_exit_openat /args->ret < 0/ { printf("FAIL: %s %d\n", comm, args->ret); }'
# bcc tools (more polished)
sudo /usr/share/bcc/tools/opensnoop -p <pid>
sudo /usr/share/bcc/tools/execsnoop
sudo /usr/share/bcc/tools/tcpconnect
sudo /usr/share/bcc/tools/biolatency 5
sudo /usr/share/bcc/tools/funccount 'vfs_*'
sudo /usr/share/bcc/tools/stackcount -p <pid> -K do_softirq
# ftrace (very low overhead, kernel-side)
sudo trace-cmd record -e sched_switch sleep 5
sudo trace-cmd report
Common scenarios
”App can’t find config"
strace -e openat -f ./app 2>&1 | grep ENOENT
# Shows every "No such file" — usually the missing config path
"Permission denied” with no app log
strace -e openat,access -f ./app 2>&1 | grep -E "EACCES|EPERM"
# Shows the exact path and operation that failed
”App hangs”
# Identify state first
sudo cat /proc/<pid>/wchan
# Then attach
sudo strace -p <pid>
# Common: futex (lock), read (FD blocked), epoll_wait (idle)
Slow startup profiling
strace -c ./app # summary at exit shows top syscalls by time
# Or:
strace -tt -e openat ./app 2>&1 | head -100
Find which library call failed
ltrace -e '*' ./app 2>&1 | tail -50
# (May not work on modern dynamic binaries; falls back to strace)
Production-safe peek at a syscall
# Use perf trace (lower overhead) or eBPF
sudo perf trace --duration 10 -p <pid>
sudo /usr/share/bcc/tools/opensnoop -p <pid>
Common findings this catches
openat("/etc/myapp.conf", ...) = -1 ENOENT→ config missing.connect(... 2.3.4.5:443) = -1 ETIMEDOUT→ network reach issue, not the app.- Hung process in
futex(FUTEX_WAIT, ...)→ lock contention; investigate other threads. stat()on hundreds of paths at startup → JVM/Python loading every classpath/site-packages dir. Cache or trim.read(fd, ...)blocked on a socket → upstream slow; correlate with target.mmapfailing with ENOMEM → memory pressure orvm.max_map_counttoo low (common JVM issue: setsysctl -w vm.max_map_count=262144).access()returning EACCES beforeopen→ app pre-checking; SELinux or POSIX perms.
When to escalate
- Production hang requiring extensive tracing → move to eBPF tools to avoid amplifying the problem.
- Trace evidence of a kernel-side bug (specific syscall returning impossible value) — file kernel bug with reproducer.
- Suspected userspace tracer compatibility (ltrace on a relocated binary failing) — switch tools rather than fight it.
Related prompts
-
Linux Context Switch & Lock Contention Diagnosis Prompt
Diagnose context-switch storms, futex contention, kernel-level lock waits, and CPU scheduling pathologies that masquerade as 'app is slow.'
-
Linux High Load & CPU Runaway Investigation Prompt
Diagnose high load average, CPU saturation, run-queue pressure, IRQ storms, and steal time on Linux servers — distinguish user CPU vs system CPU vs I/O wait vs steal.
-
Linux `perf` & Flame Graph Profiling Prompt
Profile a Linux process with `perf record` and generate flame graphs to find CPU hotspots, off-CPU waits, and frequent stack patterns.