Debugging Linux Processes with strace and ltrace (and AI)

Some bugs don’t show up in the logs because the process never gets far enough to log. It hangs on startup, fails with a generic “permission denied,” or silently does nothing. When the application’s own output is a dead end, you go one level down and watch the system calls it makes. strace shows you every syscall a process issues — every file it opens, every socket it touches, every EACCES it hits — and there’s no hiding from it.

The catch is that strace output is a firehose. A process doing nothing interesting still emits thousands of lines. This is where AI has become a real part of my workflow: it reads the trace far faster than I can and points at the one openat() that returned ENOENT among ten thousand boring lines. It’s a fast junior engineer pattern-matching against a haystack. It reads the trace; I decide what to do about the box, because the fix usually mutates production.

strace versus ltrace — pick the right lens

They sound similar and aren’t. strace traces system calls (the kernel boundary). ltrace traces library calls (functions like malloc or getenv). For “why can’t this process open a file / bind a port / talk to the network,” you want strace. For “what’s this thing doing inside libc,” ltrace. Ninety percent of the time it’s strace:

strace -f -e trace=file ./myapp        # only file-related syscalls
strace -f -e trace=network ./myapp     # sockets, connect, bind
strace -f -p 12345                     # attach to a running PID

The -f follows forked children — essential, because the syscall you care about is often in a child the parent spawned. The -e trace= filter is what makes the output readable; tracing everything buries the signal. I keep these strace recipes with my other linux admin prompts.

The single most useful invocation

When a process “just fails” with no useful message, this is where I start:

strace -f -e trace=file -o /tmp/trace.log ./myapp
grep -E 'ENOENT|EACCES|EPERM' /tmp/trace.log

That grep finds the missing files and permission failures, which together explain the overwhelming majority of startup failures. A config file at the wrong path, a socket directory that doesn’t exist, a cert file root can’t read — they all surface here as openat(... ) = -1 ENOENT or EACCES. Pro Tip: Save the trace to a file with -o instead of watching it scroll. A 50,000-line trace is unreadable live, but grep against the file finds the one failing syscall in a second — and the file is exactly what you hand to an AI for a second opinion.

Let AI read the haystack

This is the part that genuinely changed how fast I debug. Paste a trimmed trace into your assistant:

Here’s an strace excerpt from a process that exits immediately with no error message. Identify the syscall that’s failing, what it was trying to do, and the most likely root cause. The app is a Go binary on Ubuntu 24.04.

The model is excellent at this. It’ll spot that the binary tried to openat() a config at /etc/myapp/config.yaml, got ENOENT, and exited — and it’ll tell you in one sentence what took you ten minutes of scrolling. For a hang, give it the last few syscalls before the process froze, and it’ll often identify a blocking read() on a socket or a flock() waiting on a lock another process holds.

The incident response helper is built around exactly this loop — feed it symptoms and trace excerpts, get back a structured investigation path. And the code review tool is useful when the root cause turns out to be in a script you control and you want a fix reviewed before it ships.

Tracing a hung process safely

When something’s wedged in production, you want to look without making it worse. Attaching strace to a live PID adds overhead and, on a latency-sensitive process, can change its behavior — so be deliberate:

strace -f -p 12345 -e trace=network -o /tmp/hang.log
# watch for a few seconds, then Ctrl-C to detach

strace detaches cleanly on Ctrl-C and leaves the process running. If you’d rather not perturb the process at all, a single snapshot of where it’s stuck is cheaper:

cat /proc/12345/stack          # kernel stack — what it's blocked in
ls -l /proc/12345/fd           # open file descriptors
cat /proc/12345/wchan          # the kernel function it's waiting on

Those /proc reads are zero-overhead and often enough on their own — a wchan of unix_stream_read_generic tells you it’s blocked reading a unix socket. Hand any of this to the AI to interpret. It reads /proc entries fluently and explains what a wchan value means without you reaching for kernel source.

Permissions and SELinux/AppArmor wrinkles

A syscall returning EACCES isn’t always plain Unix permissions. On RHEL it might be SELinux denying access even though the file mode looks fine; on Ubuntu it might be AppArmor. strace shows you the EACCES, but the reason lives in the audit log:

sudo ausearch -m avc -ts recent       # SELinux denials
sudo dmesg | grep -i apparmor          # AppArmor denials

When strace says EACCES but ls -l says the permissions are fine, that mismatch is the tell. Give both the strace line and the audit output to the AI and ask it to reconcile them — it’s good at recognizing “the file mode is fine, so this is a mandatory access control denial, not a DAC one.” You then fix the policy properly rather than reaching for chmod 777 or setenforce 0.

Keep the model out of the production loop

The trace, the /proc reads, the audit logs — all read-only context you can safely share. What the AI never gets is a way to act on the box. It tells you the failing syscall and the likely fix; restarting the service, fixing a permission, or editing an AppArmor profile is a human running a reviewed command. That separation matters most precisely when you’re debugging something already on fire, because the temptation to let an “obvious” suggested fix run unattended is highest exactly when you’re most stressed and most likely to skip the review.

I keep my strace one-liners and the trace-interpretation prompts in the prompt packs and prompts library, so the next time a process fails silently I’m grepping for ENOENT within seconds instead of remembering the flags.

Conclusion

When logs go quiet, strace and ltrace show you the truth at the syscall and library boundary, and /proc gives you a zero-overhead peek at a hung process. The traces are dense by nature, and that density is exactly what AI is good at digesting — it finds the one failing syscall in a wall of output and names the likely cause. Let it read; you keep the keyboard for anything that touches the running system.