systemd Unit Failure Debugging Prompt
Diagnose systemd unit failures — dependency cycles, mount/target failures, exit codes, journalctl filtering, drop-in overrides, and silent service flapping.
- Target user
- Linux sysadmins and SREs
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior Linux sysadmin who has debugged hundreds of systemd unit failures across Ubuntu, RHEL, and Debian. You can read dependency graphs, decode exit codes, and spot the "drop-in override changed everything" trap. I will provide: - The failing unit name and what it's supposed to do - `systemctl status <unit>` output - `journalctl -u <unit> --no-pager -n 100` output - `systemctl cat <unit>` (full effective unit file including drop-ins) - Whether the failure is at boot or runtime; first occurrence or repeat - Distro + systemd version Your job: 1. **Read the status carefully**: - State (`active`, `inactive`, `failed`, `activating`, `deactivating`) + sub-state - Result reason (`exit-code`, `signal`, `timeout`, `protocol`, `oom-kill`, `dependency`) - Exit code (0 = clean, 1-255 = app error, ≥128 = signal-killed, 247 = SIGSEGV via systemd) - Active time / inactive time → flapping vs first failure 2. **Walk the dependency chain** from `systemctl list-dependencies <unit>` and `systemctl list-dependencies --reverse <unit>`: - Was a required `After=` / `Requires=` unit unavailable? - Was a network/mount target not reached? - Is there an ordering cycle? (`systemd-analyze verify`) 3. **Decode the journal output**: - Exit code mapping (`Status=...`) - Common signal kills: SIGKILL (9) = OOM or `kill -9`; SIGTERM (15) = stopped/restarted; SIGSEGV (11) = app crash - `(code=killed, signal=KILL)` with the OOM-killer banner upstream means cgroup OOM - `Watchdog timeout` = the service didn't ping `sd_notify` in time 4. **Check effective config including drop-ins**: - `systemctl cat` shows ALL fragments (base + `/etc/systemd/system/*.d/*.conf` overrides) - Override-file misnamings (e.g., `override.cnf` instead of `.conf`) are silently ignored - `Environment=` order matters; later wins 5. **Common root causes to check**: - `ExecStart=` binary path wrong, or `User=` doesn't exist - `WorkingDirectory=` doesn't exist - `ReadOnlyPaths=` blocks a required write path - `ProtectSystem=strict` + app writes to `/etc` → permission denied with cryptic exit code - `RestartSec=` too low + `StartLimitBurst=` exceeded → stuck in "start-limit-hit" - Missing `After=network-online.target` for net-dependent service that crashes early - Hardware/mount dependency: `.mount` unit failure cascading 6. **For boot-time failures**: `systemd-analyze blame`, `systemd-analyze critical-chain`, and check if `emergency.target` or `rescue.target` is reachable. 7. **Suggest the recovery path**: - Reset start-limit state (`systemctl reset-failed <unit>`) - Reload after unit edits (`systemctl daemon-reload`) - Override safely with `systemctl edit <unit>` (creates a drop-in, never edit packaged unit files) Mark anything DESTRUCTIVE clearly (mask, force-stop while dependents run, daemon-reload during active deployment). --- Unit name: [e.g., myapp.service / mnt-data.mount / postgresql.service] Failure context: [boot / runtime / repeat / first-time] Distro + systemd version: [e.g., Ubuntu 22.04, systemd 249] `systemctl status <unit>` (with -l): ``` [PASTE] ``` `journalctl -u <unit> -n 100 --no-pager`: ``` [PASTE] ``` `systemctl cat <unit>` (effective config): ``` [PASTE] ``` Any related units that also failed: ``` [PASTE] ```
Why this prompt works
systemd failures hide the actual error behind multiple layers: the unit state, the journal output, the dependency graph, and the effective drop-in configuration. The literal exit code 217 means “user does not exist” but systemctl status doesn’t translate it — you just see a number. This prompt forces the model to decode each layer.
How to use it
- Always paste
systemctl cat <unit>— not the original unit file. Drop-ins in/etc/systemd/system/<unit>.d/can flip critical behavior and the base file alone misleads. - Paste at least 100 lines of journalctl. The first error is usually 20+ lines before the visible failure.
- Mention the symptom timing: at boot? after a deploy? randomly every 4 hours? Time pattern is diagnostic.
- If the unit has dependents, include their status too. Sometimes the “failing” unit is just the visible one in a chain.
Useful commands
# Full picture of one unit
systemctl status <unit> -l
systemctl cat <unit>
systemctl show <unit> | less
journalctl -u <unit> -n 100 --no-pager
journalctl -u <unit> --since "1 hour ago" --no-pager
journalctl -u <unit> -p err --no-pager # errors only
# Dependency analysis
systemctl list-dependencies <unit>
systemctl list-dependencies --reverse <unit>
systemd-analyze verify <unit>
systemd-analyze dot <unit> | dot -Tsvg > deps.svg # graphviz install required
# Boot analysis
systemd-analyze
systemd-analyze blame | head -30
systemd-analyze critical-chain
systemd-analyze plot > boot.svg
# Edit safely
sudo systemctl edit <unit> # creates override.conf drop-in
sudo systemctl daemon-reload # MANDATORY after edits
sudo systemctl restart <unit>
# Override a single setting (interactive)
sudo systemctl edit --full <unit> # edit the full unit (drop-in copy)
# Reset state
sudo systemctl reset-failed <unit>
sudo systemctl reset-failed # everything
# Find ALL drop-ins for a unit
ls -la /etc/systemd/system/<unit>.d/
ls -la /run/systemd/system/<unit>.d/
ls -la /usr/lib/systemd/system/<unit>.d/
# Verify changes without restart
systemd-analyze verify /etc/systemd/system/<unit>.service
Common exit codes to recognize
| Exit code | systemd meaning |
|---|---|
| 0 | Clean exit |
| 1 | Generic failure |
| 200–242 | systemd-reserved (User/Group setup failures) |
200 (EXIT_CHDIR) | WorkingDirectory= doesn’t exist |
203 (EXIT_EXEC) | ExecStart= binary not found / not executable |
207 (EXIT_STDIN) | stdin redirect failed |
208 (EXIT_STDOUT) | stdout redirect failed |
217 (EXIT_USER) | User= doesn’t exist |
218 (EXIT_GROUP) | Group= doesn’t exist |
219 (EXIT_CHROOT) | RootDirectory= failure |
226 (EXIT_NAMESPACE) | namespace setup failed |
232 (EXIT_ADDRESS_FAMILIES) | RestrictAddressFamilies= blocked |
247 (EXIT_MEMORY) | memory setup failed |
Signal-killed codes (code=killed, signal=<NAME>) are separate:
SIGTERM(15) → systemd asked to stopSIGKILL(9) → cgroup OOM orkill -9(checkdmesgfor OOM banner)SIGSEGV(11) → app bugSIGABRT(6) →assert()failure
Common findings this catches
- Exit code 217 →
User=appuserset but the user doesn’t exist on this host (forgot to add in your deploy). Status=killed, signal=KILLwith no OOM banner → externalkill -9; check who/what (audit logs).- Service flapping with “start-limit-hit” →
Restart=always+ crash; counter exceededStartLimitBurst. Fix the app, thenreset-failed. (code=exited, status=203/EXEC)→ExecStart=path doesn’t exist or isn’t executable. Common after a package downgrade.- Unit “active” but app not running → forked into background;
Type=forkingmismatch withType=simple. - Watchdog timeouts →
WatchdogSec=set but the app doesn’t callsd_notify(WATCHDOG=1). ProtectSystem=strict+ permission denied → app needs to write somewhere outside its allowed paths; addReadWritePaths=/var/lib/myapp.- Drop-in override file ignored → wrong filename extension (
override.confis correct;override.cnfis silently ignored).
Safe override pattern
sudo systemctl edit myapp.service
Add only the lines you want to override or extend (note the [Service] header):
[Service]
# Empty ExecStart= clears the inherited value before adding the new one
ExecStart=
ExecStart=/usr/local/bin/myapp --new-flag
Environment=DEBUG=true
TimeoutStartSec=120
Then:
sudo systemctl daemon-reload
sudo systemctl restart myapp.service
sudo systemctl status myapp.service
When to escalate
- Boot stuck in
emergency.targetwith no obvious failed unit — engage console access, do not reboot blindly. - Failed
*.mountunit on a critical filesystem — coordinate with storage; do not edit/etc/fstabover a hung session. - A unit failure that correlates with a kernel taint in
dmesg— likely driver/hardware; pull in platform team.
Related prompts
-
Linux Boot Failure & Rescue Prompt
Recover an unbootable Linux server — GRUB failures, broken initramfs, fstab errors, missing root, kernel panics — with a deliberate rescue sequence.
-
Linux OOM Kill & Memory Pressure Investigation Prompt
Diagnose OOM kills, memory pressure, swap thrashing, slab bloat, and cgroup memory limit failures on Linux servers from dmesg OOM banners and /proc data.
-
Linux Server Troubleshooting Prompt
Help diagnose CPU, memory, disk, network, and service issues on Ubuntu or RHEL servers from raw command output.
-
Sudoers & Systemd Services Review Prompt
AI review of /etc/sudoers (and /etc/sudoers.d/*) and systemd service unit files for privilege escalation, unsafe defaults, and hardening gaps.