Skip to content
CloudOps
All prompts
AI for Linux Admins Difficulty: Intermediate ClaudeChatGPT

systemd Unit Failure Debugging Prompt

Diagnose systemd unit failures — dependency cycles, mount/target failures, exit codes, journalctl filtering, drop-in overrides, and silent service flapping.

Target user
Linux sysadmins and SREs
Difficulty
Intermediate
Tools
Claude, ChatGPT

The prompt

You are a senior Linux sysadmin who has debugged hundreds of systemd unit failures across Ubuntu, RHEL, and Debian. You can read dependency graphs, decode exit codes, and spot the "drop-in override changed everything" trap.

I will provide:
- The failing unit name and what it's supposed to do
- `systemctl status <unit>` output
- `journalctl -u <unit> --no-pager -n 100` output
- `systemctl cat <unit>` (full effective unit file including drop-ins)
- Whether the failure is at boot or runtime; first occurrence or repeat
- Distro + systemd version

Your job:

1. **Read the status carefully**:
   - State (`active`, `inactive`, `failed`, `activating`, `deactivating`) + sub-state
   - Result reason (`exit-code`, `signal`, `timeout`, `protocol`, `oom-kill`, `dependency`)
   - Exit code (0 = clean, 1-255 = app error, ≥128 = signal-killed, 247 = SIGSEGV via systemd)
   - Active time / inactive time → flapping vs first failure
2. **Walk the dependency chain** from `systemctl list-dependencies <unit>` and `systemctl list-dependencies --reverse <unit>`:
   - Was a required `After=` / `Requires=` unit unavailable?
   - Was a network/mount target not reached?
   - Is there an ordering cycle? (`systemd-analyze verify`)
3. **Decode the journal output**:
   - Exit code mapping (`Status=...`)
   - Common signal kills: SIGKILL (9) = OOM or `kill -9`; SIGTERM (15) = stopped/restarted; SIGSEGV (11) = app crash
   - `(code=killed, signal=KILL)` with the OOM-killer banner upstream means cgroup OOM
   - `Watchdog timeout` = the service didn't ping `sd_notify` in time
4. **Check effective config including drop-ins**:
   - `systemctl cat` shows ALL fragments (base + `/etc/systemd/system/*.d/*.conf` overrides)
   - Override-file misnamings (e.g., `override.cnf` instead of `.conf`) are silently ignored
   - `Environment=` order matters; later wins
5. **Common root causes to check**:
   - `ExecStart=` binary path wrong, or `User=` doesn't exist
   - `WorkingDirectory=` doesn't exist
   - `ReadOnlyPaths=` blocks a required write path
   - `ProtectSystem=strict` + app writes to `/etc` → permission denied with cryptic exit code
   - `RestartSec=` too low + `StartLimitBurst=` exceeded → stuck in "start-limit-hit"
   - Missing `After=network-online.target` for net-dependent service that crashes early
   - Hardware/mount dependency: `.mount` unit failure cascading
6. **For boot-time failures**: `systemd-analyze blame`, `systemd-analyze critical-chain`, and check if `emergency.target` or `rescue.target` is reachable.
7. **Suggest the recovery path**:
   - Reset start-limit state (`systemctl reset-failed <unit>`)
   - Reload after unit edits (`systemctl daemon-reload`)
   - Override safely with `systemctl edit <unit>` (creates a drop-in, never edit packaged unit files)

Mark anything DESTRUCTIVE clearly (mask, force-stop while dependents run, daemon-reload during active deployment).

---

Unit name: [e.g., myapp.service / mnt-data.mount / postgresql.service]
Failure context: [boot / runtime / repeat / first-time]
Distro + systemd version: [e.g., Ubuntu 22.04, systemd 249]
`systemctl status <unit>` (with -l):
```
[PASTE]
```
`journalctl -u <unit> -n 100 --no-pager`:
```
[PASTE]
```
`systemctl cat <unit>` (effective config):
```
[PASTE]
```
Any related units that also failed:
```
[PASTE]
```

Why this prompt works

systemd failures hide the actual error behind multiple layers: the unit state, the journal output, the dependency graph, and the effective drop-in configuration. The literal exit code 217 means “user does not exist” but systemctl status doesn’t translate it — you just see a number. This prompt forces the model to decode each layer.

How to use it

  1. Always paste systemctl cat <unit> — not the original unit file. Drop-ins in /etc/systemd/system/<unit>.d/ can flip critical behavior and the base file alone misleads.
  2. Paste at least 100 lines of journalctl. The first error is usually 20+ lines before the visible failure.
  3. Mention the symptom timing: at boot? after a deploy? randomly every 4 hours? Time pattern is diagnostic.
  4. If the unit has dependents, include their status too. Sometimes the “failing” unit is just the visible one in a chain.

Useful commands

# Full picture of one unit
systemctl status <unit> -l
systemctl cat <unit>
systemctl show <unit> | less
journalctl -u <unit> -n 100 --no-pager
journalctl -u <unit> --since "1 hour ago" --no-pager
journalctl -u <unit> -p err --no-pager     # errors only

# Dependency analysis
systemctl list-dependencies <unit>
systemctl list-dependencies --reverse <unit>
systemd-analyze verify <unit>
systemd-analyze dot <unit> | dot -Tsvg > deps.svg   # graphviz install required

# Boot analysis
systemd-analyze
systemd-analyze blame | head -30
systemd-analyze critical-chain
systemd-analyze plot > boot.svg

# Edit safely
sudo systemctl edit <unit>          # creates override.conf drop-in
sudo systemctl daemon-reload        # MANDATORY after edits
sudo systemctl restart <unit>

# Override a single setting (interactive)
sudo systemctl edit --full <unit>   # edit the full unit (drop-in copy)

# Reset state
sudo systemctl reset-failed <unit>
sudo systemctl reset-failed         # everything

# Find ALL drop-ins for a unit
ls -la /etc/systemd/system/<unit>.d/
ls -la /run/systemd/system/<unit>.d/
ls -la /usr/lib/systemd/system/<unit>.d/

# Verify changes without restart
systemd-analyze verify /etc/systemd/system/<unit>.service

Common exit codes to recognize

Exit codesystemd meaning
0Clean exit
1Generic failure
200–242systemd-reserved (User/Group setup failures)
200 (EXIT_CHDIR)WorkingDirectory= doesn’t exist
203 (EXIT_EXEC)ExecStart= binary not found / not executable
207 (EXIT_STDIN)stdin redirect failed
208 (EXIT_STDOUT)stdout redirect failed
217 (EXIT_USER)User= doesn’t exist
218 (EXIT_GROUP)Group= doesn’t exist
219 (EXIT_CHROOT)RootDirectory= failure
226 (EXIT_NAMESPACE)namespace setup failed
232 (EXIT_ADDRESS_FAMILIES)RestrictAddressFamilies= blocked
247 (EXIT_MEMORY)memory setup failed

Signal-killed codes (code=killed, signal=<NAME>) are separate:

  • SIGTERM (15) → systemd asked to stop
  • SIGKILL (9) → cgroup OOM or kill -9 (check dmesg for OOM banner)
  • SIGSEGV (11) → app bug
  • SIGABRT (6) → assert() failure

Common findings this catches

  • Exit code 217User=appuser set but the user doesn’t exist on this host (forgot to add in your deploy).
  • Status=killed, signal=KILL with no OOM banner → external kill -9; check who/what (audit logs).
  • Service flapping with “start-limit-hit”Restart=always + crash; counter exceeded StartLimitBurst. Fix the app, then reset-failed.
  • (code=exited, status=203/EXEC)ExecStart= path doesn’t exist or isn’t executable. Common after a package downgrade.
  • Unit “active” but app not running → forked into background; Type=forking mismatch with Type=simple.
  • Watchdog timeoutsWatchdogSec= set but the app doesn’t call sd_notify(WATCHDOG=1).
  • ProtectSystem=strict + permission denied → app needs to write somewhere outside its allowed paths; add ReadWritePaths=/var/lib/myapp.
  • Drop-in override file ignored → wrong filename extension (override.conf is correct; override.cnf is silently ignored).

Safe override pattern

sudo systemctl edit myapp.service

Add only the lines you want to override or extend (note the [Service] header):

[Service]
# Empty ExecStart= clears the inherited value before adding the new one
ExecStart=
ExecStart=/usr/local/bin/myapp --new-flag

Environment=DEBUG=true
TimeoutStartSec=120

Then:

sudo systemctl daemon-reload
sudo systemctl restart myapp.service
sudo systemctl status myapp.service

When to escalate

  • Boot stuck in emergency.target with no obvious failed unit — engage console access, do not reboot blindly.
  • Failed *.mount unit on a critical filesystem — coordinate with storage; do not edit /etc/fstab over a hung session.
  • A unit failure that correlates with a kernel taint in dmesg — likely driver/hardware; pull in platform team.

Related prompts

Newsletter

Get weekly AI workflows for DevOps engineers

Practical prompts, automation ideas, and tool reviews for infrastructure engineers. One email per week. No spam.