Debugging systemd Services That Won't Start (With AI Help)
A failed systemd unit, the commands that actually tell you why, and how to use AI to read the noise so you fix the right thing the first time.
- #linux
- #systemd
- #debugging
- #services
- #journald
- #sysadmin
A service won’t start. systemctl status shows a red dot, the app is down, and somebody is asking in Slack why the deploy “didn’t work.” After 25 years of this, I can tell you the cause is almost always one of five things — and the trick is reading the right output in the right order instead of guessing.
Here’s the workflow I actually use, plus where AI saves real time.
Start with status, but don’t stop there
systemctl status myapp is the first look, not the answer:
systemctl status myapp.service
You’re scanning for three things: the Active line (failed, activating, inactive?), the Main PID exit code, and the last few log lines systemd inlines. An exit code like status=203/EXEC already tells you the binary path is wrong or non-executable before you read another line.
But status only shows a handful of log lines. The real story is in the journal.
Read the full journal for the unit
journalctl -u myapp.service -n 100 --no-pager
Add -b to scope it to the current boot, or --since "10 min ago" to get just this restart attempt. If the service is flapping — start, fail, restart, repeat — use -f and trigger a systemctl restart in another pane to watch a clean cycle.
This is the moment AI earns its keep. Service logs at startup are noisy: stack traces, library warnings, deprecation spam. Paste the journal block into a model and ask:
“This systemd service fails on start. Here are the last 100 journal lines. What is the actual failure, ignoring warnings, and what’s the most likely root cause?”
The model is very good at separating the one fatal line from forty cosmetic ones. I keep a few of these in my Linux prompts so I’m not retyping them.
The five usual suspects
After thousands of these, the cause is nearly always one of:
1. Wrong ExecStart path or permissions
status=203/EXEC or status=200/CHDIR. Check the unit:
systemctl cat myapp.service
Verify the binary exists, is executable, and the WorkingDirectory is real.
2. Missing environment or config
The app starts then immediately exits non-zero. Check EnvironmentFile= points at a file that exists and is readable by the service user.
3. Permissions on the runtime user
User=myapp can’t read its config, write its PID file, or bind below port 1024. AmbientCapabilities=CAP_NET_BIND_SERVICE fixes the port case cleanly.
4. Dependency ordering
The service starts before the database or network is ready. After=network-online.target plus Wants=network-online.target is the usual fix — After=network.target alone is not enough; it doesn’t wait for an actual route.
5. The unit file isn’t loaded
You edited the file and forgot:
systemctl daemon-reload
If your change “did nothing,” this is why nine times out of ten.
Validate before you restart
Before you restart for the fifth time, sanity-check the unit:
systemd-analyze verify myapp.service
It catches typos, missing directives, and bad ordering that you’d otherwise discover by trial and error.
Run the ExecStart by hand
When the journal is ambiguous, cut systemd out of the loop. Grab the exact ExecStart line and run it as the service user:
sudo -u myapp /usr/local/bin/myapp --config /etc/myapp/config.yaml
Now you see the app’s real stdout/stderr with nothing swallowed. This single step resolves more “mysterious” startup failures than anything else, because it removes the question of whether systemd or the app is at fault.
A reusable triage prompt
When I hand a failure to AI, I give it everything at once and constrain the output:
“Here is
systemctl cat, the last 80journalctl -ulines, and the output of running ExecStart manually. Tell me: (1) the single fatal error, (2) the most likely fix, (3) the exact commands to verify, read-only only. Don’t suggest restarting until I’ve confirmed the cause.”
That last constraint matters. Left alone, models love to suggest systemctl restart as step one — the same anti-pattern I warn about for incident triage. Confirm first, restart once.
Don’t let it edit unit files blindly
One caution: AI will happily rewrite your whole unit file. Don’t paste that back without reading it. Models frequently “helpfully” add Restart=always to a oneshot, or drop a Type= directive that changes startup semantics. Treat its unit edits as a draft you review line by line.
The fix that prevents the next one
Once it’s running, add a guardrail so the next failure is louder and self-healing:
[Service]
Restart=on-failure
RestartSec=5
StartLimitIntervalSec=60
StartLimitBurst=3
That restarts on crashes but gives up after three failures in a minute instead of hammering a broken service forever — which is what fills your disk with logs and hides the real error.
The takeaway
systemd failures feel opaque but they’re shallow: status for the headline, journal for the story, manual ExecStart for the truth. Let AI compress the log-reading and propose the fix, but keep the human on daemon-reload and restart. Read the command before you run it, confirm the cause before you change state, and you’ll close these in minutes instead of a frustrated half hour.
AI suggestions are assistive, not authoritative. Verify every command against your own system before running it.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.