Debugging systemd Services That Won't Start (With AI Help)

A service won’t start. systemctl status shows a red dot, the app is down, and somebody is asking in Slack why the deploy “didn’t work.” After 25 years of this, I can tell you the cause is almost always one of five things — and the trick is reading the right output in the right order instead of guessing.

Here’s the workflow I actually use, plus where AI saves real time.

Start with status, but don’t stop there

systemctl status myapp is the first look, not the answer:

systemctl status myapp.service

You’re scanning for three things: the Active line (failed, activating, inactive?), the Main PID exit code, and the last few log lines systemd inlines. An exit code like status=203/EXEC already tells you the binary path is wrong or non-executable before you read another line.

But status only shows a handful of log lines. The real story is in the journal.

Read the full journal for the unit

journalctl -u myapp.service -n 100 --no-pager

Add -b to scope it to the current boot, or --since "10 min ago" to get just this restart attempt. If the service is flapping — start, fail, restart, repeat — use -f and trigger a systemctl restart in another pane to watch a clean cycle.

This is the moment AI earns its keep. Service logs at startup are noisy: stack traces, library warnings, deprecation spam. Paste the journal block into a model and ask:

“This systemd service fails on start. Here are the last 100 journal lines. What is the actual failure, ignoring warnings, and what’s the most likely root cause?”

The model is very good at separating the one fatal line from forty cosmetic ones. I keep a few of these in my Linux prompts so I’m not retyping them.

The five usual suspects

After thousands of these, the cause is nearly always one of:

1. Wrong ExecStart path or permissions

status=203/EXEC or status=200/CHDIR. Check the unit:

systemctl cat myapp.service

Verify the binary exists, is executable, and the WorkingDirectory is real.

2. Missing environment or config

The app starts then immediately exits non-zero. Check EnvironmentFile= points at a file that exists and is readable by the service user.

3. Permissions on the runtime user

User=myapp can’t read its config, write its PID file, or bind below port 1024. AmbientCapabilities=CAP_NET_BIND_SERVICE fixes the port case cleanly.

4. Dependency ordering

The service starts before the database or network is ready. After=network-online.target plus Wants=network-online.target is the usual fix — After=network.target alone is not enough; it doesn’t wait for an actual route.

5. The unit file isn’t loaded

You edited the file and forgot:

systemctl daemon-reload

If your change “did nothing,” this is why nine times out of ten.

Validate before you restart

Before you restart for the fifth time, sanity-check the unit:

systemd-analyze verify myapp.service

It catches typos, missing directives, and bad ordering that you’d otherwise discover by trial and error.

Run the ExecStart by hand

When the journal is ambiguous, cut systemd out of the loop. Grab the exact ExecStart line and run it as the service user:

sudo -u myapp /usr/local/bin/myapp --config /etc/myapp/config.yaml

Now you see the app’s real stdout/stderr with nothing swallowed. This single step resolves more “mysterious” startup failures than anything else, because it removes the question of whether systemd or the app is at fault.

A reusable triage prompt

When I hand a failure to AI, I give it everything at once and constrain the output:

“Here is systemctl cat, the last 80 journalctl -u lines, and the output of running ExecStart manually. Tell me: (1) the single fatal error, (2) the most likely fix, (3) the exact commands to verify, read-only only. Don’t suggest restarting until I’ve confirmed the cause.”

That last constraint matters. Left alone, models love to suggest systemctl restart as step one — the same anti-pattern I warn about for incident triage. Confirm first, restart once.

Don’t let it edit unit files blindly

One caution: AI will happily rewrite your whole unit file. Don’t paste that back without reading it. Models frequently “helpfully” add Restart=always to a oneshot, or drop a Type= directive that changes startup semantics. Treat its unit edits as a draft you review line by line.

The fix that prevents the next one

Once it’s running, add a guardrail so the next failure is louder and self-healing:

[Service]
Restart=on-failure
RestartSec=5
StartLimitIntervalSec=60
StartLimitBurst=3

That restarts on crashes but gives up after three failures in a minute instead of hammering a broken service forever — which is what fills your disk with logs and hides the real error.

The takeaway

systemd failures feel opaque but they’re shallow: status for the headline, journal for the story, manual ExecStart for the truth. Let AI compress the log-reading and propose the fix, but keep the human on daemon-reload and restart. Read the command before you run it, confirm the cause before you change state, and you’ll close these in minutes instead of a frustrated half hour.

AI suggestions are assistive, not authoritative. Verify every command against your own system before running it.