The Most Common Linux Server Problems (and How to Fix Them)

The most common Linux server problems are: a full disk or exhausted inodes, high load average, the OOM killer killing processes, runaway processes, a full /boot, failed systemd services, SSH lockouts, DNS resolution failures, clock drift, read-only root filesystems, zombie processes, “port already in use” errors, slow disk I/O, permission-denied errors, and package/dependency conflicts. Nearly all of them are diagnosable in under a minute with a handful of built-in tools — df, top, dmesg, journalctl, ss, free, systemctl, and iostat — and most have a fix you can apply immediately and a one-line change to stop them recurring.

Here’s the short list of what we’ll cover, roughly in the order I hit them on real incidents:

Disk full / inode exhaustion
High load average
The OOM killer killing your processes
A runaway process eating CPU
A full /boot partition
A failed systemd service
SSH lockout (you can’t log in)
DNS resolution failure
Clock / time drift
Read-only root filesystem
Zombie / defunct processes
“Address already in use” (port conflict)
Slow disk I/O
Permission denied
Package / dependency hell

After more than a decade of carrying a pager across Ubuntu, RHEL, Rocky, and Debian fleets, these fifteen account for the overwhelming majority of “the server is broken” tickets. For each one below I give you the symptom, the diagnosis commands (real, copy-pasteable), the fix, and the prevention so it doesn’t page you again. If you want the whole troubleshooting toolkit in one place, the Linux Admins hub collects every deep-dive we’ve written.

Disk Full or Inode Exhaustion

Symptom: Writes fail with No space left on device, services crash on startup, logs stop, databases go read-only — even though df -h sometimes shows free space.

Diagnosis:

df -h            # space used per filesystem
df -i            # INODES used per filesystem — check this when df -h shows free space
du -xh --max-depth=1 / 2>/dev/null | sort -rh | head -20   # biggest dirs on the root fs
du -xh --max-depth=1 /var | sort -rh | head -20            # /var is the usual culprit
lsof +L1         # files deleted but still held open by a process (space not reclaimed)

The classic trap: df -h says 60% used but writes still fail. That means inodes are exhausted, not bytes — usually millions of tiny files (mail queues, PHP sessions, cache fragments). df -i is the tell.

Fix: Find and remove the bloat. If a process holds a deleted file open (common with rotated logs nginx/app still writes to), space isn’t freed until you restart or truncate it:

# Truncate a runaway log without restarting the writer:
truncate -s 0 /var/log/some-app/huge.log
# Or for a deleted-but-open file, find the PID from lsof +L1 and:
systemctl restart the-service
journalctl --vacuum-size=500M     # cap systemd journal

Prevention: Configure logrotate, cap the journal in /etc/systemd/journald.conf (SystemMaxUse=1G), set a disk-usage alert at 80%, and put /var on its own partition so a runaway log can’t take down /.

High Load Average

Symptom: Everything feels sluggish; uptime shows a load average several times higher than your CPU count.

Diagnosis:

uptime                  # the three load numbers: 1, 5, 15-minute averages
nproc                   # number of CPUs — compare load against this
top                     # press '1' to see per-CPU; watch %us, %sy, %wa, %id
vmstat 1 5              # 'r' = runnable, 'b' = blocked; high 'wa' = I/O wait

Load average on Linux counts processes that are runnable and in uninterruptible I/O sleep. So high load with low CPU usually means you’re I/O-bound, not CPU-bound — look at the wa (I/O wait) column in top/vmstat. High load with %us near 100% is genuine CPU saturation.

Fix: Identify the heaviest consumers with top (sort by CPU with P) or ps aux --sort=-%cpu | head. If it’s I/O wait, jump to the slow-disk-I/O section. Renice or throttle non-critical batch jobs:

renice +10 -p <PID>
ionice -c3 -p <PID>     # idle I/O class for a noisy batch process

Prevention: Capacity-plan against load relative to nproc, move batch jobs off peak hours, and alert on load > 1.5 × nproc sustained for several minutes rather than on a single spike.

The OOM Killer Killing Your Processes

Symptom: A process (often your database or app) vanishes with no clean shutdown. Restarts seem random.

Diagnosis:

dmesg -T | grep -i -E 'killed process|out of memory|oom'
journalctl -k | grep -i oom
free -h                 # is swap exhausted too?
# Per-cgroup memory pressure on systemd services:
systemctl status <service>      # look for "OOMKilled" / non-zero exit

The kernel’s Out-Of-Memory killer triggers when the system can’t allocate memory and there’s nothing left to reclaim. dmesg records exactly which process it sacrificed and its oom_score.

Fix: Immediately, restart the killed service. Longer term, cap memory per service so the offender gets killed instead of innocent neighbors, or add swap as a buffer:

# Give a systemd unit a hard memory ceiling:
systemctl edit myapp     # add: [Service] \n MemoryMax=2G
# Add emergency swap:
fallocate -l 4G /swapfile && chmod 600 /swapfile && mkswap /swapfile && swapon /swapfile

Prevention: Right-size workloads, set MemoryMax/MemoryHigh on services, tune vm.overcommit_memory deliberately, and protect critical processes with a negative oom_score_adj.

A Runaway Process Eating CPU

Symptom: One core (or all of them) pinned at 100%, fans screaming, latency spiking.

Diagnosis:

top                              # press 'P' to sort by CPU
ps -eo pid,ppid,cmd,%cpu,%mem --sort=-%cpu | head
pidstat 1 5                      # per-process CPU over time (from sysstat)
strace -p <PID>                  # what syscall is it spinning on?

Fix: If it’s safe to kill, do it; if it’s a fork bomb or wedged loop, stop it gracefully first:

kill -TERM <PID>      # ask nicely
kill -KILL <PID>      # if it ignores you

For a fork bomb, you may need to kill -STOP the parent to freeze the tree before cleanup. Set per-user nproc limits in /etc/security/limits.conf to prevent recurrence.

Prevention: Add resource limits (CPUQuota= in systemd), watchdog timers, and alerting on sustained single-process CPU. For interactive debugging, Warp’s AI terminal is handy for explaining unfamiliar strace/perf output on the fly.

A Full /boot Partition

Symptom: Kernel updates fail (apt/dnf errors about no space in /boot), or the box won’t boot after an update.

Diagnosis:

df -h /boot
dpkg --list | grep linux-image       # Debian/Ubuntu: installed kernels
rpm -qa | grep kernel                # RHEL/Rocky: installed kernels
uname -r                             # the kernel you're CURRENTLY running — never remove this one

/boot is often a small dedicated partition (a few hundred MB), and every kernel update leaves the old image behind. Three or four kernels later it’s full.

Fix: Remove old kernels — never the running one:

# Ubuntu/Debian:
apt autoremove --purge
# RHEL/Rocky/Fedora (keep 2 most recent):
dnf remove $(dnf repoquery --installonly --latest-limit=-2 -q)

Prevention: Set installonly_limit=2 in /etc/dnf/dnf.conf, or rely on Ubuntu’s apt autoremove. Monitor /boot separately — it fills silently.

A Failed systemd Service

Symptom: An app or daemon isn’t responding; systemctl reports failed or it’s stuck restarting.

Diagnosis:

systemctl status myapp.service          # state, last exit code, recent log lines
journalctl -u myapp.service -n 100 --no-pager     # full recent logs for this unit
journalctl -u myapp.service -p err -b              # errors since last boot
systemctl list-units --state=failed                # everything that's failed

The exit code and the journal almost always tell you exactly why: bad config, missing dependency, permission issue, or port conflict.

Fix: Read the journal, fix the root cause (config typo, missing file, wrong user), then:

systemctl daemon-reload     # if you edited the unit file
systemctl restart myapp.service
systemctl reset-failed myapp.service    # clear the failed state

Prevention: Add Restart=on-failure with a sane RestartSec, set StartLimitIntervalSec/StartLimitBurst so a crash-loop doesn’t hammer the box, and validate config in CI before deploy.

SSH Lockout — You Can’t Log In

Symptom: Permission denied (publickey), connection refused, or Too many authentication failures — and it’s the only way into the box.

Diagnosis (from a working session or console):

sshd -t                                  # test sshd config syntax BEFORE restarting
journalctl -u ssh -n 50                   # (or sshd on RHEL) — auth failures, config errors
tail -f /var/log/auth.log                 # Debian/Ubuntu auth log
ss -tulpn | grep :22                      # is sshd even listening?
fail2ban-client status sshd               # are you banned by fail2ban?

Common causes: wrong key permissions (~/.ssh/authorized_keys must be 600, ~/.ssh 700), PermissionRootLogin no, a botched sshd_config edit, a firewall rule, or fail2ban banning your own IP.

Fix: Use the cloud provider’s serial/VNC console or a recovery instance. The cardinal rule: always run sshd -t and keep an existing session open before restarting sshd. Unban yourself with fail2ban-client set sshd unbanip <IP>. Fix permissions:

chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys
restorecon -Rv ~/.ssh    # RHEL/Rocky: fix SELinux context

Prevention: Keep a break-glass console access path, never restart sshd without a second open session, and add your management IPs to a fail2ban ignoreip allowlist.

DNS Resolution Failure

Symptom: Temporary failure in name resolution, package installs and outbound API calls fail, but ping 8.8.8.8 works fine (IP routing is OK, names aren’t).

Diagnosis:

cat /etc/resolv.conf                  # which resolvers is the system using?
resolvectl status                     # systemd-resolved: per-link DNS config
dig example.com                       # full resolution path and timing
dig @1.1.1.1 example.com              # does an external resolver work? isolates local vs upstream
getent hosts example.com              # what the system's NSS actually returns

If ping to an IP works but name lookups fail, it’s DNS. If dig @1.1.1.1 works but the default resolver doesn’t, your configured resolver is the problem.

Fix: Point at a working resolver. On systemd-resolved systems, /etc/resolv.conf is a symlink — don’t hand-edit it; configure the resolver properly:

resolvectl dns eth0 1.1.1.1 8.8.8.8
systemctl restart systemd-resolved

Prevention: Configure redundant resolvers, monitor DNS latency, and beware of DHCP overwriting /etc/resolv.conf on reboot.

Clock / Time Drift

Symptom: TLS handshakes fail (certificate is not yet valid), Kerberos/auth breaks, logs across hosts don’t line up, and replication/quorum systems misbehave.

Diagnosis:

timedatectl status                    # is NTP synchronized? what's the offset?
chronyc tracking                      # chrony: offset, stratum, last sync
chronyc sources -v                    # which time sources, are they reachable?
date                                  # quick sanity check

Fix:

systemctl enable --now chronyd        # (or systemd-timesyncd)
chronyc makestep                      # force an immediate step correction
timedatectl set-ntp true

Prevention: Run a time daemon on every host, point at consistent NTP sources (ideally internal), and alert on offset above a few hundred milliseconds.

Read-Only Root Filesystem

Symptom: Suddenly nothing can write: Read-only file system errors everywhere, services failing to log or save state.

Diagnosis:

mount | grep ' / '                    # is / mounted 'ro'?
dmesg -T | grep -i -E 'ext4|xfs|i/o error|remount|read-only'
cat /proc/mounts | grep ' / '

The kernel remounts a filesystem read-only when it detects corruption or repeated I/O errors — a safety mechanism. So a read-only root is frequently a symptom of failing storage, not a config mistake.

Fix: Check the disk health first (dmesg, smartctl -a /dev/sda). If the filesystem is genuinely damaged, run fsck from a recovery/rescue boot (never on a mounted root). To remount read-write once you’ve confirmed it’s safe:

mount -o remount,rw /

Prevention: Monitor disk SMART health, watch dmesg for I/O errors, and treat any unexpected remount-ro as a hardware investigation, not just a remount.

Zombie / Defunct Processes

Symptom: ps shows processes in state Z marked <defunct>; in large numbers they can exhaust the PID table.

Diagnosis:

ps aux | awk '$8 ~ /^Z/ { print }'    # list zombies
ps -eo pid,ppid,state,cmd | grep -w Z # zombies and their PARENT pids
top                                    # the 'zombie' count in the summary line

A zombie is a finished child whose parent never called wait() to reap its exit status. The zombie itself uses no resources except a PID slot — the bug is in the parent.

Fix: You can’t kill a zombie (it’s already dead). Signal the parent to reap it, or restart the parent:

kill -CHLD <PPID>     # nudge the parent to reap
# If that fails, restart or kill the parent; init/systemd then reaps the orphans.

Prevention: Fix the parent program to reap children. In containers, run a proper init (tini, --init) so PID 1 reaps zombies.

”Address Already in Use” (Port Conflict)

Symptom: A service won’t start: bind: Address already in use or Failed to listen on 0.0.0.0:8080.

Diagnosis:

ss -tulpn | grep :8080                # what's listening on the port (and its PID)
ss -tulpn | grep LISTEN               # everything currently listening
lsof -iTCP:8080 -sTCP:LISTEN          # alternative, shows process detail
fuser 8080/tcp                        # PID(s) using the TCP port

Either another instance is already running, a previous instance didn’t shut down cleanly, or the socket is in TIME_WAIT.

Fix: Stop the conflicting process, or change the port:

kill <PID>                            # the process from ss -tulpn
# or, if it's a stale unit:
systemctl restart conflicting.service

If it’s TIME_WAIT churn on a busy service, enable SO_REUSEADDR in the app (most servers already do).

Prevention: Use systemd socket activation or ExecStartPre checks, document port assignments, and avoid running two copies of the same service.

Slow Disk I/O

Symptom: High load with low CPU, sluggish databases, long fsync times, processes stuck in D (uninterruptible sleep) state.

Diagnosis:

iostat -xz 1 5                        # %util near 100% and high await = saturated disk
iotop -oP                             # which processes are doing the I/O right now
ps -eo pid,state,cmd | awk '$2 ~ /D/' # processes blocked on I/O
dmesg -T | grep -i -E 'i/o error|ata|nvme'   # hardware errors

In iostat -xz, watch %util (how busy the device is), await (average wait per I/O in ms), and aqu-sz (queue depth). A %util pinned at 100% with rising await means the disk is the bottleneck.

Fix: Find and throttle the heavy writer with ionice -c3, move hot data to faster storage (NVMe/SSD), or add caching. For databases, check whether you’re fsync-bound and consider a write-back cache or faster journal device.

Prevention: Provision IOPS for the workload, monitor await/%util trends, and separate noisy batch I/O onto its own device.

Permission Denied

Symptom: A command, service, or web app fails with Permission denied even though the file “looks” readable.

Diagnosis:

ls -l /path/to/file                   # owner, group, mode bits
ls -ld /path/to/dir                   # directory must be traversable (x bit) all the way down
id <user>                             # what groups is the user actually in?
namei -l /full/path/to/file           # walks EVERY component's permissions — invaluable
getfacl /path/to/file                 # POSIX ACLs that override basic mode bits
# RHEL/Rocky/Fedora — don't forget SELinux:
ls -Z /path/to/file
ausearch -m avc -ts recent            # recent SELinux denials

On RHEL-family systems, the file permissions can be perfect and access still fails because of an SELinux context mismatch — ausearch -m avc reveals it. On any system, an unreadable parent directory (missing x) blocks access to a perfectly readable file inside it; namei -l shows you exactly where the path breaks.

Fix: Correct ownership/mode, add the user to the right group, or fix the SELinux context:

chown user:group /path/to/file && chmod 640 /path/to/file
restorecon -Rv /path/to/dir          # reset SELinux to the policy default
setsebool -P httpd_can_network_connect on    # example: allow a known-good SELinux action

Prevention: Use groups instead of broad 777, keep SELinux enforcing with proper contexts rather than disabling it, and audit permissions with least-privilege reviews.

Package / Dependency Hell

Symptom: apt/dnf refuses to install: held broken packages, conflicting versions, unmet dependencies, or a half-configured database.

Diagnosis:

# Debian/Ubuntu:
apt-get check
dpkg --audit                          # half-installed/half-configured packages
apt-cache policy <pkg>                # which versions/repos are available
# RHEL/Rocky/Fedora:
dnf check
rpm -Va                               # verify installed packages against their metadata
dnf repoquery --duplicates            # duplicate package versions

Fix: Repair the broken state, then resolve conflicts deliberately:

# Debian/Ubuntu:
dpkg --configure -a
apt-get -f install                    # fix broken dependencies
# RHEL/Rocky:
dnf distro-sync                       # reconcile to consistent versions
dnf history undo last                 # roll back the last transaction

Prevention: Pin critical package versions, avoid mixing third-party repos that fight over the same packages, test upgrades in staging, and use dnf history / snapshots so you can roll back.

Triage Faster with AI

Once you’ve gathered output from the commands above, the slow part isn’t fixing the problem — it’s deciding which of these fifteen you’re actually looking at when the symptoms overlap (high load could be CPU, OOM thrash, or slow I/O; a failed service could be a port conflict, a permission issue, or DNS).

This is exactly where an LLM shines as a triage partner. The workflow I use on real incidents: paste the raw, unedited output of uptime, free -h, df -h, df -i, the relevant journalctl -u <service> -n 50, and dmesg -T | tail -40 into the model and ask for ranked hypotheses plus the single next command to confirm each one. You stay in control — the model proposes, you run the commands and decide.

A few ways to wire this into your workflow:

The free incident-response tool on the dashboard takes your pasted diagnostics and returns ranked likely causes and the next diagnostic step — built for exactly this paste-and-triage loop.
The Linux Admin Prompt Pack bundles battle-tested prompts for each problem class above (OOM analysis, disk forensics, SELinux denials, systemd unit debugging), so you’re not rewriting the same prompt at 3am. It pairs directly with everything in this article.
Our general prompt library has reusable templates for log analysis and command explanation.
If you live in the terminal, Warp’s AI features and Claude both explain unfamiliar output (cryptic strace, perf, or kernel messages) inline without copy-pasting elsewhere.

The one rule: never paste secrets (keys, tokens, passwords, internal hostnames you’d rather not leak) into any tool, and always verify a suggested command before you run it as root.

FAQ

How do I find what’s filling my disk? Start with df -h to find the full filesystem, then du -xh --max-depth=1 / | sort -rh | head -20 to drill into the biggest directories (the -x keeps it on one filesystem). If df -h shows free space but writes still fail, run df -i to check for inode exhaustion, and lsof +L1 to catch deleted files still held open by a process — space those occupy isn’t freed until the process is restarted or the file truncated.

Why is my Linux server slow? Check uptime for load average and compare it to nproc. Then top to see whether you’re CPU-bound (%us high), I/O-bound (%wa high), or memory-starved (swapping in free -h). High load with low CPU almost always means slow disk I/O — confirm with iostat -xz 1 and look for %util near 100% with rising await. Memory pressure shows up as swap thrashing in vmstat 1.

How do I see why a process was killed? Run dmesg -T | grep -i 'killed process' or journalctl -k | grep -i oom. The kernel’s OOM killer logs exactly which process it terminated and why when the system runs out of memory. For a systemd service, systemctl status <service> will show an OOMKilled result or a non-zero exit code.

What does “load average” actually mean on Linux? It’s the average number of processes that are either running or waiting in uninterruptible I/O sleep, over 1, 5, and 15 minutes. Unlike other Unixes, Linux counts I/O-blocked processes, which is why a fully idle CPU can still show high load — that’s your signal to investigate disk I/O rather than CPU.

How do I check what’s listening on a port? ss -tulpn lists every listening TCP/UDP socket with the owning process and PID. To target one port: ss -tulpn | grep :8080 or lsof -iTCP:8080 -sTCP:LISTEN. This is the fastest way to resolve an “address already in use” error — it tells you exactly which PID to stop.

Conclusion

Linux server problems feel chaotic in the moment, but the same fifteen issues account for the vast majority of real incidents — and every one of them is diagnosable with built-in tools you already have. Internalize the first command for each (df -i, dmesg, journalctl -u, ss -tulpn, iostat -xz) and you’ll identify the root cause in the first minute instead of guessing. Pair that muscle memory with an AI triage step for the ambiguous cases, fix the immediate problem, then apply the one-line prevention so it never pages you twice.

Bookmark the Linux Admins hub for the deeper dives on each topic, grab the Linux Admin Prompt Pack if you want the prompts ready to go, and keep the incident-response tool open the next time the pager goes off.

Disk Full or Inode Exhaustion

High Load Average

The OOM Killer Killing Your Processes

A Runaway Process Eating CPU

A Full /boot Partition

A Failed systemd Service

SSH Lockout — You Can’t Log In

DNS Resolution Failure

Clock / Time Drift

Read-Only Root Filesystem

Zombie / Defunct Processes

”Address Already in Use” (Port Conflict)

Slow Disk I/O

Permission Denied

Package / Dependency Hell

Triage Faster with AI

FAQ

Conclusion

Download the Free 500-Prompt DevOps AI Toolkit