NFS Mount Troubleshooting Prompt
Diagnose NFS mount failures — stale file handles, hung mounts (D-state processes), soft vs hard semantics, autofs misbehavior, Kerberos (sec=krb5) errors.
- Target user
- Linux sysadmins managing NFS clients and servers
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior Linux storage engineer with deep experience operating NFS — NFSv3 and NFSv4, with or without Kerberos — and recovering clients from stuck mounts in production. I will provide: - The symptom (mount fails, mount hangs forever, "stale file handle," I/O hangs after working for hours, permission denied even though IDs match) - NFS version (`v3`, `v4`, `v4.1`, `v4.2`) - Security flavor (`sys`, `krb5`, `krb5i`, `krb5p`) - Client and server distros and kernel versions - The mount line (from `/etc/fstab` or `autofs` map) — full options - Output of: `mount | grep <path>`, `nfsstat -m`, `dmesg | grep -i nfs`, `rpcinfo -p <server>` - For hangs: `ps auxf | grep -E "[D] "` (D-state processes), and `cat /proc/<pid>/stack` for one stuck process Your job: 1. **Classify the symptom**: - **Mount-time failure** → server reachable? portmapper running? export visible? auth flavor mismatch? - **Hung mount / D-state processes** → soft vs hard? `intr` deprecated; modern hangs need careful recovery - **Stale file handle (`ESTALE`)** → server-side file deleted/replaced while client held it; or server FH-database changed (e.g., FS recreated) - **Permission denied with matching UIDs** → idmapd / nfs4 uid mapping mismatch, or root_squash/no_root_squash - **Kerberos errors** → keytab, clock skew, principal name, `gssproxy` / `rpc.gssd` - **Performance / I/O wait** → mount options (`rsize`/`wsize`, `actimeo`), backend storage saturation, MTU 2. **Walk the path** from client to server: - Client kernel NFS module loaded? - `rpc.statd` (v3) or `nfsidmap` (v4)? - Network reachable to server on `2049/tcp` (NFSv4) or RPC-discovered ports (NFSv3)? - Server `exportfs -v` shows the expected export with the expected client mask? - For Kerberos: clock skew < 5 min? both have keytabs with `nfs/<host>` principals? 3. **For hung mounts**, explain the recovery options in safest-first order: - **Soft mounts** (`soft,timeo=`): I/O returns EIO after timeout. Process can recover. - **Hard mounts** (default): I/O blocks forever. Process is in D state and is `kill -9` resistant until the server returns OR the mount is force-unmounted. - `umount -f` (force, NFS-aware): asks the kernel to fail in-flight I/O. Sometimes works. - `umount -l` (lazy): detaches from the namespace; existing handles keep blocking. Use to free the path, not to recover the process. - Last resort: reboot. 4. **Decode mount options** the user has set, calling out dangerous combinations: - `soft` without `timeo=` / `retrans=` — fails too fast under transient blip - `hard,nointr` (legacy; on modern kernels `intr` is ignored anyway) - `noac` — disables attribute caching; severe performance hit; use only when required (shared write-heavy workloads) - `actimeo=0` — same problem; rarely needed - `rsize=`/`wsize=` mismatched to MTU — fragmentation; defaults usually best - `sec=sys` over the internet — credentials are unauthenticated; client UID is trusted 5. **For NFSv4 idmapping issues**: explain `/etc/idmapd.conf`, `Domain=`, and where the "nobody:nogroup" trap comes from (mismatched Domain). 6. **Mark DESTRUCTIVE actions**: force-umount of a mount with running writers, rebooting an NFS server with active clients, deleting server-side files with handles still held, restarting `nfs-server` mid-day. --- Symptom: [DESCRIBE] NFS version + sec: [v4.1, sys] Client / Server distros: [e.g., Ubuntu 22.04 client, RHEL 8 server] Mount line (`/etc/fstab` or autofs map): ``` [PASTE] ``` `mount | grep <mountpoint>`: ``` [PASTE] ``` `nfsstat -m`: ``` [PASTE] ``` `dmesg | grep -i nfs` (recent): ``` [PASTE] ``` `rpcinfo -p <server>`: ``` [PASTE] ``` Server-side: `exportfs -v` and `nfsstat -s` (if you have access): ``` [PASTE] ``` Hung process info (if applicable): ``` [PASTE — ps + /proc/<pid>/stack] ```
Why this prompt works
NFS failures are weird because they happen across two kernels (client and server) plus a network plus possibly Kerberos. “Stale file handle” sounds informative but is opaque to most engineers. Models tend to suggest “remount the share” as a panacea — useless if the mount is hung and processes are in D state. This prompt forces a layered walk and respects the irreversibility of certain NFS operations.
How to use it
- State the NFS version. v3 and v4 are different protocols with different debugging tools (
rpcinfomatters for v3; v4 only listens on 2049/tcp). - Specify
sec=flavor. Kerberos NFS adds 50% more failure modes (KDC, keytab, clock skew). - For hangs, list the D-state PIDs and their stack —
cat /proc/<pid>/stackshows whether they’re stuck inrpc_wait_bit_killable(server unreachable) vsnfs_wait_on_request(server slow vs hung). - From the server side, include
exportfs -vif you have access — half of mount failures are export permission mismatches.
Useful commands
Client side
# Mount info
mount | grep nfs
findmnt -t nfs,nfs4
nfsstat -m # per-mount options, version, server
# Recent NFS kernel messages
dmesg -T | grep -i nfs
journalctl -k --since "1 hour ago" | grep -i nfs
# Test connectivity to server
rpcinfo -p <server> # NFSv3 (lists all RPC services)
showmount -e <server> # list exports (v3-style)
nc -vz <server> 2049 # NFSv4 main port
nc -vz <server> 111 # portmapper / rpcbind
# Hung process forensics
ps auxf | awk '$8 ~ /D/' # all D-state procs
sudo cat /proc/<pid>/stack # what kernel function it's waiting in
sudo cat /proc/<pid>/wchan # short form
# Force unmount (last resort, see safety notes)
sudo umount -f /mnt/nfs # force; signals NFS to fail in-flight
sudo umount -l /mnt/nfs # lazy; detaches namespace but keeps handles
# NFSv4 idmapping
cat /etc/idmapd.conf
sudo nfsidmap -c # clear idmap cache
ls -ln /mnt/nfs/somefile # numeric UIDs (rules out idmap as cause)
# Kerberos NFS
klist # has user got a ticket?
sudo klist -k # host keytab
sudo systemctl status rpc-gssd
sudo journalctl -u rpc-gssd --since "1 hour ago"
Server side
# Exports
sudo exportfs -v
sudo cat /etc/exports
sudo exportfs -r # reload after editing /etc/exports (safe)
# Statistics
sudo nfsstat -s # server-side counters
sudo nfsstat -o nfs # per-version operation counters
# Who's connected (v4)
sudo ss -tn dst :*:2049
# kerberos
sudo klist -k # service keytab
Common findings this catches
- Mount hangs at boot →
_netdevmissing in fstab option, mount tried before network up. Add_netdevandx-systemd.requires=network-online.target. - Mount fails with
mount.nfs: access denied by server→ export pattern doesn’t include this client’s IP. Server/etc/exportsmismatch. - NFSv4 mount shows
nobody:nogroupfor every file → idmapdDomain=mismatch between client and server, or NFSv4 ACL/ID mapping not active. ESTALEstorm after server restart → server lost its FSID-based file handles. Client mustumount && mount(or use persistent FSID viafsid=in exports).- Hung writes after working for hours → server side oom/storage-paused. Client hard-mount blocks indefinitely; check server.
- Kerberos error:
Server not found in Kerberos database→ service principal not in keytab, or KDC unreachable.kinit -ktest from client. - High
actimeo=0causes 10× attr lookups → revert to defaultactimeo=settings (3-60 seconds) unless you have a known shared-write requirement.
Recommended baseline mount options
For most internal workloads on a stable LAN, NFSv4 with defaults plus a sensible timeout:
<server>:/export /mnt/data nfs4 _netdev,rw,hard,nodev,nosuid,nfsvers=4.2,proto=tcp,sec=sys,bg 0 0
For internet-spanning NFS or unreliable links, consider:
<server>:/export /mnt/data nfs4 _netdev,rw,soft,timeo=600,retrans=2,nfsvers=4.2,proto=tcp,sec=krb5p,bg 0 0
For automounter-managed mounts, prefer autofs with the same option set and --timeout=60 so unused mounts unmount cleanly.
Hang recovery decision tree
Process in D state on hard NFS mount
├── Server reachable now? (ping, nc -vz <server> 2049)
│ ├── Yes → I/O should resume; wait briefly, check `nfsstat -m` retries
│ └── No → continue
│
├── Critical to recover process?
│ ├── No → wait for server; log incident
│ └── Yes → try `umount -f /mnt/<path>` from a NEW shell
│ ├── Succeeds → mount re-fails recovery
│ └── Hangs → `umount -l` frees the path; reboot to recover hung process
When to escalate
- Anything that requires restarting
nfs-serveron a busy file server — coordinate; this is a cluster-wide event. - Stale-handle storms across many clients — likely server-side FSID change; coordinate with storage team before any export changes.
- Kerberos NFS troubleshooting that touches the KDC — engage security/identity team.
- Clock skew issues — fix NTP at the platform level rather than per-host workarounds.
Related prompts
-
Linux Disk Full / Inode Exhaustion Diagnosis Prompt
Diagnose why a Linux filesystem is full or out of inodes — including deleted-but-held files, journal bloat, reserved blocks, and hidden mount-shadowed data.
-
Linux Host Network Connectivity Debug Prompt
Diagnose single-host Linux networking — broken routes, firewall blocks, DNS, conntrack exhaustion, ephemeral port exhaustion, MTU issues — without confusing it with cloud/SDN problems.
-
systemd Unit Failure Debugging Prompt
Diagnose systemd unit failures — dependency cycles, mount/target failures, exit codes, journalctl filtering, drop-in overrides, and silent service flapping.