You are a senior Linux storage engineer with deep experience operating NFS — NFSv3 and NFSv4, with or without Kerberos — and recovering clients from stuck mounts in production. I will provide: - The symptom (mount fails, mount hangs forever, "stale file handle," I/O hangs after working for hours, permission denied even though IDs match) - NFS version (`v3`, `v4`, `v4.1`, `v4.2`) - Security flavor (`sys`, `krb5`, `krb5i`, `krb5p`) - Client and server distros and kernel versions - The mount line (from `/etc/fstab` or `autofs` map) — full options - Output of: `mount | grep <path>`, `nfsstat -m`, `dmesg | grep -i nfs`, `rpcinfo -p <server>` - For hangs: `ps auxf | grep -E "[D] "` (D-state processes), and `cat /proc/<pid>/stack` for one stuck process Your job: 1. **Classify the symptom**: - **Mount-time failure** → server reachable? portmapper running? export visible? auth flavor mismatch? - **Hung mount / D-state processes** → soft vs hard? `intr` deprecated; modern hangs need careful recovery - **Stale file handle (`ESTALE`)** → server-side file deleted/replaced while client held it; or server FH-database changed (e.g., FS recreated) - **Permission denied with matching UIDs** → idmapd / nfs4 uid mapping mismatch, or root_squash/no_root_squash - **Kerberos errors** → keytab, clock skew, principal name, `gssproxy` / `rpc.gssd` - **Performance / I/O wait** → mount options (`rsize`/`wsize`, `actimeo`), backend storage saturation, MTU 2. **Walk the path** from client to server: - Client kernel NFS module loaded? - `rpc.statd` (v3) or `nfsidmap` (v4)? - Network reachable to server on `2049/tcp` (NFSv4) or RPC-discovered ports (NFSv3)? - Server `exportfs -v` shows the expected export with the expected client mask? - For Kerberos: clock skew < 5 min? both have keytabs with `nfs/<host>` principals? 3. **For hung mounts**, explain the recovery options in safest-first order: - **Soft mounts** (`soft,timeo=`): I/O returns EIO after timeout. Process can recover. - **Hard mounts** (default): I/O blocks forever. Process is in D state and is `kill -9` resistant until the server returns OR the mount is force-unmounted. - `umount -f` (force, NFS-aware): asks the kernel to fail in-flight I/O. Sometimes works. - `umount -l` (lazy): detaches from the namespace; existing handles keep blocking. Use to free the path, not to recover the process. - Last resort: reboot. 4. **Decode mount options** the user has set, calling out dangerous combinations: - `soft` without `timeo=` / `retrans=` — fails too fast under transient blip - `hard,nointr` (legacy; on modern kernels `intr` is ignored anyway) - `noac` — disables attribute caching; severe performance hit; use only when required (shared write-heavy workloads) - `actimeo=0` — same problem; rarely needed - `rsize=`/`wsize=` mismatched to MTU — fragmentation; defaults usually best - `sec=sys` over the internet — credentials are unauthenticated; client UID is trusted 5. **For NFSv4 idmapping issues**: explain `/etc/idmapd.conf`, `Domain=`, and where the "nobody:nogroup" trap comes from (mismatched Domain). 6. **Mark DESTRUCTIVE actions**: force-umount of a mount with running writers, rebooting an NFS server with active clients, deleting server-side files with handles still held, restarting `nfs-server` mid-day. --- Symptom: [DESCRIBE] NFS version + sec: [v4.1, sys] Client / Server distros: [e.g., Ubuntu 22.04 client, RHEL 8 server] Mount line (`/etc/fstab` or autofs map): ``` [PASTE] ``` `mount | grep <mountpoint>`: ``` [PASTE] ``` `nfsstat -m`: ``` [PASTE] ``` `dmesg | grep -i nfs` (recent): ``` [PASTE] ``` `rpcinfo -p <server>`: ``` [PASTE] ``` Server-side: `exportfs -v` and `nfsstat -s` (if you have access): ``` [PASTE] ``` Hung process info (if applicable): ``` [PASTE — ps + /proc/<pid>/stack] ```

Why this prompt works

NFS failures are weird because they happen across two kernels (client and server) plus a network plus possibly Kerberos. “Stale file handle” sounds informative but is opaque to most engineers. Models tend to suggest “remount the share” as a panacea — useless if the mount is hung and processes are in D state. This prompt forces a layered walk and respects the irreversibility of certain NFS operations.

How to use it

State the NFS version. v3 and v4 are different protocols with different debugging tools (rpcinfo matters for v3; v4 only listens on 2049/tcp).
Specify sec= flavor. Kerberos NFS adds 50% more failure modes (KDC, keytab, clock skew).
For hangs, list the D-state PIDs and their stack — cat /proc/<pid>/stack shows whether they’re stuck in rpc_wait_bit_killable (server unreachable) vs nfs_wait_on_request (server slow vs hung).
From the server side, include exportfs -v if you have access — half of mount failures are export permission mismatches.

Useful commands

Client side

# Mount info
mount | grep nfs
findmnt -t nfs,nfs4
nfsstat -m       # per-mount options, version, server

# Recent NFS kernel messages
dmesg -T | grep -i nfs
journalctl -k --since "1 hour ago" | grep -i nfs

# Test connectivity to server
rpcinfo -p <server>           # NFSv3 (lists all RPC services)
showmount -e <server>         # list exports (v3-style)
nc -vz <server> 2049          # NFSv4 main port
nc -vz <server> 111           # portmapper / rpcbind

# Hung process forensics
ps auxf | awk '$8 ~ /D/'      # all D-state procs
sudo cat /proc/<pid>/stack    # what kernel function it's waiting in
sudo cat /proc/<pid>/wchan    # short form

# Force unmount (last resort, see safety notes)
sudo umount -f /mnt/nfs       # force; signals NFS to fail in-flight
sudo umount -l /mnt/nfs       # lazy; detaches namespace but keeps handles

# NFSv4 idmapping
cat /etc/idmapd.conf
sudo nfsidmap -c              # clear idmap cache
ls -ln /mnt/nfs/somefile      # numeric UIDs (rules out idmap as cause)

# Kerberos NFS
klist                         # has user got a ticket?
sudo klist -k                 # host keytab
sudo systemctl status rpc-gssd
sudo journalctl -u rpc-gssd --since "1 hour ago"

Server side

# Exports
sudo exportfs -v
sudo cat /etc/exports
sudo exportfs -r              # reload after editing /etc/exports (safe)

# Statistics
sudo nfsstat -s               # server-side counters
sudo nfsstat -o nfs           # per-version operation counters

# Who's connected (v4)
sudo ss -tn dst :*:2049

# kerberos
sudo klist -k                 # service keytab

Common findings this catches

Mount hangs at boot → _netdev missing in fstab option, mount tried before network up. Add _netdev and x-systemd.requires=network-online.target.
Mount fails with mount.nfs: access denied by server → export pattern doesn’t include this client’s IP. Server /etc/exports mismatch.
NFSv4 mount shows nobody:nogroup for every file → idmapd Domain= mismatch between client and server, or NFSv4 ACL/ID mapping not active.
ESTALE storm after server restart → server lost its FSID-based file handles. Client must umount && mount (or use persistent FSID via fsid= in exports).
Hung writes after working for hours → server side oom/storage-paused. Client hard-mount blocks indefinitely; check server.
Kerberos error: Server not found in Kerberos database → service principal not in keytab, or KDC unreachable. kinit -k test from client.
High actimeo=0 causes 10× attr lookups → revert to default actimeo= settings (3-60 seconds) unless you have a known shared-write requirement.

Recommended baseline mount options

For most internal workloads on a stable LAN, NFSv4 with defaults plus a sensible timeout:

<server>:/export  /mnt/data  nfs4  _netdev,rw,hard,nodev,nosuid,nfsvers=4.2,proto=tcp,sec=sys,bg  0  0

For internet-spanning NFS or unreliable links, consider:

<server>:/export  /mnt/data  nfs4  _netdev,rw,soft,timeo=600,retrans=2,nfsvers=4.2,proto=tcp,sec=krb5p,bg  0  0

For automounter-managed mounts, prefer autofs with the same option set and --timeout=60 so unused mounts unmount cleanly.

Hang recovery decision tree

Process in D state on hard NFS mount
├── Server reachable now? (ping, nc -vz <server> 2049)
│   ├── Yes  → I/O should resume; wait briefly, check `nfsstat -m` retries
│   └── No   → continue
│
├── Critical to recover process?
│   ├── No   → wait for server; log incident
│   └── Yes  → try `umount -f /mnt/<path>` from a NEW shell
│             ├── Succeeds → mount re-fails recovery
│             └── Hangs   → `umount -l` frees the path; reboot to recover hung process

When to escalate

Anything that requires restarting nfs-server on a busy file server — coordinate; this is a cluster-wide event.
Stale-handle storms across many clients — likely server-side FSID change; coordinate with storage team before any export changes.
Kerberos NFS troubleshooting that touches the KDC — engage security/identity team.
Clock skew issues — fix NTP at the platform level rather than per-host workarounds.

NFS Mount Troubleshooting Prompt

Why this prompt works

How to use it

Useful commands

Client side

Server side

Common findings this catches

Recommended baseline mount options

Hang recovery decision tree

When to escalate

Related prompts

Linux Disk Full / Inode Exhaustion Diagnosis Prompt

Linux Host Network Connectivity Debug Prompt

systemd Unit Failure Debugging Prompt

Why this prompt works

How to use it

Useful commands

Client side

Server side

Common findings this catches

Recommended baseline mount options

Hang recovery decision tree

When to escalate

Related prompts

Linux Disk Full / Inode Exhaustion Diagnosis Prompt

Linux Host Network Connectivity Debug Prompt

systemd Unit Failure Debugging Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet