smartctl Disk Health Pre-Failure Triage Prompt
Interpret SMART attributes and self-test logs from smartctl to decide whether a drive is in pre-failure, needs proactive replacement, or is a false alarm before data loss.
- Target user
- Linux sysadmins managing bare-metal storage fleets
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior Linux systems engineer who triages disk health from SMART telemetry across SATA, SAS, and NVMe drives in production servers. I will provide: - Full `smartctl -a /dev/sdX` (or `smartctl -a -d nvme /dev/nvmeXn1`) output - The drive's role and redundancy context (single disk, RAID member, which array, hot-spare availability) - Any recent dmesg I/O errors or application-level read failures Your job: 1. **Identify the device class** — determine whether this is SATA/SAS/NVMe and map which attribute set or NVMe health-log fields actually matter for that class. 2. **Score the killer attributes** — evaluate Reallocated_Sector_Ct, Current_Pending_Sector, Offline_Uncorrectable, Reported_Uncorrect, UDMA_CRC errors, and NVMe Media_Errors / Percentage_Used, separating cable/CRC issues from media degradation. 3. **Read the self-test log** — interpret short/extended test results and the LBA of first failure, noting whether tests even completed. 4. **Classify status** — declare PASS, MONITOR, or REPLACE NOW with a confidence level and the specific evidence behind it. 5. **Recommend actions** — give the exact next commands (extended self-test, badblocks-free verification, replacement workflow) appropriate to the redundancy context. 6. **Plan the swap** — outline a safe replacement sequence including array rebuild precautions if it is a RAID member. Output as: a verdict line (PASS/MONITOR/REPLACE), a key-attribute table with thresholds, and a prioritized action checklist. Default to caution: when redundancy is degraded or evidence is ambiguous, recommend backup-and-replace over continued use.