Skip to content
CloudOps
Newsletter
All prompts
AI for Linux Admins Difficulty: Advanced ClaudeChatGPT

Linux mdraid Software RAID Recovery Prompt

Recover from degraded or failed mdraid arrays — failed disk, missing member, resync stuck, replacing drives without losing data.

Target user
Linux sysadmins managing software RAID
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior Linux storage engineer who has recovered countless mdraid arrays — RAID 1/5/6/10 — through disk failures, controller resets, and accidental member removal. You know that the wrong command turns a recoverable degraded array into permanent data loss.

I will provide:
- The array state (`cat /proc/mdstat`)
- `mdadm --detail /dev/md<N>`
- `mdadm --examine /dev/<member>` on each suspected good and bad member
- `dmesg | grep -E "md/|md:"` recent
- SMART status on members (`smartctl -a /dev/<dev>`)
- The symptom and what you've already tried

Your job:

1. **Identify the array state** from `mdstat`:
   - `[UU]` = both members up (RAID1)
   - `[U_]` = one member missing/failed (degraded but operational)
   - `[__]` = both members missing (down)
   - Resync rate, percentage, ETA
   - Bitmap status (internal bitmap accelerates resync)
2. **For a failed member**:
   - Confirm with `mdadm --examine`. A "Faulty" or missing member should be `mdadm --remove`d
   - Replace the disk physically (or attach a new one)
   - Partition the new disk identically (`sfdisk -d /dev/old | sfdisk /dev/new`)
   - `mdadm --add /dev/md<N> /dev/<newdev>` → triggers resync
   - For RAID5/6: longer resync; one more failure during resync = data loss (RAID5) or further degraded (RAID6)
3. **For "Event count mismatch"** on examine:
   - All members have an event count; assembly picks the highest-event group
   - Stale members (low event) can be re-added with `--re-add` (uses bitmap, fast) or `--add` (full resync)
   - **NEVER `mdadm --create`** on existing members "to fix" — this writes new metadata and is functionally reformatting
4. **For totally-down array (won't assemble)**:
   - `mdadm --assemble --scan` to try auto
   - `mdadm --assemble --force /dev/md<N> /dev/<m1> /dev/<m2>` — force assembly tolerating event mismatch. Risky.
   - If --force succeeds, the array is degraded; let it resync but capture data first if possible
   - **Last resort**: `mdadm --create --assume-clean ...` — but ONLY with exactly the same order, layout, chunk size, metadata version. Get this wrong and data is gone.
5. **For RAID resync stuck or slow**:
   - `cat /proc/sys/dev/raid/speed_limit_min` (default 1000 kB/s — too low for modern disks)
   - Raise with `echo 50000 > /proc/sys/dev/raid/speed_limit_min`
   - `speed_limit_max` caps the max; raise if storage can handle
6. **Pre-failure detection**:
   - SMART `Reallocated_Sector_Ct`, `Current_Pending_Sector` > 0 — disk failing
   - `dmesg` showing block read errors on a single device — pre-failure
   - Schedule replacement before mdadm flags the disk
7. **For RAID 5/6 with concurrent failures**: stop, get a disk image (`ddrescue`), THEN attempt force-assembly with the image — don't risk further failure on the original.

Mark DESTRUCTIVE clearly: `--create` on existing array, `--zero-superblock` on a member, write operations during a force-assembled degraded array, partitioning the wrong disk.

---

Array: [e.g., /dev/md0 — RAID1, 2 members; /dev/md1 — RAID5, 4 members]
Symptom: [DESCRIBE]
`cat /proc/mdstat`:
```
[PASTE]
```
`mdadm --detail /dev/md<N>`:
```
[PASTE]
```
`mdadm --examine /dev/<each-member>`:
```
[PASTE]
```
Recent dmesg (md / block errors):
```
[PASTE]
```
SMART status: [PASTE relevant counters]

Why this prompt works

mdraid recovery is high-stakes: the difference between --assemble (recovery) and --create (data loss) is one word, and forums sometimes recommend the wrong one. This prompt enforces a state-inventory-first walkthrough.

How to use it

  1. Capture state with read-only commands first. mdstat, --detail, --examine. No writes.
  2. Image failing disks BEFORE further operationsddrescue to a healthy replacement, then operate on the image.
  3. Confirm device names every time. lsblk -f then re-confirm before any partition or assemble command.
  4. Slow down. Many recoveries are botched by hurrying.

Useful commands

# Status (read-only — safe)
cat /proc/mdstat
mdadm --detail /dev/md<N>
mdadm --examine /dev/<member>
mdadm --examine --scan
lsblk -f                              # confirm devices

# SMART
sudo smartctl -a /dev/sda | grep -E "Reallocated|Pending|Offline_Unc|Power_On_Hours"
sudo smartctl -t short /dev/sda       # initiate short self-test

# Block errors in dmesg
dmesg | grep -E "I/O error|sector|md/raid|md:" | head -50

# Assemble (read-only attempt)
sudo mdadm --assemble --scan
sudo mdadm --assemble /dev/md0 /dev/sda1 /dev/sdb1

# Force-assemble (event mismatch — use carefully)
sudo mdadm --assemble --force /dev/md0 /dev/sda1 /dev/sdb1

# Add/remove members
sudo mdadm --fail /dev/md0 /dev/sdb1               # mark as failed
sudo mdadm --remove /dev/md0 /dev/sdb1             # remove from array
sudo mdadm --add /dev/md0 /dev/sdc1                # add replacement (triggers resync)
sudo mdadm --re-add /dev/md0 /dev/sdb1             # if member was briefly missing (uses bitmap)

# Resync speed
cat /proc/sys/dev/raid/speed_limit_min             # default 1000 — too low
cat /proc/sys/dev/raid/speed_limit_max
echo 50000 | sudo tee /proc/sys/dev/raid/speed_limit_min
echo 200000 | sudo tee /proc/sys/dev/raid/speed_limit_max

# Stop array (when intact)
sudo umount /mnt/<path>
sudo mdadm --stop /dev/md0

# Replace disk procedure
sudo smartctl -a /dev/sdb                          # verify before
sudo mdadm --fail /dev/md0 /dev/sdb1
sudo mdadm --remove /dev/md0 /dev/sdb1
# Physically replace, then:
sudo sfdisk -d /dev/sda | sudo sfdisk /dev/sdc     # clone partition table from healthy member (sda)
sudo mdadm --add /dev/md0 /dev/sdc1                # resync starts

# Image a failing disk (BEFORE further ops)
sudo apt install gddrescue
sudo ddrescue -d -r3 /dev/<failing> /dev/<replacement> ddrescue.log
# Then operate on /dev/<replacement>

# Bitmap (accelerates re-add)
sudo mdadm --grow /dev/md0 --bitmap=internal       # add internal bitmap
sudo mdadm --grow /dev/md0 --bitmap=none           # remove (faster I/O, slower recovery)

Replace-failing-disk procedure (RAID1, online)

# 1. Identify failing disk (SMART or dmesg)
sudo smartctl -a /dev/sdb | grep -E "Reallocated|Pending"

# 2. Mark failed
sudo mdadm --manage /dev/md0 --fail /dev/sdb1

# 3. Remove from array
sudo mdadm --manage /dev/md0 --remove /dev/sdb1

# 4. (Schedule physical swap; for hot-swap, pull /dev/sdb)

# 5. Confirm new disk appears (e.g., /dev/sdc)
lsblk

# 6. Clone partition table from healthy member
sudo sfdisk -d /dev/sda | sudo sfdisk /dev/sdc

# 7. Add new member
sudo mdadm --manage /dev/md0 --add /dev/sdc1

# 8. Watch resync
watch cat /proc/mdstat

Recover a force-assembled array

# After --assemble --force succeeds with event mismatch:
# 1. Filesystem check (read-only first)
sudo fsck.ext4 -nf /dev/md0      # ext4 read-only check
sudo xfs_repair -n /dev/md0      # XFS read-only check

# 2. If checks pass, mount read-only and COPY data immediately
sudo mount -o ro /dev/md0 /mnt/recovery
rsync -aHAX /mnt/recovery/ /backup/

# 3. Only after backup, allow writes
sudo umount /mnt/recovery
sudo mount /dev/md0 /mnt/data

Common findings this catches

  • [U_] for weeks with no replacement — degraded RAID1 has no redundancy. Replace the failed member ASAP.
  • All members show Faulty in --examine but disks are physically OK — likely controller reset; reboot, try --assemble.
  • Resync at 1000 kB/s on modern disksspeed_limit_min default; raise to 50000+.
  • RAID5 with two failures simultaneously → image both disks with ddrescue, then attempt assembly from images. Don’t risk further wear on originals.
  • “mismatch_cnt” non-zero after scrub → silent corruption detected. RAID1 picks one; investigate disk health.
  • mdraid superblock 0.9 (legacy) in an old array — limited features, no UUID-based assembly. Upgrade carefully.

When to escalate

  • Multi-disk failure on a critical RAID5/6 — call a data-recovery service before attempting more operations.
  • Hardware controller errors in dmesg — replace controller before further trust in the array.
  • mdraid + LUKS + LVM stack failing — recover bottom-up (mdraid first, then LUKS, then LVM); each layer’s own debug applies.
  • Suspected silent corruption (mismatch_cnt rising) — engage storage team; consider migrating to a checksumming FS (btrfs/ZFS) for future.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week