You are a senior Linux storage engineer who has recovered countless mdraid arrays — RAID 1/5/6/10 — through disk failures, controller resets, and accidental member removal. You know that the wrong command turns a recoverable degraded array into permanent data loss. I will provide: - The array state (`cat /proc/mdstat`) - `mdadm --detail /dev/md<N>` - `mdadm --examine /dev/<member>` on each suspected good and bad member - `dmesg | grep -E "md/|md:"` recent - SMART status on members (`smartctl -a /dev/<dev>`) - The symptom and what you've already tried Your job: 1. **Identify the array state** from `mdstat`: - `[UU]` = both members up (RAID1) - `[U_]` = one member missing/failed (degraded but operational) - `[__]` = both members missing (down) - Resync rate, percentage, ETA - Bitmap status (internal bitmap accelerates resync) 2. **For a failed member**: - Confirm with `mdadm --examine`. A "Faulty" or missing member should be `mdadm --remove`d - Replace the disk physically (or attach a new one) - Partition the new disk identically (`sfdisk -d /dev/old | sfdisk /dev/new`) - `mdadm --add /dev/md<N> /dev/<newdev>` → triggers resync - For RAID5/6: longer resync; one more failure during resync = data loss (RAID5) or further degraded (RAID6) 3. **For "Event count mismatch"** on examine: - All members have an event count; assembly picks the highest-event group - Stale members (low event) can be re-added with `--re-add` (uses bitmap, fast) or `--add` (full resync) - **NEVER `mdadm --create`** on existing members "to fix" — this writes new metadata and is functionally reformatting 4. **For totally-down array (won't assemble)**: - `mdadm --assemble --scan` to try auto - `mdadm --assemble --force /dev/md<N> /dev/<m1> /dev/<m2>` — force assembly tolerating event mismatch. Risky. - If --force succeeds, the array is degraded; let it resync but capture data first if possible - **Last resort**: `mdadm --create --assume-clean ...` — but ONLY with exactly the same order, layout, chunk size, metadata version. Get this wrong and data is gone. 5. **For RAID resync stuck or slow**: - `cat /proc/sys/dev/raid/speed_limit_min` (default 1000 kB/s — too low for modern disks) - Raise with `echo 50000 > /proc/sys/dev/raid/speed_limit_min` - `speed_limit_max` caps the max; raise if storage can handle 6. **Pre-failure detection**: - SMART `Reallocated_Sector_Ct`, `Current_Pending_Sector` > 0 — disk failing - `dmesg` showing block read errors on a single device — pre-failure - Schedule replacement before mdadm flags the disk 7. **For RAID 5/6 with concurrent failures**: stop, get a disk image (`ddrescue`), THEN attempt force-assembly with the image — don't risk further failure on the original. Mark DESTRUCTIVE clearly: `--create` on existing array, `--zero-superblock` on a member, write operations during a force-assembled degraded array, partitioning the wrong disk. --- Array: [e.g., /dev/md0 — RAID1, 2 members; /dev/md1 — RAID5, 4 members] Symptom: [DESCRIBE] `cat /proc/mdstat`: ``` [PASTE] ``` `mdadm --detail /dev/md<N>`: ``` [PASTE] ``` `mdadm --examine /dev/<each-member>`: ``` [PASTE] ``` Recent dmesg (md / block errors): ``` [PASTE] ``` SMART status: [PASTE relevant counters]

Why this prompt works

mdraid recovery is high-stakes: the difference between --assemble (recovery) and --create (data loss) is one word, and forums sometimes recommend the wrong one. This prompt enforces a state-inventory-first walkthrough.

How to use it

Capture state with read-only commands first. mdstat, --detail, --examine. No writes.
Image failing disks BEFORE further operations — ddrescue to a healthy replacement, then operate on the image.
Confirm device names every time. lsblk -f then re-confirm before any partition or assemble command.
Slow down. Many recoveries are botched by hurrying.

Useful commands

# Status (read-only — safe)
cat /proc/mdstat
mdadm --detail /dev/md<N>
mdadm --examine /dev/<member>
mdadm --examine --scan
lsblk -f                              # confirm devices

# SMART
sudo smartctl -a /dev/sda | grep -E "Reallocated|Pending|Offline_Unc|Power_On_Hours"
sudo smartctl -t short /dev/sda       # initiate short self-test

# Block errors in dmesg
dmesg | grep -E "I/O error|sector|md/raid|md:" | head -50

# Assemble (read-only attempt)
sudo mdadm --assemble --scan
sudo mdadm --assemble /dev/md0 /dev/sda1 /dev/sdb1

# Force-assemble (event mismatch — use carefully)
sudo mdadm --assemble --force /dev/md0 /dev/sda1 /dev/sdb1

# Add/remove members
sudo mdadm --fail /dev/md0 /dev/sdb1               # mark as failed
sudo mdadm --remove /dev/md0 /dev/sdb1             # remove from array
sudo mdadm --add /dev/md0 /dev/sdc1                # add replacement (triggers resync)
sudo mdadm --re-add /dev/md0 /dev/sdb1             # if member was briefly missing (uses bitmap)

# Resync speed
cat /proc/sys/dev/raid/speed_limit_min             # default 1000 — too low
cat /proc/sys/dev/raid/speed_limit_max
echo 50000 | sudo tee /proc/sys/dev/raid/speed_limit_min
echo 200000 | sudo tee /proc/sys/dev/raid/speed_limit_max

# Stop array (when intact)
sudo umount /mnt/<path>
sudo mdadm --stop /dev/md0

# Replace disk procedure
sudo smartctl -a /dev/sdb                          # verify before
sudo mdadm --fail /dev/md0 /dev/sdb1
sudo mdadm --remove /dev/md0 /dev/sdb1
# Physically replace, then:
sudo sfdisk -d /dev/sda | sudo sfdisk /dev/sdc     # clone partition table from healthy member (sda)
sudo mdadm --add /dev/md0 /dev/sdc1                # resync starts

# Image a failing disk (BEFORE further ops)
sudo apt install gddrescue
sudo ddrescue -d -r3 /dev/<failing> /dev/<replacement> ddrescue.log
# Then operate on /dev/<replacement>

# Bitmap (accelerates re-add)
sudo mdadm --grow /dev/md0 --bitmap=internal       # add internal bitmap
sudo mdadm --grow /dev/md0 --bitmap=none           # remove (faster I/O, slower recovery)

Replace-failing-disk procedure (RAID1, online)

# 1. Identify failing disk (SMART or dmesg)
sudo smartctl -a /dev/sdb | grep -E "Reallocated|Pending"

# 2. Mark failed
sudo mdadm --manage /dev/md0 --fail /dev/sdb1

# 3. Remove from array
sudo mdadm --manage /dev/md0 --remove /dev/sdb1

# 4. (Schedule physical swap; for hot-swap, pull /dev/sdb)

# 5. Confirm new disk appears (e.g., /dev/sdc)
lsblk

# 6. Clone partition table from healthy member
sudo sfdisk -d /dev/sda | sudo sfdisk /dev/sdc

# 7. Add new member
sudo mdadm --manage /dev/md0 --add /dev/sdc1

# 8. Watch resync
watch cat /proc/mdstat

Recover a force-assembled array

# After --assemble --force succeeds with event mismatch:
# 1. Filesystem check (read-only first)
sudo fsck.ext4 -nf /dev/md0      # ext4 read-only check
sudo xfs_repair -n /dev/md0      # XFS read-only check

# 2. If checks pass, mount read-only and COPY data immediately
sudo mount -o ro /dev/md0 /mnt/recovery
rsync -aHAX /mnt/recovery/ /backup/

# 3. Only after backup, allow writes
sudo umount /mnt/recovery
sudo mount /dev/md0 /mnt/data

Common findings this catches

[U_] for weeks with no replacement — degraded RAID1 has no redundancy. Replace the failed member ASAP.
All members show Faulty in --examine but disks are physically OK — likely controller reset; reboot, try --assemble.
Resync at 1000 kB/s on modern disks → speed_limit_min default; raise to 50000+.
RAID5 with two failures simultaneously → image both disks with ddrescue, then attempt assembly from images. Don’t risk further wear on originals.
“mismatch_cnt” non-zero after scrub → silent corruption detected. RAID1 picks one; investigate disk health.
mdraid superblock 0.9 (legacy) in an old array — limited features, no UUID-based assembly. Upgrade carefully.

When to escalate

Multi-disk failure on a critical RAID5/6 — call a data-recovery service before attempting more operations.
Hardware controller errors in dmesg — replace controller before further trust in the array.
mdraid + LUKS + LVM stack failing — recover bottom-up (mdraid first, then LUKS, then LVM); each layer’s own debug applies.
Suspected silent corruption (mismatch_cnt rising) — engage storage team; consider migrating to a checksumming FS (btrfs/ZFS) for future.

Linux mdraid Software RAID Recovery Prompt

Why this prompt works

How to use it

Useful commands

Replace-failing-disk procedure (RAID1, online)

Recover a force-assembled array

Common findings this catches

When to escalate

Related prompts

ext4 Filesystem Corruption Recovery Prompt

Linux Disk Full / Inode Exhaustion Diagnosis Prompt

LVM Troubleshooting Prompt

Why this prompt works

How to use it

Useful commands

Replace-failing-disk procedure (RAID1, online)

Recover a force-assembled array

Common findings this catches

When to escalate

Related prompts

ext4 Filesystem Corruption Recovery Prompt

Linux Disk Full / Inode Exhaustion Diagnosis Prompt

LVM Troubleshooting Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet