Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Linux Admins By James Joyner IV · · 9 min read

Managing Software RAID with mdadm: Building, Monitoring, and Recovering

Software RAID with mdadm is rock-solid when you understand it. Here's how to build arrays, monitor health, and recover from a failed disk without losing data.

  • #linux
  • #raid
  • #mdadm
  • #storage
  • #disks
  • #recovery

Hardware RAID controllers are great until the controller dies and you discover your array is in a proprietary format no other card can read. Linux software RAID via mdadm has none of that lock-in: the metadata is open, the arrays are portable between machines, and it’s been battle-tested for two decades. I’ve recovered more data from mdadm arrays than from any hardware controller. Here’s how to build, watch, and rescue them.

Choosing a RAID level

Quick reality check before you create anything:

  • RAID 1 (mirror) — two+ disks, full redundancy, simple, survives one disk loss. My default for OS/boot and small critical volumes.
  • RAID 10 (mirror + stripe) — speed and redundancy, needs 4+ disks, my default for databases.
  • RAID 5 — one parity disk, survives one failure, but rebuilds are slow and stressful on large modern drives. Acceptable for archival, risky for big arrays.
  • RAID 6 — two parity disks, survives two failures, the sane choice over RAID 5 for large-capacity arrays.

RAID is not a backup. It protects against disk failure, not rm -rf, corruption, or fire. Keep backups regardless.

Building an array

Say you have /dev/sdb and /dev/sdc for a mirror:

sudo mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sdb /dev/sdc

Watch the initial sync:

cat /proc/mdstat

/proc/mdstat is the heartbeat of software RAID — it shows every array, its disks, sync progress, and which members are missing. You’ll learn to read it at a glance.

Then make a filesystem and mount it:

sudo mkfs.ext4 /dev/md0
sudo mount /dev/md0 /data

Make the array reassemble on boot

This is the step people forget, and then the array doesn’t come up after a reboot. Save the array definition to the config and rebuild the initramfs:

sudo mdadm --detail --scan | sudo tee -a /etc/mdadm/mdadm.conf
sudo update-initramfs -u     # Debian/Ubuntu
# or: sudo dracut -f         # RHEL family

Add a /etc/fstab entry by UUID (get it from blkid /dev/md0), not by /dev/md0, since array device numbers can shift.

Monitoring: catch a failure before it’s a disaster

The whole point of RAID is surviving a disk failure — but only if you notice the first failure before the second one kills you. Set up email alerts:

# In /etc/mdadm/mdadm.conf
MAILADDR you@example.com

mdadm runs a monitor daemon that emails on Fail, DegradedArray, and SpareMissing events. Test it:

sudo mdadm --monitor --scan --test --oneshot

For health at a glance:

sudo mdadm --detail /dev/md0     # state, per-disk status, event count
cat /proc/mdstat

A healthy mirror shows [UU]. A degraded one shows [U_] — that underscore is the alarm. Also schedule periodic SMART checks (smartctl -a /dev/sdb); RAID protects against a dead disk, but a disk throwing read errors can quietly corrupt before it fully dies.

Recovering from a failed disk

Here’s the part that matters at 2am. A disk failed and the array is degraded but still serving data. The drill:

1. Identify the failed member:

sudo mdadm --detail /dev/md0
cat /proc/mdstat        # the [U_] tells you which slot is down

2. Mark it failed and remove it (if not already):

sudo mdadm /dev/md0 --fail /dev/sdc
sudo mdadm /dev/md0 --remove /dev/sdc

3. Physically replace the drive, then add the new one:

sudo mdadm /dev/md0 --add /dev/sdc

The array immediately starts rebuilding onto the new disk. Watch it:

cat /proc/mdstat       # shows recovery percentage and ETA

Do not reboot or remove a second disk during a RAID 5/6 rebuild — that window is exactly when arrays die for good. For RAID 5 on large drives, rebuilds can take many hours; plan for it.

Hot spares: automate the swap

For arrays you can’t babysit, add a hot spare. mdadm automatically pulls it in when a disk fails, starting the rebuild without you:

sudo mdadm --create /dev/md0 --level=5 --raid-devices=3 --spare-devices=1 \
  /dev/sdb /dev/sdc /dev/sdd /dev/sde

Now a single failure triggers an automatic rebuild onto /dev/sde, buying you time to replace the dead disk on your own schedule.

Growing and reshaping

mdadm can grow arrays online — add a disk and expand:

sudo mdadm --add /dev/md0 /dev/sdf
sudo mdadm --grow /dev/md0 --raid-devices=4
sudo resize2fs /dev/md0     # then grow the filesystem

Reshaping is powerful but slow and risky on a live array — have backups and ideally do it during a maintenance window.

A recovery checklist worth saving

  1. cat /proc/mdstat — what’s degraded?
  2. mdadm --detail /dev/mdX — which member, what state?
  3. --fail then --remove the bad disk.
  4. Replace hardware, --add the new disk.
  5. Watch /proc/mdstat to 100% before relaxing.
  6. Confirm [UU] and re-test monitoring alerts.

Where AI helps

mdadm --detail and /proc/mdstat output is terse and easy to misread when you’re stressed and a customer’s data is on the line. Pasting it into a model and asking “which physical disk failed, is the array still serving data, and what’s the exact safe recovery sequence” turns cryptic status into a clear next step. I keep a few Linux admin prompts for exactly these storage-recovery moments.

Software RAID has saved my data more times than I can count, but only because I treated monitoring as non-negotiable and rehearsed the recovery before I needed it. Build the array, wire up the alerts, and practice a disk swap on a test box once — so the real one is muscle memory.

Generated commands and configs are assistive, not authoritative. Always verify against your own systems before applying changes in production.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.