mismatch_cnt, RAID1, and a clever fix

This past weekend my computer showed an ominous error:

Jan 1 04:06:16 lister mdadm[8317]: RebuildFinished event detected on md device /dev/md/0, component device mismatches found: 9856 (on raid level 1)

Huh, that doesn't look particularly good. Mismatches between drives tend to lead to bad things.

Once the original panic subsided I checked online.

please explain mismatch_cnt so I can sleep better at night seemed promising for an explanation of what was going on. I read through this. "Aha! Swap files cause this problem!" I exclaimed to myself. "That has to be the culprit".

Except my machine didn't have a swap file on a RAID1 device.

Shoot. That would have been an easy explanation.

Much like looking up symptoms online the more I searched the more my heart sank. Data corruption seemed the most obvious cause.

I checked the SMART data on all of the drives. Nothing looked amiss there.

I re-ran the checks:

echo check > /sys/block/md0/md/sync_action

and ran:

watch cat /sys/block/md0/md/mismatch_cnt

My heart sank as I watched the counter tick up and up.

So I pulled out my copy of Spinrite and ran a read-test on all of my drives. Spinrite said there was nothing wrong with the hardware. So it had to be something up with the software RAID1 itself.

I kept reading that the problem could one of two issues. Either: 1) there was a memory-mapped file that wasn't in sync between the drives, or 2) the free space between the drives didn't 100% match. I was still holding out that it wasn't 3) corrupted data.

At this point I could have done:

echo repair > /sys/block/md0/md/sync_action

but that thought scared me. I had one of the drives on this machine corrupt a Virtualbox instance before when I mirrored the bitmaps between them (one drive had issues, and suddenly both copies of the file had issues). Add to that the somewhat scary notion that it was anyone's guess which mismatch would become the canonical version. That path seemed a certain way to ensure something got corrupted.

I'm not currently able to find the exact article but someone mentioned that one way to test to see if it's the free space of the RAID is to do a dd if=/dev/zero of=foo bs=8K and just let it fill up the disk. The reasoning is that the free space will be reclaimed and will be set to a known quantity.

Note: If you decide to do this make sure that you do this in single-user mode (shutdown now) rather than with a running system. Filling up a filesystem while things are running can make your machine very cranky.

So I ran dd if=/dev/zero of=foo bs=8K as root and let it fill up the remaining space on the disk. I then ran sync;sync;sync to make sure everything was synchronized before removing the file (and re-running the sync;sync;sync command again.

I booted the machine into multi-user mode and re-ran the scrubbing check. I kept an eye on the progress and the mismatch_cnt variable.

When it kept at a steady 0 count I breathed a sigh of relief.

Moral of the story: The free space on RAID1 can get out of sync (especially if there's a power outage or if you have memory-mapped files like swap files). You can do a "repair" on it, but you might run into corruption if the mismatches are pointing to real data. You may want to instead create a known-good file in the free-space of the drive and see if that clears it out.

Hope this helps someone else who runs into this. If I re-run into the original article that mentioned this nugget I'll update with a link.


links

social