Recovering from a Bad Resilver?

am 26.09.2011 07:40:49 von kenn

I managed to get mdadm to resilver the wrong drive of a 5-drive RAID5
array. I stopped the resilver at less than 1% complete but the damage is
done, the drive won't mount and fsck -n spits out a zillion errors. I'm
in the process of purchasing two 2T drives to dd a copy of the array to
attempt to recover the files. Here's what I plan to do:

(1) fsck a copy of the drive. Who knows.
(2) Run photorec on the entire drive, and use the md5sum checksums of the
files to recover their filenames (I had a cron process run md5sum against
the raid5 and I have a 2010 copy of the drive's output)

Both options seem sucky. Only 1% of the drive should be corrupt. Any
other ideas?

Thanks,
Kenn

P.S. Details:

/dev/md3 is a 5 x WD 750G in a raid5 array - /dev/hde1 /dev/hdi1 /dev/sde1
/dev/hdk1 /dev/hdg1

/dev/sde dropped out. From a loose sata cable was my guess, since it
wasn't seated fully. And I ran a full smartctl -t offline /dev/sde and it
found and marked 37 unreadable sectors, and I decided to try out the drive
again before replacing it.

I added /dev/sde1 back into the array and it resilvered over the next day.
Everything was fine for a couple days.

Then I decided to fsck my array just for good measure. It wouldn't
unmount. I thought sde was the issue so I tried to remove it from the
array via remove and then fail, but /proc/mdstat wouldn't show it out of
the array. So I removed my array from fstab and rebooted, and then sde
was out of the array and the array was unmounted.

I wanted to force another resilver on sde, so I used fdisk to delete sde's
raid partition and create two small partitions, used newfs to format them
as ext3, then deleted them, and re-created an empty partition for sde's
raid partition. Then I used --zero-superblock to get rid of sde's raid
info. The resilver on this new sde was supposed to test if the drive was
fully working or needed replacement.

Then I added sde back into the array. I stopped the array, and recreated
it and this is probably where I went wrong. First I tried:

# mdadm --create /dev/md3 --level=5 --raid-devices=5 /dev/hde1 /dev/hdi1
missing /dev/hdk1 /dev/hdg1

and this worked fine. Note the sde1 is marked as missing still. This
mounted and unmounted fine. So I stopped the array and added sde1 back
in:

mdadm --create /dev/md3 --level=5 --raid-devices=5 /dev/hde1 /dev/hdi1
/dev/sde1 /dev/hdk1 /dev/hdg1

This started up the array .. but /proc/mdstat showed a non-sde1 drive as
out of the array and a resilvering process running. OH NO! So I stopped
the array, and tried to recreate it with sde1 as missing:

# mdadm --create /dev/md3 --level=5 --raid-devices=5 /dev/hde1 /dev/hdi1
missing /dev/hdk1 /dev/hdg1

It created, but the array wont mount and fsck -n says lots of nasty things.

I don't have a 3 Terrabyte drive handy, and my motherboard won't support
drives over 2T, so I'm gonna purchase two 2T's, raid0 them, and then see
what I can recover out of my failed /dev/md3.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html