raid10 failure(s)

am 14.02.2011 17:07:34 von Mark Keisler

Sorry in advance for the long email :)

I had a RAID10 array set up using 4 WD 1TB caviar black drives (SATA3)
on 64 bit on a 2.6.36 kernel using mdadm 3.1.4. I noticed last night
that one drive had faulted out of the array. It had a bunch of errors
like so:

Feb 8 03:39:48 samsara kernel: [41330.835285] ata3.00: exception
Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Feb 8 03:39:48 samsara kernel: [41330.835288] ata3.00: irq_stat 0x40000008
Feb 8 03:39:48 samsara kernel: [41330.835292] ata3.00: failed
command: READ FPDMA QUEUED
Feb 8 03:39:48 samsara kernel: [41330.835297] ata3.00: cmd
60/f8:00:f8:9a:45/00:00:04:00:00/40 tag 0 ncq 126976 in
Feb 8 03:39:48 samsara kernel: [41330.835297] res
41/40:00:70:9b:45/00:00:04:00:00/40 Emask 0x409 (media error)
Feb 8 03:39:48 samsara kernel: [41330.835300] ata3.00: status: { DRDY ERR }
Feb 8 03:39:48 samsara kernel: [41330.835301] ata3.00: error: { UNC }
Feb 8 03:39:48 samsara kernel: [41330.839776] ata3.00: configured for UDMA/133
Feb 8 03:39:48 samsara kernel: [41330.839788] ata3: EH complete
.....

Feb 8 03:39:58 samsara kernel: [41340.423236] sd 2:0:0:0: [sdc]
Unhandled sense code
Feb 8 03:39:58 samsara kernel: [41340.423238] sd 2:0:0:0: [sdc]
Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 8 03:39:58 samsara kernel: [41340.423240] sd 2:0:0:0: [sdc]
Sense Key : Medium Error [current] [descriptor]
Feb 8 03:39:58 samsara kernel: [41340.423243] Descriptor sense data
with sense descriptors (in hex):
Feb 8 03:39:58 samsara kernel: [41340.423244] 72 03 11 04 00
00 00 0c 00 0a 80 00 00 00 00 00
Feb 8 03:39:58 samsara kernel: [41340.423249] 04 45 9b 70
Feb 8 03:39:58 samsara kernel: [41340.423251] sd 2:0:0:0: [sdc] Add.
Sense: Unrecovered read error - auto reallocate failed
Feb 8 03:39:58 samsara kernel: [41340.423254] sd 2:0:0:0: [sdc] CDB:
Read(10): 28 00 04 45 9a f8 00 00 f8 00
Feb 8 03:39:58 samsara kernel: [41340.423259] end_request: I/O error,
dev sdc, sector 71670640
Feb 8 03:39:58 samsara kernel: [41340.423262] md/raid10:md0: sdc1:
rescheduling sector 143332600
.....
Feb 8 03:40:10 samsara kernel: [41351.940796] md/raid10:md0: read
error corrected (8 sectors at 2168 on sdc1)
Feb 8 03:40:10 samsara kernel: [41351.954972] md/raid10:md0: sdb1:
redirecting sector 143332600 to another mirror

and so on until:
Feb 8 03:55:01 samsara kernel: [42243.609414] md/raid10:md0: sdc1:
Raid device exceeded read_error threshold [cur 21:max 20]
Feb 8 03:55:01 samsara kernel: [42243.609417] md/raid10:md0: sdc1:
Failing raid device
Feb 8 03:55:01 samsara kernel: [42243.609419] md/raid10:md0: Disk
failure on sdc1, disabling device.
Feb 8 03:55:01 samsara kernel: [42243.609420] <1>md/raid10:md0:
Operation continuing on 3 devices.
Feb 8 03:55:01 samsara kernel: [42243.609423] md/raid10:md0: sdb1:
redirecting sector 143163888 to another mirror
Feb 8 03:55:01 samsara kernel: [42243.609650] md/raid10:md0: sdb1:
redirecting sector 143164416 to another mirror
Feb 8 03:55:01 samsara kernel: [42243.610095] md/raid10:md0: sdb1:
redirecting sector 143164664 to another mirror
Feb 8 03:55:01 samsara kernel: [42243.633814] RAID10 conf printout:
Feb 8 03:55:01 samsara kernel: [42243.633817] --- wd:3 rd:4
Feb 8 03:55:01 samsara kernel: [42243.633820] disk 0, wo:0, o:1, dev:sdb1
Feb 8 03:55:01 samsara kernel: [42243.633821] disk 1, wo:1, o:0, dev:sdc1
Feb 8 03:55:01 samsara kernel: [42243.633823] disk 2, wo:0, o:1, dev:sdd1
Feb 8 03:55:01 samsara kernel: [42243.633824] disk 3, wo:0, o:1, dev:sde1
Feb 8 03:55:01 samsara kernel: [42243.645880] RAID10 conf printout:
Feb 8 03:55:01 samsara kernel: [42243.645883] --- wd:3 rd:4
Feb 8 03:55:01 samsara kernel: [42243.645885] disk 0, wo:0, o:1, dev:sdb1
Feb 8 03:55:01 samsara kernel: [42243.645887] disk 2, wo:0, o:1, dev:sdd1
Feb 8 03:55:01 samsara kernel: [42243.645888] disk 3, wo:0, o:1, dev:sde1

This seemed weird as the machine is only a week or two old. I powered
down to open it up and get the serial number off the drive fro an RMA.
I powered back up and mdadm had automatically removed the drive from
the RAID. Fine. The RAID had already been running on just 3 disks
since the 8th. For some reason, I thought to add the drive back into
the array to see if it failed out again thinking worst case scenario
I'm back to a degraded RAID10 again. So I added it back in and did an
mdadm --detail to check on it after a little while and found this:
samsara log # mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Sat Feb 5 22:00:52 2011
Raid Level : raid10
Array Size : 1953519104 (1863.02 GiB 2000.40 GB)
Used Dev Size : 976759552 (931.51 GiB 1000.20 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent

Update Time : Mon Feb 14 00:04:46 2011
State : clean, FAILED, recovering
Active Devices : 2
Working Devices : 2
Failed Devices : 2
Spare Devices : 0

Layout : near=2
Chunk Size : 256K

Rebuild Status : 99% complete

Name : samsara:0 (local to host samsara)
UUID : 26804ec8:a20a4365:bc7d5b4e:653ade03
Events : 30348

Number Major Minor RaidDevice State
0 8 17 0 faulty spare rebuilding /dev/sdb1
1 8 33 1 faulty spare rebuilding /dev/sdc1
2 8 49 2 active sync /dev/sdd1
3 8 65 3 active sync /dev/sde1
samsara log # exit

It had faulted drive 0 also during the rebuild.
[ 1177.064359] RAID10 conf printout:
[ 1177.064362] --- wd:2 rd:4
[ 1177.064365] disk 0, wo:1, o:0, dev:sdb1
[ 1177.064367] disk 1, wo:1, o:0, dev:sdc1
[ 1177.064368] disk 2, wo:0, o:1, dev:sdd1
[ 1177.064370] disk 3, wo:0, o:1, dev:sde1
[ 1177.073325] RAID10 conf printout:
[ 1177.073328] --- wd:2 rd:4
[ 1177.073330] disk 0, wo:1, o:0, dev:sdb1
[ 1177.073332] disk 2, wo:0, o:1, dev:sdd1
[ 1177.073333] disk 3, wo:0, o:1, dev:sde1
[ 1177.073340] RAID10 conf printout:
[ 1177.073341] --- wd:2 rd:4
[ 1177.073342] disk 0, wo:1, o:0, dev:sdb1
[ 1177.073343] disk 2, wo:0, o:1, dev:sdd1
[ 1177.073344] disk 3, wo:0, o:1, dev:sde1
[ 1177.083323] RAID10 conf printout:
[ 1177.083326] --- wd:2 rd:4
[ 1177.083329] disk 2, wo:0, o:1, dev:sdd1
[ 1177.083330] disk 3, wo:0, o:1, dev:sde1

So the RAID ended up being marked "clean, FAILED." Gee, glad it is
clean at least ;). I'm wondering wtf went wrong and if it actually
makes sense that I had a double disk failure like that. I can't even
force it to assemble the raid anymore:
# mdadm --assemble --verbose --force /dev/md0
mdadm: looking for devices for /dev/md0
mdadm: cannot open device /dev/sde1: Device or resource busy
mdadm: /dev/sde1 has wrong uuid.
mdadm: cannot open device /dev/sdd1: Device or resource busy
mdadm: /dev/sdd1 has wrong uuid.
mdadm: cannot open device /dev/sdc1: Device or resource busy
mdadm: /dev/sdc1 has wrong uuid.
mdadm: cannot open device /dev/sdb1: Device or resource busy
mdadm: /dev/sdb1 has wrong uuid.

Am I totally SOL? Thanks for any suggestions or things to try.

--
Mark Keisler
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html