RAID10 failure(s)

am 14.02.2011 17:09:07 von Mark Keisler

Sorry in advance for the long email :)

I had a RAID10 array set up using 4 WD 1TB caviar black drives (SATA3)
on 64 bit on a 2.6.36 kernel using mdadm 3.1.4. I noticed last night
that one drive had faulted out of the array. It had a bunch of errors
like so:

Feb 8 03:39:48 samsara kernel: [41330.835285] ata3.00: exception
Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Feb 8 03:39:48 samsara kernel: [41330.835288] ata3.00: irq_stat 0x40000008
Feb 8 03:39:48 samsara kernel: [41330.835292] ata3.00: failed
command: READ FPDMA QUEUED
Feb 8 03:39:48 samsara kernel: [41330.835297] ata3.00: cmd
60/f8:00:f8:9a:45/00:00:04:00:00/40 tag 0 ncq 126976 in
Feb 8 03:39:48 samsara kernel: [41330.835297] res
41/40:00:70:9b:45/00:00:04:00:00/40 Emask 0x409 (media error)
Feb 8 03:39:48 samsara kernel: [41330.835300] ata3.00: status: { DRDY ERR }
Feb 8 03:39:48 samsara kernel: [41330.835301] ata3.00: error: { UNC }
Feb 8 03:39:48 samsara kernel: [41330.839776] ata3.00: configured for UDMA/133
Feb 8 03:39:48 samsara kernel: [41330.839788] ata3: EH complete
.....

Feb 8 03:39:58 samsara kernel: [41340.423236] sd 2:0:0:0: [sdc]
Unhandled sense code
Feb 8 03:39:58 samsara kernel: [41340.423238] sd 2:0:0:0: [sdc]
Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 8 03:39:58 samsara kernel: [41340.423240] sd 2:0:0:0: [sdc]
Sense Key : Medium Error [current] [descriptor]
Feb 8 03:39:58 samsara kernel: [41340.423243] Descriptor sense data
with sense descriptors (in hex):
Feb 8 03:39:58 samsara kernel: [41340.423244] 72 03 11 04 00
00 00 0c 00 0a 80 00 00 00 00 00
Feb 8 03:39:58 samsara kernel: [41340.423249] 04 45 9b 70
Feb 8 03:39:58 samsara kernel: [41340.423251] sd 2:0:0:0: [sdc] Add.
Sense: Unrecovered read error - auto reallocate failed
Feb 8 03:39:58 samsara kernel: [41340.423254] sd 2:0:0:0: [sdc] CDB:
Read(10): 28 00 04 45 9a f8 00 00 f8 00
Feb 8 03:39:58 samsara kernel: [41340.423259] end_request: I/O error,
dev sdc, sector 71670640
Feb 8 03:39:58 samsara kernel: [41340.423262] md/raid10:md0: sdc1:
rescheduling sector 143332600
.....
Feb 8 03:40:10 samsara kernel: [41351.940796] md/raid10:md0: read
error corrected (8 sectors at 2168 on sdc1)
Feb 8 03:40:10 samsara kernel: [41351.954972] md/raid10:md0: sdb1:
redirecting sector 143332600 to another mirror

and so on until:
Feb 8 03:55:01 samsara kernel: [42243.609414] md/raid10:md0: sdc1:
Raid device exceeded read_error threshold [cur 21:max 20]
Feb 8 03:55:01 samsara kernel: [42243.609417] md/raid10:md0: sdc1:
Failing raid device
Feb 8 03:55:01 samsara kernel: [42243.609419] md/raid10:md0: Disk
failure on sdc1, disabling device.
Feb 8 03:55:01 samsara kernel: [42243.609420] <1>md/raid10:md0:
Operation continuing on 3 devices.
Feb 8 03:55:01 samsara kernel: [42243.609423] md/raid10:md0: sdb1:
redirecting sector 143163888 to another mirror
Feb 8 03:55:01 samsara kernel: [42243.609650] md/raid10:md0: sdb1:
redirecting sector 143164416 to another mirror
Feb 8 03:55:01 samsara kernel: [42243.610095] md/raid10:md0: sdb1:
redirecting sector 143164664 to another mirror
Feb 8 03:55:01 samsara kernel: [42243.633814] RAID10 conf printout:
Feb 8 03:55:01 samsara kernel: [42243.633817] --- wd:3 rd:4
Feb 8 03:55:01 samsara kernel: [42243.633820] disk 0, wo:0, o:1, dev:sdb1
Feb 8 03:55:01 samsara kernel: [42243.633821] disk 1, wo:1, o:0, dev:sdc1
Feb 8 03:55:01 samsara kernel: [42243.633823] disk 2, wo:0, o:1, dev:sdd1
Feb 8 03:55:01 samsara kernel: [42243.633824] disk 3, wo:0, o:1, dev:sde1
Feb 8 03:55:01 samsara kernel: [42243.645880] RAID10 conf printout:
Feb 8 03:55:01 samsara kernel: [42243.645883] --- wd:3 rd:4
Feb 8 03:55:01 samsara kernel: [42243.645885] disk 0, wo:0, o:1, dev:sdb1
Feb 8 03:55:01 samsara kernel: [42243.645887] disk 2, wo:0, o:1, dev:sdd1
Feb 8 03:55:01 samsara kernel: [42243.645888] disk 3, wo:0, o:1, dev:sde1

This seemed weird as the machine is only a week or two old. I powered
down to open it up and get the serial number off the drive fro an RMA.
I powered back up and mdadm had automatically removed the drive from
the RAID. Fine. The RAID had already been running on just 3 disks
since the 8th. For some reason, I thought to add the drive back into
the array to see if it failed out again thinking worst case scenario
I'm back to a degraded RAID10 again. So I added it back in and did an
mdadm --detail to check on it after a little while and found this:
samsara log # mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Sat Feb 5 22:00:52 2011
Raid Level : raid10
Array Size : 1953519104 (1863.02 GiB 2000.40 GB)
Used Dev Size : 976759552 (931.51 GiB 1000.20 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent

Update Time : Mon Feb 14 00:04:46 2011
State : clean, FAILED, recovering
Active Devices : 2
Working Devices : 2
Failed Devices : 2
Spare Devices : 0

Layout : near=2
Chunk Size : 256K

Rebuild Status : 99% complete

Name : samsara:0 (local to host samsara)
UUID : 26804ec8:a20a4365:bc7d5b4e:653ade03
Events : 30348

Number Major Minor RaidDevice State
0 8 17 0 faulty spare rebuilding /dev/sdb1
1 8 33 1 faulty spare rebuilding /dev/sdc1
2 8 49 2 active sync /dev/sdd1
3 8 65 3 active sync /dev/sde1
samsara log # exit

It had faulted drive 0 also during the rebuild.
[ 1177.064359] RAID10 conf printout:
[ 1177.064362] --- wd:2 rd:4
[ 1177.064365] disk 0, wo:1, o:0, dev:sdb1
[ 1177.064367] disk 1, wo:1, o:0, dev:sdc1
[ 1177.064368] disk 2, wo:0, o:1, dev:sdd1
[ 1177.064370] disk 3, wo:0, o:1, dev:sde1
[ 1177.073325] RAID10 conf printout:
[ 1177.073328] --- wd:2 rd:4
[ 1177.073330] disk 0, wo:1, o:0, dev:sdb1
[ 1177.073332] disk 2, wo:0, o:1, dev:sdd1
[ 1177.073333] disk 3, wo:0, o:1, dev:sde1
[ 1177.073340] RAID10 conf printout:
[ 1177.073341] --- wd:2 rd:4
[ 1177.073342] disk 0, wo:1, o:0, dev:sdb1
[ 1177.073343] disk 2, wo:0, o:1, dev:sdd1
[ 1177.073344] disk 3, wo:0, o:1, dev:sde1
[ 1177.083323] RAID10 conf printout:
[ 1177.083326] --- wd:2 rd:4
[ 1177.083329] disk 2, wo:0, o:1, dev:sdd1
[ 1177.083330] disk 3, wo:0, o:1, dev:sde1

So the RAID ended up being marked "clean, FAILED." Gee, glad it is
clean at least ;). I'm wondering wtf went wrong and if it actually
makes sense that I had a double disk failure like that. I can't even
force it to assemble the raid anymore:
# mdadm --assemble --verbose --force /dev/md0
mdadm: looking for devices for /dev/md0
mdadm: cannot open device /dev/sde1: Device or resource busy
mdadm: /dev/sde1 has wrong uuid.
mdadm: cannot open device /dev/sdd1: Device or resource busy
mdadm: /dev/sdd1 has wrong uuid.
mdadm: cannot open device /dev/sdc1: Device or resource busy
mdadm: /dev/sdc1 has wrong uuid.
mdadm: cannot open device /dev/sdb1: Device or resource busy
mdadm: /dev/sdb1 has wrong uuid.

Am I totally SOL? Thanks for any suggestions or things to try.

--
Mark
Tact is the ability to tell a man he has an open mind when he has a
hole in his head.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID10 failure(s)

am 14.02.2011 21:33:03 von Mark Keisler

Sorry for the double-post on the original.
I realize that I also left out the fact that I rebooted since drive 0
also reported a fault and mdadm won't start the array at all. I'm not
sure how to tell which members were the in two RAID0 groups. I would
think that if I have a RAID0 pair left from the RAID10, I should be
able to recover somehow. Not sure if that was drive 0 and 2, 1 and 3
or 0 and 1, 2 and 3.

Anyway, the drives do still show the correct array UUID when queried
with mdadm -E, but they disagree about the state of the array:
# mdadm -E /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 | grep 'Array State'
Array State : AAAA ('A' == active, '.' == missing)
Array State : .AAA ('A' == active, '.' == missing)
Array State : ..AA ('A' == active, '.' == missing)
Array State : ..AA ('A' == active, '.' == missing)

sdc still shows a recovery offset, too:

/dev/sdb1:
Data Offset : 2048 sectors
Super Offset : 8 sectors
/dev/sdc1:
Data Offset : 2048 sectors
Super Offset : 8 sectors
Recovery Offset : 2 sectors
/dev/sdd1:
Data Offset : 2048 sectors
Super Offset : 8 sectors
/dev/sde1:
Data Offset : 2048 sectors
Super Offset : 8 sectors

I did some searching on the "READ FPDMA QUEUED" error message that my
drive was reporting and have found that there seems to be a
correlation between that and having AHCI (NCQ in particular) enabled.
I've now set my BIOS back to Native IDE (which was the default anyway)
instead of AHCI for the SATA setting. I'm hoping that was the issue.

Still wondering if there is some magic to be done to get at my data aga=
in :)

--
Mark
Tact is the ability to tell a man he has an open mind when he has a
hole in his head.

On Mon, Feb 14, 2011 at 10:09 AM, Mark Keisler wrot=
e:
>
> Sorry in advance for the long email :)
>
>
> I had a RAID10 array set up using 4 WD 1TB caviar black drives (SATA3=
)
> on 64 bit on a 2.6.36 kernel using mdadm 3.1.4. =A0I noticed last nig=
ht
> that one drive had faulted out of the array. =A0It had a bunch of err=
ors
> like so:
>
> Feb =A08 03:39:48 samsara kernel: [41330.835285] ata3.00: exception
> Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
> Feb =A08 03:39:48 samsara kernel: [41330.835288] ata3.00: irq_stat 0x=
40000008
> Feb =A08 03:39:48 samsara kernel: [41330.835292] ata3.00: failed
> command: READ FPDMA QUEUED
> Feb =A08 03:39:48 samsara kernel: [41330.835297] ata3.00: cmd
> 60/f8:00:f8:9a:45/00:00:04:00:00/40 tag 0 ncq 126976 in
> Feb =A08 03:39:48 samsara kernel: [41330.835297] =A0 =A0 =A0 =A0 =A0r=
es
> 41/40:00:70:9b:45/00:00:04:00:00/40 Emask 0x409 (media error)
> Feb =A08 03:39:48 samsara kernel: [41330.835300] ata3.00: status: { D=
RDY ERR }
> Feb =A08 03:39:48 samsara kernel: [41330.835301] ata3.00: error: { UN=
C }
> Feb =A08 03:39:48 samsara kernel: [41330.839776] ata3.00: configured =
for UDMA/133
> Feb =A08 03:39:48 samsara kernel: [41330.839788] ata3: EH complete
> ....
>
> Feb =A08 03:39:58 samsara kernel: [41340.423236] sd 2:0:0:0: [sdc]
> Unhandled sense code
> Feb =A08 03:39:58 samsara kernel: [41340.423238] sd 2:0:0:0: [sdc]
> Result: hostbyte=3DDID_OK driverbyte=3DDRIVER_SENSE
> Feb =A08 03:39:58 samsara kernel: [41340.423240] sd 2:0:0:0: [sdc]
> Sense Key : Medium Error [current] [descriptor]
> Feb =A08 03:39:58 samsara kernel: [41340.423243] Descriptor sense dat=
a
> with sense descriptors (in hex):
> Feb =A08 03:39:58 samsara kernel: [41340.423244] =A0 =A0 =A0 =A0 72 0=
3 11 04 00
> 00 00 0c 00 0a 80 00 00 00 00 00
> Feb =A08 03:39:58 samsara kernel: [41340.423249] =A0 =A0 =A0 =A0 04 4=
5 9b 70
> Feb =A08 03:39:58 samsara kernel: [41340.423251] sd 2:0:0:0: [sdc] =A0=
Add.
> Sense: Unrecovered read error - auto reallocate failed
> Feb =A08 03:39:58 samsara kernel: [41340.423254] sd 2:0:0:0: [sdc] CD=
B:
> Read(10): 28 00 04 45 9a f8 00 00 f8 00
> Feb =A08 03:39:58 samsara kernel: [41340.423259] end_request: I/O err=
or,
> dev sdc, sector 71670640
> Feb =A08 03:39:58 samsara kernel: [41340.423262] md/raid10:md0: sdc1:
> rescheduling sector 143332600
> ....
> Feb =A08 03:40:10 samsara kernel: [41351.940796] md/raid10:md0: read
> error corrected (8 sectors at 2168 on sdc1)
> Feb =A08 03:40:10 samsara kernel: [41351.954972] md/raid10:md0: sdb1:
> redirecting sector 143332600 to another mirror
>
> and so on until:
> Feb =A08 03:55:01 samsara kernel: [42243.609414] md/raid10:md0: sdc1:
> Raid device exceeded read_error threshold [cur 21:max 20]
> Feb =A08 03:55:01 samsara kernel: [42243.609417] md/raid10:md0: sdc1:
> Failing raid device
> Feb =A08 03:55:01 samsara kernel: [42243.609419] md/raid10:md0: Disk
> failure on sdc1, disabling device.
> Feb =A08 03:55:01 samsara kernel: [42243.609420] <1>md/raid10:md0:
> Operation continuing on 3 devices.
> Feb =A08 03:55:01 samsara kernel: [42243.609423] md/raid10:md0: sdb1:
> redirecting sector 143163888 to another mirror
> Feb =A08 03:55:01 samsara kernel: [42243.609650] md/raid10:md0: sdb1:
> redirecting sector 143164416 to another mirror
> Feb =A08 03:55:01 samsara kernel: [42243.610095] md/raid10:md0: sdb1:
> redirecting sector 143164664 to another mirror
> Feb =A08 03:55:01 samsara kernel: [42243.633814] RAID10 conf printout=
:
> Feb =A08 03:55:01 samsara kernel: [42243.633817] =A0--- wd:3 rd:4
> Feb =A08 03:55:01 samsara kernel: [42243.633820] =A0disk 0, wo:0, o:1=
, dev:sdb1
> Feb =A08 03:55:01 samsara kernel: [42243.633821] =A0disk 1, wo:1, o:0=
, dev:sdc1
> Feb =A08 03:55:01 samsara kernel: [42243.633823] =A0disk 2, wo:0, o:1=
, dev:sdd1
> Feb =A08 03:55:01 samsara kernel: [42243.633824] =A0disk 3, wo:0, o:1=
, dev:sde1
> Feb =A08 03:55:01 samsara kernel: [42243.645880] RAID10 conf printout=
:
> Feb =A08 03:55:01 samsara kernel: [42243.645883] =A0--- wd:3 rd:4
> Feb =A08 03:55:01 samsara kernel: [42243.645885] =A0disk 0, wo:0, o:1=
, dev:sdb1
> Feb =A08 03:55:01 samsara kernel: [42243.645887] =A0disk 2, wo:0, o:1=
, dev:sdd1
> Feb =A08 03:55:01 samsara kernel: [42243.645888] =A0disk 3, wo:0, o:1=
, dev:sde1
>
>
> This seemed weird as the machine is only a week or two old. =A0I powe=
red
> down to open it up and get the serial number off the drive fro an RMA=

> =A0I powered back up and mdadm had automatically removed the drive fr=
om
> the RAID. =A0Fine. =A0The RAID had already been running on just 3 dis=
ks
> since the 8th. =A0For some reason, I thought to add the drive back in=
to
> the array to see if it failed out again thinking worst case scenario
> I'm back to a degraded RAID10 again. =A0So I added it back in and did=
an
> mdadm --detail to check on it after a little while and found this:
> samsara log # mdadm --detail /dev/md0
> /dev/md0:
> =A0 =A0 =A0 Version : 1.2
> =A0Creation Time : Sat Feb =A05 22:00:52 2011
> =A0 =A0Raid Level : raid10
> =A0 =A0Array Size : 1953519104 (1863.02 GiB 2000.40 GB)
> =A0Used Dev Size : 976759552 (931.51 GiB 1000.20 GB)
> =A0Raid Devices : 4
> =A0Total Devices : 4
> =A0 Persistence : Superblock is persistent
>
> =A0 Update Time : Mon Feb 14 00:04:46 2011
> =A0 =A0 =A0 =A0 State : clean, FAILED, recovering
> =A0Active Devices : 2
> Working Devices : 2
> =A0Failed Devices : 2
> =A0Spare Devices : 0
>
> =A0 =A0 =A0 =A0Layout : near=3D2
> =A0 =A0Chunk Size : 256K
>
> =A0Rebuild Status : 99% complete
>
> =A0 =A0 =A0 =A0 =A0Name : samsara:0 =A0(local to host samsara)
> =A0 =A0 =A0 =A0 =A0UUID : 26804ec8:a20a4365:bc7d5b4e:653ade03
> =A0 =A0 =A0 =A0Events : 30348
>
> =A0 Number =A0 Major =A0 Minor =A0 RaidDevice State
> =A0 =A0 =A00 =A0 =A0 =A0 8 =A0 =A0 =A0 17 =A0 =A0 =A0 =A00 =A0 =A0 =A0=
faulty spare rebuilding =A0 /dev/sdb1
> =A0 =A0 =A01 =A0 =A0 =A0 8 =A0 =A0 =A0 33 =A0 =A0 =A0 =A01 =A0 =A0 =A0=
faulty spare rebuilding =A0 /dev/sdc1
> =A0 =A0 =A02 =A0 =A0 =A0 8 =A0 =A0 =A0 49 =A0 =A0 =A0 =A02 =A0 =A0 =A0=
active sync =A0 /dev/sdd1
> =A0 =A0 =A03 =A0 =A0 =A0 8 =A0 =A0 =A0 65 =A0 =A0 =A0 =A03 =A0 =A0 =A0=
active sync =A0 /dev/sde1
> samsara log # exit
>
> It had faulted drive 0 also during the rebuild.
> [ 1177.064359] RAID10 conf printout:
> [ 1177.064362] =A0--- wd:2 rd:4
> [ 1177.064365] =A0disk 0, wo:1, o:0, dev:sdb1
> [ 1177.064367] =A0disk 1, wo:1, o:0, dev:sdc1
> [ 1177.064368] =A0disk 2, wo:0, o:1, dev:sdd1
> [ 1177.064370] =A0disk 3, wo:0, o:1, dev:sde1
> [ 1177.073325] RAID10 conf printout:
> [ 1177.073328] =A0--- wd:2 rd:4
> [ 1177.073330] =A0disk 0, wo:1, o:0, dev:sdb1
> [ 1177.073332] =A0disk 2, wo:0, o:1, dev:sdd1
> [ 1177.073333] =A0disk 3, wo:0, o:1, dev:sde1
> [ 1177.073340] RAID10 conf printout:
> [ 1177.073341] =A0--- wd:2 rd:4
> [ 1177.073342] =A0disk 0, wo:1, o:0, dev:sdb1
> [ 1177.073343] =A0disk 2, wo:0, o:1, dev:sdd1
> [ 1177.073344] =A0disk 3, wo:0, o:1, dev:sde1
> [ 1177.083323] RAID10 conf printout:
> [ 1177.083326] =A0--- wd:2 rd:4
> [ 1177.083329] =A0disk 2, wo:0, o:1, dev:sdd1
> [ 1177.083330] =A0disk 3, wo:0, o:1, dev:sde1
>
>
> So the RAID ended up being marked "clean, FAILED." =A0Gee, glad it is
> clean at least ;). =A0I'm wondering wtf went wrong and if it actually
> makes sense that I had a double disk failure like that. =A0I can't ev=
en
> force it to assemble the raid anymore:
> =A0# mdadm --assemble --verbose --force /dev/md0
> mdadm: looking for devices for /dev/md0
> mdadm: cannot open device /dev/sde1: Device or resource busy
> mdadm: /dev/sde1 has wrong uuid.
> mdadm: cannot open device /dev/sdd1: Device or resource busy
> mdadm: /dev/sdd1 has wrong uuid.
> mdadm: cannot open device /dev/sdc1: Device or resource busy
> mdadm: /dev/sdc1 has wrong uuid.
> mdadm: cannot open device /dev/sdb1: Device or resource busy
> mdadm: /dev/sdb1 has wrong uuid.
>
> Am I totally SOL? =A0Thanks for any suggestions or things to try.
>
> --
> Mark
> Tact is the ability to tell a man he has an open mind when he has a
> hole in his head.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID10 failure(s)

am 14.02.2011 23:29:03 von Stan Hoeppner

Mark Keisler put forth on 2/14/2011 2:33 PM:

> Still wondering if there is some magic to be done to get at my data again :)
>>
>> Am I totally SOL? Thanks for any suggestions or things to try.
>>
>> --
>> Mark
>> Tact is the ability to tell a man he has an open mind when he has a
>> hole in his head.

Interesting, and ironically appropriate, sig, Mark.

No magic is required. Simply wipe each disk by writing all zeros with dd. You
can do all 4 in parallel. This will take a while with 1TB drives. If there are
still SATA/NCQ/etc issues they should pop up while wiping the drives. If not,
and all dd operations complete successfully, simply create a new RAID 10 array
and format it with your favorite filesystem.

Then restore all your files from your backups.[1]

[1] Tact is the ability to tell a man he has an open mind when he has a hole in
his head.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID10 failure(s)

am 14.02.2011 23:48:02 von NeilBrown

On Mon, 14 Feb 2011 14:33:03 -0600 Mark Keisler wrote:

> Sorry for the double-post on the original.
> I realize that I also left out the fact that I rebooted since drive 0
> also reported a fault and mdadm won't start the array at all. I'm not
> sure how to tell which members were the in two RAID0 groups. I would
> think that if I have a RAID0 pair left from the RAID10, I should be
> able to recover somehow. Not sure if that was drive 0 and 2, 1 and 3
> or 0 and 1, 2 and 3.
>
> Anyway, the drives do still show the correct array UUID when queried
> with mdadm -E, but they disagree about the state of the array:
> # mdadm -E /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 | grep 'Array State'
> Array State : AAAA ('A' == active, '.' == missing)
> Array State : .AAA ('A' == active, '.' == missing)
> Array State : ..AA ('A' == active, '.' == missing)
> Array State : ..AA ('A' == active, '.' == missing)
>
> sdc still shows a recovery offset, too:
>
> /dev/sdb1:
> Data Offset : 2048 sectors
> Super Offset : 8 sectors
> /dev/sdc1:
> Data Offset : 2048 sectors
> Super Offset : 8 sectors
> Recovery Offset : 2 sectors
> /dev/sdd1:
> Data Offset : 2048 sectors
> Super Offset : 8 sectors
> /dev/sde1:
> Data Offset : 2048 sectors
> Super Offset : 8 sectors
>
> I did some searching on the "READ FPDMA QUEUED" error message that my
> drive was reporting and have found that there seems to be a
> correlation between that and having AHCI (NCQ in particular) enabled.
> I've now set my BIOS back to Native IDE (which was the default anyway)
> instead of AHCI for the SATA setting. I'm hoping that was the issue.
>
> Still wondering if there is some magic to be done to get at my data again :)

No need for magic here .. but you better stand back, as
I'm going to try ... Science.
(or is that Engineering...)

mdadm -S /dev/md0
mdadm -C /dev/md0 -l10 -n4 -c256 missing /dev/sdc1 /dev/sdd1 /dev/sde1
mdadm --wait /dev/md0
mdadm /dev/md0 --add /dev/sdb1

(but be really sure that the devices really are working before you try this).

BTW, for a near=2, Raid-disks=4 arrangement, the first and second devices
contain the same data, and the third and fourth devices also container the
same data as each other (but obviously different to the first and second).

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID10 failure(s)

am 15.02.2011 00:08:45 von Mark Keisler

On Mon, Feb 14, 2011 at 4:48 PM, NeilBrown wrote:
> On Mon, 14 Feb 2011 14:33:03 -0600 Mark Keisler w=
rote:
>
>> Sorry for the double-post on the original.
>> I realize that I also left out the fact that I rebooted since drive =
0
>> also reported a fault and mdadm won't start the array at all. =A0I'm=
not
>> sure how to tell which members were the in two RAID0 groups. =A0I wo=
uld
>> think that if I have a RAID0 pair left from the RAID10, I should be
>> able to recover somehow. =A0Not sure if that was drive 0 and 2, 1 an=
d 3
>> or 0 and 1, 2 and 3.
>>
>> Anyway, the drives do still show the correct array UUID when queried
>> with mdadm -E, but they disagree about the state of the array:
>> # mdadm -E /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 | grep 'Array Sta=
te'
>> =A0 =A0Array State : AAAA ('A' == active, '.' == missing)
>> =A0 =A0Array State : .AAA ('A' == active, '.' == missing)
>> =A0 =A0Array State : ..AA ('A' == active, '.' == missing)
>> =A0 =A0Array State : ..AA ('A' == active, '.' == missing)
>>
>> sdc still shows a recovery offset, too:
>>
>> /dev/sdb1:
>> =A0 =A0 Data Offset : 2048 sectors
>> =A0 =A0Super Offset : 8 sectors
>> /dev/sdc1:
>> =A0 =A0 Data Offset : 2048 sectors
>> =A0 =A0Super Offset : 8 sectors
>> Recovery Offset : 2 sectors
>> /dev/sdd1:
>> =A0 =A0 Data Offset : 2048 sectors
>> =A0 =A0Super Offset : 8 sectors
>> /dev/sde1:
>> =A0 =A0 Data Offset : 2048 sectors
>> =A0 =A0Super Offset : 8 sectors
>>
>> I did some searching on the "READ FPDMA QUEUED" error message that m=
y
>> drive was reporting and have found that there seems to be a
>> correlation between that and having AHCI (NCQ in particular) enabled=

>> I've now set my BIOS back to Native IDE (which was the default anywa=
y)
>> instead of AHCI for the SATA setting. =A0I'm hoping that was the iss=
ue.
>>
>> Still wondering if there is some magic to be done to get at my data =
again :)
>
> No need for magic here .. but you better stand back, as
> =A0I'm going to try ... Science.
> (or is that Engineering...)
>
> =A0mdadm -S /dev/md0
> =A0mdadm -C /dev/md0 -l10 -n4 -c256 missing /dev/sdc1 /dev/sdd1 /dev/=
sde1
> =A0mdadm --wait /dev/md0
> =A0mdadm /dev/md0 --add /dev/sdb1
>
> (but be really sure that the devices really are working before you tr=
y this).
>
> BTW, for a near=3D2, Raid-disks=3D4 arrangement, the first and second=
devices
> contain the same data, and the third and fourth devices also containe=
r the
> same data as each other (but obviously different to the first and sec=
ond).
>
> NeilBrown
>
>
Ah, that's the kind of info that I was looking for. So, the third and
fourth disks are a complete RAID0 set and the entire RAID10 should be
able to rebuild from them if I replace the first two disks with new
ones (hence being sure the devices are working)? Or I need to hope
the originals will hold up to a rebuild?

Thanks for the info, Neil, and all your work in FOSS :)
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID10 failure(s)

am 15.02.2011 00:20:07 von NeilBrown

On Mon, 14 Feb 2011 17:08:45 -0600 Mark Keisler wro=
te:

> On Mon, Feb 14, 2011 at 4:48 PM, NeilBrown wrote:
> > On Mon, 14 Feb 2011 14:33:03 -0600 Mark Keisler =
wrote:
> >
> >> Sorry for the double-post on the original.
> >> I realize that I also left out the fact that I rebooted since driv=
e 0
> >> also reported a fault and mdadm won't start the array at all. =A0I=
'm not
> >> sure how to tell which members were the in two RAID0 groups. =A0I =
would
> >> think that if I have a RAID0 pair left from the RAID10, I should b=
e
> >> able to recover somehow. =A0Not sure if that was drive 0 and 2, 1 =
and 3
> >> or 0 and 1, 2 and 3.
> >>
> >> Anyway, the drives do still show the correct array UUID when queri=
ed
> >> with mdadm -E, but they disagree about the state of the array:
> >> # mdadm -E /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 | grep 'Array S=
tate'
> >> =A0 =A0Array State : AAAA ('A' == active, '.' == missing)
> >> =A0 =A0Array State : .AAA ('A' == active, '.' == missing)
> >> =A0 =A0Array State : ..AA ('A' == active, '.' == missing)
> >> =A0 =A0Array State : ..AA ('A' == active, '.' == missing)
> >>
> >> sdc still shows a recovery offset, too:
> >>
> >> /dev/sdb1:
> >> =A0 =A0 Data Offset : 2048 sectors
> >> =A0 =A0Super Offset : 8 sectors
> >> /dev/sdc1:
> >> =A0 =A0 Data Offset : 2048 sectors
> >> =A0 =A0Super Offset : 8 sectors
> >> Recovery Offset : 2 sectors
> >> /dev/sdd1:
> >> =A0 =A0 Data Offset : 2048 sectors
> >> =A0 =A0Super Offset : 8 sectors
> >> /dev/sde1:
> >> =A0 =A0 Data Offset : 2048 sectors
> >> =A0 =A0Super Offset : 8 sectors
> >>
> >> I did some searching on the "READ FPDMA QUEUED" error message that=
my
> >> drive was reporting and have found that there seems to be a
> >> correlation between that and having AHCI (NCQ in particular) enabl=
ed.
> >> I've now set my BIOS back to Native IDE (which was the default any=
way)
> >> instead of AHCI for the SATA setting. =A0I'm hoping that was the i=
ssue.
> >>
> >> Still wondering if there is some magic to be done to get at my dat=
a again :)
> >
> > No need for magic here .. but you better stand back, as
> > =A0I'm going to try ... Science.
> > (or is that Engineering...)
> >
> > =A0mdadm -S /dev/md0
> > =A0mdadm -C /dev/md0 -l10 -n4 -c256 missing /dev/sdc1 /dev/sdd1 /de=
v/sde1
> > =A0mdadm --wait /dev/md0
> > =A0mdadm /dev/md0 --add /dev/sdb1
> >
> > (but be really sure that the devices really are working before you =
try this).
> >
> > BTW, for a near=3D2, Raid-disks=3D4 arrangement, the first and seco=
nd devices
> > contain the same data, and the third and fourth devices also contai=
ner the
> > same data as each other (but obviously different to the first and s=
econd).
> >
> > NeilBrown
> >
> >
> Ah, that's the kind of info that I was looking for. So, the third an=
d
> fourth disks are a complete RAID0 set and the entire RAID10 should be
> able to rebuild from them if I replace the first two disks with new
> ones (hence being sure the devices are working)? Or I need to hope
> the originals will hold up to a rebuild?

No.

third and fourth are like a RAID1 set, not a RAID0 set.

=46irst and second are a RAID1 pair. Third and fourth are a RAID1 pair=

=46irst and third
first and fourth
second and third
second and fourth

can each be seen as a RAID0 pair which container all of the data.

NeilBrown

>=20
> Thanks for the info, Neil, and all your work in FOSS :)

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID10 failure(s)

am 15.02.2011 01:40:53 von Stan Hoeppner

Mark Keisler put forth on 2/14/2011 4:39 PM:
> On Mon, Feb 14, 2011 at 4:29 PM, Stan Hoeppner wrote:
>>
>> Mark Keisler put forth on 2/14/2011 2:33 PM:
>>
>>> Still wondering if there is some magic to be done to get at my data again :)
>>>>
>>>> Am I totally SOL? Thanks for any suggestions or things to try.
>>>>
>>>> --
>>>> Mark
>>>> Tact is the ability to tell a man he has an open mind when he has a
>>>> hole in his head.
>>
>> Interesting, and ironically appropriate, sig, Mark.
>>
>> No magic is required. Simply wipe each disk by writing all zeros with dd. You
>> can do all 4 in parallel. This will take a while with 1TB drives. If there are
>> still SATA/NCQ/etc issues they should pop up while wiping the drives. If not,
>> and all dd operations complete successfully, simply create a new RAID 10 array
>> and format it with your favorite filesystem.
>>
>> Then restore all your files from your backups.[1]
>>
>> [1] Tact is the ability to tell a man he has an open mind when he has a hole in
>> his head.
>>
>> --
>> Stan
>
> Well, that was completely unhelpful and devoid of any information.
> Backups don't keep a RAID from failing and that's what my question was
> about. I don't want to spend all of my time rebuilding an array and
> restoring from backup every week.

"Backups don't keep a RAID from failing" -- good sig material ;)

It seems you lack the sense of humor implied by your signature. Given that
fact, I can understand the defensiveness. However, note that there are some
very helpful suggestions in my reply. IIRC, your question wasn't "how to keep a
RAID from failing" but "why did one drive in my RAID 10 fail, and then another
during rebuild". My suggestions could help you answer the first, possibly the
second.

You still don't know if the first dropped drive is actually bad or not. Zeroing
it with dd may very well help to inform you if there is a real problem with it.
Checking your logs and smart data during/afterward may/should tell you.
Zeroing all of them with dd gives you a clean slate for further troubleshooting.

You don't currently have a full backup. While the reminder of such may have
irritated you, it is nonetheless very relevant, and useful, especially for other
list OPs not donning a Jimmy hat. RAID is not a replacement for a proper backup
procedure. You (re)discovered that fact here, or you simply believe, foolishly,
the opposite.

My reply was full of useful information. Apparently just not useful to someone
who wants to cut corners without having to face the potential negative consequences.

RAID won't save you from massive filesystem corruption. A proper backup can.
And if this scenario would have turned dire (or still does) it could save you
here as well. Again, you need a proper backup solution.

--
Stan

Backups don't keep a RAID from failing. --Mark Keisler
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID10 failure(s)

am 15.02.2011 01:49:03 von Mark Keisler

On Mon, Feb 14, 2011 at 5:20 PM, NeilBrown wrote:
> On Mon, 14 Feb 2011 17:08:45 -0600 Mark Keisler w=
rote:
>
>> On Mon, Feb 14, 2011 at 4:48 PM, NeilBrown wrote:
>> > On Mon, 14 Feb 2011 14:33:03 -0600 Mark Keisler > wrote:
>> >
>> >> Sorry for the double-post on the original.
>> >> I realize that I also left out the fact that I rebooted since dri=
ve 0
>> >> also reported a fault and mdadm won't start the array at all. =A0=
I'm not
>> >> sure how to tell which members were the in two RAID0 groups. =A0I=
would
>> >> think that if I have a RAID0 pair left from the RAID10, I should =
be
>> >> able to recover somehow. =A0Not sure if that was drive 0 and 2, 1=
and 3
>> >> or 0 and 1, 2 and 3.
>> >>
>> >> Anyway, the drives do still show the correct array UUID when quer=
ied
>> >> with mdadm -E, but they disagree about the state of the array:
>> >> # mdadm -E /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 | grep 'Array =
State'
>> >> =A0 =A0Array State : AAAA ('A' == active, '.' == missing)
>> >> =A0 =A0Array State : .AAA ('A' == active, '.' == missing)
>> >> =A0 =A0Array State : ..AA ('A' == active, '.' == missing)
>> >> =A0 =A0Array State : ..AA ('A' == active, '.' == missing)
>> >>
>> >> sdc still shows a recovery offset, too:
>> >>
>> >> /dev/sdb1:
>> >> =A0 =A0 Data Offset : 2048 sectors
>> >> =A0 =A0Super Offset : 8 sectors
>> >> /dev/sdc1:
>> >> =A0 =A0 Data Offset : 2048 sectors
>> >> =A0 =A0Super Offset : 8 sectors
>> >> Recovery Offset : 2 sectors
>> >> /dev/sdd1:
>> >> =A0 =A0 Data Offset : 2048 sectors
>> >> =A0 =A0Super Offset : 8 sectors
>> >> /dev/sde1:
>> >> =A0 =A0 Data Offset : 2048 sectors
>> >> =A0 =A0Super Offset : 8 sectors
>> >>
>> >> I did some searching on the "READ FPDMA QUEUED" error message tha=
t my
>> >> drive was reporting and have found that there seems to be a
>> >> correlation between that and having AHCI (NCQ in particular) enab=
led.
>> >> I've now set my BIOS back to Native IDE (which was the default an=
yway)
>> >> instead of AHCI for the SATA setting. =A0I'm hoping that was the =
issue.
>> >>
>> >> Still wondering if there is some magic to be done to get at my da=
ta again :)
>> >
>> > No need for magic here .. but you better stand back, as
>> > =A0I'm going to try ... Science.
>> > (or is that Engineering...)
>> >
>> > =A0mdadm -S /dev/md0
>> > =A0mdadm -C /dev/md0 -l10 -n4 -c256 missing /dev/sdc1 /dev/sdd1 /d=
ev/sde1
>> > =A0mdadm --wait /dev/md0
>> > =A0mdadm /dev/md0 --add /dev/sdb1
>> >
>> > (but be really sure that the devices really are working before you=
try this).
>> >
>> > BTW, for a near=3D2, Raid-disks=3D4 arrangement, the first and sec=
ond devices
>> > contain the same data, and the third and fourth devices also conta=
iner the
>> > same data as each other (but obviously different to the first and =
second).
>> >
>> > NeilBrown
>> >
>> >
>> Ah, that's the kind of info that I was looking for. =A0So, the third=
and
>> fourth disks are a complete RAID0 set and the entire RAID10 should b=
e
>> able to rebuild from them if I replace the first two disks with new
>> ones (hence being sure the devices are working)? =A0Or I need to hop=
e
>> the originals will hold up to a rebuild?
>
> No.
>
> third and fourth are like a RAID1 set, not a RAID0 set.
>
> First and second are a RAID1 pair. =A0Third and fourth are a RAID1 pa=
ir.
>
> First and third
> first and fourth
> second and third
> second and fourth
>
> can each be seen as a RAID0 pair which container all of the data.
>
> NeilBrown
>
>
>
>>
>> Thanks for the info, Neil, and all your work in FOSS :)
>
>
Oh, duh, was thinking in 0+1 instead of 10. I'm still wondering why
you made mention of "but be really sure that the devices really are
working before you try this." If trying to bring the RAID back fails,
I'm just back to not having access to the data which is where I am now
:).
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID10 failure(s)

am 15.02.2011 01:57:44 von NeilBrown

On Mon, 14 Feb 2011 18:49:03 -0600 Mark Keisler wrote:

> Oh, duh, was thinking in 0+1 instead of 10. I'm still wondering why
> you made mention of "but be really sure that the devices really are
> working before you try this." If trying to bring the RAID back fails,
> I'm just back to not having access to the data which is where I am now
> :).

If you try reconstructing the array before you are sure you have resolved the
original problem (be it BIOS setting, bad cables, dodgey controller or even a
bad disk drive) then you risk compounding your problems and at least are
likely to waste time.
Sometimes people are in such a hurry to get access to their data that they
cut corners to their detriment. I don't know if you are such a person, but
I mentioned it anyway just in case.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID10 failure(s)

am 15.02.2011 18:47:53 von Mark Keisler

On Mon, Feb 14, 2011 at 6:57 PM, NeilBrown wrote:
> On Mon, 14 Feb 2011 18:49:03 -0600 Mark Keisler w=
rote:
>
>> Oh, duh, was thinking in 0+1 instead of 10. =A0I'm still wondering w=
hy
>> you made mention of "but be really sure that the devices really are
>> working before you try this." =A0If trying to bring the RAID back fa=
ils,
>> I'm just back to not having access to the data which is where I am n=
ow
>> :).
>
> If you try reconstructing the array before you are sure you have reso=
lved the
> original problem (be it BIOS setting, bad cables, dodgey controller o=
r even a
> bad disk drive) then you risk compounding your problems and at least =
are
> likely to waste time.
> Sometimes people are in such a hurry to get access to their data that=
they
> cut corners to their detriment. =A0I don't know if you are such a per=
son, but
> I mentioned it anyway just in case.
>
> NeilBrown
>
>
After checking things over, SMART tests were showing quite a few
Offline_Uncorrectable and a high Current_Pending_Sector count on the
two drives that had failed out of the array. So, based on that, I
figured I had nothing to lose in trying to create the array again. I
just went with the array in a degraded state with 3 drives and was
able to activate the volumes on it and get the part of the data off
that wasn't backed up yet before it failed again.

Stan's dd zero idea also confirms with its output and logs:
# dd if=3D/dev/zero of=3D/dev/sdb
dd: writing to `/dev/sdb': Input/output error
9368201+0 records in
9368200+0 records out
4796518400 bytes (4.8 GB) copied, 229.128 s, 20.9 MB/s

So, RMA of drives, keep smartd running, rebuild the array, load some
data and monitor :). Thanks for the help guys.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html