Recovering failed array

am 22.09.2011 20:07:33 von Alex

Hi,

I have a RAID5 array that has died and I need help recovering it.
Somehow two of the four partitions in the array have failed. The
server was completely dead, and had very little recognizable
information on the console before it was rebooted. I believe they were
kernel messages, but it wasn't a panic.

I'm able to read data from all four disks (using dd) but can't figure
out how to try and reassmble it. Here is some information I've
obtained by booting from a rescue CDROM and the mdadm.conf from
backup.

Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] [linear]
md1 : inactive sda2[0] sdd2[4](S) sdb2[1]
205820928 blocks super 1.1

md0 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1]
255988 blocks super 1.0 [4/4] [UUUU]

# mdadm --add /dev/md1 /dev/sdd2
mdadm: Cannot open /dev/sdd2: Device or resource busy

# mdadm --run /dev/md1
mdadm: failed to run array /dev/md1: Input/output error

I've tried "--assemble --scan" and it also provides an IO error.

mdadm.conf:
# mdadm.conf written out by anaconda
MAILADDR root
AUTO +imsm +1.x -all
ARRAY /dev/md0 level=raid1 num-devices=4
UUID=9406b71d:8024a882:f17932f6:98d4df18
ARRAY /dev/md1 level=raid5 num-devices=4
UUID=f5bb8db9:85f66b43:32a8282a:fb664152

Any ideas greatly appreciated.
Thanks,
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Recovering failed array

am 22.09.2011 23:52:30 von Phil Turmel

Hi Alex,

More information please....

On 09/22/2011 02:07 PM, Alex wrote:
> Hi,
>
> I have a RAID5 array that has died and I need help recovering it.
> Somehow two of the four partitions in the array have failed. The
> server was completely dead, and had very little recognizable
> information on the console before it was rebooted. I believe they were
> kernel messages, but it wasn't a panic.
>
> I'm able to read data from all four disks (using dd) but can't figure
> out how to try and reassmble it. Here is some information I've
> obtained by booting from a rescue CDROM and the mdadm.conf from
> backup.
>
> Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] [linear]
> md1 : inactive sda2[0] sdd2[4](S) sdb2[1]
> 205820928 blocks super 1.1
>
> md0 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1]
> 255988 blocks super 1.0 [4/4] [UUUU]
>
>
> # mdadm --add /dev/md1 /dev/sdd2
> mdadm: Cannot open /dev/sdd2: Device or resource busy
>
> # mdadm --run /dev/md1
> mdadm: failed to run array /dev/md1: Input/output error
>
> I've tried "--assemble --scan" and it also provides an IO error.
>
> mdadm.conf:
> # mdadm.conf written out by anaconda
> MAILADDR root
> AUTO +imsm +1.x -all
> ARRAY /dev/md0 level=raid1 num-devices=4
> UUID=9406b71d:8024a882:f17932f6:98d4df18
> ARRAY /dev/md1 level=raid5 num-devices=4
> UUID=f5bb8db9:85f66b43:32a8282a:fb664152

Please show the output of "lsdrv" [1] and then "mdadm -D /dev/md[01]", and also "mdadm -E /dev/sd[abcd][12]"

(From within your rescue environment.) Some errors are likely, but get what you can.

Phil

[1] http://github.com/pturmel/lsdrv
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Recovering failed array

am 23.09.2011 00:39:10 von Alex

Hi,

>> Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] [li=
near]
>> md1 : inactive sda2[0] sdd2[4](S) sdb2[1]
>> Â Â Â 205820928 blocks super 1.1
>>
>> md0 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1]
>> Â Â Â 255988 blocks super 1.0 [4/4] [UUUU]
>>
>>
>> # mdadm --add /dev/md1 /dev/sdd2
>> mdadm: Cannot open /dev/sdd2: Device or resource busy
>>
>> # mdadm --run /dev/md1
>> mdadm: failed to run array /dev/md1: Input/output error
>>
>> I've tried "--assemble --scan" and it also provides an IO error.
>>
>> mdadm.conf:
>> # mdadm.conf written out by anaconda
>> MAILADDR root
>> AUTO +imsm +1.x -all
>> ARRAY /dev/md0 level=3Draid1 num-devices=3D4
>> UUID=3D9406b71d:8024a882:f17932f6:98d4df18
>> ARRAY /dev/md1 level=3Draid5 num-devices=3D4
>> UUID=3Df5bb8db9:85f66b43:32a8282a:fb664152
>
> Please show the output of "lsdrv" [1] and then "mdadm -D /dev/md[01]"=
, and also "mdadm -E /dev/sd[abcd][12]"
>
> (From within your rescue environment.) Â Some errors are likely, =
but get what you can.

Great, thanks for your offer to help. Great program you've written.
I've included the output here:

# mdadm -E /dev/sd[abcd][12]
http://pastebin.com/3JcBjiV6

# When I booted into the rescue CD again, it mounted md0 as md127
http://pastebin.com/yXnzzL6K

# lsdrv output (also included below)
http://pastebin.com/JkpgVNL4

The md1 array appeared as md125 and was inactive, so "mdadm -D" didn't
work. Here is how the arrays now appear. sdb2 should be part of md125,
not its own array, as it is below, obviously.

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
[raid4] [raid10]
md125 : inactive sda2[0](S) sdc2[4](S)
137213952 blocks super 1.1

md126 : inactive sdb2[1](S)
68606976 blocks super 1.1

md127 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1]
255988 blocks super 1.0 [4/4] [UUUU]

unused devices:

lsdrv output:

PCI [pata_amd] 00:07.1 IDE interface: Advanced Micro Devices [AMD]
AMD-8111 IDE (rev 03)
ââscsi 0:0:0:0 MATSHITA DVD-ROM SR-8178 {MATSHITADVD-R=
OM_SR-8178_}
â=82 ââsr0: [11:0] (iso9660) 352.23m 'sysrcd-2.3=
1'
â=82 ââMounted as /dev/sr0 @ /livemnt/boot
ââscsi 1:x:x:x [Empty]
PCI [aic79xx] 02:0a.0 SCSI storage controller: Adaptec AIC-7902 U320 (r=
ev 10)
ââscsi 2:x:x:x [Empty]
PCI [aic79xx] 02:0a.1 SCSI storage controller: Adaptec AIC-7902 U320 (r=
ev 10)
ââscsi 3:0:0:0 FUJITSU MAW3073NC {DAM9PA1001LL}
â=82 ââsda: [8:0] Partitioned (dos) 68.49g
â=82 ââsda1: [8:1] MD raid1 (0/4) 250.00m md1=
27 clean in_sync
'dbserv.guardiandigital.com:0' {9406b71d-8024-a882-f179-32f698d4df18}
â=82 â=82 ââmd127: [9:127] (ext4) 249.9=
9m {a99a461a-ff72-4bc0-9ccc-095b8a26f5e2}
â=82 ââsda2: [8:2] MD raid5 (none/4) 65.43g m=
d125 inactive spare
'dbserv.guardiandigital.com:1' {f5bb8db9-85f6-6b43-32a8-282afb664152}
â=82 â=82 ââmd125: [9:125] Empty/Unknow=
n 0.00k
â=82 ââsda3: [8:3] (swap) 2.82g {0c2eeeb1-fc3=
5-4e43-9432-21cb005f8e05}
ââscsi 3:0:1:0 FUJITSU MAW3073NC {DAM9PA1001L5}
â=82 ââsdb: [8:16] Partitioned (dos) 68.49g
â=82 ââsdb1: [8:17] MD raid1 (1/4) 250.00m md=
127 clean in_sync
'dbserv.guardiandigital.com:0' {9406b71d-8024-a882-f179-32f698d4df18}
â=82 ââsdb2: [8:18] MD raid5 (none/4) 65.43g =
md126 inactive spare
'dbserv.guardiandigital.com:1' {f5bb8db9-85f6-6b43-32a8-282afb664152}
â=82 â=82 ââmd126: [9:126] Empty/Unknow=
n 0.00k
â=82 ââsdb3: [8:19] (swap) 2.82g {e36083c8-59=
a1-437b-8f93-6c624a8b0b90}
ââscsi 3:0:2:0 FUJITSU MAW3073NC {DAM9PA1001LD}
â=82 ââsdc: [8:32] Partitioned (dos) 68.49g
â=82 ââsdc1: [8:33] MD raid1 (2/4) 250.00m md=
127 clean in_sync
'dbserv.guardiandigital.com:0' {9406b71d-8024-a882-f179-32f698d4df18}
â=82 ââsdc2: [8:34] MD raid5 (none/4) 65.43g =
md125 inactive spare
'dbserv.guardiandigital.com:1' {f5bb8db9-85f6-6b43-32a8-282afb664152}
â=82 ââsdc3: [8:35] (swap) 2.82g {fe4c0314-7e=
61-475e-9034-1b90b23e817a}
ââscsi 3:0:3:0 FUJITSU MAW3073NC {DAM9PA1001LA}
ââsdd: [8:48] Partitioned (dos) 68.49g
ââsdd1: [8:49] MD raid1 (3/4) 250.00m md127 clea=
n in_sync
'dbserv.guardiandigital.com:0' {9406b71d-8024-a882-f179-32f698d4df18}
ââsdd2: [8:50] MD raid5 (4) 65.43g inactive
'dbserv.guardiandigital.com:1' {f5bb8db9-85f6-6b43-32a8-282afb664152}
ââsdd3: [8:51] (swap) 2.82g {deaf299e-e988-4581-=
aba0-d061aa36914a}
Other Block Devices
ââloop0: [7:0] (squashfs) 288.07m
â=82 ââMounted as /dev/loop0 @ /livemnt/squashfs
ââram0: [1:0] Empty/Unknown 16.00m
ââram1: [1:1] Empty/Unknown 16.00m
ââram2: [1:2] Empty/Unknown 16.00m
ââram3: [1:3] Empty/Unknown 16.00m
ââram4: [1:4] Empty/Unknown 16.00m
ââram5: [1:5] Empty/Unknown 16.00m
ââram6: [1:6] Empty/Unknown 16.00m
ââram7: [1:7] Empty/Unknown 16.00m
ââram8: [1:8] Empty/Unknown 16.00m
ââram9: [1:9] Empty/Unknown 16.00m
ââram10: [1:10] Empty/Unknown 16.00m
ââram11: [1:11] Empty/Unknown 16.00m
ââram12: [1:12] Empty/Unknown 16.00m
ââram13: [1:13] Empty/Unknown 16.00m
ââram14: [1:14] Empty/Unknown 16.00m
ââram15: [1:15] Empty/Unknown 16.00m
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Recovering failed array

am 23.09.2011 06:15:12 von NeilBrown

--Sig_/NVvDc/eddB0SVGXPtxWcHQA
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Thu, 22 Sep 2011 18:39:10 -0400 Alex wrote:

> Hi,
>=20
> >> Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] [line=
ar]
> >> md1 : inactive sda2[0] sdd2[4](S) sdb2[1]
> >> =A0 =A0 =A0 205820928 blocks super 1.1
> >>
> >> md0 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1]
> >> =A0 =A0 =A0 255988 blocks super 1.0 [4/4] [UUUU]
> >>
> >>
> >> # mdadm --add /dev/md1 /dev/sdd2
> >> mdadm: Cannot open /dev/sdd2: Device or resource busy
> >>
> >> # mdadm --run /dev/md1
> >> mdadm: failed to run array /dev/md1: Input/output error
> >>
> >> I've tried "--assemble --scan" and it also provides an IO error.
> >>
> >> mdadm.conf:
> >> # mdadm.conf written out by anaconda
> >> MAILADDR root
> >> AUTO +imsm +1.x -all
> >> ARRAY /dev/md0 level=3Draid1 num-devices=3D4
> >> UUID=3D9406b71d:8024a882:f17932f6:98d4df18
> >> ARRAY /dev/md1 level=3Draid5 num-devices=3D4
> >> UUID=3Df5bb8db9:85f66b43:32a8282a:fb664152
> >
> > Please show the output of "lsdrv" [1] and then "mdadm -D /dev/md[01]", =
and also "mdadm -E /dev/sd[abcd][12]"
> >
> > (From within your rescue environment.) =A0Some errors are likely, but g=
et what you can.
>=20
> Great, thanks for your offer to help. Great program you've written.
> I've included the output here:
>=20
> # mdadm -E /dev/sd[abcd][12]
> http://pastebin.com/3JcBjiV6
>=20
> # When I booted into the rescue CD again, it mounted md0 as md127
> http://pastebin.com/yXnzzL6K
>=20

Hmmm ... looks like a bit of a mess. Two devices that should be active
arrays appear to be spares. I suspect you tried to --add them when you
shouldn't have. Newer version of mdadm stop you from doing that but older
version don't. You only --add a device that you want to be a spare, not a
device that you think is part of the array.

All of the devices think that device 2 (the third in the array) should exi=
st
and be working, but no device claims to be it. Presumably it is /dev/sdc2.

You will need to recreate the array.
i.e.

mdadm -S /dev/md1
or=20
mdadm -S /dev/md125 /dev/md126

or whatever md arrays claim to be holding any of the 4 devices according
to /proc/mdstat.

Then

mdadm -C /dev/md1 -e 1.1 --level 5 -n 4 --chunk 512 --assume-clean \
/dev/sda2 /dev/sdb2 /dev/sdc2 missing

This will just re-write the metadata and assemble the array. It won't chan=
ge
the data.
Then "fsck -n /dev/md1" and make sure it looks good.
If it does: good.
If not, try again with sdd2 in place of sdc2.

Once you are happy that you can see your data, you can add the other device
as a spare and it will rebuild.

You don't really need the --assume-clean above because a degraded RAID5 is
always assumed to be clean, but it is good practice to use --assume-clean
whenever re-creating an array which has real data on it.

Good luck,
NeilBrown

--Sig_/NVvDc/eddB0SVGXPtxWcHQA
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)

iD8DBQFOfAfQG5fc6gV+Wb0RAqBIAJ4vVLeHTRjESVMURiAdjDVh+rzj3QCf QB87
S+Xp16ym7Jq3Vlyz/m3lMvg=
=I3a/
-----END PGP SIGNATURE-----

--Sig_/NVvDc/eddB0SVGXPtxWcHQA--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Recovering failed array

am 25.09.2011 00:39:19 von Alex

Hi guys,

>> Great, thanks for your offer to help. Great program you've written.
>> I've included the output here:
>>
>> # mdadm -E /dev/sd[abcd][12]
>> http://pastebin.com/3JcBjiV6
>>
>> # When I booted into the rescue CD again, it mounted md0 as md127
>> http://pastebin.com/yXnzzL6K
>>
>
> Hmmm ... looks like a bit of a mess. =A0Two devices that should be ac=
tive
> arrays appear to be spares. I suspect you tried to --add them when yo=
u
> shouldn't have. =A0Newer version of mdadm stop you from doing that bu=
t older
> version don't. =A0You only --add a device that you want to be a spare=
, not a
> device that you think is part of the array.

Yes, that is what I did.

> All of the devices think that device 2 (the third in the array) shoul=
d =A0exist
> and =A0be working, but no device claims to be it. =A0Presumably it is=
/dev/sdc2.
>
>
> You will need to recreate the array.
> i.e.
>
> =A0mdadm -S /dev/md1
> or
> =A0mdadm -S /dev/md125 /dev/md126
>
> or whatever md arrays claim to be holding any of the 4 devices accord=
ing
> to /proc/mdstat.
>
> Then
>
> =A0mdadm -C /dev/md1 -e 1.1 --level 5 -n 4 =A0--chunk 512 --assume-cl=
ean \
> =A0 =A0/dev/sda2 /dev/sdb2 /dev/sdc2 missing
>
> This will just re-write the metadata and assemble the array. =A0It wo=
n't change
> the data.
> Then "fsck -n /dev/md1" and make sure it looks good.
> If it does: good.
> If not, try again with sdd2 in place of sdc2.
>
> Once you are happy that you can see your data, you can add the other =
device
> as a spare and it will rebuild.

Just wanted to let you know that these instructions worked perfectly,
and thanks for all your help.

Any idea how this would have happened in the first place? Somehow two
of the three RAID5 devices failed at once. I've checked all the disks
with the Fujitsu disk scanner, and it found no physical errors. I also
don't think the disks were disturbed in any way while they were
operating.

Thanks again,
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html