Need help recovering RAID5 array

am 05.08.2011 17:27:06 von Stephen Muskiewicz

Hello,

I'm hoping to figure out how I can recover a RAID5 array that suddenly
won't start after one of our servers took a power hit.
I'm fairly confident that all the individual disks of the RAID are OK
and that I can recover my data (without having to resort to asking my
sysadmin to fetch the backup tapes), but despite my extensive Googling
and reviewing the list archives and mdadm manpage, so far nothing I've
tried has worked. Hopefully I am just missing something simple.

Background: The server is a Sun X4500 (thumper) running CentOS 5.5. I
have confirmed using the (Sun provided) "hd" utilities that all of the
individual disks are online and none of the device names appear to have
changed from before the power outage. There are also two other RAID5
arrays as well as the /dev/md0 RAID1 OS mirror on the same box that did
come back cleanly (these have ext3 filesystems on them, the one that
failed to come up is just a raw partition used via iSCSI if that makes
any difference.) The array that didn't come back is /dev/md/51, the
ones that did are /dev/md/52 and /dev/md/53. I have confirmed that all
three device files do exist in /dev/md. (/dev/md51 is also a symlink to
/dev/md/51, as are /dev/md52 and /dev/md53 for the working arrays). We
also did quite a bit of testing on the box before we deployed the arrays
and haven't seen this problem before now, previously all of the arrays
came back online as expected. Of course it has also been about 7 months
since the box has gone down but I don't think there were any major
changes since then.

When I boot the system (tried this twice including a hard power down
just to be sure), I see "mdadm: No suitable drives found for /dev/md51".
Again the other 2 arrays come up just fine. I have checked that the
array is listed in /etc/mdadm.conf

(I will apologize for a lack of specific mdadm output in my details
below, the network people have conveniently (?) picked this weekend to
upgrade the network in our campus building and I am currently unable to
access the server until they are done!)

"mdadm --detail /dev/md/51" does (as expected?) display: "mdadm: md
device /dev/md51 does not appear to be active"

I have done an "mdadm --examine" on each of the drives in the array and
each one shows a state of "clean" with a status of "U" (and all of the
other drives in the sequence shown as "u"). The array name and UUID
value look good and the "update time" appears to be about when the
server lost power. All the checksums read "correct" as well. So I'm
confident all the individual drives are there and OK.

I do have the original mdadm command used to construct the array.
(There are 8 active disks in the array plus 2 spares.) I am using
version 1.0 metadata with the -N arg to provide a name for each array.
So I used this command with the assemble option (but without the -N or
-u) options:

mdadm -A /dev/md/51 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
/dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1

But this just gave the "no suitable drives found" message.

I retried the mdadm command using -N and -u options but in
both cases saw the same result.

One odd thing that I noticed was that when I ran an:
mdadm --detail --scan

The output *does* display all three arrays, but the name of the arrays
shows up as "ARRAY /dev/md/" rather than the "ARRAY
/dev/md/NN" that I would expect (and that is in my /etc/mdadm.conf
file). Not sure if this has anything to do with the problem or not.
There are no /dev/md/ device files or symlinks on the system.

I *think* my next step based on the various posts I've read would be to
try the same mdadm -A command with --force, but I'm a little wary of
that and want to make sure I actually understand what I'm doing so I
don't screw up the array entirely and lose all my data! I'm not sure if
I should be giving it *all* of the drives as an arg, including the
spares or should I just pass it the active drives? Should I use the
--raid-devices and/or --spare-devices options? Anything else I should
include or not include?

Thanks in advance to any advice you can provide. I won't be able to
test until Monday morning but it would be great to be armed with things
to try so I can hopefully get back up and running soon and minimize all
of those "When will the network share be back up?" questions that I'm
already anticipating getting.

Cheers,
-steve

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Need help recovering RAID5 array

am 06.08.2011 03:29:10 von NeilBrown

On Fri, 5 Aug 2011 11:27:06 -0400 Stephen Muskiewicz
wrote:

> Hello,
>
> I'm hoping to figure out how I can recover a RAID5 array that suddenly
> won't start after one of our servers took a power hit.
> I'm fairly confident that all the individual disks of the RAID are OK
> and that I can recover my data (without having to resort to asking my
> sysadmin to fetch the backup tapes), but despite my extensive Googling
> and reviewing the list archives and mdadm manpage, so far nothing I've
> tried has worked. Hopefully I am just missing something simple.
>
> Background: The server is a Sun X4500 (thumper) running CentOS 5.5. I
> have confirmed using the (Sun provided) "hd" utilities that all of the
> individual disks are online and none of the device names appear to have
> changed from before the power outage. There are also two other RAID5
> arrays as well as the /dev/md0 RAID1 OS mirror on the same box that did
> come back cleanly (these have ext3 filesystems on them, the one that
> failed to come up is just a raw partition used via iSCSI if that makes
> any difference.) The array that didn't come back is /dev/md/51, the
> ones that did are /dev/md/52 and /dev/md/53. I have confirmed that all
> three device files do exist in /dev/md. (/dev/md51 is also a symlink to
> /dev/md/51, as are /dev/md52 and /dev/md53 for the working arrays). We
> also did quite a bit of testing on the box before we deployed the arrays
> and haven't seen this problem before now, previously all of the arrays
> came back online as expected. Of course it has also been about 7 months
> since the box has gone down but I don't think there were any major
> changes since then.
>
> When I boot the system (tried this twice including a hard power down
> just to be sure), I see "mdadm: No suitable drives found for /dev/md51".
> Again the other 2 arrays come up just fine. I have checked that the
> array is listed in /etc/mdadm.conf
>
> (I will apologize for a lack of specific mdadm output in my details
> below, the network people have conveniently (?) picked this weekend to
> upgrade the network in our campus building and I am currently unable to
> access the server until they are done!)
>
> "mdadm --detail /dev/md/51" does (as expected?) display: "mdadm: md
> device /dev/md51 does not appear to be active"
>
> I have done an "mdadm --examine" on each of the drives in the array and
> each one shows a state of "clean" with a status of "U" (and all of the
> other drives in the sequence shown as "u"). The array name and UUID
> value look good and the "update time" appears to be about when the
> server lost power. All the checksums read "correct" as well. So I'm
> confident all the individual drives are there and OK.
>
> I do have the original mdadm command used to construct the array.
> (There are 8 active disks in the array plus 2 spares.) I am using
> version 1.0 metadata with the -N arg to provide a name for each array.
> So I used this command with the assemble option (but without the -N or
> -u) options:
>
> mdadm -A /dev/md/51 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
> /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1
>
> But this just gave the "no suitable drives found" message.
>
> I retried the mdadm command using -N and -u options but in
> both cases saw the same result.
>
> One odd thing that I noticed was that when I ran an:
> mdadm --detail --scan
>
> The output *does* display all three arrays, but the name of the arrays
> shows up as "ARRAY /dev/md/" rather than the "ARRAY
> /dev/md/NN" that I would expect (and that is in my /etc/mdadm.conf
> file). Not sure if this has anything to do with the problem or not.
> There are no /dev/md/ device files or symlinks on the system.

So maybe the only problem is that the names are missing from /dev/md/ ???

When you can access the server again, could you report:

cat /proc/mdstat
grep md /proc/partitions
ls -l /dev/md*

and maybe
mdadm -Ds
mdadm -Es
cat /etc/mdadm.conf

just for completeness.

It certainly looks like your data is all there but maybe not appearing
exactly where you expect it.

>
> I *think* my next step based on the various posts I've read would be to
> try the same mdadm -A command with --force, but I'm a little wary of
> that and want to make sure I actually understand what I'm doing so I
> don't screw up the array entirely and lose all my data! I'm not sure if
> I should be giving it *all* of the drives as an arg, including the
> spares or should I just pass it the active drives? Should I use the
> --raid-devices and/or --spare-devices options? Anything else I should
> include or not include?

When you do a "-A --force" you do give it all they drives that might be part
of the array so it has maximum information.
--spare-devices and --raid-devices are not meaningful with --assemble.

NeilBrown

>
> Thanks in advance to any advice you can provide. I won't be able to
> test until Monday morning but it would be great to be armed with things
> to try so I can hopefully get back up and running soon and minimize all
> of those "When will the network share be back up?" questions that I'm
> already anticipating getting.
>
> Cheers,
> -steve
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: Need help recovering RAID5 array

am 08.08.2011 19:41:34 von Stephen Muskiewicz

> -----Original Message-----
> From: NeilBrown [mailto:neilb@suse.de]
> Sent: Friday, August 05, 2011 9:29 PM
> To: Muskiewicz, Stephen C
> Cc: linux-raid@vger.kernel.org
> Subject: Re: Need help recovering RAID5 array
>
> On Fri, 5 Aug 2011 11:27:06 -0400 Stephen Muskiewicz
> wrote:
>
> > Hello,
> >
> > I'm hoping to figure out how I can recover a RAID5 array that
> suddenly
> > won't start after one of our servers took a power hit.
> > I'm fairly confident that all the individual disks of the RAID are OK
> > and that I can recover my data (without having to resort to asking my
> > sysadmin to fetch the backup tapes), but despite my extensive
> Googling
> > and reviewing the list archives and mdadm manpage, so far nothing
> I've
> > tried has worked. Hopefully I am just missing something simple.
> >
> > Background: The server is a Sun X4500 (thumper) running CentOS 5.5.
> I
> > have confirmed using the (Sun provided) "hd" utilities that all of
> the
> > individual disks are online and none of the device names appear to
> have
> > changed from before the power outage. There are also two other RAID5
> > arrays as well as the /dev/md0 RAID1 OS mirror on the same box that
> did
> > come back cleanly (these have ext3 filesystems on them, the one that
> > failed to come up is just a raw partition used via iSCSI if that
> makes
> > any difference.) The array that didn't come back is /dev/md/51, the
> > ones that did are /dev/md/52 and /dev/md/53. I have confirmed that
> all
> > three device files do exist in /dev/md. (/dev/md51 is also a symlink
> to
> > /dev/md/51, as are /dev/md52 and /dev/md53 for the working arrays).
> We
> > also did quite a bit of testing on the box before we deployed the
> arrays
> > and haven't seen this problem before now, previously all of the
> arrays
> > came back online as expected. Of course it has also been about 7
> months
> > since the box has gone down but I don't think there were any major
> > changes since then.
> >
> > When I boot the system (tried this twice including a hard power down
> > just to be sure), I see "mdadm: No suitable drives found for
> /dev/md51".
> > Again the other 2 arrays come up just fine. I have checked that
> the
> > array is listed in /etc/mdadm.conf
> >
> > (I will apologize for a lack of specific mdadm output in my details
> > below, the network people have conveniently (?) picked this weekend
> to
> > upgrade the network in our campus building and I am currently unable
> to
> > access the server until they are done!)
> >
> > "mdadm --detail /dev/md/51" does (as expected?) display: "mdadm: md
> > device /dev/md51 does not appear to be active"
> >
> > I have done an "mdadm --examine" on each of the drives in the array
> and
> > each one shows a state of "clean" with a status of "U" (and all of
> the
> > other drives in the sequence shown as "u"). The array name and UUID
> > value look good and the "update time" appears to be about when the
> > server lost power. All the checksums read "correct" as well. So I'm
> > confident all the individual drives are there and OK.
> >
> > I do have the original mdadm command used to construct the array.
> > (There are 8 active disks in the array plus 2 spares.) I am using
> > version 1.0 metadata with the -N arg to provide a name for each
> array.
> > So I used this command with the assemble option (but without the -N
> or
> > -u) options:
> >
> > mdadm -A /dev/md/51 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
> > /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1
> >
> > But this just gave the "no suitable drives found" message.
> >
> > I retried the mdadm command using -N and -u options but
> in
> > both cases saw the same result.
> >
> > One odd thing that I noticed was that when I ran an:
> > mdadm --detail --scan
> >
> > The output *does* display all three arrays, but the name of the
> arrays
> > shows up as "ARRAY /dev/md/" rather than the "ARRAY
> > /dev/md/NN" that I would expect (and that is in my /etc/mdadm.conf
> > file). Not sure if this has anything to do with the problem or not.
> > There are no /dev/md/ device files or symlinks on the
> system.
>
> So maybe the only problem is that the names are missing from /dev/md/
> ???

I tried creating a symlink /dev/md/tsongas_archive to /dev/md/51 but still got the "no suitable drives" error when trying to assemble (using both /dev/md/51 or /dev/md/tsongas_archive)

>
> When you can access the server again, could you report:
>
> cat /proc/mdstat
> grep md /proc/partitions
> ls -l /dev/md*
>
> and maybe
> mdadm -Ds
> mdadm -Es
> cat /etc/mdadm.conf
>
> just for completeness.
>
>
> It certainly looks like your data is all there but maybe not appearing
> exactly where you expect it.
>

Here is all is:

[root@libthumper1 ~]# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md53 : active raid5 sdae1[0] sds1[8](S) sdai1[9](S) sdk1[10] sdam1[6] sdo1[5] sdau1[4] sdaq1[3] sdw1[2] sdaa1[1]
3418686208 blocks super 1.0 level 5, 128k chunk, algorithm 2 [8/8] [UUUUUUUU]

md52 : active raid5 sdad1[0] sdf1[11](S) sdz1[10](S) sdb1[12] sdn1[8] sdj1[7] sdal1[6] sdah1[5] sdat1[4] sdap1[3] sdv1[2] sdr1[1]
4395453696 blocks super 1.0 level 5, 128k chunk, algorithm 2 [10/10] [UUUUUUUUUU]

md0 : active raid1 sdac2[0] sdy2[1]
480375552 blocks [2/2] [UU]

unused devices:

[root@libthumper1 ~]# grep md /proc/partitions
9 0 480375552 md0
9 52 4395453696 md52
9 53 3418686208 md53

[root@libthumper1 ~]# ls -l /dev/md*
brw-r----- 1 root disk 9, 0 Aug 4 15:25 /dev/md0
lrwxrwxrwx 1 root root 5 Aug 4 15:25 /dev/md51 -> md/51

lrwxrwxrwx 1 root root 5 Aug 4 15:25 /dev/md52 -> md/52

lrwxrwxrwx 1 root root 5 Aug 4 15:25 /dev/md53 -> md/53

/dev/md:
total 0
brw-r----- 1 root disk 9, 51 Aug 4 15:25 51
brw-r----- 1 root disk 9, 52 Aug 4 15:25 52
brw-r----- 1 root disk 9, 53 Aug 4 15:25 53

[root@libthumper1 ~]# mdadm -Ds
ARRAY /dev/md0 level=raid1 num-devices=2 metadata=0.90 UUID=e30f5b25:6dc28a02:1b03ab94:da5913ed
ARRAY /dev/md52 level=raid5 num-devices=10 metadata=1.00 spares=2 name=vmware_storage UUID=c436b591:01a4be5f:2736d7dd:3b97d872
ARRAY /dev/md53 level=raid5 num-devices=8 metadata=1.00 spares=2 name=backup_mirror UUID=9bb89570:675f47be:2fe2f481:ebc33388

[root@libthumper1 ~]# mdadm -Es
ARRAY /dev/md2 level=raid1 num-devices=6 UUID=d08b45a4:169e4351:02cff74a:c70fcb00
ARRAY /dev/md0 level=raid1 num-devices=2 UUID=e30f5b25:6dc28a02:1b03ab94:da5913ed
ARRAY /dev/md/tsongas_archive level=raid5 metadata=1.0 num-devices=8 UUID=41aa414e:cfe1a5ae:3768e4ef:0084904e name=tsongas_archive
ARRAY /dev/md/vmware_storage level=raid5 metadata=1.0 num-devices=10 UUID=c436b591:01a4be5f:2736d7dd:3b97d872 name=vmware_storage
ARRAY /dev/md/backup_mirror level=raid5 metadata=1.0 num-devices=8 UUID=9bb89570:675f47be:2fe2f481:ebc33388 name=backup_mirror

[root@libthumper1 ~]# cat /etc/mdadm.conf

# mdadm.conf written out by anaconda
DEVICE partitions
MAILADDR sysadmins
MAILFROM root@libthumper1.uml.edu
ARRAY /dev/md0 level=raid1 num-devices=2 uuid=e30f5b25:6dc28a02:1b03ab94:da5913ed
ARRAY /dev/md/51 level=raid5 num-devices=8 spares=2 name=tsongas_archive uuid=41aa414e:cfe1a5ae:3768e4ef:0084904e
ARRAY /dev/md/52 level=raid5 num-devices=10 spares=2 name=vmware_storage uuid=c436b591:01a4be5f:2736d7dd:3b97d872
ARRAY /dev/md/53 level=raid5 num-devices=8 spares=2 name=backup_mirror uuid=9bb89570:675f47be:2fe2f481:ebc33388

It looks like the md51 device isn't appearing in /proc/partitions, not sure why that is?

I also just noticed the /dev/md2 that appears in the mdadm -Es output, not sure what that is but I don't recognize it as anything that was previously on that box. (There is no /dev/md2 device file). Not sure if that is related at all or just a red herring...

For good measure, here's some actual mdadm -E output for the specific drives (I won't include all as they all seem to be about the same):

[root@libthumper1 ~]# mdadm -E /dev/sd[qui]1
/dev/sdi1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x0
Array UUID : 41aa414e:cfe1a5ae:3768e4ef:0084904e
Name : tsongas_archive
Creation Time : Thu Feb 24 11:43:37 2011
Raid Level : raid5
Raid Devices : 8

Avail Dev Size : 976767728 (465.76 GiB 500.11 GB)
Array Size : 6837372416 (3260.31 GiB 3500.73 GB)
Used Dev Size : 976767488 (465.76 GiB 500.10 GB)
Super Offset : 976767984 sectors
State : clean
Device UUID : 750e6410:661d4838:0a5f7581:7c110cf1

Update Time : Thu Aug 4 06:41:23 2011
Checksum : 20bb0567 - correct
Events : 18446744073709551615

Layout : left-symmetric
Chunk Size : 128K
Array Slot : 5 (0, 1, 2, 3, 4, 5, 6, 7)
Array State : uuuuuUuu

/dev/sdq1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x0
Array UUID : 41aa414e:cfe1a5ae:3768e4ef:0084904e
Name : tsongas_archive
Creation Time : Thu Feb 24 11:43:37 2011
Raid Level : raid5
Raid Devices : 8

Avail Dev Size : 976767728 (465.76 GiB 500.11 GB)
Array Size : 6837372416 (3260.31 GiB 3500.73 GB)
Used Dev Size : 976767488 (465.76 GiB 500.10 GB)
Super Offset : 976767984 sectors
State : clean
Device UUID : 3a1b81cc:8b03dec1:ce27abeb:33598b7b

Update Time : Thu Aug 4 06:41:23 2011
Checksum : 5b2308c8 - correct
Events : 18446744073709551615
Layout : left-symmetric
Chunk Size : 128K

Array Slot : 0 (0, 1, 2, 3, 4, 5, 6, 7)
Array State : Uuuuuuuu

/dev/sdu1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x0
Array UUID : 41aa414e:cfe1a5ae:3768e4ef:0084904e
Name : tsongas_archive
Creation Time : Thu Feb 24 11:43:37 2011
Raid Level : raid5
Raid Devices : 8

Avail Dev Size : 976767728 (465.76 GiB 500.11 GB)
Array Size : 6837372416 (3260.31 GiB 3500.73 GB)
Used Dev Size : 976767488 (465.76 GiB 500.10 GB)
Super Offset : 976767984 sectors
State : clean
Device UUID : df0c9e89:bb801e58:c17c0adf:57625ef7

Update Time : Thu Aug 4 06:41:23 2011
Checksum : 1db2d5b5 - correct
Events : 18446744073709551615

Layout : left-symmetric
Chunk Size : 128K

Array Slot : 1 (0, 1, 2, 3, 4, 5, 6, 7)
Array State : uUuuuuuu

Is that huge number for the event count perhaps a problem?

>
>
> >
> > I *think* my next step based on the various posts I've read would be
> to
> > try the same mdadm -A command with --force, but I'm a little wary of
> > that and want to make sure I actually understand what I'm doing so I
> > don't screw up the array entirely and lose all my data! I'm not sure
> if
> > I should be giving it *all* of the drives as an arg, including the
> > spares or should I just pass it the active drives? Should I use the
> > --raid-devices and/or --spare-devices options? Anything else I
> should
> > include or not include?
>
> When you do a "-A --force" you do give it all they drives that might be
> part
> of the array so it has maximum information.
> --spare-devices and --raid-devices are not meaningful with --assemble.
>
>

OK so I tried with the --force and here's what I got (BTW the device names are different from my original email since I didn't have access to the server before, but I used the real device names exactly as when I originally created the array, sorry for any confusion)

mdadm -A /dev/md/51 --force /dev/sdq1 /dev/sdu1 /dev/sdao1 /dev/sdas1 /dev/sdag1 /dev/sdi1 /dev/sdm1 /dev/sda1 /dev/sdak1 /dev/sde1

mdadm: forcing event count in /dev/sdq1(0) from -1 upto -1
mdadm: forcing event count in /dev/sdu1(1) from -1 upto -1
mdadm: forcing event count in /dev/sdao1(2) from -1 upto -1
mdadm: forcing event count in /dev/sdas1(3) from -1 upto -1
mdadm: forcing event count in /dev/sdag1(4) from -1 upto -1
mdadm: forcing event count in /dev/sdi1(5) from -1 upto -1
mdadm: forcing event count in /dev/sdm1(6) from -1 upto -1
mdadm: forcing event count in /dev/sda1(7) from -1 upto -1
mdadm: failed to RUN_ARRAY /dev/md/51: Input/output error

Additionally I got a bunch of messages on the console, first was:

Kicking non-fresh sdak1 from array

This was repeated for each device, *except* the first drive (/dev/sdq1) and the last spare (/dev/sde1).

After those messages was (sorry if not exact, had to retype as cut/paste from KVM console wasn't working):

raid5: not enough operational devices for md51 (7/8 failed)
RAID5 conf printout:
--- rd:8 wd:1 fd:7
disk 0, o11, dev:sdq1

After this, here's the output of mdadm --detail /dev/md/51:

/dev/md/51:
Version : 1.00
Creation Time : Thu Feb 24 11:43:37 2011
Raid Level : raid5
Used Dev Size : 488383744 (465.76 GiB 500.10 GB)
Raid Devices : 8
Total Devices : 1
Preferred Minor : 51
Persistence : Superblock is persistent

Update Time : Thu Aug 4 06:41:23 2011
State : active, degraded, Not Started
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 128K

Name : tsongas_archive
UUID : 41aa414e:cfe1a5ae:3768e4ef:0084904e
Events : 18446744073709551615

Number Major Minor RaidDevice State
0 65 1 0 active sync /dev/sdq1
1 0 0 1 removed
2 0 0 2 removed
3 0 0 3 removed
4 0 0 4 removed
5 0 0 5 removed
6 0 0 6 removed
7 0 0 7 removed

So even with --force, the results don't look very promising. Could it have something to do with the "non-fresh" or the really large event?

Anything further I can try, aside from going to fetch the tape backups? :-0

Thanks much!
-steve

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Need help recovering RAID5 array

am 09.08.2011 01:12:14 von NeilBrown

On Mon, 8 Aug 2011 17:41:34 +0000 "Muskiewicz, Stephen C"
wrote:

> I tried creating a symlink /dev/md/tsongas_archive to /dev/md/51 but still got the "no suitable drives" error when trying to assemble (using both /dev/md/51 or /dev/md/tsongas_archive)
>
> >
> > When you can access the server again, could you report:
> >
> > cat /proc/mdstat
> > grep md /proc/partitions
> > ls -l /dev/md*
> >
> > and maybe
> > mdadm -Ds
> > mdadm -Es
> > cat /etc/mdadm.conf
> >
> > just for completeness.
> >
> >
> > It certainly looks like your data is all there but maybe not appearing
> > exactly where you expect it.
> >
>
> Here is all is:
>
> [root@libthumper1 ~]# cat /proc/mdstat
> Personalities : [raid1] [raid6] [raid5] [raid4]
> md53 : active raid5 sdae1[0] sds1[8](S) sdai1[9](S) sdk1[10] sdam1[6] sdo1[5] sdau1[4] sdaq1[3] sdw1[2] sdaa1[1]
> 3418686208 blocks super 1.0 level 5, 128k chunk, algorithm 2 [8/8] [UUUUUUUU]
>
> md52 : active raid5 sdad1[0] sdf1[11](S) sdz1[10](S) sdb1[12] sdn1[8] sdj1[7] sdal1[6] sdah1[5] sdat1[4] sdap1[3] sdv1[2] sdr1[1]
> 4395453696 blocks super 1.0 level 5, 128k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
>
> md0 : active raid1 sdac2[0] sdy2[1]
> 480375552 blocks [2/2] [UU]
>
> unused devices:
>
> [root@libthumper1 ~]# grep md /proc/partitions
> 9 0 480375552 md0
> 9 52 4395453696 md52
> 9 53 3418686208 md53
>
>
> [root@libthumper1 ~]# ls -l /dev/md*
> brw-r----- 1 root disk 9, 0 Aug 4 15:25 /dev/md0
> lrwxrwxrwx 1 root root 5 Aug 4 15:25 /dev/md51 -> md/51
>
> lrwxrwxrwx 1 root root 5 Aug 4 15:25 /dev/md52 -> md/52
>
> lrwxrwxrwx 1 root root 5 Aug 4 15:25 /dev/md53 -> md/53
>
>
> /dev/md:
> total 0
> brw-r----- 1 root disk 9, 51 Aug 4 15:25 51
> brw-r----- 1 root disk 9, 52 Aug 4 15:25 52
> brw-r----- 1 root disk 9, 53 Aug 4 15:25 53
>
> [root@libthumper1 ~]# mdadm -Ds
> ARRAY /dev/md0 level=raid1 num-devices=2 metadata=0.90 UUID=e30f5b25:6dc28a02:1b03ab94:da5913ed
> ARRAY /dev/md52 level=raid5 num-devices=10 metadata=1.00 spares=2 name=vmware_storage UUID=c436b591:01a4be5f:2736d7dd:3b97d872
> ARRAY /dev/md53 level=raid5 num-devices=8 metadata=1.00 spares=2 name=backup_mirror UUID=9bb89570:675f47be:2fe2f481:ebc33388
>
> [root@libthumper1 ~]# mdadm -Es
> ARRAY /dev/md2 level=raid1 num-devices=6 UUID=d08b45a4:169e4351:02cff74a:c70fcb00
> ARRAY /dev/md0 level=raid1 num-devices=2 UUID=e30f5b25:6dc28a02:1b03ab94:da5913ed
> ARRAY /dev/md/tsongas_archive level=raid5 metadata=1.0 num-devices=8 UUID=41aa414e:cfe1a5ae:3768e4ef:0084904e name=tsongas_archive
> ARRAY /dev/md/vmware_storage level=raid5 metadata=1.0 num-devices=10 UUID=c436b591:01a4be5f:2736d7dd:3b97d872 name=vmware_storage
> ARRAY /dev/md/backup_mirror level=raid5 metadata=1.0 num-devices=8 UUID=9bb89570:675f47be:2fe2f481:ebc33388 name=backup_mirror
>
> [root@libthumper1 ~]# cat /etc/mdadm.conf
>
> # mdadm.conf written out by anaconda
> DEVICE partitions
> MAILADDR sysadmins
> MAILFROM root@libthumper1.uml.edu
> ARRAY /dev/md0 level=raid1 num-devices=2 uuid=e30f5b25:6dc28a02:1b03ab94:da5913ed
> ARRAY /dev/md/51 level=raid5 num-devices=8 spares=2 name=tsongas_archive uuid=41aa414e:cfe1a5ae:3768e4ef:0084904e
> ARRAY /dev/md/52 level=raid5 num-devices=10 spares=2 name=vmware_storage uuid=c436b591:01a4be5f:2736d7dd:3b97d872
> ARRAY /dev/md/53 level=raid5 num-devices=8 spares=2 name=backup_mirror uuid=9bb89570:675f47be:2fe2f481:ebc33388
>
> It looks like the md51 device isn't appearing in /proc/partitions, not sure why that is?
>
> I also just noticed the /dev/md2 that appears in the mdadm -Es output, not sure what that is but I don't recognize it as anything that was previously on that box. (There is no /dev/md2 device file). Not sure if that is related at all or just a red herring...
>
> For good measure, here's some actual mdadm -E output for the specific drives (I won't include all as they all seem to be about the same):
>
> [root@libthumper1 ~]# mdadm -E /dev/sd[qui]1
> /dev/sdi1:
> Magic : a92b4efc
> Version : 1.0
> Feature Map : 0x0
> Array UUID : 41aa414e:cfe1a5ae:3768e4ef:0084904e
> Name : tsongas_archive
> Creation Time : Thu Feb 24 11:43:37 2011
> Raid Level : raid5
> Raid Devices : 8
>
> Avail Dev Size : 976767728 (465.76 GiB 500.11 GB)
> Array Size : 6837372416 (3260.31 GiB 3500.73 GB)
> Used Dev Size : 976767488 (465.76 GiB 500.10 GB)
> Super Offset : 976767984 sectors
> State : clean
> Device UUID : 750e6410:661d4838:0a5f7581:7c110cf1
>
> Update Time : Thu Aug 4 06:41:23 2011
> Checksum : 20bb0567 - correct
> Events : 18446744073709551615

....

>
> Is that huge number for the event count perhaps a problem?

Could be. That number is 0xffff,ffff,ffff,ffff. i.e.2^64-1.
It cannot get any bigger than that.

> >
>
> OK so I tried with the --force and here's what I got (BTW the device names are different from my original email since I didn't have access to the server before, but I used the real device names exactly as when I originally created the array, sorry for any confusion)
>
> mdadm -A /dev/md/51 --force /dev/sdq1 /dev/sdu1 /dev/sdao1 /dev/sdas1 /dev/sdag1 /dev/sdi1 /dev/sdm1 /dev/sda1 /dev/sdak1 /dev/sde1
>
> mdadm: forcing event count in /dev/sdq1(0) from -1 upto -1
> mdadm: forcing event count in /dev/sdu1(1) from -1 upto -1
> mdadm: forcing event count in /dev/sdao1(2) from -1 upto -1
> mdadm: forcing event count in /dev/sdas1(3) from -1 upto -1
> mdadm: forcing event count in /dev/sdag1(4) from -1 upto -1
> mdadm: forcing event count in /dev/sdi1(5) from -1 upto -1
> mdadm: forcing event count in /dev/sdm1(6) from -1 upto -1
> mdadm: forcing event count in /dev/sda1(7) from -1 upto -1
> mdadm: failed to RUN_ARRAY /dev/md/51: Input/output error

and sometimes "2^64-1" looks like "-1".

We just need to replace that "-1" with a more useful number.

It looks the the "--force" might have made a little bit of a mess but we
should be able to recover it.

Could you:
apply the following patch and build a new 'mdadm'.
mdadm -S /dev/md/51
mdadm -A /dev/md/51 --update=summaries
-vv /dev/sdq1 /dev/sdu1 /dev/sdao1 /dev/sdas1 /dev/sdag1 /dev/sdi1 /dev/sdm1 /dev/sda1 /dev/sdak1 /dev/sde1

and if that doesn't work, repeat the same two commands but add "--force" to
the second. Make sure you keep the "-vv" in both cases.

then report the results.

I wonder how the event count got that high. There aren't enough seconds
since the birth of the universe of it to have happened naturally...

Thanks,
NeilBrown

diff --git a/super1.c b/super1.c
index 35e92a3..4a3341a 100644
--- a/super1.c
+++ b/super1.c
@@ -803,6 +803,8 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
__le64_to_cpu(sb->data_size));
} else if (strcmp(update, "_reshape_progress")==0)
sb->reshape_position = __cpu_to_le64(info->reshape_progress);
+ else if (strcmp(update, "summaries") == 0)
+ sb->events = __cpu_to_le64(4);
else
rv = -1;

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Need help recovering RAID5 array

am 09.08.2011 04:29:10 von Stephen Muskiewicz

On 8/8/2011 7:12 PM, NeilBrown wrote:
>> [root@libthumper1 ~]# cat /proc/mdstat
>> Personalities : [raid1] [raid6] [raid5] [raid4]
>> md53 : active raid5 sdae1[0] sds1[8](S) sdai1[9](S) sdk1[10] sdam1[6] sdo1[5] sdau1[4] sdaq1[3] sdw1[2] sdaa1[1]
>> 3418686208 blocks super 1.0 level 5, 128k chunk, algorithm 2 [8/8] [UUUUUUUU]
>>
>> md52 : active raid5 sdad1[0] sdf1[11](S) sdz1[10](S) sdb1[12] sdn1[8] sdj1[7] sdal1[6] sdah1[5] sdat1[4] sdap1[3] sdv1[2] sdr1[1]
>> 4395453696 blocks super 1.0 level 5, 128k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
>>
>> md0 : active raid1 sdac2[0] sdy2[1]
>> 480375552 blocks [2/2] [UU]
>>
>> unused devices:
>>
>> [root@libthumper1 ~]# grep md /proc/partitions
>> 9 0 480375552 md0
>> 9 52 4395453696 md52
>> 9 53 3418686208 md53
>>
>>
>> [root@libthumper1 ~]# ls -l /dev/md*
>> brw-r----- 1 root disk 9, 0 Aug 4 15:25 /dev/md0
>> lrwxrwxrwx 1 root root 5 Aug 4 15:25 /dev/md51 -> md/51
>>
>> lrwxrwxrwx 1 root root 5 Aug 4 15:25 /dev/md52 -> md/52
>>
>> lrwxrwxrwx 1 root root 5 Aug 4 15:25 /dev/md53 -> md/53
>>
>>
>> /dev/md:
>> total 0
>> brw-r----- 1 root disk 9, 51 Aug 4 15:25 51
>> brw-r----- 1 root disk 9, 52 Aug 4 15:25 52
>> brw-r----- 1 root disk 9, 53 Aug 4 15:25 53
>>
>> [root@libthumper1 ~]# mdadm -Ds
>> ARRAY /dev/md0 level=raid1 num-devices=2 metadata=0.90 UUID=e30f5b25:6dc28a02:1b03ab94:da5913ed
>> ARRAY /dev/md52 level=raid5 num-devices=10 metadata=1.00 spares=2 name=vmware_storage UUID=c436b591:01a4be5f:2736d7dd:3b97d872
>> ARRAY /dev/md53 level=raid5 num-devices=8 metadata=1.00 spares=2 name=backup_mirror UUID=9bb89570:675f47be:2fe2f481:ebc33388
>>
>> [root@libthumper1 ~]# mdadm -Es
>> ARRAY /dev/md2 level=raid1 num-devices=6 UUID=d08b45a4:169e4351:02cff74a:c70fcb00
>> ARRAY /dev/md0 level=raid1 num-devices=2 UUID=e30f5b25:6dc28a02:1b03ab94:da5913ed
>> ARRAY /dev/md/tsongas_archive level=raid5 metadata=1.0 num-devices=8 UUID=41aa414e:cfe1a5ae:3768e4ef:0084904e name=tsongas_archive
>> ARRAY /dev/md/vmware_storage level=raid5 metadata=1.0 num-devices=10 UUID=c436b591:01a4be5f:2736d7dd:3b97d872 name=vmware_storage
>> ARRAY /dev/md/backup_mirror level=raid5 metadata=1.0 num-devices=8 UUID=9bb89570:675f47be:2fe2f481:ebc33388 name=backup_mirror
>>
>> [root@libthumper1 ~]# cat /etc/mdadm.conf
>>
>> # mdadm.conf written out by anaconda
>> DEVICE partitions
>> MAILADDR sysadmins
>> MAILFROM root@libthumper1.uml.edu
>> ARRAY /dev/md0 level=raid1 num-devices=2 uuid=e30f5b25:6dc28a02:1b03ab94:da5913ed
>> ARRAY /dev/md/51 level=raid5 num-devices=8 spares=2 name=tsongas_archive uuid=41aa414e:cfe1a5ae:3768e4ef:0084904e
>> ARRAY /dev/md/52 level=raid5 num-devices=10 spares=2 name=vmware_storage uuid=c436b591:01a4be5f:2736d7dd:3b97d872
>> ARRAY /dev/md/53 level=raid5 num-devices=8 spares=2 name=backup_mirror uuid=9bb89570:675f47be:2fe2f481:ebc33388
>>
>> It looks like the md51 device isn't appearing in /proc/partitions, not sure why that is?
>>
>> I also just noticed the /dev/md2 that appears in the mdadm -Es output, not sure what that is but I don't recognize it as anything that was previously on that box. (There is no /dev/md2 device file). Not sure if that is related at all or just a red herring...
>>
>> For good measure, here's some actual mdadm -E output for the specific drives (I won't include all as they all seem to be about the same):
>>
>> [root@libthumper1 ~]# mdadm -E /dev/sd[qui]1
>> /dev/sdi1:
>> Magic : a92b4efc
>> Version : 1.0
>> Feature Map : 0x0
>> Array UUID : 41aa414e:cfe1a5ae:3768e4ef:0084904e
>> Name : tsongas_archive
>> Creation Time : Thu Feb 24 11:43:37 2011
>> Raid Level : raid5
>> Raid Devices : 8
>>
>> Avail Dev Size : 976767728 (465.76 GiB 500.11 GB)
>> Array Size : 6837372416 (3260.31 GiB 3500.73 GB)
>> Used Dev Size : 976767488 (465.76 GiB 500.10 GB)
>> Super Offset : 976767984 sectors
>> State : clean
>> Device UUID : 750e6410:661d4838:0a5f7581:7c110cf1
>>
>> Update Time : Thu Aug 4 06:41:23 2011
>> Checksum : 20bb0567 - correct
>> Events : 18446744073709551615
> ...
>
>> Is that huge number for the event count perhaps a problem?
> Could be. That number is 0xffff,ffff,ffff,ffff. i.e.2^64-1.
> It cannot get any bigger than that.
>
>> OK so I tried with the --force and here's what I got (BTW the device names are different from my original email since I didn't have access to the server before, but I used the real device names exactly as when I originally created the array, sorry for any confusion)
>>
>> mdadm -A /dev/md/51 --force /dev/sdq1 /dev/sdu1 /dev/sdao1 /dev/sdas1 /dev/sdag1 /dev/sdi1 /dev/sdm1 /dev/sda1 /dev/sdak1 /dev/sde1
>>
>> mdadm: forcing event count in /dev/sdq1(0) from -1 upto -1
>> mdadm: forcing event count in /dev/sdu1(1) from -1 upto -1
>> mdadm: forcing event count in /dev/sdao1(2) from -1 upto -1
>> mdadm: forcing event count in /dev/sdas1(3) from -1 upto -1
>> mdadm: forcing event count in /dev/sdag1(4) from -1 upto -1
>> mdadm: forcing event count in /dev/sdi1(5) from -1 upto -1
>> mdadm: forcing event count in /dev/sdm1(6) from -1 upto -1
>> mdadm: forcing event count in /dev/sda1(7) from -1 upto -1
>> mdadm: failed to RUN_ARRAY /dev/md/51: Input/output error
> and sometimes "2^64-1" looks like "-1".
>
> We just need to replace that "-1" with a more useful number.
>
> It looks the the "--force" might have made a little bit of a mess but we
> should be able to recover it.
>
> Could you:
> apply the following patch and build a new 'mdadm'.
> mdadm -S /dev/md/51
> mdadm -A /dev/md/51 --update=summaries
> -vv /dev/sdq1 /dev/sdu1 /dev/sdao1 /dev/sdas1 /dev/sdag1 /dev/sdi1 /dev/sdm1 /dev/sda1 /dev/sdak1 /dev/sde1
>
> and if that doesn't work, repeat the same two commands but add "--force" to
> the second. Make sure you keep the "-vv" in both cases.
>
> then report the results.
>

Well it looks like the first try didn't work, but adding the --force
seems to have done the trick! Here's the results:

[root@libthumper1 ~]# /root/mdadm -V
mdadm - v3.2.2 - 17th June 2011

[root@libthumper1 ~]# /root/mdadm -S /dev/md/51
mdadm: stopped /dev/md/51

[root@libthumper1 ~]# /root/mdadm -A /dev/md/51 --update=summaries -vv \
> /dev/sdq1 /dev/sdu1 /dev/sdao1 /dev/sdas1 /dev/sdag1 /dev/sdi1
/dev/sdm1 \
> /dev/sda1 /dev/sdak1 /dev/sde1
mdadm: looking for devices for /dev/md/51
mdadm: /dev/sdq1 is identified as a member of /dev/md/51, slot 0.
mdadm: /dev/sdu1 is identified as a member of /dev/md/51, slot 1.
mdadm: /dev/sdao1 is identified as a member of /dev/md/51, slot 2.
mdadm: /dev/sdas1 is identified as a member of /dev/md/51, slot 3.
mdadm: /dev/sdag1 is identified as a member of /dev/md/51, slot 4.
mdadm: /dev/sdi1 is identified as a member of /dev/md/51, slot 5.
mdadm: /dev/sdm1 is identified as a member of /dev/md/51, slot 6.
mdadm: /dev/sda1 is identified as a member of /dev/md/51, slot 7.
mdadm: /dev/sdak1 is identified as a member of /dev/md/51, slot -1.
mdadm: /dev/sde1 is identified as a member of /dev/md/51, slot -1.
mdadm: added /dev/sdq1 to /dev/md/51 as 0
mdadm: added /dev/sdu1 to /dev/md/51 as 1
mdadm: added /dev/sdao1 to /dev/md/51 as 2
mdadm: added /dev/sdas1 to /dev/md/51 as 3
mdadm: added /dev/sdag1 to /dev/md/51 as 4
mdadm: added /dev/sdi1 to /dev/md/51 as 5
mdadm: added /dev/sdm1 to /dev/md/51 as 6
mdadm: added /dev/sda1 to /dev/md/51 as 7
mdadm: added /dev/sde1 to /dev/md/51 as -1
mdadm: added /dev/sdak1 to /dev/md/51 as -1
mdadm: /dev/md/51 assembled from 0 drives and 2 spares - not enough to
start the array.

[root@libthumper1 ~]# /root/mdadm --detail /dev/md/51
mdadm: md device /dev/md/51 does not appear to be active.

[root@libthumper1 ~]# /root/mdadm -S /dev/md/51
mdadm: stopped /dev/md/51

[root@libthumper1 ~]# /root/mdadm -A /dev/md/51 --force
--update=summaries -vv
/dev/sdq1 /dev/sdu1 /dev/sdao1 /dev/sdas1 /dev/sdag1 /dev/sdi1 /dev/sdm1
/dev/sda1 /dev/sdak1 /dev/sde1
mdadm: looking for devices for /dev/md/51
mdadm: /dev/sdq1 is identified as a member of /dev/md/51, slot 0.
mdadm: /dev/sdu1 is identified as a member of /dev/md/51, slot 1.
mdadm: /dev/sdao1 is identified as a member of /dev/md/51, slot 2.
mdadm: /dev/sdas1 is identified as a member of /dev/md/51, slot 3.
mdadm: /dev/sdag1 is identified as a member of /dev/md/51, slot 4.
mdadm: /dev/sdi1 is identified as a member of /dev/md/51, slot 5.
mdadm: /dev/sdm1 is identified as a member of /dev/md/51, slot 6.
mdadm: /dev/sda1 is identified as a member of /dev/md/51, slot 7.
mdadm: /dev/sdak1 is identified as a member of /dev/md/51, slot -1.
mdadm: /dev/sde1 is identified as a member of /dev/md/51, slot -1.
mdadm: added /dev/sdu1 to /dev/md/51 as 1
mdadm: added /dev/sdao1 to /dev/md/51 as 2
mdadm: added /dev/sdas1 to /dev/md/51 as 3
mdadm: added /dev/sdag1 to /dev/md/51 as 4
mdadm: added /dev/sdi1 to /dev/md/51 as 5
mdadm: added /dev/sdm1 to /dev/md/51 as 6
mdadm: added /dev/sda1 to /dev/md/51 as 7
mdadm: added /dev/sdak1 to /dev/md/51 as -1
mdadm: added /dev/sde1 to /dev/md/51 as -1
mdadm: added /dev/sdq1 to /dev/md/51 as 0
mdadm: /dev/md/51 has been started with 8 drives and 2 spares.

[root@libthumper1 ~]# /root/mdadm --detail /dev/md/51
/dev/md/51:
Version : 1.0
Creation Time : Thu Feb 24 11:43:37 2011
Raid Level : raid5
Array Size : 3418686208 (3260.31 GiB 3500.73 GB)
Used Dev Size : 488383744 (465.76 GiB 500.10 GB)
Raid Devices : 8
Total Devices : 10
Persistence : Superblock is persistent

Update Time : Thu Aug 4 06:41:23 2011
State : clean
Active Devices : 8
Working Devices : 10
Failed Devices : 0
Spare Devices : 2

Layout : left-symmetric
Chunk Size : 128K

Name : tsongas_archive
UUID : 41aa414e:cfe1a5ae:3768e4ef:0084904e
Events : 4

Number Major Minor RaidDevice State
0 65 1 0 active sync /dev/sdq1
1 65 65 1 active sync /dev/sdu1
2 66 129 2 active sync /dev/sdao1
3 66 193 3 active sync /dev/sdas1
4 66 1 4 active sync /dev/sdag1
5 8 129 5 active sync /dev/sdi1
6 8 193 6 active sync /dev/sdm1
7 8 1 7 active sync /dev/sda1

8 66 65 - spare /dev/sdak1
9 8 65 - spare /dev/sde1

So it looks like I'm in business again! Many thanks!

This does lead to a question: Do you recommend (and is it safe on CentOS
5.5?) for me to use the updated (3.2.2 with your patch) version of mdadm
going forward in place of the CentOS version (2.6.9)?

> I wonder how the event count got that high. There aren't enough seconds
> since the birth of the universe of it to have happened naturally...
>
Any chance it might be related to these kernel messages? I just noticed
(guess I should be paying more attention to my logs) that there are tons
of these messages repeated in my /var/log/messages file. However as far
as the RAID arrays themselves, we haven't seen any problems while they
are running so I'm not sure what's causing these or whether they are
insignificant. Again, speculation on my part but given the huge event
count from mdadm and the number of these messages it might seem that
they are somehow related....

Jul 31 04:02:13 libthumper1 kernel: program diskmond is using a
deprecated SCSI
ioctl, please convert it to SG_IO
Jul 31 04:02:26 libthumper1 last message repeated 47 times
Jul 31 04:12:11 libthumper1 kernel: md: bug in file drivers/md/md.c,
line 1659
Jul 31 04:12:11 libthumper1 kernel:
Jul 31 04:12:11 libthumper1 kernel: md: **********************************
Jul 31 04:12:11 libthumper1 kernel: md: * *
Jul 31 04:12:11 libthumper1 kernel: md: **********************************
Jul 31 04:12:11 libthumper1 kernel: md53:

Jul 31 04:12:11 libthumper1 kernel: md: rdev sdk1, SZ:488383744 F:0 S:1
DN:10
Jul 31 04:12:11 libthumper1 kernel: md: rdev superblock:
Jul 31 04:12:11 libthumper1 kernel: md: SB: (V:1.0.0)
ID: CT:81f4e22f
Jul 31 04:12:11 libthumper1 kernel: md: L-2009873429 S1801675106
ND:1834971253 RD:1869771369 md114 LO:65536 CS:196610
Jul 31 04:12:11 libthumper1 kernel: md: UT:00000000 ST:0
AD:976767728 WD:0 FD:976767984 SD:0 CSUM:00000000 E:00000000
Jul 31 04:12:11 libthumper1 kernel: D 0: DISK
Jul 31 04:12:11 libthumper1 kernel: D 1: DISK
Jul 31 04:12:11 libthumper1 kernel: D 2: DISK
Jul 31 04:12:11 libthumper1 kernel: D 3: DISK
Jul 31 04:12:11 libthumper1 kernel: md: THIS: DISK
Jul 31 04:12:11 libthumper1 kernel: md: rdev superblock:
Jul 31 04:12:11 libthumper1 kernel: md: SB: (V:1.0.0)
ID: CT:81f4e22f
Jul 31 04:12:11 libthumper1 kernel: md: L-2009873429 S1801675106
ND:1834971253 RD:1869771369 md114 LO:65536 CS:196610
Jul 31 04:12:11 libthumper1 kernel: md: UT:00000000 ST:0
AD:976767728 WD:0 FD:976767984 SD:0 CSUM:00000000 E:00000000

Of course given how old the CentOS mdadm is, maybe by updating it I'll
be fixing this problem as well?
If not, I'd be willing to help delve deeper if it's something worth
investigating.

Again, Thanks a ton for all your help and quick replies!

Cheers!
-steve

> Thanks,
> NeilBrown
>
> diff --git a/super1.c b/super1.c
> index 35e92a3..4a3341a 100644
> --- a/super1.c
> +++ b/super1.c
> @@ -803,6 +803,8 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
> __le64_to_cpu(sb->data_size));
> } else if (strcmp(update, "_reshape_progress")==0)
> sb->reshape_position = __cpu_to_le64(info->reshape_progress);
> + else if (strcmp(update, "summaries") == 0)
> + sb->events = __cpu_to_le64(4);
> else
> rv = -1;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Need help recovering RAID5 array

am 09.08.2011 04:55:49 von NeilBrown

On Mon, 8 Aug 2011 22:29:10 -0400 Stephen Muskiewicz
wrote:
>
> Well it looks like the first try didn't work, but adding the --force
> seems to have done the trick! Here's the results:
>

snip

>
> So it looks like I'm in business again! Many thanks!

Great!

>
> This does lead to a question: Do you recommend (and is it safe on CentOS
> 5.5?) for me to use the updated (3.2.2 with your patch) version of mdadm
> going forward in place of the CentOS version (2.6.9)?

I wouldn't kept that patch. It was a little hack to get your array working
again. I wouldn't recommend using it without expert advice...

Other than that ... 3.2.2 certainly fixes bug and adds features over 2.6.9,
but maybe it adds some bugs too... I would say that it is safe, but probably
not really necessary.
i.e. up to you :-)

>
> > I wonder how the event count got that high. There aren't enough seconds
> > since the birth of the universe of it to have happened naturally...
> >
> Any chance it might be related to these kernel messages? I just noticed
> (guess I should be paying more attention to my logs) that there are tons
> of these messages repeated in my /var/log/messages file. However as far
> as the RAID arrays themselves, we haven't seen any problems while they
> are running so I'm not sure what's causing these or whether they are
> insignificant. Again, speculation on my part but given the huge event
> count from mdadm and the number of these messages it might seem that
> they are somehow related....
>
> Jul 31 04:02:13 libthumper1 kernel: program diskmond is using a
> deprecated SCSI
> ioctl, please convert it to SG_IO
> Jul 31 04:02:26 libthumper1 last message repeated 47 times
> Jul 31 04:12:11 libthumper1 kernel: md: bug in file drivers/md/md.c,
> line 1659

I need to know the exact kernel version to find out what this line is.... I
could guess but I would probably be wrong.

> Jul 31 04:12:11 libthumper1 kernel:
> Jul 31 04:12:11 libthumper1 kernel: md: **********************************
> Jul 31 04:12:11 libthumper1 kernel: md: * *
> Jul 31 04:12:11 libthumper1 kernel: md: **********************************
> Jul 31 04:12:11 libthumper1 kernel: md53:
>
> Jul 31 04:12:11 libthumper1 kernel: md: rdev sdk1, SZ:488383744 F:0 S:1
> DN:10
> Jul 31 04:12:11 libthumper1 kernel: md: rdev superblock:
> Jul 31 04:12:11 libthumper1 kernel: md: SB: (V:1.0.0)
> ID: CT:81f4e22f
> Jul 31 04:12:11 libthumper1 kernel: md: L-2009873429 S1801675106
> ND:1834971253 RD:1869771369 md114 LO:65536 CS:196610
> Jul 31 04:12:11 libthumper1 kernel: md: UT:00000000 ST:0
> AD:976767728 WD:0 FD:976767984 SD:0 CSUM:00000000 E:00000000
> Jul 31 04:12:11 libthumper1 kernel: D 0: DISK
> Jul 31 04:12:11 libthumper1 kernel: D 1: DISK
> Jul 31 04:12:11 libthumper1 kernel: D 2: DISK
> Jul 31 04:12:11 libthumper1 kernel: D 3: DISK
> Jul 31 04:12:11 libthumper1 kernel: md: THIS: DISK
> Jul 31 04:12:11 libthumper1 kernel: md: rdev superblock:
> Jul 31 04:12:11 libthumper1 kernel: md: SB: (V:1.0.0)
> ID: CT:81f4e22f
> Jul 31 04:12:11 libthumper1 kernel: md: L-2009873429 S1801675106
> ND:1834971253 RD:1869771369 md114 LO:65536 CS:196610
> Jul 31 04:12:11 libthumper1 kernel: md: UT:00000000 ST:0
> AD:976767728 WD:0 FD:976767984 SD:0 CSUM:00000000 E:00000000
>
>

Did it really start repeating at this point? I would have expected a bit
more first.

So if you get me kernel version and confirm that this really is all in the
logs except for identical repeats, I'll see if I can figure out what might
have caused it - and then if it could be related to your original problem.

>
> Of course given how old the CentOS mdadm is, maybe by updating it I'll
> be fixing this problem as well?

In general running newer code should be safer and easier to support. Don't
know if it would fix this problem yet though.

NeilBrown

> If not, I'd be willing to help delve deeper if it's something worth
> investigating.
>
> Again, Thanks a ton for all your help and quick replies!
>
> Cheers!
> -steve
>
> > Thanks,
> > NeilBrown
> >
> > diff --git a/super1.c b/super1.c
> > index 35e92a3..4a3341a 100644
> > --- a/super1.c
> > +++ b/super1.c
> > @@ -803,6 +803,8 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
> > __le64_to_cpu(sb->data_size));
> > } else if (strcmp(update, "_reshape_progress")==0)
> > sb->reshape_position = __cpu_to_le64(info->reshape_progress);
> > + else if (strcmp(update, "summaries") == 0)
> > + sb->events = __cpu_to_le64(4);
> > else
> > rv = -1;
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Need help recovering RAID5 array

am 09.08.2011 13:38:51 von Phil Turmel

On 08/08/2011 10:55 PM, NeilBrown wrote:
> On Mon, 8 Aug 2011 22:29:10 -0400 Stephen Muskiewicz wrote:
>> This does lead to a question: Do you recommend (and is it safe on CentOS
>> 5.5?) for me to use the updated (3.2.2 with your patch) version of mdadm
>> going forward in place of the CentOS version (2.6.9)?
>
> I wouldn't kept that patch. It was a little hack to get your array working
> again. I wouldn't recommend using it without expert advice...
>
> Other than that ... 3.2.2 certainly fixes bug and adds features over 2.6.9,
> but maybe it adds some bugs too... I would say that it is safe, but probably
> not really necessary.
> i.e. up to you :-)

Let me add a reason to stick with 2.6.9: it has different defaults for
metadata reserved space. If all hell breaks loose, and you find you need to
do "mdadm --create --assume-clean" or some variant as part of your recovery
efforts, you'll need the older version to get an identical layout.

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: Need help recovering RAID5 array

am 09.08.2011 16:47:33 von Stephen Muskiewicz

> -----Original Message-----
> From: NeilBrown [mailto:neilb@suse.de]
> Sent: Monday, August 08, 2011 10:56 PM
> To: Muskiewicz, Stephen C
> Cc: linux-raid@vger.kernel.org
> Subject: Re: Need help recovering RAID5 array
>
> >
> > This does lead to a question: Do you recommend (and is it safe on
> CentOS
> > 5.5?) for me to use the updated (3.2.2 with your patch) version of
> mdadm
> > going forward in place of the CentOS version (2.6.9)?
>
> I wouldn't kept that patch. It was a little hack to get your array
> working
> again. I wouldn't recommend using it without expert advice...
>
> Other than that ... 3.2.2 certainly fixes bug and adds features over
> 2.6.9,
> but maybe it adds some bugs too... I would say that it is safe, but
> probably
> not really necessary.
> i.e. up to you :-)
>

OK, I'll probably stick with 2.6.9 for now and focus on getting our other thumper server updated to CentOS 6 then. Oh yeah and getting the UPS control software so it actually shuts down the box cleanly so this hopefully doesn't happen again! ;-)

> >
> > > I wonder how the event count got that high. There aren't enough
> seconds
> > > since the birth of the universe of it to have happened naturally...
> > >
> > Any chance it might be related to these kernel messages? I just
> noticed
> > (guess I should be paying more attention to my logs) that there are
> tons
> > of these messages repeated in my /var/log/messages file. However as
> far
> > as the RAID arrays themselves, we haven't seen any problems while
> they
> > are running so I'm not sure what's causing these or whether they are
> > insignificant. Again, speculation on my part but given the huge
> event
> > count from mdadm and the number of these messages it might seem that
> > they are somehow related....
> >
> > Jul 31 04:02:13 libthumper1 kernel: program diskmond is using a
> > deprecated SCSI
> > ioctl, please convert it to SG_IO
> > Jul 31 04:02:26 libthumper1 last message repeated 47 times
> > Jul 31 04:12:11 libthumper1 kernel: md: bug in file drivers/md/md.c,
> > line 1659
>
> I need to know the exact kernel version to find out what this line
> is.... I
> could guess but I would probably be wrong.
>
> > Jul 31 04:12:11 libthumper1 kernel:
> > Jul 31 04:12:11 libthumper1 kernel: md:
> **********************************
> > Jul 31 04:12:11 libthumper1 kernel: md: * > PRINTOUT> *
> > Jul 31 04:12:11 libthumper1 kernel: md:
> **********************************
> > Jul 31 04:12:11 libthumper1 kernel: md53:
> >
> > Jul 31 04:12:11 libthumper1 kernel: md: rdev sdk1, SZ:488383744 F:0
> S:1
> > DN:10
> > Jul 31 04:12:11 libthumper1 kernel: md: rdev superblock:
> > Jul 31 04:12:11 libthumper1 kernel: md: SB: (V:1.0.0)
> > ID: CT:81f4e22f
> > Jul 31 04:12:11 libthumper1 kernel: md: L-2009873429 S1801675106
> > ND:1834971253 RD:1869771369 md114 LO:65536 CS:196610
> > Jul 31 04:12:11 libthumper1 kernel: md: UT:00000000 ST:0
> > AD:976767728 WD:0 FD:976767984 SD:0 CSUM:00000000 E:00000000
> > Jul 31 04:12:11 libthumper1 kernel: D 0: DISK > 1,S:-1>
> > Jul 31 04:12:11 libthumper1 kernel: D 1: DISK > 1,S:-1>
> > Jul 31 04:12:11 libthumper1 kernel: D 2: DISK > 1,S:-1>
> > Jul 31 04:12:11 libthumper1 kernel: D 3: DISK > 1,S:-1>
> > Jul 31 04:12:11 libthumper1 kernel: md: THIS:
> DISK
> > Jul 31 04:12:11 libthumper1 kernel: md: rdev superblock:
> > Jul 31 04:12:11 libthumper1 kernel: md: SB: (V:1.0.0)
> > ID: CT:81f4e22f
> > Jul 31 04:12:11 libthumper1 kernel: md: L-2009873429 S1801675106
> > ND:1834971253 RD:1869771369 md114 LO:65536 CS:196610
> > Jul 31 04:12:11 libthumper1 kernel: md: UT:00000000 ST:0
> > AD:976767728 WD:0 FD:976767984 SD:0 CSUM:00000000 E:00000000
> >
> >
>
> Did it really start repeating at this point? I would have expected a
> bit
> more first.
>
> So if you get me kernel version and confirm that this really is all in
> the
> logs except for identical repeats, I'll see if I can figure out what
> might
> have caused it - and then if it could be related to your original
> problem.
>

Yes you're right, there is quite a bit more of the info in the logs in between the "bug in file ... line 1659" message. It looks to be a state dump for each device in the array. I'll save the bandwidth and not paste all of that in here unless you need it. But I have confirmed that all of the bug lines are for the same line number (approx 60000 occurrences in the old backup of the messages file alone):

libthumper1 kernel: md: bug in file drivers/md/md.c, line 1659

Here's the kernel version and RPM info:

[root@libthumper1 ~]# uname -a
Linux libthumper1.uml.edu 2.6.18-194.32.1.el5 #1 SMP Wed Jan 5 17:52:25 EST 2011 x86_64 x86_64 x86_64 GNU/Linux

[root@libthumper1 ~]# rpm -qi kernel-2.6.18-194.32.1.el5
Name : kernel Relocations: (not relocatable)
Version : 2.6.18 Vendor: CentOS
Release : 194.32.1.el5 Build Date: Wed 05 Jan 2011 08:44:05 PM EST
Install Date: Tue 25 Jan 2011 03:13:55 PM EST Build Host: builder10.centos.org
Group : System Environment/Kernel Source RPM: kernel-2.6.18-194.32.1.el5.src.rpm
Size : 96513754 License: GPLv2
Signature : DSA/SHA1, Thu 06 Jan 2011 07:16:03 AM EST, Key ID a8a447dce8562897
URL : http://www.kernel.org/

Let me know if I can provide any other useful info.

Again, many thanks for all your help!

Cheers,
-steve

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html