Bookmarks

Yahoo Gmail Google Facebook Delicious Twitter Reddit Stumpleupon Myspace Digg

Search queries

nu vot, WWWXXXAPC, dhcpd lease file "binding state", WWWXXXDOCO, how to setup procmail to process html2text, how to setup procmail html2text, WWWXXXDOCO, WWWXXXAPC., XXXCNZZZ, ss4000 recovery array

Links

XODOX
Impressum

#1: FailSpare event?

Posted on 2007-01-11 23:11:52 by Mike

Can someone tell me what this means please? I just received this in
an email from one of my servers:


From: mdadm monitoring [root@$DOMAIN.com]
To: root@$DOMAIN.com
Subject: FailSpare event on /dev/md2:$HOST.$DOMAIN.com

This is an automatically generated mail message from mdadm
running on $HOST.$DOMAIN.com

A FailSpare event had been detected on md device /dev/md2.

It could be related to component device /dev/sde2.

Faithfully yours, etc.

On this machine I execute:

$ cat /proc/mdstat
Personalities : [raid5] [raid4] [raid1]
md0 : active raid1 sdf1[2](S) sde1[3](S) sdd1[4](S) sdc1[5](S) sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md1 : active raid1 sdf3[2](S) sde3[3](S) sdd3[4](S) sdc3[5](S) sdb3[1] sda3[0]
3068288 blocks [2/2] [UU]

md2 : active raid5 sdf2[4] sde2[5](F) sdd2[3] sdc2[2] sdb2[1] sda2[0]
560732160 blocks level 5, 256k chunk, algorithm 2 [5/5] [UUUUU]

unused devices: <none>


Does the email message mean drive sde2[5] has failed? I know the sde2 refers
to the second partition of /dev/sde. Here is the partition table

# fdisk -l /dev/sde
[root@elo ~]# fdisk -l /dev/sde

Disk /dev/sde: 146.8 GB, 146815733760 bytes
255 heads, 63 sectors/track, 17849 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sde1 * 1 13 104391 fd Linux raid autodetect
/dev/sde2 14 17465 140183190 fd Linux raid autodetect
/dev/sde3 17466 17847 3068415 fd Linux raid autodetect

I have partition 2 of drive sde as one of the raid devices for md. Does the (S)
on sde3[2](S) mean the device is a spare for md1 and the same for md0?

Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Report this message

#2: Re: FailSpare event?

Posted on 2007-01-11 23:23:41 by NeilBrown

On Thursday January 11, mikee@mikee.ath.cx wrote:
> Can someone tell me what this means please? I just received this in
> an email from one of my servers:
>
.....

>
> A FailSpare event had been detected on md device /dev/md2.
>
> It could be related to component device /dev/sde2.

It means that mdadm has just noticed that /dev/sde2 is a spare and is faulty.

You would normally expect this if the array is rebuilding a spare and
a write to the spare fails however...

>
> md2 : active raid5 sdf2[4] sde2[5](F) sdd2[3] sdc2[2] sdb2[1] sda2[0]
> 560732160 blocks level 5, 256k chunk, algorithm 2 [5/5] [UUUUU]

That isn't the case here - your array doesn't need rebuilding.
Possible a superblock-update failed. Possibly mdadm only just started
monitoring the array and the spare has been faulty for some time.

>
> Does the email message mean drive sde2[5] has failed? I know the sde2 refers
> to the second partition of /dev/sde. Here is the partition table

It means that md thinks sde2 cannot be trusted. To find out why you
would need to look at kernel logs for IO errors.

>
> I have partition 2 of drive sde as one of the raid devices for md. Does the (S)
> on sde3[2](S) mean the device is a spare for md1 and the same for md0?
>

Yes, (S) means the device is spare. You don't have (S) next to sde2
on md2 because (F) (failed) overrides (S).
You can tell by the position [5], that it isn't part of the array
(being a 5 disk array, the active positions are 0,1,2,3,4).

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Report this message

#3: Re: FailSpare event?

Posted on 2007-01-11 23:36:28 by Mike

On Fri, 12 Jan 2007, Neil Brown might have said:

> On Thursday January 11, mikee@mikee.ath.cx wrote:
> > Can someone tell me what this means please? I just received this in
> > an email from one of my servers:
> >
> ....
>
> >
> > A FailSpare event had been detected on md device /dev/md2.
> >
> > It could be related to component device /dev/sde2.
>
> It means that mdadm has just noticed that /dev/sde2 is a spare and is faulty.
>
> You would normally expect this if the array is rebuilding a spare and
> a write to the spare fails however...
>
> >
> > md2 : active raid5 sdf2[4] sde2[5](F) sdd2[3] sdc2[2] sdb2[1] sda2[0]
> > 560732160 blocks level 5, 256k chunk, algorithm 2 [5/5] [UUUUU]
>
> That isn't the case here - your array doesn't need rebuilding.
> Possible a superblock-update failed. Possibly mdadm only just started
> monitoring the array and the spare has been faulty for some time.
>
> >
> > Does the email message mean drive sde2[5] has failed? I know the sde2 refers
> > to the second partition of /dev/sde. Here is the partition table
>
> It means that md thinks sde2 cannot be trusted. To find out why you
> would need to look at kernel logs for IO errors.
>
> >
> > I have partition 2 of drive sde as one of the raid devices for md. Does the (S)
> > on sde3[2](S) mean the device is a spare for md1 and the same for md0?
> >
>
> Yes, (S) means the device is spare. You don't have (S) next to sde2
> on md2 because (F) (failed) overrides (S).
> You can tell by the position [5], that it isn't part of the array
> (being a 5 disk array, the active positions are 0,1,2,3,4).
>
> NeilBrown
>

Thanks for the quick response.

So I'm ok for the moment? Yes, I need to find the error and fix everything
back to the (S) state.

The messages in $HOST:/var/log/messages for the time of the email are:

Jan 11 16:04:25 elo kernel: sd 2:0:4:0: SCSI error: return code = 0x8000002
Jan 11 16:04:25 elo kernel: sde: Current: sense key: Hardware Error
Jan 11 16:04:25 elo kernel: Additional sense: Internal target failure
Jan 11 16:04:25 elo kernel: Info fld=0x10b93c4d
Jan 11 16:04:25 elo kernel: end_request: I/O error, dev sde, sector 280575053
Jan 11 16:04:25 elo kernel: raid5: Disk failure on sde2, disabling device. Operation continuing on 5 devices

This is a dell box running Fedora Core with recent patches. It is a production
box so I do not patch each night.

On AIX boxes I can blink the drives to identify a bad/failing device. Is there
a way to blink the drives in linux?

Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Report this message

#4: Re: FailSpare event?

Posted on 2007-01-11 23:59:15 by NeilBrown

On Thursday January 11, mikee@mikee.ath.cx wrote:
>
> So I'm ok for the moment? Yes, I need to find the error and fix everything
> back to the (S) state.

Yes, OK for the moment.

>
> The messages in $HOST:/var/log/messages for the time of the email are:
>
> Jan 11 16:04:25 elo kernel: sd 2:0:4:0: SCSI error: return code = 0x8000002
> Jan 11 16:04:25 elo kernel: sde: Current: sense key: Hardware Error
> Jan 11 16:04:25 elo kernel: Additional sense: Internal target failure
> Jan 11 16:04:25 elo kernel: Info fld=0x10b93c4d
> Jan 11 16:04:25 elo kernel: end_request: I/O error, dev sde, sector 280575053
> Jan 11 16:04:25 elo kernel: raid5: Disk failure on sde2, disabling device. Operation continuing on 5 devices

Given the sector number it looks likely that it was a superblock
update.
No idea how bad an 'internal target failure' is. Maybe powercycling
the drive would 'fix' it, maybe not.

>
> On AIX boxes I can blink the drives to identify a bad/failing device. Is there
> a way to blink the drives in linux?

Unfortunately not.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Report this message

#5: Re: FailSpare event?

Posted on 2007-01-12 00:06:36 by Mike

On Fri, 12 Jan 2007, Neil Brown might have said:

> On Thursday January 11, mikee@mikee.ath.cx wrote:
> >
> > So I'm ok for the moment? Yes, I need to find the error and fix everything
> > back to the (S) state.
>
> Yes, OK for the moment.
>
> >
> > The messages in $HOST:/var/log/messages for the time of the email are:
> >
> > Jan 11 16:04:25 elo kernel: sd 2:0:4:0: SCSI error: return code = 0x8000002
> > Jan 11 16:04:25 elo kernel: sde: Current: sense key: Hardware Error
> > Jan 11 16:04:25 elo kernel: Additional sense: Internal target failure
> > Jan 11 16:04:25 elo kernel: Info fld=0x10b93c4d
> > Jan 11 16:04:25 elo kernel: end_request: I/O error, dev sde, sector 280575053
> > Jan 11 16:04:25 elo kernel: raid5: Disk failure on sde2, disabling device. Operation continuing on 5 devices
>
> Given the sector number it looks likely that it was a superblock
> update.
> No idea how bad an 'internal target failure' is. Maybe powercycling
> the drive would 'fix' it, maybe not.
>
> >
> > On AIX boxes I can blink the drives to identify a bad/failing device. Is there
> > a way to blink the drives in linux?
>
> Unfortunately not.
>
> NeilBrown
>

I found the smartctl command. I have a 'long' test running in the background.
I checked this drive and the other drives. This drive has been used the least
(confirms it is a spare?) and is the only one with 'Total uncorrected errors' > 0.

How to determine the error, correct the error, or clear the error?

Mike

[root@$HOST ~]# smartctl -a /dev/sde
smartctl version 5.36 [i686-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: SEAGATE ST3146707LC Version: D703
Serial number: 3KS30WY8
Device type: disk
Transport protocol: Parallel SCSI (SPI-4)
Local Time is: Thu Jan 11 17:00:26 2007 CST
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK

Current Drive Temperature: 48 C
Drive Trip Temperature: 68 C
Elements in grown defect list: 0
Vendor (Seagate) cache information
Blocks sent to initiator = 66108
Blocks received from initiator = 147374656
Blocks read from cache and sent to initiator = 42215
Number of read and write commands whose size <= segment size = 12635583
Number of read and write commands whose size > segment size = 0
Vendor (Seagate/Hitachi) factory information
number of hours powered up = 3943.42
number of minutes until next internal SMART test = 94

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 354 0 0 354 354 0.546 0
write: 0 0 0 0 0 185.871 1

Non-medium error count: 0

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Completed, segment failed - 3943 - [- - -]

Long (extended) Self Test duration: 2726 seconds [45.4 minutes]

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Report this message

#6: Re: FailSpare event?

Posted on 2007-01-12 01:05:41 by Michael Hardy

google "BadBlockHowto"

Any "just google it" response sounds glib, but this is actually how to
do it :-)

If you're new to md and mdadm, don't forget to actually remove the drive
from the array before you start working on it with 'dd'

-Mike

Mike wrote:
> On Fri, 12 Jan 2007, Neil Brown might have said:
>
>> On Thursday January 11, mikee@mikee.ath.cx wrote:
>>> So I'm ok for the moment? Yes, I need to find the error and fix everything
>>> back to the (S) state.
>> Yes, OK for the moment.
>>
>>> The messages in $HOST:/var/log/messages for the time of the email are:
>>>
>>> Jan 11 16:04:25 elo kernel: sd 2:0:4:0: SCSI error: return code = 0x8000002
>>> Jan 11 16:04:25 elo kernel: sde: Current: sense key: Hardware Error
>>> Jan 11 16:04:25 elo kernel: Additional sense: Internal target failure
>>> Jan 11 16:04:25 elo kernel: Info fld=0x10b93c4d
>>> Jan 11 16:04:25 elo kernel: end_request: I/O error, dev sde, sector 280575053
>>> Jan 11 16:04:25 elo kernel: raid5: Disk failure on sde2, disabling device. Operation continuing on 5 devices
>> Given the sector number it looks likely that it was a superblock
>> update.
>> No idea how bad an 'internal target failure' is. Maybe powercycling
>> the drive would 'fix' it, maybe not.
>>
>>> On AIX boxes I can blink the drives to identify a bad/failing device. Is there
>>> a way to blink the drives in linux?
>> Unfortunately not.
>>
>> NeilBrown
>>
>
> I found the smartctl command. I have a 'long' test running in the background.
> I checked this drive and the other drives. This drive has been used the least
> (confirms it is a spare?) and is the only one with 'Total uncorrected errors' > 0.
>
> How to determine the error, correct the error, or clear the error?
>
> Mike
>
> [root@$HOST ~]# smartctl -a /dev/sde
> smartctl version 5.36 [i686-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
> Home page is http://smartmontools.sourceforge.net/
>
> Device: SEAGATE ST3146707LC Version: D703
> Serial number: 3KS30WY8
> Device type: disk
> Transport protocol: Parallel SCSI (SPI-4)
> Local Time is: Thu Jan 11 17:00:26 2007 CST
> Device supports SMART and is Enabled
> Temperature Warning Enabled
> SMART Health Status: OK
>
> Current Drive Temperature: 48 C
> Drive Trip Temperature: 68 C
> Elements in grown defect list: 0
> Vendor (Seagate) cache information
> Blocks sent to initiator = 66108
> Blocks received from initiator = 147374656
> Blocks read from cache and sent to initiator = 42215
> Number of read and write commands whose size <= segment size = 12635583
> Number of read and write commands whose size > segment size = 0
> Vendor (Seagate/Hitachi) factory information
> number of hours powered up = 3943.42
> number of minutes until next internal SMART test = 94
>
> Error counter log:
> Errors Corrected by Total Correction Gigabytes Total
> ECC rereads/ errors algorithm processed uncorrected
> fast | delayed rewrites corrected invocations [10^9 bytes] errors
> read: 354 0 0 354 354 0.546 0
> write: 0 0 0 0 0 185.871 1
>
> Non-medium error count: 0
>
> SMART Self-test log
> Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
> Description number (hours)
> # 1 Background long Completed, segment failed - 3943 - [- - -]
>
> Long (extended) Self Test duration: 2726 seconds [45.4 minutes]
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Report this message

#7: Re: FailSpare event?

Posted on 2007-01-12 01:40:18 by Corey Hickey

Mike wrote:
> I found the smartctl command. I have a 'long' test running in the background.
> I checked this drive and the other drives. This drive has been used the least
> (confirms it is a spare?) and is the only one with 'Total uncorrected errors' > 0.
>
> How to determine the error, correct the error, or clear the error?
>
> Mike
>

[cut]

> SMART Self-test log
> Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
> Description number (hours)
> # 1 Background long Completed, segment failed - 3943 - [- - -]
>
> Long (extended) Self Test duration: 2726 seconds [45.4 minutes]

Am I mistaken, or does the above information not say that the long
self-test actually failed? If a SMART test fails, that should be
sufficient cause to RMA the drive if it's still under warranty.

It might not actually be that surprising to have a largely unused drive
fail. I've had a couple drives fail due to what I presume is bearing
wear: the drive gradually gets noisy (over many months) and eventually
starts having intermittent errors that get more and more frequent. If
your drive was spinning while it was a spare, then it would be just as
likely to wear out a bad bearing as any of your other drives. Of course,
it could be some other problem; that's just an example.

-Corey
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Report this message

#8: Re: FailSpare event?

Posted on 2007-01-12 01:48:55 by martin

2007/1/12, Mike <mikee@mikee.ath.cx>:
> # 1 Background long Completed, segment failed - 3943

This should still be in warranty. Try to get a replacement.

Best
Martin
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Report this message

#9: Re: FailSpare event?

Posted on 2007-01-12 15:34:15 by Ernst Herzberg

On Thursday 11 January 2007 23:23, Neil Brown wrote:
> On Thursday January 11, mikee@mikee.ath.cx wrote:
> > Can someone tell me what this means please? I just received this in
> > an email from one of my servers:
>
> ....
>

Same problem here, on different machines. But only with mdadm 2.6, with
mdadm 2.5.5 no problems.

First machine sends direct after starting mdadm in monitor mode:
(kernel 2.6.20-rc3)
-----------------------------
event=DeviceDisappeared
mddev=/dev/md1
device=Wrong-Level

Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
md1 : active raid0 sdb2[1] sda2[0]
3904704 blocks 16k chunks

md2 : active raid0 sdb3[1] sda3[0]
153930112 blocks 16k chunks

md3 : active raid5 sdf1[3] sde1[2] sdd1[1] sdc1[0]
732587712 blocks level 5, 16k chunk, algorithm 2 [4/4] [UUUU]

md0 : active raid1 sdb1[1] sda1[0]
192640 blocks [2/2] [UU]

unused devices: <none>
-----------------------
and a second time for md2.
Then every about 60 sec 4 times

event=SpareActive
mddev=/dev/md3

******************************

Second machine sends about every 60sec 8 messages with:
(kernel 2.6.19.2)
--------------------------
event=SpareActive
mddev=/dev/md0
device=

Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md1 : active raid1 sdb1[1] sda1[0]
979840 blocks [2/2] [UU]

md3 : active raid5 sdh1[5] sdg1[4] sdf1[3] sde1[2] sdd1[1] sdc1[0]
4899200 blocks level 5, 8k chunk, algorithm 2 [6/6] [UUUUUU]

md2 : active raid5 sdh2[7] sdg2[6] sdf2[5] sde2[4] sdd2[3] sdc2[2] sdb2[1]
sda2[0]
6858880 blocks level 5, 4k chunk, algorithm 2 [8/8] [UUUUUUUU]

md0 : active raid5 sdh3[7] sdg3[6] sdf3[5] sde3[4] sdd3[3] sdc3[2] sdb3[1]
sda3[0]
235086656 blocks level 5, 16k chunk, algorithm 2 [8/8] [UUUUUUUU]

unused devices: <none>

--------------------------

Both machines had nerver seen any spare device, and there are no failing
devices, everything works as expected.

<earny>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Report this message

#10: Re: FailSpare event?

Posted on 2007-01-13 19:10:59 by Nix

On 12 Jan 2007, Ernst Herzberg told this:
> Then every about 60 sec 4 times
>
> event=SpareActive
> mddev=/dev/md3

I see exactly this on both my RAID-5 arrays, neither of which have any
spare device --- nor have any active devices transitioned to spare
(which is what that event is actually supposed to mean).

mdadm-2.6 bug, I fear. I haven't tracked it down yet but will look
shortly: I can't afford to not run mdadm --monitor... odd, that
code hasn't changed during 2.6 development.

--
`He accused the FSF of being "something of a hypocrit", which
shows that he neither understands hypocrisy nor can spell.'
--- jimmybgood
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Report this message

#11: Re: FailSpare event?

Posted on 2007-01-13 23:29:16 by Mike

On Fri, 12 Jan 2007, Neil Brown might have said:

> On Thursday January 11, mikee@mikee.ath.cx wrote:
> > Can someone tell me what this means please? I just received this in
> > an email from one of my servers:
> >
> ....
>
> >
> > A FailSpare event had been detected on md device /dev/md2.
> >
> > It could be related to component device /dev/sde2.
>
> It means that mdadm has just noticed that /dev/sde2 is a spare and is faulty.
>
> You would normally expect this if the array is rebuilding a spare and
> a write to the spare fails however...
>
> >
> > md2 : active raid5 sdf2[4] sde2[5](F) sdd2[3] sdc2[2] sdb2[1] sda2[0]
> > 560732160 blocks level 5, 256k chunk, algorithm 2 [5/5] [UUUUU]
>
> That isn't the case here - your array doesn't need rebuilding.
> Possible a superblock-update failed. Possibly mdadm only just started
> monitoring the array and the spare has been faulty for some time.
>
> >
> > Does the email message mean drive sde2[5] has failed? I know the sde2 refers
> > to the second partition of /dev/sde. Here is the partition table
>
> It means that md thinks sde2 cannot be trusted. To find out why you
> would need to look at kernel logs for IO errors.
>
> >
> > I have partition 2 of drive sde as one of the raid devices for md. Does the (S)
> > on sde3[2](S) mean the device is a spare for md1 and the same for md0?
> >
>
> Yes, (S) means the device is spare. You don't have (S) next to sde2
> on md2 because (F) (failed) overrides (S).
> You can tell by the position [5], that it isn't part of the array
> (being a 5 disk array, the active positions are 0,1,2,3,4).
>
> NeilBrown
>

I have cleared the error by:

# mdadm --manage /dev/md2 -f /dev/sde2
( make sure it has failed )
# mdadm --manage /dev/md2 -r /dev/sde2
( remove from the array )
# mdadm --manage /dev/md2 -a /dev/sde2
( add the device back to the array )
# mdadm --detail /dev/md2
( verify there are no faults and the array knows about the spare )
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Report this message

#12: Re: FailSpare event?

Posted on 2007-01-14 00:34:26 by Nix

On 13 Jan 2007, nix@esperi.org.uk spake thusly:

> On 12 Jan 2007, Ernst Herzberg told this:
>> Then every about 60 sec 4 times
>>
>> event=SpareActive
>> mddev=/dev/md3
>
> I see exactly this on both my RAID-5 arrays, neither of which have any
> spare device --- nor have any active devices transitioned to spare
> (which is what that event is actually supposed to mean).

Hm, the manual says that it means that a spare has transitioned to
active (which seems more likely). Perhaps the comment at line 82 of
Monitor.c is wrong, or I just don't understand what a `reverse
transition' is supposed to be.

--
`He accused the FSF of being "something of a hypocrit", which
shows that he neither understands hypocrisy nor can spell.'
--- jimmybgood
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Report this message

#13: Re: FailSpare event?

Posted on 2007-01-14 00:38:00 by Nix

On 13 Jan 2007, nix@esperi.org.uk uttered the following:

> On 12 Jan 2007, Ernst Herzberg told this:
>> Then every about 60 sec 4 times
>>
>> event=SpareActive
>> mddev=/dev/md3
>
> I see exactly this on both my RAID-5 arrays, neither of which have any
> spare device --- nor have any active devices transitioned to spare
> (which is what that event is actually supposed to mean).

One oddity has already come to light. My /proc/mdstat says

md2 : active raid5 sdb7[0] hda5[3] sda7[1]
19631104 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU]

md1 : active raid5 sda6[0] hdc5[3] sdb6[1]
76807296 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU]

hda5 and hdc5 look odd. Indeed, --examine says

Number Major Minor RaidDevice State
0 8 6 0 active sync /dev/sda6
1 8 22 1 active sync /dev/sdb6
3 22 5 2 active sync /dev/hdc5

Number Major Minor RaidDevice State
0 8 23 0 active sync /dev/sdb7
1 8 7 1 active sync /dev/sda7
3 3 5 2 active sync /dev/hda5

0, 1, and *3*. Where has number 2 gone? (And how does `Number' differ
from `RaidDevice'? Why have both?)

--
`He accused the FSF of being "something of a hypocrit", which
shows that he neither understands hypocrisy nor can spell.'
--- jimmybgood
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Report this message

#14: Re: FailSpare event?

Posted on 2007-01-14 16:01:06 by Nix

On 13 Jan 2007, nix@esperi.org.uk uttered the following:
> mdadm-2.6 bug, I fear. I haven't tracked it down yet but will look
> shortly: I can't afford to not run mdadm --monitor... odd, that
> code hasn't changed during 2.6 development.

Whoo! Compile Monitor.c without optimization and the problem goes away.

Hunting: maybe it's a compiler bug (anyone not using GCC 4.1.1 seeing
this?), maybe mdadm is tripping undefined behaviour somewhere...

--
`He accused the FSF of being "something of a hypocrit", which
shows that he neither understands hypocrisy nor can spell.'
--- jimmybgood
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Report this message

#15: Re: FailSpare event?

Posted on 2007-01-14 22:20:10 by NeilBrown

On Sunday January 14, nix@esperi.org.uk wrote:
> On 13 Jan 2007, nix@esperi.org.uk uttered the following:
> > mdadm-2.6 bug, I fear. I haven't tracked it down yet but will look
> > shortly: I can't afford to not run mdadm --monitor... odd, that
> > code hasn't changed during 2.6 development.
>
> Whoo! Compile Monitor.c without optimization and the problem goes away.
>
> Hunting: maybe it's a compiler bug (anyone not using GCC 4.1.1 seeing
> this?), maybe mdadm is tripping undefined behaviour somewhere...

Probably....

A quick look suggests that the following patch might make a
difference, but there is more to it than that. I think there are
subtle differences due to the use of version-1 superblocks. That
might be just another one-line change, but I want to make sure first.

Thanks,
NeilBrown



### Diffstat output
./Monitor.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff .prev/Monitor.c ./Monitor.c
--- .prev/Monitor.c 2006-12-21 17:15:55.000000000 +1100
+++ ./Monitor.c 2007-01-15 08:17:30.000000000 +1100
@@ -383,7 +383,7 @@ int Monitor(mddev_dev_t devlist,
)
alert("SpareActive", dev, dv, mailaddr, mailfrom, alert_cmd, dosyslog);
}
- st->devstate[i] = disc.state;
+ st->devstate[i] = newstate;
st->devid[i] = makedev(disc.major, disc.minor);
}
st->active = array.active_disks;
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Report this message

#16: Re: FailSpare event?

Posted on 2007-01-15 20:59:43 by Nix

On 15 Jan 2007, Bill Davidsen told this:
> Nix wrote:
>> Number Major Minor RaidDevice State
>> 0 8 6 0 active sync /dev/sda6
>> 1 8 22 1 active sync /dev/sdb6
>> 3 22 5 2 active sync /dev/hdc5
>>
>> Number Major Minor RaidDevice State
>> 0 8 23 0 active sync /dev/sdb7
>> 1 8 7 1 active sync /dev/sda7
>> 3 3 5 2 active sync /dev/hda5
>>
>> 0, 1, and *3*. Where has number 2 gone? (And how does `Number' differ
>> from `RaidDevice'? Why have both?)
>>
>>
> Did you ever move the data to these drives from another? I think this
> is what you see when you migrate by adding a drive as a spare, then
> mark an existing drive as failed, so the data is rebuilt on the new
> drive. Was there ever a device 2?

Nope. These arrays were created in one lump and never had a spare.

Plenty of pvmoves have happened on them, but that's *inside* the
arrays, of course...

--
`He accused the FSF of being "something of a hypocrit", which
shows that he neither understands hypocrisy nor can spell.'
--- jimmybgood
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Report this message

#17: Re: FailSpare event?

Posted on 2007-01-15 21:08:40 by Nix

On 14 Jan 2007, Neil Brown told this:
> A quick look suggests that the following patch might make a
> difference, but there is more to it than that. I think there are
> subtle differences due to the use of version-1 superblocks. That
> might be just another one-line change, but I want to make sure first.

Well, that certainly made that warning go away. I don't have any
actually-failed disks, so I can't tell if it would *ever* warn anymore ;)

.... actually, it just picked up some monthly array check activity:

Jan 15 20:03:17 loki daemon warning: mdadm: Rebuild20 event detected on md device /dev/md2

So it looks like it works perfectly well now.

(Looking at the code, yeah, without that change it'll never remember
state changes at all!)

One bit of residue from the state before this patch remains on line 352,
where you initialize disc.state and then never use it for anything...

--
`He accused the FSF of being "something of a hypocrit", which
shows that he neither understands hypocrisy nor can spell.'
--- jimmybgood
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Report this message