RAID6 check found different events, how should I proceed?

RAID6 check found different events, how should I proceed?

am 06.08.2011 15:23:24 von mathias.buren

First, thanks for this:

> The primary purpose of data scrubbing a RAID is to detect & correct
> read errors on any of the member devices; both check and repair
> perform this function. Finding (and w/ repair correcting) mismatches
> is only a secondary purpose - it is only if there are no read errors
> but the data copy or parity blocks are found to be inconsistent that a
> mismatch is reported. In order to repair a mismatch, MD needs to
> restore consistency, by over writing the inconsistent data copy or
> parity blocks w/ the correct data. But, because the underlying member
> devices did not return any errors, MD has no way of knowing which
> blocks are correct, and which are incorrect; when it is told to do a
> repair, it makes the assumption that the first copy in a RAID1 or
> RAID10, or the data (non-parity) blocks in RAID4/5/6 are correct, and
> corrects the mismatch based on that assumption.
>
> That assumption may or may not be correct, but MD has no way of
> determining that reliably - but the user might be able to, by using
> additional knowledge or tools, so MD gives the user the option to
> perform data scrubbing either with (repair) or without (check) MD
> correcting the mismatches using that assumption.
>
>
> I hope that answers your question,
> Beolach

My RAID6 is currently degraded with one HDD (panic mail on the list),
and my weekly cron job kicked in doing the RAID6 check action. This is
the result:

DEV EVENTS REALL PEND UNCORR CRC RAW ZONE END
sdb1 6239487 0 0 0 2 0 0
sdc1 6239487 0 0 0 0 0 0
sdd1 6239487 0 0 0 0 0 0
sde1 6239487 0 0 0 0 0 0
sdf1 6239490 0 0 0 0 49 6
sdg1 6239491 0 0 0 0 0 0
sdh1 (missing, on RMA trip)


(so the SMART is actually fine for all drives)

Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdf1[5] sdg1[0] sdd1[4] sde1[7] sdc1[3] sdb1[1]
9751756800 blocks super 1.2 level 6, 64k chunk, algorithm 2
[7/6] [UUUUU_U]

unused devices:


/dev/md0:
Version : 1.2
Creation Time : Tue Oct 19 08:58:41 2010
Raid Level : raid6
Array Size : 9751756800 (9300.00 GiB 9985.80 GB)
Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB)
Raid Devices : 7
Total Devices : 6
Persistence : Superblock is persistent

Update Time : Sat Aug 6 14:13:08 2011
State : clean, degraded
Active Devices : 6
Working Devices : 6
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

Name : ion:0 (local to host ion)
UUID : e6595c64:b3ae90b3:f01133ac:3f402d20
Events : 6239491

Number Major Minor RaidDevice State
0 8 97 0 active sync /dev/sdg1
1 8 17 1 active sync /dev/sdb1
4 8 49 2 active sync /dev/sdd1
3 8 33 3 active sync /dev/sdc1
5 8 81 4 active sync /dev/sdf1
5 0 0 5 removed
7 8 65 6 active sync /dev/sde1

So sdf1 and sdg1 have a different event count. Does this mean the HDDs
have silently corrupted the data? I have no way of checking if the
data itself is corrupt or not, except for perhaps a fsck of the
filesystem? Does that make sense?

* Should I run a repair?
* Chould I run a check again, to see if the event count changes?
* Is it likely I've 2 more bad harddrives that will die soon?
* Is it wise to run another smartctl -t long on all devices?

Thanks,
Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID6 check found different events, how should I proceed?

am 06.08.2011 18:02:48 von mathias.buren

T24gNiBBdWd1c3QgMjAxMSAxNDoyMywgTWF0aGlhcyBCdXLDqW4gPG1hdGhp YXMuYnVyZW5AZ21h
aWwuY29tPiB3cm90ZToKPiBNeSBSQUlENiBpcyBjdXJyZW50bHkgZGVncmFk ZWQgd2l0aCBvbmUg
SEREIChwYW5pYyBtYWlsIG9uIHRoZSBsaXN0KSwKPiBhbmQgbXkgd2Vla2x5 IGNyb24gam9iIGtp
Y2tlZCBpbiBkb2luZyB0aGUgUkFJRDYgY2hlY2sgYWN0aW9uLiBUaGlzIGlz Cj4gdGhlIHJlc3Vs
dDoKPgo+IERFViDCoCDCoCBFVkVOVFMgwqBSRUFMTCDCoCBQRU5EIMKgIMKg VU5DT1JSIMKgQ1JD
IMKgIMKgIFJBVyDCoCDCoCBaT05FIMKgIMKgRU5ECj4gc2RiMSDCoCDCoDYy Mzk0ODcgMCDCoCDC
oCDCoCDCoCDCoCDCoCDCoCAwIMKgIMKgIMKgIMKgIMKgIMKgIMKgIDAgwqAg wqAgwqAgwqAgwqAg
wqAgwqAgMiDCoCDCoCDCoCAwIMKgIMKgIMKgIMKgIMKgIMKgIMKgIDAKPiBz ZGMxIMKgIMKgNjIz
OTQ4NyAwIMKgIMKgIMKgIMKgIMKgIMKgIMKgIDAgwqAgwqAgwqAgwqAgwqAg wqAgwqAgMCDCoCDC
oCDCoCDCoCDCoCDCoCDCoCAwIMKgIMKgIMKgIDAgwqAgwqAgwqAgwqAgwqAg wqAgwqAgMAo+IHNk
ZDEgwqAgwqA2MjM5NDg3IDAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgMCDCoCDC oCDCoCDCoCDCoCDC
oCDCoCAwIMKgIMKgIMKgIMKgIMKgIMKgIMKgIDAgwqAgwqAgwqAgMCDCoCDC oCDCoCDCoCDCoCDC
oCDCoCAwCj4gc2RlMSDCoCDCoDYyMzk0ODcgMCDCoCDCoCDCoCDCoCDCoCDC oCDCoCAwIMKgIMKg
IMKgIMKgIMKgIMKgIMKgIDAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgMCDCoCDC oCDCoCAwIMKgIMKg
IMKgIMKgIMKgIMKgIMKgIDAKPiBzZGYxIMKgIMKgNjIzOTQ5MCAwIMKgIMKg IMKgIMKgIMKgIMKg
IMKgIDAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgMCDCoCDCoCDCoCDCoCDCoCDC oCDCoCAwIMKgIMKg
IMKgIDQ5IMKgIMKgIMKgIMKgIMKgIMKgIMKgNgo+IHNkZzEgwqAgwqA2MjM5 NDkxIDAgwqAgwqAg
wqAgwqAgwqAgwqAgwqAgMCDCoCDCoCDCoCDCoCDCoCDCoCDCoCAwIMKgIMKg IMKgIMKgIMKgIMKg
IMKgIDAgwqAgwqAgwqAgMCDCoCDCoCDCoCDCoCDCoCDCoCDCoCAwCj4gc2Ro MSDCoCDCoChtaXNz
aW5nLCBvbiBSTUEgdHJpcCkKPgooc25pcCkKPiAqIFNob3VsZCBJIHJ1biBh IHJlcGFpcj8KPiAq
IENob3VsZCBJIHJ1biBhIGNoZWNrIGFnYWluLCB0byBzZWUgaWYgdGhlIGV2 ZW50IGNvdW50IGNo
YW5nZXM/Cj4gKiBJcyBpdCBsaWtlbHkgSSd2ZSAyIG1vcmUgYmFkIGhhcmRk cml2ZXMgdGhhdCB3
aWxsIGRpZSBzb29uPwo+ICogSXMgaXQgd2lzZSB0byBydW4gYW5vdGhlciBz bWFydGN0bCAtdCBs
b25nIG9uIGFsbCBkZXZpY2VzPwo+Cj4gVGhhbmtzLAo+IE1hdGhpYXMKPgoK QSBmb2xsb3d1cDsK
CkkgcmFuIHNtYXJ0Y3RsIC10IGxvbmcgb24gYWxsIGRldmljZXMsIGFuZCB0 aGV5IGFsbCBwYXNz
ZWQsIFNNQVJUIGlzCmZpbmUuIFRoZSBudW1iZXIgb2YgZXZlbnRzIGlzIGFs c28gdGhlIHNhbWUg
Zm9yIGFsbCBIRERzIG5vdzoKCkRFVglFVkVOVFMJUkVBTEwJUEVORAlVTkNP UlIJQ1JDCVJBVwla
T05FCUVORApzZGIxCTYyNDQ0MTUJMAkwCTAJMgkwCTAJCnNkYzEJNjI0NDQx NQkwCTAJMAkwCTAJ
MAkKc2RkMQk2MjQ0NDE1CTAJMAkwCTAJMAkwCQpzZGUxCTYyNDQ0MTUJMAkw CTAJMAkwCTAJCnNk
ZjEJNjI0NDQxNQkwCTAJMAkwCTQ5CTYJCnNkZzEJNjI0NDQxNQkwCTAJMAkw CTAJMAkKc2RoMQkJ
CQkJCQkJCgpUaGlzIGlzIHdpdGhvdXQgbWUgcnVubmluZyByZXBhaXIgb3Ig YW55dGhpbmcgbGlr
ZSB0aGF0LgoKTWF0aGlhcwo=
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID6 check found different events, how should I proceed?

am 06.08.2011 19:09:04 von unknown

Can't offer any advice on this issue, but would be very interested to
hear the debrief once the situation is resolved.

On Sat, Aug 6, 2011 at 6:08 PM, Cal Leeming [Simplicity Media Ltd]
wrote:
>
> Can't offer any advice on this issue, but would be very interested to=
hear the debrief once the situation is resolved.
> On Sat, Aug 6, 2011 at 5:02 PM, Mathias Bur=E9n com> wrote:
>>
>> On 6 August 2011 14:23, Mathias Bur=E9n wr=
ote:
>> > My RAID6 is currently degraded with one HDD (panic mail on the lis=
t),
>> > and my weekly cron job kicked in doing the RAID6 check action. Thi=
s is
>> > the result:
>> >
>> > DEV =A0 =A0 EVENTS =A0REALL =A0 PEND =A0 =A0UNCORR =A0CRC =A0 =A0 =
RAW =A0 =A0 ZONE =A0 =A0END
>> > sdb1 =A0 =A06239487 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0=
=A0 =A0 =A0 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 2 =A0 =A0 =A0 0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 0
>> > sdc1 =A0 =A06239487 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0=
=A0 =A0 =A0 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 0
>> > sdd1 =A0 =A06239487 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0=
=A0 =A0 =A0 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 0
>> > sde1 =A0 =A06239487 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0=
=A0 =A0 =A0 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 0
>> > sdf1 =A0 =A06239490 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0=
=A0 =A0 =A0 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 49 =A0 =A0 =A0=
=A0 =A0 =A0 =A06
>> > sdg1 =A0 =A06239491 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0=
=A0 =A0 =A0 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 0
>> > sdh1 =A0 =A0(missing, on RMA trip)
>> >
>> (snip)
>> > * Should I run a repair?
>> > * Chould I run a check again, to see if the event count changes?
>> > * Is it likely I've 2 more bad harddrives that will die soon?
>> > * Is it wise to run another smartctl -t long on all devices?
>> >
>> > Thanks,
>> > Mathias
>> >
>>
>> A followup;
>>
>> I ran smartctl -t long on all devices, and they all passed, SMART is
>> fine. The number of events is also the same for all HDDs now:
>>
>> DEV =A0 =A0 EVENTS =A0REALL =A0 PEND =A0 =A0UNCORR =A0CRC =A0 =A0 RA=
W =A0 =A0 ZONE =A0 =A0END
>> sdb1 =A0 =A06244415 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 2 =A0 =
=A0 =A0 0 =A0 =A0 =A0 0
>> sdc1 =A0 =A06244415 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 =A0 =
=A0 =A0 0 =A0 =A0 =A0 0
>> sdd1 =A0 =A06244415 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 =A0 =
=A0 =A0 0 =A0 =A0 =A0 0
>> sde1 =A0 =A06244415 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 =A0 =
=A0 =A0 0 =A0 =A0 =A0 0
>> sdf1 =A0 =A06244415 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 =A0 =
=A0 =A0 49 =A0 =A0 =A06
>> sdg1 =A0 =A06244415 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 =A0 =
=A0 =A0 0 =A0 =A0 =A0 0
>> sdh1
>>
>> This is without me running repair or anything like that.
>>
>> Mathias
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID6 check found different events, how should I proceed?

am 06.08.2011 19:54:20 von alexander.kuehn

I'd do _nothing_ until I got a replacement drive. Then plug that in =20
and let it regain full redundancy.
After that you can start stressing the disks with the actions you =20
suggested if you like.
Alex.

Zitat von Mathias Burén :

> First, thanks for this:
>
>> The primary purpose of data scrubbing a RAID is to detect & correct
>> read errors on any of the member devices; both check and repair
>> perform this function. Finding (and w/ repair correcting) mismatche=
s
>> is only a secondary purpose - it is only if there are no read errors
>> but the data copy or parity blocks are found to be inconsistent that=
a
>> mismatch is reported. In order to repair a mismatch, MD needs to
>> restore consistency, by over writing the inconsistent data copy or
>> parity blocks w/ the correct data. But, because the underlying memb=
er
>> devices did not return any errors, MD has no way of knowing which
>> blocks are correct, and which are incorrect; when it is told to do a
>> repair, it makes the assumption that the first copy in a RAID1 or
>> RAID10, or the data (non-parity) blocks in RAID4/5/6 are correct, an=
d
>> corrects the mismatch based on that assumption.
>>
>> That assumption may or may not be correct, but MD has no way of
>> determining that reliably - but the user might be able to, by using
>> additional knowledge or tools, so MD gives the user the option to
>> perform data scrubbing either with (repair) or without (check) MD
>> correcting the mismatches using that assumption.
>>
>>
>> I hope that answers your question,
>> Beolach
>
> My RAID6 is currently degraded with one HDD (panic mail on the list),
> and my weekly cron job kicked in doing the RAID6 check action. This i=
s
> the result:
>
> DEV EVENTS REALL PEND UNCORR CRC RAW ZONE END
> sdb1 6239487 0 0 0 2 0 0
> sdc1 6239487 0 0 0 0 0 0
> sdd1 6239487 0 0 0 0 0 0
> sde1 6239487 0 0 0 0 0 0
> sdf1 6239490 0 0 0 0 49 6
> sdg1 6239491 0 0 0 0 0 0
> sdh1 (missing, on RMA trip)
>
>
> (so the SMART is actually fine for all drives)
>
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdf1[5] sdg1[0] sdd1[4] sde1[7] sdc1[3] sdb1[1]
> 9751756800 blocks super 1.2 level 6, 64k chunk, algorithm 2
> [7/6] [UUUUU_U]
>
> unused devices:
>
>
> /dev/md0:
> Version : 1.2
> Creation Time : Tue Oct 19 08:58:41 2010
> Raid Level : raid6
> Array Size : 9751756800 (9300.00 GiB 9985.80 GB)
> Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB)
> Raid Devices : 7
> Total Devices : 6
> Persistence : Superblock is persistent
>
> Update Time : Sat Aug 6 14:13:08 2011
> State : clean, degraded
> Active Devices : 6
> Working Devices : 6
> Failed Devices : 0
> Spare Devices : 0
>
> Layout : left-symmetric
> Chunk Size : 64K
>
> Name : ion:0 (local to host ion)
> UUID : e6595c64:b3ae90b3:f01133ac:3f402d20
> Events : 6239491
>
> Number Major Minor RaidDevice State
> 0 8 97 0 active sync /dev/sdg1
> 1 8 17 1 active sync /dev/sdb1
> 4 8 49 2 active sync /dev/sdd1
> 3 8 33 3 active sync /dev/sdc1
> 5 8 81 4 active sync /dev/sdf1
> 5 0 0 5 removed
> 7 8 65 6 active sync /dev/sde1
>
> So sdf1 and sdg1 have a different event count. Does this mean the HDD=
s
> have silently corrupted the data? I have no way of checking if the
> data itself is corrupt or not, except for perhaps a fsck of the
> filesystem? Does that make sense?
>
> * Should I run a repair?
> * Chould I run a check again, to see if the event count changes?
> * Is it likely I've 2 more bad harddrives that will die soon?
> * Is it wise to run another smartctl -t long on all devices?
>
> Thanks,
> Mathias
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID6 check found different events, how should I proceed?

am 09.08.2011 00:57:04 von NeilBrown

On Sat, 6 Aug 2011 17:02:48 +0100 Mathias Bur=E9n com>
wrote:

> On 6 August 2011 14:23, Mathias Bur=E9n wro=
te:
> > My RAID6 is currently degraded with one HDD (panic mail on the list=
),
> > and my weekly cron job kicked in doing the RAID6 check action. This=
is
> > the result:
> >
> > DEV =A0 =A0 EVENTS =A0REALL =A0 PEND =A0 =A0UNCORR =A0CRC =A0 =A0 R=
AW =A0 =A0 ZONE =A0 =A0END
> > sdb1 =A0 =A06239487 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0=
=A0 =A0 =A0 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 2 =A0 =A0 =A0 0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 0
> > sdc1 =A0 =A06239487 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0=
=A0 =A0 =A0 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 0
> > sdd1 =A0 =A06239487 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0=
=A0 =A0 =A0 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 0
> > sde1 =A0 =A06239487 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0=
=A0 =A0 =A0 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 0
> > sdf1 =A0 =A06239490 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0=
=A0 =A0 =A0 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 49 =A0 =A0 =A0=
=A0 =A0 =A0 =A06
> > sdg1 =A0 =A06239491 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0=
=A0 =A0 =A0 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 0
> > sdh1 =A0 =A0(missing, on RMA trip)
> >
> (snip)
> > * Should I run a repair?
> > * Chould I run a check again, to see if the event count changes?
> > * Is it likely I've 2 more bad harddrives that will die soon?
> > * Is it wise to run another smartctl -t long on all devices?
> >
> > Thanks,
> > Mathias
> >
>=20
> A followup;
>=20
> I ran smartctl -t long on all devices, and they all passed, SMART is
> fine. The number of events is also the same for all HDDs now:
>=20
> DEV EVENTS REALL PEND UNCORR CRC RAW ZONE END
> sdb1 6244415 0 0 0 2 0 0=09
> sdc1 6244415 0 0 0 0 0 0=09
> sdd1 6244415 0 0 0 0 0 0=09
> sde1 6244415 0 0 0 0 0 0=09
> sdf1 6244415 0 0 0 0 49 6=09
> sdg1 6244415 0 0 0 0 0 0=09
> sdh1 =09
>=20
> This is without me running repair or anything like that.

The thing that you did which produced the change was that you let time =
pass.

Presumably there was a time delay (maybe small) between extracting the
'events' number from sde1 and sdf1, then sdf1 and sdg1. During these t=
imes
the events on all devices in the array was updated. This implies some =
thread
was writing, but possibly not writing very heavily.

When you sampled them all the second time and got the same number there=
were
presumably no writes happening, so the event numbers didn't change.

When there are occasional writes the array oscillates between 'clean' =
and
'active' and each change updates the 'events' number.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html