raid 5 mismatch_cnt errors

raid 5 mismatch_cnt errors

am 20.05.2010 19:02:23 von Trey Scarborough

I have a raid 5 array with 9 disks and I have a mismatch_cnt that keeps
growing. This is causing file corruption on the underlaying file systems
as well. I can copy a group of 100 100mb files and then do a md5sum on
them and 1-3 will be corrupt. If this is a drive that is bad is there
anyway to run a report on the count per drive that these mismatches
occur. I have run smarttools test and do not see one drive that stands
out to be causing errors. Could something else be causing these errors?
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: raid 5 mismatch_cnt errors

am 20.05.2010 23:16:45 von NeilBrown

On Thu, 20 May 2010 12:02:23 -0500
Trey Scarborough wrote:

> I have a raid 5 array with 9 disks and I have a mismatch_cnt that keeps
> growing. This is causing file corruption on the underlaying file systems
> as well. I can copy a group of 100 100mb files and then do a md5sum on
> them and 1-3 will be corrupt. If this is a drive that is bad is there
> anyway to run a report on the count per drive that these mismatches
> occur. I have run smarttools test and do not see one drive that stands
> out to be causing errors. Could something else be causing these errors?


When RAID5 detects an inconsistency there is no way to know which device was
wrong.
SMART only detects some errors, not all.
I have had hard drives before which appears to have a single-bit error in
their internal buffer. No error would be reported, but data you read would
sometimes be wrong.
RAID5 cannot help you with this sort of error.

I would suggest backing up all your data (if it isn't already to late),
breaking the array, and testing each device individually.
e.g. create a filesystem on the device and try copying data on and reading it
off.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: raid 5 mismatch_cnt errors

am 21.05.2010 00:29:37 von Trey Scarborough

Neil Brown wrote:
> On Thu, 20 May 2010 12:02:23 -0500
> Trey Scarborough wrote:
>
>
>> I have a raid 5 array with 9 disks and I have a mismatch_cnt that keeps
>> growing. This is causing file corruption on the underlaying file systems
>> as well. I can copy a group of 100 100mb files and then do a md5sum on
>> them and 1-3 will be corrupt. If this is a drive that is bad is there
>> anyway to run a report on the count per drive that these mismatches
>> occur. I have run smarttools test and do not see one drive that stands
>> out to be causing errors. Could something else be causing these errors?
>>
>
>
> When RAID5 detects an inconsistency there is no way to know which device was
> wrong.
> SMART only detects some errors, not all.
> I have had hard drives before which appears to have a single-bit error in
> their internal buffer. No error would be reported, but data you read would
> sometimes be wrong.
> RAID5 cannot help you with this sort of error.
>
> I would suggest backing up all your data (if it isn't already to late),
> breaking the array, and testing each device individually.
> e.g. create a filesystem on the device and try copying data on and reading it
> off.
>
> NeilBrown
>
Thats what I was afraid of. The problem I have is if I back it up
knowing what data is bad. Luckily it appears to be a write error because
once written and correct I can do sums on all the files and I do not see
anymore errors. I was thinking that there might be a way of do a resync
and turning up the debug somehow so that it would log the mismatches
with both the drives that it was reading from at the time. I could then
take that information and considering there are 9 drives in the array
the one that comes out having the most should be the culprit. I could
then remove that drive from the array and test it leaving the rest in a
state that could be rebuilt and the data being consistant because the
drive with the bad write errors would be removed. Is this something that
might be possible?

Thanks,
Trey

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: raid 5 mismatch_cnt errors

am 21.05.2010 00:38:19 von NeilBrown

On Thu, 20 May 2010 17:29:37 -0500
Trey Scarborough wrote:

> Neil Brown wrote:
> > On Thu, 20 May 2010 12:02:23 -0500
> > Trey Scarborough wrote:
> >
> >
> >> I have a raid 5 array with 9 disks and I have a mismatch_cnt that keeps
> >> growing. This is causing file corruption on the underlaying file systems
> >> as well. I can copy a group of 100 100mb files and then do a md5sum on
> >> them and 1-3 will be corrupt. If this is a drive that is bad is there
> >> anyway to run a report on the count per drive that these mismatches
> >> occur. I have run smarttools test and do not see one drive that stands
> >> out to be causing errors. Could something else be causing these errors?
> >>
> >
> >
> > When RAID5 detects an inconsistency there is no way to know which device was
> > wrong.
> > SMART only detects some errors, not all.
> > I have had hard drives before which appears to have a single-bit error in
> > their internal buffer. No error would be reported, but data you read would
> > sometimes be wrong.
> > RAID5 cannot help you with this sort of error.
> >
> > I would suggest backing up all your data (if it isn't already to late),
> > breaking the array, and testing each device individually.
> > e.g. create a filesystem on the device and try copying data on and reading it
> > off.
> >
> > NeilBrown
> >
> Thats what I was afraid of. The problem I have is if I back it up
> knowing what data is bad. Luckily it appears to be a write error because
> once written and correct I can do sums on all the files and I do not see
> anymore errors. I was thinking that there might be a way of do a resync
> and turning up the debug somehow so that it would log the mismatches
> with both the drives that it was reading from at the time. I could then
> take that information and considering there are 9 drives in the array
> the one that comes out having the most should be the culprit. I could
> then remove that drive from the array and test it leaving the rest in a
> state that could be rebuilt and the data being consistant because the
> drive with the bad write errors would be removed. Is this something that
> might be possible?

To detect a mismatch, raid5 reads from all drives in parallel, calculates the
parity across the data blocks and compares that to the parity block.
So no: something like that is not possible.

only thing I can suggest:

- add a write-intent bitmap so you can remove/re-add devices fairly cheaply
- create a v.large file.
- write random data to the file without truncating it. (use dd of=file
conv=notrunc) then read it back and see if it matches. If it does, then
this approach doesn't help. If it doesn't:

1 by 1, fail/remove a drive from the array. Write new random data to the
same file and read it back and compare. Then --readd the missing device.
I'm hoping that you will get an error every time except when the 'bad'
device has been removed.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: raid 5 mismatch_cnt errors

am 21.05.2010 04:16:07 von Doug Ledford

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enigECF8913BF2A73773A9DED69E
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On 05/20/2010 06:38 PM, Neil Brown wrote:
> On Thu, 20 May 2010 17:29:37 -0500
> Trey Scarborough wrote:
>=20
>> Neil Brown wrote:
>>> On Thu, 20 May 2010 12:02:23 -0500
>>> Trey Scarborough wrote:
>>>
>>> =20
>>>> I have a raid 5 array with 9 disks and I have a mismatch_cnt that ke=
eps=20
>>>> growing. This is causing file corruption on the underlaying file sys=
tems=20
>>>> as well. I can copy a group of 100 100mb files and then do a md5sum=
on=20
>>>> them and 1-3 will be corrupt. If this is a drive that is bad is ther=
e=20
>>>> anyway to run a report on the count per drive that these mismatches =

>>>> occur. I have run smarttools test and do not see one drive that stan=
ds=20
>>>> out to be causing errors. Could something else be causing these erro=
rs?
>>>> =20

While a bad drive is certainly a possibility here, this is precisely the
type of failure scenario that would make me suspect bad RAM,
motherboard, or CPU. So I wouldn't rule those out as possibilities eithe=
r.

>>>
>>> When RAID5 detects an inconsistency there is no way to know which dev=
ice was
>>> wrong.
>>> SMART only detects some errors, not all.
>>> I have had hard drives before which appears to have a single-bit erro=
r in
>>> their internal buffer. No error would be reported, but data you read=
would
>>> sometimes be wrong.
>>> RAID5 cannot help you with this sort of error.
>>>
>>> I would suggest backing up all your data (if it isn't already to late=
),
>>> breaking the array, and testing each device individually.
>>> e.g. create a filesystem on the device and try copying data on and re=
ading it
>>> off.
>>>
>>> NeilBrown
>>> =20
>> Thats what I was afraid of. The problem I have is if I back it up=20
>> knowing what data is bad. Luckily it appears to be a write error becau=
se=20
>> once written and correct I can do sums on all the files and I do not s=
ee=20
>> anymore errors. I was thinking that there might be a way of do a resyn=
c=20
>> and turning up the debug somehow so that it would log the mismatches=20
>> with both the drives that it was reading from at the time. I could the=
n=20
>> take that information and considering there are 9 drives in the array =

>> the one that comes out having the most should be the culprit. I could =

>> then remove that drive from the array and test it leaving the rest in =
a=20
>> state that could be rebuilt and the data being consistant because the =

>> drive with the bad write errors would be removed. Is this something th=
at=20
>> might be possible?
>=20
> To detect a mismatch, raid5 reads from all drives in parallel, calculat=
es the
> parity across the data blocks and compares that to the parity block.
> So no: something like that is not possible.
>=20
> only thing I can suggest:
>=20
> - add a write-intent bitmap so you can remove/re-add devices fairly che=
aply
> - create a v.large file.
> - write random data to the file without truncating it. (use dd of=3Dfil=
e
> conv=3Dnotrunc) then read it back and see if it matches. If it does=
, then
> this approach doesn't help. If it doesn't:
>=20
> 1 by 1, fail/remove a drive from the array. Write new random data to=
the
> same file and read it back and compare. Then --readd the missing dev=
ice.
> I'm hoping that you will get an error every time except when the 'bad=
'
> device has been removed.
>=20
> NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html


--=20
Doug Ledford
GPG KeyID: CFBFF194
http://people.redhat.com/dledford

Infiniband specific RPMs available at
http://people.redhat.com/dledford/Infiniband


--------------enigECF8913BF2A73773A9DED69E
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)

iEYEARECAAYFAkv17OcACgkQg6WylM+/8ZSYkACfbE+/mgPj61PeT0qdncwY mvEm
S/EAn3hr3roIx4TeoZb1ejCXsgs8Lz3R
=43Mc
-----END PGP SIGNATURE-----

--------------enigECF8913BF2A73773A9DED69E--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: raid 5 mismatch_cnt errors

am 21.05.2010 18:40:34 von MRK

On 05/21/2010 04:16 AM, Doug Ledford wrote:
> On 05/20/2010 06:38 PM, Neil Brown wrote:
>
>> On Thu, 20 May 2010 17:29:37 -0500
>> Trey Scarborough wrote:
>>
>>
>>> Neil Brown wrote:
>>>
>>>> On Thu, 20 May 2010 12:02:23 -0500
>>>> Trey Scarborough wrote:
>>>>
>>>>
>>>>
>>>>> I have a raid 5 array with 9 disks and I have a mismatch_cnt that keeps
>>>>> growing. This is causing file corruption on the underlaying file systems
>>>>> as well. I can copy a group of 100 100mb files and then do a md5sum on
>>>>> them and 1-3 will be corrupt. If this is a drive that is bad is there
>>>>> anyway to run a report on the count per drive that these mismatches
>>>>> occur. I have run smarttools test and do not see one drive that stands
>>>>> out to be causing errors. Could something else be causing these errors?
>>>>>
>>>>>
> While a bad drive is certainly a possibility here, this is precisely the
> type of failure scenario that would make me suspect bad RAM,
> motherboard, or CPU. So I wouldn't rule those out as possibilities either.
>

Could the cabling to the drive be causing this? (maybe failing or maybe
it's partly disconnected)
I don't remember at what point Linux is at implementing the checksums
between the controller and the drive.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: raid 5 mismatch_cnt errors

am 21.05.2010 22:57:29 von Doug Ledford

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enig501CD30DB138E0F5F5A796AA
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On 05/21/2010 12:40 PM, MRK wrote:
> On 05/21/2010 04:16 AM, Doug Ledford wrote:
>> On 05/20/2010 06:38 PM, Neil Brown wrote:
>> =20
>>> On Thu, 20 May 2010 17:29:37 -0500
>>> Trey Scarborough wrote:
>>>
>>> =20
>>>> Neil Brown wrote:
>>>> =20
>>>>> On Thu, 20 May 2010 12:02:23 -0500
>>>>> Trey Scarborough wrote:
>>>>>
>>>>>
>>>>> =20
>>>>>> I have a raid 5 array with 9 disks and I have a mismatch_cnt that
>>>>>> keeps
>>>>>> growing. This is causing file corruption on the underlaying file
>>>>>> systems
>>>>>> as well. I can copy a group of 100 100mb files and then do a
>>>>>> md5sum on
>>>>>> them and 1-3 will be corrupt. If this is a drive that is bad is th=
ere
>>>>>> anyway to run a report on the count per drive that these mismatche=
s
>>>>>> occur. I have run smarttools test and do not see one drive that
>>>>>> stands
>>>>>> out to be causing errors. Could something else be causing these
>>>>>> errors?
>>>>>>
>>>>>> =20
>> While a bad drive is certainly a possibility here, this is precisely t=
he
>> type of failure scenario that would make me suspect bad RAM,
>> motherboard, or CPU. So I wouldn't rule those out as possibilities
>> either.
>> =20
>=20
> Could the cabling to the drive be causing this? (maybe failing or maybe=

> it's partly disconnected)
> I don't remember at what point Linux is at implementing the checksums
> between the controller and the drive.

I don't know. I'm not up on the SATA signaling details so I don't know
if it uses CRC on the signal, but I suspect it does and a bad cable
would cause failed requests. But I wouldn't bet my house on it, so I
would ask some SATA gurus.


--=20
Doug Ledford
GPG KeyID: CFBFF194
http://people.redhat.com/dledford

Infiniband specific RPMs available at
http://people.redhat.com/dledford/Infiniband


--------------enig501CD30DB138E0F5F5A796AA
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)

iEYEARECAAYFAkv287kACgkQg6WylM+/8ZQmywCgg1zPfO0693Df+fK06Sqt Cg1X
qLIAn23vV6ivCqwli4qibbiFqVNWb7Ge
=JB4e
-----END PGP SIGNATURE-----

--------------enig501CD30DB138E0F5F5A796AA--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: raid 5 mismatch_cnt errors

am 24.05.2010 11:34:28 von Tim Small

On 21/05/10 21:57, Doug Ledford wrote:
> On 05/21/2010 12:40 PM, MRK wrote:
>
>> On 05/21/2010 04:16 AM, Doug Ledford wrote:
>>
>> Could the cabling to the drive be causing this? (maybe failing or maybe
>> it's partly disconnected)
>> I don't remember at what point Linux is at implementing the checksums
>> between the controller and the drive.
>>
> I don't know. I'm not up on the SATA signaling details so I don't know
> if it uses CRC on the signal, but I suspect it does and a bad cable
> would cause failed requests. But I wouldn't bet my house on it, so I
> would ask some SATA gurus.
>

I wouldn't call myself that, but I believe PATA and SATA-level CRC
errors show up in the UDMA_CRC_Error_Count SMART variable - look for a
non-zero raw value in the smartctl output. This is presumably just the
error-count from the drive's point of view (bad data recd at drive
end). I don't know what happens with CRC errors detected at the Linux
end - and whether detection is controller-dependant. Better ask on
linux-ide.


From the SMART attribute name, presumably the earlier PATA transfer
modes don't support CRC error detection.

An easy thing to check might be to reduce the libata transfer speed from
3GBps to 1.5GBps. Similarly, try to test each drive and SATA port in
isolation if you can....

Tim.

--
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53 http://seoss.co.uk/ +44-(0)1273-808309

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: raid 5 mismatch_cnt errors

am 25.05.2010 21:09:43 von Robert Hancock

On 05/24/2010 03:34 AM, Tim Small wrote:
> On 21/05/10 21:57, Doug Ledford wrote:
>> On 05/21/2010 12:40 PM, MRK wrote:
>>> On 05/21/2010 04:16 AM, Doug Ledford wrote:
>>> Could the cabling to the drive be causing this? (maybe failing or maybe
>>> it's partly disconnected)
>>> I don't remember at what point Linux is at implementing the checksums
>>> between the controller and the drive.
>> I don't know. I'm not up on the SATA signaling details so I don't know
>> if it uses CRC on the signal, but I suspect it does and a bad cable
>> would cause failed requests. But I wouldn't bet my house on it, so I
>> would ask some SATA gurus.
>
> I wouldn't call myself that, but I believe PATA and SATA-level CRC
> errors show up in the UDMA_CRC_Error_Count SMART variable - look for a
> non-zero raw value in the smartctl output. This is presumably just the
> error-count from the drive's point of view (bad data recd at drive end).
> I don't know what happens with CRC errors detected at the Linux end -
> and whether detection is controller-dependant. Better ask on linux-ide.
>
>
> From the SMART attribute name, presumably the earlier PATA transfer
> modes don't support CRC error detection.
>
> An easy thing to check might be to reduce the libata transfer speed from
> 3GBps to 1.5GBps. Similarly, try to test each drive and SATA port in
> isolation if you can....

ATA transfer errors should cause a bad CRC resulting in a failed
transfer which will cause complaints in the kernel log. For PATA, only
UDMA modes can detect CRC errors, PIO and MWDMA transfers can't.

There are other places where data corruption can occur however, like
inside the controller or the drive itself..
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: raid 5 mismatch_cnt errors

am 26.05.2010 17:07:26 von Bill Davidsen

Doug Ledford wrote:
> On 05/20/2010 06:38 PM, Neil Brown wrote:
>
>> On Thu, 20 May 2010 17:29:37 -0500
>> Trey Scarborough wrote:
>>
>>
>>> Neil Brown wrote:
>>>
>>>> On Thu, 20 May 2010 12:02:23 -0500
>>>> Trey Scarborough wrote:
>>>>
>>>>
>>>>
>>>>> I have a raid 5 array with 9 disks and I have a mismatch_cnt that keeps
>>>>> growing. This is causing file corruption on the underlaying file systems
>>>>> as well. I can copy a group of 100 100mb files and then do a md5sum on
>>>>> them and 1-3 will be corrupt. If this is a drive that is bad is there
>>>>> anyway to run a report on the count per drive that these mismatches
>>>>> occur. I have run smarttools test and do not see one drive that stands
>>>>> out to be causing errors. Could something else be causing these errors?
>>>>>
>>>>>
>
> While a bad drive is certainly a possibility here, this is precisely the
> type of failure scenario that would make me suspect bad RAM,
> motherboard, or CPU. So I wouldn't rule those out as possibilities either.
>

I have the same thought, I would remove half the RAM from the system and
test again, then swap to the "other" half and repeat. Of course running
memtest first is a good idea, but I have seen failures which only happen
on disk access.

If the system is O/C obviously the first step is to cut the speed back...

--
Bill Davidsen
"We can't solve today's problems by using the same thinking we
used in creating them." - Einstein

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: raid 5 mismatch_cnt errors

am 26.05.2010 17:49:52 von Doug Ledford

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enigDF87B377FCCF121F6BC21B4E
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On 05/26/2010 11:07 AM, Bill Davidsen wrote:
> Doug Ledford wrote:
>> On 05/20/2010 06:38 PM, Neil Brown wrote:
>> =20
>>> On Thu, 20 May 2010 17:29:37 -0500
>>> Trey Scarborough wrote:
>>>
>>> =20
>>>> Neil Brown wrote:
>>>> =20
>>>>> On Thu, 20 May 2010 12:02:23 -0500
>>>>> Trey Scarborough wrote:
>>>>>
>>>>> =20
>>>>>> I have a raid 5 array with 9 disks and I have a mismatch_cnt that
>>>>>> keeps growing. This is causing file corruption on the underlaying
>>>>>> file systems as well. I can copy a group of 100 100mb files and
>>>>>> then do a md5sum on them and 1-3 will be corrupt. If this is a
>>>>>> drive that is bad is there anyway to run a report on the count per=

>>>>>> drive that these mismatches occur. I have run smarttools test and
>>>>>> do not see one drive that stands out to be causing errors. Could
>>>>>> something else be causing these errors?
>>>>>> =20
>>
>> While a bad drive is certainly a possibility here, this is precisely t=
he
>> type of failure scenario that would make me suspect bad RAM,
>> motherboard, or CPU. So I wouldn't rule those out as possibilities
>> either.
>> =20
>=20
> I have the same thought, I would remove half the RAM from the system an=
d
> test again, then swap to the "other" half and repeat. Of course running=

> memtest first is a good idea, but I have seen failures which only happe=
n
> on disk access.

Indeed, I've seen lots of failures that only happen with disk access and
not with memory testers. Hence why I have a shell script on my web page
in my sig that uses disk access to test memory.

> If the system is O/C obviously the first step is to cut the speed back.=
.
>=20


--=20
Doug Ledford
GPG KeyID: CFBFF194
http://people.redhat.com/dledford

Infiniband specific RPMs available at
http://people.redhat.com/dledford/Infiniband


--------------enigDF87B377FCCF121F6BC21B4E
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)

iEYEARECAAYFAkv9QyAACgkQg6WylM+/8ZTpNwCgqCGc6lVzsS6l0gpy5wpZ wKs8
WeoAoKOyw5Sfs6fGGdSv13hHG9ATMUpl
=ig/k
-----END PGP SIGNATURE-----

--------------enigDF87B377FCCF121F6BC21B4E--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html