mdadm / force parity checking of blocks on all reads?

mdadm / force parity checking of blocks on all reads?

am 18.02.2011 03:04:48 von Steve Costaras

I'm looking at alternatives to ZFS since it still has some time to go
for large scale deployment as a kernel-level file system (and brtfs has
years to go). I am running into problems with silent data corruption
with large deployments of disks. Currently no hardware raid vendor
supports T10 DIF (which even if supported would only work w/ SAS/FC
drives anyway) nor does read parity checking.

I am hoping that either there is a way that I don't know of to enable
mdadm to read the data plus p+q parity blocks for every request and
compare them for accuracy (simlar to what you need to do for a scrub but
/ALWAYS/) or have the functionality added as an option.

With the current large capacity drives we have today getting bit errors
is quite common (I have some scripts that I do complete file checks
every two weeks across 50TB arrays and come up with errros every
month). I'm looking at expanding to 200-300TB volumes shortly so the
problem will only get that much more frequent. Being able to check
the data against parity will be able to find/notify and correct errors
at read time before they get to user space. This fixes bit rot as well
as torn/wild reads/writes and mitigates transmission issues.

I searched the list but couldn't find this benig discussed before, is
this possible?

Steve Costaras
stevecs@chaven.com
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: mdadm / force parity checking of blocks on all reads?

am 18.02.2011 04:25:36 von NeilBrown

On Thu, 17 Feb 2011 20:04:48 -0600 Steve Costaras wrote:

>
>
> I'm looking at alternatives to ZFS since it still has some time to go
> for large scale deployment as a kernel-level file system (and brtfs has
> years to go). I am running into problems with silent data corruption
> with large deployments of disks. Currently no hardware raid vendor
> supports T10 DIF (which even if supported would only work w/ SAS/FC
> drives anyway) nor does read parity checking.

Maybe I'm just naive, but find it impossible to believe that "silent data
corruption" is ever acceptable. You should fix or replace your hardware.

Yes, I know silent data corruption is theoretically possible at a very low
probability and that as you add more and more storage, that probability gets
higher and higher.

But my point is that the probability of unfixable but detectable corruption
will ALWAYS be much (much much) higher than the probability of silent data
corruption (on a correctly working system).

So if you are getting unfixable errors reported on some component, replace
that component. And if you aren't then ask your vender to replace the
system, because it is broken.


>
> I am hoping that either there is a way that I don't know of to enable
> mdadm to read the data plus p+q parity blocks for every request and
> compare them for accuracy (simlar to what you need to do for a scrub but
> /ALWAYS/) or have the functionality added as an option.

No, it is not currently possible to do this, nor have I plan to implement
it. I guess it would be possible in theory though.

NeilBrown


>
> With the current large capacity drives we have today getting bit errors
> is quite common (I have some scripts that I do complete file checks
> every two weeks across 50TB arrays and come up with errros every
> month). I'm looking at expanding to 200-300TB volumes shortly so the
> problem will only get that much more frequent. Being able to check
> the data against parity will be able to find/notify and correct errors
> at read time before they get to user space. This fixes bit rot as well
> as torn/wild reads/writes and mitigates transmission issues.
>
> I searched the list but couldn't find this benig discussed before, is
> this possible?
>
> Steve Costaras
> stevecs@chaven.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: mdadm / force parity checking of blocks on all reads?

am 18.02.2011 05:34:32 von Roberto Spadim

put a lot of raid1 devices eheheh, but i don=B4t know if it=B4s a good
idea... maybe... change your hardware and try another more mature
filesystem
maybe a solution of raid+lvm+filesystem is a better solution, with lvm
you can online backup

2011/2/18 NeilBrown :
> On Thu, 17 Feb 2011 20:04:48 -0600 Steve Costaras > wrote:
>
>>
>>
>> I'm looking at alternatives to ZFS since it still has some time to g=
o
>> for large scale deployment as a kernel-level file system (and brtfs =
has
>> years to go). =A0 I am running into problems with silent data corrup=
tion
>> with large deployments of disks. =A0 =A0Currently no hardware raid v=
endor
>> supports T10 DIF (which even if supported would only work w/ SAS/FC
>> drives anyway) nor does read parity checking.
>
> Maybe I'm just naive, =A0but find it impossible to believe that "sile=
nt data
> corruption" is ever acceptable. =A0 You should fix or replace your ha=
rdware.
>
> Yes, I know silent data corruption is theoretically possible at a ver=
y low
> probability and that as you add more and more storage, that probabili=
ty gets
> higher and higher.
>
> But my point is that the probability of unfixable but detectable corr=
uption
> will ALWAYS be much (much much) higher than the probability of silent=
data
> corruption (on a correctly working system).
>
> So if you are getting unfixable errors reported on some component, re=
place
> that component. =A0And if you aren't then ask your vender to replace =
the
> system, because it is broken.
>
>
>>
>> I am hoping that either there is a way that I don't know of to enabl=
e
>> mdadm to read the data plus p+q parity blocks for every request and
>> compare them for accuracy (simlar to what you need to do for a scrub=
but
>> /ALWAYS/) or have the functionality added as an option.
>
> No, it is not currently possible to do this, nor have I plan to imple=
ment
> it. =A0I guess it would be possible in theory though.
>
> NeilBrown
>
>
>>
>> With the current large capacity drives we have today getting bit err=
ors
>> is quite common (I have some scripts that I do complete file checks
>> every two weeks across 50TB arrays and come up with errros every
>> month). =A0 I'm looking at expanding to 200-300TB volumes shortly so=
the
>> problem will only get that much more frequent. =A0 =A0 Being able to=
check
>> the data against parity will be able to find/notify and correct erro=
rs
>> at read time before they get to user space. =A0 This fixes bit rot a=
s well
>> as torn/wild reads/writes and mitigates transmission issues.
>>
>> I searched the list but couldn't find this benig discussed before, i=
s
>> this possible?
>>
>> Steve Costaras
>> stevecs@chaven.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid=
" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: mdadm / force parity checking of blocks on all reads?

am 18.02.2011 12:13:19 von Steve Costaras

On 2011-02-17 21:25, NeilBrown wrote:
> On Thu, 17 Feb 2011 20:04:48 -0600 Steve Costaras wrote:
>
>>
>> I'm looking at alternatives to ZFS since it still has some time to go
>> for large scale deployment as a kernel-level file system (and brtfs has
>> years to go). I am running into problems with silent data corruption
>> with large deployments of disks. Currently no hardware raid vendor
>> supports T10 DIF (which even if supported would only work w/ SAS/FC
>> drives anyway) nor does read parity checking.
> Maybe I'm just naive, but find it impossible to believe that "silent data
> corruption" is ever acceptable. You should fix or replace your hardware.
>
> Yes, I know silent data corruption is theoretically possible at a very low
> probability and that as you add more and more storage, that probability gets
> higher and higher.
>
> But my point is that the probability of unfixable but detectable corruption
> will ALWAYS be much (much much) higher than the probability of silent data
> corruption (on a correctly working system).
>
> So if you are getting unfixable errors reported on some component, replace
> that component. And if you aren't then ask your vender to replace the
> system, because it is broken.
>
>
Would love to, do you have the home phone #'s of all the drive
manufacturer's CTO's so I can talk to them?
It's a fact of life across /ALL/ drives. This is 'SILENT' corruption,
i.e. it's not reported by anything in the I/O chain as all systems
'assume' the data is good in the request. This concept has been
proved flawed.

You can discover this by running (like we do here, sha1 hashes of all
files and compare them over time). We find on our 40TB arrays (this
on drives w/ 10^15 BER and 1TB drives (seagate & hitachi) about 1-2
mis-matches per month. This requires us then to restore the data from
tape (after also checking that). This type of corruption is not
unknown and is quite common (we first discovered it back in 2007-2008
which is why I wrote up the scripts to check for it. There has been a
lot of discussion on this in the larger deployments (I know that at
least CERN has also seen it as they wrote a paper on it). Ideally
drive manufacturers should improve their BER to 10^17 or better for the
large capacity drives (unfortunately its' the smaller drives that get
the better BER (10^16)) and also allow for T10 DIF (520 byte sectors or
4160byte if 4K). However this standard was only adopted by the T10
(SAS/SCSI group) not the T13 (SATA/IDE) group so that leaves another
huge gap. Let alone the lack of HBA's that support T10 DIF/fat sectors
(LSI 9200 series is the only one I've found). The only large capacity
drive I've found that seems to have some additional protections is the
Seagate ST32000444SS sas drive as it does ECC checks of each block at
read time and tries to correct it. From running 80 of these over the
past several months I have not found an error that has reached user
space /so far/. However this just checks that the block ECC matches
the block so a wild write or wild read request would go un-noticed
(that's where DIF/DIX standard which also includes the LBA would be
useful).

This is the real driving factor for ZFS as it does not require T10 DIF
(fat sectors) or high BER drives (as manufacturers are not making them,
a lot of 2TB and 3TB drives are rated even at 10^14!!!! ) ZFS works
by creating it's own raid checksum and checking it on every transaction
(read/write) at least in regards to this type of problem. This same
level of assurance can be accomplished by /any/ type of raid as the data
is also already there but it needs to be checked on every transaction to
verify it's integrity and if wrong corrected BEFORE handing it to user
space.

If this is not something that is planned for mdadm then I'm back to
solaris or freebsd for the mean time until native zfs is up to snuff.




--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: mdadm / force parity checking of blocks on all reads?

am 18.02.2011 13:07:04 von John Robinson

On 18/02/2011 11:13, Steve Costaras wrote:
> On 2011-02-17 21:25, NeilBrown wrote:
>> On Thu, 17 Feb 2011 20:04:48 -0600 Steve Costaras
>> wrote:
>>> I'm looking at alternatives to ZFS since it still has some time to go
>>> for large scale deployment as a kernel-level file system (and brtfs has
>>> years to go). I am running into problems with silent data corruption
>>> with large deployments of disks. Currently no hardware raid vendor
>>> supports T10 DIF (which even if supported would only work w/ SAS/FC
>>> drives anyway) nor does read parity checking.
>> Maybe I'm just naive, but find it impossible to believe that "silent data
>> corruption" is ever acceptable. You should fix or replace your hardware.
>>
>> Yes, I know silent data corruption is theoretically possible at a very
>> low
>> probability and that as you add more and more storage, that
>> probability gets
>> higher and higher.
>>
>> But my point is that the probability of unfixable but detectable
>> corruption
>> will ALWAYS be much (much much) higher than the probability of silent
>> data
>> corruption (on a correctly working system).
>>
>> So if you are getting unfixable errors reported on some component,
>> replace
>> that component. And if you aren't then ask your vender to replace the
>> system, because it is broken.
>>
>>
> Would love to, do you have the home phone #'s of all the drive
> manufacturer's CTO's so I can talk to them?
> It's a fact of life across /ALL/ drives. This is 'SILENT' corruption,
> i.e. it's not reported by anything in the I/O chain as all systems
> 'assume' the data is good in the request. This concept has been proved
> flawed.
>
> You can discover this by running (like we do here, sha1 hashes of all
> files and compare them over time). We find on our 40TB arrays (this on
> drives w/ 10^15 BER and 1TB drives (seagate & hitachi) about 1-2
> mis-matches per month.

I thought the BER was for reported uncorrectable errors? Or it might
include the silent ones but they ought to be thousands or possibly
millions of times rarer - I don't know what ECC techniques they're using
but presumably the manufacturers presumably don't quote a BER for silent
corruption?

I did some sums a while ago and found that with current drives you've an
evens chance of getting a bit error with every ~43TB you read, with a 1
in 10^15 BER. I assumed that the drive would report it, allowing md or
any other RAID setup to reconstruct the data and re-write it.

Can you estimate from your usage of your 40TB arrays what your "silent
BER" is?

[...]
> The only large capacity
> drive I've found that seems to have some additional protections is the
> Seagate ST32000444SS sas drive as it does ECC checks of each block at
> read time and tries to correct it.

Again in theory don't all drives do ECC all the time to even reach their
1 in 10^15 BER? Do those Seagates quote a much better BER? Ooh no but
they do also quote a miscorrected BER of 1 in 10^21, which is something
I haven't seen quoted before, and they also note that these rates only
apply with the drive is doing "full read retries", so presumably
wouldn't apply to a RAID setup using shortened SCT ERC timeouts.

[...]
> This is the real driving factor for ZFS as it does not require T10 DIF
> (fat sectors) or high BER drives (as manufacturers are not making them,
> a lot of 2TB and 3TB drives are rated even at 10^14!!!! ) ZFS works by
> creating it's own raid checksum and checking it on every transaction
> (read/write) at least in regards to this type of problem. This same
> level of assurance can be accomplished by /any/ type of raid as the data
> is also already there but it needs to be checked on every transaction to
> verify it's integrity and if wrong corrected BEFORE handing it to user
> space.
>
> If this is not something that is planned for mdadm then I'm back to
> solaris or freebsd for the mean time until native zfs is up to snuff.

A separate device-mapper target which did another layer of ECC over hard
drives has been suggested here and I vaguely remember seeing a patch at
some point, which would take (perhaps) 64 sectors of data and add an ECC
sector. Such a thing should work well under RAID. But I don't know what
(if anything) happened to it.

Cheers,

John.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html