the true behavior of mdadm"s raid-1 with regard to vertical parityand silent error detection/scrubbi

am 18.08.2010 05:56:38 von Brett

I've googled endlessly about the internal nature of a md Raid-1.

Over the years, I've found several single bit flips on traditional
platter disks on files that were previously on a linux raid-1 that had
split and while doing a verification of the two copies. This seems to
imply that what I'm looking for doesn't exist- and that is simply a
vertical parity within each disk at the md level- even a single crc32
every once in a while so that if a bit flips on drive 1 of a mirror,
drive 2's copy replaces it instead of 1's bit flipped copy replacing
drive 2's good copy. From what I can gather, it seems to be a 50/50 shot
whether your good copy gets mangled in the event of a silent bit flippage.

So- is there any built in parity that helps mdadm decide which copy to
use when the copies disagree on a raid 1 mirror during a resync?

If not- is there a reason why not beyond the extra space overhead and
read compute write overhead?

This issue interests me more as I look into SSDs and having flash blocks
wear out.

I'd choose a higher raid level if i could, but this is only a very small
atom 330 box with only mildly important data. I think I'm ultimately
looking for something like ZFS has, but ZFS under RHEL/CentOS will
probably never happen in any meaningful production worthy way due
licensing and the ultimate demise of sun and tainting of things that is
Oracle.

I'd love any information anyone has on the subject.

-Brett
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: the true behavior of mdadm"s raid-1 with regard to vertical parityand silent error detection/scr

am 18.08.2010 09:45:33 von Michael Tokarev

18.08.2010 07:56, Brett L. Trotter wrote:
[]
> So- is there any built in parity that helps mdadm decide which copy to
> use when the copies disagree on a raid 1 mirror during a resync?

There's no.

> If not- is there a reason why not beyond the extra space overhead and
> read compute write overhead?

Well. This sounds pretty much like old discussion about bad
blocks marking in md or filesystems or any other layer like this.
But nowadays - hopefully anyway - all drives are capable of doing
this internally, -- remapping bad blocks. If a drive is not able
to remap a new bad block anymore, it's time to throw it away or
to RMA it, instead of trying to "cure" it in upper layers.

The same thing is with parity. All modern drives, at least in
theory, has ways to ensure they either return whatever has been
written, or indicate error. This is ECC codes, checksums, parity,
whatnot - things supposed to detect errors and sometimes correct
simple ones like bit flips.

I understand you've got real cases where such detection does not
works for some reason. Well, bad block remapping didn't work in
the past too... ;)

It shouldn't be very difficult to implement checsumming and/or
simple ECC codes in md (storing the parity information within
extra blocks either at the end of underlying device or every,
say, 64th block or so - in order to not reduce sector size into
something like 511 bytes :). The overhead shouldn't be large
either. Together with implementing bad block remapping.. But
to me, the question is if there's a real reason/demand of doing
so.

From the other hand, following this theme one may say that whole
md subsystem is obsolete by hardware raid controllers... :)

> This issue interests me more as I look into SSDs and having flash blocks
> wear out.

And it is even more important for SSDs to have such feature, and
as far as I understand, ths is what they actually have. I might
be wrong, but...

/mjt
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: the true behavior of mdadm"s raid-1 with regard to verticalparity and silent error detection/scr

am 18.08.2010 10:00:17 von Mikael Abrahamsson

On Wed, 18 Aug 2010, Michael Tokarev wrote:

> But to me, the question is if there's a real reason/demand of doing so.

ZFS does it and people who are paranoid about bit rot really want it. It
gives more protection against memory errors etc, ie outside the drive when
the bits are in transit from the drive thru
cables/controllers/drivers/block subsystem etc. Of course it's not
perfect, but it gives some added protection.

If the cost/benefit analysis holds up or not I don't know, because I don't
know the complexity. Having a 64k stripe in md actually use 68 k on drive
and store some checksum might make sense, but it doesn't give great
granularity. Perhaps those 4k can be checksum per 4k within the 64k stripe
block so that a fairly fine-granular error can be given, and also if there
is parity information available it can be read and the problem corrected.

--
Mikael Abrahamsson email: swmike@swm.pp.se
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: the true behavior of mdadm"s raid-1 with regard to vertical parityand silent error detection/scr

am 04.09.2010 20:38:56 von Bill Davidsen

Brett L. Trotter wrote:
> I've googled endlessly about the internal nature of a md Raid-1.
>
> Over the years, I've found several single bit flips on traditional
> platter disks on files that were previously on a linux raid-1 that had
> split and while doing a verification of the two copies. This seems to
> imply that what I'm looking for doesn't exist- and that is simply a
> vertical parity within each disk at the md level- even a single crc32
> every once in a while so that if a bit flips on drive 1 of a mirror,
> drive 2's copy replaces it instead of 1's bit flipped copy replacing
> drive 2's good copy. From what I can gather, it seems to be a 50/50 shot
> whether your good copy gets mangled in the event of a silent bit flippage.
>
> So- is there any built in parity that helps mdadm decide which copy to
> use when the copies disagree on a raid 1 mirror during a resync?
>
> If not- is there a reason why not beyond the extra space overhead and
> read compute write overhead?
>
> This issue interests me more as I look into SSDs and having flash blocks
> wear out.
>
>
> I'd choose a higher raid level if i could, but this is only a very small
> atom 330 box with only mildly important data. I think I'm ultimately
> looking for something like ZFS has, but ZFS under RHEL/CentOS will
> probably never happen in any meaningful production worthy way due
> licensing and the ultimate demise of sun and tainting of things that is
> Oracle.
>
> I'd love any information anyone has on the subject.
>

I don't think you are going to love this, as far as I can tell there is
no better recovery done for higher level raid, either, if the failure is
silent rather than a drive failing. When a 'check' is run and error
found, Neil seems to believe that it is not worth the overhead of
identifying the most likely wrong data, so it is simply rewritten to
make the mismatch go away, rather than to make an attempt to identify
the most likely correct data for the most likely bad sector and to fix that.

On N>2 copy raid-1, no check is made (unless the change is very recent)
to see if N-1 copies agree on a value, and with raid-6 the obvious check
to find the most likely to be wrong data isn't done. This has been
discussed to death, I don't see any changes coming.

--
Bill Davidsen
"We can't solve today's problems by using the same thinking we
used in creating them." - Einstein

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: the true behavior of mdadm"s raid-1 with regard to verticalparity and silent error detection/scr

am 04.09.2010 20:56:21 von Mikael Abrahamsson

On Sat, 4 Sep 2010, Bill Davidsen wrote:

> This has been discussed to death, I don't see any changes coming.

True. If someone really wants this (for instance 64k on drive using 4k ECC
data for a total of 68k per stripe) and are willing to put either money on
the table for someone to write it, or someone is volunteering, I don't see
this coming either.

If someone is willing to actually code (or have it coded) then a
discussion should first happen if such a change would be accepted.

--
Mikael Abrahamsson email: swmike@swm.pp.se
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html