Triple-parity raid6

am 09.06.2011 02:01:06 von David Brown

Has anyone considered triple-parity raid6 ? As far as I can see, it=20
should not be significantly harder than normal raid6 - either to=20
implement, or for the processor at run-time. Once you have the GF(2=E2=
¸)=20
field arithmetic in place for raid6, it's just a matter of making=20
another parity block in the same way but using a different generator:

P =3D D_0 + D_1 + D_2 + .. + D_(n.1)
Q =3D D_0 + g.D_1 + gÂ².D_2 + .. + g^(n-1).D_(n.1)
R =3D D_0 + h.D_1 + hÂ².D_2 + .. + h^(n-1).D_(n.1)

The raid6 implementation in mdraid uses g =3D 0x02 to generate the seco=
nd=20
parity (based on "The mathematics of RAID-6" - I haven't checked the=20
source code). You can make a third parity using h =3D 0x04 and then ge=
t a=20
redundancy of 3 disks. (Note - I haven't yet confirmed that this is=20
valid for more than 100 data disks - I need to make my checker program=20
more efficient first.)

Rebuilding a disk, or running in degraded mode, is just an obvious=20
extension to the current raid6 algorithms. If you are missing three=20
data blocks, the maths looks hard to start with - but if you express th=
e=20
equations as a set of linear equations and use standard matrix inversio=
n=20
techniques, it should not be hard to implement. You only need to do=20
this inversion once when you find that one or more disks have failed -=20
then you pre-compute the multiplication tables in the same way as is=20
done for raid6 today.

In normal use, calculating the R parity is no more demanding than=20
calculating the Q parity. And most rebuilds or degraded situations wil=
l=20
only involve a single disk, and the data can thus be re-constructed=20
using the P parity just like raid5 or two-parity raid6.

I'm sure there are situations where triple-parity raid6 would be=20
appealing - it has already been implemented in ZFS, and it is only a=20
matter of time before two-parity raid6 has a real probability of hittin=
g=20
an unrecoverable read error during a rebuild.

And of course, there is no particular reason to stop at three parity=20
blocks - the maths can easily be generalised. 1, 2, 4 and 8 can be use=
d=20
as generators for quad-parity (checked up to 60 disks), and adding 16=20
gives you quintuple parity (checked up to 30 disks) - but that's maybe=20
getting a bit paranoid.

ref.:

mvh.,

David

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple-parity raid6

am 09.06.2011 03:49:54 von NeilBrown

On Thu, 09 Jun 2011 02:01:06 +0200 David Brown o>
wrote:

> Has anyone considered triple-parity raid6 ? As far as I can see, it=20
> should not be significantly harder than normal raid6 - either to=20
> implement, or for the processor at run-time. Once you have the GF(2=E2=
¸)=20
> field arithmetic in place for raid6, it's just a matter of making=20
> another parity block in the same way but using a different generator:
>=20
> P =3D D_0 + D_1 + D_2 + .. + D_(n.1)
> Q =3D D_0 + g.D_1 + gÂ².D_2 + .. + g^(n-1).D_(n.1)
> R =3D D_0 + h.D_1 + hÂ².D_2 + .. + h^(n-1).D_(n.1)
>=20
> The raid6 implementation in mdraid uses g =3D 0x02 to generate the se=
cond=20
> parity (based on "The mathematics of RAID-6" - I haven't checked the=20
> source code). You can make a third parity using h =3D 0x04 and then =
get a=20
> redundancy of 3 disks. (Note - I haven't yet confirmed that this is=20
> valid for more than 100 data disks - I need to make my checker progra=
m=20
> more efficient first.)
>=20
> Rebuilding a disk, or running in degraded mode, is just an obvious=20
> extension to the current raid6 algorithms. If you are missing three=20
> data blocks, the maths looks hard to start with - but if you express =
the=20
> equations as a set of linear equations and use standard matrix invers=
ion=20
> techniques, it should not be hard to implement. You only need to do=20
> this inversion once when you find that one or more disks have failed =
-=20
> then you pre-compute the multiplication tables in the same way as is=20
> done for raid6 today.
>=20
> In normal use, calculating the R parity is no more demanding than=20
> calculating the Q parity. And most rebuilds or degraded situations w=
ill=20
> only involve a single disk, and the data can thus be re-constructed=20
> using the P parity just like raid5 or two-parity raid6.
>=20
>=20
> I'm sure there are situations where triple-parity raid6 would be=20
> appealing - it has already been implemented in ZFS, and it is only a=20
> matter of time before two-parity raid6 has a real probability of hitt=
ing=20
> an unrecoverable read error during a rebuild.
>=20
>=20
> And of course, there is no particular reason to stop at three parity=20
> blocks - the maths can easily be generalised. 1, 2, 4 and 8 can be u=
sed=20
> as generators for quad-parity (checked up to 60 disks), and adding 16=
=20
> gives you quintuple parity (checked up to 30 disks) - but that's mayb=
e=20
> getting a bit paranoid.
>=20
>=20
> ref.:
>=20
>
>
>
>
>=20

-ENOPATCH :-)

I have a series of patches nearly ready which removes a lot of the rema=
ining
duplication in raid5.c between raid5 and raid6 paths. So there will be
relative few places where RAID5 and RAID6 do different things - only th=
e
places where they *must* do different things.
After that, adding a new level or layout which has 'max_degraded ===
3' would
be quite easy.
The most difficult part would be the enhancements to libraid6 to genera=
te the
new 'syndrome', and to handle the different recovery possibilities.

So if you're not otherwise busy this weekend, a patch would be nice :-)

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple-parity raid6

am 09.06.2011 13:32:59 von David Brown

On 09/06/2011 03:49, NeilBrown wrote:
> On Thu, 09 Jun 2011 02:01:06 +0200 David Brown no>
> wrote:
>
>> Has anyone considered triple-parity raid6 ? As far as I can see, it
>> should not be significantly harder than normal raid6 - either to
>> implement, or for the processor at run-time. Once you have the GF(2=
â=B8)
>> field arithmetic in place for raid6, it's just a matter of making
>> another parity block in the same way but using a different generator=
:
>>
>> P =3D D_0 + D_1 + D_2 + .. + D_(n.1)
>> Q =3D D_0 + g.D_1 + gÂ².D_2 + .. + g^(n-1).D_(n.1)
>> R =3D D_0 + h.D_1 + hÂ².D_2 + .. + h^(n-1).D_(n.1)
>>
>> The raid6 implementation in mdraid uses g =3D 0x02 to generate the s=
econd
>> parity (based on "The mathematics of RAID-6" - I haven't checked the
>> source code). You can make a third parity using h =3D 0x04 and then=
get a
>> redundancy of 3 disks. (Note - I haven't yet confirmed that this is
>> valid for more than 100 data disks - I need to make my checker progr=
am
>> more efficient first.)
>>
>> Rebuilding a disk, or running in degraded mode, is just an obvious
>> extension to the current raid6 algorithms. If you are missing three
>> data blocks, the maths looks hard to start with - but if you express=
the
>> equations as a set of linear equations and use standard matrix inver=
sion
>> techniques, it should not be hard to implement. You only need to do
>> this inversion once when you find that one or more disks have failed=
-
>> then you pre-compute the multiplication tables in the same way as is
>> done for raid6 today.
>>
>> In normal use, calculating the R parity is no more demanding than
>> calculating the Q parity. And most rebuilds or degraded situations =
will
>> only involve a single disk, and the data can thus be re-constructed
>> using the P parity just like raid5 or two-parity raid6.
>>
>>
>> I'm sure there are situations where triple-parity raid6 would be
>> appealing - it has already been implemented in ZFS, and it is only a
>> matter of time before two-parity raid6 has a real probability of hit=
ting
>> an unrecoverable read error during a rebuild.
>>
>>
>> And of course, there is no particular reason to stop at three parity
>> blocks - the maths can easily be generalised. 1, 2, 4 and 8 can be =
used
>> as generators for quad-parity (checked up to 60 disks), and adding 1=
6
>> gives you quintuple parity (checked up to 30 disks) - but that's may=
be
>> getting a bit paranoid.
>>
>>
>> ref.:
>>
>>
>>
>>
>>
>>
>
> -ENOPATCH :-)
>
> I have a series of patches nearly ready which removes a lot of the re=
maining
> duplication in raid5.c between raid5 and raid6 paths. So there will =
be
> relative few places where RAID5 and RAID6 do different things - only =
the
> places where they *must* do different things.
> After that, adding a new level or layout which has 'max_degraded ===
3' would
> be quite easy.
> The most difficult part would be the enhancements to libraid6 to gene=
rate the
> new 'syndrome', and to handle the different recovery possibilities.
>
> So if you're not otherwise busy this weekend, a patch would be nice :=
-)
>

I'm not going to promise any patches, but maybe I can help with the=20
maths. You say the difficult part is the syndrome calculations and=20
recovery - I've got these bits figured out on paper and some=20
quick-and-dirty python test code. On the other hand, I don't really=20
want to get into the md kernel code, or the mdadm code - I haven't done=
=20
Linux kernel development before (I mostly program 8-bit microcontroller=
s=20
- when I code on Linux, I use Python), and I fear it would take me a=20
long time to get up to speed.

However, if the parity generation and recovery is neatly separated into=
=20
a libraid6 library, the whole thing becomes much more tractable from my=
=20
viewpoint. Since I am new to this, can you tell me where I should get=20
the current libraid6 code? I'm sure google will find some sources for=20
me, but I'd like to make sure I start with whatever version /you/ have.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple-parity raid6

am 09.06.2011 14:04:38 von NeilBrown

On Thu, 09 Jun 2011 13:32:59 +0200 David Brown =
wrote:

> On 09/06/2011 03:49, NeilBrown wrote:
> > On Thu, 09 Jun 2011 02:01:06 +0200 David Brown t.no>
> > wrote:
> >
> >> Has anyone considered triple-parity raid6 ? As far as I can see, =
it
> >> should not be significantly harder than normal raid6 - either to
> >> implement, or for the processor at run-time. Once you have the GF=
(2â=B8)
> >> field arithmetic in place for raid6, it's just a matter of making
> >> another parity block in the same way but using a different generat=
or:
> >>
> >> P =3D D_0 + D_1 + D_2 + .. + D_(n.1)
> >> Q =3D D_0 + g.D_1 + gÂ².D_2 + .. + g^(n-1).D_(n.1)
> >> R =3D D_0 + h.D_1 + hÂ².D_2 + .. + h^(n-1).D_(n.1)
> >>
> >> The raid6 implementation in mdraid uses g =3D 0x02 to generate the=
second
> >> parity (based on "The mathematics of RAID-6" - I haven't checked t=
he
> >> source code). You can make a third parity using h =3D 0x04 and th=
en get a
> >> redundancy of 3 disks. (Note - I haven't yet confirmed that this =
is
> >> valid for more than 100 data disks - I need to make my checker pro=
gram
> >> more efficient first.)
> >>
> >> Rebuilding a disk, or running in degraded mode, is just an obvious
> >> extension to the current raid6 algorithms. If you are missing thr=
ee
> >> data blocks, the maths looks hard to start with - but if you expre=
ss the
> >> equations as a set of linear equations and use standard matrix inv=
ersion
> >> techniques, it should not be hard to implement. You only need to =
do
> >> this inversion once when you find that one or more disks have fail=
ed -
> >> then you pre-compute the multiplication tables in the same way as =
is
> >> done for raid6 today.
> >>
> >> In normal use, calculating the R parity is no more demanding than
> >> calculating the Q parity. And most rebuilds or degraded situation=
s will
> >> only involve a single disk, and the data can thus be re-constructe=
d
> >> using the P parity just like raid5 or two-parity raid6.
> >>
> >>
> >> I'm sure there are situations where triple-parity raid6 would be
> >> appealing - it has already been implemented in ZFS, and it is only=
a
> >> matter of time before two-parity raid6 has a real probability of h=
itting
> >> an unrecoverable read error during a rebuild.
> >>
> >>
> >> And of course, there is no particular reason to stop at three pari=
ty
> >> blocks - the maths can easily be generalised. 1, 2, 4 and 8 can b=
e used
> >> as generators for quad-parity (checked up to 60 disks), and adding=
16
> >> gives you quintuple parity (checked up to 30 disks) - but that's m=
aybe
> >> getting a bit paranoid.
> >>
> >>
> >> ref.:
> >>
> >>
> >>
> >>
> >>
> >>
> >
> > -ENOPATCH :-)
> >
> > I have a series of patches nearly ready which removes a lot of the =
remaining
> > duplication in raid5.c between raid5 and raid6 paths. So there wil=
l be
> > relative few places where RAID5 and RAID6 do different things - onl=
y the
> > places where they *must* do different things.
> > After that, adding a new level or layout which has 'max_degraded =3D=
=3D 3' would
> > be quite easy.
> > The most difficult part would be the enhancements to libraid6 to ge=
nerate the
> > new 'syndrome', and to handle the different recovery possibilities.
> >
> > So if you're not otherwise busy this weekend, a patch would be nice=
:-)
> >
>=20
> I'm not going to promise any patches, but maybe I can help with the=20
> maths. You say the difficult part is the syndrome calculations and=20
> recovery - I've got these bits figured out on paper and some=20
> quick-and-dirty python test code. On the other hand, I don't really=20
> want to get into the md kernel code, or the mdadm code - I haven't do=
ne=20
> Linux kernel development before (I mostly program 8-bit microcontroll=
ers=20
> - when I code on Linux, I use Python), and I fear it would take me a=20
> long time to get up to speed.
>=20
> However, if the parity generation and recovery is neatly separated in=
to=20
> a libraid6 library, the whole thing becomes much more tractable from =
my=20
> viewpoint. Since I am new to this, can you tell me where I should ge=
t=20
> the current libraid6 code? I'm sure google will find some sources fo=
r=20
> me, but I'd like to make sure I start with whatever version /you/ hav=
e.
>=20
>=20
>=20
>=20
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

You can see the current kernel code at:

http://git.kernel.org/?p=3Dlinux/kernel/git/torvalds/linux-2 .6.git;a=3D=
tree;f=3Dlib/raid6;h=3D970c541a452d3b9983223d74b10866902f1a4 7c7;hb=3DHE=
AD

int.uc is the generic C code which 'unroll.awk' processes to make vario=
us
versions that unroll the loops different amounts to work with CPUs with
different numbers of registers.
Then there is sse1, sse2, altivec which provide the same functionality =
in
assembler which is optimised for various processors.

And 'recov' has the smarts for doing the reverse calculation when 2 dat=
a
blocks, or 1 data and P are missing.

Even if you don't feel up to implementing everything, a start might be
useful. You never know when someone might jump up and offer to help.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple-parity raid6

am 09.06.2011 21:19:35 von David Brown

On 09/06/11 14:04, NeilBrown wrote:
> On Thu, 09 Jun 2011 13:32:59 +0200 David Brown=
wrote:
>
>> On 09/06/2011 03:49, NeilBrown wrote:
>>> On Thu, 09 Jun 2011 02:01:06 +0200 David Brown t.no>
>>> wrote:
>>>
>>>> Has anyone considered triple-parity raid6 ? As far as I can see, =
it
>>>> should not be significantly harder than normal raid6 - either to
>>>> implement, or for the processor at run-time. Once you have the GF=
(2â=B8)
>>>> field arithmetic in place for raid6, it's just a matter of making
>>>> another parity block in the same way but using a different generat=
or:
>>>>
>>>> P =3D D_0 + D_1 + D_2 + .. + D_(n.1)
>>>> Q =3D D_0 + g.D_1 + gÂ².D_2 + .. + g^(n-1).D_(n.1)
>>>> R =3D D_0 + h.D_1 + hÂ².D_2 + .. + h^(n-1).D_(n.1)
>>>>
>>>> The raid6 implementation in mdraid uses g =3D 0x02 to generate the=
second
>>>> parity (based on "The mathematics of RAID-6" - I haven't checked t=
he
>>>> source code). You can make a third parity using h =3D 0x04 and th=
en get a
>>>> redundancy of 3 disks. (Note - I haven't yet confirmed that this =
is
>>>> valid for more than 100 data disks - I need to make my checker pro=
gram
>>>> more efficient first.)
>>>>
>>>> Rebuilding a disk, or running in degraded mode, is just an obvious
>>>> extension to the current raid6 algorithms. If you are missing thr=
ee
>>>> data blocks, the maths looks hard to start with - but if you expre=
ss the
>>>> equations as a set of linear equations and use standard matrix inv=
ersion
>>>> techniques, it should not be hard to implement. You only need to =
do
>>>> this inversion once when you find that one or more disks have fail=
ed -
>>>> then you pre-compute the multiplication tables in the same way as =
is
>>>> done for raid6 today.
>>>>
>>>> In normal use, calculating the R parity is no more demanding than
>>>> calculating the Q parity. And most rebuilds or degraded situation=
s will
>>>> only involve a single disk, and the data can thus be re-constructe=
d
>>>> using the P parity just like raid5 or two-parity raid6.
>>>>
>>>>
>>>> I'm sure there are situations where triple-parity raid6 would be
>>>> appealing - it has already been implemented in ZFS, and it is only=
a
>>>> matter of time before two-parity raid6 has a real probability of h=
itting
>>>> an unrecoverable read error during a rebuild.
>>>>
>>>>
>>>> And of course, there is no particular reason to stop at three pari=
ty
>>>> blocks - the maths can easily be generalised. 1, 2, 4 and 8 can b=
e used
>>>> as generators for quad-parity (checked up to 60 disks), and adding=
16
>>>> gives you quintuple parity (checked up to 30 disks) - but that's m=
aybe
>>>> getting a bit paranoid.
>>>>
>>>>
>>>> ref.:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> -ENOPATCH :-)
>>>
>>> I have a series of patches nearly ready which removes a lot of the =
remaining
>>> duplication in raid5.c between raid5 and raid6 paths. So there wil=
l be
>>> relative few places where RAID5 and RAID6 do different things - onl=
y the
>>> places where they *must* do different things.
>>> After that, adding a new level or layout which has 'max_degraded =3D=
=3D 3' would
>>> be quite easy.
>>> The most difficult part would be the enhancements to libraid6 to ge=
nerate the
>>> new 'syndrome', and to handle the different recovery possibilities.
>>>
>>> So if you're not otherwise busy this weekend, a patch would be nice=
:-)
>>>
>>
>> I'm not going to promise any patches, but maybe I can help with the
>> maths. You say the difficult part is the syndrome calculations and
>> recovery - I've got these bits figured out on paper and some
>> quick-and-dirty python test code. On the other hand, I don't really
>> want to get into the md kernel code, or the mdadm code - I haven't d=
one
>> Linux kernel development before (I mostly program 8-bit microcontrol=
lers
>> - when I code on Linux, I use Python), and I fear it would take me a
>> long time to get up to speed.
>>
>> However, if the parity generation and recovery is neatly separated i=
nto
>> a libraid6 library, the whole thing becomes much more tractable from=
my
>> viewpoint. Since I am new to this, can you tell me where I should g=
et
>> the current libraid6 code? I'm sure google will find some sources f=
or
>> me, but I'd like to make sure I start with whatever version /you/ ha=
ve.
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid=
" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> You can see the current kernel code at:
>
> http://git.kernel.org/?p=3Dlinux/kernel/git/torvalds/linux-2 .6.git;a=3D=
tree;f=3Dlib/raid6;h=3D970c541a452d3b9983223d74b10866902f1a4 7c7;hb=3DHE=
AD
>
>
> int.uc is the generic C code which 'unroll.awk' processes to make var=
ious
> versions that unroll the loops different amounts to work with CPUs wi=
th
> different numbers of registers.
> Then there is sse1, sse2, altivec which provide the same functionalit=
y in
> assembler which is optimised for various processors.
>
> And 'recov' has the smarts for doing the reverse calculation when 2 d=
ata
> blocks, or 1 data and P are missing.
>
> Even if you don't feel up to implementing everything, a start might b=
e
> useful. You never know when someone might jump up and offer to help.
>
> NeilBrown

Monday is a holiday here in Norway, so I've got a long weekend. I=20
should get at least /some/ time to have a look at libraid6!

mvh.,

David

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple-parity raid6

am 10.06.2011 00:42:41 von David Brown

On 09/06/11 02:01, David Brown wrote:
> Has anyone considered triple-parity raid6 ? As far as I can see, it
> should not be significantly harder than normal raid6 - either to
> implement, or for the processor at run-time. Once you have the GF(2=E2=
¸)
> field arithmetic in place for raid6, it's just a matter of making
> another parity block in the same way but using a different generator:
>
> P =3D D_0 + D_1 + D_2 + .. + D_(n.1)
> Q =3D D_0 + g.D_1 + gÂ².D_2 + .. + g^(n-1).D_(n.1)
> R =3D D_0 + h.D_1 + hÂ².D_2 + .. + h^(n-1).D_(n.1)
>
> The raid6 implementation in mdraid uses g =3D 0x02 to generate the se=
cond
> parity (based on "The mathematics of RAID-6" - I haven't checked the
> source code). You can make a third parity using h =3D 0x04 and then g=
et a
> redundancy of 3 disks. (Note - I haven't yet confirmed that this is
> valid for more than 100 data disks - I need to make my checker progra=
m
> more efficient first.)
>
> Rebuilding a disk, or running in degraded mode, is just an obvious
> extension to the current raid6 algorithms. If you are missing three d=
ata
> blocks, the maths looks hard to start with - but if you express the
> equations as a set of linear equations and use standard matrix invers=
ion
> techniques, it should not be hard to implement. You only need to do t=
his
> inversion once when you find that one or more disks have failed - the=
n
> you pre-compute the multiplication tables in the same way as is done =
for
> raid6 today.
>
> In normal use, calculating the R parity is no more demanding than
> calculating the Q parity. And most rebuilds or degraded situations wi=
ll
> only involve a single disk, and the data can thus be re-constructed
> using the P parity just like raid5 or two-parity raid6.
>
>
> I'm sure there are situations where triple-parity raid6 would be
> appealing - it has already been implemented in ZFS, and it is only a
> matter of time before two-parity raid6 has a real probability of hitt=
ing
> an unrecoverable read error during a rebuild.
>
>
> And of course, there is no particular reason to stop at three parity
> blocks - the maths can easily be generalised. 1, 2, 4 and 8 can be us=
ed
> as generators for quad-parity (checked up to 60 disks), and adding 16
> gives you quintuple parity (checked up to 30 disks) - but that's mayb=
e
> getting a bit paranoid.
>
>
> ref.:
>
>
>
>
>
>
>
> mvh.,
>
> David
>

Just to follow up on my numbers here - I've now checked the validity of=
=20
triple-parity using generators 1, 2 and 4 for up to 254 data disks=20
(i.e., 257 disks altogether). I've checked the validity of quad-parity=
=20
up to 120 disks - checking the full 253 disks will probably take the=20
machine most of the night. I'm sure there is some mathematical way to=20
prove this, and it could certainly be checked more efficiently than wit=
h=20
a Python program - but my computer has more spare time than me!

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple-parity raid6

am 10.06.2011 05:22:14 von Namhyung Kim

NeilBrown writes:
> On Thu, 09 Jun 2011 13:32:59 +0200 David Brown wrote:
>
> You can see the current kernel code at:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6 .git;a=tree;f=lib/raid6;h=970c541a452d3b9983223d74b10866902f 1a47c7;hb=HEAD
>
>
> int.uc is the generic C code which 'unroll.awk' processes to make various
> versions that unroll the loops different amounts to work with CPUs with
> different numbers of registers.
> Then there is sse1, sse2, altivec which provide the same functionality in
> assembler which is optimised for various processors.
>
> And 'recov' has the smarts for doing the reverse calculation when 2 data
> blocks, or 1 data and P are missing.
>
> Even if you don't feel up to implementing everything, a start might be
> useful. You never know when someone might jump up and offer to help.
>

Maybe I could help David to some extent. :)

I'm gonna read the raid6 code next week and hope that there is a room I
can help with, FWIW.

Thanks.

--
Regards,
Namhyung Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple-parity raid6

am 10.06.2011 10:45:46 von David Brown

On 10/06/2011 05:22, Namhyung Kim wrote:
> NeilBrown writes:
>> On Thu, 09 Jun 2011 13:32:59 +0200 David Brown wrote:
>>
>> You can see the current kernel code at:
>>
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6 .git;a=tree;f=lib/raid6;h=970c541a452d3b9983223d74b10866902f 1a47c7;hb=HEAD
>>
>>
>> int.uc is the generic C code which 'unroll.awk' processes to make various
>> versions that unroll the loops different amounts to work with CPUs with
>> different numbers of registers.
>> Then there is sse1, sse2, altivec which provide the same functionality in
>> assembler which is optimised for various processors.
>>
>> And 'recov' has the smarts for doing the reverse calculation when 2 data
>> blocks, or 1 data and P are missing.
>>
>> Even if you don't feel up to implementing everything, a start might be
>> useful. You never know when someone might jump up and offer to help.
>>
>
> Maybe I could help David to some extent. :)
>
> I'm gonna read the raid6 code next week and hope that there is a room I
> can help with, FWIW.
>

No matter how far I manage to get, I'm going to need help from someone
who knows the system and the development process, how the new functions
would fit with the rest of the system, and not least people who can
check and test the code.

Making multiple parity syndromes is easy enough mathematically:

For each parity bit P_j, you have a generator g_j and calculate for d_i
running over all data disks:

P_j = sum((g_j ^ i) . d_i)

Raid5 parity uses g_0 = 1, so it is just the xor.
Raid6 uses g_0 = 1, g_1 = 2.

Any independent generators could be used, of course. I am not sure how
to prove that a set of generators is independent (except in the easy
case of 2 generators, as shown in the raid6.pdf paper) - but brute force
testing over all choices of dead disks is easy enough.

For Raid7, the obvious choice is g_2 = 4 - then we can re-use existing
macros and optimisations.

I've had a brief look at int.uc - the gen_syndrome function is nice and
small, but that's before awk gets at it. I haven't yet tried building
any of this, but extending raid6_int to a third syndrome is, I hope, is
straightforward:

static void raid7_int$#_gen_syndrome(int disks, size_t bytes, void **ptrs)
{
u8 **dptr = (u8 **)ptrs;
u8 *p, *q, *r;
int d, z, z0;

unative_t wd$$, wq$$, wp$$, w1$$, w2$$, wr$$, w3$$, w4$$;

z0 = disks - 4; /* Highest data disk */
p = dptr[z0+1]; /* XOR parity */
q = dptr[z0+2]; /* RS syndrome */
r = dptr[z0+3]; /* RS syndrome 2 */

for ( d = 0 ; d < bytes ; d += NSIZE*$# ) {
wr$$ = wq$$ = wp$$ = *(unative_t *)&dptr[z0][d+$$*NSIZE];
for ( z = z0-1 ; z >= 0 ; z-- ) {
wd$$ = *(unative_t *)&dptr[z][d+$$*NSIZE];
wp$$ ^= wd$$;
w2$$ = MASK(wq$$);
w1$$ = SHLBYTE(wq$$);
w2$$ &= NBYTES(0x1d);
w1$$ ^= w2$$;
wq$$ = w1$$ ^ wd$$;

w4$$ = MASK(wr$$);
w3$$ = SHLBYTE(wr$$);
w4$$ &= NBYTES(0x1d);
w3$$ ^= w4$$;
w4$$ = MASK(w3$$);
w3$$ = SHLBYTE(w3$$);
w4$$ &= NBYTES(0x1d);
w3$$ ^= w4$$;
wr$$ = w3$$ ^ wd$$;
}
*(unative_t *)&p[d+NSIZE*$$] = wp$$;
*(unative_t *)&q[d+NSIZE*$$] = wq$$;
*(unative_t *)&r[d+NSIZE*$$] = wr$$;
}
}

I wrote the wr$$ calculations using a second set of working variables.
I don't know (yet) what compiler options are used to generate the target
code, nor what the awk'ed code looks like. If the compiler can handle
the scheduling to interlace the Q and R calculations and reduce pipeline
delays, then that's great. If not, then they can be manually interlaced
if it helps the code. But as I'm writing this blind (on a windows
machine, no less), I don't know if it makes a difference.

As a general point regarding optimisations, is there a minimum level of
gcc we can expect to have here? And are there rules about compiler
flags? Later versions of gcc are getting very smart about
vectorisation, and generating multiple versions of code that are
selected automatically at run-time depending on the capabilities of the
processor. My hope is that the code can be greatly simplified by
writing a single clear version in C that runs fast and takes advantage
of the cpu, without needing to make explicit SSE, AtliVec, etc.,
versions. Are there rules about which gcc attributes or extensions we
can use?

Another question - what should we call this? Raid 7? Raid 6.3? Raid
7.3? While we are in the mood for triple-parity Raid, we can easily
extend it to quad-parity or more - the name should probably be flexible
enough to allow that.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple-parity raid6

am 10.06.2011 11:03:26 von David Brown

On 09/06/2011 14:04, NeilBrown wrote:
> On Thu, 09 Jun 2011 13:32:59 +0200 David Brown=
wrote:
>
>> On 09/06/2011 03:49, NeilBrown wrote:
>>> On Thu, 09 Jun 2011 02:01:06 +0200 David Brown t.no>
>>> wrote:
>>>
>>>> Has anyone considered triple-parity raid6 ? As far as I can see, =
it
>>>> should not be significantly harder than normal raid6 - either to
>>>> implement, or for the processor at run-time. Once you have the GF=
(2â=B8)
>>>> field arithmetic in place for raid6, it's just a matter of making
>>>> another parity block in the same way but using a different generat=
or:
>>>>
>>>> P =3D D_0 + D_1 + D_2 + .. + D_(n.1)
>>>> Q =3D D_0 + g.D_1 + gÂ².D_2 + .. + g^(n-1).D_(n.1)
>>>> R =3D D_0 + h.D_1 + hÂ².D_2 + .. + h^(n-1).D_(n.1)
>>>>
>>>> The raid6 implementation in mdraid uses g =3D 0x02 to generate the=
second
>>>> parity (based on "The mathematics of RAID-6" - I haven't checked t=
he
>>>> source code). You can make a third parity using h =3D 0x04 and th=
en get a
>>>> redundancy of 3 disks. (Note - I haven't yet confirmed that this =
is
>>>> valid for more than 100 data disks - I need to make my checker pro=
gram
>>>> more efficient first.)
>>>>
>>>> Rebuilding a disk, or running in degraded mode, is just an obvious
>>>> extension to the current raid6 algorithms. If you are missing thr=
ee
>>>> data blocks, the maths looks hard to start with - but if you expre=
ss the
>>>> equations as a set of linear equations and use standard matrix inv=
ersion
>>>> techniques, it should not be hard to implement. You only need to =
do
>>>> this inversion once when you find that one or more disks have fail=
ed -
>>>> then you pre-compute the multiplication tables in the same way as =
is
>>>> done for raid6 today.
>>>>
>>>> In normal use, calculating the R parity is no more demanding than
>>>> calculating the Q parity. And most rebuilds or degraded situation=
s will
>>>> only involve a single disk, and the data can thus be re-constructe=
d
>>>> using the P parity just like raid5 or two-parity raid6.
>>>>
>>>>
>>>> I'm sure there are situations where triple-parity raid6 would be
>>>> appealing - it has already been implemented in ZFS, and it is only=
a
>>>> matter of time before two-parity raid6 has a real probability of h=
itting
>>>> an unrecoverable read error during a rebuild.
>>>>
>>>>
>>>> And of course, there is no particular reason to stop at three pari=
ty
>>>> blocks - the maths can easily be generalised. 1, 2, 4 and 8 can b=
e used
>>>> as generators for quad-parity (checked up to 60 disks), and adding=
16
>>>> gives you quintuple parity (checked up to 30 disks) - but that's m=
aybe
>>>> getting a bit paranoid.
>>>>
>>>>
>>>> ref.:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> -ENOPATCH :-)
>>>
>>> I have a series of patches nearly ready which removes a lot of the =
remaining
>>> duplication in raid5.c between raid5 and raid6 paths. So there wil=
l be
>>> relative few places where RAID5 and RAID6 do different things - onl=
y the
>>> places where they *must* do different things.
>>> After that, adding a new level or layout which has 'max_degraded =3D=
=3D 3' would
>>> be quite easy.
>>> The most difficult part would be the enhancements to libraid6 to ge=
nerate the
>>> new 'syndrome', and to handle the different recovery possibilities.
>>>
>>> So if you're not otherwise busy this weekend, a patch would be nice=
:-)
>>>
>>
>> I'm not going to promise any patches, but maybe I can help with the
>> maths. You say the difficult part is the syndrome calculations and
>> recovery - I've got these bits figured out on paper and some
>> quick-and-dirty python test code. On the other hand, I don't really
>> want to get into the md kernel code, or the mdadm code - I haven't d=
one
>> Linux kernel development before (I mostly program 8-bit microcontrol=
lers
>> - when I code on Linux, I use Python), and I fear it would take me a
>> long time to get up to speed.
>>
>> However, if the parity generation and recovery is neatly separated i=
nto
>> a libraid6 library, the whole thing becomes much more tractable from=
my
>> viewpoint. Since I am new to this, can you tell me where I should g=
et
>> the current libraid6 code? I'm sure google will find some sources f=
or
>> me, but I'd like to make sure I start with whatever version /you/ ha=
ve.
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid=
" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> You can see the current kernel code at:
>
> http://git.kernel.org/?p=3Dlinux/kernel/git/torvalds/linux-2 .6.git;a=3D=
tree;f=3Dlib/raid6;h=3D970c541a452d3b9983223d74b10866902f1a4 7c7;hb=3DHE=
AD
>
>
> int.uc is the generic C code which 'unroll.awk' processes to make var=
ious
> versions that unroll the loops different amounts to work with CPUs wi=
th
> different numbers of registers.
> Then there is sse1, sse2, altivec which provide the same functionalit=
y in
> assembler which is optimised for various processors.
>
> And 'recov' has the smarts for doing the reverse calculation when 2 d=
ata
> blocks, or 1 data and P are missing.
>
> Even if you don't feel up to implementing everything, a start might b=
e
> useful. You never know when someone might jump up and offer to help.
>
> NeilBrown

When looking at recov.c, I see in the "raid6_dual_recov" function there=
=20
is no code for testing data+Q failure as it is equivalent to raid5=20
recovery. Should this not still be implemented here so that testing ca=
n=20
be more complete?

Is there a general entry point for the recovery routines, which then=20
decides which of raid6_2data_recov, raid6_datap_recov, or=20
raid6_dual_recov is called? With triple-parity raid, there are many=20
more combinations - it would make sense for the library to have a singl=
e=20
function like :

void raid7_3_recov(int disks, size_t bytes, int noOfFails,
int *pFails, void **ptrs);

or even (to cover quad parity and more) :

void raid7_n_recov(int disks, int noOfParities, size_t bytes,
int noOfFails, int *pFails, void **ptrs);

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple-parity raid6

am 10.06.2011 14:20:40 von Christoph Dittmann

On 06/10/2011 10:45 AM, David Brown wrote:
> Making multiple parity syndromes is easy enough mathematically:

Adam Leventhal, who wrote the double parity and triple parity code for
ZFS, mentioned on his blog [1] that going beyond triple parity poses
significant challenges if write performance should not suffer.

In particular, it looks like a relevant math paper (originally) had a
flaw in the claim that quad parity and above can be implemented just
like triple parity.

I don't know the implementation you were going to use, neither am I
knowledgeable about multi-parity in general. I only thought it might be
relevant to add to the current discussion that other people had issues
with implementing N-parity for N > 3.

Christoph

[1] http://blogs.oracle.com/ahl/entry/triple_parity_raid_z
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple-parity raid6

am 10.06.2011 15:56:02 von Bill Davidsen

NeilBrown wrote:
> On Thu, 09 Jun 2011 13:32:59 +0200 David Brown wrote:
>
>
>> On 09/06/2011 03:49, NeilBrown wrote:
>>
>>>
>>> -ENOPATCH :-)
>>>
>>> I have a series of patches nearly ready which removes a lot of the remaining
>>> duplication in raid5.c between raid5 and raid6 paths. So there will be
>>> relative few places where RAID5 and RAID6 do different things - only the
>>> places where they *must* do different things.
>>>
>>> After that, adding a new level or layout which has 'max_degraded == 3' would
>>> be quite easy.
>>> The most difficult part would be the enhancements to libraid6 to generate the
>>> new 'syndrome', and to handle the different recovery possibilities.
>>>
>>> So if you're not otherwise busy this weekend, a patch would be nice :-)
>>>
>>>
>> I'm not going to promise any patches, but maybe I can help with the
>> maths. You say the difficult part is the syndrome calculations and
>> recovery - I've got these bits figured out on paper and some
>> quick-and-dirty python test code. On the other hand, I don't really
>> want to get into the md kernel code, or the mdadm code - I haven't done
>> Linux kernel development before (I mostly program 8-bit microcontrollers
>> - when I code on Linux, I use Python), and I fear it would take me a
>> long time to get up to speed.
>>
>> However, if the parity generation and recovery is neatly separated into
>> a libraid6 library, the whole thing becomes much more tractable from my
>> viewpoint. Since I am new to this, can you tell me where I should get
>> the current libraid6 code? I'm sure google will find some sources for
>> me, but I'd like to make sure I start with whatever version /you/ have.
>>
> You can see the current kernel code at:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6 .git;a=tree;f=lib/raid6;h=970c541a452d3b9983223d74b10866902f 1a47c7;hb=HEAD
>
>
> int.uc is the generic C code which 'unroll.awk' processes to make various
> versions that unroll the loops different amounts to work with CPUs with
> different numbers of registers.
> Then there is sse1, sse2, altivec which provide the same functionality in
> assembler which is optimised for various processors.
>
>
And at some point I'm sure one of the video card vendors will provide a
hack to do it in
the GPU in massively parallel fashion.

> And 'recov' has the smarts for doing the reverse calculation when 2 data
> blocks, or 1 data and P are missing.
>
> Even if you don't feel up to implementing everything, a start might be
> useful. You never know when someone might jump up and offer to help.
>
>
>

--
Bill Davidsen
We are not out of the woods yet, but we know the direction and have
taken the first step. The steps are many, but finite in number, and if
we persevere we will reach our destination. -me, 2010

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple-parity raid6

am 10.06.2011 16:28:09 von David Brown

On 10/06/2011 14:20, Christoph Dittmann wrote:
> On 06/10/2011 10:45 AM, David Brown wrote:
>> Making multiple parity syndromes is easy enough mathematically:
>
> Adam Leventhal, who wrote the double parity and triple parity code for
> ZFS, mentioned on his blog [1] that going beyond triple parity poses
> significant challenges if write performance should not suffer.
>
> In particular, it looks like a relevant math paper (originally) had a
> flaw in the claim that quad parity and above can be implemented just
> like triple parity.
>
> I don't know the implementation you were going to use, neither am I
> knowledgeable about multi-parity in general. I only thought it might be
> relevant to add to the current discussion that other people had issues
> with implementing N-parity for N > 3.
>
>

I've looked at Adam's blog article - in fact, stumbling over it when
wandering around the 'net was one of my inspirations to think about
multi-parity raid again.

But there are a few key differences between the description in the James
Plank papers referenced, and the implementation I've looked at.

One point is that he is looking at GF(2^4), which is quickly a limiting
factor. It's hard to get enough independent parity syndromes, and you
can get easily start to think that things won't work for larger parity.

The other is the choice of factors for the syndromes. Assuming I
understand the papers correctly, the first paper is suggesting this
arrangement (all arithmetic done in GF()):

P_0 = 1^0 . d_0 + 2^0 . d_1 + 3^0 . d_2 + 4^0 . d_3
P_1 = 1^1 . d_0 + 2^1 . d_1 + 3^1 . d_2 + 4^1 . d_3
P_2 = 1^2 . d_0 + 2^2 . d_1 + 3^2 . d_2 + 4^2 . d_3
P_3 = 1^3 . d_0 + 2^3 . d_1 + 3^3 . d_2 + 4^3 . d_3

The second paper changes it to:

P_0 = 1^0 . d_0 + 1^1 . d_1 + 1^2 . d_2 + 1^3 . d_3
P_1 = 2^0 . d_0 + 2^1 . d_1 + 2^2 . d_2 + 2^3 . d_3
P_2 = 3^0 . d_0 + 3^1 . d_1 + 3^2 . d_2 + 3^3 . d_3
P_3 = 4^0 . d_0 + 4^1 . d_1 + 4^2 . d_2 + 4^3 . d_3

For the first two parity blocks, this is the same as for Linux raid6:

P = d_0 + d_1 + d_2 + d_3
Q = g^0 . d_0 + g^1 . d_1 + g^2 . d_2 + g^3 . d_3
(where g = 2)

But the third (and later) lines are not good - the restoration equations
on multiple disk failure are not independent for all combinations of
failed disks. For example, if you have 20 data disks and 3 parity
disks, there are 1140 different combinations of three data-disk
failures. Of these, 92 combinations lead to non-independent equations
when you try to restore the data.

The equations I am using are:

P_0 = 1^0 . d_0 + 1^1 . d_1 + 1^2 . d_2 + 1^3 . d_3
P_1 = 2^0 . d_0 + 2^1 . d_1 + 2^2 . d_2 + 2^3 . d_3
P_2 = 4^0 . d_0 + 4^1 . d_1 + 4^2 . d_2 + 4^3 . d_3
P_3 = 8^0 . d_0 + 8^1 . d_1 + 8^2 . d_2 + 8^3 . d_3

P_4 would use powers of 16, etc.

I have checked that the restoration equations are solvable for all
combinations of 3 disk failures for triple parity, and all combinations
of 4 disk failures for quad parity. The restoration equations can be
written in matrix form and are dependent on the indexes of the failed
disks. To solve them, you simply invert the matrix and this gives you a
linear formula for re-calculating each missing data block. So to check
that the data can be restored, you need to check that the determinant of
this matrix is non-zero for all combinations of n-disk failures.

I /believe/ these should work find for at least up to P_7 (i.e., 8
parity disks), but I haven't yet checked more than a few larger parity
modes. Checking all 4 disk failures from 253 disks took my inefficient
python program about 45 minutes - to check for 8-disk failures would
mean finding all combinations of 8 choices for up to 249 disks. If my
maths serves me right, that's 316,528,258,780,856 combinations - and for
each one, I'd have to find the determinant of an 8x8 matrix over
GF(2^8). Fortunately, I don't think they'll be much call for
octal-parity RAID in the near future :-)

Of course, all this assume that my maths is correct !

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple-parity raid6

am 11.06.2011 12:13:12 von Piergiorgio Sartor

[snip]
> Of course, all this assume that my maths is correct !

I would suggest to check out the Reed-Solomon thing
in the more friendly form of Vandermonde matrix.

It will be completely clear how to generate k parity
set with n data set (disk), so that n+k < 258 for the
GF(256) space.

It will also be much more clear how to re-construct
the data set in case of erasure (known data lost).

You can have a look, for reference, at:

http://lsirwww.epfl.ch/wdas2004/Presentations/manasse.ppt

If you search for something like "Reed Solomon Vandermonde"
you'll find even more information.

Hope this helps.

bye,

--

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple-parity raid6

am 11.06.2011 13:51:12 von David Brown

On 11/06/11 12:13, Piergiorgio Sartor wrote:
> [snip]
>> Of course, all this assume that my maths is correct !
>
> I would suggest to check out the Reed-Solomon thing
> in the more friendly form of Vandermonde matrix.
>
> It will be completely clear how to generate k parity
> set with n data set (disk), so that n+k< 258 for the
> GF(256) space.
>
> It will also be much more clear how to re-construct
> the data set in case of erasure (known data lost).
>
> You can have a look, for reference, at:
>
> http://lsirwww.epfl.ch/wdas2004/Presentations/manasse.ppt
>
> If you search for something like "Reed Solomon Vandermonde"
> you'll find even more information.
>
> Hope this helps.
>
> bye,
>

That presentation is using Vandermonde matrices, which are the same as
the ones used in James Plank's papers. As far as I can see, these are
limited in how well you can recover from missing disks (the presentation
here says it only works for up to triple parity). They Vandermonde
matrices have that advantage that the determinants are easily calculated
- I haven't yet figured out an analytical method of calculating the
determinants in my equations, and have just used brute force checking.
(My syndromes also have the advantage of being easy to calculate quickly.)

Still, I think the next step for me should be to write up the maths a
bit more formally, rather than just hints in mailing list posts. Then
others can have a look, and have an opinion on whether I've got it right
or not. It makes sense to be sure the algorithms will work before
spending much time implementing them!

I certainly /believe/ my maths is correct here - but it's nearly twenty
years since I did much formal algebra. I studied maths at university,
but I don't use group theory often in my daily job as an embedded
programmer.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple-parity raid6

am 11.06.2011 15:18:01 von Piergiorgio Sartor

On Sat, Jun 11, 2011 at 01:51:12PM +0200, David Brown wrote:
> On 11/06/11 12:13, Piergiorgio Sartor wrote:
> >[snip]
> >>Of course, all this assume that my maths is correct !
> >
> >I would suggest to check out the Reed-Solomon thing
> >in the more friendly form of Vandermonde matrix.
> >
> >It will be completely clear how to generate k parity
> >set with n data set (disk), so that n+k< 258 for the
> >GF(256) space.
> >
> >It will also be much more clear how to re-construct
> >the data set in case of erasure (known data lost).
> >
> >You can have a look, for reference, at:
> >
> >http://lsirwww.epfl.ch/wdas2004/Presentations/manasse.ppt
> >
> >If you search for something like "Reed Solomon Vandermonde"
> >you'll find even more information.
> >
> >Hope this helps.
> >
> >bye,
> >
>
> That presentation is using Vandermonde matrices, which are the same
> as the ones used in James Plank's papers. As far as I can see,
> these are limited in how well you can recover from missing disks
> (the presentation here says it only works for up to triple parity).

As far as I understood, 3 is only an example, it works
up to k lost disks, with n+k < 258 (or 259).
I mean, I do not see why it should not.

You've, of course, to know which disks are failed.
On the other hand, having k parities allows you to find
up to k/2 error positions.
This is bit more complicated, I guess.
You can search for Berlekamp-Massey Algorithm (and related)
in order to see how to *find* the errors.

> They Vandermonde matrices have that advantage that the determinants
> are easily calculated - I haven't yet figured out an analytical
> method of calculating the determinants in my equations, and have
> just used brute force checking. (My syndromes also have the
> advantage of being easy to calculate quickly.)

I think the point of Reed Solomon (with Vandermonde or Cauchy
matrices) is also that it generalize the parity concept.

This means you do not have to care if it is 2 or 3 or 7 or ...

In this way you can have as many parities as you like, up to
the limitation of Reed Solomon in GF(256).

> Still, I think the next step for me should be to write up the maths
> a bit more formally, rather than just hints in mailing list posts.
> Then others can have a look, and have an opinion on whether I've got
> it right or not. It makes sense to be sure the algorithms will work
> before spending much time implementing them!

I tend to agree. At first you should set up the background
theory, then the algorithm, later the implementation and
eventually the optimization.

> I certainly /believe/ my maths is correct here - but it's nearly
> twenty years since I did much formal algebra. I studied maths at
> university, but I don't use group theory often in my daily job as an
> embedded programmer.

Well, I, for sure, will stay tuned for your results!

bye,

--

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple-parity raid6

am 11.06.2011 16:53:38 von David Brown

On 11/06/11 15:18, Piergiorgio Sartor wrote:
> On Sat, Jun 11, 2011 at 01:51:12PM +0200, David Brown wrote:
>> On 11/06/11 12:13, Piergiorgio Sartor wrote:
>>> [snip]
>>>> Of course, all this assume that my maths is correct !
>>>
>>> I would suggest to check out the Reed-Solomon thing
>>> in the more friendly form of Vandermonde matrix.
>>>
>>> It will be completely clear how to generate k parity
>>> set with n data set (disk), so that n+k< 258 for the
>>> GF(256) space.
>>>
>>> It will also be much more clear how to re-construct
>>> the data set in case of erasure (known data lost).
>>>
>>> You can have a look, for reference, at:
>>>
>>> http://lsirwww.epfl.ch/wdas2004/Presentations/manasse.ppt
>>>
>>> If you search for something like "Reed Solomon Vandermonde"
>>> you'll find even more information.
>>>
>>> Hope this helps.
>>>
>>> bye,
>>>
>>
>> That presentation is using Vandermonde matrices, which are the same
>> as the ones used in James Plank's papers. As far as I can see,
>> these are limited in how well you can recover from missing disks
>> (the presentation here says it only works for up to triple parity).
>
> As far as I understood, 3 is only an example, it works
> up to k lost disks, with n+k< 258 (or 259).
> I mean, I do not see why it should not.
>

I also don't see why 3 parity should be a limitation - I think it must
be because of the choice of syndrome calculations. But the presentation
you linked to specifically says on page 3 that it will say "Why it stops
at three erasures, and works only for GF(2^k)". I haven't investigated
anything other than GF(2^8), since that is optimal for implementing raid
(well, 2^1 is easier - but that only gives you raid5). Unfortunately,
the paper doesn't give details there. Adam Leventhal's blog (mentioned
earlier in this thread) also says that implementation of triple-parity
for ZFS was relatively easy, but not for more than three parity bits.

> You've, of course, to know which disks are failed.

That's normally the case for disk applications.

> On the other hand, having k parities allows you to find
> up to k/2 error positions.
> This is bit more complicated, I guess.
> You can search for Berlekamp-Massey Algorithm (and related)
> in order to see how to *find* the errors.
>

I've worked with ECC systems for data transmission and communications
systems, when you don't know if there are any errors or where the errors
might be. But although there is a fair overlap of the theory here,
there are big differences in the way you implement such checking and
correction, and your priorities. With raid, you know either that your
block read is correct (because of the ECC handled at the drive firmware
level), or incorrect.

To deal with unknown errors or error positions, you have to read in
everything in a stripe and run your error checking for every read - that
would be a significant run-time cost, which normally be wasted (as the
raid set is normally consistent).

One situation where that might be useful, however, is for scrubbing or
checking when the array is know to be inconsistent (such as after a
power failure). Neil has already argued that the simple approach of
re-creating the parity blocks (rather than identifying incorrect blocks)
is better, or at least no worse, than being "smart". But the balance of
that argument might change with more parity blocks.

>> They Vandermonde matrices have that advantage that the determinants
>> are easily calculated - I haven't yet figured out an analytical
>> method of calculating the determinants in my equations, and have
>> just used brute force checking. (My syndromes also have the
>> advantage of being easy to calculate quickly.)
>
> I think the point of Reed Solomon (with Vandermonde or Cauchy
> matrices) is also that it generalize the parity concept.
>
> This means you do not have to care if it is 2 or 3 or 7 or ...
>
> In this way you can have as many parities as you like, up to
> the limitation of Reed Solomon in GF(256).
>

I agree. However, I'm not sure there is much practical use of going
beyond perhaps 4 parity blocks - at that stage you are probably better
dividing up your array, or (if you need more protection) using n-parity
raid6 over raid1 mirror sets.

>> Still, I think the next step for me should be to write up the maths
>> a bit more formally, rather than just hints in mailing list posts.
>> Then others can have a look, and have an opinion on whether I've got
>> it right or not. It makes sense to be sure the algorithms will work
>> before spending much time implementing them!
>
> I tend to agree. At first you should set up the background
> theory, then the algorithm, later the implementation and
> eventually the optimization.
>

Yes - we've already established that the implementation will be
possible, and that there are people willing and able to help with it.
And I believe that much of the optimisation can be handled by the
compiler - gcc has come a long way since raid6 was first implemented in
mdraid.

>> I certainly /believe/ my maths is correct here - but it's nearly
>> twenty years since I did much formal algebra. I studied maths at
>> university, but I don't use group theory often in my daily job as an
>> embedded programmer.
>
> Well, I, for sure, will stay tuned for your results!
>
> bye,
>

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple-parity raid6

am 11.06.2011 17:05:08 von Joe Landman

On 06/11/2011 10:53 AM, David Brown wrote:

> Yes - we've already established that the implementation will be
> possible, and that there are people willing and able to help with it.
> And I believe that much of the optimisation can be handled by the
> compiler - gcc has come a long way since raid6 was first implemented in
> mdraid.

Hmmm ... don't be too overly reliant upon optimization from a compiler
for your performance. It makes sense to have a well designed (and
proven correct) algorithm first that is hand optimizable. There are
many designs which are anathema to performance, and you have to be
careful to avoid using them in your coding.

We are interested in working on this capability (and more generic
capability) as well.

Is anyone in particular starting to design/code this? Please let me know.

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple-parity raid6

am 11.06.2011 18:31:19 von David Brown

On 11/06/11 17:05, Joe Landman wrote:
> On 06/11/2011 10:53 AM, David Brown wrote:
>
>> Yes - we've already established that the implementation will be
>> possible, and that there are people willing and able to help with it.
>> And I believe that much of the optimisation can be handled by the
>> compiler - gcc has come a long way since raid6 was first implemented in
>> mdraid.
>
> Hmmm ... don't be too overly reliant upon optimization from a compiler
> for your performance. It makes sense to have a well designed (and proven
> correct) algorithm first that is hand optimizable. There are many
> designs which are anathema to performance, and you have to be careful to
> avoid using them in your coding.
>

Absolutely - it's the designer's job to pick a good algorithm, and the
programmer's job to make a good implementation of it. But it's the
compiler's job to turn that into tight object code. And for code like
this, the algorithm designer has to work closely with the programmer (or
be the same person, of course) to pick algorithms that can be
implemented well. Similarly, the programmer has to have a good
understanding of the compiler and be able to understand the generated
assembly, in order to get the best from the tools.

What has changed over the years is that there is no longer such a need
for manual assembly code to get optimal speed out of the cpu. While
writing such assembly is fun, it is time-consuming to write and hard to
maintain, especially for code that must run on so many different platforms.

> We are interested in working on this capability (and more generic
> capability) as well.
>
> Is anyone in particular starting to design/code this? Please let me know.
>

Well, I am currently trying to write up some of the maths - I started
the thread because I had been playing around with the maths, and thought
it should work. I made a brief stab at writing a
"raid7_int$#_gen_syndrome()" function, but I haven't done any testing
with it (or even tried to compile it) - first I want to be sure of the
algorithms.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple-parity raid6

am 11.06.2011 18:57:47 von Joe Landman

On 06/11/2011 12:31 PM, David Brown wrote:

> What has changed over the years is that there is no longer such a need
> for manual assembly code to get optimal speed out of the cpu. While

Hmmm ... I've done studies on this using an incredibly simple function
(Riemann Zeta Function c.f. http://scalability.org/?p=470 ). The short
version is that hand optimized SSE2 is ~4x faster (for this case) than
best optimization of high level code. Hand optimized assembler is even
better.

> writing such assembly is fun, it is time-consuming to write and hard to
> maintain, especially for code that must run on so many different platforms.

Yes, it is generally hard to write and maintain. But it you can get the
rest of the language semantics out of the way. If you look at the tests
that Linux does when it starts up, you can see a fairly wide
distribution in the performance.

raid5: using function: generic_sse (13356.000 MB/sec)
raid6: int64x1 3507 MB/s
raid6: int64x2 3886 MB/s
raid6: int64x4 3257 MB/s
raid6: int64x8 3054 MB/s
raid6: sse2x1 8347 MB/s
raid6: sse2x2 9695 MB/s
raid6: sse2x4 10972 MB/s

Some of these are hand coded assembly. See
${KERNEL_SOURCE}/drivers/md/raid6sse2.c and look at the
raid6_sse24_gen_syndrome code.

Really, to get the best performance out of the system, requires a fairly
deep understanding of how the processor/memory system operates. These
functions do use the SSE registers, but we can have only so many SSE
operations in flight at once. These processors can generally have quite
a few simultaneous operations in flight at once, so a knowledge about
that, and the mix of operations, and how the interact with the
instruction scheduler in the hardware, is fairly essential to getting
good performance.

>
>> We are interested in working on this capability (and more generic
>> capability) as well.
>>
>> Is anyone in particular starting to design/code this? Please let me know.
>>
>
> Well, I am currently trying to write up some of the maths - I started
> the thread because I had been playing around with the maths, and thought
> it should work. I made a brief stab at writing a
> "raid7_int$#_gen_syndrome()" function, but I haven't done any testing
> with it (or even tried to compile it) - first I want to be sure of the
> algorithms.

I've been coding various bits as "pseudocode" using Octave. Makes
checking with the built in Galios functions pretty easy.

I haven't looked at the math behind the triple parity syndrome calc yet,
though I'd imagine someone has, and can write it down. If someone
hasn't done that yet, its a good first step. Then we can code the
simple version from there with test drivers/cases, and then start
optimizing the implementation.

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple-parity raid6

am 11.06.2011 19:14:38 von Joe Landman

A quick note of caution (and someone from Netapp, feel free to speak up).

Netapp has a patent on triple parity raid (c.f.
http://www.freepatentsonline.com/7640484.html). A quick look over this,
suggests that the major innovation is the layout and computation which
they simplified in a particular manner. That is, I don't think their
patent covers triple parity RAID in general, but does cover their
implementation, and the diagonal parity with anti-diagonal parity
(effectively counter propagating or orthogonalized parity).

I am not sure what this means from a coding sense, other than not to use
their techniques without a license to do so. If Netapp wants to grant
such a license, this would be good, but I suspect that it wouldn't be
quite as simple as this.

Just a note so that we don't encounter problems. I think its very
possible to avoid their IP, as it would somewhat hard to claim ownership
of the Galois Field math behind RAID calculations. They can (and do)
claim a particular implementation and algorithm.

[also not trying to open the patent on code wars here, just pointing out
the current situation ]

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple-parity raid6

am 11.06.2011 20:05:32 von David Brown

On 11/06/11 19:14, Joe Landman wrote:
> A quick note of caution (and someone from Netapp, feel free to speak up).
>
> Netapp has a patent on triple parity raid (c.f.
> http://www.freepatentsonline.com/7640484.html). A quick look over this,
> suggests that the major innovation is the layout and computation which
> they simplified in a particular manner. That is, I don't think their
> patent covers triple parity RAID in general, but does cover their
> implementation, and the diagonal parity with anti-diagonal parity
> (effectively counter propagating or orthogonalized parity).
>
> I am not sure what this means from a coding sense, other than not to use
> their techniques without a license to do so. If Netapp wants to grant
> such a license, this would be good, but I suspect that it wouldn't be
> quite as simple as this.
>
> Just a note so that we don't encounter problems. I think its very
> possible to avoid their IP, as it would somewhat hard to claim ownership
> of the Galois Field math behind RAID calculations. They can (and do)
> claim a particular implementation and algorithm.
>
> [also not trying to open the patent on code wars here, just pointing out
> the current situation ]
>
>

I've read a little about diagonal parities - I can see some advantage in
their simplicity, but I think that they are a poor choice for raid.
Raid5+ already suffers from performance issues because you often have to
read a whole stripe at a time just to change a few blocks - with
diagonal parity, you'd have to read a whole 2-D set of stripes.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple-parity raid6

am 12.06.2011 11:05:40 von David Brown

On 11/06/11 18:57, Joe Landman wrote:
> On 06/11/2011 12:31 PM, David Brown wrote:
>
>> What has changed over the years is that there is no longer such a need
>> for manual assembly code to get optimal speed out of the cpu. While
>
> Hmmm ... I've done studies on this using an incredibly simple function
> (Riemann Zeta Function c.f. http://scalability.org/?p=470 ). The short
> version is that hand optimized SSE2 is ~4x faster (for this case) than
> best optimization of high level code. Hand optimized assembler is even
> better.
>
>> writing such assembly is fun, it is time-consuming to write and hard to
>> maintain, especially for code that must run on so many different
>> platforms.
>
> Yes, it is generally hard to write and maintain. But it you can get the
> rest of the language semantics out of the way. If you look at the tests
> that Linux does when it starts up, you can see a fairly wide
> distribution in the performance.
>
> raid5: using function: generic_sse (13356.000 MB/sec)
> raid6: int64x1 3507 MB/s
> raid6: int64x2 3886 MB/s
> raid6: int64x4 3257 MB/s
> raid6: int64x8 3054 MB/s
> raid6: sse2x1 8347 MB/s
> raid6: sse2x2 9695 MB/s
> raid6: sse2x4 10972 MB/s
>
> Some of these are hand coded assembly. See
> ${KERNEL_SOURCE}/drivers/md/raid6sse2.c and look at the
> raid6_sse24_gen_syndrome code.
>
> Really, to get the best performance out of the system, requires a fairly
> deep understanding of how the processor/memory system operates. These
> functions do use the SSE registers, but we can have only so many SSE
> operations in flight at once. These processors can generally have quite
> a few simultaneous operations in flight at once, so a knowledge about
> that, and the mix of operations, and how the interact with the
> instruction scheduler in the hardware, is fairly essential to getting
> good performance.
>

I am not suggesting that hand-coding assembly won't make the
calculations faster - just that better compiler optimisations (which
will automatically make use of sse instructions) will make the generic
code closer to the theoretical maximum.

Out of curiosity, have you re-tried your zeta function code using a more
modern version of gcc? A lot has happened with gcc since 4.1 - in
particular, the "graphite" code in gcc 4.4 will make a big difference to
code that loops through a lot of data (it re-arranges the loops to
unroll inner blocks, and to make loop strides match cache sizes).

>>
>>> We are interested in working on this capability (and more generic
>>> capability) as well.
>>>
>>> Is anyone in particular starting to design/code this? Please let me
>>> know.
>>>
>>
>> Well, I am currently trying to write up some of the maths - I started
>> the thread because I had been playing around with the maths, and thought
>> it should work. I made a brief stab at writing a
>> "raid7_int$#_gen_syndrome()" function, but I haven't done any testing
>> with it (or even tried to compile it) - first I want to be sure of the
>> algorithms.
>
> I've been coding various bits as "pseudocode" using Octave. Makes
> checking with the built in Galios functions pretty easy.
>
> I haven't looked at the math behind the triple parity syndrome calc yet,
> though I'd imagine someone has, and can write it down. If someone hasn't
> done that yet, its a good first step. Then we can code the simple
> version from there with test drivers/cases, and then start optimizing
> the implementation.
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html