RAID6 r-m-w, op-journaled fs, SSDs

RAID6 r-m-w, op-journaled fs, SSDs

am 30.04.2011 17:27:48 von pg_xf2

While I agree with BAARF.com arguments fully, I sometimes have
to deal with legacy systems with wide RAID6 sets (for example 16
drives, quite revolting) which have op-journaled filesystems on
them like XFS or JFS (sometimes block-journaled ext[34], but I
am not that interested in them for this).

Sometimes (but fortunately not that recently) I have had to deal
with small-file filesystems setup on wide-stripe RAID6 setup by
morons who don't understand the difference between a database
and a filesystem (and I have strong doubts that RAID6 is
remotely appropriate to databases).

So I'd like to figure out how much effort I should invest in
undoing cases of the above, that is how badly they are likely to
be and degrade over time (usually very badly).

First a couple of question purely about RAID, but indirectly
relevant to op-journaled filesystems:

* Can Linux MD do "abbreviated" read-modify-write RAID6
updates like for RAID5? That is where not the whole stripe
is read in, modified and written, but just the block to be
updated and the parity wblocks.

* When reading or writing part of RAID[456] stripe for example
smaller than a sector, what is the minimum unit of transfer
with Linux MD? The full stripe, the chunk containing the
sector, or just the sector containing the bytes to be
written or updated (and potentially the parity sectors)? I
would expect reads to always read just the sector, but not
so sure about writing.

* What about popular HW RAID host adapter (e.g. LSI, Adaptec,
Areca, 3ware), where is the documentation if any on how they
behave in these cases?

Regardless, op-journaled file system designs like JFS and XFS
write small records (way below a stripe set size, and usually
way below a chunk size) to the journal when they queue
operations, even if sometimes depending on design and options
may "batch" the journal updates (potentially breaking safety
semantics). Also they do small write when they dequeue the
operations from the journal to the actual metadata records
involved.

How bad can this be when the journal is say internal for a
filesystem that is held on wide-stride RAID6 set? I suspect very
very bad, with apocalyptic read-modify-write storms, eating IOPS.

I suspect that this happens a lot with SSDs too, where the role
of stripe set size is played by the erase block size (often in
the hundreds of KBytes, and even more expensive).

Where are studies or even just impressions of anedoctes on how
bad this is?

Are there instrumentation tools in JFS or XFS that may allow me
to watch/inspect what is happening with the journal? For Linux
MD to see what are the rates of stripe r-m-w cases?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

Re: RAID6 r-m-w, op-journaled fs, SSDs

am 30.04.2011 18:02:13 von Emmanuel Florac

Le Sat, 30 Apr 2011 16:27:48 +0100 vous écriviez:

> While I agree with BAARF.com arguments fully, I sometimes have
> to deal with legacy systems with wide RAID6 sets (for example 16
> drives, quite revolting)

Revolting for what? I manage hundreds of such systems, but 99% of them
are used for video storage (typical file size range is several to
hundred of GBs).

> Sometimes (but fortunately not that recently) I have had to deal
> with small-file filesystems setup on wide-stripe RAID6 setup

What do you call "wide stripe" exactly? Do you mean a 256K stripe, a
4MB stripe?

> by
> morons who don't understand the difference between a database
> and a filesystem (and I have strong doubts that RAID6 is
> remotely appropriate to databases).

RAID-6 isn't appropriate for databases, but work reasonably well if the
workflow is almost only reading. And creating hundreds of millions of
files in a filesystem works reasonably well, too.

> So I'd like to figure out how much effort I should invest in
> undoing cases of the above, that is how badly they are likely to
> be and degrade over time (usually very badly).

Well, actually my bet is that it's impossible to say without you
providing much more detail on the hardware, the file IO patterns...

>=20
> * When reading or writing part of RAID[456] stripe for example
> smaller than a sector, what is the minimum unit of transfer
> with Linux MD? The full stripe, the chunk containing the
> sector, or just the sector containing the bytes to be
> written or updated (and potentially the parity sectors)? I
> would expect reads to always read just the sector, but not
> so sure about writing.
>=20
> * What about popular HW RAID host adapter (e.g. LSI, Adaptec,
> Areca, 3ware), where is the documentation if any on how they
> behave in these cases?

I may be wrong but in my tests, both Linux RAID and 3Ware, LSI and
Adaptec controllers (didn't really tested Areca on that point) would
read the full stripe most of the time. At least, they'll read the full
stripe in a single thread environment. However, when using many
concurrent threads the behaviour changes and they seem to work at chunk
level.
=20
> Regardless, op-journaled file system designs like JFS and XFS
> write small records (way below a stripe set size, and usually
> way below a chunk size) to the journal when they queue
> operations, even if sometimes depending on design and options
> may "batch" the journal updates (potentially breaking safety
> semantics). Also they do small write when they dequeue the
> operations from the journal to the actual metadata records
> involved.
>=20
> How bad can this be when the journal is say internal for a
> filesystem that is held on wide-stride RAID6 set?

Not that bad because typically the journal is small enough to fit
entirely in the controller cache.

> I suspect very
> very bad, with apocalyptic read-modify-write storms, eating IOPS.

Not if you're using write-back cache.

--=20
------------------------------------------------------------ -----------=
-
Emmanuel Florac | Direction technique
| Intellique
|
| +33 1 78 94 84 02
------------------------------------------------------------ -----------=
-
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID6 r-m-w, op-journaled fs, SSDs

am 01.05.2011 00:27:17 von NeilBrown

On Sat, 30 Apr 2011 16:27:48 +0100 pg_xf2@xf2.for.sabi.co.UK (Peter Grandi)
wrote:

> While I agree with BAARF.com arguments fully, I sometimes have
> to deal with legacy systems with wide RAID6 sets (for example 16
> drives, quite revolting) which have op-journaled filesystems on
> them like XFS or JFS (sometimes block-journaled ext[34], but I
> am not that interested in them for this).
>
> Sometimes (but fortunately not that recently) I have had to deal
> with small-file filesystems setup on wide-stripe RAID6 setup by
> morons who don't understand the difference between a database
> and a filesystem (and I have strong doubts that RAID6 is
> remotely appropriate to databases).
>
> So I'd like to figure out how much effort I should invest in
> undoing cases of the above, that is how badly they are likely to
> be and degrade over time (usually very badly).
>
> First a couple of question purely about RAID, but indirectly
> relevant to op-journaled filesystems:
>
> * Can Linux MD do "abbreviated" read-modify-write RAID6
> updates like for RAID5? That is where not the whole stripe
> is read in, modified and written, but just the block to be
> updated and the parity wblocks.

No. (patches welcome).

>
> * When reading or writing part of RAID[456] stripe for example
> smaller than a sector, what is the minimum unit of transfer
> with Linux MD? The full stripe, the chunk containing the
> sector, or just the sector containing the bytes to be
> written or updated (and potentially the parity sectors)? I
> would expect reads to always read just the sector, but not
> so sure about writing.

1 "PAGE" - normally 4K.


>
> * What about popular HW RAID host adapter (e.g. LSI, Adaptec,
> Areca, 3ware), where is the documentation if any on how they
> behave in these cases?
>
> Regardless, op-journaled file system designs like JFS and XFS
> write small records (way below a stripe set size, and usually
> way below a chunk size) to the journal when they queue
> operations, even if sometimes depending on design and options
> may "batch" the journal updates (potentially breaking safety
> semantics). Also they do small write when they dequeue the
> operations from the journal to the actual metadata records
> involved.

The ideal config for a journalled filesystem is for put the journal on a
separate smaller lower-latency device. e.g. a small RAID1 pair.

In a previous work place I had good results with:
RAID1 pair of small disks with root, swap, journal
Large RAID5/6 array with bulk of filesystem.

I also did data journalling as it helps a lot with NFS.



>
> How bad can this be when the journal is say internal for a
> filesystem that is held on wide-stride RAID6 set? I suspect very
> very bad, with apocalyptic read-modify-write storms, eating IOPS.
>
> I suspect that this happens a lot with SSDs too, where the role
> of stripe set size is played by the erase block size (often in
> the hundreds of KBytes, and even more expensive).
>
> Where are studies or even just impressions of anedoctes on how
> bad this is?
>
> Are there instrumentation tools in JFS or XFS that may allow me
> to watch/inspect what is happening with the journal? For Linux
> MD to see what are the rates of stripe r-m-w cases?

Not that I am aware of.


NeilBrown

> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID6 r-m-w, op-journaled fs, SSDs

am 01.05.2011 11:36:59 von Dave Chinner

On Sat, Apr 30, 2011 at 04:27:48PM +0100, Peter Grandi wrote:
> Regardless, op-journaled file system designs like JFS and XFS
> write small records (way below a stripe set size, and usually
> way below a chunk size) to the journal when they queue
> operations,

XFS will write log-stripe-unit sized records to disk. If the log
buffers are not full, it pads them. Supported log-sunit sizes are up
to 256k.

> even if sometimes depending on design and options
> may "batch" the journal updates (potentially breaking safety
> semantics). Also they do small write when they dequeue the
> operations from the journal to the actual metadata records
> involved.
>
> How bad can this be when the journal is say internal for a
> filesystem that is held on wide-stride RAID6 set? I suspect very
> very bad, with apocalyptic read-modify-write storms, eating IOPS.

Not bad at all, because the journal writes are sequential, and XFS
can have multiple log IOs in progress at once (up to 8 x 256k =
2MB). So in general while metadata operations are in progress, XFS
will fill full stripes with log IO and you won't get problems with
RMW.

> Where are studies or even just impressions of anedoctes on how
> bad this is?

Just buy decent RAID hardware with a BBWC and journal IO does not
hurt at all.

> Are there instrumentation tools in JFS or XFS that may allow me
> to watch/inspect what is happening with the journal? For Linux
> MD to see what are the rates of stripe r-m-w cases?

XFS has plenty of event tracing, including all the transaction
reservation and commit accounting in it. And if you know what you
are looking for, you can see all the log IO and transaction
completion processing in the event traces, too.

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID6 r-m-w, op-journaled fs, SSDs

am 01.05.2011 17:24:09 von David Brown

On 01/05/11 00:27, NeilBrown wrote:
> On Sat, 30 Apr 2011 16:27:48 +0100 pg_xf2@xf2.for.sabi.co.UK (Peter Grandi)
> wrote:
>
>> While I agree with BAARF.com arguments fully, I sometimes have
>> to deal with legacy systems with wide RAID6 sets (for example 16
>> drives, quite revolting) which have op-journaled filesystems on
>> them like XFS or JFS (sometimes block-journaled ext[34], but I
>> am not that interested in them for this).
>>
>> Sometimes (but fortunately not that recently) I have had to deal
>> with small-file filesystems setup on wide-stripe RAID6 setup by
>> morons who don't understand the difference between a database
>> and a filesystem (and I have strong doubts that RAID6 is
>> remotely appropriate to databases).
>>
>> So I'd like to figure out how much effort I should invest in
>> undoing cases of the above, that is how badly they are likely to
>> be and degrade over time (usually very badly).
>>
>> First a couple of question purely about RAID, but indirectly
>> relevant to op-journaled filesystems:
>>
>> * Can Linux MD do "abbreviated" read-modify-write RAID6
>> updates like for RAID5? That is where not the whole stripe
>> is read in, modified and written, but just the block to be
>> updated and the parity wblocks.
>
> No. (patches welcome).

As far as I understand the raid6 mathematics, it shouldn't be too hard
to do such abbreviated updates, but that it could quickly lead to
complex code if you are trying to update more than a couple of blocks at
a time.

>
>>
>> * When reading or writing part of RAID[456] stripe for example
>> smaller than a sector, what is the minimum unit of transfer
>> with Linux MD? The full stripe, the chunk containing the
>> sector, or just the sector containing the bytes to be
>> written or updated (and potentially the parity sectors)? I
>> would expect reads to always read just the sector, but not
>> so sure about writing.
>
> 1 "PAGE" - normally 4K.
>
>
>>
>> * What about popular HW RAID host adapter (e.g. LSI, Adaptec,
>> Areca, 3ware), where is the documentation if any on how they
>> behave in these cases?
>>
>> Regardless, op-journaled file system designs like JFS and XFS
>> write small records (way below a stripe set size, and usually
>> way below a chunk size) to the journal when they queue
>> operations, even if sometimes depending on design and options
>> may "batch" the journal updates (potentially breaking safety
>> semantics). Also they do small write when they dequeue the
>> operations from the journal to the actual metadata records
>> involved.
>
> The ideal config for a journalled filesystem is for put the journal on a
> separate smaller lower-latency device. e.g. a small RAID1 pair.
>
> In a previous work place I had good results with:
> RAID1 pair of small disks with root, swap, journal
> Large RAID5/6 array with bulk of filesystem.
>
> I also did data journalling as it helps a lot with NFS.
>

I suppose it also makes sense to put the write-intent bitmap for md raid
on such a raid1 pair (typically SSD's).

What would be very nice is a RAM-based SSD with battery backup, rather
than a flash disk. These sorts of devices exist, but they are usually
vastly expensive because they RAM is expensive for disk-like sizes. I'd
like to see physically small and cheap RAM-based SSD with 1 or 2 GB -
that would be ideal for file system journals, write intent bitmaps, etc.

>
>
>>
>> How bad can this be when the journal is say internal for a
>> filesystem that is held on wide-stride RAID6 set? I suspect very
>> very bad, with apocalyptic read-modify-write storms, eating IOPS.
>>
>> I suspect that this happens a lot with SSDs too, where the role
>> of stripe set size is played by the erase block size (often in
>> the hundreds of KBytes, and even more expensive).
>>
>> Where are studies or even just impressions of anedoctes on how
>> bad this is?
>>
>> Are there instrumentation tools in JFS or XFS that may allow me
>> to watch/inspect what is happening with the journal? For Linux
>> MD to see what are the rates of stripe r-m-w cases?
>
> Not that I am aware of.
>
>
> NeilBrown
>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID6 r-m-w, op-journaled fs, SSDs

am 01.05.2011 17:31:34 von pg_xf2

[ ... ]

>> * Can Linux MD do "abbreviated" read-modify-write RAID6
>> updates like for RAID5? [ ... ]

> No. (patches welcome).

Ahhhm, but let me dig a bit deeper, even if it may be implied in
the answer: would it be *possible*?

That is, is the double parity scheme used in MS such that it is
possible to "subtract" the old content of a page and "add" the
new content of that page to both parity pages?

[ ... ]

> The ideal config for a journalled filesystem is for put the
> journal on a separate smaller lower-latency device. e.g. a
> small RAID1 pair.

> In a previous work place I had good results with:
> RAID1 pair of small disks with root, swap, journal
> Large RAID5/6 array with bulk of filesystem.

Sound reasonable, except that I am allergic to RAID5 (except in
two cases) and RAID6 (in general). :-), but would work equally
well I guess with RAID10 and its delightful MD implementation.

[ ... ]

Thanks for the information!

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

Re: RAID6 r-m-w, op-journaled fs, SSDs

am 01.05.2011 18:48:18 von Christoph Hellwig

On Sun, May 01, 2011 at 05:24:09PM +0200, David Brown wrote:
> I suppose it also makes sense to put the write-intent bitmap for md
> raid on such a raid1 pair (typically SSD's).

Note that right now you can't actually put the bitmap on a device,
but it requires a file on a filesystem, which seems rather confusing.

Also make sure that you don't already max out the solid state devices
IOP rate with the log writes, in which case putting the bitmap on
it as well will slow it down.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID6 r-m-w, op-journaled fs, SSDs

am 01.05.2011 20:32:22 von David Brown

On 01/05/11 17:31, Peter Grandi wrote:
> [ ... ]
>
>>> * Can Linux MD do "abbreviated" read-modify-write RAID6
>>> updates like for RAID5? [ ... ]
>
>> No. (patches welcome).
>
> Ahhhm, but let me dig a bit deeper, even if it may be implied in
> the answer: would it be *possible*?
>
> That is, is the double parity scheme used in MS such that it is
> possible to "subtract" the old content of a page and "add" the
> new content of that page to both parity pages?
>

If I've understood the maths correctly, then yes it would be possible.=20
But it would involve more calculations, and it is difficult to see wher=
e=20
the best balance lies between cpu demands and IO demands. In general,=20
calculating the Q parity block for raid6 is processor-intensive -=20
there's a fair amount of optimisation done in the normal calculations t=
o=20
keep it reasonable.

Basically, the first parity P is a simple calculation:

P =3D D_0 + D_1 + .. + D_n-1

But Q is more difficult:

Q =3D D_0 + g.D_1 + g=B2.D_2 + ... + g^(n-1).D_n-1

where "plus" is xor, "times" is a weird function calculated over a=20
G(2^8) field, and g is a generator for that field.

If you want to replace D_i, then you can calculate:

P(new) =3D P(old) + D_i(old) + D_i(new)

Q(new) =3D Q(old) + g^i.(D_i(old) + D_i(new))

This means multiplying by g_i for whichever block i is being replaced.

The generator and multiply operation are picked to make it relatively=20
fast and easy to multiply by g, especially if you've got a processor=20
that has vector operations (as most powerful cpus do). This means that=
=20
the original Q calculation is fairly efficient. But to do general=20
multiplications by g_i is more effort, and will typically involve=20
cache-killing lookup tables or multiple steps.


It is probably reasonable to say that when md raid first implemented=20
raid6, it made little sense to do these abbreviated parity calculations=
.
But as processors have got faster (and wider, with more cores) while=20
disk throughput has made slower progress, it's maybe a different=20
balance. So it's probably both possible and practical to do these=20
calculations. All it needs is someone to spend the time writing the=20
code - and lots of people willing to test it.



--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID6 r-m-w, op-journaled fs, SSDs

am 02.05.2011 00:01:31 von NeilBrown

On Sun, 01 May 2011 17:24:09 +0200 David Brown
wrote:

> On 01/05/11 00:27, NeilBrown wrote:
> > On Sat, 30 Apr 2011 16:27:48 +0100 pg_xf2@xf2.for.sabi.co.UK (Peter Grandi)
> > wrote:
> >

> >> * Can Linux MD do "abbreviated" read-modify-write RAID6
> >> updates like for RAID5? That is where not the whole stripe
> >> is read in, modified and written, but just the block to be
> >> updated and the parity wblocks.
> >
> > No. (patches welcome).
>
> As far as I understand the raid6 mathematics, it shouldn't be too hard
> to do such abbreviated updates, but that it could quickly lead to
> complex code if you are trying to update more than a couple of blocks at
> a time.

The RAID5 code already handle some of this complexity.

It would be quite easy to modify the code so that we have
- list of 'old' data blocks
- P and Q blocks
- list of 'new' data blocks.

We then just need to (optimised) maths to deduce the new P and Q given all
of that.
Of course you would only bother with a 7-or-more disk array..

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html