raid5/raid6 write performance question

am 17.02.2011 19:52:07 von lopresti

I have a fair amount of experience with hardware RAID devices, but now
I am investigating Linux software RAID and I have a question. Well, a
few questions.

The classic problem for RAID5/RAID6 write performance, especially when
striping across many drives, is that a single small write requires
reading in the entire stripe from all disks to calculate the new
syndrome block(s).

Hardware RAID controllers typically mitigate this problem by using a
sizable (512MiB - 4GiB) non-volatile write-back cache, in the hopes
that enough blocks will be written in a short period of time to
populate an entire stripe. Once an entire stripe is in the write-back
cache, it can be written out with its syndrome blocks without having
to read anything.

Of course, the cache has to be non-volatile (battery backed or solid
state), because the kernel is expecting stuff it has written to disk
not to vanish because of a power failure.

My question is this: How does Linux RAID5/RAID6 avoid reading an
entire stripe every time the kernel flushes a single page? Does it
have a (volatile?) cache? Or does it rely on the kernel flushing lots
of contiguous data in a single request? Or something else?

Does Linux RAID keep track of which disk blocks have already been
written at least once, so that there is a difference between writing a
block for the first time and updating it later? (But I guess that
would not make sense, since eventually all writes become updates as
files are created and deleted.)

Thanks.

- Pat
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: raid5/raid6 write performance question

am 17.02.2011 21:13:00 von Piergiorgio Sartor

On Thu, Feb 17, 2011 at 10:52:07AM -0800, Patrick J. LoPresti wrote:
> I have a fair amount of experience with hardware RAID devices, but now
> I am investigating Linux software RAID and I have a question. Well, a
> few questions.
>
> The classic problem for RAID5/RAID6 write performance, especially when
> striping across many drives, is that a single small write requires
> reading in the entire stripe from all disks to calculate the new
> syndrome block(s).
>
> Hardware RAID controllers typically mitigate this problem by using a
> sizable (512MiB - 4GiB) non-volatile write-back cache, in the hopes
> that enough blocks will be written in a short period of time to
> populate an entire stripe. Once an entire stripe is in the write-back
> cache, it can be written out with its syndrome blocks without having
> to read anything.
>
> Of course, the cache has to be non-volatile (battery backed or solid
> state), because the kernel is expecting stuff it has written to disk
> not to vanish because of a power failure.
>
> My question is this: How does Linux RAID5/RAID6 avoid reading an
> entire stripe every time the kernel flushes a single page? Does it
> have a (volatile?) cache? Or does it rely on the kernel flushing lots
> of contiguous data in a single request? Or something else?

This one I know... :-)
There is a cache (volatile, since it is in system RAM), which
can be tuned via sysfs.

I've an i7 xeon with 12GiB RAM, 4 HDDs RAID-5 and I set the
cache to 6GiB. This is dynamically allocated, so it uses RAM
only when needed.
Some benchmarks show that you can achieve the full 3 HDDs
speed in small data writes and sustained write.

I must say I was really impressed by the difference in
writing performances after increasing the cache, not only
in the benchmark world, but also with some I/O intensive
applications.

It made me rethink about the "quality" of the benchmarks
you can find around: it seems nobody understood this
capability of md.

Of course, in case of power failure, without UPS, you
risk a lot. Nevertheless, it depends on what are the
overall requirements, I guess.

> Does Linux RAID keep track of which disk blocks have already been
> written at least once, so that there is a difference between writing a
> block for the first time and updating it later? (But I guess that
> would not make sense, since eventually all writes become updates as
> files are created and deleted.)

This one I do not know... :-)

bye,

--

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: raid5/raid6 write performance question

am 18.02.2011 10:56:34 von David Brown

On 17/02/2011 19:52, Patrick J. LoPresti wrote:
> I have a fair amount of experience with hardware RAID devices, but now
> I am investigating Linux software RAID and I have a question. Well, a
> few questions.
>

I'll give some answers, but I am not sure about all the details. I hope
that someone else will correct me if I'm wrong :-)

> The classic problem for RAID5/RAID6 write performance, especially when
> striping across many drives, is that a single small write requires
> reading in the entire stripe from all disks to calculate the new
> syndrome block(s).
>

You don't need to read the whole stripe (at least, not for RAID5 - I
don't know enough about RAID6 to comment).

With RAID5, the parity is the xor of all the other blocks in the stripe.
So if you only want to change one block, you can read the old block
and the old parity block, and calculate the new parity block as the xor
of the old data block, the old parity block, and the new data block.

You still have to do some reads then a write, but at least you don't
need to read the whole stripe.

I presume that's the way md RAID5 implements small writes.

> Hardware RAID controllers typically mitigate this problem by using a
> sizable (512MiB - 4GiB) non-volatile write-back cache, in the hopes
> that enough blocks will be written in a short period of time to
> populate an entire stripe. Once an entire stripe is in the write-back
> cache, it can be written out with its syndrome blocks without having
> to read anything.
>
> Of course, the cache has to be non-volatile (battery backed or solid
> state), because the kernel is expecting stuff it has written to disk
> not to vanish because of a power failure.
>
> My question is this: How does Linux RAID5/RAID6 avoid reading an
> entire stripe every time the kernel flushes a single page? Does it
> have a (volatile?) cache? Or does it rely on the kernel flushing lots
> of contiguous data in a single request? Or something else?
>

My understanding is that md keeps a cache of the stripes in ram. Any
writes must be completed to the disk itself, rather than just the stripe
cache, before being reported to the file system as completed, as this
cache is volatile. But the next time you make a small write to a stripe
that is in the cache, it can avoid the reads. Of course, the cache will
also be used for reads.

The size of this cache is configurable - using a larger stripe cache
will give you a higher hit ratio, and thus faster small writes on
average. But the same ram can be used for other types of caches -
directory entry caches, file caches, etc. The best balance will depend
on your load - for a read-mostly array, ram will probably be better
spent as file cache, while for a write-mostly array the stripe cache is
more important.

My understanding of hardware raid cards is that the have stripe caches,
but these are typically volatile. A non-volatile cache would mean you
can't swap out controllers or disks when the system is switched off, as
some of the data might be in the controller card's cache instead of the
disks.

For high-end systems, your battery backup must not only keep the cache
alive, but it should keep your disks running so that the cache can be
flushed to disk when there is a power failure. Then the controller will
be able to report a write as "complete" when it is cached, and handle
the flush to disk in the background.

Less high-end systems would, I believe, handle the cache in the same way
as md raid - the stripe cache in ram would help avoid the reads before
writing to part of a RAID 5 stripe. Typically, this on-board cache will
be a lot smaller than you would have in an md RAID system.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html