Absymal performance of O_DIRECT write on parity raid

am 31.12.2010 05:35:06 von Spelic

Hi all linux raiders

on kernel 2.6.36.2, but probably others, performances of O_DIRECT are
absymal on parity raid, compared to nonparity raid

And this is NOT due to the RMW apparently! (see below)

With dd bs=1M to the bare MD device, a 6-disk raid5 1024k chunk, I
obtain 2.1MB/sec on raid5 while the same test onto a 4-disk raid10 goes
at 160MB/sec (80 times faster).
even with stripe_cache_size to the max.
Nondirect writes to the arrays are at about 250MB/sec for raid5, and
about 180MB/sec for raid10.
With bs=4k directio it's 205KB/sec on the raid5 vs 28MB/sec on the
raid10 (136 times faster)

This does NOT seem due to the RMW, because from the second time on MD
does *not* read from the disks anymore (checked with iostat -x 1)
(BTW how do you clear that cache? echo 3 > /proc/sys/vm/drop_cache does
not appear to work)

It's so bad it looks like a bug. Could you please have a look at this?
There are many important stuff that use o_direct, in particular:
- LVM, I think, especially pvmove and mirror creation, which are
impossibly slow on parity raid
- Databases (ok I understand we should use raid10 but the difference
should not be SO great!)
- Virtualization. E.g. KVM wants bare devices for high performance,
wants to do direct io. Go figure.

With such a bad worst-case for o_direct we seriously risk to need to
abandon MD parity raid completely
Please have a look

Thank you
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Absymal performance of O_DIRECT write on parity raid

am 31.12.2010 06:36:37 von Doug Dumitru

A couple of comments.

=46irst, your test stripe size is very large. With 6 disks raid-5 and
1M chunks, you need 5MB of IO to fill a stripe. With direct IO, the
IO must complete and "sync" before dd continues. Thus each 1M write
will do reads from 4 drives and then 2 writes. I am not sure about
iostat not seeing this. I ran this here against 8 SSDs.

test file: 1G of random data copied from /dev/urandom into /dev/shm
(ssds can vary speed based on data content, hdds don't tend to act
this way).
array: /dev/md0 - 8 Indilinx SSDs 1024K chunk size. raid 5.
test: dd if=3D/dev/shm/rand.1b of=3D/dev/md0 bs=3D1M oflag=3Ddirect
result: 56.6 MB/s

here a 2 second iostat snapshot during the dd looks like:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await svctm %util
sdb 1382.00 2990.50 157.00 291.00 6.31 13.00
88.29 1.66 3.78 0.41 18.55
sdc 1265.00 2309.50 144.00 203.50 5.50 9.89
90.73 1.16 3.33 0.45 15.60
sdd 1473.50 2688.00 190.50 283.00 6.50 11.53
78.00 0.98 2.07 0.31 14.85
sde 1497.50 3128.50 166.50 327.50 6.50 13.50
82.91 2.46 4.98 0.48 23.65
sdf 1498.00 3133.00 166.00 323.00 6.50 13.50
83.76 1.74 3.56 0.42 20.30
sdg 1482.00 3127.50 182.00 328.50 6.50 13.50
80.24 1.04 2.03 0.31 15.95
sdh 1464.50 3033.00 163.00 322.00 6.11 13.00
80.69 0.94 1.92 0.32 15.60
sdi 1488.00 3002.00 176.00 326.00 6.50 13.00
79.55 1.48 2.94 0.35 17.55
md0 0.00 0.00 0.00 454.50 0.00 50.50
227.56 0.00 0.00 0.00 0.00

so there are lots of RMWs going on.

If I do the same test with chunk set to 64K and bs set to 458752
(chunk size * 7 or /sys/block/md0/queue/optimal_io_size) the dd
improves to 250 MB/sec.

This is still a lot slower than "perfect" IO. For these drives on
raid-5 perfect is about 700MB/sec. I have hit 670 with in-house
patches (see another thread) to raid5.c, but these patches don't
translate down to user-space and programs like dd.

In general, if you want to run linear IO, you want to do IO at a
multiple of optimal_io_size. If the chunk size is too large, then
optimal_io_size is way to big to fit in a single bio.

The other issue with oflag=3Ddirect is that with direct, you only have =
a
single IO outstanding before the next IO starts. Again, testing with
SSDs, the raid/5 logic tends to schedule about 35 IOPS for small
random writes. This is the raid layer waiting for additional IOs to
arrive before scheduling the RMW reads to back-fill the stripe cache
buffers. Again, the issue is single-threaded operations and how they
get scheduled.

In terms of how this impacts application performance, it can get
complicated. For single-threaded apps that do direct IO, the number
you are seeing are real. If the app does multi-threaded IO, then the
numbers are real, but for each thread independently. Again, with SSDs
raid/5 can hit 18,000 write IOPS (with really good drives) if the
queue depth is really deep. Mind you raid-10 can hit 80,000 IOPS and
front-end FTLs (Flash Translation Layers) in software over raid/5 (see
http://www.managedflash.com) can hit 250,000 IOPS with the same
drives. It is all about scheduling and keeping the drives busy moving
meaningful data.

Bottom line is that the current raid-5 code is doing the best it can.
It's real issue is knowing when IO is random and when it is linear in
that all it "sees" is inbound un-associated block requests. The
problem becomes when to "pull the trigger" and assume IO is random
when it might help to wait for some more linear blocks. There is talk
"now and again" about adding a write cache to raid/456.
Unfortunately, without some non-volatile memory (think hardware raid
with batteries to backup RAM), a bad shutdown will kill data left and
right if the raid code re-orders writes and crashes.

Perhaps what is needed is a new bio status bit to let the layers know
that the request is complete and needs "pushed" immediately.
Unfortunately, such a change is a "big deal" and requires about every
app to become "aware" in order for it to help.

Doug

On Thu, Dec 30, 2010 at 8:35 PM, Spelic wrote:
>
> Hi all linux raiders
>
> on kernel 2.6.36.2, but probably others, performances of O_DIRECT are=
absymal on parity raid, compared to nonparity raid
>
> And this is NOT due to the RMW apparently! (see below)
>
> With dd bs=3D1M to the bare MD device, a 6-disk raid5 1024k chunk, I =
obtain 2.1MB/sec on raid5 while the same test onto a 4-disk raid10 goes=
at 160MB/sec (80 times faster).
> even with stripe_cache_size to the max.
> Nondirect writes to the arrays are at about 250MB/sec for raid5, and =
about 180MB/sec for raid10.
> With bs=3D4k directio it's 205KB/sec on the raid5 vs 28MB/sec on the =
raid10 (136 times faster)
>
> This does NOT seem due to the RMW, because from the second time on MD=
does *not* read from the disks anymore (checked with iostat -x 1)
> (BTW how do you clear that cache? echo 3 > /proc/sys/vm/drop_cache do=
es not appear to work)
>
> It's so bad it looks like a bug. Could you please have a look at this=
?
> There are many important stuff that use o_direct, in particular:
> - LVM, I think, especially pvmove and mirror creation, which are impo=
ssibly slow on parity raid
> - Databases (ok I understand we should use raid10 but the difference =
should not be SO great!)
> - Virtualization. E.g. KVM wants bare devices for high performance, w=
ants to do direct io. Go figure.
>
> With such a bad worst-case for o_direct we seriously risk to need to =
abandon MD parity raid completely
> Please have a look
>
> Thank you
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html

--
Doug Dumitru
EasyCo LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Absymal performance of O_DIRECT write on parity raid

am 05.01.2011 12:51:47 von Spelic

On 12/31/2010 06:36 AM, Doug Dumitru wrote:

> With direct IO, the
> IO must complete and "sync" before dd continues. Thus each 1M write
> will do reads from 4 drives and then 2 writes. I am not sure about
> iostat not seeing this. I ran this here against 8 SSDs.
>

I confirm. It is the stripe_cache doing that.
You need to raise the stripe cache to 32768, then do a little I/O the
first time (less than 32768*4k*number of disks) so the stripe cache fills up
Then do it again and you will see no reads.
I found how to clear that: bring stripe_cache_size to 32 then again to
32768. After that it will read again.

If you test that, it will probably be lightning fast for you because you
have SSDs.
So do mdstat -x 10 (10 seconds) so you will see a "frozen" summary; you
will see no reads.

Thanks for all your info, it's interesting stuff, and I confirm you are
right with parallelism: with fio with 20 threads doing random 1M direct
writes, the bandwitdh sums up proportionally like you say.

However:
I confirm that in my case, even when it DOESN'T read (stripe_cache
effect) sequential dd with O_DIRECT bs=1M is dog slow on my raid-5

What I see with iostat (I paid more attention now) is that, every other
second, the iostat -x 1 shows ZERO I/O and exactly one disk (below the
md raid) with 1 in avgqu-sz. I go to the /sys/block/ in question
and I can see it's an inflight write. That disk has 1 inflight write
100% of the time. This is for a while. After some time the disk changes,
now it's another disk of the array which has 1 inflight write 100% of
the time... It cycles through all disks of the array with this pattern:
[3] [2] [1] [0] [6] [4] (I am remapping it to device order in that array
from cat /proc/mdstat) I don't have a disk 5 in that array, maybe a
problem when I created it, if I had a disk "5" instead of disk "6" it
probably would have been 3 2 1 0 5 4 . I think it varies with the
position of either the data disk being written or the parity disk being
written.

My interpretation is that since it's a (direct and hence) sync I/O, MD
waits for completion of inflight writes before submitting another one,
and every certain number of requests there is one that stays stuck for
1-2 seconds and so everything freezes for 1-2 seconds. That's why it is
dog slow.

Now why does that inflight write take so long??

I thought this might be a bug of my controller, it's a 3ware 9650SE =
not the best for MD...

However please note that I see this problem in all raid-5 arrays on most
"bs" sizes (disappearing around bs=4M: bs=4M has a very variable speed
from attempt to attempt) and I do NOT see the problem in raid-10 or
raid-1 arrays, like when doing dd sequential O_DIRECT writes of bs=1M or
any other bs to a raid10 array and note that this is a very similar
scenario because:
- it is direct
- it is sync
- does not read
- generates IOPS to every disk enormously higher than in the problematic
raid-5 case I am reporting
still I don't see this problem of hanging requests, and dd goes very
fast at any block size (obviously faster for reasonably big sizes).

So I am wondering if there can be a contribution of MD to this "bug"...?

Thank you

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html