single cpu thread performance limit?

single cpu thread performance limit?

am 11.08.2011 17:58:48 von mark delfman

I seem to have hit a significant hard stop in MD RAID1/10 performance
which seems to be linked to a single CPU thread.

I am using extremely high speed (IOPS) internal block devices =96 8 in
total. They are capable of achieving > 1million iops.

However if I use RAID1 / 10 then MD seems to use a single thread which
will reach 100% CPU utilisation (single core) at around 200K IOPS.
Limiting the entire performance to around 200K.

If I use say 4 x RAID1 / 10=92s and a RAID0 on top =96 I see not much
greater results. (although the theory seems to say I should and there
are now 4 CPU threads running, it still seems to hit 4 x 100% at maybe
350K).

Is there any way to increase the number of threads per RAID set? Or
any other suggestions on configurations? (I have tried every
permutation of R0+R1/10=92s)

Thank you for any advice.


Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: single cpu thread performance limit?

am 11.08.2011 18:01:32 von mathias.buren

11 August 2011 16:58, mark delfman wrote:
> I seem to have hit a significant hard stop in MD RAID1/10 performance
> which seems to be linked to a single CPU thread.
>
> I am using extremely high speed (IOPS) internal block devices â€=93=
8 in
> total.  They are capable of achieving > 1million iops.
>
> However if I use RAID1 / 10 then MD seems to use a single thread whic=
h
> will reach 100% CPU utilisation (single core) at around 200K IOPS.
> Limiting the entire performance to around 200K.
>
> If I use say 4 x RAID1 / 10â€=99s and a RAID0 on top â€=93 I =
see not much
> greater results. (although the theory seems to say I should and there
> are now 4 CPU threads running, it still seems to hit 4 x 100% at mayb=
e
> 350K).
>
> Is there any way to increase the number of threads per RAID set? Or
> any other suggestions on configurations?  (I have tried every
> permutation of R0+R1/10â€=99s)
>
> Thank you for any advice.
>
>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.ht=
ml
>

Maybe create separate MD RAID1 devices, then a new MD device with
RAID0? (instead of using mdadm RAID"10")

/M
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: single cpu thread performance limit?

am 11.08.2011 18:07:18 von mark delfman

Tried this... it results in the same :(

On Thu, Aug 11, 2011 at 5:01 PM, Mathias Bur=E9n om> wrote:
> =A011 August 2011 16:58, mark delfman wr=
ote:
>> I seem to have hit a significant hard stop in MD RAID1/10 performanc=
e
>> which seems to be linked to a single CPU thread.
>>
>> I am using extremely high speed (IOPS) internal block devices =96 8 =
in
>> total. =A0They are capable of achieving > 1million iops.
>>
>> However if I use RAID1 / 10 then MD seems to use a single thread whi=
ch
>> will reach 100% CPU utilisation (single core) at around 200K IOPS.
>> Limiting the entire performance to around 200K.
>>
>> If I use say 4 x RAID1 / 10=92s and a RAID0 on top =96 I see not muc=
h
>> greater results. (although the theory seems to say I should and ther=
e
>> are now 4 CPU threads running, it still seems to hit 4 x 100% at may=
be
>> 350K).
>>
>> Is there any way to increase the number of threads per RAID set? Or
>> any other suggestions on configurations? =A0(I have tried every
>> permutation of R0+R1/10=92s)
>>
>> Thank you for any advice.
>>
>>
>> Mark
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid=
" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>>
>
> Maybe create separate MD RAID1 devices, then a new MD device with
> RAID0? (instead of using mdadm RAID"10")
>
> /M
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: single cpu thread performance limit?

am 11.08.2011 20:58:37 von Stan Hoeppner

On 8/11/2011 10:58 AM, mark delfman wrote:
> I seem to have hit a significant hard stop in MD RAID1/10 performance
> which seems to be linked to a single CPU thread.

What is the name of the kernel thread that is peaking your cores? Coul=
d
the device driver be eating the CPU and not the md kernel threads? Is
it both? Is it a different thread? How much CPU is the IO generator
app eating?

What Linux kernel version are you running? Which Linux distribution?
What application are you using to generate the IO load? Does it work a=
t
the raw device/partition level or at the file level?

> I am using extremely high speed (IOPS) internal block devices =96 8 i=
n
> total. They are capable of achieving > 1million iops.

8 solid state drives of one model or another, probably occupying 8 PCIe
slots. IBIS, VeloDrive, the LSI SSD, or other PCIe based SSD? Or are
these plain SATA II SSDs that *claim* to have 125K 4KB random IOPS
performance?

> However if I use RAID1 / 10 then MD seems to use a single thread whic=
h
> will reach 100% CPU utilisation (single core) at around 200K IOPS.
> Limiting the entire performance to around 200K.

CPU frequency? How many sockets? Total cores? Whose box? HP, Dell,
IBM, whitebox, self built? If the latter two, whose motherboard? How
many PCIe slots are occupied by the SSD cards?

> If I use say 4 x RAID1 / 10=92s and a RAID0 on top =96 I see not much
> greater results. (although the theory seems to say I should and there
> are now 4 CPU threads running, it still seems to hit 4 x 100% at mayb=
e
> 350K).

Assuming you have 4 processors (cores), then yes, you should see better
scaling. If you have less cores than threads, then no. Do you see mor=
e
IOPS before running out of CPU when writing vs reading? You should as
you're doing half the IOs when reading.

> Is there any way to increase the number of threads per RAID set? Or
> any other suggestions on configurations? (I have tried every
> permutation of R0+R1/10=92s)

The answer to the first question AFAIK is no. Do you have the same
problem with a single --linear array? What is the result when putting =
a
filesystem on each individual drive? Do you get your 1 million IOPS?

Is MSI enabled and verified to be working for each PCIe SSD device? Se=
e:

http://git.kernel.org/?p=3Dlinux/kernel/git/torvalds/linux-2 .6.git;a=3D=
blob;f=3DDocumentation/PCI/MSI-HOWTO.txt;hb=3DHEAD

--=20
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: single cpu thread performance limit?

am 11.08.2011 21:04:12 von Bernd Schubert

On 08/11/2011 05:58 PM, mark delfman wrote:
> I seem to have hit a significant hard stop in MD RAID1/10 performance
> which seems to be linked to a single CPU thread.
>
> I am using extremely high speed (IOPS) internal block devices =96 8 i=
n
> total. They are capable of achieving> 1million iops.
>
> However if I use RAID1 / 10 then MD seems to use a single thread whic=
h
> will reach 100% CPU utilisation (single core) at around 200K IOPS.
> Limiting the entire performance to around 200K.
>

Out of interest, could you please run "perf top" to let us see where th=
e=20
kernel is busy?

Thanks,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: single cpu thread performance limit?

am 11.08.2011 21:37:46 von mark delfman

Hi... sorry for the lack of initial info and your question made me
realise how much i had missed off! hopefully this adds some color

PCIe based Flash - SLC based
Multuiple XEON 5640's (total 16 cores)
MSI ints all set (and affinity / pinned tried)
SLES 11 (2.6.32.43-0.5)
tried on both a Supermicro and and Dell R server

the thread is MD0_RAID10 (or something simular as i am not near it now)=

This thread is easily linked to the MD(s)
Create 4 x RAID1's and you have 4 x MD threads etc.

So, a single RAID10 creates a single thread - which will max at maybe 2=
00K IOPS.
Create 4 x RAID10's seems OK, but they will not scale so great with a
RAID0 on top :(
Ideal would be a few threads per RAIDx


Using basic fio for IOPS (4 workers - 128 QD) - this usess hardly any
CPU resource.
Reads are maybe 50% faster as you would expect.

The issue seems to be the fact a single thread will only deliver X
before 100% CPU... with emerging flash, this is not reaching the
capability

=46S: An FS is not really an option for this solution, so we have not
tried this on this rig, but in the past the FS has degreaded the IOPS

Whilst a R0 on top of the R1/10's does offer some increase in
performance, linear does not :(
LVM R0 on top of the MD R1/10's does much the same results.
The limiter seems fixes on the single thread per R1/10


Thank you for any feedback!

Mark



On Thu, Aug 11, 2011 at 7:58 PM, Stan Hoeppner =
wrote:
> On 8/11/2011 10:58 AM, mark delfman wrote:
>> I seem to have hit a significant hard stop in MD RAID1/10 performanc=
e
>> which seems to be linked to a single CPU thread.
>
> What is the name of the kernel thread that is peaking your cores? =A0=
Could
> the device driver be eating the CPU and not the md kernel threads? =A0=
Is
> it both? =A0Is it a different thread? =A0How much CPU is the IO gener=
ator
> app eating?
>
> What Linux kernel version are you running? =A0Which Linux distributio=
n?
> What application are you using to generate the IO load? =A0Does it wo=
rk at
> the raw device/partition level or at the file level?
>
>> I am using extremely high speed (IOPS) internal block devices =96 8 =
in
>> total. =A0They are capable of achieving > 1million iops.
>
> 8 solid state drives of one model or another, probably occupying 8 PC=
Ie
> slots. =A0IBIS, VeloDrive, the LSI SSD, or other PCIe based SSD? =A0O=
r are
> these plain SATA II SSDs that *claim* to have 125K 4KB random IOPS
> performance?
>
>> However if I use RAID1 / 10 then MD seems to use a single thread whi=
ch
>> will reach 100% CPU utilisation (single core) at around 200K IOPS.
>> Limiting the entire performance to around 200K.
>
> CPU frequency? =A0How many sockets? =A0Total cores? =A0Whose box? =A0=
HP, Dell,
> IBM, whitebox, self built? =A0If the latter two, whose motherboard? =A0=
How
> many PCIe slots are occupied by the SSD cards?
>
>> If I use say 4 x RAID1 / 10=92s and a RAID0 on top =96 I see not muc=
h
>> greater results. (although the theory seems to say I should and ther=
e
>> are now 4 CPU threads running, it still seems to hit 4 x 100% at may=
be
>> 350K).
>
> Assuming you have 4 processors (cores), then yes, you should see bett=
er
> scaling. =A0If you have less cores than threads, then no. =A0Do you s=
ee more
> IOPS before running out of CPU when writing vs reading? =A0You should=
as
> you're doing half the IOs when reading.
>
>> Is there any way to increase the number of threads per RAID set? Or
>> any other suggestions on configurations? =A0(I have tried every
>> permutation of R0+R1/10=92s)
>
> The answer to the first question AFAIK is no. =A0Do you have the same
> problem with a single --linear array? =A0What is the result when putt=
ing a
> filesystem on each individual drive? =A0Do you get your 1 million IOP=
S?
>
> Is MSI enabled and verified to be working for each PCIe SSD device? =A0=
See:
>
> http://git.kernel.org/?p=3Dlinux/kernel/git/torvalds/linux-2 .6.git;a=3D=
blob;f=3DDocumentation/PCI/MSI-HOWTO.txt;hb=3DHEAD
>
> --
> Stan
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: single cpu thread performance limit?

am 11.08.2011 21:57:01 von Joe Landman

On 08/11/2011 03:37 PM, mark delfman wrote:

> So, a single RAID10 creates a single thread - which will max at maybe 200K IOPS.

We are seeing ~110k IOPs per PCI HBA for an SSD variant of what you
have. FWIW, MD RAID is significantly faster than the hardware RAID
here, but that's due to the processor more than anything else.

Which cards if you don't mind my asking? We work with a number of
vendors in this space.

> Create 4 x RAID10's seems OK, but they will not scale so great with a
> RAID0 on top :(
> Ideal would be a few threads per RAIDx

[...]

> Whilst a R0 on top of the R1/10's does offer some increase in
> performance, linear does not :(

Linear makes no sense for distributing IO's among many devices. Linear
is a concatenation.

> LVM R0 on top of the MD R1/10's does much the same results.
> The limiter seems fixes on the single thread per R1/10

Whats your CPU? What's your 'lspci -vvv' output look like (is it
possible you've oversubscribed your PCIe channels?) How many PCIe lanes
do you have on your MB?

FWIW, our array of SSD's hit 7.8 GB/s and 330k IOPs (8k random reads
against 768GB of data) using MD RAID5's. Each RAID5 hits around 75k
IOPs, and when joined together, they hit closer to 110k per HBA.

The PCIe units are generally much better than this. Last set of cards
we played with a few weeks ago we were getting about 400k IOPs for a
pair of cards in an MD RAID0. I expect newer drivers and other things
to help out a bit.


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: single cpu thread performance limit?

am 11.08.2011 22:51:39 von Stan Hoeppner

On 8/11/2011 2:37 PM, mark delfman wrote:

> FS: An FS is not really an option for this solution, so we have not
> tried this on this rig, but in the past the FS has degreaded the IOPS

I'm wondering what your applications is, given you have the option to
write to raw devices in production.

> Whilst a R0 on top of the R1/10's does offer some increase in
> performance, linear does not :(
> LVM R0 on top of the MD R1/10's does much the same results.
> The limiter seems fixes on the single thread per R1/10

This might provide you some really interesting results. :) Take your 8
flash devices, which are of equal size I assume, and create an md
--linear array on the raw device, no partitions (we'll worry about
redundancy later). Format this md device with:

~$ mkfs.xfs -d ag=8 /dev/mdX

Mount it with:

~$ mount -o inode64,logbsize=256,noatime,nobarrier /dev/mdX /test

(Too bad you're running 2.6.32 instead of 2.6.35 or above, as enabling
the XFS delayed logging mount option would probably bump your small file
block IOPS to well over a million, if the hardware is actually up to it.)

Now, create 8 directories, say test[1-8]. XFS drives parallelism
through allocation groups. Each directory will be created in a
different AG. Thus, you'll end up with one directory per SSD, and any
files written to that directory will go that that same SSD. Thus,
writing files to all 8 directories in parallel will get you near perfect
scaling across all disks, with files, not simply raw blocks.

I'm not really that familiar with FIO but I'll assume it can do file as
well as block IO. If not, grab iozone or bonnie, etc, and run tests
writing small files to all 8 directories in parallel. The results may
surprise you. After you've done this, create 4 mirror pairs and then a
--linear of them. Duplicate the above but use 4 allocation groups and 4
directories. Please post the results for both test setups.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: single cpu thread performance limit?

am 12.08.2011 03:05:35 von Stan Hoeppner

On 8/11/2011 3:51 PM, Stan Hoeppner wrote:
> On 8/11/2011 2:37 PM, mark delfman wrote:
>
>> FS: An FS is not really an option for this solution, so we have not
>> tried this on this rig, but in the past the FS has degreaded the IOPS

>> Whilst a R0 on top of the R1/10's does offer some increase in
>> performance, linear does not :(
>> LVM R0 on top of the MD R1/10's does much the same results.
>> The limiter seems fixes on the single thread per R1/10

This seems to be the case. The md processes apparently aren't threaded,
at least not when doing mirroring/+striping. xfsbufd, xfssyncd, and
xfsaild are all threaded.

> This might provide you some really interesting results. :) Take your 8
> flash devices, which are of equal size I assume, and create an md
> --linear array on the raw device, no partitions (we'll worry about
> redundancy later). Format this md device with:

A concat shouldn't use nearly as much CPU as a mirror or stripe. Though
I don't know if one core will be enough here. Test and see.

> ~$ mkfs.xfs -d ag=8 /dev/mdX
>
> Mount it with:
>
> ~$ mount -o inode64,logbsize=256,noatime,nobarrier /dev/mdX /test
>
> (Too bad you're running 2.6.32 instead of 2.6.35 or above, as enabling
> the XFS delayed logging mount option would probably bump your small file
> block IOPS to well over a million, if the hardware is actually up to it.)
>
> Now, create 8 directories, say test[1-8]. XFS drives parallelism
> through allocation groups. Each directory will be created in a
> different AG. Thus, you'll end up with one directory per SSD, and any
> files written to that directory will go that that same SSD. Thus,
> writing files to all 8 directories in parallel will get you near perfect
> scaling across all disks, with files, not simply raw blocks.

In actuality, since you're running up against CPU vs IOPs, it may be
better here to create 32 or even 64 allocation groups and spread files
evenly across them. IIRC, each XFS file IO gets its own worker thread,
so you'll be able to take advantage of all 16 cores in the box. The
kernel IO is more than sufficiently threaded.

You mentioned above that using a filesystem isn't really an option. As
I see it, given the lack of md's lateral (parallel) scalability with
your hardware and workload, you may want to evaluate the following ideas:

1. Upgrade to 2.6.38 or later. There have been IO optimizations since
2.6.32, though I'm not sure WRT the md code itself.

2. Try the XFS option. It may or may not work in your case, but it
will parallelize to hundreds of cores when writing hundreds of files
concurrently. The trick is matching your workload to it, vice versa.
If you're writing single large files, it's likely not going to
parallelize. If you can't use a filesystem...

3. mdraid on your individual cores can't keep up with your SSDs, so:
A. Switch to 24 SLC SATA SSDs attached to 3* 8 port LSI SAS HBAs:
http://www.lsi.com/products/storagecomponents/Pages/LSISAS92 11-8i.aspx
which will give you 12 mdraid1 processes instead of 4. Use
cpumemsets to lock the 12 mdraid1 processes to 12 specific
cores, and the mdraid0 process to another core. And disable HT.
B. Swap the CPUs for higher frequency models, though it'll gain you
little and cost quite a bit for four 3.6GHz Xeon W5590s

I'm sure you've already thought of these options, but I figured I'd get
them in Google.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: single cpu thread performance limit?

am 12.08.2011 11:04:31 von David Brown

On 11/08/11 21:57, Joe Landman wrote:
> On 08/11/2011 03:37 PM, mark delfman wrote:
>
>
>> Whilst a R0 on top of the R1/10's does offer some increase in
>> performance, linear does not :(
>
> Linear makes no sense for distributing IO's among many devices. Linear
> is a concatenation.
>

If the real-world application involves parallel access to lots of
different files, then XFS on a linear concatenation /will/ make sense,
if your allocation groups match your concatenated devices. It won't
give you faster access to any of the files, but it will let you have
fast access to several files at the same time. Of course, YMMV
according to the setup and application.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: single cpu thread performance limit?

am 12.08.2011 14:48:23 von Asdo

On 08/11/11 21:37, mark delfman wrote:
> So, a single RAID10 creates a single thread - which will max at maybe 200K IOPS.
> Create 4 x RAID10's seems OK, but they will not scale so great with a
> RAID0 on top :(
> Ideal would be a few threads per RAIDx

Try this: LVM.
AFAIR, LVM does not have its thread, it is the application thread that
executes LVM code.
This should not impede scalability.

If you are testing with something like fio, which randomly spans the
whole device with random I/O during test, you can use a linear LVM
concatenation (which is the default when you create a LV that spans the
whole VG).
Otherwise use striping on lvcreate.
Try both if possible.

Also, as other people have said, your kernel is quite old... Actually I
don't remember if there were performance improvements regarding what you
are doing, but you probably should try a newer one.

Let me know how it goes.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: single cpu thread performance limit?

am 12.08.2011 15:23:06 von mark delfman

Hi

Quick update with the XFS tests suggested (although a FS is still
probably not a real option at teh moment for me)

This rig only has 4 x Flash (2 MLC and 2 SLC)..... 125K IOPS each for
MLC - 165K each for SLC.

Create linear RAID and XFS with ag=3D4

Mount as suggested and create 4 test folders.....

If i test individually - we get 99.9% of the IOPS (ie. 125 for first 2
AG's and 165 for last 2). which is great news and means that the AG
does what it should.

But if a run the test over all 4, then we see it peak at aroudn 320K
IOPS. Interstingly each AG =3D 80K IOPS and as we can see above this i=
s
need not be the case, as the CPU load is not having any issues - i am
presuming that this could be a simple XFS limit maybe.


More testing with many R1's and R0's on top seem to suggest that R0 is
losing around 20-25% of the IOPS. (R1 around 5%). I have tried with
LVM strip and much the same.






On Thu, Aug 11, 2011 at 7:58 PM, Stan Hoeppner =
wrote:
> On 8/11/2011 10:58 AM, mark delfman wrote:
>> I seem to have hit a significant hard stop in MD RAID1/10 performanc=
e
>> which seems to be linked to a single CPU thread.
>
> What is the name of the kernel thread that is peaking your cores? =A0=
Could
> the device driver be eating the CPU and not the md kernel threads? =A0=
Is
> it both? =A0Is it a different thread? =A0How much CPU is the IO gener=
ator
> app eating?
>
> What Linux kernel version are you running? =A0Which Linux distributio=
n?
> What application are you using to generate the IO load? =A0Does it wo=
rk at
> the raw device/partition level or at the file level?
>
>> I am using extremely high speed (IOPS) internal block devices =96 8 =
in
>> total. =A0They are capable of achieving > 1million iops.
>
> 8 solid state drives of one model or another, probably occupying 8 PC=
Ie
> slots. =A0IBIS, VeloDrive, the LSI SSD, or other PCIe based SSD? =A0O=
r are
> these plain SATA II SSDs that *claim* to have 125K 4KB random IOPS
> performance?
>
>> However if I use RAID1 / 10 then MD seems to use a single thread whi=
ch
>> will reach 100% CPU utilisation (single core) at around 200K IOPS.
>> Limiting the entire performance to around 200K.
>
> CPU frequency? =A0How many sockets? =A0Total cores? =A0Whose box? =A0=
HP, Dell,
> IBM, whitebox, self built? =A0If the latter two, whose motherboard? =A0=
How
> many PCIe slots are occupied by the SSD cards?
>
>> If I use say 4 x RAID1 / 10=92s and a RAID0 on top =96 I see not muc=
h
>> greater results. (although the theory seems to say I should and ther=
e
>> are now 4 CPU threads running, it still seems to hit 4 x 100% at may=
be
>> 350K).
>
> Assuming you have 4 processors (cores), then yes, you should see bett=
er
> scaling. =A0If you have less cores than threads, then no. =A0Do you s=
ee more
> IOPS before running out of CPU when writing vs reading? =A0You should=
as
> you're doing half the IOs when reading.
>
>> Is there any way to increase the number of threads per RAID set? Or
>> any other suggestions on configurations? =A0(I have tried every
>> permutation of R0+R1/10=92s)
>
> The answer to the first question AFAIK is no. =A0Do you have the same
> problem with a single --linear array? =A0What is the result when putt=
ing a
> filesystem on each individual drive? =A0Do you get your 1 million IOP=
S?
>
> Is MSI enabled and verified to be working for each PCIe SSD device? =A0=
See:
>
> http://git.kernel.org/?p=3Dlinux/kernel/git/torvalds/linux-2 .6.git;a=3D=
blob;f=3DDocumentation/PCI/MSI-HOWTO.txt;hb=3DHEAD
>
> --
> Stan
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: single cpu thread performance limit?

am 12.08.2011 16:23:58 von Asdo

On 08/12/11 15:23, mark delfman wrote:
> Hi
>
> Quick update with the XFS tests suggested (although a FS is still
> probably not a real option at teh moment for me)
>
> This rig only has 4 x Flash (2 MLC and 2 SLC)..... 125K IOPS each for
> MLC - 165K each for SLC.
>
> Create linear RAID and XFS with ag=4
>
> Mount as suggested and create 4 test folders.....
>
> If i test individually - we get 99.9% of the IOPS (ie. 125 for first 2
> AG's and 165 for last 2). which is great news and means that the AG
> does what it should.
>
> But if a run the test over all 4, then we see it peak at aroudn 320K
> IOPS. Interstingly each AG = 80K IOPS and as we can see above this is
> need not be the case, as the CPU load is not having any issues - i am
> presuming that this could be a simple XFS limit maybe.
>
>
> More testing with many R1's and R0's on top seem to suggest that R0 is
> losing around 20-25% of the IOPS. (R1 around 5%). I have tried with
> LVM strip and much the same.
>

So you report a higher speed now: (25% overhead + 5% overhead = 30%
overhead = 70% remains)
(125*2+175*2)*0.7 = 420 K
Previously in your first post you were talking about 350K, do you confirm?

Unfortunately I think 20% overhead for R0 or LVM is reasonable, I have
measured 15% for LVM in other situations.
Your figures with 4 SSDs are not bad I'd say.

But this means that you should obtain 840K IOPS when you have all 8 SSD
PCIe cards installed (like in your first post).
If possible repeat the test with LVM stripes on the big rig.

Oh and I also wanted to ask: if you run 8 parallel tests on the big rig
with 8 SSDs, each test on a different SSD but all tests simultaneously,
without RAIDs or LVMs, are you sure you reach 1 million IOPS overall, or
do you max out at 600K or similar? (600K would be the last performance
you measured but adjusted to remove the overheads of LVM and RAID)


BTW: please note you do NOT have 16 cores, you have 8 cores if you have
a dual Xeon 5640. The other 8 cores you see are fake, that's
hyperthreading. If one core CPU occupation goes up, you will see it's
other twin phantom core to also go up. This makes more difficult to
understand the benchmarking, so you might disable hyperthreading from
bios if you want to understand better what's going on. Performances
should probably change just very little after you disable hyperthreading.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: single cpu thread performance limit?

am 12.08.2011 22:51:30 von Stan Hoeppner

On 8/12/2011 8:23 AM, mark delfman wrote:

> Quick update with the XFS tests suggested (although a FS is still
> probably not a real option at teh moment for me)
>=20
> This rig only has 4 x Flash (2 MLC and 2 SLC)..... 125K IOPS each fo=
r
> MLC - 165K each for SLC.
>=20
> Create linear RAID and XFS with ag=3D4
>=20
> Mount as suggested and create 4 test folders.....
>=20
> If i test individually - we get 99.9% of the IOPS (ie. 125 for first =
2
> AG's and 165 for last 2). which is great news and means that the AG
> does what it should.

Now you know why XFS has the high performance reputation it does.

> But if a run the test over all 4, then we see it peak at aroudn 320K
> IOPS. Interstingly each AG =3D 80K IOPS and as we can see above this=
is
> need not be the case, as the CPU load is not having any issues - i am
> presuming that this could be a simple XFS limit maybe.

Ok, now this is interesting, because the 320K IOPS you mentioned as a
limit here is very close to the ~350K IOPS you mentioned in your first
post, when 4 cores were pegged with the md processes. In this case you=
r
CPUs are not pegged, but you're hitting nearly the same ceiling, 320K I=
OPS.

I'm pretty sure you're not hitting an XFS limit here. To confirm,
create 4 subdirectories in each of the current 4 directories, and
generate 16 concurrent writers against the 16 dirs.

On 8/11/2011 10:58 AM, mark delfman wrote:
> If I use say 4 x RAID1 / 10=92s and a RAID0 on top =96 I see not much
> greater results. (although the theory seems to say I should and there
> are now 4 CPU threads running, it still seems to hit 4 x 100% at mayb=
e
> 350K).

So it's beginning to look like your scalability issue may not
necessarily be with mdraid, but possibly a hardware bottleneck, or a
bottleneck somewhere else in the kernel. As Bernd mentioned previously=
,
you should probably run perf top or some other tool to see where the
kernel is busy.

Also, you never answered my question regarding which block device
driver(s) you're using for these PCIe SSDs.

> More testing with many R1's and R0's on top seem to suggest that R0 i=
s
> losing around 20-25% of the IOPS. (R1 around 5%). I have tried with
> LVM strip and much the same.

Are you hitting the same ~320K-350K IOPS aggregate limit with all test
configurations?

--=20
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html