high throughput storage server?

high throughput storage server?

am 15.02.2011 00:59:28 von Matt Garman

For many years, I have been using Linux software RAID at home for a
simple NAS system. Now at work, we are looking at buying a massive,
high-throughput storage system (e.g. a SAN). I have little
familiarity with these kinds of pre-built, vendor-supplied solutions.
I just started talking to a vendor, and the prices are extremely high.

So I got to thinking, perhaps I could build an adequate device for
significantly less cost using Linux. The problem is, the requirements
for such a system are significantly higher than my home media server,
and put me into unfamiliar territory (in terms of both hardware and
software configuration).

The requirement is basically this: around 40 to 50 compute machines
act as basically an ad-hoc scientific compute/simulation/analysis
cluster. These machines all need access to a shared 20 TB pool of
storage. Each compute machine has a gigabit network connection, and
it's possible that nearly every machine could simultaneously try to
access a large (100 to 1000 MB) file in the storage pool. In other
words, a 20 TB file store with bandwidth upwards of 50 Gbps.

I was wondering if anyone on the list has built something similar to
this using off-the-shelf hardware (and Linux of course)?

My initial thoughts/questions are:

(1) We need lots of spindles (i.e. many small disks rather than
few big disks). How do you compute disk throughput when there are
multiple consumers? Most manufacturers provide specs on their drives
such as sustained linear read throughput. But how is that number
affected when there are multiple processes simultanesously trying to
access different data? Is the sustained bulk read throughput value
inversely proportional to the number of consumers? (E.g. 100 MB/s
drive only does 33 MB/s w/three consumers.) Or is there are more
specific way to estimate this?

(2) The big storage server(s) need to connect to the network via
multiple bonded Gigabit ethernet, or something faster like
FibreChannel or 10 GbE. That seems pretty straightforward.

(3) This will probably require multiple servers connected together
somehow and presented to the compute machines as one big data store.
This is where I really don't know much of anything. I did a quick
"back of the envelope" spec for a system with 24 600 GB 15k SAS drives
(based on the observation that 24-bay rackmount enclosures seem to be
fairly common). Such a system would only provide 7.2 TB of storage
using a scheme like RAID-10. So how could two or three of these
servers be "chained" together and look like a single large data pool
to the analysis machines?

I know this is a broad question, and not 100% about Linux software
RAID. But I've been lurking on this list for years now, and I get the
impression there are list members who regularly work with "big iron"
systems such as what I've described. I'm just looking for any kind of
relevant information here; any and all is appreciated!

Thank you,
Matt
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 15.02.2011 03:06:43 von Doug Dumitru

Matt,

You have a whole slew of questions to answer before you can decide on
a design. This is true if you build it yourself or decide to go with
a vendor and buy a supported server. If you do go with a vendor, the
odds are actually quite good you will end up with Linux anyway.

You state a need for 20TB of storage split among 40-50 "users". Your
description implies that this space needs to be shared at the file
level. This means you are building a NAS (Network Attached Storage),
not a SAN (Storage Area Network). SANs typically export block devices
over protocols like iSCSI. These block devices are non-sharable (ie,
only a single client can mount them (at least read/write) at a time.

So, 20TB of NAS. Not really that hard to build. Next, you need to
look at the space itself. Is this all unique data, or is there an
opportunity for "data deduplication". Some filesystems (ZFS) and some
block solutions can actively spot blocks that are duplicates and only
store a single copy. With some applications (like virtual servers all
running the same OS), this can result in de-dupe ratios of 20:1. If
your application is like this, your 20TB might only be 1-2 TB. I
suspect this is not the case based on your description.

Next, is the space all the same. Perhaps some of it is "active" and
some of it is archival. If you need 4TB of "fast" storage and 16TB of
"backup" storage, this can really impact how you build a NAS. Space
for backup might be configured with large (> 1TB) SATA drives running
RAID-5/6. These configurations are good at reads and linear writes,
but lousy at random writes. There cost is wildly lower than "fast"
storage. You can buy a 12 bay 2U chassis for $300 plus PS and put 12
2TB 7200 RPM SATA drives raid/6 and get ~20TB of usable space. Random
write performance will be quite bad, but for backups and "near line"
storage, it will do quite well. You can probably build this for
around $5K (or maybe a bit less) including a 10GigE adapter and server
class components.

If you need IOPS (IO Operations Per Second), you are looking at SSDs.
You can build 20TB of pure SSD space. If you do it yourself raid-10,
expect to pay around $6/GB or $120K just for drives. 18TB will fit in
a 4U chassis (see the 72 drive SuperMicro double-sided 4U). 72 500GB
drives later and you have 18,000 GB of space. Not cheap, but if you
quote a system from NetApp or EMC it will seem so.

If you can cut the "fast" size down to 2-4TBs, SSDs become a lot more
realistic with commercial systems from new companies like WhipTail for
way under $100K.

If you go with hard drives, you are trading speed for space. With
600GB 10K drives would need 66 drives raid-10. Multi-threaded, this
would read at around 10K IOPS and write at around 7K for "small"
blocks (4-8K). Linear IO would be wicked fast but random OPs slow you
down. Conversly, large SSDs arrays can routinely hit > 400K reads and
> 200K writes if built correctly. Just the 66 hard drives will run
you $30K. These are SAS drives, not WD Velociraptors which would save
you 30%.

If you opt for "lots of small drives" (ie, 72GB 15K SAS drives) or
worse (short seek small drives), the SSDs are actually faster and
cheaper per GB. 20TB of raid-10 72GB drives is 550 drives or $105K
(just for the drives, not counting jbod enclosures, racks, etc).
Short seeking would be 1000+ drives. I highly expect you do not want
to do this.

In terms of Linux, pretty much any stock distribution will work.
After all, you are just talking about SMB or NFS exports. Not exactly
rocket science.

In terms of hardware, buy good disk controllers and good SAS
expanders. SuperMicro is a good brand for motherboards and white box
chassis. The LSI 8 channel 6gbit SAS PCIe card is a favorite as a
dumb disk controller. The SuperMicro backplanes have LSI SAS expander
chips and work well.

The network is the easiest part. Buy a decent dual-port 10GigE
adapter and two 24-port GigE switches with 10GigE uplink ports. You
will max out at about 1.2 GBytes/sec on the network but should be able
to keep the GigE channels very busy.

Then you get to test, test, test.

Good Luck

Doug Dumitru
EasyCo LLC





On Mon, Feb 14, 2011 at 3:59 PM, Matt Garman =
wrote:
> For many years, I have been using Linux software RAID at home for a
> simple NAS system. =A0Now at work, we are looking at buying a massive=
,
> high-throughput storage system (e.g. a SAN). =A0I have little
> familiarity with these kinds of pre-built, vendor-supplied solutions.
> I just started talking to a vendor, and the prices are extremely high=

>
> So I got to thinking, perhaps I could build an adequate device for
> significantly less cost using Linux. =A0The problem is, the requireme=
nts
> for such a system are significantly higher than my home media server,
> and put me into unfamiliar territory (in terms of both hardware and
> software configuration).
>
> The requirement is basically this: around 40 to 50 compute machines
> act as basically an ad-hoc scientific compute/simulation/analysis
> cluster. =A0These machines all need access to a shared 20 TB pool of
> storage. =A0Each compute machine has a gigabit network connection, an=
d
> it's possible that nearly every machine could simultaneously try to
> access a large (100 to 1000 MB) file in the storage pool. =A0In other
> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>
> I was wondering if anyone on the list has built something similar to
> this using off-the-shelf hardware (and Linux of course)?
>
> My initial thoughts/questions are:
>
> =A0 =A0(1) We need lots of spindles (i.e. many small disks rather tha=
n
> few big disks). =A0How do you compute disk throughput when there are
> multiple consumers? =A0Most manufacturers provide specs on their driv=
es
> such as sustained linear read throughput. =A0But how is that number
> affected when there are multiple processes simultanesously trying to
> access different data? =A0Is the sustained bulk read throughput value
> inversely proportional to the number of consumers? =A0(E.g. 100 MB/s
> drive only does 33 MB/s w/three consumers.) =A0Or is there are more
> specific way to estimate this?
>
> =A0 =A0(2) The big storage server(s) need to connect to the network v=
ia
> multiple bonded Gigabit ethernet, or something faster like
> FibreChannel or 10 GbE. =A0That seems pretty straightforward.
>
> =A0 =A0(3) This will probably require multiple servers connected toge=
ther
> somehow and presented to the compute machines as one big data store.
> This is where I really don't know much of anything. =A0I did a quick
> "back of the envelope" spec for a system with 24 600 GB 15k SAS drive=
s
> (based on the observation that 24-bay rackmount enclosures seem to be
> fairly common). =A0Such a system would only provide 7.2 TB of storage
> using a scheme like RAID-10. =A0So how could two or three of these
> servers be "chained" together and look like a single large data pool
> to the analysis machines?
>
> I know this is a broad question, and not 100% about Linux software
> RAID. =A0But I've been lurking on this list for years now, and I get =
the
> impression there are list members who regularly work with "big iron"
> systems such as what I've described. =A0I'm just looking for any kind=
of
> relevant information here; any and all is appreciated!
>
> Thank you,
> Matt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Doug Dumitru
EasyCo LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 15.02.2011 05:44:34 von Matt Garman

On Mon, Feb 14, 2011 at 06:06:43PM -0800, Doug Dumitru wrote:
> You have a whole slew of questions to answer before you can decide
> on a design. This is true if you build it yourself or decide to
> go with a vendor and buy a supported server. If you do go with a
> vendor, the odds are actually quite good you will end up with
> Linux anyway.

I kind of assumed/wondered if the vendor-supplied systems didn't run
Linux behind the scenes anyway.

> You state a need for 20TB of storage split among 40-50 "users".
> Your description implies that this space needs to be shared at the
> file level. This means you are building a NAS (Network Attached
> Storage), not a SAN (Storage Area Network). SANs typically export
> block devices over protocols like iSCSI. These block devices are
> non-sharable (ie, only a single client can mount them (at least
> read/write) at a time.

Is that the only distinction between SAN and NAS? (Honest
question, not rhetorical.)

> So, 20TB of NAS. Not really that hard to build. Next, you need
> to look at the space itself. Is this all unique data, or is there
> an opportunity for "data deduplication". Some filesystems (ZFS)
> and some block solutions can actively spot blocks that are
> duplicates and only store a single copy. With some applications
> (like virtual servers all running the same OS), this can result in
> de-dupe ratios of 20:1. If your application is like this, your
> 20TB might only be 1-2 TB. I suspect this is not the case based
> on your description.

Unfortunately, no, there is no duplication. Basically, we have a
bunch of files that are generated via another big collection of
servers scattered throughout different data centers. These files
are "harvested" daily (i.e. copied back to the big store in our
office for the analysis I've mentioned).

> Next, is the space all the same. Perhaps some of it is "active"
> and some of it is archival. If you need 4TB of "fast" storage and
> ...
> well. You can probably build this for around $5K (or maybe a bit
> less) including a 10GigE adapter and server class components.

The whole system needs to be "fast".

Actually, to give more detail, we currently have a simple system I
built for backup/slow access. This is exactly what you described, a
bunch of big, slow disks. Lots of space, lowsy I/O performance, but
plenty adequate for backup purposes.

As of right now, we actually have about a dozen "users", i.e.
compute servers. The collection is basically a home-grown compute
farm. Each server has a gigabit ethernet connection, and 1 TB of
RAID-1 spinning disk storage. Each storage mounts every other
server via NFS, and the current data is distributed evenly across
all systems.

So, loosely speaking, right now we have roughly 10 TB of
"live"/"fast" data available at 1 to 10 gbps, depending on how you
look at it.

While we only have about a dozen servers now, we have definitely
identified growing this compute farm about 4x (to 40--50 servers)
within the next year. But the storage capacity requirements
shouldn't change too terribly much. The 20 TB number was basically
thrown out there as a "it would be nice to have 2x the live
storage".

I'll also add that this NAS needs to be optimized for *read*
throughput. As I mentioned, the only real write process is the
daily "harvesting" of the data files. Those are copied across
long-haul leased lines, and the copy process isn't really
performance sensitive. In other words, in day-to-day use, those
40--50 client machines will do 100% reading from the NAS.

> If you need IOPS (IO Operations Per Second), you are looking at
> SSDs. You can build 20TB of pure SSD space. If you do it
> yourself raid-10, expect to pay around $6/GB or $120K just for
> drives. 18TB will fit in a 4U chassis (see the 72 drive
> SuperMicro double-sided 4U). 72 500GB drives later and you have
> 18,000 GB of space. Not cheap, but if you quote a system from
> NetApp or EMC it will seem so.

Hmm. That does seem high, but that would be a beast of a system.
And I have to add, I'd love to build something like that!

> If you can cut the "fast" size down to 2-4TBs, SSDs become a lot
> more realistic with commercial systems from new companies like
> WhipTail for way under $100K.
>
> If you go with hard drives, you are trading speed for space. With
> 600GB 10K drives would need 66 drives raid-10. Multi-threaded, this
> would read at around 10K IOPS and write at around 7K for "small"
> blocks (4-8K). Linear IO would be wicked fast but random OPs slow you
> down. Conversly, large SSDs arrays can routinely hit > 400K reads and
> > 200K writes if built correctly. Just the 66 hard drives will run
> you $30K. These are SAS drives, not WD Velociraptors which would save
> you 30%.
>
> If you opt for "lots of small drives" (ie, 72GB 15K SAS drives) or
> worse (short seek small drives), the SSDs are actually faster and
> cheaper per GB. 20TB of raid-10 72GB drives is 550 drives or $105K
> (just for the drives, not counting jbod enclosures, racks, etc).
> Short seeking would be 1000+ drives. I highly expect you do not want
> to do this.

No. :) 72 SSDs sounds like fun; 550 spinning disks sound dreadful.
I have a feeling I'd probably have to keep a significant number
on-hand as spares, as I predict drive failures would probably be a
weekly occurance.

Thank you for the detailed and thoughtful answers! Definitely very
helpful.

Take care,
Matt

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 15.02.2011 06:49:51 von hansBKK

I highly recommend taking a look at Openfiler, pretty simple to set up
and very flexible, really just a stabilized/tested "appliance" built
on Linux/FOSS tools. Then your choices come down to what
top-of-the-line hardware you'd like to buy. . .

With the money you'd save from not going COTS, you could build two of
them and create high-availability mirrored servers with DRBD/heartbeat
for extra redundancy/fault-tolerance. And pre-pay for a full lifetime
of support, if that gives you and the company an extra level of
comfort. And still have a nice chunk of budget left over for UPSs,
backup hardware, network capacity expansion etc.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 15.02.2011 10:43:56 von David Brown

On 15/02/2011 05:44, Matt Garman wrote:
> On Mon, Feb 14, 2011 at 06:06:43PM -0800, Doug Dumitru wrote:
>
> I'll also add that this NAS needs to be optimized for *read*
> throughput. As I mentioned, the only real write process is the
> daily "harvesting" of the data files. Those are copied across
> long-haul leased lines, and the copy process isn't really
> performance sensitive. In other words, in day-to-day use, those
> 40--50 client machines will do 100% reading from the NAS.
>

If you are not too bothered about write performance, I'd put a fair
amount of the budget into ram rather than just disk performance. When
you've got the ram space to make sure small reads are mostly cached, the
main bottleneck will be sequential reads - and big hard disks handle
sequential reads as fast as expensive SSDs.

>
> No. :) 72 SSDs sounds like fun; 550 spinning disks sound dreadful.
> I have a feeling I'd probably have to keep a significant number
> on-hand as spares, as I predict drive failures would probably be a
> weekly occurance.
>

Don't forget to include running costs in this - 72 SSDs use a lot less
power than 550 hard disks.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 15.02.2011 13:29:12 von Stan Hoeppner

Matt Garman put forth on 2/14/2011 5:59 PM:

> The requirement is basically this: around 40 to 50 compute machines
> act as basically an ad-hoc scientific compute/simulation/analysis
> cluster. These machines all need access to a shared 20 TB pool of
> storage. Each compute machine has a gigabit network connection, and
> it's possible that nearly every machine could simultaneously try to
> access a large (100 to 1000 MB) file in the storage pool. In other
> words, a 20 TB file store with bandwidth upwards of 50 Gbps.

If your description of the requirement is accurate, then what you need is a
_reliable_ high performance NFS server backed by many large/fast spindles.

> I was wondering if anyone on the list has built something similar to
> this using off-the-shelf hardware (and Linux of course)?

My thoughtful, considered, recommendation would be to stay away from a DIY build
for the requirement you describe, and stay away from mdraid as well, but not
because mdraid isn't up to the task. I get the feeling you don't fully grasp
some of the consequences of a less than expert level mdraid admin being
responsible for such a system after it's in production. If multiple drives are
kicked off line simultaneously (posts of such seem to occur multiple times/week
here), downing the array, are you capable of bringing it back online intact,
successfully, without outside assistance, in a short period of time? If you
lose the entire array due to a typo'd mdadm parm, then what?

You haven't described a hobby level system here, one which you can fix at your
leisure. You've described a large, expensive, production caliber storage
resource used for scientific discovery. You need to perform one very serious
gut check, and be damn sure you're prepared to successfully manage such a large,
apparently important, mdraid array when things go to the South pole in a
heartbeat. Do the NFS server yourself, as mistakes there are more forgiving
than mistakes at the array level. Thus, I'd recommend the following. And as
you can tell from the length of it, I put some careful consideration (and time)
into whipping this up.

Get one HP ProLiant DL 385 G7 eight core AMD Magny Cours server for deployment
as your 64 bit Linux NFS server ($2500):
http://www.newegg.com/Product/Product.aspx?Item=N82E16859105 806

Eight 2.3GHz cores is actually overkill for this NFS server, but this box has
the right combination of price and other features you need. The standard box
comes with 2x2GB DIMMs, using only 2 of the 4 channels of the occupied G34
socket, and 4GB is a tad short of what you'll need. So toss the installed DIMMs
and buy this HP certified 4 channel 16GB kit directly from Kingston ($400):
http://www.ec.kingston.com/ecom/configurator_new/partsinfo.a sp?root=us&LinkBack=http://www.kingston.com&ktcpartno=KTH-PL 313K4/16G

This box has 4 GbE ports, which will give you max NFS throughput of ~600-800
MB/s bidirectional, roughly 1/3rd to half the storage system bandwidth (see
below). Link aggregation with the switch will help with efficiency. Set jumbo
frames across all the systems and switches obviously, MTU of 9000, or the lowest
common denominator, regardless of which NIC solution you end up with. If that's
not enough b/w...

Add one of these PCI Express 2.0 x8 10 GbE copper Intel NICs ($600) to bump max
NFS throughput to ~1.5-2 GB/s bidirectional (assuming your switch has a copper
10 GbE port):
http://www.newegg.com/Product/Product.aspx?Item=N82E16833106 043
Due to using NFS+TCP/UDP as your protocols, this NIC can't outrun the FC back
end, even though the raw signaling rate is slightly higher. However, if you
fired up 10-12 simultaneous FTP gets you'd come really close.

Two of these for boot drives ($600):
http://www.newegg.com/Product/Product.aspx?Item=N82E16822332 060
Mirror them with the onboard 256MB SmartArray BBWC RAID controller

Qlogic PCIe 2.0 x4/x8 8Gbit FC HBA ($1100):
http://www.newegg.com/Product/Product.aspx?Item=N82E16833380 014&cm_re=qlogic-_-33-380-014-_-Product

for connecting to the important part ($20-40K USD):
http://www.nexsan.com/satabeast.php

42 drives in a single 4U chassis, one RAID group or many, up to 254 LUNs or just
one, awesome capacity and performance for the price. To keep costs down yet
performance high, you'll want the 8Gbit FC single controller model with 2GB
cache (standard) and with qty 42* 1TB 7.2K rpm SATA drives. All drives use a
firmware revision tested and certified by Nexsan for use with their controllers
so you won't have problems with drives being randomly kicked offline, etc. This
is an enterprise class SAN controller. (Do some research and look at Nexsan's
customers and what they're using these things for. Caltech dumps the data from
the Spitzer space telescope to a group of 50-60 of these SATABeasts).

A SATABeast with 42 * 1TB drives should run in the ballpark of $20-40K USD
depending on the reseller and your organization status (EDU, non profit,
government, etc). Nexsan has resellers covering the entire Americas and Europe.
If you need to expand in the future, Nexsan offers the NXS-B60E expansion
chassis
(http://www.nexstor.co.uk/prod_pdfs/NXS-B60E_Nexsan%20NXS-B6 0E%20Datasheet.pdf)
which holds 60 disks and plugs into the SATABeast with redundant multilane SAS
cables, allowing up to 102 drives in 8U of rack space, 204TB total using 2TB
drives, or any combination in between. The NXS-B60E adds no additional
bandwidth to the system. Thus, if you need more speed and space, buy a second
SATABeast and another FC card, or replace the single port FC card with a dual
port model (or buy the dual port up front)

With the full 42 drive chassis configured as a 40 disk RAID 10 (2 spares) you'll
get 20TB usable space and you'll easily peak the 8GBit FC interface in both
directions simultaneously. Aggregate random non-cached IOPS will peak at around
3000, cached at 50,000. The bandwidth figures may seem low to people used to
"testing" md arrays with hdparm or dd and seeing figures of 500MB/s to 1GB/s
with only a handful of disks, however these are usually sequential _read_
figures only, on RAID 6 arrays, which have write performance often 2-3 times
lower. In the real world, 1.6GB/s of sustained bidirectional random I/O
throughput while servicing dozens or hundreds of hosts is pretty phenomenal
performance, especially in this price range. The NFS server will most likely be
the bottleneck though, not this storage, definitely so if 4 bonded GbE
interfaces are used for NFS serving instead of the 10 GbE NIC.

The hardware for this should run you well less than $50K USD for everything.
I'd highly recommend you create a single 40 drive RAID 10 array, as I mentioned
above with 2 spares, if you need performance as much as, if not more than,
capacity--especially write performance. A 40 drive RAID 10 on this SATABeast
will give you performance almost identical to a 20 disk RAID 0 stripe. If you
need additional capacity more than speed, configure 40 drives as a RAID 6. The
read performance will be similar, although the write performance will take a big
dive with 40 drives and dual parity.

Configure 90-95% of the array as one logical drive and save the other 5-10% for
a rainy day--you'll be glad you did. Export the logical drive as a single LUN.
Format that LUN as XFS. Visit the XFS mailing list and ask for instructions on
how best to format and mount it. Use the most recent Linux kernel available,
2.6.37 or later, depending on when you actually build the NFS server--2.6.38/39
if they're stable. If you get Linux+XFS+NFS configured and running optimally,
you should be more than impressed and satisfied with the performance and
reliability of this combined system.

I don't work for any of the companies whose products are mentioned above. I'm
merely a satisfied customer of all of them. The Nexsan products have the lowest
price/TB of any SAN storage products on the market, and the highest
performance/dollar, and lowest price per watt of power consumption. They're
easy as cake to setup and manage with a nice GUI web interface over an ethernet
management port.

Hope you find this information useful. Feel free to contact me directly if I
can be of further assistance.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 15.02.2011 13:45:01 von Roberto Spadim

if you want a hobby server
a old computer with many pci-express slots
many sata2 boards
and a mdadm work
no problem on speed
the common bottle neck:

1)disk speed for sequencial read/write
2)disk speed for non-sequencial read/write
3)disk channel (SATA/SAS/other)
4)pci-express/pci/isa/other channel speed
5)ram memory speed
6)cpu use

check that buffer on disk controllers just help with read speed, if
you want more speed for read put more ram (file system cache) or
controller cache
another solution for big speed is ssd (for read/write it=B4s near a
fixed speed rate), use raid0 when possible, raid1 just for mirroring
(it=B4s not a speed improvement for writes, the write is done with the
rate of slowest disk, read can work near raid0 if using harddisk,
better if use raid1)

2011/2/15 Stan Hoeppner :
> Matt Garman put forth on 2/14/2011 5:59 PM:
>
>> The requirement is basically this: around 40 to 50 compute machines
>> act as basically an ad-hoc scientific compute/simulation/analysis
>> cluster. =A0These machines all need access to a shared 20 TB pool of
>> storage. =A0Each compute machine has a gigabit network connection, a=
nd
>> it's possible that nearly every machine could simultaneously try to
>> access a large (100 to 1000 MB) file in the storage pool. =A0In othe=
r
>> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>
> If your description of the requirement is accurate, then what you nee=
d is a
> _reliable_ high performance NFS server backed by many large/fast spin=
dles.
>
>> I was wondering if anyone on the list has built something similar to
>> this using off-the-shelf hardware (and Linux of course)?
>
> My thoughtful, considered, recommendation would be to stay away from =
a DIY build
> for the requirement you describe, and stay away from mdraid as well, =
but not
> because mdraid isn't up to the task. =A0I get the feeling you don't f=
ully grasp
> some of the consequences of a less than expert level mdraid admin bei=
ng
> responsible for such a system after it's in production. =A0If multipl=
e drives are
> kicked off line simultaneously (posts of such seem to occur multiple =
times/week
> here), downing the array, are you capable of bringing it back online =
intact,
> successfully, without outside assistance, in a short period of time? =
=A0If you
> lose the entire array due to a typo'd mdadm parm, then what?
>
> You haven't described a hobby level system here, one which you can fi=
x at your
> leisure. =A0You've described a large, expensive, production caliber s=
torage
> resource used for scientific discovery. =A0You need to perform one ve=
ry serious
> gut check, and be damn sure you're prepared to successfully manage su=
ch a large,
> apparently important, mdraid array when things go to the South pole i=
n a
> heartbeat. =A0Do the NFS server yourself, as mistakes there are more =
forgiving
> than mistakes at the array level. =A0Thus, I'd recommend the followin=
g. =A0And as
> you can tell from the length of it, I put some careful consideration =
(and time)
> into whipping this up.
>
> Get one HP ProLiant DL 385 G7 eight core AMD Magny Cours server for d=
eployment
> as your 64 bit Linux NFS server ($2500):
> http://www.newegg.com/Product/Product.aspx?Item=3DN82E168591 05806
>
> Eight 2.3GHz cores is actually overkill for this NFS server, but this=
box has
> the right combination of price and other features you need. =A0The st=
andard box
> comes with 2x2GB DIMMs, using only 2 of the 4 channels of the occupie=
d G34
> socket, and 4GB is a tad short of what you'll need. =A0So toss the in=
stalled DIMMs
> and buy this HP certified 4 channel 16GB kit directly from Kingston (=
$400):
> http://www.ec.kingston.com/ecom/configurator_new/partsinfo.a sp?root=3D=
us&LinkBack=3Dhttp://www.kingston.com&ktcpartno=3DKTH-PL313K 4/16G
>
> This box has 4 GbE ports, which will give you max NFS throughput of ~=
600-800
> MB/s bidirectional, roughly 1/3rd to half the storage system bandwidt=
h (see
> below). =A0Link aggregation with the switch will help with efficiency=
=A0Set jumbo
> frames across all the systems and switches obviously, MTU of 9000, or=
the lowest
> common denominator, regardless of which NIC solution you end up with.=
=A0If that's
> not enough b/w...
>
> Add one of these PCI Express 2.0 x8 10 GbE copper Intel NICs ($600) t=
o bump max
> NFS throughput to ~1.5-2 GB/s bidirectional (assuming your switch has=
a copper
> 10 GbE port):
> http://www.newegg.com/Product/Product.aspx?Item=3DN82E168331 06043
> Due to using NFS+TCP/UDP as your protocols, this NIC can't outrun the=
FC back
> end, even though the raw signaling rate is slightly higher. =A0Howeve=
r, if you
> fired up 10-12 simultaneous FTP gets you'd come really close.
>
> Two of these for boot drives ($600):
> http://www.newegg.com/Product/Product.aspx?Item=3DN82E168223 32060
> Mirror them with the onboard 256MB SmartArray BBWC RAID controller
>
> Qlogic PCIe 2.0 x4/x8 8Gbit FC HBA ($1100):
> http://www.newegg.com/Product/Product.aspx?Item=3DN82E168333 80014&cm_=
re=3Dqlogic-_-33-380-014-_-Product
>
> for connecting to the important part ($20-40K USD):
> http://www.nexsan.com/satabeast.php
>
> 42 drives in a single 4U chassis, one RAID group or many, up to 254 L=
UNs or just
> one, awesome capacity and performance for the price. =A0To keep costs=
down yet
> performance high, you'll want the 8Gbit FC single controller model wi=
th 2GB
> cache (standard) and with qty 42* 1TB 7.2K rpm SATA drives. =A0All dr=
ives use a
> firmware revision tested and certified by Nexsan for use with their c=
ontrollers
> so you won't have problems with drives being randomly kicked offline,=
etc. =A0This
> is an enterprise class SAN controller. =A0(Do some research and look =
at Nexsan's
> customers and what they're using these things for. =A0Caltech dumps t=
he data from
> the Spitzer space telescope to a group of 50-60 of these SATABeasts).
>
> A SATABeast with 42 * 1TB drives should run in the ballpark of $20-40=
K USD
> depending on the reseller and your organization status (EDU, non prof=
it,
> government, etc). =A0Nexsan has resellers covering the entire America=
s and Europe.
> =A0If you need to expand in the future, Nexsan offers the NXS-B60E ex=
pansion
> chassis
> (http://www.nexstor.co.uk/prod_pdfs/NXS-B60E_Nexsan%20NXS-B6 0E%20Data=
sheet.pdf)
> which holds 60 disks and plugs into the SATABeast with redundant mult=
ilane SAS
> cables, allowing up to 102 drives in 8U of rack space, 204TB total us=
ing 2TB
> drives, or any combination in between. =A0The NXS-B60E adds no additi=
onal
> bandwidth to the system. =A0Thus, if you need more speed and space, b=
uy a second
> SATABeast and another FC card, or replace the single port FC card wit=
h a dual
> port model (or buy the dual port up front)
>
> With the full 42 drive chassis configured as a 40 disk RAID 10 (2 spa=
res) you'll
> get 20TB usable space and you'll easily peak the 8GBit FC interface i=
n both
> directions simultaneously. =A0Aggregate random non-cached IOPS will p=
eak at around
> 3000, cached at 50,000. =A0The bandwidth figures may seem low to peop=
le used to
> "testing" md arrays with hdparm or dd and seeing figures of 500MB/s t=
o 1GB/s
> with only a handful of disks, however these are usually sequential _r=
ead_
> figures only, on RAID 6 arrays, which have write performance often 2-=
3 times
> lower. =A0In the real world, 1.6GB/s of sustained bidirectional rando=
m I/O
> throughput while servicing dozens or hundreds of hosts is pretty phen=
omenal
> performance, especially in this price range. =A0The NFS server will m=
ost likely be
> the bottleneck though, not this storage, definitely so if 4 bonded Gb=
E
> interfaces are used for NFS serving instead of the 10 GbE NIC.
>
> The hardware for this should run you well less than $50K USD for ever=
ything.
> I'd highly recommend you create a single 40 drive RAID 10 array, as I=
mentioned
> above with 2 spares, if you need performance as much as, if not more =
than,
> capacity--especially write performance. =A0A 40 drive RAID 10 on this=
SATABeast
> will give you performance almost identical to a 20 disk RAID 0 stripe=
=A0If you
> need additional capacity more than speed, configure 40 drives as a RA=
ID 6. =A0The
> read performance will be similar, although the write performance will=
take a big
> dive with 40 drives and dual parity.
>
> Configure 90-95% of the array as one logical drive and save the other=
5-10% for
> a rainy day--you'll be glad you did. =A0Export the logical drive as a=
single LUN.
> =A0Format that LUN as XFS. =A0Visit the XFS mailing list and ask for =
instructions on
> how best to format and mount it. =A0Use the most recent Linux kernel =
available,
> 2.6.37 or later, depending on when you actually build the NFS server-=
-2.6.38/39
> if they're stable. =A0If you get Linux+XFS+NFS configured and running=
optimally,
> you should be more than impressed and satisfied with the performance =
and
> reliability of this combined system.
>
> I don't work for any of the companies whose products are mentioned ab=
ove. =A0I'm
> merely a satisfied customer of all of them. =A0The Nexsan products ha=
ve the lowest
> price/TB of any SAN storage products on the market, and the highest
> performance/dollar, and lowest price per watt of power consumption. =A0=
They're
> easy as cake to setup and manage with a nice GUI web interface over a=
n ethernet
> management port.
>
> Hope you find this information useful. =A0Feel free to contact me dir=
ectly if I
> can be of further assistance.
>
> --
> Stan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 15.02.2011 14:03:16 von Roberto Spadim

disks are good for sequencial access
for non-sequencial ssd are better (the sequencial access rate for a
ssd is the same for a non sequencial access rate)

in my tests the best disk i used (15000rpm SAS 6gb 146gb) get a
sequencial read of 160MB/s (for random it=B4s slower)
a OCZ VERTEX2 SSD SATA2 (near USD 200, for 128GB) get min of 190MB/s
max of 270MB/s for random or sequencial read (maybe a disk isn=B4t a
good option today... the cost of ssd isn=B4t a problem today, i=B4m usi=
ng
vertez2 on one production server and the speed is really good)

the solution to get more speed today is raid0 (or another stripe raid s=
olution)
why? check example:

reading sector 1 to 10
using raid0, 2, hard disks, striping per sector

what today read do:

considering disks position=3D0
read sector 1
disk1 read, new position=3D1 (no access time, since the sector1 =3D dis=
k 1
position0)
read sector 2
disk2 read, new position=3D1 (no access time, since the sector2 =3D dis=
k 2
position0)
read sector 3
disk1 read, new position=3D2 (no access time, since the sector2 =3D dis=
k 2
position1)
..


that=B4s why you get 2x the read speed for harddisks raid0 sequencial
read, the access time is very small for raid0 on a sequencial read, if
you use a random access you will have a bigger access time since disk
must change head position, with a sequencial read the position isn=B4t
changed a lot

with raid1 using harddisk you can=B4t get the same speed as raid0
striping, since sector2 is position 2 in any disk, that=B4s why the
today raid1 read_balance use near head algorithm and if it can use
only one disk it will use just one disk

if you want try another read balance for raid1 i=B4m testing (benchmark=
ing) it at:
www.spadim.com.br/raid1/

when i get good benchmarks i will send to Neil to test and try to
adopt it on next md version
if you could help with benchmarks =3D) you are welcome =3D)
there=B4s many scenarios where diferent read_balance are better than ne=
ar_head
all solutions can use any read_balance
the time based is good for anyone, the problem is the number of
parameters to config it
the round robin is good for ssd since access time is the same for
random or sequencial read
the stripe is a round robin solution but i didn=B4t see any performace
improvement with it
the near head is good with hard disk


2011/2/15 Roberto Spadim :
> if you want a hobby server
> a old computer with many pci-express slots
> many sata2 boards
> and a mdadm work
> no problem on speed
> the common bottle neck:
>
> 1)disk speed for sequencial read/write
> 2)disk speed for non-sequencial read/write
> 3)disk channel (SATA/SAS/other)
> 4)pci-express/pci/isa/other channel speed
> 5)ram memory speed
> 6)cpu use
>
> check that buffer on disk controllers just help with read speed, if
> you want more speed for read put more ram (file system cache) or
> controller cache
> another solution for big speed is ssd (for read/write it=B4s near a
> fixed speed rate), use raid0 when possible, raid1 just for mirroring
> (it=B4s not a speed improvement for writes, the write is done with th=
e
> rate of slowest disk, read can work near raid0 if using harddisk,
> better if use raid1)
>
> 2011/2/15 Stan Hoeppner :
>> Matt Garman put forth on 2/14/2011 5:59 PM:
>>
>>> The requirement is basically this: around 40 to 50 compute machines
>>> act as basically an ad-hoc scientific compute/simulation/analysis
>>> cluster. =A0These machines all need access to a shared 20 TB pool o=
f
>>> storage. =A0Each compute machine has a gigabit network connection, =
and
>>> it's possible that nearly every machine could simultaneously try to
>>> access a large (100 to 1000 MB) file in the storage pool. =A0In oth=
er
>>> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>>
>> If your description of the requirement is accurate, then what you ne=
ed is a
>> _reliable_ high performance NFS server backed by many large/fast spi=
ndles.
>>
>>> I was wondering if anyone on the list has built something similar t=
o
>>> this using off-the-shelf hardware (and Linux of course)?
>>
>> My thoughtful, considered, recommendation would be to stay away from=
a DIY build
>> for the requirement you describe, and stay away from mdraid as well,=
but not
>> because mdraid isn't up to the task. =A0I get the feeling you don't =
fully grasp
>> some of the consequences of a less than expert level mdraid admin be=
ing
>> responsible for such a system after it's in production. =A0If multip=
le drives are
>> kicked off line simultaneously (posts of such seem to occur multiple=
times/week
>> here), downing the array, are you capable of bringing it back online=
intact,
>> successfully, without outside assistance, in a short period of time?=
=A0If you
>> lose the entire array due to a typo'd mdadm parm, then what?
>>
>> You haven't described a hobby level system here, one which you can f=
ix at your
>> leisure. =A0You've described a large, expensive, production caliber =
storage
>> resource used for scientific discovery. =A0You need to perform one v=
ery serious
>> gut check, and be damn sure you're prepared to successfully manage s=
uch a large,
>> apparently important, mdraid array when things go to the South pole =
in a
>> heartbeat. =A0Do the NFS server yourself, as mistakes there are more=
forgiving
>> than mistakes at the array level. =A0Thus, I'd recommend the followi=
ng. =A0And as
>> you can tell from the length of it, I put some careful consideration=
(and time)
>> into whipping this up.
>>
>> Get one HP ProLiant DL 385 G7 eight core AMD Magny Cours server for =
deployment
>> as your 64 bit Linux NFS server ($2500):
>> http://www.newegg.com/Product/Product.aspx?Item=3DN82E168591 05806
>>
>> Eight 2.3GHz cores is actually overkill for this NFS server, but thi=
s box has
>> the right combination of price and other features you need. =A0The s=
tandard box
>> comes with 2x2GB DIMMs, using only 2 of the 4 channels of the occupi=
ed G34
>> socket, and 4GB is a tad short of what you'll need. =A0So toss the i=
nstalled DIMMs
>> and buy this HP certified 4 channel 16GB kit directly from Kingston =
($400):
>> http://www.ec.kingston.com/ecom/configurator_new/partsinfo.a sp?root=3D=
us&LinkBack=3Dhttp://www.kingston.com&ktcpartno=3DKTH-PL313K 4/16G
>>
>> This box has 4 GbE ports, which will give you max NFS throughput of =
~600-800
>> MB/s bidirectional, roughly 1/3rd to half the storage system bandwid=
th (see
>> below). =A0Link aggregation with the switch will help with efficienc=
y. =A0Set jumbo
>> frames across all the systems and switches obviously, MTU of 9000, o=
r the lowest
>> common denominator, regardless of which NIC solution you end up with=
=A0If that's
>> not enough b/w...
>>
>> Add one of these PCI Express 2.0 x8 10 GbE copper Intel NICs ($600) =
to bump max
>> NFS throughput to ~1.5-2 GB/s bidirectional (assuming your switch ha=
s a copper
>> 10 GbE port):
>> http://www.newegg.com/Product/Product.aspx?Item=3DN82E168331 06043
>> Due to using NFS+TCP/UDP as your protocols, this NIC can't outrun th=
e FC back
>> end, even though the raw signaling rate is slightly higher. =A0Howev=
er, if you
>> fired up 10-12 simultaneous FTP gets you'd come really close.
>>
>> Two of these for boot drives ($600):
>> http://www.newegg.com/Product/Product.aspx?Item=3DN82E168223 32060
>> Mirror them with the onboard 256MB SmartArray BBWC RAID controller
>>
>> Qlogic PCIe 2.0 x4/x8 8Gbit FC HBA ($1100):
>> http://www.newegg.com/Product/Product.aspx?Item=3DN82E168333 80014&cm=
_re=3Dqlogic-_-33-380-014-_-Product
>>
>> for connecting to the important part ($20-40K USD):
>> http://www.nexsan.com/satabeast.php
>>
>> 42 drives in a single 4U chassis, one RAID group or many, up to 254 =
LUNs or just
>> one, awesome capacity and performance for the price. =A0To keep cost=
s down yet
>> performance high, you'll want the 8Gbit FC single controller model w=
ith 2GB
>> cache (standard) and with qty 42* 1TB 7.2K rpm SATA drives. =A0All d=
rives use a
>> firmware revision tested and certified by Nexsan for use with their =
controllers
>> so you won't have problems with drives being randomly kicked offline=
, etc. =A0This
>> is an enterprise class SAN controller. =A0(Do some research and look=
at Nexsan's
>> customers and what they're using these things for. =A0Caltech dumps =
the data from
>> the Spitzer space telescope to a group of 50-60 of these SATABeasts)=

>>
>> A SATABeast with 42 * 1TB drives should run in the ballpark of $20-4=
0K USD
>> depending on the reseller and your organization status (EDU, non pro=
fit,
>> government, etc). =A0Nexsan has resellers covering the entire Americ=
as and Europe.
>> =A0If you need to expand in the future, Nexsan offers the NXS-B60E e=
xpansion
>> chassis
>> (http://www.nexstor.co.uk/prod_pdfs/NXS-B60E_Nexsan%20NXS-B6 0E%20Dat=
asheet.pdf)
>> which holds 60 disks and plugs into the SATABeast with redundant mul=
tilane SAS
>> cables, allowing up to 102 drives in 8U of rack space, 204TB total u=
sing 2TB
>> drives, or any combination in between. =A0The NXS-B60E adds no addit=
ional
>> bandwidth to the system. =A0Thus, if you need more speed and space, =
buy a second
>> SATABeast and another FC card, or replace the single port FC card wi=
th a dual
>> port model (or buy the dual port up front)
>>
>> With the full 42 drive chassis configured as a 40 disk RAID 10 (2 sp=
ares) you'll
>> get 20TB usable space and you'll easily peak the 8GBit FC interface =
in both
>> directions simultaneously. =A0Aggregate random non-cached IOPS will =
peak at around
>> 3000, cached at 50,000. =A0The bandwidth figures may seem low to peo=
ple used to
>> "testing" md arrays with hdparm or dd and seeing figures of 500MB/s =
to 1GB/s
>> with only a handful of disks, however these are usually sequential _=
read_
>> figures only, on RAID 6 arrays, which have write performance often 2=
-3 times
>> lower. =A0In the real world, 1.6GB/s of sustained bidirectional rand=
om I/O
>> throughput while servicing dozens or hundreds of hosts is pretty phe=
nomenal
>> performance, especially in this price range. =A0The NFS server will =
most likely be
>> the bottleneck though, not this storage, definitely so if 4 bonded G=
bE
>> interfaces are used for NFS serving instead of the 10 GbE NIC.
>>
>> The hardware for this should run you well less than $50K USD for eve=
rything.
>> I'd highly recommend you create a single 40 drive RAID 10 array, as =
I mentioned
>> above with 2 spares, if you need performance as much as, if not more=
than,
>> capacity--especially write performance. =A0A 40 drive RAID 10 on thi=
s SATABeast
>> will give you performance almost identical to a 20 disk RAID 0 strip=
e. =A0If you
>> need additional capacity more than speed, configure 40 drives as a R=
AID 6. =A0The
>> read performance will be similar, although the write performance wil=
l take a big
>> dive with 40 drives and dual parity.
>>
>> Configure 90-95% of the array as one logical drive and save the othe=
r 5-10% for
>> a rainy day--you'll be glad you did. =A0Export the logical drive as =
a single LUN.
>> =A0Format that LUN as XFS. =A0Visit the XFS mailing list and ask for=
instructions on
>> how best to format and mount it. =A0Use the most recent Linux kernel=
available,
>> 2.6.37 or later, depending on when you actually build the NFS server=
--2.6.38/39
>> if they're stable. =A0If you get Linux+XFS+NFS configured and runnin=
g optimally,
>> you should be more than impressed and satisfied with the performance=
and
>> reliability of this combined system.
>>
>> I don't work for any of the companies whose products are mentioned a=
bove. =A0I'm
>> merely a satisfied customer of all of them. =A0The Nexsan products h=
ave the lowest
>> price/TB of any SAN storage products on the market, and the highest
>> performance/dollar, and lowest price per watt of power consumption. =
=A0They're
>> easy as cake to setup and manage with a nice GUI web interface over =
an ethernet
>> management port.
>>
>> Hope you find this information useful. =A0Feel free to contact me di=
rectly if I
>> can be of further assistance.
>>
>> --
>> Stan
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid=
" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>>
>
>
>
> --
> Roberto Spadim
> Spadim Technology / SPAEmpresarial
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 15.02.2011 14:39:14 von David Brown

On 15/02/2011 13:29, Stan Hoeppner wrote:
> Matt Garman put forth on 2/14/2011 5:59 PM:
>
>> The requirement is basically this: around 40 to 50 compute machines
>> act as basically an ad-hoc scientific compute/simulation/analysis
>> cluster. These machines all need access to a shared 20 TB pool of
>> storage. Each compute machine has a gigabit network connection, and
>> it's possible that nearly every machine could simultaneously try to
>> access a large (100 to 1000 MB) file in the storage pool. In other
>> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>
> If your description of the requirement is accurate, then what you need is a
> _reliable_ high performance NFS server backed by many large/fast spindles.
>
>> I was wondering if anyone on the list has built something similar to
>> this using off-the-shelf hardware (and Linux of course)?
>
> My thoughtful, considered, recommendation would be to stay away from a DIY build
> for the requirement you describe, and stay away from mdraid as well, but not
> because mdraid isn't up to the task. I get the feeling you don't fully grasp
> some of the consequences of a less than expert level mdraid admin being
> responsible for such a system after it's in production. If multiple drives are
> kicked off line simultaneously (posts of such seem to occur multiple times/week
> here), downing the array, are you capable of bringing it back online intact,
> successfully, without outside assistance, in a short period of time? If you
> lose the entire array due to a typo'd mdadm parm, then what?
>

This brings up an important point - no matter what sort of system you
get (home made, mdadm raid, or whatever) you will want to do some tests
and drills at replacing failed drives. Also make sure everything is
well documented, and well labelled. When mdadm sends you an email
telling you drive sdx has failed, you want to be /very/ sure you know
which drive is sdx before you take it out!



You also want to consider your raid setup carefully. RAID 10 has been
mentioned here several times - it is often a good choice, but not
necessarily. RAID 10 gives you fast recovery, and can at best survive a
loss of half your disks - but at worst a loss of two disks will bring
down the whole set. It is also very inefficient in space. If you use
SSDs, it may not be worth double the price to have RAID 10. If you use
hard disks, it may not be sufficient safety.

I haven't build a raid of anything like this size, so my comments here
are only based on my imperfect understanding of the theory - I'm
learning too.

RAID 10 has the advantage of good speed at reading (close to RAID 0
speeds), at the cost of poorer write speed and poor space efficiency.
RAID 5 and RAID 6 are space efficient, and fast for most purposes, but
slow for rebuilds and slow for small writes.

You are not much bothered about write performance, and most of your
writes are large anyway.

How about building the array as a two-tier RAID 6+5 setup? Take 7 x 1TB
disks as a RAID 6 for 5 TB space. Five sets of these as RAID 5 gives
you your 20 TB in 35 drives. This will survive any four failed disks,
or more depending on the combinations. If you are careful how it is
arranged, it will also survive a failing controller card.

If a disk fails, you could remove that whole set from the outer array
(which should have a write intent bitmap) - then the rebuild will go at
maximal speed, while the outer array's speed will not be so badly
affected. Once the rebuild is complete, put it back in the outer array.
Since you are not doing many writes, it will not take long to catch up.

It is probably worth having a small array of SSDs (RAID1 or RAID10) to
hold the write intent bitmap, the journal for your main file system, and
of course your OS. Maybe one of these absurdly fast PCI Express flash
disks would be a good choice.



--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 15.02.2011 14:48:05 von Zdenek Kaspar

Dne 15.2.2011 0:59, Matt Garman napsal(a):
> For many years, I have been using Linux software RAID at home for a
> simple NAS system. Now at work, we are looking at buying a massive,
> high-throughput storage system (e.g. a SAN). I have little
> familiarity with these kinds of pre-built, vendor-supplied solutions.
> I just started talking to a vendor, and the prices are extremely high.
>
> So I got to thinking, perhaps I could build an adequate device for
> significantly less cost using Linux. The problem is, the requirements
> for such a system are significantly higher than my home media server,
> and put me into unfamiliar territory (in terms of both hardware and
> software configuration).
>
> The requirement is basically this: around 40 to 50 compute machines
> act as basically an ad-hoc scientific compute/simulation/analysis
> cluster. These machines all need access to a shared 20 TB pool of
> storage. Each compute machine has a gigabit network connection, and
> it's possible that nearly every machine could simultaneously try to
> access a large (100 to 1000 MB) file in the storage pool. In other
> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>
> I was wondering if anyone on the list has built something similar to
> this using off-the-shelf hardware (and Linux of course)?
>
> My initial thoughts/questions are:
>
> (1) We need lots of spindles (i.e. many small disks rather than
> few big disks). How do you compute disk throughput when there are
> multiple consumers? Most manufacturers provide specs on their drives
> such as sustained linear read throughput. But how is that number
> affected when there are multiple processes simultanesously trying to
> access different data? Is the sustained bulk read throughput value
> inversely proportional to the number of consumers? (E.g. 100 MB/s
> drive only does 33 MB/s w/three consumers.) Or is there are more
> specific way to estimate this?
>
> (2) The big storage server(s) need to connect to the network via
> multiple bonded Gigabit ethernet, or something faster like
> FibreChannel or 10 GbE. That seems pretty straightforward.
>
> (3) This will probably require multiple servers connected together
> somehow and presented to the compute machines as one big data store.
> This is where I really don't know much of anything. I did a quick
> "back of the envelope" spec for a system with 24 600 GB 15k SAS drives
> (based on the observation that 24-bay rackmount enclosures seem to be
> fairly common). Such a system would only provide 7.2 TB of storage
> using a scheme like RAID-10. So how could two or three of these
> servers be "chained" together and look like a single large data pool
> to the analysis machines?
>
> I know this is a broad question, and not 100% about Linux software
> RAID. But I've been lurking on this list for years now, and I get the
> impression there are list members who regularly work with "big iron"
> systems such as what I've described. I'm just looking for any kind of
> relevant information here; any and all is appreciated!
>
> Thank you,
> Matt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

If you really need to handle 50Gbit/s storage traffic, then it's not so
easy for hobby. For good price you probably want multiple machines with
lots hard drives and interconnects..

Might be worth to ask here:
Newsgroups: gmane.comp.clustering.beowulf.general

HTH, Z.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 15.02.2011 15:29:16 von Roberto Spadim

first, run memtest86 (if you use x86 cpu)
check ram memory speed
my hp (ml350g5 very old: 2005) get 2500MB/s (~20 Gbits/s)

maybe ram is a bottleneck for 50gbits....
you will need a multi computer raid or stripe fileaccess operations
(database on one machine, s.o. on another...)

for hobby =3D SATA2 disks, 50USD disks of 1TB 50MB/s
the today state of art, in 'my world' is: http://www.ramsan.com/product=
s/3


2011/2/15 Zdenek Kaspar :
> Dne 15.2.2011 0:59, Matt Garman napsal(a):
>> For many years, I have been using Linux software RAID at home for a
>> simple NAS system. =A0Now at work, we are looking at buying a massiv=
e,
>> high-throughput storage system (e.g. a SAN). =A0I have little
>> familiarity with these kinds of pre-built, vendor-supplied solutions=

>> I just started talking to a vendor, and the prices are extremely hig=
h.
>>
>> So I got to thinking, perhaps I could build an adequate device for
>> significantly less cost using Linux. =A0The problem is, the requirem=
ents
>> for such a system are significantly higher than my home media server=
,
>> and put me into unfamiliar territory (in terms of both hardware and
>> software configuration).
>>
>> The requirement is basically this: around 40 to 50 compute machines
>> act as basically an ad-hoc scientific compute/simulation/analysis
>> cluster. =A0These machines all need access to a shared 20 TB pool of
>> storage. =A0Each compute machine has a gigabit network connection, a=
nd
>> it's possible that nearly every machine could simultaneously try to
>> access a large (100 to 1000 MB) file in the storage pool. =A0In othe=
r
>> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>>
>> I was wondering if anyone on the list has built something similar to
>> this using off-the-shelf hardware (and Linux of course)?
>>
>> My initial thoughts/questions are:
>>
>> =A0 =A0 (1) We need lots of spindles (i.e. many small disks rather t=
han
>> few big disks). =A0How do you compute disk throughput when there are
>> multiple consumers? =A0Most manufacturers provide specs on their dri=
ves
>> such as sustained linear read throughput. =A0But how is that number
>> affected when there are multiple processes simultanesously trying to
>> access different data? =A0Is the sustained bulk read throughput valu=
e
>> inversely proportional to the number of consumers? =A0(E.g. 100 MB/s
>> drive only does 33 MB/s w/three consumers.) =A0Or is there are more
>> specific way to estimate this?
>>
>> =A0 =A0 (2) The big storage server(s) need to connect to the network=
via
>> multiple bonded Gigabit ethernet, or something faster like
>> FibreChannel or 10 GbE. =A0That seems pretty straightforward.
>>
>> =A0 =A0 (3) This will probably require multiple servers connected to=
gether
>> somehow and presented to the compute machines as one big data store.
>> This is where I really don't know much of anything. =A0I did a quick
>> "back of the envelope" spec for a system with 24 600 GB 15k SAS driv=
es
>> (based on the observation that 24-bay rackmount enclosures seem to b=
e
>> fairly common). =A0Such a system would only provide 7.2 TB of storag=
e
>> using a scheme like RAID-10. =A0So how could two or three of these
>> servers be "chained" together and look like a single large data pool
>> to the analysis machines?
>>
>> I know this is a broad question, and not 100% about Linux software
>> RAID. =A0But I've been lurking on this list for years now, and I get=
the
>> impression there are list members who regularly work with "big iron"
>> systems such as what I've described. =A0I'm just looking for any kin=
d of
>> relevant information here; any and all is appreciated!
>>
>> Thank you,
>> Matt
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid=
" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>>
>
> If you really need to handle 50Gbit/s storage traffic, then it's not =
so
> easy for hobby. For good price you probably want multiple machines wi=
th
> lots hard drives and interconnects..
>
> Might be worth to ask here:
> Newsgroups: gmane.comp.clustering.beowulf.general
>
> HTH, Z.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 15.02.2011 15:51:42 von a.krijgsman

Just ran memcheck 2 weeks ago.

If you triple-lane your memory you get 10GByte (!) per second memory.
( This is memory from 2010 ;-) 1333 Mhz )


-----Oorspronkelijk bericht-----
From: Roberto Spadim
Sent: Tuesday, February 15, 2011 3:29 PM
To: Zdenek Kaspar
Cc: linux-raid@vger.kernel.org
Subject: Re: high throughput storage server?

first, run memtest86 (if you use x86 cpu)
check ram memory speed
my hp (ml350g5 very old: 2005) get 2500MB/s (~20 Gbits/s)

maybe ram is a bottleneck for 50gbits....
you will need a multi computer raid or stripe fileaccess operations
(database on one machine, s.o. on another...)

for hobby = SATA2 disks, 50USD disks of 1TB 50MB/s
the today state of art, in 'my world' is: http://www.ramsan.com/products/3


2011/2/15 Zdenek Kaspar :
> Dne 15.2.2011 0:59, Matt Garman napsal(a):
>> For many years, I have been using Linux software RAID at home for a
>> simple NAS system. Now at work, we are looking at buying a massive,
>> high-throughput storage system (e.g. a SAN). I have little
>> familiarity with these kinds of pre-built, vendor-supplied solutions.
>> I just started talking to a vendor, and the prices are extremely high.
>>
>> So I got to thinking, perhaps I could build an adequate device for
>> significantly less cost using Linux. The problem is, the requirements
>> for such a system are significantly higher than my home media server,
>> and put me into unfamiliar territory (in terms of both hardware and
>> software configuration).
>>
>> The requirement is basically this: around 40 to 50 compute machines
>> act as basically an ad-hoc scientific compute/simulation/analysis
>> cluster. These machines all need access to a shared 20 TB pool of
>> storage. Each compute machine has a gigabit network connection, and
>> it's possible that nearly every machine could simultaneously try to
>> access a large (100 to 1000 MB) file in the storage pool. In other
>> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>>
>> I was wondering if anyone on the list has built something similar to
>> this using off-the-shelf hardware (and Linux of course)?
>>
>> My initial thoughts/questions are:
>>
>> (1) We need lots of spindles (i.e. many small disks rather than
>> few big disks). How do you compute disk throughput when there are
>> multiple consumers? Most manufacturers provide specs on their drives
>> such as sustained linear read throughput. But how is that number
>> affected when there are multiple processes simultanesously trying to
>> access different data? Is the sustained bulk read throughput value
>> inversely proportional to the number of consumers? (E.g. 100 MB/s
>> drive only does 33 MB/s w/three consumers.) Or is there are more
>> specific way to estimate this?
>>
>> (2) The big storage server(s) need to connect to the network via
>> multiple bonded Gigabit ethernet, or something faster like
>> FibreChannel or 10 GbE. That seems pretty straightforward.
>>
>> (3) This will probably require multiple servers connected together
>> somehow and presented to the compute machines as one big data store.
>> This is where I really don't know much of anything. I did a quick
>> "back of the envelope" spec for a system with 24 600 GB 15k SAS drives
>> (based on the observation that 24-bay rackmount enclosures seem to be
>> fairly common). Such a system would only provide 7.2 TB of storage
>> using a scheme like RAID-10. So how could two or three of these
>> servers be "chained" together and look like a single large data pool
>> to the analysis machines?
>>
>> I know this is a broad question, and not 100% about Linux software
>> RAID. But I've been lurking on this list for years now, and I get the
>> impression there are list members who regularly work with "big iron"
>> systems such as what I've described. I'm just looking for any kind of
>> relevant information here; any and all is appreciated!
>>
>> Thank you,
>> Matt
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
> If you really need to handle 50Gbit/s storage traffic, then it's not so
> easy for hobby. For good price you probably want multiple machines with
> lots hard drives and interconnects..
>
> Might be worth to ask here:
> Newsgroups: gmane.comp.clustering.beowulf.general
>
> HTH, Z.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>



--
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 15.02.2011 15:56:07 von Zdenek Kaspar

Dne 15.2.2011 15:29, Roberto Spadim napsal(a):
> first, run memtest86 (if you use x86 cpu)
> check ram memory speed
> my hp (ml350g5 very old: 2005) get 2500MB/s (~20 Gbits/s)
>
> maybe ram is a bottleneck for 50gbits....
> you will need a multi computer raid or stripe fileaccess operations
> (database on one machine, s.o. on another...)
>
> for hobby = SATA2 disks, 50USD disks of 1TB 50MB/s
> the today state of art, in 'my world' is: http://www.ramsan.com/products/3

I doubt 20TB SLC which will survive huge abuse (writes) is low-cost
solution what OP wants to build himself..

or 20TB RAM omg..

Z.


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 15.02.2011 16:16:15 von Joe Landman

[disclosure: vendor posting, ignore if you wish, vendor html link at
bottom of message]

On 02/14/2011 11:44 PM, Matt Garman wrote:
> On Mon, Feb 14, 2011 at 06:06:43PM -0800, Doug Dumitru wrote:
>> You have a whole slew of questions to answer before you can decide
>> on a design. This is true if you build it yourself or decide to
>> go with a vendor and buy a supported server. If you do go with a
>> vendor, the odds are actually quite good you will end up with
>> Linux anyway.
>
> I kind of assumed/wondered if the vendor-supplied systems didn't run
> Linux behind the scenes anyway.

We've been using Linux as the basis for our storage systems.
Occasionally there are other OSes required by customers, but for the
most part, Linux is the preferred platform.

[...]

>> Next, is the space all the same. Perhaps some of it is "active"
>> and some of it is archival. If you need 4TB of "fast" storage and
>> ...
>> well. You can probably build this for around $5K (or maybe a bit
>> less) including a 10GigE adapter and server class components.
>
> The whole system needs to be "fast".

Ok ... sounds strange, but ...

Define what you mean by "fast". Seriously ... we've had people tell us
about their "huge" storage needs that we can easily fit onto a single
small unit, no storage cluster needed. We've had people say "fast" when
they mean "able to keep 1 GbE port busy".

Fast needs to be articulated really in terms of what you will do with
it. As you noted in this and other messages, you are scaling up from 10
compute nodes to 40 compute nodes. 4x change in demand, and I am
guessing bandwidth (if these are large files you are streaming) or IOPs
(if these are many small files you are reading). Small and large here
would mean less than 64kB for small, and greater than 4MB for large.


> Actually, to give more detail, we currently have a simple system I
> built for backup/slow access. This is exactly what you described, a
> bunch of big, slow disks. Lots of space, lowsy I/O performance, but
> plenty adequate for backup purposes.

Your choice is simple. Build or buy. Many folks have made suggestions,
and some are pretty reasonable, though a pure SSD or Flash based
machine, while doable (and we sell these), is quite unlikely to be close
to the realities of your budget. There are use cases for which this
does make sense, but the costs are quite prohibitive for all but a few
users.

> As of right now, we actually have about a dozen "users", i.e.
> compute servers. The collection is basically a home-grown compute
> farm. Each server has a gigabit ethernet connection, and 1 TB of
> RAID-1 spinning disk storage. Each storage mounts every other
> server via NFS, and the current data is distributed evenly across
> all systems.

Ok ... this isn't something thats great to manage. I might suggest
looking at GlusterFS for this. You can aggregate and distribute your
data. Even build in some resiliency if you wish/need. GlusterFS 3.1.2
is open source, so you can deploy fairly easily.

>
> So, loosely speaking, right now we have roughly 10 TB of
> "live"/"fast" data available at 1 to 10 gbps, depending on how you
> look at it.
>
> While we only have about a dozen servers now, we have definitely
> identified growing this compute farm about 4x (to 40--50 servers)
> within the next year. But the storage capacity requirements
> shouldn't change too terribly much. The 20 TB number was basically
> thrown out there as a "it would be nice to have 2x the live
> storage".

Without building a storage unit, you could (in concept) use GlusterFS
for this. In practice, this model gets harder and harder to manage as
you increase the number of nodes. Adding the N+1 th node means you have
N+1 nodes to modify and manage storage on. This does not scale well at all.

>
> I'll also add that this NAS needs to be optimized for *read*
> throughput. As I mentioned, the only real write process is the
> daily "harvesting" of the data files. Those are copied across
> long-haul leased lines, and the copy process isn't really
> performance sensitive. In other words, in day-to-day use, those
> 40--50 client machines will do 100% reading from the NAS.

Ok.

This isn't a commercial. I'll keep this part short.

We've built systems like this which sustain north of 10GB/s (big B not
little b) for concurrent read and write access from thousands of cores.
20TB (and 40TB) are on the ... small ... side for this, but it is very
doable.

As a tie in to the Linux RAID list, we use md raid for our OS drives
(SSD pairs), and other utility functions within the unit, as well as
striping over our hardware accelerated RAIDs. We would like to use
non-power of two chunk sizes, but haven't delved into the code as much
as we'd like to see if we can make this work.

As a rule, we find mdadm to be an excellent tool, and the whole md RAID
system to be quite good. We may spend time at some point on figuring
out whats wrong with the multi-threaded raid456 bit (allocated 200+
kernel threads last I played with it), but apart from bits like that, we
do find it very good for production use. It isn't as fast as some
dedicated accelerated RAID hardware (though we have our md + kernel
stack very well tuned so some of our software RAIDs are faster than many
of our competitors hardware RAIDs).

You could build a fairly competent unit using md RAID.

It all gets back to build versus buy. In either case, I'd recommend
grabbing a copy of dstat (http://dag.wieers.com/home-made/dstat/) and
watching your IO/network system throughput. I am assuming 1 GbE
switches as the basis for your cluster. I assume this will not change.
The cost of your time/effort and any opportunity cost and productivity
loss should also be accounted for in the cost-benefit analysis. That
is, if it costs you less overall to buy than to build, should you build
anyway? Generally no, but some people simply want the experience.

Big issues you need to be aware of with md raid are the hotswap problem.
Your SATA link needs to allow you to pull a drive out without crashing
the machine. Many of the on-motherboard SATA connections we've used
over the years don't tolerate unplugs/plugins very well. I'd recommend
at least an reasonable HBA for this that understands hot swap and
handles it correctly (you need hardware and driver level support to
correctly signal the kernel of these events).

If you decide to buy, have a really clear idea of your performance
regime, and a realistic eye towards budget. A 48 TB server with > 2GB/s
streaming performance for TB sized files is very doable, well under $30k
USD. A 48 TB software RAID version would be quite a bit less than that.

Good luck with this, and let us know what you do.

vendor html link: http://scalableinformatics.com , our storage clusters
http://scalableinformatics.com/sicluster
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 15.02.2011 17:44:43 von Roberto Spadim

10Gbyte ~ 80Gbit, i don=B4t know if 50Gbit is possible
you have SO cpu time to read and write many things not just memory,
check filesystem cache, etc. etc. etc., maybe you can=B4t get this spee=
d
with just 80Gbit memory

2011/2/15 A. Krijgsman :
> Just ran memcheck 2 weeks ago.
>
> If you triple-lane your memory you get 10GByte (!) per second memory.
> ( This is memory from 2010 ;-) 1333 Mhz )
>
> -----Oorspronkelijk bericht----- From: Roberto Spadim Sent: Tuesday,
> February 15, 2011 3:29 PM To: Zdenek Kaspar Cc: linux-raid@vger.kerne=
l.org
> Subject: Re: high throughput storage server?
> first, run memtest86 (if you use x86 cpu)
> check ram memory speed
> my hp (ml350g5 very old: 2005) get 2500MB/s (~20 Gbits/s)
>
> maybe ram is a bottleneck for 50gbits....
> you will need a multi computer raid or stripe fileaccess operations
> (database on one machine, s.o. on another...)
>
> for hobby =3D SATA2 disks, 50USD disks of 1TB 50MB/s
> the today state of art, in 'my world' is: http://www.ramsan.com/produ=
cts/3
>
>
> 2011/2/15 Zdenek Kaspar :
>>
>> Dne 15.2.2011 0:59, Matt Garman napsal(a):
>>>
>>> For many years, I have been using Linux software RAID at home for a
>>> simple NAS system. =A0Now at work, we are looking at buying a massi=
ve,
>>> high-throughput storage system (e.g. a SAN). =A0I have little
>>> familiarity with these kinds of pre-built, vendor-supplied solution=
s.
>>> I just started talking to a vendor, and the prices are extremely hi=
gh.
>>>
>>> So I got to thinking, perhaps I could build an adequate device for
>>> significantly less cost using Linux. =A0The problem is, the require=
ments
>>> for such a system are significantly higher than my home media serve=
r,
>>> and put me into unfamiliar territory (in terms of both hardware and
>>> software configuration).
>>>
>>> The requirement is basically this: around 40 to 50 compute machines
>>> act as basically an ad-hoc scientific compute/simulation/analysis
>>> cluster. =A0These machines all need access to a shared 20 TB pool o=
f
>>> storage. =A0Each compute machine has a gigabit network connection, =
and
>>> it's possible that nearly every machine could simultaneously try to
>>> access a large (100 to 1000 MB) file in the storage pool. =A0In oth=
er
>>> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>>>
>>> I was wondering if anyone on the list has built something similar t=
o
>>> this using off-the-shelf hardware (and Linux of course)?
>>>
>>> My initial thoughts/questions are:
>>>
>>> =A0 =A0(1) We need lots of spindles (i.e. many small disks rather t=
han
>>> few big disks). =A0How do you compute disk throughput when there ar=
e
>>> multiple consumers? =A0Most manufacturers provide specs on their dr=
ives
>>> such as sustained linear read throughput. =A0But how is that number
>>> affected when there are multiple processes simultanesously trying t=
o
>>> access different data? =A0Is the sustained bulk read throughput val=
ue
>>> inversely proportional to the number of consumers? =A0(E.g. 100 MB/=
s
>>> drive only does 33 MB/s w/three consumers.) =A0Or is there are more
>>> specific way to estimate this?
>>>
>>> =A0 =A0(2) The big storage server(s) need to connect to the network=
via
>>> multiple bonded Gigabit ethernet, or something faster like
>>> FibreChannel or 10 GbE. =A0That seems pretty straightforward.
>>>
>>> =A0 =A0(3) This will probably require multiple servers connected to=
gether
>>> somehow and presented to the compute machines as one big data store=

>>> This is where I really don't know much of anything. =A0I did a quic=
k
>>> "back of the envelope" spec for a system with 24 600 GB 15k SAS dri=
ves
>>> (based on the observation that 24-bay rackmount enclosures seem to =
be
>>> fairly common). =A0Such a system would only provide 7.2 TB of stora=
ge
>>> using a scheme like RAID-10. =A0So how could two or three of these
>>> servers be "chained" together and look like a single large data poo=
l
>>> to the analysis machines?
>>>
>>> I know this is a broad question, and not 100% about Linux software
>>> RAID. =A0But I've been lurking on this list for years now, and I ge=
t the
>>> impression there are list members who regularly work with "big iron=
"
>>> systems such as what I've described. =A0I'm just looking for any ki=
nd of
>>> relevant information here; any and all is appreciated!
>>>
>>> Thank you,
>>> Matt
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rai=
d" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.htm=
l
>>>
>>
>> If you really need to handle 50Gbit/s storage traffic, then it's not=
so
>> easy for hobby. For good price you probably want multiple machines w=
ith
>> lots hard drives and interconnects..
>>
>> Might be worth to ask here:
>> Newsgroups: gmane.comp.clustering.beowulf.general
>>
>> HTH, Z.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid=
" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>>
>
>
>
> --
> Roberto Spadim
> Spadim Technology / SPAEmpresarial
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 15.02.2011 21:37:56 von NeilBrown

On Tue, 15 Feb 2011 10:16:15 -0500 Joe Landman wrote:

> As a tie in to the Linux RAID list, we use md raid for our OS drives
> (SSD pairs), and other utility functions within the unit, as well as
> striping over our hardware accelerated RAIDs. We would like to use
> non-power of two chunk sizes, but haven't delved into the code as much
> as we'd like to see if we can make this work.
>

md/raid0 (striping) currently supports non-power-of-two chunk sizes, though
it is a relatively recent addition.
(raid4/5/6 doesn't).

Just FYI.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 15.02.2011 21:47:37 von Joe Landman

On 02/15/2011 03:37 PM, NeilBrown wrote:
> On Tue, 15 Feb 2011 10:16:15 -0500 Joe Landman wrote:
>
>> As a tie in to the Linux RAID list, we use md raid for our OS drives
>> (SSD pairs), and other utility functions within the unit, as well as
>> striping over our hardware accelerated RAIDs. We would like to use
>> non-power of two chunk sizes, but haven't delved into the code as much
>> as we'd like to see if we can make this work.
>>
>
> md/raid0 (striping) currently supports non-power-of-two chunk sizes, though
> it is a relatively recent addition.
> (raid4/5/6 doesn't).

Cool! We need to start playing with this ...

Which kernels have the support?

--
Joe
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 15.02.2011 22:41:36 von NeilBrown

On Tue, 15 Feb 2011 15:47:37 -0500 Joe Landman wrote:

> On 02/15/2011 03:37 PM, NeilBrown wrote:
> > On Tue, 15 Feb 2011 10:16:15 -0500 Joe Landman wrote:
> >
> >> As a tie in to the Linux RAID list, we use md raid for our OS drives
> >> (SSD pairs), and other utility functions within the unit, as well as
> >> striping over our hardware accelerated RAIDs. We would like to use
> >> non-power of two chunk sizes, but haven't delved into the code as much
> >> as we'd like to see if we can make this work.
> >>
> >
> > md/raid0 (striping) currently supports non-power-of-two chunk sizes, though
> > it is a relatively recent addition.
> > (raid4/5/6 doesn't).
>
> Cool! We need to start playing with this ...
>
> Which kernels have the support?

It was enabled by commit fbb704efb784e2c8418e34dc3013af76bdd58101
so

$ git name-rev fbb704efb784e2c8418e34dc3013af76bdd58101
fbb704efb784e2c8418e34dc3013af76bdd58101 tags/v2.6.31-rc1~143^2~18


2.6.31 has this support.

However I note that mdadm still checks that the chunk size is a power of two:
if (chunk < 8 || ((chunk-1)&chunk)) {

I should fix that...

NeilBrown


>
> --
> Joe
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 17.02.2011 00:32:58 von Stan Hoeppner

David Brown put forth on 2/15/2011 7:39 AM:

> This brings up an important point - no matter what sort of system you get (home
> made, mdadm raid, or whatever) you will want to do some tests and drills at
> replacing failed drives. Also make sure everything is well documented, and well
> labelled. When mdadm sends you an email telling you drive sdx has failed, you
> want to be /very/ sure you know which drive is sdx before you take it out!

This is one of the many reasons I recommended an enterprise class vendor
solution. The Nexsan unit can be configured for SMTP and/or SNMP and/or pager
notification. When a drive is taken offline the drive slot is identified in the
GUI. Additionally, the backplane board has power and activity LEDs next to each
drive. When you slide the chassis out of the rack (while still fully
operating), and pull the cover, you will see a distinct blink pattern of the
LEDs next to the failed drive. This is fully described in the documentation,
but even without reading such it'll be crystal clear which drive is down. There
is zero guess work.

The drive replacement testing scenario you describe is unnecessary with the
Nexsan products as well as any enterprise disk array.

> You also want to consider your raid setup carefully. RAID 10 has been mentioned
> here several times - it is often a good choice, but not necessarily. RAID 10
> gives you fast recovery, and can at best survive a loss of half your disks - but
> at worst a loss of two disks will bring down the whole set. It is also very
> inefficient in space. If you use SSDs, it may not be worth double the price to
> have RAID 10. If you use hard disks, it may not be sufficient safety.

RAID level space/cost efficiency from a TCO standpoint is largely irrelevant
today due to the low price of mech drives. Using the SATABeast as an example,
the cost per TB of a 20TB RAID 10 is roughly $1600/TB and a 20TB RAID 6 is about
$1200/TB. Given all the advantages of RAID 10 over RAID 6 the 33% premium is
more than worth it.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 17.02.2011 01:00:58 von Keld Simonsen

On Wed, Feb 16, 2011 at 05:32:58PM -0600, Stan Hoeppner wrote:
> David Brown put forth on 2/15/2011 7:39 AM:
>
> RAID level space/cost efficiency from a TCO standpoint is largely irrelevant
> today due to the low price of mech drives. Using the SATABeast as an example,
> the cost per TB of a 20TB RAID 10 is roughly $1600/TB and a 20TB RAID 6 is about
> $1200/TB. Given all the advantages of RAID 10 over RAID 6 the 33% premium is
> more than worth it.

I assume that you with 20 TB mean the payload space in both places, that
is for the Linux MD RAID10 you actually have 40 TB of raw disk space.
With the Linux MD raid10 solution you furthermore can enjoy an almost
double up of the IO reading speed, involving 20 * 2 TB spindles compared
to 12 * 2 TB spindles.

best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 17.02.2011 01:19:41 von Stan Hoeppner

Keld J=F8rn Simonsen put forth on 2/16/2011 6:00 PM:
> On Wed, Feb 16, 2011 at 05:32:58PM -0600, Stan Hoeppner wrote:
>> David Brown put forth on 2/15/2011 7:39 AM:
>>
>> RAID level space/cost efficiency from a TCO standpoint is largely ir=
relevant
>> today due to the low price of mech drives. Using the SATABeast as a=
n example,
>> the cost per TB of a 20TB RAID 10 is roughly $1600/TB and a 20TB RAI=
D 6 is about
>> $1200/TB. Given all the advantages of RAID 10 over RAID 6 the 33% p=
remium is
>> more than worth it.
>=20
> I assume that you with 20 TB mean the payload space in both places, t=
hat
> is for the Linux MD RAID10 you actually have 40 TB of raw disk space.
> With the Linux MD raid10 solution you furthermore can enjoy an almost
> double up of the IO reading speed, involving 20 * 2 TB spindles compa=
red
> to 12 * 2 TB spindles.

Enterprise solutions don't use Linux mdraid. The RAID function is buil=
t into
the SAN controller. My TCO figures were based on a single controller S=
ATABeast,
42x1TB drives in the RAID 10, and 24x1TB drives in the RAID 6, each
configuration including two spares.

--=20
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 17.02.2011 01:26:15 von David Brown

(Sorry for the mixup in sending this by direct email instead of posting
to the list.)

On 17/02/11 00:32, Stan Hoeppner wrote:
> David Brown put forth on 2/15/2011 7:39 AM:
>
>> This brings up an important point - no matter what sort of system you get (home
>> made, mdadm raid, or whatever) you will want to do some tests and drills at
>> replacing failed drives. Also make sure everything is well documented, and well
>> labelled. When mdadm sends you an email telling you drive sdx has failed, you
>> want to be /very/ sure you know which drive is sdx before you take it out!
>
> This is one of the many reasons I recommended an enterprise class vendor
> solution. The Nexsan unit can be configured for SMTP and/or SNMP and/or pager
> notification. When a drive is taken offline the drive slot is identified in the
> GUI. Additionally, the backplane board has power and activity LEDs next to each
> drive. When you slide the chassis out of the rack (while still fully
> operating), and pull the cover, you will see a distinct blink pattern of the
> LEDs next to the failed drive. This is fully described in the documentation,
> but even without reading such it'll be crystal clear which drive is down. There
> is zero guess work.
>
> The drive replacement testing scenario you describe is unnecessary with the
> Nexsan products as well as any enterprise disk array.
>

I'd still like to do a test - you don't want to be surprised at the
wrong moment. The test lets you know everything is working fine, and
gives you a feel of how long it will take, and how easy or difficult it is.

But I agree there is a lot of benefit in the sort of clear indications
of problems that you get with that sort of hardware rather a home made
system.


>> You also want to consider your raid setup carefully. RAID 10 has been mentioned
>> here several times - it is often a good choice, but not necessarily. RAID 10
>> gives you fast recovery, and can at best survive a loss of half your disks - but
>> at worst a loss of two disks will bring down the whole set. It is also very
>> inefficient in space. If you use SSDs, it may not be worth double the price to
>> have RAID 10. If you use hard disks, it may not be sufficient safety.
>
> RAID level space/cost efficiency from a TCO standpoint is largely irrelevant
> today due to the low price of mech drives. Using the SATABeast as an example,
> the cost per TB of a 20TB RAID 10 is roughly $1600/TB and a 20TB RAID 6 is about
> $1200/TB. Given all the advantages of RAID 10 over RAID 6 the 33% premium is
> more than worth it.
>

>

I don't think it is fair to give general rules like that. In this
particular case, that might be how the sums work out. But in other
cases, using RAID 10 instead of RAID 6 might mean stepping up in chassis
or controller size and costs. Also remember that RAID 10 is not better
than RAID 6 in every way - a RAID 6 array will survive any two failed
drives, while with RAID 10 an unlucky pairing of failed drives will
bring down the whole raid. Different applications require different
balances here.


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 17.02.2011 01:45:02 von Stan Hoeppner

David Brown put forth on 2/16/2011 6:26 PM:

> On 17/02/11 00:32, Stan Hoeppner wrote:

>> RAID level space/cost efficiency from a TCO standpoint is largely irrelevant
>> today due to the low price of mech drives. Using the SATABeast as an example,
>> the cost per TB of a 20TB RAID 10 is roughly $1600/TB and a 20TB RAID 6 is about
>> $1200/TB. Given all the advantages of RAID 10 over RAID 6 the 33% premium is
>> more than worth it.

> I don't think it is fair to give general rules like that. In this particular

The IT press does it every day. CTOs read those articles. In many cases it's
their primary source of information. Speak in terms CTOs (i.e. those holding
the purse) understand.

> case, that might be how the sums work out. But in other cases, using RAID 10
> instead of RAID 6 might mean stepping up in chassis or controller size and
> costs. Also remember that RAID 10 is not better than RAID 6 in every way - a
> RAID 6 array will survive any two failed drives, while with RAID 10 an unlucky
> pairing of failed drives will bring down the whole raid. Different applications
> require different balances here.

I'm not sure about being "fair" but it directly relates to the original question
that started this thread. The OP wanted performance and space with a preference
for performance. This demonstrates he can get the performance for a ~33% cost
premium. He didn't mention a budget limit, only that most vendor figures were
too high.

Also, you're repeating points I've made in this (and other) threads back to me.
Try to keep up David. ;)

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 17.02.2011 03:23:42 von Roberto Spadim

what's 'enterprise' means?

2011/2/16 Stan Hoeppner :
> Keld J=F8rn Simonsen put forth on 2/16/2011 6:00 PM:
>> On Wed, Feb 16, 2011 at 05:32:58PM -0600, Stan Hoeppner wrote:
>>> David Brown put forth on 2/15/2011 7:39 AM:
>>>
>>> RAID level space/cost efficiency from a TCO standpoint is largely i=
rrelevant
>>> today due to the low price of mech drives. =A0Using the SATABeast a=
s an example,
>>> the cost per TB of a 20TB RAID 10 is roughly $1600/TB and a 20TB RA=
ID 6 is about
>>> $1200/TB. =A0Given all the advantages of RAID 10 over RAID 6 the 33=
% premium is
>>> more than worth it.
>>
>> I assume that you with 20 TB mean the payload space in both places, =
that
>> is for the Linux MD RAID10 you actually have 40 TB of raw disk space=

>> With the Linux MD raid10 solution you furthermore can enjoy an almos=
t
>> double up of the IO reading speed, involving 20 * 2 TB spindles comp=
ared
>> to 12 * 2 TB spindles.
>
> Enterprise solutions don't use Linux mdraid. =A0The RAID function is =
built into
> the SAN controller. =A0My TCO figures were based on a single controll=
er SATABeast,
> 42x1TB drives in the RAID 10, and 24x1TB drives in the RAID 6, each
> configuration including two spares.
>
> --
> Stan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 17.02.2011 04:05:42 von Stan Hoeppner

Roberto Spadim put forth on 2/16/2011 8:23 PM:
> what's 'enterprise' means?

http://lmgtfy.com/?q=enterprise+storage

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 17.02.2011 11:39:10 von David Brown

On 17/02/2011 01:45, Stan Hoeppner wrote:
> David Brown put forth on 2/16/2011 6:26 PM:
>
>> On 17/02/11 00:32, Stan Hoeppner wrote:
>
>>> RAID level space/cost efficiency from a TCO standpoint is largely irrelevant
>>> today due to the low price of mech drives. Using the SATABeast as an example,
>>> the cost per TB of a 20TB RAID 10 is roughly $1600/TB and a 20TB RAID 6 is about
>>> $1200/TB. Given all the advantages of RAID 10 over RAID 6 the 33% premium is
>>> more than worth it.
>
>> I don't think it is fair to give general rules like that. In this particular
>
> The IT press does it every day. CTOs read those articles. In many cases it's
> their primary source of information. Speak in terms CTOs (i.e. those holding
> the purse) understand.
>

I work at a small company - I get to read the articles, make the
recommendations, and build the servers. So I can put more emphasis on
what I think is technically the best solution for us, rather than what
sounds good in the press. Of course, the other side of the coin is that
being a small company with modest server needs, I don't get to play with
20 TB raid systems!

>> case, that might be how the sums work out. But in other cases, using RAID 10
>> instead of RAID 6 might mean stepping up in chassis or controller size and
>> costs. Also remember that RAID 10 is not better than RAID 6 in every way - a
>> RAID 6 array will survive any two failed drives, while with RAID 10 an unlucky
>> pairing of failed drives will bring down the whole raid. Different applications
>> require different balances here.
>
> I'm not sure about being "fair" but it directly relates to the original question
> that started this thread. The OP wanted performance and space with a preference
> for performance. This demonstrates he can get the performance for a ~33% cost
> premium. He didn't mention a budget limit, only that most vendor figures were
> too high.
>

I agree that RAID 10 sounds like a match for the OP. All I am saying is
that it is not necessarily the best choice in general, and not just
because of the initial purchase price.

> Also, you're repeating points I've made in this (and other) threads back to me.
> Try to keep up David. ;)
>

I'm doing my best! I believe I've got a fair understanding of various
sorts of RAID systems, but I am totally missing real-world experience of
anything more advanced than a four disk setup. Bigger raid setups is
only a hobby interest for me at the moment, so I'm learning as I go
here. And you write such a lot here that it's hard for an amateur to
take it all in :-)

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 17.02.2011 12:07:39 von John Robinson

On 14/02/2011 23:59, Matt Garman wrote:
[...]
> The requirement is basically this: around 40 to 50 compute machines
> act as basically an ad-hoc scientific compute/simulation/analysis
> cluster. These machines all need access to a shared 20 TB pool of
> storage. Each compute machine has a gigabit network connection, and
> it's possible that nearly every machine could simultaneously try to
> access a large (100 to 1000 MB) file in the storage pool. In other
> words, a 20 TB file store with bandwidth upwards of 50 Gbps.

I'd recommend you analyse that requirement more closely. Yes, you have
50 compute machines with GigE connections so it's possible they could
all demand data from the file store at once, but in actual use, would they?

For example, if these machines were each to demand a 100MB file, how
long would they spend computing their results from it? If it's only 1
second, then you would indeed need an aggregate bandwidth of 50Gbps[1].
If it's 20 seconds processing, your filer only needs an aggregate
bandwidth of 2.5Gbps.

So I'd recommend you work out first how much data the compute machines
can actually chew through and work up from there, rather than what their
network connections could stream through and work down.

Cheers,

John.

[1] I'm assuming the compute nodes are fetching the data for the next
compute cycle while they're working on this one; if they're not you're
likely making unnecessary demands on your filer while leaving your
compute nodes idle.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 17.02.2011 14:36:49 von Roberto Spadim

with more network cards =3D more network gbps
with better (faster) rams =3D more disks reads
with more raid0/4/5/6 =3D more speed on disks reads
with more raid1 mirrors =3D more security
with more sas/sata/raid controllers =3D more GB/TB on storage
with more anything ~=3D more money
just know what numbers you want and make it work

2011/2/17 John Robinson :
> On 14/02/2011 23:59, Matt Garman wrote:
> [...]
>>
>> The requirement is basically this: around 40 to 50 compute machines
>> act as basically an ad-hoc scientific compute/simulation/analysis
>> cluster. =A0These machines all need access to a shared 20 TB pool of
>> storage. =A0Each compute machine has a gigabit network connection, a=
nd
>> it's possible that nearly every machine could simultaneously try to
>> access a large (100 to 1000 MB) file in the storage pool. =A0In othe=
r
>> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>
> I'd recommend you analyse that requirement more closely. Yes, you hav=
e 50
> compute machines with GigE connections so it's possible they could al=
l
> demand data from the file store at once, but in actual use, would the=
y?
>
> For example, if these machines were each to demand a 100MB file, how =
long
> would they spend computing their results from it? If it's only 1 seco=
nd,
> then you would indeed need an aggregate bandwidth of 50Gbps[1]. If it=
's 20
> seconds processing, your filer only needs an aggregate bandwidth of 2=
5Gbps.
>
> So I'd recommend you work out first how much data the compute machine=
s can
> actually chew through and work up from there, rather than what their =
network
> connections could stream through and work down.
>
> Cheers,
>
> John.
>
> [1] I'm assuming the compute nodes are fetching the data for the next
> compute cycle while they're working on this one; if they're not you'r=
e
> likely making unnecessary demands on your filer while leaving your co=
mpute
> nodes idle.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 17.02.2011 14:54:36 von Roberto Spadim

building it on only one machine...
if you want 50gbps, put six (one more) for network access (you need
many pci-express slots with 4x(10gbps) or 8x(20gbps))
i use raid10 for redundancy and speed
you can do raid1 for redundancy and after raid0,4,5,6 over raid1
devices for better speed

sata/sas/raid controllers? sata is very cheap you can use SSD with
sata2 interface, sas have fasters (less accestime) hard disks with
10k/15k rpm

ram? with more ran =3D more cache/buffers low disks usage, more read sp=
eed
cpu? i don't know what to use, but it's a big machine maybe you need
servers motherboards (5 pci-express just for network =3D big
motherboard, big motherboard =3D many cpus) try with only one cpu with
6cores hiperthread, etc. if it's not enought put a second cpu

operational system? linux with md =3D), it's a md list heehhe, maybe a
netbsd or freebsd or windows works too
file server? nfs, samba
filesystem? hummmmmm a cluster fs is good here, but a single ext4,
xfs, reiserfs could work, your energy is good? you want jornaling?
redundancy/cluster? beowolf openmosix, others. heartbeat, placemark, ot=
hers.
sql database? mysql have ndb for clusters, myisam is fast without some
features, innodb is slower with many features, ariadb =3D myisam but
slower to write with fail safe feature. oracle is good but mysql is
low resource consuming. postgres is nice too, maybe you app will tell
you what to use
network? many 10gbit with bounding(linux module) on round robin or
another good(working) loadbalance

2011/2/17 Roberto Spadim :
> with more network cards =3D more network gbps
> with better (faster) rams =3D more disks reads
> with more raid0/4/5/6 =3D more speed on disks reads
> with more raid1 mirrors =3D more security
> with more sas/sata/raid controllers =3D more GB/TB on storage
> with more anything ~=3D more money
> just know what numbers you want and make it work
>
> 2011/2/17 John Robinson :
>> On 14/02/2011 23:59, Matt Garman wrote:
>> [...]
>>>
>>> The requirement is basically this: around 40 to 50 compute machines
>>> act as basically an ad-hoc scientific compute/simulation/analysis
>>> cluster. =A0These machines all need access to a shared 20 TB pool o=
f
>>> storage. =A0Each compute machine has a gigabit network connection, =
and
>>> it's possible that nearly every machine could simultaneously try to
>>> access a large (100 to 1000 MB) file in the storage pool. =A0In oth=
er
>>> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>>
>> I'd recommend you analyse that requirement more closely. Yes, you ha=
ve 50
>> compute machines with GigE connections so it's possible they could a=
ll
>> demand data from the file store at once, but in actual use, would th=
ey?
>>
>> For example, if these machines were each to demand a 100MB file, how=
long
>> would they spend computing their results from it? If it's only 1 sec=
ond,
>> then you would indeed need an aggregate bandwidth of 50Gbps[1]. If i=
t's 20
>> seconds processing, your filer only needs an aggregate bandwidth of =
2.5Gbps.
>>
>> So I'd recommend you work out first how much data the compute machin=
es can
>> actually chew through and work up from there, rather than what their=
network
>> connections could stream through and work down.
>>
>> Cheers,
>>
>> John.
>>
>> [1] I'm assuming the compute nodes are fetching the data for the nex=
t
>> compute cycle while they're working on this one; if they're not you'=
re
>> likely making unnecessary demands on your filer while leaving your c=
ompute
>> nodes idle.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid=
" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>>
>
>
>
> --
> Roberto Spadim
> Spadim Technology / SPAEmpresarial
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 17.02.2011 22:47:04 von Stan Hoeppner

John Robinson put forth on 2/17/2011 5:07 AM:
> On 14/02/2011 23:59, Matt Garman wrote:
> [...]
>> The requirement is basically this: around 40 to 50 compute machines
>> act as basically an ad-hoc scientific compute/simulation/analysis
>> cluster. These machines all need access to a shared 20 TB pool of
>> storage. Each compute machine has a gigabit network connection, and
>> it's possible that nearly every machine could simultaneously try to
>> access a large (100 to 1000 MB) file in the storage pool. In other
>> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>
> I'd recommend you analyse that requirement more closely. Yes, you have
> 50 compute machines with GigE connections so it's possible they could
> all demand data from the file store at once, but in actual use, would they?

This is a very good point and one which I somewhat ignored in my initial
response, making a silent assumption. I did so based on personal
experience, and knowledge of what other sites are deploying.

You don't see many deployed filers on the planet with 5 * 10 GbE front
end connections. In fact, today, you still don't see many deployed
filers with even one 10 GbE front end connection, but usually multiple
(often but not always bonded) GbE connections.

A single 10 GbE front end connection provides a truly enormous amount of
real world bandwidth, over 1 GB/s aggregate sustained. *This is
equivalent to transferring a full length dual layer DVD in 10 seconds*

Few sites/applications actually need this kind of bandwidth, either
burst or sustained. But, this is the system I spec'd for the OP
earlier. Sometimes people get caught up in comparing raw bandwidth
numbers between different platforms and lose sight of the real world
performance they can get from any one of them.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 17.02.2011 23:13:04 von Joe Landman

On 02/17/2011 04:47 PM, Stan Hoeppner wrote:
> John Robinson put forth on 2/17/2011 5:07 AM:
>> On 14/02/2011 23:59, Matt Garman wrote:
>> [...]
>>> The requirement is basically this: around 40 to 50 compute machines
>>> act as basically an ad-hoc scientific compute/simulation/analysis
>>> cluster. These machines all need access to a shared 20 TB pool of
>>> storage. Each compute machine has a gigabit network connection, and
>>> it's possible that nearly every machine could simultaneously try to
>>> access a large (100 to 1000 MB) file in the storage pool. In other
>>> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>>
>> I'd recommend you analyse that requirement more closely. Yes, you have
>> 50 compute machines with GigE connections so it's possible they could
>> all demand data from the file store at once, but in actual use, would they?
>
> This is a very good point and one which I somewhat ignored in my initial
> response, making a silent assumption. I did so based on personal
> experience, and knowledge of what other sites are deploying.

Well, the application area appears to be high performance cluster
computing, and the storage behind it. Its a somewhat more specialized
version of storage, and not one that a typical IT person runs into
often. There are different, some profoundly so, demands placed upon
such storage.

Full disclosure: this is our major market, we make/sell products in
this space, have for a while. Take what we say with that in your mind
as a caveat, as it does color our opinions.

The spec's as stated, 50Gb/s ... its rare ... exceptionally rare ...
that you ever see cluster computing storage requirements stated in such
terms. Usually they are stated in the MB/s or GB/s regime. Using a
basic conversion of Gb/s to GB/s, the OP is looking for ~6GB/s support.

Some basic facts about this.

Fibre channel (FC-8 in particular), will give you, at best 1GB/s per
loop, and that presumes you aren't oversubscribing the loop. The vast
majority of designs we see coming from IT shops, do, in fact, badly
oversubscribe the bandwidth, which causes significant contention on the
loops. The Nexsan unit you indicated (they are nominally a competitor
of ours) is an FC device, though we've heard rumblings that they may
even allow for SAS direct connections (though that would be quite cost
ineffective as a SAS JBOD chassis compared to other units, and you still
have the oversubscription problem).

As I said, high performance storage design is a very ... very ...
different animal from standard IT storage design. There are very
different decision points, and design concepts.

> You don't see many deployed filers on the planet with 5 * 10 GbE front
> end connections. In fact, today, you still don't see many deployed
> filers with even one 10 GbE front end connection, but usually multiple
> (often but not always bonded) GbE connections.

In this space, high performance cluster storage, this statement is
incorrect.

Our units (again, not trying to be a commercial here, see .sig if you
want to converse offline) usually ship with either 2x 10GbE, 2x QDR IB,
or combinations of these. QDR IB gets you 3.2 GB/s. Per port.

In high performance computing storage (again, the focus of the OP's
questions), this is a reasonable configuration and request.
>
> A single 10 GbE front end connection provides a truly enormous amount of
> real world bandwidth, over 1 GB/s aggregate sustained. *This is
> equivalent to transferring a full length dual layer DVD in 10 seconds*

Trust me. This is not *enormous*. Well, ok ... put another way, we
architect systems that scale well beyond 10GB/s sustained. We have nice
TB sprints and similar sorts of "drag racing" as I call them (c.f.
http://scalability.org/?p=2912 http://scalability.org/?p=2356
http://scalability.org/?p=2165 http://scalability.org/?p=1980
http://scalability.org/?p=1756 )

1 GB/s is nothing magical. Again, not a commercial, but our DeltaV
units, running MD raid, achieve 850-900MB/s (0.85-0.9 GB/s) for RAID6.

To get good (great) performance you have to start out with a good
(great) design. One that will really optimize the performance on a per
unit basis.

> Few sites/applications actually need this kind of bandwidth, either
> burst or sustained. But, this is the system I spec'd for the OP
> earlier. Sometimes people get caught up in comparing raw bandwidth
> numbers between different platforms and lose sight of the real world
> performance they can get from any one of them.

The sad part is that we often wind up fighting against others "marketing
numbers". Our real benchmarks are often comparable to their "strong
wind a the back" numbers. Heck, our MD raid numbers often are better
than others hardware RAID numbers.

Theoretical bandwidth from the marketing docs doesn't matter. The only
thing that does matter is having a sound design and implementation at
all levels. This is why we do what we do, and why we do use MD raid.

Regards,

Joe

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 18.02.2011 00:49:07 von Stan Hoeppner

Joe Landman put forth on 2/17/2011 4:13 PM:

> Well, the application area appears to be high performance cluster
> computing, and the storage behind it. Its a somewhat more specialized
> version of storage, and not one that a typical IT person runs into
> often. There are different, some profoundly so, demands placed upon
> such storage.

The OP's post described an ad hoc collection of 40-50 machines doing
various types of processing on shared data files. This is not classical
cluster computing. He didn't describe any kind of _parallel_
processing. It sounded to me like staged batch processing, the
bandwidth demands of which are typically much lower than a parallel
compute cluster.

> Full disclosure: this is our major market, we make/sell products in
> this space, have for a while. Take what we say with that in your mind
> as a caveat, as it does color our opinions.

Thanks for the disclosure Joe.

> The spec's as stated, 50Gb/s ... its rare ... exceptionally rare ...
> that you ever see cluster computing storage requirements stated in such
> terms. Usually they are stated in the MB/s or GB/s regime. Using a
> basic conversion of Gb/s to GB/s, the OP is looking for ~6GB/s support.

Indeed. You typically don't see this kind of storage b/w need outside
the government labs and supercomputing centers (LLNL, Sandia, NCCS,
SDSC, etc). Of course those sites' requirements are quite a bit higher
than a "puny" 6 GB/s.

> Some basic facts about this.
>
> Fibre channel (FC-8 in particular), will give you, at best 1GB/s per
> loop, and that presumes you aren't oversubscribing the loop. The vast
> majority of designs we see coming from IT shops, do, in fact, badly
> oversubscribe the bandwidth, which causes significant contention on the
> loops.

Who is still doing loops on the front end? Front end loops died many
years ago with the introduction of switches from Brocade, Qlogic,
McData, etc. I've not hard of a front end loop being used in many many
years. Some storage vendors still use loops on the _back_ end to
connect FC/SAS/SATA expansion chassis to the head controller, IBM and
NetApp come to mind, but it's usually dual loops per chassis, so you're
looking at ~3 GB/s per expansion chassis using 8 Gbit loops. One would
be hard pressed to over subscribe such a system as most of these are
sold with multiple chassis. And for systems such as the IBMs and
NetApps, you can get anywhere from 4-32 front end ports of 8 Gbit FC or
10 GbE. In the IBM case you're limited to block access, whereas the
NetApp will do both block and file.

> The Nexsan unit you indicated (they are nominally a competitor
> of ours) is an FC device, though we've heard rumblings that they may
> even allow for SAS direct connections (though that would be quite cost
> ineffective as a SAS JBOD chassis compared to other units, and you still
> have the oversubscription problem).

Nexsan doesn't offer direct SAS connection on the big 42/102 drive Beast
units, only on the Boy units. The Beast units all use dual or quad FC
front end ports, with a couple front end GbE iSCSI ports thrown in for
flexibility. The SAS Boy units beat all competitors on price/TB, as do
all the Nexsan products.

I'd like to note that over subscription isn't intrinsic to a piece of
hardware. It's indicative of an engineer or storage architect not
knowing what the blank he's doing.

> As I said, high performance storage design is a very ... very ...
> different animal from standard IT storage design. There are very
> different decision points, and design concepts.

Depends on the segment of the HPC market. It seems you're competing in
the low end of it. Configurations get a bit exotic at the very high
end. It also depends on what HPC storage tier you're looking at, and
the application area. For pure parallel computing sites such as NCCS,
NCSA, PSSC, etc your storage infrastructure and the manner in which it
is accessed is going to be quite different than some of the NASA
sponsored projects, such as the Spitzer telescope project being handled
by Caltech. The first will have persistent parallel data writing from
simulation runs across many hundreds or thousands of nodes. The second
will have massive streaming writes as the telescope streams data in real
time to a ground station. Then this data will be staged and processed
with massive streaming wrties.

So, again, it really depends on the application(s), as always,
regardless of whether it's HPC or IT, although there are few purely
streaming IT workloads, EDL of decision support databases comes to mind,
but these are usually relatively short duration. They can still put
some strain on a SAN if not architected correctly.

>> You don't see many deployed filers on the planet with 5 * 10 GbE front
>> end connections. In fact, today, you still don't see many deployed
>> filers with even one 10 GbE front end connection, but usually multiple
>> (often but not always bonded) GbE connections.
>
> In this space, high performance cluster storage, this statement is
> incorrect.

The OP doesn't have a high performance cluster. HPC cluster storage by
accepted definition includes highly parallel workloads. This is not
what the OP described. He described ad hoc staged data analysis.

> In high performance computing storage (again, the focus of the OP's
> questions), this is a reasonable configuration and request.

Again, I disagree. See above.

>> A single 10 GbE front end connection provides a truly enormous amount of
>> real world bandwidth, over 1 GB/s aggregate sustained. *This is
>> equivalent to transferring a full length dual layer DVD in 10 seconds*
>
> Trust me. This is not *enormous*. Well, ok ... put another way, we

Given that the OP has nothing right now, this is *enormous* bandwidth.
It would surely meet his needs. For the vast majority of
workloads/environments, 1GB/s sustained is enormous. Sure, there are
environments that may need more, but those folks aren't typically going
to be asking for architecture assistance on this, or any other mailing
list. ;)

> 1 GB/s is nothing magical. Again, not a commercial, but our DeltaV
> units, running MD raid, achieve 850-900MB/s (0.85-0.9 GB/s) for RAID6.

1 GB/s sustained random I/O is a bit magical, for many many
sites/applications. I'm betting the 850-900MB/s RAID6 you quote is a
streaming read, yes? What does that box peak at with a mixed random I/O
workload from 40-50 clients?

> To get good (great) performance you have to start out with a good
> (great) design. One that will really optimize the performance on a per
> unit basis.

Blah blah. You're marketing too much at this point. :)

> The sad part is that we often wind up fighting against others "marketing
> numbers". Our real benchmarks are often comparable to their "strong
> wind a the back" numbers. Heck, our MD raid numbers often are better
> than others hardware RAID numbers.

And they're all on paper. It was great back in the day when vendors
would drop off an eval unit free of charge and let you bang on it for a
month. Today, there are too many players, and margins are to small, for
most companies to have the motivation to do this. Today you're invited
to the vendor to watch them run the hardware through a demo, which has
little bearing on your workload. For a small firm like yours I'm
guessing it would be impossible to deploy eval units in any numbers due
to capitalization issues.

> Theoretical bandwidth from the marketing docs doesn't matter. The only

This is always the case. Which is one reason why certain trade mags are
still read--almost decent product reviews.

> thing that does matter is having a sound design and implementation at
> all levels. This is why we do what we do, and why we do use MD raid.

No argument here. This is one reason why some quality VARs/integrators
are unsung heroes in some quarters. There is a plethora of fantastic
gear on the market today, from servers to storage to networking gear.
One could buy the best $$ products available and still get crappy
performance if it's not integrated properly, from the cabling to the
firmware to the application.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 18.02.2011 01:06:00 von Joe Landman

On 2/17/2011 6:49 PM, Stan Hoeppner wrote:
> Joe Landman put forth on 2/17/2011 4:13 PM:
>
>> Well, the application area appears to be high performance cluster
>> computing, and the storage behind it. Its a somewhat more specialized
>> version of storage, and not one that a typical IT person runs into
>> often. There are different, some profoundly so, demands placed upon
>> such storage.
>
> The OP's post described an ad hoc collection of 40-50 machines doing
> various types of processing on shared data files. This is not classical
> cluster computing. He didn't describe any kind of _parallel_
> processing. It sounded to me like staged batch processing, the

Semantics at best. He is doing significant processing, in parallel,
doing data analysis, in parallel, across a cluster of machines. Doing
MPI-IO? No. Does not using MPI make this not a cluster? No.

> bandwidth demands of which are typically much lower than a parallel
> compute cluster.

See his original post. He posits his bandwidth demands.

>
>> Full disclosure: this is our major market, we make/sell products in
>> this space, have for a while. Take what we say with that in your mind
>> as a caveat, as it does color our opinions.
>
> Thanks for the disclosure Joe.
>
>> The spec's as stated, 50Gb/s ... its rare ... exceptionally rare ...
>> that you ever see cluster computing storage requirements stated in such
>> terms. Usually they are stated in the MB/s or GB/s regime. Using a
>> basic conversion of Gb/s to GB/s, the OP is looking for ~6GB/s support.
>
> Indeed. You typically don't see this kind of storage b/w need outside
> the government labs and supercomputing centers (LLNL, Sandia, NCCS,
> SDSC, etc). Of course those sites' requirements are quite a bit higher
> than a "puny" 6 GB/s.

Heh ... we see it all the time in compute cluster, large data analysis
farms etc. Not at the big labs.

[...]

> McData, etc. I've not hard of a front end loop being used in many many
> years. Some storage vendors still use loops on the _back_ end to
> connect FC/SAS/SATA expansion chassis to the head controller, IBM and

I am talking about the back end.

> NetApp come to mind, but it's usually dual loops per chassis, so you're
> looking at ~3 GB/s per expansion chassis using 8 Gbit loops. One would

2 GB/s assuming FC-8, and 20 lower speed drives are sufficient to
completely fill 2 GB/s. So, as I was saying, the design matters.

[...]

> Nexsan doesn't offer direct SAS connection on the big 42/102 drive Beast
> units, only on the Boy units. The Beast units all use dual or quad FC
> front end ports, with a couple front end GbE iSCSI ports thrown in for
> flexibility. The SAS Boy units beat all competitors on price/TB, as do
> all the Nexsan products.

As I joked one time, many many years ago "broad sweeping generalizations
tend to be incorrect". Yes, it is a recursive joke, but there is a
serious aspect to it. Your proffered pricing per TB, which you claim
Nexsan beats all ... is much higher than ours, and many others. No,
they don't beat all, or even many.


> I'd like to note that over subscription isn't intrinsic to a piece of
> hardware. It's indicative of an engineer or storage architect not
> knowing what the blank he's doing.

Oversubscription and it corresponding resource contention, not to
mention poor design of other aspects ... yeah, I agree that this is
indicative of something. One must question why people continue to
deploy architectures which don't scale.

>
>> As I said, high performance storage design is a very ... very ...
>> different animal from standard IT storage design. There are very
>> different decision points, and design concepts.
>
> Depends on the segment of the HPC market. It seems you're competing in
> the low end of it. Configurations get a bit exotic at the very high

I noted this about your previous responses, this particular tone you
take. I debated for a while responding, until I saw something I simply
needed to correct. I'll try not to take your bait.

[...]

> So, again, it really depends on the application(s), as always,
> regardless of whether it's HPC or IT, although there are few purely
> streaming IT workloads, EDL of decision support databases comes to mind,
> but these are usually relatively short duration. They can still put
> some strain on a SAN if not architected correctly.
>
>>> You don't see many deployed filers on the planet with 5 * 10 GbE front
>>> end connections. In fact, today, you still don't see many deployed
>>> filers with even one 10 GbE front end connection, but usually multiple
>>> (often but not always bonded) GbE connections.
>>
>> In this space, high performance cluster storage, this statement is
>> incorrect.
>
> The OP doesn't have a high performance cluster. HPC cluster storage by

Again, semantics. They are doing massive data ingestion and processing.
The view of this is called "big data" in HPC circles and it is *very
much* an HPC problem.

> accepted definition includes highly parallel workloads. This is not
> what the OP described. He described ad hoc staged data analysis.

See above. If you want to argue semantics, be my guest, I won't be
party to such a waste of time. The OP is doing analysis that requires a
high performance architecture. The architecture you suggested is not
one people in the field would likely recommend.

[rest deleted]


--
joe

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 18.02.2011 04:48:36 von Stan Hoeppner

Joe Landman put forth on 2/17/2011 6:06 PM:

> See above. If you want to argue semantics, be my guest, I won't be
> party to such a waste of time. The OP is doing analysis that requires a
> high performance architecture. The architecture you suggested is not
> one people in the field would likely recommend.

We don't actually know what the OP's needs are at this point. Any
suggestion is an educated guess. I clearly stated mine was such.

The OP simply multiplied the quantity of his client hosts' interfaces by
their link speed and posted that as his "requirement", which is where
the 50Gb/s figure came from. IIRC, he posted that as more of a question
than a statement.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 18.02.2011 14:49:20 von Mattias Wadenstein

On Mon, 14 Feb 2011, Matt Garman wrote:

> For many years, I have been using Linux software RAID at home for a
> simple NAS system. Now at work, we are looking at buying a massive,
> high-throughput storage system (e.g. a SAN). I have little
> familiarity with these kinds of pre-built, vendor-supplied solutions.
> I just started talking to a vendor, and the prices are extremely high.
>
> So I got to thinking, perhaps I could build an adequate device for
> significantly less cost using Linux. The problem is, the requirements
> for such a system are significantly higher than my home media server,
> and put me into unfamiliar territory (in terms of both hardware and
> software configuration).
>
> The requirement is basically this: around 40 to 50 compute machines
> act as basically an ad-hoc scientific compute/simulation/analysis
> cluster. These machines all need access to a shared 20 TB pool of
> storage. Each compute machine has a gigabit network connection, and
> it's possible that nearly every machine could simultaneously try to
> access a large (100 to 1000 MB) file in the storage pool. In other
> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>
> I was wondering if anyone on the list has built something similar to
> this using off-the-shelf hardware (and Linux of course)?

Well, this seems fairly close to the LHC data analysis case, or HPC usage
in general, both of which I'm rather familiar with.

> My initial thoughts/questions are:
>
> (1) We need lots of spindles (i.e. many small disks rather than
> few big disks). How do you compute disk throughput when there are
> multiple consumers? Most manufacturers provide specs on their drives
> such as sustained linear read throughput. But how is that number
> affected when there are multiple processes simultanesously trying to
> access different data? Is the sustained bulk read throughput value
> inversely proportional to the number of consumers? (E.g. 100 MB/s
> drive only does 33 MB/s w/three consumers.) Or is there are more
> specific way to estimate this?

This is tricky. In general there isn't a good way of estimating this,
because so much about this involves the way your load interacts with
IO-scheduling in both Linux and (if you use them) raid controllers, etc.

The actual IO pattern of your workload is probably the biggest factor
here, determining both if readahead will give any benefits, as well as how
much sequential IO can be done as opposed to just seeking.

> (2) The big storage server(s) need to connect to the network via
> multiple bonded Gigabit ethernet, or something faster like
> FibreChannel or 10 GbE. That seems pretty straightforward.

I'd also look at the option of many small&cheap servers, especially if the
load is spread out fairly even over the filesets.

> (3) This will probably require multiple servers connected together
> somehow and presented to the compute machines as one big data store.
> This is where I really don't know much of anything. I did a quick
> "back of the envelope" spec for a system with 24 600 GB 15k SAS drives
> (based on the observation that 24-bay rackmount enclosures seem to be
> fairly common). Such a system would only provide 7.2 TB of storage
> using a scheme like RAID-10. So how could two or three of these
> servers be "chained" together and look like a single large data pool
> to the analysis machines?

Here you would either maintain a large list of nfs mounts for the read
load, or start looking at a distributed filesystem. Sticking them all into
one big fileserver is easier on the administration part, but quickly gets
really expensive when you look to put multiple 10GE interfaces on it.

If the load is almost all read and seldom updated, and you can afford the
time to manually layout data files over the servers, the nfs mounts option
might work well for you. If the analysis cluster also creates files here
and there you might need a parallel filesystem.

2U machines with 12 3.5" or 16-24 2.5" hdd slots can be gotten pretty
cheaply. Add a quad-gige card if your load can get decent sequential load,
or look at fast/ssd 2.5" drives if you are mostly short random reads. Then
add as many as you need to sustain the analysis speed you need. The
advantage here is that this is really scalable, if you double the number
of servers you get at least twice the IO capacity.

Oh, yet another setup I've seen is adding a some (2-4) fast disks to each
of the analysis machines and then running a distributed replicated
filesystem like hadoop over them.

/Mattias Wadenstein
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 19.02.2011 00:16:38 von Stan Hoeppner

Mattias Wadenstein put forth on 2/18/2011 7:49 AM:

> Here you would either maintain a large list of nfs mounts for the read
> load, or start looking at a distributed filesystem. Sticking them all
> into one big fileserver is easier on the administration part, but
> quickly gets really expensive when you look to put multiple 10GE
> interfaces on it.

This really depends on one's definition of "really expensive". Taking
the total cost of such a system/infrastructure into account, these two
Intel dual port 10 GbE NICs seem rather cheap at $650-$750 USD:

http://www.newegg.com/Product/Product.aspx?Item=N82E16833106 037
http://www.newegg.com/Product/Product.aspx?Item=N82E16833106 075

20 Gb/s (40 both ways) raw/peak throughput at this price seems like a
bargain to me (plus the switch module cost obviously, if required,
usually not for RJ-45 or CX4, thus my motivation for mentioning these).

The storage infrastructure on the back end required to keep these pipes
full will be the "really expensive" piece. With 40-50 NFS clients you
end up with a random read/write workload, as has been mentioned. To
sustain 2 GB/s throughput (CRC+TCP+NFS+etc overhead limited) under such
random IO conditions is going to require something on the order of 24-30
15k SAS drives in a RAID 0 stripe, or 48-60 such drives in a RAID 10,
assuming something like 80-90% efficiency in your software or hardware
RAID engine. To get this level of sustained random performance from the
Nexsan arrays you'd have to use 2 units as the controller hardware just
isn't fast enough. This is also exactly why NetApp does good business
in the midrange segment--one unit does it all, including block and file.

RAID 5/6 need not apply due the abysmal RMW partial stripe write
penalty, unless of course you're doing almost no writes. But in that
case, how did the data get there in the first place? :)

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 19.02.2011 01:24:58 von Joe Landman

On 02/18/2011 08:49 AM, Mattias Wadenstein wrote:
> On Mon, 14 Feb 2011, Matt Garman wrote:

[...]

>> I was wondering if anyone on the list has built something similar to
>> this using off-the-shelf hardware (and Linux of course)?
>
> Well, this seems fairly close to the LHC data analysis case, or HPC
> usage in general, both of which I'm rather familiar with.

Its similar to many HPC workloads dealing with large data sets. There's
nothing unusual about this in the HPC world.

>
>> My initial thoughts/questions are:
>>
>> (1) We need lots of spindles (i.e. many small disks rather than
>> few big disks). How do you compute disk throughput when there are
>> multiple consumers? Most manufacturers provide specs on their drives
>> such as sustained linear read throughput. But how is that number
>> affected when there are multiple processes simultanesously trying to
>> access different data? Is the sustained bulk read throughput value
>> inversely proportional to the number of consumers? (E.g. 100 MB/s
>> drive only does 33 MB/s w/three consumers.) Or is there are more
>> specific way to estimate this?
>
> This is tricky. In general there isn't a good way of estimating this,
> because so much about this involves the way your load interacts with
> IO-scheduling in both Linux and (if you use them) raid controllers, etc.
>
> The actual IO pattern of your workload is probably the biggest factor
> here, determining both if readahead will give any benefits, as well as
> how much sequential IO can be done as opposed to just seeking.

Absolutely.

Good real time data can be had from a number of tools. Collectl,
iostat, etc (sar ...). I personally like atop for the "dashboard" like
view. Collectl and others can get you even more data that you can analyze.

>
>> (2) The big storage server(s) need to connect to the network via
>> multiple bonded Gigabit ethernet, or something faster like
>> FibreChannel or 10 GbE. That seems pretty straightforward.
>
> I'd also look at the option of many small&cheap servers, especially if
> the load is spread out fairly even over the filesets.

Here is where things like GlusterFS and FhGFS shine. When Ceph firms up
you can use this. Happily all of these do run atop an MD raid device
(to tie into the list).

>> (3) This will probably require multiple servers connected together
>> somehow and presented to the compute machines as one big data store.
>> This is where I really don't know much of anything. I did a quick
>> "back of the envelope" spec for a system with 24 600 GB 15k SAS drives
>> (based on the observation that 24-bay rackmount enclosures seem to be
>> fairly common). Such a system would only provide 7.2 TB of storage
>> using a scheme like RAID-10. So how could two or three of these
>> servers be "chained" together and look like a single large data pool
>> to the analysis machines?
>
> Here you would either maintain a large list of nfs mounts for the read
> load, or start looking at a distributed filesystem. Sticking them all
> into one big fileserver is easier on the administration part, but
> quickly gets really expensive when you look to put multiple 10GE
> interfaces on it.
>
> If the load is almost all read and seldom updated, and you can afford
> the time to manually layout data files over the servers, the nfs mounts
> option might work well for you. If the analysis cluster also creates
> files here and there you might need a parallel filesystem.

One of the nicer aspects of GlusterFS in this context is that it
provides an NFS compatible server that NFS clients can connect to. Some
things aren't supported right now in the current release, but I
anticipate they will be soon.

Moreover, with the distribute mode, it will do a reasonable job of
distributing the files among the nodes. Sort of like the nfs layout
model, but with a "random" distribution. This should be, on average,
reasonably good.

>
> 2U machines with 12 3.5" or 16-24 2.5" hdd slots can be gotten pretty
> cheaply. Add a quad-gige card if your load can get decent sequential
> load, or look at fast/ssd 2.5" drives if you are mostly short random
> reads. Then add as many as you need to sustain the analysis speed you
> need. The advantage here is that this is really scalable, if you double
> the number of servers you get at least twice the IO capacity.
>
> Oh, yet another setup I've seen is adding a some (2-4) fast disks to
> each of the analysis machines and then running a distributed replicated
> filesystem like hadoop over them.

Ugh ... short-stroking drives or using SSDs? Quite cost-inefficient for
this work. And given the HPC nature of the problem, its probably a good
idea to aim for more cost-efficient.

This said, I'd recommend at least looking at GlusterFS. Put it atop an
MD raid (6 or 10), and you should be in pretty good shape with the right
network design. That is, as long as you don't use a bad SATA/SAS HBA.

Joe
--
Joe Landman
landman@scalableinformatics.com
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 21.02.2011 11:04:21 von Mattias Wadenstein

On Fri, 18 Feb 2011, Joe Landman wrote:

> On 02/18/2011 08:49 AM, Mattias Wadenstein wrote:
> [...]
>> 2U machines with 12 3.5" or 16-24 2.5" hdd slots can be gotten pretty
>> cheaply. Add a quad-gige card if your load can get decent sequential
>> load, or look at fast/ssd 2.5" drives if you are mostly short random
>> reads. Then add as many as you need to sustain the analysis speed you
>> need. The advantage here is that this is really scalable, if you double
>> the number of servers you get at least twice the IO capacity.
>>
>> Oh, yet another setup I've seen is adding a some (2-4) fast disks to
>> each of the analysis machines and then running a distributed replicated
>> filesystem like hadoop over them.
>
> Ugh ... short-stroking drives or using SSDs? Quite cost-inefficient for this
> work. And given the HPC nature of the problem, its probably a good idea to
> aim for more cost-efficient.

Or just regular fairly slow sata drives. The advantage being that it is
really cheap to get to 100-200 spindles this way, so you might not need
very fast disks. It depends on your IO pattern, but for the LHC data
analysis this has been showed to be surprisingly fast.

/Mattias Wadenstein
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 21.02.2011 11:25:53 von Mattias Wadenstein

On Fri, 18 Feb 2011, Stan Hoeppner wrote:

> Mattias Wadenstein put forth on 2/18/2011 7:49 AM:
>
>> Here you would either maintain a large list of nfs mounts for the read
>> load, or start looking at a distributed filesystem. Sticking them all
>> into one big fileserver is easier on the administration part, but
>> quickly gets really expensive when you look to put multiple 10GE
>> interfaces on it.
>
> This really depends on one's definition of "really expensive". Taking
> the total cost of such a system/infrastructure into account, these two
> Intel dual port 10 GbE NICs seem rather cheap at $650-$750 USD:
>
> http://www.newegg.com/Product/Product.aspx?Item=N82E16833106 037
> http://www.newegg.com/Product/Product.aspx?Item=N82E16833106 075
>
> 20 Gb/s (40 both ways) raw/peak throughput at this price seems like a
> bargain to me (plus the switch module cost obviously, if required,
> usually not for RJ-45 or CX4, thus my motivation for mentioning these).
>
> The storage infrastructure on the back end required to keep these pipes
> full will be the "really expensive" piece.

Exactly my point, a storage server that can sustain 20-200MB/s is rather
cheap, but one that can sustain 2GB/s is really expensive. Possibly to the
point where 10-100 smaller file servers are much cheaper. The worst case
here is very small random reads, and then you're screwed cost-wise
whatever you choose, if you want to get the 2GB/s number.

[snip]

> RAID 5/6 need not apply due the abysmal RMW partial stripe write
> penalty, unless of course you're doing almost no writes. But in that
> case, how did the data get there in the first place? :)

Actually, that's probably the common case for data analysis load. Lots of
random reads, but only occasional sequential writes when you add a new
file/fileset. So raid 5/6 performance-wise works out pretty much as a
stripe of n-[12] disks.

/Mattias Wadenstein
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 21.02.2011 22:51:17 von Stan Hoeppner

Mattias Wadenstein put forth on 2/21/2011 4:25 AM:
> On Fri, 18 Feb 2011, Stan Hoeppner wrote:
>
>> Mattias Wadenstein put forth on 2/18/2011 7:49 AM:
>>
>>> Here you would either maintain a large list of nfs mounts for the read
>>> load, or start looking at a distributed filesystem. Sticking them all
>>> into one big fileserver is easier on the administration part, but
>>> quickly gets really expensive when you look to put multiple 10GE
>>> interfaces on it.
>>
>> This really depends on one's definition of "really expensive". Taking
>> the total cost of such a system/infrastructure into account, these two
>> Intel dual port 10 GbE NICs seem rather cheap at $650-$750 USD:
>>
>> http://www.newegg.com/Product/Product.aspx?Item=N82E16833106 037
>> http://www.newegg.com/Product/Product.aspx?Item=N82E16833106 075
>>
>> 20 Gb/s (40 both ways) raw/peak throughput at this price seems like a
>> bargain to me (plus the switch module cost obviously, if required,
>> usually not for RJ-45 or CX4, thus my motivation for mentioning these).
>>
>> The storage infrastructure on the back end required to keep these pipes
>> full will be the "really expensive" piece.
>
> Exactly my point, a storage server that can sustain 20-200MB/s is rather
> cheap, but one that can sustain 2GB/s is really expensive. Possibly to
> the point where 10-100 smaller file servers are much cheaper. The worst
> case here is very small random reads, and then you're screwed cost-wise
> whatever you choose, if you want to get the 2GB/s number.

"Screwed" may be a bit harsh, but I agree that one big fast storage
server will usually cost more than many smaller ones with equal
aggregate performance. But looking at this from a TCO standpoint, the
administrative burden is higher for the many small case, and file layout
can be problematic, specifically in the case where all analysis nodes
need to share a file or group of files. This can create bottlenecks at
individual storage servers. Thus, acquisition cost must be weighed
against operational costs. If any of the data is persistent, backing up
a single server is straight forward. Backing up multiple servers, and
restoring them if necessary, is more complicated.

>> RAID 5/6 need not apply due the abysmal RMW partial stripe write
>> penalty, unless of course you're doing almost no writes. But in that
>> case, how did the data get there in the first place? :)

> Actually, that's probably the common case for data analysis load. Lots
> of random reads, but only occasional sequential writes when you add a
> new file/fileset. So raid 5/6 performance-wise works out pretty much as
> a stripe of n-[12] disks.

RAID5/6 have decent single streaming read performance, but sub optimal
random read, less than sub optimal streaming write, and abysmal random
write performance. They exhibit poor random read performance with high
client counts when compared to RAID0 or RAID10. Additionally, with an
analysis "cluster" designed for overall high utilization (no idle
nodes), one node will be uploading data sets while others are doing
analysis. Thus you end up with a mixed simultaneous random read and
streaming write workload on the server. RAID10 will give many times the
throughput in this case compared to RAID5/6, which will bog down rapidly
under such a workload.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 22.02.2011 09:57:12 von David Brown

On 21/02/2011 22:51, Stan Hoeppner wrote:
> Mattias Wadenstein put forth on 2/21/2011 4:25 AM:
>> On Fri, 18 Feb 2011, Stan Hoeppner wrote:
>>
>>> Mattias Wadenstein put forth on 2/18/2011 7:49 AM:
>>>
>>> RAID 5/6 need not apply due the abysmal RMW partial stripe write
>>> penalty, unless of course you're doing almost no writes. But in that
>>> case, how did the data get there in the first place? :)
>
>> Actually, that's probably the common case for data analysis load. Lots
>> of random reads, but only occasional sequential writes when you add a
>> new file/fileset. So raid 5/6 performance-wise works out pretty much as
>> a stripe of n-[12] disks.
>
> RAID5/6 have decent single streaming read performance, but sub optimal
> random read, less than sub optimal streaming write, and abysmal random
> write performance. They exhibit poor random read performance with high
> client counts when compared to RAID0 or RAID10. Additionally, with an
> analysis "cluster" designed for overall high utilization (no idle
> nodes), one node will be uploading data sets while others are doing
> analysis. Thus you end up with a mixed simultaneous random read and
> streaming write workload on the server. RAID10 will give many times the
> throughput in this case compared to RAID5/6, which will bog down rapidly
> under such a workload.
>

I'm a little confused here. It's easy to see why RAID5/6 have very poor
random write performance - you need at least two reads and two writes
for a single write access. It's also easy to see that streaming reads
will be good, as you can read from most of the disks in parallel.

However, I can't see that streaming writes would be so bad - you have to
write slightly more than for a RAID0 write, since you have the parity
data too, but the parity is calculated in advance without the need of
any reads, and all the writes are in parallel. So you get the streamed
write performance of n-[12] disks. Contrast this with RAID10 where you
have to write out all data twice - you get the performance of n/2 disks.

I also cannot see why random reads would be bad - I would expect that to
be of similar speed to a RAID0 setup. The only exception would be if
you've got atime enabled, and each random read was also causing a small
write - then it would be terrible.

Or am I missing something here?

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 22.02.2011 10:30:10 von Mattias Wadenstein

On Tue, 22 Feb 2011, David Brown wrote:

> On 21/02/2011 22:51, Stan Hoeppner wrote:
>> Mattias Wadenstein put forth on 2/21/2011 4:25 AM:
>>> On Fri, 18 Feb 2011, Stan Hoeppner wrote:
>>>
>>>> Mattias Wadenstein put forth on 2/18/2011 7:49 AM:
>>>>
>>>> RAID 5/6 need not apply due the abysmal RMW partial stripe write
>>>> penalty, unless of course you're doing almost no writes. But in that
>>>> case, how did the data get there in the first place? :)
>>
>>> Actually, that's probably the common case for data analysis load. Lots
>>> of random reads, but only occasional sequential writes when you add a
>>> new file/fileset. So raid 5/6 performance-wise works out pretty much as
>>> a stripe of n-[12] disks.
>>
>> RAID5/6 have decent single streaming read performance, but sub optimal
>> random read, less than sub optimal streaming write, and abysmal random
>> write performance. They exhibit poor random read performance with high
>> client counts when compared to RAID0 or RAID10. Additionally, with an
>> analysis "cluster" designed for overall high utilization (no idle
>> nodes), one node will be uploading data sets while others are doing
>> analysis. Thus you end up with a mixed simultaneous random read and
>> streaming write workload on the server. RAID10 will give many times the
>> throughput in this case compared to RAID5/6, which will bog down rapidly
>> under such a workload.
>>
>
> I'm a little confused here. It's easy to see why RAID5/6 have very poor
> random write performance - you need at least two reads and two writes for a
> single write access. It's also easy to see that streaming reads will be
> good, as you can read from most of the disks in parallel.
>
> However, I can't see that streaming writes would be so bad - you have to
> write slightly more than for a RAID0 write, since you have the parity data
> too, but the parity is calculated in advance without the need of any reads,
> and all the writes are in parallel. So you get the streamed write
> performance of n-[12] disks. Contrast this with RAID10 where you have to
> write out all data twice - you get the performance of n/2 disks.

It's fine as long as you have only a few streaming writes, if you go up to
many streams things might start breaking down.

> I also cannot see why random reads would be bad - I would expect that to be
> of similar speed to a RAID0 setup. The only exception would be if you've got
> atime enabled, and each random read was also causing a small write - then it
> would be terrible.
>
> Or am I missing something here?

The thing I think you are missing is crappy implementations in several HW
raid controllers. For linux software raid the situation is quite sanely as
you describe in my experience.

/Mattias Wadenstein
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 22.02.2011 10:49:18 von David Brown

On 22/02/2011 10:30, Mattias Wadenstein wrote:
> On Tue, 22 Feb 2011, David Brown wrote:
>
>> On 21/02/2011 22:51, Stan Hoeppner wrote:
>>> Mattias Wadenstein put forth on 2/21/2011 4:25 AM:
>>>> On Fri, 18 Feb 2011, Stan Hoeppner wrote:
>>>>
>>>>> Mattias Wadenstein put forth on 2/18/2011 7:49 AM:
>>>>>
>>>>> RAID 5/6 need not apply due the abysmal RMW partial stripe write
>>>>> penalty, unless of course you're doing almost no writes. But in that
>>>>> case, how did the data get there in the first place? :)
>>>
>>>> Actually, that's probably the common case for data analysis load. Lots
>>>> of random reads, but only occasional sequential writes when you add a
>>>> new file/fileset. So raid 5/6 performance-wise works out pretty much as
>>>> a stripe of n-[12] disks.
>>>
>>> RAID5/6 have decent single streaming read performance, but sub optimal
>>> random read, less than sub optimal streaming write, and abysmal random
>>> write performance. They exhibit poor random read performance with high
>>> client counts when compared to RAID0 or RAID10. Additionally, with an
>>> analysis "cluster" designed for overall high utilization (no idle
>>> nodes), one node will be uploading data sets while others are doing
>>> analysis. Thus you end up with a mixed simultaneous random read and
>>> streaming write workload on the server. RAID10 will give many times the
>>> throughput in this case compared to RAID5/6, which will bog down rapidly
>>> under such a workload.
>>>
>>
>> I'm a little confused here. It's easy to see why RAID5/6 have very
>> poor random write performance - you need at least two reads and two
>> writes for a single write access. It's also easy to see that streaming
>> reads will be good, as you can read from most of the disks in parallel.
>>
>> However, I can't see that streaming writes would be so bad - you have
>> to write slightly more than for a RAID0 write, since you have the
>> parity data too, but the parity is calculated in advance without the
>> need of any reads, and all the writes are in parallel. So you get the
>> streamed write performance of n-[12] disks. Contrast this with RAID10
>> where you have to write out all data twice - you get the performance
>> of n/2 disks.
>
> It's fine as long as you have only a few streaming writes, if you go up
> to many streams things might start breaking down.
>

That's always going to be the case when you have a lot of writes at the
same time. Perhaps RAID5/6 makes matters a little worse by requiring a
certain ordering on the writes to ensure consistency (maybe you have to
write a whole stripe before starting a new stripe? I don't know how md
raid balances performance and consistency here). I think the choice of
file system is likely to make a bigger impact in such cases.

>> I also cannot see why random reads would be bad - I would expect that
>> to be of similar speed to a RAID0 setup. The only exception would be
>> if you've got atime enabled, and each random read was also causing a
>> small write - then it would be terrible.
>>
>> Or am I missing something here?
>
> The thing I think you are missing is crappy implementations in several
> HW raid controllers. For linux software raid the situation is quite
> sanely as you describe in my experience.
>

Ah, okay. Thanks!


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 22.02.2011 14:38:53 von Stan Hoeppner

David Brown put forth on 2/22/2011 2:57 AM:
> On 21/02/2011 22:51, Stan Hoeppner wrote:

>> RAID5/6 have decent single streaming read performance, but sub optimal
>> random read, less than sub optimal streaming write, and abysmal random
>> write performance. They exhibit poor random read performance with high
>> client counts when compared to RAID0 or RAID10. Additionally, with an
>> analysis "cluster" designed for overall high utilization (no idle
>> nodes), one node will be uploading data sets while others are doing
>> analysis. Thus you end up with a mixed simultaneous random read and
>> streaming write workload on the server. RAID10 will give many times the
>> throughput in this case compared to RAID5/6, which will bog down rapidly
>> under such a workload.
>>
>
> I'm a little confused here. It's easy to see why RAID5/6 have very poor
> random write performance - you need at least two reads and two writes
> for a single write access. It's also easy to see that streaming reads
> will be good, as you can read from most of the disks in parallel.
>
> However, I can't see that streaming writes would be so bad - you have to
> write slightly more than for a RAID0 write, since you have the parity
> data too, but the parity is calculated in advance without the need of
> any reads, and all the writes are in parallel. So you get the streamed
> write performance of n-[12] disks. Contrast this with RAID10 where you
> have to write out all data twice - you get the performance of n/2 disks.
>
> I also cannot see why random reads would be bad - I would expect that to
> be of similar speed to a RAID0 setup. The only exception would be if
> you've got atime enabled, and each random read was also causing a small
> write - then it would be terrible.
>
> Or am I missing something here?

I misspoke. What I meant to say is RAID5/6 have decent streaming and
random read performance, less than optimal *degraded* streaming and
random read performance. The reason for this is that with one drive
down, each stripe for which that dead drive contained data and not
parity the stripe must be reconstructed with a parity calculation when read.

This is another huge advantage RAID 10 has over the parity RAIDs: zero
performance loss while degraded. The other two big ones are vastly
lower rebuild times and still very good performance during a rebuild
operation as only two drives in the array take an extra hit from the
rebuild: the survivor of the mirror pair and the spare being written.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 22.02.2011 15:18:21 von David Brown

On 22/02/2011 14:38, Stan Hoeppner wrote:
> David Brown put forth on 2/22/2011 2:57 AM:
>> On 21/02/2011 22:51, Stan Hoeppner wrote:
>
>>> RAID5/6 have decent single streaming read performance, but sub optimal
>>> random read, less than sub optimal streaming write, and abysmal random
>>> write performance. They exhibit poor random read performance with high
>>> client counts when compared to RAID0 or RAID10. Additionally, with an
>>> analysis "cluster" designed for overall high utilization (no idle
>>> nodes), one node will be uploading data sets while others are doing
>>> analysis. Thus you end up with a mixed simultaneous random read and
>>> streaming write workload on the server. RAID10 will give many times the
>>> throughput in this case compared to RAID5/6, which will bog down rapidly
>>> under such a workload.
>>>
>>
>> I'm a little confused here. It's easy to see why RAID5/6 have very poor
>> random write performance - you need at least two reads and two writes
>> for a single write access. It's also easy to see that streaming reads
>> will be good, as you can read from most of the disks in parallel.
>>
>> However, I can't see that streaming writes would be so bad - you have to
>> write slightly more than for a RAID0 write, since you have the parity
>> data too, but the parity is calculated in advance without the need of
>> any reads, and all the writes are in parallel. So you get the streamed
>> write performance of n-[12] disks. Contrast this with RAID10 where you
>> have to write out all data twice - you get the performance of n/2 disks.
>>
>> I also cannot see why random reads would be bad - I would expect that to
>> be of similar speed to a RAID0 setup. The only exception would be if
>> you've got atime enabled, and each random read was also causing a small
>> write - then it would be terrible.
>>
>> Or am I missing something here?
>
> I misspoke. What I meant to say is RAID5/6 have decent streaming and
> random read performance, less than optimal *degraded* streaming and
> random read performance. The reason for this is that with one drive
> down, each stripe for which that dead drive contained data and not
> parity the stripe must be reconstructed with a parity calculation when read.
>

That makes lots of sense - I was missing the missing word "degraded"!

I don't think the degraded streaming reads will be too bad - after all,
you are reading the full stripe anyway, and the data reconstruction will
be fast on a modern cpu. But random reads will be very bad. For
example, if you have 4+1 drives in a RAID5, then one in every 5 random
reads will be on the dead drive, and will require 4 reads. That means
random reads will take 180% of the normal time, or almost half the
performance.

> This is another huge advantage RAID 10 has over the parity RAIDs: zero
> performance loss while degraded. The other two big ones are vastly
> lower rebuild times and still very good performance during a rebuild
> operation as only two drives in the array take an extra hit from the
> rebuild: the survivor of the mirror pair and the spare being written.
>

Yes, this is definitely true - RAID10 is less affected by running
degraded, and recovering is faster and involves less disk wear. The
disadvantage compared to RAID6 is, of course, if the other half of a
disk pair dies during recovery then your raid is gone - with RAID6 you
have better worst-case redundancy.

Once md raid has support for bad block lists, hot replace, and non-sync
lists, then the differences will be far less clear. If a disk in a RAID
5/6 set has a few failures (rather than dying completely), then it will
run as normal except when bad blocks are accessed. This means for all
but the few bad blocks, the degraded performance will be full speed.
And if you use "hot replace" to replace the partially failed drive, the
rebuild will have almost exactly the same characteristics as RAID10
rebuilds - apart from the bad blocks, which must be recovered by parity
calculations, you have a straight disk-to-disk copy.



--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 23.02.2011 06:52:02 von Stan Hoeppner

David Brown put forth on 2/22/2011 8:18 AM:

> Yes, this is definitely true - RAID10 is less affected by running
> degraded, and recovering is faster and involves less disk wear. The
> disadvantage compared to RAID6 is, of course, if the other half of a
> disk pair dies during recovery then your raid is gone - with RAID6 you
> have better worst-case redundancy.

The odds of the mirror partner dying during rebuild are very very long,
and the odds of suffering a URE are very low. However, in the case of
RAID5/6, moreso with RAID5, with modern very large drives (1/2/3TB),
there is being quite a bit written these days about unrecoverable read
error rates. Using a sufficient number of these very large disks will
at some point guarantee a URE during an array rebuild, which may very
likely cost you your entire array. This is because every block of every
remaining disk (assuming full disk RAID not small partitions on each
disk) must be read during a RAID5/6 rebuild. I don't have the equation
handy but Google should be able to fetch it for you. IIRC this is one
of the reasons RAID6 is becoming more popular today. Not just because
it can survive an additional disk failure, but that it's more resilient
to a URE during a rebuild.

With a RAID10 rebuild, as you're only reading entire contents of a
single disk, the odds of encountering a URE are much lower than with a
RAID5 with the same number of drives, simply due to the total number of
bits read.

> Once md raid has support for bad block lists, hot replace, and non-sync
> lists, then the differences will be far less clear. If a disk in a RAID
> 5/6 set has a few failures (rather than dying completely), then it will
> run as normal except when bad blocks are accessed. This means for all
> but the few bad blocks, the degraded performance will be full speed. And

You're muddying the definition of a "degraded RAID".

> if you use "hot replace" to replace the partially failed drive, the
> rebuild will have almost exactly the same characteristics as RAID10
> rebuilds - apart from the bad blocks, which must be recovered by parity
> calculations, you have a straight disk-to-disk copy.

Are you saying you'd take a "partially failing" drive in a RAID5/6 and
simply do a full disk copy onto the spare, except "bad blocks",
rebuilding those in the normal fashion, simply to approximate the
recover speed of RAID10?

I think your logic is a tad flawed here. If a drive is already failing,
why on earth would you trust it, period? I think you'd be asking for
trouble doing this. This is precisely one of the reasons many hardware
RAID controllers have historically kicked drives offline after the first
signs of trouble--if a drive is acting flaky we don't want to trust it,
but replace it as soon as possible.

The assumption is that the data on the array is far more valuable than
the cost of a single drive or the entire hardware for that matter. In
most environments this is the case. Everyone seems fond of the WD20EARS
drives (which I disdain). I hear they're loved because Newegg has them
for less than $100. What's your 2TB of data on that drive worth? In
the case of a MythTV box, to the owner, that $100 is worth more than the
content. In a business setting, I'd dare say the data on that drive is
worth far more than the $100 cost of the drive and the admin $$ time
required to replace/rebuild it.

In the MythTV case what you propose might be a worthwhile risk. In a
business environment, definitely not.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 23.02.2011 14:56:34 von David Brown

On 23/02/2011 06:52, Stan Hoeppner wrote:
> David Brown put forth on 2/22/2011 8:18 AM:
>
>> Yes, this is definitely true - RAID10 is less affected by running
>> degraded, and recovering is faster and involves less disk wear. The
>> disadvantage compared to RAID6 is, of course, if the other half of a
>> disk pair dies during recovery then your raid is gone - with RAID6 you
>> have better worst-case redundancy.
>
> The odds of the mirror partner dying during rebuild are very very long,
> and the odds of suffering a URE are very low. However, in the case of
> RAID5/6, moreso with RAID5, with modern very large drives (1/2/3TB),
> there is being quite a bit written these days about unrecoverable read
> error rates. Using a sufficient number of these very large disks will
> at some point guarantee a URE during an array rebuild, which may very
> likely cost you your entire array. This is because every block of every
> remaining disk (assuming full disk RAID not small partitions on each
> disk) must be read during a RAID5/6 rebuild. I don't have the equation
> handy but Google should be able to fetch it for you. IIRC this is one
> of the reasons RAID6 is becoming more popular today. Not just because
> it can survive an additional disk failure, but that it's more resilient
> to a URE during a rebuild.
>

It is certainly the case that the chance of a second failure when doing
a RAID5/6 rebuild goes up with the number of disks (since all the disks
are stressed during the rebuild, and any failures are relevant), while
with RAID 10 rebuilds the chances of a second failure are restricted to
the single disk being used.

However, as disks get bigger, the chance of errors on any given disk is
increasing. And the fact remains that if you have a failure on a RAID10
system, you then have a single point of failure during the rebuild
period - while with RAID6 you still have redundancy (obviously RAID5 is
far worse here).

> With a RAID10 rebuild, as you're only reading entire contents of a
> single disk, the odds of encountering a URE are much lower than with a
> RAID5 with the same number of drives, simply due to the total number of
> bits read.
>
>> Once md raid has support for bad block lists, hot replace, and non-sync
>> lists, then the differences will be far less clear. If a disk in a RAID
>> 5/6 set has a few failures (rather than dying completely), then it will
>> run as normal except when bad blocks are accessed. This means for all
>> but the few bad blocks, the degraded performance will be full speed. And
>
> You're muddying the definition of a "degraded RAID".
>

That could be the case - I'll try to be clearer. It is certainly
possible that I'm getting terminology wrong.

>> if you use "hot replace" to replace the partially failed drive, the
>> rebuild will have almost exactly the same characteristics as RAID10
>> rebuilds - apart from the bad blocks, which must be recovered by parity
>> calculations, you have a straight disk-to-disk copy.
>
> Are you saying you'd take a "partially failing" drive in a RAID5/6 and
> simply do a full disk copy onto the spare, except "bad blocks",
> rebuilding those in the normal fashion, simply to approximate the
> recover speed of RAID10?
>
> I think your logic is a tad flawed here. If a drive is already failing,
> why on earth would you trust it, period? I think you'd be asking for
> trouble doing this. This is precisely one of the reasons many hardware
> RAID controllers have historically kicked drives offline after the first
> signs of trouble--if a drive is acting flaky we don't want to trust it,
> but replace it as soon as possible.
>

I don't know if you've followed the recent "md road-map: 2011" thread (I
can't see any replies from you in the thread), but that is my reference
point here.

Sometimes disks die suddenly and catastrophically. When that happens,
the disk is gone and needs to be kicked offline.

Other times, you have a single-event corruption - for some reason, a
particular block got corrupted. And sometimes the disk is wearing out -
disks have a set of replacement blocks for re-locating known bad blocks,
and in the end these will run out. Either you get an URE, or a write
failure.

(I don't have any idea what the ratio of these sorts of failure modes is.)

If you have a drive with a few failures, then the rest of the data is
still correct. You can expect that if the drive returns data
successfully for a read, then the data is valid - that's what the
drive's ECC is for. But you would not want to trust it with new data,
and you would want to replace it as soon as possible.

The point of md raid's planned "bad block list" is to track which areas
of the drive should not be used. And the "hot replace" feature is aimed
at making a direct copy of a disk - excluding the bad blocks - to make
replacement of failed drives faster and safer. Since the failing drive
is not removed from the array until the hot replace takes over, you
still have full redundancy for most of the array - just not for stripes
that contain a bad block.

I can well imagine that hardware RAID controllers don't have this sort
of flexibility.

> The assumption is that the data on the array is far more valuable than
> the cost of a single drive or the entire hardware for that matter. In
> most environments this is the case. Everyone seems fond of the WD20EARS
> drives (which I disdain). I hear they're loved because Newegg has them
> for less than $100. What's your 2TB of data on that drive worth? In
> the case of a MythTV box, to the owner, that $100 is worth more than the
> content. In a business setting, I'd dare say the data on that drive is
> worth far more than the $100 cost of the drive and the admin $$ time
> required to replace/rebuild it.
>
> In the MythTV case what you propose might be a worthwhile risk. In a
> business environment, definitely not.
>

I believe it is the value of the data - and the value of keeping as much
redundancy as you can, and minimising the risky rebuild period, that is
Neil Brown's motivation behind the bad block list and hot replace. It
could well be that I'm not explaining it very well - but this is /not/
about saving money by continuing to use a dodgy disk even though you
know it is failing. It is about a dodgy disk with most of a data set
being a lot better than no disk when it comes to rebuild speed and data
redundancy.


Incidentally, what's your opinion on a RAID1+5 or RAID1+6 setup, where
you have a RAID5 or RAID6 build from RAID1 pairs? You get all the
rebuild benefits of RAID1 or RAID10, such as simple and fast direct
copies for rebuilds, and little performance degradation. But you also
get multiple failure redundancy from the RAID5 or RAID6. It could be
that it is excessive - that the extra redundancy is not worth the
performance cost (you still have poor small write performance).

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 23.02.2011 15:25:49 von John Robinson

On 23/02/2011 13:56, David Brown wrote:
[...]
> Incidentally, what's your opinion on a RAID1+5 or RAID1+6 setup, where
> you have a RAID5 or RAID6 build from RAID1 pairs? You get all the
> rebuild benefits of RAID1 or RAID10, such as simple and fast direct
> copies for rebuilds, and little performance degradation. But you also
> get multiple failure redundancy from the RAID5 or RAID6. It could be
> that it is excessive - that the extra redundancy is not worth the
> performance cost (you still have poor small write performance).

I'd also be interested to hear what Stan and other experienced
large-array people think of RAID60. For example, elsewhere in this
thread Stan suggested using a 40-drive RAID-10 (i.e. a 20-way RAID-0
stripe over RAID-1 pairs), and I wondered how a 40-drive RAID-60 (i.e. a
10-way RAID-0 stripe over 4-way RAID-6 arrays) would perform, both in
normal and degraded situations, and whether it might be preferable since
it would avoid the single-disk-failure issue that the RAID-1 mirrors
potentially expose. My guess is that it ought to have similar random
read performance and about half the random write performance, which
might be a trade-off worth making.

Cheers,

John.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 23.02.2011 16:15:50 von David Brown

On 23/02/2011 15:25, John Robinson wrote:
> On 23/02/2011 13:56, David Brown wrote:
> [...]
>> Incidentally, what's your opinion on a RAID1+5 or RAID1+6 setup, where
>> you have a RAID5 or RAID6 build from RAID1 pairs? You get all the
>> rebuild benefits of RAID1 or RAID10, such as simple and fast direct
>> copies for rebuilds, and little performance degradation. But you also
>> get multiple failure redundancy from the RAID5 or RAID6. It could be
>> that it is excessive - that the extra redundancy is not worth the
>> performance cost (you still have poor small write performance).
>
> I'd also be interested to hear what Stan and other experienced
> large-array people think of RAID60. For example, elsewhere in this
> thread Stan suggested using a 40-drive RAID-10 (i.e. a 20-way RAID-0
> stripe over RAID-1 pairs), and I wondered how a 40-drive RAID-60 (i.e. a
> 10-way RAID-0 stripe over 4-way RAID-6 arrays) would perform, both in
> normal and degraded situations, and whether it might be preferable since
> it would avoid the single-disk-failure issue that the RAID-1 mirrors
> potentially expose. My guess is that it ought to have similar random
> read performance and about half the random write performance, which
> might be a trade-off worth making.
>

Basically you are comparing a 4-drive RAID-6 to a 4-drive RAID-10. I
think the RAID-10 will be faster for streamed reads, and a lot faster
for small writes. You get improved safety in that you still have a
one-drive redundancy after a drive has failed, but you pay for it in
longer and more demanding rebuilds. But certainly RAID60 (or at least
RAID50) seems to be a choice many raid controllers support, so it must
be popular.


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 23.02.2011 22:11:00 von Stan Hoeppner

David Brown put forth on 2/23/2011 7:56 AM:

> However, as disks get bigger, the chance of errors on any given disk is
> increasing. And the fact remains that if you have a failure on a RAID10
> system, you then have a single point of failure during the rebuild
> period - while with RAID6 you still have redundancy (obviously RAID5 is
> far worse here).

The problem isn't a 2nd whole drive failure during the rebuild, but a
URE during rebuild:

http://www.zdnet.com/blog/storage/why-raid-5-stops-working-i n-2009/162

> I don't know if you've followed the recent "md road-map: 2011" thread (I
> can't see any replies from you in the thread), but that is my reference
> point here.

Actually I haven't. Is Neil's motivation with this RAID5/6 "mirror
rebuild" to avoid the URE problem?

> Incidentally, what's your opinion on a RAID1+5 or RAID1+6 setup, where
> you have a RAID5 or RAID6 build from RAID1 pairs? You get all the
> rebuild benefits of RAID1 or RAID10, such as simple and fast direct
> copies for rebuilds, and little performance degradation. But you also
> get multiple failure redundancy from the RAID5 or RAID6. It could be
> that it is excessive - that the extra redundancy is not worth the
> performance cost (you still have poor small write performance).

I don't care for and don't use parity RAID levels. Simple mirroring and
RAID10 have served me well for a very long time. They have many
advantages over parity RAID and few, if any, disadvantages. I've
mentioned all of these in previous posts.

--
Stan


--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 23.02.2011 22:59:17 von Stan Hoeppner

John Robinson put forth on 2/23/2011 8:25 AM:
> On 23/02/2011 13:56, David Brown wrote:
> [...]
>> Incidentally, what's your opinion on a RAID1+5 or RAID1+6 setup, where
>> you have a RAID5 or RAID6 build from RAID1 pairs? You get all the
>> rebuild benefits of RAID1 or RAID10, such as simple and fast direct
>> copies for rebuilds, and little performance degradation. But you also
>> get multiple failure redundancy from the RAID5 or RAID6. It could be
>> that it is excessive - that the extra redundancy is not worth the
>> performance cost (you still have poor small write performance).
>
> I'd also be interested to hear what Stan and other experienced
> large-array people think of RAID60. For example, elsewhere in this
> thread Stan suggested using a 40-drive RAID-10 (i.e. a 20-way RAID-0
> stripe over RAID-1 pairs),

Actually, that's not what I mentioned. What I described was a 48 drive
storage system consisting of qty 6 RAID10 arrays of 8 drives each.
These could be 6 mdraid10 8 drive arrays using LVM to concatenate them
into a single volume, or they could be 6 HBA hardware RAID10 8 drive
arrays stitched together with mdraid linear into a single logical device.

Then you would use XFS as your filesystem, and its allocation group
architecture to achieve your multi user workload parallelism. This
works well for a lot of workloads. Coincidentally, because we have 6
arrays of 8 drives each, instead of one large 48 drive RAID10, the
probability of the "dreaded" 2nd drive failure during rebuild drops
dramatically. Additionally, the the amount of data exposed to loss due
to this architecture decreases to 1/6th of that of a single large RAID10
of 48 drives. If you were to lose both drives during the rebuild, as
long as this 8 drive array is not the first array in the stitched
logical device, it won't contain XFS metadata, and you can recover.
Thus, it's possible to xfs_repair the filesystem, only losing the data
contents of the 8 disk array that failed, or 1/6th of your data. This
failure/recovery scenario is a wild edge case so I wouldn't _rely_ on
it, but it's interesting that it works, and is worth mentioning.

> and I wondered how a 40-drive RAID-60 (i.e. a
> 10-way RAID-0 stripe over 4-way RAID-6 arrays) would perform, both in
> normal and degraded situations, and whether it might be preferable since
> it would avoid the single-disk-failure issue that the RAID-1 mirrors
> potentially expose. My guess is that it ought to have similar random
> read performance and about half the random write performance, which
> might be a trade-off worth making.

First off what you describe here is not a RAID60. RAID60 is defined as
a stripe across _two_ RAID6 arrays--not 10 arrays. RAID50 is the same
but with RAID5 arrays. What you're describing is simply a custom nested
RAID, much like what I mentioned above. Let's call it RAID J-60.

Anyway, you'd be better off striping 13 three-disk mirror sets with a
spare drive making up the 40. This covers the double drive failure
during rebuild (a non issue in my book for RAID1/10), and suffers zero
read or write performance, except possibly LVM striping overhead in the
event you have to use LVM to create the stripe. I'm not familiar enough
with mdadm to know if you can do this nested setup all in mdadm.

The big problem I see is stripe size. How the !@#$ would you calculate
the proper stripe size for this type of nested RAID and actually get
decent performance from your filesystem sitting on top?

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 24.02.2011 00:14:31 von Stan Hoeppner

David Brown put forth on 2/23/2011 9:15 AM:

> Basically you are comparing a 4-drive RAID-6 to a 4-drive RAID-10. I
> think the RAID-10 will be faster for streamed reads, and a lot faster

In this 4 drive configuration, RAID6 might be ever so slightly faster in
read performance, but RAID10 will very likely be faster in every other
category, to include degraded performance and rebuild time. I can't say
definitively as I've not actually tested these setups head to head.

> for small writes. You get improved safety in that you still have a
> one-drive redundancy after a drive has failed, but you pay for it in
> longer and more demanding rebuilds.

Just to be clear, you're saying the RAID6 rebuilds are longer and more
demanding than RAID10. To state the opposite would be incorrect.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 24.02.2011 00:43:22 von John Robinson

On 23/02/2011 21:59, Stan Hoeppner wrote:
> John Robinson put forth on 2/23/2011 8:25 AM:
>> On 23/02/2011 13:56, David Brown wrote:
>> [...]
>>> Incidentally, what's your opinion on a RAID1+5 or RAID1+6 setup, where
>>> you have a RAID5 or RAID6 build from RAID1 pairs? You get all the
>>> rebuild benefits of RAID1 or RAID10, such as simple and fast direct
>>> copies for rebuilds, and little performance degradation. But you also
>>> get multiple failure redundancy from the RAID5 or RAID6. It could be
>>> that it is excessive - that the extra redundancy is not worth the
>>> performance cost (you still have poor small write performance).
>>
>> I'd also be interested to hear what Stan and other experienced
>> large-array people think of RAID60. For example, elsewhere in this
>> thread Stan suggested using a 40-drive RAID-10 (i.e. a 20-way RAID-0
>> stripe over RAID-1 pairs),
>
> Actually, that's not what I mentioned.

Yes, it's precisely what you mentioned in this post:
http://marc.info/?l=linux-raid&m=129777295601681&w=2

[...]
>> and I wondered how a 40-drive RAID-60 (i.e. a
>> 10-way RAID-0 stripe over 4-way RAID-6 arrays) would perform
[...]
> First off what you describe here is not a RAID60. RAID60 is defined as
> a stripe across _two_ RAID6 arrays--not 10 arrays. RAID50 is the same
> but with RAID5 arrays. What you're describing is simply a custom nested
> RAID, much like what I mentioned above.

In the same way that RAID10 is not specified as a stripe across two
RAID1 arrays, RAID60 is not specified as a stripe across two arrays. But
yes, it's a nested RAID, in the same way that you have repeatedly
insisted that RAID10 is nested RAID0 over RAID1.

> Anyway, you'd be better off striping 13 three-disk mirror sets with a
> spare drive making up the 40. This covers the double drive failure
> during rebuild (a non issue in my book for RAID1/10), and suffers zero
> read or write performance, except possibly LVM striping overhead in the
> event you have to use LVM to create the stripe. I'm not familiar enough
> with mdadm to know if you can do this nested setup all in mdadm.

Yes of course you can. (You can use md RAID10 with layout n3 or do it
the long way round with multiple RAID1s and a RAID0.) But in order to
get the 20TB of storage you'd need 60 drives. That's why for the sake of
slightly better storage and energy efficiency I'd be interested in how a
RAID 6+0 (if you prefer) in the arrangement I suggested would perform
compared to a RAID 10.

I'm positing this arrangement specifically to cope with the almost
inevitable URE when trying to recover an array. You dismissed it above
as a non-issue but in another post you linked to the zdnet article on
"why RAID5 stops working in 2009", and as far as I'm concerned much the
same applies to RAID1 pairs. UREs are now a fact of life. When they do
occur the drives aren't necessarily even operating outside their specs:
it's 1 in 10^14 or 10^15 bits, so read a lot more than that (as you will
on a busy drive) and they're going to happen.

Cheers,

John.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 24.02.2011 11:19:21 von David Brown

On 24/02/2011 00:14, Stan Hoeppner wrote:
> David Brown put forth on 2/23/2011 9:15 AM:
>
>> Basically you are comparing a 4-drive RAID-6 to a 4-drive RAID-10. I
>> think the RAID-10 will be faster for streamed reads, and a lot faster
>
> In this 4 drive configuration, RAID6 might be ever so slightly faster in
> read performance, but RAID10 will very likely be faster in every other
> category, to include degraded performance and rebuild time. I can't say
> definitively as I've not actually tested these setups head to head.
>
>> for small writes. You get improved safety in that you still have a
>> one-drive redundancy after a drive has failed, but you pay for it in
>> longer and more demanding rebuilds.
>
> Just to be clear, you're saying the RAID6 rebuilds are longer and more
> demanding than RAID10. To state the opposite would be incorrect.
>

Yes, that is exactly what I am saying.


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 24.02.2011 12:24:54 von David Brown

On 23/02/2011 22:11, Stan Hoeppner wrote:
> David Brown put forth on 2/23/2011 7:56 AM:
>
>> However, as disks get bigger, the chance of errors on any given disk is
>> increasing. And the fact remains that if you have a failure on a RAID10
>> system, you then have a single point of failure during the rebuild
>> period - while with RAID6 you still have redundancy (obviously RAID5 is
>> far worse here).
>
> The problem isn't a 2nd whole drive failure during the rebuild, but a
> URE during rebuild:
>
> http://www.zdnet.com/blog/storage/why-raid-5-stops-working-i n-2009/162
>

Yes, I've read that article - it's one of the reasons for always
preferring RAID6 to RAID5.

My understanding of RAID controllers (software or hardware) is that they
consider a drive to be either "good" or "bad". So if you get an URE,
the controller considers the drive "bad" and ejects it from the array.
It doesn't matter if it is an URE or a total disk death.

Maybe hardware RAID controllers do something else here - you know far
more about them than I do.

The idea of the md raid "bad block list" is that there is a medium
ground - you can have disks that are "mostly good".

Supposing you have a RAID6 array, and one disk has died completely. It
gets replaced by a hot spare, and rebuild begins. As the rebuild
progresses, disk 1 gets an URE. Traditional handling would mean disk 1
is ejected, and now you have a double-degraded RAID6 to rebuilt. When
you later get an URE on disk 2, you have lost data for that stripe - and
the whole raid is gone.

But with bad block lists, the URE on disk 1 leads to a bad block entry
on disk 1, and the rebuild continues. When you later get an URE on disk
2, it's no problem - you use data from disk 1 and the other disks.
URE's are no longer a killer unless your set has no redundancy.


URE's are also what I worry about with RAID1 (including RAID10)
rebuilds. If a disk has failed, you are right in saying that the
chances of the second disk in the pair failing completely are tiny. But
the chances of getting an URE on the second disk during the rebuild are
not negligible - they are small, but growing with each new jump in disk
size.

With md raid's future bad block lists and hot replace features, then an
URE on the second disk during rebuilds is only a problem if the first
disk has died completely - if it only had a small problem, then the "hot
replace" rebuild will be able to use both disks to find the data.

>> I don't know if you've followed the recent "md road-map: 2011" thread (I
>> can't see any replies from you in the thread), but that is my reference
>> point here.
>
> Actually I haven't. Is Neil's motivation with this RAID5/6 "mirror
> rebuild" to avoid the URE problem?
>

I know you are more interested in hardware raid than software raid, but
I'm sure you'll find some interesting points in Neil's writings. If you
don't want to read through the thread, at least read his blog post.



>> Incidentally, what's your opinion on a RAID1+5 or RAID1+6 setup, where
>> you have a RAID5 or RAID6 build from RAID1 pairs? You get all the
>> rebuild benefits of RAID1 or RAID10, such as simple and fast direct
>> copies for rebuilds, and little performance degradation. But you also
>> get multiple failure redundancy from the RAID5 or RAID6. It could be
>> that it is excessive - that the extra redundancy is not worth the
>> performance cost (you still have poor small write performance).
>
> I don't care for and don't use parity RAID levels. Simple mirroring and
> RAID10 have served me well for a very long time. They have many
> advantages over parity RAID and few, if any, disadvantages. I've
> mentioned all of these in previous posts.
>


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 24.02.2011 16:53:52 von Stan Hoeppner

John Robinson put forth on 2/23/2011 5:43 PM:
> On 23/02/2011 21:59, Stan Hoeppner wrote:

>> Actually, that's not what I mentioned.
>
> Yes, it's precisely what you mentioned in this post:
> http://marc.info/?l=linux-raid&m=129777295601681&w=2

Sorry John. I thought you were referring to my recent post regarding 48
drives. I usually don't remember my own posts very long, especially
those over a week old. Heck, I'm lucky to remember a post I made 2-3
days ago. ;)

> [...]
>>> and I wondered how a 40-drive RAID-60 (i.e. a
>>> 10-way RAID-0 stripe over 4-way RAID-6 arrays) would perform
> [...]
>> First off what you describe here is not a RAID60. RAID60 is defined as
>> a stripe across _two_ RAID6 arrays--not 10 arrays. RAID50 is the same
>> but with RAID5 arrays. What you're describing is simply a custom nested
>> RAID, much like what I mentioned above.
>
> In the same way that RAID10 is not specified as a stripe across two
> RAID1 arrays, RAID60 is not specified as a stripe across two arrays. But
> yes, it's a nested RAID, in the same way that you have repeatedly
> insisted that RAID10 is nested RAID0 over RAID1.

"RAID 10" is used to describe striped mirrors regardless of the number
of mirror sets used, simply specifying the number of drives in the
description, i.e. "20 drive RAID 10" or "8 drive RAID 10". As I just
learned from doing some research, apparently when ones stripes more than
2 RAID6s one would then describe the array and an "n leg RAID 60", or "n
element RAID 60". In your example this would be a "10 leg RAID 60".
I'd only seen the term "RAID 60" used to describe the 2 leg case. My
apologies for straying out here and wasting time on a non-issue.

>> Anyway, you'd be better off striping 13 three-disk mirror sets with a
>> spare drive making up the 40. This covers the double drive failure
>> during rebuild (a non issue in my book for RAID1/10), and suffers zero
>> read or write performance, except possibly LVM striping overhead in the
>> event you have to use LVM to create the stripe. I'm not familiar enough
>> with mdadm to know if you can do this nested setup all in mdadm.
>
> Yes of course you can. (You can use md RAID10 with layout n3 or do it
> the long way round with multiple RAID1s and a RAID0.) But in order to
> get the 20TB of storage you'd need 60 drives. That's why for the sake of
> slightly better storage and energy efficiency I'd be interested in how a
> RAID 6+0 (if you prefer) in the arrangement I suggested would perform
> compared to a RAID 10.

For the definitive answer to this you'd have to test each RAID level
with your target workload. In general, I'd say, other than the problems
with parity performance, the possible gotcha is being able to come up
with a workable stripe block/width with such a setup. Wide arrays
typically don't work well for general use filesystems as most files are
much smaller than the typical stripe block required to get decent
performance from such a wide stripe. The situation is even worse with
nested stripes.

Your example uses a top level stripe width of 10 with a nested stripe
width of 2. Getting any filesystem to work efficiently with such a
nested RAID, from both an overall performance and space efficiency
standpoint, may prove to be very difficult. If you can't find a magic
formula for this, you could very well end up with worse actual space
efficiency in the FS than if you used a straight RAID10.

If you prefer RAID6 legs, what I'd recommend is simply concatenating the
legs instead of striping them. Using your 40 drive example, I'd
recommend using 4 RAID6 legs of 10 drives each, so you get an 8 drive
stripe width per array and thus better performance than the 4 drive
case. Use a stripe block size of 64KB on each array as this should
yield a good mix of space efficiency for average size files/extents and
performance for random IO with such size files. Concatenating in this
manner will avoid the difficult to solve multiple layered stripe
block/width to filesystem harmony problem.

Using XFS atop this concatenated RAID6 setup with an allocation group
count of 32 (4 arrays x 8 stripe spindles/array) will give you good
parallelism across the 4 arrays with a multiuser workload. AFAIK,
EXT3/4, ReiserFS, JFS, don't use allocation groups or anything like
them, and thus can't get parallelism from such a concatenated setup.
This is one of the many reasons why XFS is the only suitable Linux FS
for large/complex arrays. I haven't paid any attention to BTRFS, so I
don't know if it would be suitable for scenarios like this. It's so far
from production quality at this point it's not really even worth
mentioning, but I did so for the sake of being complete.

As always, all of this is a strictly academic guessing exercise without
testing the specific workload. That said, for any multiuser workload
this setup should perform relatively well, for a parity based array.

The takeaway here is concatenation instead of layered striping, and
using the appropriate filesystem to take advantage of such.

> I'm positing this arrangement specifically to cope with the almost
> inevitable URE when trying to recover an array. You dismissed it above
> as a non-issue but in another post you linked to the zdnet article on
> "why RAID5 stops working in 2009", and as far as I'm concerned much the
> same applies to RAID1 pairs. UREs are now a fact of life. When they do
> occur the drives aren't necessarily even operating outside their specs:
> it's 1 in 10^14 or 10^15 bits, so read a lot more than that (as you will
> on a busy drive) and they're going to happen.

I didn't mean to discount anything. The math shows that UREs during
rebuild aren't relevant for mirrored RAID schemes. This is because
with current drive sizes and URE rates you have to read more than
something like 12 TB before encountering a URE. The largest drives
available are 3TB, or ~1/4th the "URE rebuild threshold" bit count.
Probabilities inform us about the hypothetical world in general terms.
In the real world, sure, anything can happen. Real world data of this
type isn't published, do we have to base our calculation and planning on
what the manufacturers provide.

The article makes an interesting point in that as drives continue to
increase in capacity, with their URE rates remaining basically static,
eventually every RAID6 rebuild will see a URE. I haven't done the math
so I don't know at exactly what drive size/count this will occur. The
obvious answer to it will be RAID7, or triple parity RAID. At that
point, parity RAID will have, in practical $$, lost its only advantage
over mirrors, i.e. RAID10.

In the long run, if the current size:URE rate trend continues, we may
see the 3 leg RAID 10 becoming popular. My personal hope is that the
drive makers can start producing drives with much lower URE rates. I'd
rather never see the days of anything close to hexa parity RAID9 and
quad leg RAID10 being required simply to survive a rebuild process.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 24.02.2011 21:28:14 von Matt Garman

Wow, I can't believe the number of responses I've received to this
question. I've been trying to digest it all. I'm going to throw some
follow-up comments as time allows, starting here...

On Tue, Feb 15, 2011 at 3:43 AM, David Brown wr=
ote:
> If you are not too bothered about write performance, I'd put a fair a=
mount
> of the budget into ram rather than just disk performance. =A0When you=
've got
> the ram space to make sure small reads are mostly cached, the main
> bottleneck will be sequential reads - and big hard disks handle seque=
ntial
> reads as fast as expensive SSDs.

I could be wrong, but I'm not so sure RAM would be beneficial for our
case. Are workload is virtually all reads, however, these are huge
reads. The analysis programs basically do a full read of data files
that are generally pretty big: roughly 100 MB to 5 GB in the worst
case. Average file size is maybe 500 MB (rough estimate). And there
are hundreds of these falls, all of which need "immediate" access. So
to cache these in RAM, seems like it would take an awful lot of RAM.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 24.02.2011 21:36:50 von Matt Garman

On Tue, Feb 15, 2011 at 8:56 AM, Zdenek Kaspar wrote:
> Dne 15.2.2011 15:29, Roberto Spadim napsal(a):
>> for hobby = SATA2 disks, 50USD disks of 1TB 50MB/s
>> the today state of art, in 'my world' is: http://www.ramsan.com/products/3
>
> I doubt 20TB SLC which will survive huge abuse (writes) is low-cost
> solution what OP wants to build himself..
>
> or 20TB RAM omg..

Just to be clear, this is *not* a hobby system. I mentioned hobby
system in my original post just to serve as a reference for my current
knowledge level. I've built and configured the simple linux md raid6
NAS box at home, and a similar system for backups here at work.

But now I'm looking at something that's obviously a completely
different game, with bigger and stricter requirements.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 24.02.2011 21:43:16 von David Brown

On 24/02/11 21:28, Matt Garman wrote:
> Wow, I can't believe the number of responses I've received to this
> question. I've been trying to digest it all. I'm going to throw some
> follow-up comments as time allows, starting here...
>
> On Tue, Feb 15, 2011 at 3:43 AM, David Brown wrote:
>> If you are not too bothered about write performance, I'd put a fair amount
>> of the budget into ram rather than just disk performance. When you've got
>> the ram space to make sure small reads are mostly cached, the main
>> bottleneck will be sequential reads - and big hard disks handle sequential
>> reads as fast as expensive SSDs.
>
> I could be wrong, but I'm not so sure RAM would be beneficial for our
> case. Are workload is virtually all reads, however, these are huge
> reads. The analysis programs basically do a full read of data files
> that are generally pretty big: roughly 100 MB to 5 GB in the worst
> case. Average file size is maybe 500 MB (rough estimate). And there
> are hundreds of these falls, all of which need "immediate" access. So
> to cache these in RAM, seems like it would take an awful lot of RAM.

RAM for cache makes a difference if the same file is read more than
once. That applies equally to big files - but only if more than one
machine is reading the same file. If they are all reading different
files, then - as you say - there won't be much to gain as each file is
only used once.

Still, when you have so much data going from the disks and out to the
clients, it is good to have plenty of ram for buffering, even if it is
only used once.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 24.02.2011 21:43:17 von Matt Garman

On Tue, Feb 15, 2011 at 7:03 AM, Roberto Spadim wrote:
> disks are good for sequencial access
> for non-sequencial ssd are better (the sequencial access rate for a
> ssd is the same for a non sequencial access rate)

I have a more general question: say I have an ultra simple NAS system,
with exactly one disk, and an infinitely fast network connection.
Now, with exactly one client, I should be able to do a sequential read
that is exactly the speed of that single drive in the NAS box (assume
network protocol overhead is negligible to keep it simple).

What happens if there are exactly two clients simultaneously
requesting different large files? From the client's perspective, this
is a sequential read, but from the drive's perspective, it's obviously
not.

And likewise, what if there are three clients, or four clients, ...,
all requesting different but large files simultaneously?

How does one calculate the drive's throughput in these cases? And,
clearly, there are two throughputs, one from the clients'
perspectives, and one from the drive's perspective.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 24.02.2011 21:49:35 von Matt Garman

On Tue, Feb 15, 2011 at 7:39 AM, David Brown wr=
ote:
> This brings up an important point - no matter what sort of system you=
get
> (home made, mdadm raid, or whatever) you will want to do some tests a=
nd
> drills at replacing failed drives. =A0Also make sure everything is we=
ll
> documented, and well labelled. =A0When mdadm sends you an email telli=
ng you
> drive sdx has failed, you want to be /very/ sure you know which drive=
is sdx
> before you take it out!

Agreed! This will be a learn-as-I-go project.

> You also want to consider your raid setup carefully. =A0RAID 10 has b=
een
> mentioned here several times - it is often a good choice, but not
> necessarily. =A0RAID 10 gives you fast recovery, and can at best surv=
ive a
> loss of half your disks - but at worst a loss of two disks will bring=
down
> the whole set. =A0It is also very inefficient in space. =A0If you use=
SSDs, it
> may not be worth double the price to have RAID 10. =A0If you use hard=
disks,
> it may not be sufficient safety.

And that's what has me thinking about cluster filesystems.
Ultimately, I'd like a pool of storage "nodes". These could live on
the same physical machine, or be spread across multiple machines. To
the clients, this pool of nodes would look like one single collection
of storage. The benefit of this, in my opinion, is flexibility
(mainly easy to grow/add new nodes), but also a bit more safety. If
one node dies, it doesn't take down the whole pool, just the files on
that node become unavailable.

Even better would be a "smart" pool, that, when a new node is added,
it automatically re-distributes all the files, so that the new node
has the same kind of space utilization as all the others.

> It is probably worth having a small array of SSDs (RAID1 or RAID10) t=
o hold
> the write intent bitmap, the journal for your main file system, and o=
f
> course your OS. =A0Maybe one of these absurdly fast PCI Express flash=
disks
> would be a good choice.

Is that really necessary, though, when writes account for probably >5%
of total IO operations? And (relatively speaking) write performance
is unimportant?
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 24.02.2011 21:53:51 von Zdenek Kaspar

Dne 24.2.2011 21:43, Matt Garman napsal(a):
> On Tue, Feb 15, 2011 at 7:03 AM, Roberto Spadim wrote:
>> disks are good for sequencial access
>> for non-sequencial ssd are better (the sequencial access rate for a
>> ssd is the same for a non sequencial access rate)
>
> I have a more general question: say I have an ultra simple NAS system,
> with exactly one disk, and an infinitely fast network connection.
> Now, with exactly one client, I should be able to do a sequential read
> that is exactly the speed of that single drive in the NAS box (assume
> network protocol overhead is negligible to keep it simple).
>
> What happens if there are exactly two clients simultaneously
> requesting different large files? From the client's perspective, this
> is a sequential read, but from the drive's perspective, it's obviously
> not.
>
> And likewise, what if there are three clients, or four clients, ...,
> all requesting different but large files simultaneously?
>
> How does one calculate the drive's throughput in these cases? And,
> clearly, there are two throughputs, one from the clients'
> perspectives, and one from the drive's perspective.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

For rough estimate, try to simulate your workload in small scale, ie:
create files on your disk (fs), and run multiple processes (dd) reading
them. To summarize things together watch loads, ie for disk(s): iostat
-mx 1.

HTH, Z.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 24.02.2011 21:58:57 von Matt Garman

On Tue, Feb 15, 2011 at 9:16 AM, Joe Landman wr=
ote:
> [disclosure: vendor posting, ignore if you wish, vendor html link at =
bottom
> of message]
>
>> The whole system needs to be "fast".
>
> Define what you mean by "fast". =A0Seriously ... we've had people tel=
l us
> about their "huge" storage needs that we can easily fit onto a single=
small
> unit, no storage cluster needed. =A0We've had people say "fast" when =
they mean
> "able to keep 1 GbE port busy".
>
> Fast needs to be articulated really in terms of what you will do with=
it.
> =A0As you noted in this and other messages, you are scaling up from 1=
0 compute
> nodes to 40 compute nodes. =A04x change in demand, and I am guessing =
bandwidth
> (if these are large files you are streaming) or IOPs (if these are ma=
ny
> small files you are reading). =A0Small and large here would mean less=
than
> 64kB for small, and greater than 4MB for large.

These are definitely large files; maybe "huge" is a better word. All
are over 100 MB in size, some are upwards of 5 GB, most are probably a
few hundred megs in size.

The word "streaming" may be accurate, but to me it is misleading. I
associate streaming with media, i.e. it is generally consumed much
more slowly than it can be sent (e.g. even high-def 1080p video won't
saturate a 100 mbps link). But in our case, these files are basically
read into memory, and then computations are done from there.

So, for an upper bounds on the notion of "fast", I'll illustrate the
worst-case scenario: there are 50 analysis machines, each of which can
run up to 10 processes, making 500 total processes. Every single
process requests a different file at the exact same time, and every
requested file is over 100 MB in size. Ideally, each process would be
able to access the file as though it were local, and was the only
process on the machine. In reality, it's "good enough" if each of the
50 machines' gigabit network connections are saturated. So from the
network perspective, that's 50 gbps.

=46rom the storage perspective, it's less clear to me. That's 500 huge
simultaneous read requests, and I'm not clear on what it would take to
satisfy that.

> Your choice is simple. =A0Build or buy. =A0Many folks have made sugge=
stions, and
> some are pretty reasonable, though a pure SSD or Flash based machine,=
while
> doable (and we sell these), is quite unlikely to be close to the real=
ities
> of your budget. =A0There are use cases for which this does make sense=
, but the
> costs are quite prohibitive for all but a few users.

Well, I haven't decided on whether or not to build or buy, but the
thought experiment of planning a buy is very instructive. Thanks to
everyone who has contributed to this thread, I've got more information
than I've been able to digest so far!
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 24.02.2011 22:07:50 von Joe Landman

On 02/24/2011 03:53 PM, Zdenek Kaspar wrote:

>> And likewise, what if there are three clients, or four clients, ...,
>> all requesting different but large files simultaneously?
>>
>> How does one calculate the drive's throughput in these cases? And,
>> clearly, there are two throughputs, one from the clients'
>> perspectives, and one from the drive's perspective.

we us Jens Axboe's fio code to model this.

Best case scenario is you get 1/N of the fixed sized resource that you
share averaged out over time for N requestors of equal size/priority.
Reality is often different, in that there are multiple stacks to
traverse, potential seek time issues as well as network contention
issues, interrupt and general OS "jitter", etc. That is, all the
standard HPC issues you get for compute/analysis nodes, you get for this.

Best advise is "go wide". As many spindles as possible. If you are
read bound (large block streaming IO), then RAID6 is good, and many of
them joined into a parallel file system (ala GlusterFS, FhGFS, MooseFS,
OrangeFS, ... ) is even better. Well, as long as the baseline hardware
is fast to begin with. We do not recommend a single drive per server,
turns out to be a terrible way to aggregate bandwidth in practice. Its
better to build really fast units, and go "wide" with them. Which is,
curiously, what we do with our siCluster boxen.

MD raid should be fine for you.

Regards,

Joe



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 24.02.2011 22:20:56 von Joe Landman

On 02/24/2011 03:58 PM, Matt Garman wrote:

> These are definitely large files; maybe "huge" is a better word. All
> are over 100 MB in size, some are upwards of 5 GB, most are probably a
> few hundred megs in size.

Heh ... the "huge" storage I alluded to above is also quite ... er ...
context sensitive.

>
> The word "streaming" may be accurate, but to me it is misleading. I

Actually not at all. We have quite a few customers that consume files
by slurping them into ram before processing. So the file system streams
(e.g. sends data as fast as the remote process can consume it, modulo
network and other inefficiencies).

> associate streaming with media, i.e. it is generally consumed much
> more slowly than it can be sent (e.g. even high-def 1080p video won't
> saturate a 100 mbps link). But in our case, these files are basically
> read into memory, and then computations are done from there.

Same use case. dd is an example of a "trivial" streaming app, though we
prefer to generate load with fio.

>
> So, for an upper bounds on the notion of "fast", I'll illustrate the
> worst-case scenario: there are 50 analysis machines, each of which can
> run up to 10 processes, making 500 total processes. Every single
> process requests a different file at the exact same time, and every
> requested file is over 100 MB in size. Ideally, each process would be
> able to access the file as though it were local, and was the only
> process on the machine. In reality, it's "good enough" if each of the
> 50 machines' gigabit network connections are saturated. So from the
> network perspective, that's 50 gbps.

Ok, so if we divide these 50 Gbps across say ... 10 storage nodes ...
then we need only sustain, on average, 5 Gbps/storage node. This makes
a number of assumptions, some of which are valid (e.g. file distribution
across nodes is effectively random, and can be accomplished via parallel
file system). 5 Gbps/storage node sounds like a node with 6x GbE ports,
or 1x 10GbE port. Run one of the parallel file systems across it and
make sure the interior RAID can handle this sort of bandwidth (you'd
need at least 700 MB/s on the interior RAID, which eliminates many/most
of the units on the market, and you'd need pretty high efficiencies in
the stack, which also have a tendency to reduce your choices ... better
to build the interior RAIDs as fast as possible, deal with the network
efficiency losses and call it a day)

All this said, its better to express your IO bandwidth needs in MB/s,
preferably in terms of sustained bandwidth needs, as this is language
that you'd be talking to vendors in. So on 50 machines, assume each
machine can saturate its 1GbE port (these aren't Broadcom NICs, right?),
that gets you 50x 117 MB/s or about 5.9 GB/s sustained bandwidth for
your IO. 10 machines running at a sustainable 600 MB/s delivered over
the network, and a parallel file system atop this, solves this problem.

Single centralized resources (FC heads, filers, etc.) won't scale to
this. Then again, this isn't their use case.

Regards,

Joe

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 25.02.2011 00:30:35 von Stan Hoeppner

David Brown put forth on 2/24/2011 5:24 AM:

> My understanding of RAID controllers (software or hardware) is that they
> consider a drive to be either "good" or "bad". So if you get an URE,
> the controller considers the drive "bad" and ejects it from the array.
> It doesn't matter if it is an URE or a total disk death.
>
> Maybe hardware RAID controllers do something else here - you know far
> more about them than I do.

Most HBA and SAN RAID firmware I've dealt with kicks drives offline
pretty quickly at any sign of an unrecoverable error. I've also seen
drives kicked simply because the RAID firmware didn't like the drive
firmware. I have a fond (sarcasm) memory of DAC960s kicking ST118202
18GB Cheetahs offline left and right in the late 90s. The fact I still
recall that Seagate drive# after 10+ years should be informative
regarding the severity of that issue. :(

> The idea of the md raid "bad block list" is that there is a medium
> ground - you can have disks that are "mostly good".

Everything I've read and seen in the last few years regarding hard disk
technology says that platter manufacturing quality and tolerance are so
high on modern drives that media defects are rarely, if ever, seen by
the customer, as they're mapped out at the factory. The platters don't
suffer wear effects, but the rest of the moving parts do. From what
I've read/seen, "media" errors observed in the wild today are actually
caused by mechanical failures due to physical wear on various moving
parts: VC actuator pivot bearing/race, spindle bearings, etc.
Mechanical failures tend to show mild "media errors" in the beginning
and get worse with time as moving parts go further out of alignment.
Thus, as I see it, any UREs on a modern drive represent a "Don't trust
me--Replace me NOW" flag. I could be all wrong here, but this is what
I've read, and seen in manufacturer videos from WD and Seagate.

> Supposing you have a RAID6 array, and one disk has died completely. It
> gets replaced by a hot spare, and rebuild begins. As the rebuild
> progresses, disk 1 gets an URE. Traditional handling would mean disk 1
> is ejected, and now you have a double-degraded RAID6 to rebuilt. When
> you later get an URE on disk 2, you have lost data for that stripe - and
> the whole raid is gone.
>
> But with bad block lists, the URE on disk 1 leads to a bad block entry
> on disk 1, and the rebuild continues. When you later get an URE on disk
> 2, it's no problem - you use data from disk 1 and the other disks. URE's
> are no longer a killer unless your set has no redundancy.

They're not a killer with RAID 6 anyway, are they?. You can be
rebuilding one failed drive and suffer UREs left and right, as long as
you don't get two of them on two drives simultaneously in the same
stripe block read. I think that's right. Please correct me if not.

> URE's are also what I worry about with RAID1 (including RAID10)
> rebuilds. If a disk has failed, you are right in saying that the
> chances of the second disk in the pair failing completely are tiny. But
> the chances of getting an URE on the second disk during the rebuild are
> not negligible - they are small, but growing with each new jump in disk
> size.

I touched on this in my other reply, somewhat tongue-in-cheek mentioning
3 leg and 4 leg RAID10. At current capacities and URE ratings I'm not
worried about it with mirror pairs. If URE ratings haven't increased
substantially by the time our avg drive capacity hits 10GB I'll start to
worry.

Somewhat related to this, does any else here build their arrays from the
smallest cap drives they can get away with, preferably single platter
models when possible? I adopted this strategy quite some time ago,
mostly to keep rebuild times to a minimum, keep rotational mass low to
consume the least energy since using more drives, but also with the URE
issue in the back of my mind. Anecdotal evidence tends to point to the
trend of OPs going with fewer gargantuan drives instead of many smaller
ones. Maybe that's just members of this list, whose criteria may be
quite different from the typical enterprise data center.

> With md raid's future bad block lists and hot replace features, then an
> URE on the second disk during rebuilds is only a problem if the first
> disk has died completely - if it only had a small problem, then the "hot
> replace" rebuild will be able to use both disks to find the data.

What happens when you have multiple drives at the same or similar bad
block count?

> I know you are more interested in hardware raid than software raid, but
> I'm sure you'll find some interesting points in Neil's writings. If you
> don't want to read through the thread, at least read his blog post.
>
>

Will catch up. Thanks for the blog link.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 25.02.2011 09:20:41 von David Brown

On 25/02/2011 00:30, Stan Hoeppner wrote:
> David Brown put forth on 2/24/2011 5:24 AM:
>
>> My understanding of RAID controllers (software or hardware) is that they
>> consider a drive to be either "good" or "bad". So if you get an URE,
>> the controller considers the drive "bad" and ejects it from the array.
>> It doesn't matter if it is an URE or a total disk death.
>>
>> Maybe hardware RAID controllers do something else here - you know far
>> more about them than I do.
>
> Most HBA and SAN RAID firmware I've dealt with kicks drives offline
> pretty quickly at any sign of an unrecoverable error. I've also seen
> drives kicked simply because the RAID firmware didn't like the drive
> firmware. I have a fond (sarcasm) memory of DAC960s kicking ST118202
> 18GB Cheetahs offline left and right in the late 90s. The fact I still
> recall that Seagate drive# after 10+ years should be informative
> regarding the severity of that issue. :(
>
>> The idea of the md raid "bad block list" is that there is a medium
>> ground - you can have disks that are "mostly good".
>
> Everything I've read and seen in the last few years regarding hard disk
> technology says that platter manufacturing quality and tolerance are so
> high on modern drives that media defects are rarely, if ever, seen by
> the customer, as they're mapped out at the factory. The platters don't
> suffer wear effects, but the rest of the moving parts do. From what
> I've read/seen, "media" errors observed in the wild today are actually
> caused by mechanical failures due to physical wear on various moving
> parts: VC actuator pivot bearing/race, spindle bearings, etc.
> Mechanical failures tend to show mild "media errors" in the beginning
> and get worse with time as moving parts go further out of alignment.
> Thus, as I see it, any UREs on a modern drive represent a "Don't trust
> me--Replace me NOW" flag. I could be all wrong here, but this is what
> I've read, and seen in manufacturer videos from WD and Seagate.
>

That's very useful information to know - I don't go through nearly
enough disks myself to be able to judge these things (and while I read
lots of stuff on the web, I don't see /everything/ !). Thanks.

However, this still sounds to me like a drive with UREs is dying but not
dead yet. Assuming you are correct here (and I've no reason to doubt
that - unless someone else disagrees), it means that a disk with UREs
will be dying quickly rather than dying slowly. But if the non-URE data
on the disk can be used to make a rebuild faster and safer, then surely
that is worth doing?

It may be that when a disk has had an URE and therefore an entry in the
bad block list, then it should be marked read-only and only used for
data recovery and "hot replace" rebuilds. But until it completely
croaks, it is still better than no disk at all while the rebuild is in
progress.


>> Supposing you have a RAID6 array, and one disk has died completely. It
>> gets replaced by a hot spare, and rebuild begins. As the rebuild
>> progresses, disk 1 gets an URE. Traditional handling would mean disk 1
>> is ejected, and now you have a double-degraded RAID6 to rebuilt. When
>> you later get an URE on disk 2, you have lost data for that stripe - and
>> the whole raid is gone.
>>
>> But with bad block lists, the URE on disk 1 leads to a bad block entry
>> on disk 1, and the rebuild continues. When you later get an URE on disk
>> 2, it's no problem - you use data from disk 1 and the other disks. URE's
>> are no longer a killer unless your set has no redundancy.
>
> They're not a killer with RAID 6 anyway, are they?. You can be
> rebuilding one failed drive and suffer UREs left and right, as long as
> you don't get two of them on two drives simultaneously in the same
> stripe block read. I think that's right. Please correct me if not.
>

That's true as long as UREs do not cause that disk to be kicked out of
the array. With bad block support in md raid, a disk suffering an URE
will /not/ be kicked out. But my understanding (from what you wrote
above) was that with hardware raid controllers, an URE /would/ cause a
disk to be kicked out. Or am I mixing something up again?

>> URE's are also what I worry about with RAID1 (including RAID10)
>> rebuilds. If a disk has failed, you are right in saying that the
>> chances of the second disk in the pair failing completely are tiny. But
>> the chances of getting an URE on the second disk during the rebuild are
>> not negligible - they are small, but growing with each new jump in disk
>> size.
>
> I touched on this in my other reply, somewhat tongue-in-cheek mentioning
> 3 leg and 4 leg RAID10. At current capacities and URE ratings I'm not
> worried about it with mirror pairs. If URE ratings haven't increased
> substantially by the time our avg drive capacity hits 10GB I'll start to
> worry.
>
> Somewhat related to this, does any else here build their arrays from the
> smallest cap drives they can get away with, preferably single platter
> models when possible? I adopted this strategy quite some time ago,
> mostly to keep rebuild times to a minimum, keep rotational mass low to
> consume the least energy since using more drives, but also with the URE
> issue in the back of my mind. Anecdotal evidence tends to point to the
> trend of OPs going with fewer gargantuan drives instead of many smaller
> ones. Maybe that's just members of this list, whose criteria may be
> quite different from the typical enterprise data center.
>
>> With md raid's future bad block lists and hot replace features, then an
>> URE on the second disk during rebuilds is only a problem if the first
>> disk has died completely - if it only had a small problem, then the "hot
>> replace" rebuild will be able to use both disks to find the data.
>
> What happens when you have multiple drives at the same or similar bad
> block count?
>

You replace them all. Once a drive reaches a certain number of bad
blocks (and that threshold may be just 1, or it may be more), you should
replace it. There isn't any reason not to do hot replace builds on
multiple drives simultaneously, if you've got the drives and drive bays
on hand - apart from at the bad blocks, they replacement is just a
straight disk to disk copy.

>> I know you are more interested in hardware raid than software raid, but
>> I'm sure you'll find some interesting points in Neil's writings. If you
>> don't want to read through the thread, at least read his blog post.
>>
>>
>
> Will catch up. Thanks for the blog link.
>


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server? GPFS w/ 10GB/s throughput tothe rescue

am 27.02.2011 00:54:54 von Stan Hoeppner

Joe Landman put forth on 2/24/2011 3:20 PM:

> All this said, its better to express your IO bandwidth needs in MB/s,
> preferably in terms of sustained bandwidth needs, as this is language
> that you'd be talking to vendors in.

Heartily agree.

> that gets you 50x 117 MB/s or about 5.9 GB/s sustained bandwidth for
> your IO. 10 machines running at a sustainable 600 MB/s delivered over
> the network, and a parallel file system atop this, solves this problem.

That's 1 file server for each 5 compute nodes Joe. That is excessive.
Your business is selling these storage servers, so I can understand this
recommendation. What cost is Matt looking at for these 10 storage
servers? $8-15k apiece? $80-150K total, not including installation,
maintenance, service contract, or administration training? And these
require a cluster file system. I'm guessing that's in the territory of
quotes he's already received from NetApp et al.

In that case it makes more sense to simply use direct attached storage
in each compute node at marginal additional cost, and a truly scalable
parallel filesystem across the compute nodes, IBM's GPFS. This will
give better aggregate performance at substantially lower cost, and
likely with much easier filesystem administration.

Matt, if a parallel cluster file system is in your cards, and it very
well may be, the very best way to achieve your storage bandwidth goal
would be leveraging direct attached disks in each compute node, your
existing GbE network, and using IBM GPFS as your parallel cluster
filesystem. I'd recommend using IBM 1U servers with 4 disk bays of
146GB 10k SAS drives in hardware RAID 10 (it's built in--free). With 50
compute nodes, this will give you over 10GB/s aggregate disk bandwidth,
over 200MB/s per node. Using these 146GB 2.5" drives you'd have ~14TB
of GPFS storage and can push/pull over 5GB/s of GPFS throughput over
TCP/IP. Throughput will be likely be limited by the network, not the disks.

Each 1U server has dual GbE ports, allowing each node's application to
read 100MB/s from the GPFS while the node is simultaneously serving
100MB/s to all the other nodes, with full network redundancy in the
event a single NIC or switch should fail in one of your redundant
ethernet segments. Or, you could bond the NICs, without fail over, for
over 200MB/s full duplex, giving you aggregate GPFS throughput of
between 6-10GB/s depending on actual workload access patterns.

Your only additional cost here over the base compute node is 4 drives at
~$1000, the GPFS licensing, and consulting fees to IBM Global Services
for setup and training, and maybe another GbE switch or two. This
system is completely scalable. Each time you add a compute node you add
another 100-200MB/s+ of GPFS bandwidth to the cluster, at minimal cost.
I have no idea what IBM GPFS licensing costs are. My wild ass guess
would be a couple hundred dollars per node, which is pretty reasonable
considering the capability it gives you, and the cost savings over other
solutions.

You should make an appointment with IBM Global Services to visit your
site, go over your needs and budget, and make a recommendation or two.
Request they send a GPFS educated engineer along on the call. Express
that you're looking at the architecture I've described. They may have a
better solution given your workload and cost criteria. The key thing is
that you need to get as much information as possible at this point so
have the best options going forward.

Here's an appropriate IBM compute cluster node:
http://www-304.ibm.com/shop/americas/webapp/wcs/stores/servl et/default/ProductDisplay?productId=4611686018425930325&stor eId=1&langId=-1&categoryId=4611686018425272306&dualCurrId=73 &catalogId=-840

1U rack chassis
Xeon X3430 - 2.4 GHz, 4 core, 8MB cache
8GB DDR3
dual 10/100/1000 Ethernet
4 x 146GB 10k rpm SAS hot swap, RADI10

IBM web price per single unit: ~$3,100
If buying volume in one PO: ~$2,500 or less through a wholesaler

Hope this information is helpful Matt.

--
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server? GPFS w/ 10GB/s throughput tothe rescue

am 27.02.2011 01:56:52 von Joe Landman

On 02/26/2011 06:54 PM, Stan Hoeppner wrote:
> Joe Landman put forth on 2/24/2011 3:20 PM:

[...]

>> that gets you 50x 117 MB/s or about 5.9 GB/s sustained bandwidth for
>> your IO. 10 machines running at a sustainable 600 MB/s delivered over
>> the network, and a parallel file system atop this, solves this problem.
>
> That's 1 file server for each 5 compute nodes Joe. That is excessive.

No Stan, it isn't. As I said, this is our market, we know it pretty
well. Matt stated his needs pretty clearly.

He needs 5.9GB/s sustained bandwidth. Local drives (as you suggested
later on) will deliver 75-100 MB/s of bandwidth, and he'd need 2 for
RAID1, as well as a RAID0 (e.g. RAID10) for local bandwidth (150+ MB/s).
4 drives per unit, 50 units. 200 drives.

Any admin want to admin 200+ drives in 50 chassis? Admin 50 different
file systems?

Oh, and what is the impact if some of those nodes went away? Would they
take down the file system? In the cloud of microdisk model Stan
suggested, yes they would. Which is why you might not want to give that
advice serious consideration. Unless you built in replication. Now we
are at 400 disks in 50 chassis.

Again, this design keeps getting worse.

> Your business is selling these storage servers, so I can understand this
> recommendation. What cost is Matt looking at for these 10 storage

Now this is sad, very sad.

Stan started out selling the Nexsan version of things (and why was he
doing it on the MD RAID list I wonder?), which would have run into the
same costs Stan noted later. Now Stan is selling (actually mis-selling)
GPFS (again, on an MD RAID list, seemingly having picked it off of a
website), without having a clue as to the pricing, implementation,
issues, etc.

> servers? $8-15k apiece? $80-150K total, not including installation,
> maintenance, service contract, or administration training? And these
> require a cluster file system. I'm guessing that's in the territory of
> quotes he's already received from NetApp et al.

I did suggest using GlusterFS as it will help with a number of aspects,
has an open source version. I did also suggest (since he seems to wish
to build it himself) that he pursue a reasonable design to start with,
and avoid the filer based designs Stan suggested (two Nexsan's and some
sort of filer head to handle them), or a SAN switch of some sort.
Neither design works well in his scenario, or for that matter, in the
vast majority of HPC situations.

I did make a full disclosure of my interests up front, and people are
free to take my words with a grain of salt. Insinuating based upon my
disclosure? Sad.


> In that case it makes more sense to simply use direct attached storage
> in each compute node at marginal additional cost, and a truly scalable
> parallel filesystem across the compute nodes, IBM's GPFS. This will
> give better aggregate performance at substantially lower cost, and
> likely with much easier filesystem administration.

See GlusterFS. Open source at zero cost. However, and this is a large
however, this design, using local storage for a pooled "cloud" of disks,
has some often problematic issues (resiliency, performance, hotspots).
A truly hobby design would use this. Local disk is fine for scratch
space, for a few other things. Managing the disk spread out among 50
nodes? Yeah, its harder.

I'm gonna go out on a limb here and suggest Matt speak with HPC cluster
and storage people. He can implement things ranging from effectively
zero cost through things which can be quite expensive. If you are
talking to Netapp about HPC storage, well, probably move onto a real HPC
storage shop. His problem is squarely in the HPC arena.

However, I would strongly advise against designs such as a single
centralized unit, or a cloud of micro disks. The first design is
decidedly non-scalable, which is in part why the HPC community abandoned
it years ago. The second design is very hard to manage and guarantee
any sort of resiliency. You get all the benefits of a RAID0 in what
Stan proposed.

Start out talking with and working with experts, and its pretty likely
you'll come out with a good solution. The inverse is also true.

MD RAID, which Stan dismissed as a "hobby RAID" at first can work well
for Matt. GlusterFS can help with the parallel file system atop this.
Starting with a realistic design, an MD RAID based system (self built or
otherwise) could easily provide everything Matt needs, at the data rates
he needs it, using entirely open source technologies. And good designs.

You really won't get good performance out of a bad design. The folks
doing HPC work who've responded have largely helped frame good design
patterns. The folks who aren't sure what HPC really is, haven't.

Regards,

Joe

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server? GPFS w/ 10GB/s throughput tothe rescue

am 27.02.2011 15:55:56 von Stan Hoeppner

Joe Landman put forth on 2/26/2011 6:56 PM:

> Local drives (as you suggested
> later on) will deliver 75-100 MB/s of bandwidth, and he'd need 2 for
> RAID1, as well as a RAID0 (e.g. RAID10) for local bandwidth (150+ MB/s).
> 4 drives per unit, 50 units. 200 drives.

Yes, this is pretty much exactly what I mentioned. ~5GB/s aggregate.
But we've still not received an accurate detailed description from Matt
regarding his actual performance needs. He's not posted iostat numbers
from his current filer, or any similar metrics.

> Any admin want to admin 200+ drives in 50 chassis? Admin 50 different
> file systems?

GPFS has single point administration for all storage in all nodes.

> Oh, and what is the impact if some of those nodes went away? Would they
> take down the file system? In the cloud of microdisk model Stan
> suggested, yes they would.

No, they would not. GPFS has multiple redundancy mechanisms and can
sustain multiple node failures. I think you should read the GPFS
introductory documentation:

http://www.ibm.com/common/ssi/fcgi-bin/ssialias?infotype=SA& subtype=WH&appname=STGE_XB_XB_USEN&htmlfid=XBW03010USEN&atta chment=XBW03010USEN.PDF

> Which is why you might not want to give that
> advice serious consideration. Unless you built in replication. Now we
> are at 400 disks in 50 chassis.

Your numbers are wrong, by a factor of 2. He should research GPFS and
give it serious consideration. It may be exactly what he needs.

> Again, this design keeps getting worse.

Actually it's getting better, which you'll see after reading the docs.

> Now this is sad, very sad.
>
> Stan started out selling the Nexsan version of things (and why was he

For the record, I'm not selling anything. I don't have a $$ horse in
this race. I'm simply trying to show Matt some good options. I don't
work for any company selling anything. I'm just an SA, giving free
advice to another SA with regard to his request for information. I just
happen to know a lot more about high performance storage than the
average SA. I recommend Nexsan products because I've used them, they
work very well, and are very competitive WRT price/performance/capacity.

> doing it on the MD RAID list I wonder?),

The OP asked for possible solutions to solve for his need. This need
may not necessarily be best met by mdraid, regardless of the fact he
asked on the Linux RAID list. LED identification of a failed drive is
enough reason for me to not recommend mdraid in this solution, given the
fact he'll only have 4 disks per chassis w/an inbuilt hardware RAID
chip. I'm guessing fault LED is one of the reasons why you use a
combination of PCIe RAID cards and mdraid in your JackRabbit and Delta-V
systems instead of strictly mdraid. I'm not knocking it. That's the
only way to do it properly on such systems. Likewise, please don't
knock me for recommending the obvious better solution in this case.
mdraid would have not materially positive impact, but would introduce
maintenance problems.

> which would have run into the
> same costs Stan noted later. Now Stan is selling (actually mis-selling)
> GPFS (again, on an MD RAID list, seemingly having picked it off of a
> website), without having a clue as to the pricing, implementation,
> issues, etc.

I first learned of GPFS in 2001 when it was deployed on the 256 node IBM
Netfinity dual P3 933 Myrinet cluster at Maui High Performance Computing
Center. GPFS was deployed in this cluster using what is currently
called the Network Shared Disk protocol, spanning the 512 local disks.
GPFS has grown and matured significantly in the 10 years since. Today
it is most commonly deployed with a dedicated file server node farm
architecture, but it still works just as well using NSD. In the
configuration I suggested, each node will be an NSD client and NSD
server. GPFS is renowned for its reliability and performance in the
world of HPC cluster computing due to its excellent 10+ year track
record in the field. It is years ahead of any other cluster filesystem
in capability, performance, manageability, and reliability.

> I did suggest using GlusterFS as it will help with a number of aspects,
> has an open source version. I did also suggest (since he seems to wish
> to build it himself) that he pursue a reasonable design to start with,

I don't believe his desire is to actually DIY the compute and/or storage
nodes. If it is, for a production system of this size/caliber, *I*
wouldn't DIY in this case, and I'm the king of DIY hardware. Actually,
I'm TheHardwareFreak. ;) I guess you've missed the RHS of my email
addy. :) I was given that nickname, flattering or not, about 15 years
ago. Obviously it stuck. It's been my vanity domain for quite a few years.

> and avoid the filer based designs Stan suggested (two Nexsan's and some
> sort of filer head to handle them), or a SAN switch of some sort.

There's nothing wrong with a single filer, just because it's a single
filer. I'm sure you've sold some singles. They can be very performant.
I could build a single DIY 10 GbE filer today from white box parts
using JBOD enclosures that could push highly parallel NFS client reads
at ~4GB/s all day long, about double the performance of your JackRabbit
5U. It would take me some time to tune PCIe interrupt routing, TCP, NFS
server threading, etc, but it can be done. Basic parts list would be
something like:

1 x SuperMicro H8DG6 w/dual 8 core 2GHz Optys, 8x4GB DDR3 ECC RDIMMs
3 x LSI MegaRAID SAS 9280-4i4e PCIe x8 512MB cache
1 x NIAGARA 32714L Quad Port Fiber 10 Gigabit Ethernet NIC
1 x SUPERMICRO CSE-825TQ-R700LPB Black 2U Rackmount 700W redundant PSU
3 x NORCO DS-24E External 4U 24 Bay 6G SAS w/LSI 4x6 SAS expander
74 x Seagate ST3300657SS 15K 300GB 6Gb/s SAS, 2 boot, 72 in JBOD chassis
Configure 24 drive HW RAID6 on each LSI HBA, mdraid linear over them
Format the mdraid device with mkfs.xfs with "-d agcount=66"

With this setup the disks will saturate the 12 SAS host channels at
7.2GB/s aggregate with concurrent parallel streaming reads, as each 22
drive RAID6 will be able to push over 3GB/s with 15k drives. This
excess of disk bandwidth, and high random IOPS of the 15k drives,
ensures that highly random read loads from many concurrent NFS clients
will still hit in the 4GB/s range, again, after the system has been
properly tuned.

> Neither design works well in his scenario, or for that matter, in the
> vast majority of HPC situations.

Why don't you ask Matt, as I have, for an actual, accurate description
of his workload. What we've been given isn't an accurate description.
If it was, his current production systems would be so overwhelmed he'd
already be writing checks for new gear. I've seen no iostat or other
metrics, which are standard fair when asking for this kind of advice.

> I did make a full disclosure of my interests up front, and people are
> free to take my words with a grain of salt. Insinuating based upon my
> disclosure? Sad.

It just seems to me you're too willing to oversell him. He apparently
doesn't have that kind of budget anyway. If we, you, me, anyone, really
wants to give Matt good advice, regardless of how much you might profit,
or mere satisfaction I may gain because one of my suggestions was
implemented, why don't we both agree to get as much information as
possible from Matt before making any more recommendations?

I think we've both forgotten once or twice in this thread that it's not
about us, but about Matt's requirement.

> See GlusterFS. Open source at zero cost. However, and this is a large
> however, this design, using local storage for a pooled "cloud" of disks,
> has some often problematic issues (resiliency, performance, hotspots). A
> truly hobby design would use this. Local disk is fine for scratch
> space, for a few other things. Managing the disk spread out among 50
> nodes? Yeah, its harder.

Gluster isn't designed as a high performance parallel filesystem. It
was never meant to be such. There are guys on the dovecot list who have
tried it as a maildir store and it just falls over. It simply cannot
handle random IO workloads, period. And yes, it is difficult to design
a high performance parallel network based filesystem. Much so. IBM has
a massive lead on the other cluster filesystems as IBM started work back
in the mid/late 90s for their Power clusters.

> I'm gonna go out on a limb here and suggest Matt speak with HPC cluster
> and storage people. He can implement things ranging from effectively
> zero cost through things which can be quite expensive. If you are
> talking to Netapp about HPC storage, well, probably move onto a real HPC
> storage shop. His problem is squarely in the HPC arena.

I'm still not convinced of that. Simply stating "I have 50 compute
nodes each w/one GbE port, so I need 6GB/s of bandwidth" isn't actual
application workload data. From what Matt did describe of how the
application behaves, simply time shifting the data access will likely
solve all of his problems, cheaply. He might even be able to get by
with his current filer. We simply need more information. I do anyway.
I'd hope you would as well.

> However, I would strongly advise against designs such as a single
> centralized unit, or a cloud of micro disks. The first design is
> decidedly non-scalable, which is in part why the HPC community abandoned
> it years ago. The second design is very hard to manage and guarantee
> any sort of resiliency. You get all the benefits of a RAID0 in what
> Stan proposed.

A single system filer is scalable up to the point you run out of PCIe
slots. The system I mentioned using the Nexsan array can scale 3x
before running out of slots.

I think some folks at IBM would tend to vehemently disagree with your
assertions here about GPFS. :) It's the only filesystem used on IBM's
pSeries clusters and supercomputers. I'd wager that IBM has shipped
more GPFS nodes into the HPC marketplace than Joe's company has shipped
nodes, total, ever, into any market, or ever will, by a factor of at
least 100.

This isn't really a fair comparison, as IBM has shipped single GPFS
supercomputers with more nodes than Joe's company will sell in its
entire lifespan. Case in point: ASCI Purple has 1640 GPFS client
nodes, and 134 GPFS server nodes. This machine ships GPFS traffic over
the IBM HPS network at 4GB/s per node link, each node having two links
for 8GB/s per client node--a tad faster than GbE. ;).

For this environment, and most HPC "centers", using a few fat GPFS
storage servers with hundreds of terabytes of direct attached fiber
channel storage makes more sense than deploying every compute node as a
GPFS client *and* server using local disk. In Matt's case it makes more
sense to do the latter, called NSD.

For the curious, here are the details of the $140 million ASCI Purple
system including the GPFS setup:
https://computing.llnl.gov/tutorials/purple/

> Start out talking with and working with experts, and its pretty likely
> you'll come out with a good solution. The inverse is also true.

If by experts you mean those working in the HPC field, not vendors,
that's a great idea. Matt, fire off a short polite email to Jack
Dongarra and one to Bill Camp. Dr. Dongarra is the primary author of
the Linpack benchmark, which is used to rate the 500 fastest
supercomputers in the world twice yearly, among other things. His name
is probably the most well known in the field of supercomputing.

Bill Camp designed the Red Storm supercomputer, which is now the
architectural basis for Cray's large MPP supercomputers. He works for
Sandia National Laboratory, which is one of the 4 US nuclear weapons
laboratories.

If neither of these two men has an answer for you, nor can point you to
folks who do, the answer simply doesn't exist. Out of consideration I'm
not going to post their email addresses. You can find them at the
following locations. While you're at it, read the Red Storm document.
It's very interesting.

http://www.netlib.org/utk/people/JackDongarra/

http://www.google.com/url?sa=t&source=web&cd=3&ved=0CCEQFjAC &url=http%3A%2F%2Fwww.lanl.gov%2Forgs%2Fhpc%2Fsalishan%2Fsal ishan2003%2Fcamp.pdf&rct=j&q=bill%20camp%20asci%20red&ei=VxR qTdTuEYOClAf4xKH_AQ&usg=AFQjCNFl420n6HAwBkDs5AFBU2TKpsiHvA&c ad=rja

I've not corresponded with Professor Dongarra for many years, but back
then he always answered my emails rather promptly, within a day or two.
The key is to keep it short and sweet, as the man is pretty busy I'd
guess. I've never corresponded with Dr. Camp, but I'm sure he'd respond
to you, one way or another. My experience is that technical people
enjoy talking tech shop, at least to a degree.

> MD RAID, which Stan dismissed as a "hobby RAID" at first can work well

That's a mis-characterization of the statement I made.

> for Matt. GlusterFS can help with the parallel file system atop this.
> Starting with a realistic design, an MD RAID based system (self built or
> otherwise) could easily provide everything Matt needs, at the data rates
> he needs it, using entirely open source technologies. And good designs.

I don't recall Matt saying he needed a solution based entirely on FOSS.
If he did I missed it. If he can accomplish his goals with all FOSS
that's always a plus in my book. However, I'm not averse to closed
source when it's a better fit for a requirement.

> You really won't get good performance out of a bad design. The folks

That's brilliant insight. ;)

> doing HPC work who've responded have largely helped frame good design
> patterns. The folks who aren't sure what HPC really is, haven't.

The folks who use the term HPC as a catch all, speaking as if there is
one workload pattern, or only one file access pattern which comprises
HPC, as Joe continues to do, and who attempt to tell others they don't
know what they're talking about, when they most certainly do, should be
viewed with some skepticism.

Just as in the business sector, there are many widely varied workloads
in the HPC space. At opposite ends of the disk access spectrum,
analysis applications tend to read a lot and write very little.
Simulation applications, on the other hand, tend to read very little,
and generate a tremendous amount of output. For each of these, some
benefit greatly from highly parallel communication and disk throughput,
some don't. Some benefit from extreme parallelism, and benefit from
using message passing and Lustre file access over infiniband, some with
lots of serialization don't. Some may benefit from openmp parallelism
but only mild amounts of disk parallelism. In summary, there are many
shades of HPC.

For maximum performance and ROI, just as in the business or any other
computing world, one needs to optimize his compute and storage system to
meet his particular workload. There isn't one size that fits all.
Thus, contrary to what Joe may have anyone here believe, NFS filers are
a perfect fit for some HPC workloads. For Joe to say that any workload
that works fine with an NFS filer isn't an HPC workload is simply
rubbish. One need look no further than a little ways back in this
thread to see this. In one hand, Joe says Matt's workload is absolutely
an HPC workload. Matt currently uses an NFS filer for this workload.
Thus, Joe would say this isn't an HPC workload because it's working fine
with an NFS filer. Just a bit of self contradiction there.

Instead of arguing what is and is not HPC, and arguing that Matt's
workload is "an HPC workload", I think, again, that nailing down his
exact data access profile and making a recommendation on that, is what
he needs. I'm betting he could care less if his workload is "an HPC
workload" or not. I'm starting to tire of this thread. Matt has plenty
of conflicting information to sort out. I'll be glad to answer any
questions he may have of me.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 27.02.2011 22:30:48 von Ed W

Your application appears to be an implementation of a queue processing
system? ie each machine: pulls a file down, processes it, gets the next
file, etc?

Can you share some information on
- the size of files you pull down (I saw something in another post)
- how long each machine takes to process each file
- whether there is any dependency between the processing machines? eg
can each machine operate completely independently of the others and
start it's job when it wishes (or does it need to sync?)

Given the tentative assumption that
- processing each file takes many multiples of the time needed to
download the file, and
- files are processed independently

It would appear that you can use a much lower powered system to
basically push jobs out to the processing machines in advance, this way
your bandwidth basically only needs to be:
size_of_job * num_machines / time_to_process_jobs

So if the time to process jobs is significant then you have quite some
time to push out the next job to local storage ready?

Firstly is this architecture workable? If so then you have some new
performance parameters to target for the storage architecture?

Good luck

Ed W
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 28.02.2011 16:46:06 von Joe Landman

On 02/27/2011 04:30 PM, Ed W wrote:

[...]

> It would appear that you can use a much lower powered system to
> basically push jobs out to the processing machines in advance, this way
> your bandwidth basically only needs to be:
> size_of_job * num_machines / time_to_process_jobs

This would be good. Matt's original argument suggested he needed this
as his sustained bandwidth given the way the analysis proceeded.

If we assume that the processing time is T_p, and the communication time
is T_c, ignoring other factors, the total time for 1 job is T_j = T_p +
T_c. If T_c << T_p, then you can effectively ignore bandwidth related
issues (and use a much smaller bandwidth system). For T_c << T_p, lets
(for laughs) say T_c = 0.1 x T_p (e.g. communication time is 1/10th the
processing time). Then even if you halved your bandwidth, and doubled
T_c, you are making only an about 10% increase in your total execution
time for a job.

With Nmachines each with Ncores, you have Nmachines x Ncores jobs going
on all at once. If T_c << T_p (as in the above example), then most of
the time, on average, the machines will not be communicating. In fact,
if we do a very rough first pass approximation to an answer (there are
more accurate statistical models) for this, one would expect the network
to be used T_c/T_p fraction of the time by each process. Then the total
consumption of data for a run (assuming all runs are *approximately* of
equal duration)

D = B x T_c

D being the amount of data in MB or GB, and B being the bandwidth
expressed in MB/s or GB/s. Your effective bandwidth per run, Beff will be

D = Beff x T = Beff x (T_c + T_p)

For Nmachines x Ncores jobs, Dtotal is the total data transfered

Dtotal = Nmachines x Ncores * D = Nmachines x Ncores x Beff
x (T_c + T_p)


You know Dtotal (aggregate data needed for run). You know Nmachines and
Ncores. You know T_c and T_p (approximately). From this, solve for
Beff. Thats what you have to sustain (approximately).

> So if the time to process jobs is significant then you have quite some
> time to push out the next job to local storage ready?
>
> Firstly is this architecture workable? If so then you have some new
> performance parameters to target for the storage architecture?
>
> Good luck
>
> Ed W

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 28.02.2011 23:22:08 von Stan Hoeppner

Ed W put forth on 2/27/2011 3:30 PM:
> Your application appears to be an implementation of a queue processing
> system? ie each machine: pulls a file down, processes it, gets the next
> file, etc?
>
> Can you share some information on
> - the size of files you pull down (I saw something in another post)
> - how long each machine takes to process each file
> - whether there is any dependency between the processing machines? eg
> can each machine operate completely independently of the others and
> start it's job when it wishes (or does it need to sync?)
>
> Given the tentative assumption that
> - processing each file takes many multiples of the time needed to
> download the file, and
> - files are processed independently
>
> It would appear that you can use a much lower powered system to
> basically push jobs out to the processing machines in advance, this way
> your bandwidth basically only needs to be:
> size_of_job * num_machines / time_to_process_jobs
>
> So if the time to process jobs is significant then you have quite some
> time to push out the next job to local storage ready?
>
> Firstly is this architecture workable? If so then you have some new
> performance parameters to target for the storage architecture?
>
> Good luck

Ed, you stated this thought much more thoroughly and eloquently than I
did in my last rambling post. Thank you.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 01.03.2011 00:14:20 von Stan Hoeppner

Joe Landman put forth on 2/28/2011 9:46 AM:
> On 02/27/2011 04:30 PM, Ed W wrote:
>
> [...]
>
>> It would appear that you can use a much lower powered system to
>> basically push jobs out to the processing machines in advance, this way
>> your bandwidth basically only needs to be:
>> size_of_job * num_machines / time_to_process_jobs
>
> This would be good. Matt's original argument suggested he needed this
> as his sustained bandwidth given the way the analysis proceeded.

And Joe has provided a nice mathematical model for quantifying it.

> If we assume that the processing time is T_p, and the communication time
> is T_c, ignoring other factors, the total time for 1 job is T_j = T_p +
> T_c. If T_c << T_p, then you can effectively ignore bandwidth related
> issues (and use a much smaller bandwidth system). For T_c << T_p, lets
> (for laughs) say T_c = 0.1 x T_p (e.g. communication time is 1/10th the
> processing time). Then even if you halved your bandwidth, and doubled
> T_c, you are making only an about 10% increase in your total execution
> time for a job.
>
> With Nmachines each with Ncores, you have Nmachines x Ncores jobs going
> on all at once. If T_c << T_p (as in the above example), then most of
> the time, on average, the machines will not be communicating. In fact,
> if we do a very rough first pass approximation to an answer (there are
> more accurate statistical models) for this, one would expect the network
> to be used T_c/T_p fraction of the time by each process. Then the total
> consumption of data for a run (assuming all runs are *approximately* of
> equal duration)
>
> D = B x T_c
>
> D being the amount of data in MB or GB, and B being the bandwidth
> expressed in MB/s or GB/s. Your effective bandwidth per run, Beff will be
>
> D = Beff x T = Beff x (T_c + T_p)
>
> For Nmachines x Ncores jobs, Dtotal is the total data transfered
>
> Dtotal = Nmachines x Ncores * D = Nmachines x Ncores x Beff
> x (T_c + T_p)
>
>
> You know Dtotal (aggregate data needed for run). You know Nmachines and
> Ncores. You know T_c and T_p (approximately). From this, solve for
> Beff. Thats what you have to sustain (approximately).

This assumes his application is threaded and scales linearly across
multiple cores. If not, running Ncores processes on each node should
achieve a similar result to the threaded case, assuming the application
is written such that multiple process instances don't trip over each
other by say, all using the same scratch file path/name, etc, etc.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 02.03.2011 04:44:18 von Matt Garman

On Sun, Feb 27, 2011 at 3:30 PM, Ed W wrote:
> Your application appears to be an implementation of a queue processin=
g
> system? =A0ie each machine: pulls a file down, processes it, gets the=
next
> file, etc?

Sort of. It's not so much "each machine" as it is "each job". A
machine can have multiple jobs.

At this point I'm not exactly sure what the jobs' specifics are; that
is, not sure if a job reads a bunch of files at once, then processes;
or, reads one file, then processes (as you described).

> Can you share some information on
> - the size of files you pull down (I saw something in another post)

They vary; they can be anywhere from about 100 MB to a few TB.
Average is probably on the order of a few hundred MB.

> - how long each machine takes to process each file

I'm not sure how long a job takes to process a file; I'm trying to get
these answers from the people who design and run the jobs.

> - whether there is any dependency between the processing machines? eg=
can
> each machine operate completely independently of the others and start=
it's
> job when it wishes (or does it need to sync?)

I'm fairly sure the jobs are independent.

> Given the tentative assumption that
> - processing each file takes many multiples of the time needed to dow=
nload
> the file, and
> - files are processed independently
>
> It would appear that you can use a much lower powered system to basic=
ally
> push jobs out to the processing machines in advance, this way your ba=
ndwidth
> basically only needs to be:
> =A0 =A0size_of_job * num_machines / time_to_process_jobs
>
> So if the time to process jobs is significant then you have quite som=
e time
> to push out the next job to local storage ready?
>
> Firstly is this architecture workable? =A0If so then you have some ne=
w
> performance parameters to target for the storage architecture?

That might be workable, but it would require me (or someone) to
develop and deploy the job dispatching system. Which is certainly
doable, but it might meet some "political" resistance. My boss
basically said, "find a system to buy or spec out a system to build
that meets [the requirements I've mentioned in this and other
emails]."
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 02.03.2011 05:20:21 von Joe Landman

On 03/01/2011 10:44 PM, Matt Garman wrote:

[...]


> That might be workable, but it would require me (or someone) to
> develop and deploy the job dispatching system. Which is certainly

Happily, the "develop" part of this is already done. Have a look at
GridEngine, Torque, slurm, and a number of others (commercial versions
include the excellent LSF from Platform, PBSpro by Altair, and others).

> doable, but it might meet some "political" resistance. My boss
> basically said, "find a system to buy or spec out a system to build
> that meets [the requirements I've mentioned in this and other
> emails]."

This is wandering outside of the MD list focus. You might want to speak
with other folks on the Beowulf list, among others.

I should note that nothing you've brought up isn't a solvable problem.
You simply have some additional data to gather on the apps, some costs
to compare against the benefits they bring, and make decisions from
there. Build vs buy is one of the critical ones, but as Ed, myself and
others have noted, you do need more detail to make sure you don't
under(over) spec the design for the near/mid/far term.

Regards,

Joe

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 02.03.2011 08:10:01 von Roberto Spadim

why not supercapacitors to power safe and ram memories as ramdisks?
a backup solution could help on umount or backup

2011/3/2 Joe Landman :
> On 03/01/2011 10:44 PM, Matt Garman wrote:
>
> [...]
>
>
>> That might be workable, but it would require me (or someone) to
>> develop and deploy the job dispatching system. =A0Which is certainly
>
> Happily, the "develop" part of this is already done. =A0Have a look a=
t
> GridEngine, Torque, slurm, and a number of others (commercial version=
s
> include the excellent LSF from Platform, PBSpro by Altair, and others=
).
>
>> doable, but it might meet some "political" resistance. =A0My boss
>> basically said, "find a system to buy or spec out a system to build
>> that meets [the requirements I've mentioned in this and other
>> emails]."
>
> This is wandering outside of the MD list focus. =A0You might want to =
speak
> with other folks on the Beowulf list, among others.
>
> I should note that nothing you've brought up isn't a solvable problem=
You
> simply have some additional data to gather on the apps, some costs to
> compare against the benefits they bring, and make decisions from ther=
e.
> =A0Build vs buy is one of the critical ones, but as Ed, myself and ot=
hers have
> noted, you do need more detail to make sure you don't under(over) spe=
c the
> design for the near/mid/far term.
>
> Regards,
>
> Joe
>
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics, Inc.
> email: landman@scalableinformatics.com
> web =A0: http://scalableinformatics.com
> =A0 =A0 =A0 http://scalableinformatics.com/sicluster
> phone: +1 734 786 8423 x121
> fax =A0: +1 866 888 3112
> cell : +1 734 612 4615
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 02.03.2011 20:03:14 von Drew

> why not supercapacitors to power safe and ram memories as ramdisks?
> a backup solution could help on umount or backup

Huh?
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 02.03.2011 20:20:35 von Roberto Spadim

=3D) high throughput
no ssd, no harddisk for main data, only ram, and a good ups system
with supercapacitor (not for cpu, just ram disks), could use 2,5v
2500F capacitors
ddr3 memory have >=3D10000gb/s, use SAS 6gbit channel for each ram disk

and with time, get ram disk and save to harddisks (backup only, not
online data, some filesystem have snapshots, could use it)
ram is more expensive than ssd and harddisk, but is faster, with a
good ups it=B4s less volatille (some hours without computer power
supply)


2011/3/2 Drew :
>> why not supercapacitors to power safe and ram memories as ramdisks?
>> a backup solution could help on umount or backup
>
> Huh?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server? GPFS w/ 10GB/s throughput to the rescue

am 12.03.2011 23:49:44 von Matt Garman

Sorry again for the delayed response... it takes me a while to read
through all these and process them. :) I do appreciate all the
feedback though!

On Sun, Feb 27, 2011 at 8:55 AM, Stan Hoeppner =
wrote:
> Yes, this is pretty much exactly what I mentioned. =A0~5GB/s aggregat=
e.
> But we've still not received an accurate detailed description from Ma=
tt
> regarding his actual performance needs. =A0He's not posted iostat num=
bers
> from his current filer, or any similar metrics.

Accurate metrics are hard to determine. I did run iostat for 24 hours
on a few servers, but I don't think the results give an accurate
picture of what we really need. Here's the details on what we have
now:

We currently have 10 servers, each with an NFS share. Each server
mounts every other NFS share; mountpoints are consistently named on
every server (and a server's local storage is a symlink named like its
mountpoint on other machines). One server has a huge directory of
symbolic links that acts as the "database" or "index" to all the files
spread across all 10 servers.

We spent some time a while ago creating a semi-smart distribution of
the files. In short, we basically round-robin'ed files in such a way
as to parallelize bulk reads across many servers.

The current system works, but is (as others have suggested), not
particularly scalable. When we add new servers, I have to
re-distribute those files across the new servers.

That, and these storage servers are dual-purposed; they are also used
as analysis servers---basically batch computation jobs that use this
data. The folks who run the analysis programs look at the machine
load to determine how many analysis jobs to run. So when all machines
are running analysis jobs, the machine load is a combination of both
the CPU load from these analysis programs AND the I/O load from
serving files. In other words, if these machines were strictly
compute servers, they would in general show a lower load, and thus
would run even more programs.

Having said all that, I picked a few of the 10 NFS/compute servers and
ran iostat for 24 hours, reporting stats every 1 minute (FYI, this is
actually what Dell asks you to do if you inquire about their storage
solutions). The results from all machines were (as expected)
virtually the same. They average constant, continuous reads at about
3--4 MB/s. You might take that info and say, 4 MB/s times 10
machines, that's only 40 MB/s... that's nothing, not even the full
bandwidth of a single gigabit ethernet connection. But there are
several problems (1) the number of analysis jobs is currently
artificially limited; (2) the file distribution is smart enough that
NFS load is balanced across all 10 machines; and (3) there are
currently about 15 machines doing analysis jobs (10 are dual-purposed
as I already mentioned), but this number is expected to grow to 40 or
50 within the year.

Given all that, I have simplified the requirements as follows: I want
"something" that is capable of keeping the gigabit connections of
those 50 analysis machines saturated at all times. There have been
several suggestions along the lines of smart job scheduling and the
like. However, the thing is, these analysis jobs are custom---they
are constantly being modified, new ones created, and old ones retired.
Meaning, the access patterns are somewhat dynamic, and will certainly
change over time. Our current "smart" file distribution is just based
on the general case of maybe 50% of the analysis programs' access
patterns. But next week someone could come up with a new analysis
program that makes our current file distribution "stupid". The point
is, current access patterns are somewhat meaningless, because they are
all but guaranteed to change. So what do we do? For business
reasons, any surplus manpower needs to be focused on these analysis
jobs; we don't have the resources to constantly adjust job scheduling
and file distribution.

So I think we are truly trying to solve the most general case here,
which is that all 50 gigabit-connected servers will be continuously
requesting data in an arbitrary fashion.

This is definitely a solvable problem; and there are multiple options;
I'm in the learning stage right now, so hopefully I can make a good
decision about which solution is best for our particular case. I
solicited the list because I had the impression that there were at
least a few people who have built and/or administer systems like this.
And clearly there are people with exactly this experience, given the
feedback I've received! So I've learned a lot, which is exactly what
I wanted in the first place.

> http://www.ibm.com/common/ssi/fcgi-bin/ssialias?infotype=3DS A&subtype=
=3DWH&appname=3DSTGE_XB_XB_USEN&htmlfid=3DXBW03010USEN&attac hment=3DXBW=
03010USEN.PDF
>
> Your numbers are wrong, by a factor of 2. =A0He should research GPFS =
and
> give it serious consideration. =A0It may be exactly what he needs.

I'll definitely look over that.

> I don't believe his desire is to actually DIY the compute and/or stor=
age
> nodes. =A0If it is, for a production system of this size/caliber, *I*
> wouldn't DIY in this case, and I'm the king of DIY hardware. =A0Actua=
lly,
> I'm TheHardwareFreak. =A0;) =A0I guess you've missed the RHS of my em=
ail
> addy. :) =A0I was given that nickname, flattering or not, about 15 ye=
ars
> ago. =A0Obviously it stuck. =A0It's been my vanity domain for quite a=
few years.

I'm now leaning towards a purchased solution, mainly due to the fact
that it seems like a DIY solution would cost a lot more in terms of my
time. Expensive though they are, one of the nicer things about the
vendor solutions is that they seem to provide somewhat of a "set it
and forget it" experience. Of course, a system like this needs
routine maintenance and such, but the the vendors claim their
solutions simplify that. But maybe that's just marketspeak! :)
Although I think there's some truth to it---I've been a Linux/DIY
enthusiast/hobbyist for years now, and my experience is that the
DIY/FOSS stuff always takes more individual effort. It's fun to do at
home, but can be costly from a business perspective...

> Why don't you ask Matt, as I have, for an actual, accurate descriptio=
n
> of his workload. =A0What we've been given isn't an accurate descripti=
on.
> If it was, his current production systems would be so overwhelmed he'=
d
> already be writing checks for new gear. =A0I've seen no iostat or oth=
er
> metrics, which are standard fair when asking for this kind of advice.

Hopefully my description above sheds a little more light on what we
need. Ignoring smarter job scheduling and such, I want to solve the
worst-case scenario, which is 50 servers all requesting enough data to
saturate their gigabit network connections.

> I'm still not convinced of that. =A0Simply stating "I have 50 compute
> nodes each w/one GbE port, so I need 6GB/s of bandwidth" isn't actual
> application workload data. =A0From what Matt did describe of how the
> application behaves, simply time shifting the data access will likely
> solve all of his problems, cheaply. =A0He might even be able to get b=
y
> with his current filer. =A0We simply need more information. =A0I do a=
nyway.
> =A0I'd hope you would as well.

Hopefully I described well enough why our current application workload
data metrics aren't sufficient. We haven't time-shifted data access,
but have somewhat space-shifted it, given the round-robin "smart" file
distribution I described above. But it's only "smart" for today's
usage---tomorrow's usage will almost certainly be different. 50 gbps
/ 6 GB/s is the requirement.

> I don't recall Matt saying he needed a solution based entirely on FOS=
S.
> =A0If he did I missed it. =A0If he can accomplish his goals with all =
=46OSS
> that's always a plus in my book. =A0However, I'm not averse to closed
> source when it's a better fit for a requirement.

Nope, doesn't have to be entirely FOSS.

-Matt
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 13.03.2011 21:10:00 von Christoph Hellwig

Btw, XFS has been used for >10GB/s throughput systems for about the last
5 years. The big issues is getting hardware that can reliably sustain
it - if you have that using it with Linux and XFS is not an problem at
all. Note that with NUMA system you also have to thing about your
intereconnect bandwith as a limiting factor for buffered I/O, not just
the storage subsystem.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 14.03.2011 13:27:00 von Stan Hoeppner

Christoph Hellwig put forth on 3/13/2011 3:10 PM:
> Btw, XFS has been used for >10GB/s throughput systems for about the last
> 5 years. The big issues is getting hardware that can reliably sustain
> it - if you have that using it with Linux and XFS is not an problem at

I already noted this far back in the thread Christoph, but it is worth
repeating. And it carries more weight when you, a Linux Kernel dev,
state this, than when I do. So thanks for adding your input. :)

> all. Note that with NUMA system you also have to thing about your
> intereconnect bandwith as a limiting factor for buffered I/O, not just
> the storage subsystem.

Is this only an issue with multi-chassis cabled NUMA systems such as
Altix 4000/UV and the (discontinued) IBM x86 NUMA systems (x440/445)
with their relatively low direct node-node bandwidth, or is this also of
concern with single chassis systems with relatively much higher
node-node bandwidth, such as the AMD Opteron systems, specifically the
newer G34, which have node-node bandwidth of 19.2GB/s bidirectional?

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 14.03.2011 13:47:33 von Christoph Hellwig

On Mon, Mar 14, 2011 at 07:27:00AM -0500, Stan Hoeppner wrote:
> Is this only an issue with multi-chassis cabled NUMA systems such as
> Altix 4000/UV and the (discontinued) IBM x86 NUMA systems (x440/445)
> with their relatively low direct node-node bandwidth, or is this also of
> concern with single chassis systems with relatively much higher
> node-node bandwidth, such as the AMD Opteron systems, specifically the
> newer G34, which have node-node bandwidth of 19.2GB/s bidirectional?

Just do your math. Buffered I/O will do two memory copies - a
copy_to_user into the pagecache and DMA from the pagecache to the device
(yes, that's also a copy as far as the memory subsystem is concerned,
even if it is access from the device).

So to get 10GB/s throughput you spends 20GB/s on memcpys for the actual
data alone. Add to that other system activity and metadata. Wether you
hit the interconnect or not depends on your memory configuration, I/O
attachment, and process locality. If you have all memory that the
process uses and all I/O on one node you won't hit the interconnect at
all, but depending on memory placement and storage attachment you might
hit it twice:

- userspace memory on node A to pagecache on node B to device on node
C (or A again for that matter).

In short you need to review your configuration pretty carefully. With
direct I/O it's a lot easier as you save a copy.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 18.03.2011 14:16:26 von Stan Hoeppner

Christoph Hellwig put forth on 3/14/2011 7:47 AM:
> On Mon, Mar 14, 2011 at 07:27:00AM -0500, Stan Hoeppner wrote:
>> Is this only an issue with multi-chassis cabled NUMA systems such as
>> Altix 4000/UV and the (discontinued) IBM x86 NUMA systems (x440/445)
>> with their relatively low direct node-node bandwidth, or is this also of
>> concern with single chassis systems with relatively much higher
>> node-node bandwidth, such as the AMD Opteron systems, specifically the
>> newer G34, which have node-node bandwidth of 19.2GB/s bidirectional?
>
> Just do your math. Buffered I/O will do two memory copies - a
> copy_to_user into the pagecache and DMA from the pagecache to the device
> (yes, that's also a copy as far as the memory subsystem is concerned,
> even if it is access from the device).

The context of this thread was high throughput NFS serving. If we
wanted to do 10 GB/s kernel NFS serving, would we still only have two
memory copies, since the NFS server runs in kernel, not user, space?
I.e. in addition to the block device DMA read into the page cache, would
we also have a memcopy into application buffers from the page cache, or
does the kernel NFS server simply work with the data directly from the
page cache without an extra memory copy being needed? If the latter,
adding in the DMA copy to the NIC would yield two total memory copies.
Is this correct? Or would we have 3 memcopies?

> So to get 10GB/s throughput you spends 20GB/s on memcpys for the actual
> data alone. Add to that other system activity and metadata. Wether you
> hit the interconnect or not depends on your memory configuration, I/O
> attachment, and process locality. If you have all memory that the
> process uses and all I/O on one node you won't hit the interconnect at
> all, but depending on memory placement and storage attachment you might
> hit it twice:
>
> - userspace memory on node A to pagecache on node B to device on node
> C (or A again for that matter).

Not to mention hardware interrupt processing load, which, in addition to
eating some interconnect bandwidth, will also take a toll on CPU cycles
given the number of RAID HBAs and NIC required to read and push 10GB/s
NFS to clients.

Will achieving 10GB/s NFS likely require intricate manual process
placement, along with spreading interrupt processing across only node
cores which are directly connected to the IO bridge chips, preventing
interrupt packets from consuming interconnect bandwidth?

> In short you need to review your configuration pretty carefully. With
> direct I/O it's a lot easier as you save a copy.

Is O_DIRECT necessary in this scenario, or does the kernel NFS server
negate the need for direct IO since the worker threads execute in kernel
space not user space? If not, is it possible to force to kernel NFS
server to always do O_DIRECT reads and writes, or is that the
responsibility of the application on the NFS client?

I was under the impression that the memory manager in recent 2.6
kernels, similar to IRIX on Origin, is sufficiently NUMA aware in the
default configuration to automatically take care of memory placement,
keeping all of a given process/thread's memory on the local node, and in
cases where thread memory ends up on another node for some reason, block
copying that memory to the local node and invalidating the remote CPU
caches, or in certain cases, simply moving the thread execution pointer
to a core in the remote node where the memory resides.

WRT the page cache, if the kernel doesn't automatically place page cache
data associated with a given thread in that thread's local node memory,
is it possible to force this? It's been a while since I read the
cpumemsets and other related documentation, and I don't recall if page
cache memory is manually locatable. That doesn't ring a bell.
Obviously it would be a big win from an interconnect utilization and
overall performance standpoint if the thread's working memory and page
cache memory were both on the local node.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 18.03.2011 15:05:09 von Christoph Hellwig

On Fri, Mar 18, 2011 at 08:16:26AM -0500, Stan Hoeppner wrote:
> The context of this thread was high throughput NFS serving. If we
> wanted to do 10 GB/s kernel NFS serving, would we still only have two
> memory copies, since the NFS server runs in kernel, not user, space?
> I.e. in addition to the block device DMA read into the page cache, would
> we also have a memcopy into application buffers from the page cache, or
> does the kernel NFS server simply work with the data directly from the
> page cache without an extra memory copy being needed? If the latter,
> adding in the DMA copy to the NIC would yield two total memory copies.
> Is this correct? Or would we have 3 memcopies?

When reading from the NFS server you get away with two memory "copies":

1) DMA from the storage controller into the page cache
2) DMA from the page cache into the network card

but when writing to the NFS server you usually need three:

1) DMA from the network card into the socket buffer
2) copy from the socket buffer into the page cache
3) DMA from the page cache to the storage controller

That's because we can't do proper zero copy receive. It's possible in
theory with hardware than can align payload headers on page boundaries,
and while such hardware exists on the highend I don't think we support
it yet, nor do typical setups have the network card firmware smarts for
it.

> Not to mention hardware interrupt processing load, which, in addition to
> eating some interconnect bandwidth, will also take a toll on CPU cycles
> given the number of RAID HBAs and NIC required to read and push 10GB/s
> NFS to clients.

> Will achieving 10GB/s NFS likely require intricate manual process
> placement, along with spreading interrupt processing across only node
> cores which are directly connected to the IO bridge chips, preventing
> interrupt packets from consuming interconnect bandwidth?

Note that we do have a lot of infrastructure for high end NFS serving in
the kernel, e.g. the per-node NFSD thread that Greg Banks wrote for SGI
a couple of years ago. All this was for big SGI NAS servers running
XFS. But as you mentioned it's not quite trivial to setup.

> > In short you need to review your configuration pretty carefully. With
> > direct I/O it's a lot easier as you save a copy.
>
> Is O_DIRECT necessary in this scenario, or does the kernel NFS server
> negate the need for direct IO since the worker threads execute in kernel
> space not user space? If not, is it possible to force to kernel NFS
> server to always do O_DIRECT reads and writes, or is that the
> responsibility of the application on the NFS client?

The kernel NFS server doesn't use O_DIRECT - in fact the current
O_DIRECT code can't be used on kernel pages at all. For some NFS
workloads it would certainly be interesting to make use of it, though.
E.g. large stable writes.

> I was under the impression that the memory manager in recent 2.6
> kernels, similar to IRIX on Origin, is sufficiently NUMA aware in the
> default configuration to automatically take care of memory placement,
> keeping all of a given process/thread's memory on the local node, and in
> cases where thread memory ends up on another node for some reason, block
> copying that memory to the local node and invalidating the remote CPU
> caches, or in certain cases, simply moving the thread execution pointer
> to a core in the remote node where the memory resides.
>
> WRT the page cache, if the kernel doesn't automatically place page cache
> data associated with a given thread in that thread's local node memory,
> is it possible to force this? It's been a while since I read the
> cpumemsets and other related documentation, and I don't recall if page
> cache memory is manually locatable. That doesn't ring a bell.
> Obviously it would be a big win from an interconnect utilization and
> overall performance standpoint if the thread's working memory and page
> cache memory were both on the local node.

The kernel is pretty smart in placement of user and page cache data, but
it can't really second guess your intention. With the numactl tool you
can help it doing the proper placement for you workload. Note that the
choice isn't always trivial - a numa system tends to have memory on
multiple nodes, so you'll either have to find a good partitioning of
your workload or live with off-node references. I don't think
partitioning NFS workloads is trivial, but then again I'm not a
networking expert.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 18.03.2011 16:43:43 von Stan Hoeppner

Christoph Hellwig put forth on 3/18/2011 9:05 AM:

Thanks for the confirmations and explanations.

> The kernel is pretty smart in placement of user and page cache data, but
> it can't really second guess your intention. With the numactl tool you
> can help it doing the proper placement for you workload. Note that the
> choice isn't always trivial - a numa system tends to have memory on
> multiple nodes, so you'll either have to find a good partitioning of
> your workload or live with off-node references. I don't think
> partitioning NFS workloads is trivial, but then again I'm not a
> networking expert.

Bringing mdraid back into the fold, I'm wondering what kinda of load the
mdraid threads would place on a system of the caliber needed to push
10GB/s NFS.

Neil, I spent quite a bit of time yesterday spec'ing out what I believe
is the bare minimum AMD64 based hardware needed to push 10GB/s NFS.
This includes:

4 LSI 9285-8e 8port SAS 800MHz dual core PCIE x8 HBAs
3 NIAGARA 32714 PCIe x8 Quad Port Fiber 10 Gigabit Server Adapter

This gives us 32 6Gb/s SAS ports and 12 10GbE ports total, for a raw
hardware bandwidth of 20GB/s SAS and 15GB/s ethernet.

I made the assumption that RAID 10 would be the only suitable RAID level
due to a few reasons:

1. The workload being 50+ NFS large file reads of aggregate 10GB/s,
yielding a massive random IO workload at the disk head level.

2. We'll need 384 15k SAS drives to service a 10GB/s random IO load

3. We'll need multiple "small" arrays enabling multiple mdraid threads,
assuming a single 2.4GHz core isn't enough to handle something like 48
or 96 mdraid disks.

4. Rebuild times for parity raid schemes would be unacceptably high and
would eat all of the CPU the rebuild thread would run on

To get the bandwidth we need and making sure we don't run out of
controller chip IOPS, my calculations show we'd need 16 x 24 drive
mdraid 10 arrays. Thus, ignoring all other considerations momentarily,
a dual AMD 6136 platform with 16 2.4GHz cores seems suitable, with one
mdraid thread per core, each managing a 24 drive RAID 10. Would we then
want to layer a --linear array across the 16 RAID 10 arrays? If we did
this, would the linear thread bottleneck instantly as it runs on only
one core? How many additional memory copies (interconnect transfers)
are we going to be performing per mdraid thread for each block read
before the data is picked up by the nfsd kernel threads?

How much of each core's cycles will we consume with normal random read
operations assuming 10GB/s of continuous aggregate throughput? Would
the mdraid threads consume sufficient cycles that when combined with
network stack processing and interrupt processing, that 16 cores at
2.4GHz would be insufficient? If so, would bumping the two sockets up
to 24 cores at 2.1GHz be enough for the total workload? Or, would we
need to move to a 4 socket system with 32 or 48 cores?

Is this possibly a situation where mdraid just isn't suitable due to the
CPU, memory, and interconnect bandwidth demands, making hardware RAID
the only real option? And if it does requires hardware RAID, would it
be possible to stick 16 block devices together in a --linear mdraid
array and maintain the 10GB/s performance? Or, would the single
--linear array be processed by a single thread? If so, would a single
2.4GHz core be able to handle an mdraid --leaner thread managing 8
devices at 10GB/s aggregate?

Unfortunately I don't currently work in a position allowing me to test
such a system, and I certainly don't have the personal financial
resources to build it. My rough estimate on the hardware cost is
$150-200K USD. The 384 Hitachi 15k SAS 146GB drives at $250 each
wholesale are a little over $90k.

It would be really neat to have a job that allowed me to setup and test
such things. :)

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 18.03.2011 17:21:54 von Roberto Spadim

did you contacted texas ssd solutions? i don't know how much $$$
should you pay for this setup, but it's a nice solution...

2011/3/18 Stan Hoeppner :
> Christoph Hellwig put forth on 3/18/2011 9:05 AM:
>
> Thanks for the confirmations and explanations.
>
>> The kernel is pretty smart in placement of user and page cache data,=
but
>> it can't really second guess your intention. =A0With the numactl too=
l you
>> can help it doing the proper placement for you workload. =A0Note tha=
t the
>> choice isn't always trivial - a numa system tends to have memory on
>> multiple nodes, so you'll either have to find a good partitioning of
>> your workload or live with off-node references. =A0I don't think
>> partitioning NFS workloads is trivial, but then again I'm not a
>> networking expert.
>
> Bringing mdraid back into the fold, I'm wondering what kinda of load =
the
> mdraid threads would place on a system of the caliber needed to push
> 10GB/s NFS.
>
> Neil, I spent quite a bit of time yesterday spec'ing out what I belie=
ve
> is the bare minimum AMD64 based hardware needed to push 10GB/s NFS.
> This includes:
>
> =A04 LSI 9285-8e 8port SAS 800MHz dual core PCIE x8 HBAs
> =A03 NIAGARA 32714 PCIe x8 Quad Port Fiber 10 Gigabit Server Adapter
>
> This gives us 32 6Gb/s SAS ports and 12 10GbE ports total, for a raw
> hardware bandwidth of 20GB/s SAS and 15GB/s ethernet.
>
> I made the assumption that RAID 10 would be the only suitable RAID le=
vel
> due to a few reasons:
>
> 1. =A0The workload being 50+ NFS large file reads of aggregate 10GB/s=
,
> yielding a massive random IO workload at the disk head level.
>
> 2. =A0We'll need 384 15k SAS drives to service a 10GB/s random IO loa=
d
>
> 3. =A0We'll need multiple "small" arrays enabling multiple mdraid thr=
eads,
> assuming a single 2.4GHz core isn't enough to handle something like 4=
8
> or 96 mdraid disks.
>
> 4. =A0Rebuild times for parity raid schemes would be unacceptably hig=
h and
> would eat all of the CPU the rebuild thread would run on
>
> To get the bandwidth we need and making sure we don't run out of
> controller chip IOPS, my calculations show we'd need 16 x 24 drive
> mdraid 10 arrays. =A0Thus, ignoring all other considerations momentar=
ily,
> a dual AMD 6136 platform with 16 2.4GHz cores seems suitable, with on=
e
> mdraid thread per core, each managing a 24 drive RAID 10. =A0Would we=
then
> want to layer a --linear array across the 16 RAID 10 arrays? =A0If we=
did
> this, would the linear thread bottleneck instantly as it runs on only
> one core? =A0How many additional memory copies (interconnect transfer=
s)
> are we going to be performing per mdraid thread for each block read
> before the data is picked up by the nfsd kernel threads?
>
> How much of each core's cycles will we consume with normal random rea=
d
> operations assuming 10GB/s of continuous aggregate throughput? =A0Wou=
ld
> the mdraid threads consume sufficient cycles that when combined with
> network stack processing and interrupt processing, that 16 cores at
> 2.4GHz would be insufficient? =A0If so, would bumping the two sockets=
up
> to 24 cores at 2.1GHz be enough for the total workload? =A0Or, would =
we
> need to move to a 4 socket system with 32 or 48 cores?
>
> Is this possibly a situation where mdraid just isn't suitable due to =
the
> CPU, memory, and interconnect bandwidth demands, making hardware RAID
> the only real option? =A0And if it does requires hardware RAID, would=
it
> be possible to stick 16 block devices together in a --linear mdraid
> array and maintain the 10GB/s performance? =A0Or, would the single
> --linear array be processed by a single thread? =A0If so, would a sin=
gle
> 2.4GHz core be able to handle an mdraid --leaner thread managing 8
> devices at 10GB/s aggregate?
>
> Unfortunately I don't currently work in a position allowing me to tes=
t
> such a system, and I certainly don't have the personal financial
> resources to build it. =A0My rough estimate on the hardware cost is
> $150-200K USD. =A0The 384 Hitachi 15k SAS 146GB drives at $250 each
> wholesale are a little over $90k.
>
> It would be really neat to have a job that allowed me to setup and te=
st
> such things. :)
>
> --
> Stan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 18.03.2011 23:01:01 von NeilBrown

On Fri, 18 Mar 2011 10:43:43 -0500 Stan Hoeppner
wrote:

> Christoph Hellwig put forth on 3/18/2011 9:05 AM:
>
> Thanks for the confirmations and explanations.
>
> > The kernel is pretty smart in placement of user and page cache data, but
> > it can't really second guess your intention. With the numactl tool you
> > can help it doing the proper placement for you workload. Note that the
> > choice isn't always trivial - a numa system tends to have memory on
> > multiple nodes, so you'll either have to find a good partitioning of
> > your workload or live with off-node references. I don't think
> > partitioning NFS workloads is trivial, but then again I'm not a
> > networking expert.
>
> Bringing mdraid back into the fold, I'm wondering what kinda of load the
> mdraid threads would place on a system of the caliber needed to push
> 10GB/s NFS.
>
> Neil, I spent quite a bit of time yesterday spec'ing out what I believe

Addressing me directly in an email that wasn't addressed to me directly seem
a bit ... odd. Maybe that is just me.

> is the bare minimum AMD64 based hardware needed to push 10GB/s NFS.
> This includes:
>
> 4 LSI 9285-8e 8port SAS 800MHz dual core PCIE x8 HBAs
> 3 NIAGARA 32714 PCIe x8 Quad Port Fiber 10 Gigabit Server Adapter
>
> This gives us 32 6Gb/s SAS ports and 12 10GbE ports total, for a raw
> hardware bandwidth of 20GB/s SAS and 15GB/s ethernet.
>
> I made the assumption that RAID 10 would be the only suitable RAID level
> due to a few reasons:
>
> 1. The workload being 50+ NFS large file reads of aggregate 10GB/s,
> yielding a massive random IO workload at the disk head level.
>
> 2. We'll need 384 15k SAS drives to service a 10GB/s random IO load
>
> 3. We'll need multiple "small" arrays enabling multiple mdraid threads,
> assuming a single 2.4GHz core isn't enough to handle something like 48
> or 96 mdraid disks.
>
> 4. Rebuild times for parity raid schemes would be unacceptably high and
> would eat all of the CPU the rebuild thread would run on
>
> To get the bandwidth we need and making sure we don't run out of
> controller chip IOPS, my calculations show we'd need 16 x 24 drive
> mdraid 10 arrays. Thus, ignoring all other considerations momentarily,
> a dual AMD 6136 platform with 16 2.4GHz cores seems suitable, with one
> mdraid thread per core, each managing a 24 drive RAID 10. Would we then
> want to layer a --linear array across the 16 RAID 10 arrays? If we did
> this, would the linear thread bottleneck instantly as it runs on only
> one core? How many additional memory copies (interconnect transfers)
> are we going to be performing per mdraid thread for each block read
> before the data is picked up by the nfsd kernel threads?
>
> How much of each core's cycles will we consume with normal random read

For RAID10, the md thread plays no part in reads. Which ever thread
submitted the read submits it all the way down to the relevant member device.
If the read fails the thread will come in to play.

For writes, the thread is used primarily to make sure the writes are properly
orders w.r.t. bitmap updates. I could probably remove that requirement if a
bitmap was not in use...

> operations assuming 10GB/s of continuous aggregate throughput? Would
> the mdraid threads consume sufficient cycles that when combined with
> network stack processing and interrupt processing, that 16 cores at
> 2.4GHz would be insufficient? If so, would bumping the two sockets up
> to 24 cores at 2.1GHz be enough for the total workload? Or, would we
> need to move to a 4 socket system with 32 or 48 cores?
>
> Is this possibly a situation where mdraid just isn't suitable due to the
> CPU, memory, and interconnect bandwidth demands, making hardware RAID
> the only real option?

I'm sorry, but I don't do resource usage estimates or comparisons with
hardware raid. I just do software design and coding.


> And if it does requires hardware RAID, would it
> be possible to stick 16 block devices together in a --linear mdraid
> array and maintain the 10GB/s performance? Or, would the single
> --linear array be processed by a single thread? If so, would a single
> 2.4GHz core be able to handle an mdraid --leaner thread managing 8
> devices at 10GB/s aggregate?

There is no thread for linear or RAID0.

If you want to share load over a number of devices, you would normally use
RAID0. However if the load had a high thread count and the filesystem
distributed IO evenly across the whole device space, then linear might work
for you.

NeilBrown


>
> Unfortunately I don't currently work in a position allowing me to test
> such a system, and I certainly don't have the personal financial
> resources to build it. My rough estimate on the hardware cost is
> $150-200K USD. The 384 Hitachi 15k SAS 146GB drives at $250 each
> wholesale are a little over $90k.
>
> It would be really neat to have a job that allowed me to setup and test
> such things. :)
>

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 18.03.2011 23:23:19 von Roberto Spadim

i think linux can do this job without problems, md code is very
mature. the problem here is: what size/speed of cpu/ram/network/disk
should we use?
slow disk use raid0
mirror use raid1

raid 4,5,6 are cpu intensive, maybe a problem on very high speed (if
you have money buy more cpu and no problems)

2011/3/18 NeilBrown :
> On Fri, 18 Mar 2011 10:43:43 -0500 Stan Hoeppner com>
> wrote:
>
>> Christoph Hellwig put forth on 3/18/2011 9:05 AM:
>>
>> Thanks for the confirmations and explanations.
>>
>> > The kernel is pretty smart in placement of user and page cache dat=
a, but
>> > it can't really second guess your intention. =A0With the numactl t=
ool you
>> > can help it doing the proper placement for you workload. =A0Note t=
hat the
>> > choice isn't always trivial - a numa system tends to have memory o=
n
>> > multiple nodes, so you'll either have to find a good partitioning =
of
>> > your workload or live with off-node references. =A0I don't think
>> > partitioning NFS workloads is trivial, but then again I'm not a
>> > networking expert.
>>
>> Bringing mdraid back into the fold, I'm wondering what kinda of load=
the
>> mdraid threads would place on a system of the caliber needed to push
>> 10GB/s NFS.
>>
>> Neil, I spent quite a bit of time yesterday spec'ing out what I beli=
eve
>
> Addressing me directly in an email that wasn't addressed to me direct=
ly seem
> a bit ... odd. =A0Maybe that is just me.
>
>> is the bare minimum AMD64 based hardware needed to push 10GB/s NFS.
>> This includes:
>>
>> =A0 4 LSI 9285-8e 8port SAS 800MHz dual core PCIE x8 HBAs
>> =A0 3 NIAGARA 32714 PCIe x8 Quad Port Fiber 10 Gigabit Server Adapte=
r
>>
>> This gives us 32 6Gb/s SAS ports and 12 10GbE ports total, for a raw
>> hardware bandwidth of 20GB/s SAS and 15GB/s ethernet.
>>
>> I made the assumption that RAID 10 would be the only suitable RAID l=
evel
>> due to a few reasons:
>>
>> 1. =A0The workload being 50+ NFS large file reads of aggregate 10GB/=
s,
>> yielding a massive random IO workload at the disk head level.
>>
>> 2. =A0We'll need 384 15k SAS drives to service a 10GB/s random IO lo=
ad
>>
>> 3. =A0We'll need multiple "small" arrays enabling multiple mdraid th=
reads,
>> assuming a single 2.4GHz core isn't enough to handle something like =
48
>> or 96 mdraid disks.
>>
>> 4. =A0Rebuild times for parity raid schemes would be unacceptably hi=
gh and
>> would eat all of the CPU the rebuild thread would run on
>>
>> To get the bandwidth we need and making sure we don't run out of
>> controller chip IOPS, my calculations show we'd need 16 x 24 drive
>> mdraid 10 arrays. =A0Thus, ignoring all other considerations momenta=
rily,
>> a dual AMD 6136 platform with 16 2.4GHz cores seems suitable, with o=
ne
>> mdraid thread per core, each managing a 24 drive RAID 10. =A0Would w=
e then
>> want to layer a --linear array across the 16 RAID 10 arrays? =A0If w=
e did
>> this, would the linear thread bottleneck instantly as it runs on onl=
y
>> one core? =A0How many additional memory copies (interconnect transfe=
rs)
>> are we going to be performing per mdraid thread for each block read
>> before the data is picked up by the nfsd kernel threads?
>>
>> How much of each core's cycles will we consume with normal random re=
ad
>
> For RAID10, the md thread plays no part in reads. =A0Which ever threa=
d
> submitted the read submits it all the way down to the relevant member=
device.
> If the read fails the thread will come in to play.
>
> For writes, the thread is used primarily to make sure the writes are =
properly
> orders w.r.t. bitmap updates. =A0I could probably remove that require=
ment if a
> bitmap was not in use...
>
>> operations assuming 10GB/s of continuous aggregate throughput? =A0Wo=
uld
>> the mdraid threads consume sufficient cycles that when combined with
>> network stack processing and interrupt processing, that 16 cores at
>> 2.4GHz would be insufficient? =A0If so, would bumping the two socket=
s up
>> to 24 cores at 2.1GHz be enough for the total workload? =A0Or, would=
we
>> need to move to a 4 socket system with 32 or 48 cores?
>>
>> Is this possibly a situation where mdraid just isn't suitable due to=
the
>> CPU, memory, and interconnect bandwidth demands, making hardware RAI=
D
>> the only real option?
>
> I'm sorry, but I don't do resource usage estimates or comparisons wit=
h
> hardware raid. =A0I just do software design and coding.
>
>
>> =A0 =A0 And if it does requires hardware RAID, would it
>> be possible to stick 16 block devices together in a --linear mdraid
>> array and maintain the 10GB/s performance? =A0Or, would the single
>> --linear array be processed by a single thread? =A0If so, would a si=
ngle
>> 2.4GHz core be able to handle an mdraid --leaner thread managing 8
>> devices at 10GB/s aggregate?
>
> There is no thread for linear or RAID0.
>
> If you want to share load over a number of devices, you would normall=
y use
> RAID0. =A0However if the load had a high thread count and the filesys=
tem
> distributed IO evenly across the whole device space, then linear migh=
t work
> for you.
>
> NeilBrown
>
>
>>
>> Unfortunately I don't currently work in a position allowing me to te=
st
>> such a system, and I certainly don't have the personal financial
>> resources to build it. =A0My rough estimate on the hardware cost is
>> $150-200K USD. =A0The 384 Hitachi 15k SAS 146GB drives at $250 each
>> wholesale are a little over $90k.
>>
>> It would be really neat to have a job that allowed me to setup and t=
est
>> such things. :)
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 20.03.2011 02:34:26 von Stan Hoeppner

NeilBrown put forth on 3/18/2011 5:01 PM:
> On Fri, 18 Mar 2011 10:43:43 -0500 Stan Hoeppner
> wrote:
>
>> Christoph Hellwig put forth on 3/18/2011 9:05 AM:
>>
>> Thanks for the confirmations and explanations.
>>
>>> The kernel is pretty smart in placement of user and page cache data, but
>>> it can't really second guess your intention. With the numactl tool you
>>> can help it doing the proper placement for you workload. Note that the
>>> choice isn't always trivial - a numa system tends to have memory on
>>> multiple nodes, so you'll either have to find a good partitioning of
>>> your workload or live with off-node references. I don't think
>>> partitioning NFS workloads is trivial, but then again I'm not a
>>> networking expert.
>>
>> Bringing mdraid back into the fold, I'm wondering what kinda of load the
>> mdraid threads would place on a system of the caliber needed to push
>> 10GB/s NFS.
>>
>> Neil, I spent quite a bit of time yesterday spec'ing out what I believe
>
> Addressing me directly in an email that wasn't addressed to me directly seem
> a bit ... odd. Maybe that is just me.

I guess that depends on one's perspective. Is it the content of email
To: and Cc: headers that matters, or the substance of the list
discussion thread? You are the lead developer and maintainer of Linux
mdraid AFAIK. Thus I would have assumed that directly addressing a
question to you within any given list thread was acceptable, regardless
of whose address was where in the email headers.

>> How much of each core's cycles will we consume with normal random read
>
> For RAID10, the md thread plays no part in reads. Which ever thread
> submitted the read submits it all the way down to the relevant member device.
> If the read fails the thread will come in to play.

So with RIAD10 read scalability is in essence limited to the execution
rate of the block device layer code and the interconnect b/w required.

> For writes, the thread is used primarily to make sure the writes are properly
> orders w.r.t. bitmap updates. I could probably remove that requirement if a
> bitmap was not in use...

How compute intensive is this thread during writes, if at all, at
extreme IO bandwidth rates?

>> operations assuming 10GB/s of continuous aggregate throughput? Would
>> the mdraid threads consume sufficient cycles that when combined with
>> network stack processing and interrupt processing, that 16 cores at
>> 2.4GHz would be insufficient? If so, would bumping the two sockets up
>> to 24 cores at 2.1GHz be enough for the total workload? Or, would we
>> need to move to a 4 socket system with 32 or 48 cores?
>>
>> Is this possibly a situation where mdraid just isn't suitable due to the
>> CPU, memory, and interconnect bandwidth demands, making hardware RAID
>> the only real option?
>
> I'm sorry, but I don't do resource usage estimates or comparisons with
> hardware raid. I just do software design and coding.

I probably worded this question very poorly and have possibly made
unfair assumptions about mdraid performance.

>> And if it does requires hardware RAID, would it
>> be possible to stick 16 block devices together in a --linear mdraid
>> array and maintain the 10GB/s performance? Or, would the single
>> --linear array be processed by a single thread? If so, would a single
>> 2.4GHz core be able to handle an mdraid --leaner thread managing 8
>> devices at 10GB/s aggregate?
>
> There is no thread for linear or RAID0.

What kernel code is responsible for the concatenation and striping
operations of mdraid linear and RAID0 if not an mdraid thread?

> If you want to share load over a number of devices, you would normally use
> RAID0. However if the load had a high thread count and the filesystem
> distributed IO evenly across the whole device space, then linear might work
> for you.

In my scenario I'm thinking I'd want to stay away RAID0 because of the
multi-level stripe width issues of double nested RAID (RAID0 over
RAID10). I assumed linear would be the way to go, as my scenario calls
for using XFS. Using 32 allocation groups should evenly spread the
load, which is ~50 NFS clients.

What I'm trying to figure out is how much CPU time I am going to need for:

1. Aggregate 10GB/s IO rate
2. mdraid managing 384 drives
A. 16 mdraid10 arrays of 24 drives each
B. mdraid linear concatenating the 16 arrays

Thanks for your input Neil.

--
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 20.03.2011 04:41:47 von NeilBrown

On Sat, 19 Mar 2011 20:34:26 -0500 Stan Hoeppner
wrote:

> NeilBrown put forth on 3/18/2011 5:01 PM:
> > On Fri, 18 Mar 2011 10:43:43 -0500 Stan Hoeppner
> > wrote:
> >
> >> Christoph Hellwig put forth on 3/18/2011 9:05 AM:
> >>
> >> Thanks for the confirmations and explanations.
> >>
> >>> The kernel is pretty smart in placement of user and page cache data, but
> >>> it can't really second guess your intention. With the numactl tool you
> >>> can help it doing the proper placement for you workload. Note that the
> >>> choice isn't always trivial - a numa system tends to have memory on
> >>> multiple nodes, so you'll either have to find a good partitioning of
> >>> your workload or live with off-node references. I don't think
> >>> partitioning NFS workloads is trivial, but then again I'm not a
> >>> networking expert.
> >>
> >> Bringing mdraid back into the fold, I'm wondering what kinda of load the
> >> mdraid threads would place on a system of the caliber needed to push
> >> 10GB/s NFS.
> >>
> >> Neil, I spent quite a bit of time yesterday spec'ing out what I believe
> >
> > Addressing me directly in an email that wasn't addressed to me directly seem
> > a bit ... odd. Maybe that is just me.
>
> I guess that depends on one's perspective. Is it the content of email
> To: and Cc: headers that matters, or the substance of the list
> discussion thread? You are the lead developer and maintainer of Linux
> mdraid AFAIK. Thus I would have assumed that directly addressing a
> question to you within any given list thread was acceptable, regardless
> of whose address was where in the email headers.

This assumes that I read every email on this list. I certainly do read a lot,
but I tend to tune out of threads that don't seem particularly interesting -
and configuring hardware is only vaguely interesting to me - and I am sure
there are people on the list with more experience.

But whatever... there is certainly more chance of me missing something that
isn't directly addressed to me (such messages get filed differently).


>
> >> How much of each core's cycles will we consume with normal random read
> >
> > For RAID10, the md thread plays no part in reads. Which ever thread
> > submitted the read submits it all the way down to the relevant member device.
> > If the read fails the thread will come in to play.
>
> So with RIAD10 read scalability is in essence limited to the execution
> rate of the block device layer code and the interconnect b/w required.

Correct.

>
> > For writes, the thread is used primarily to make sure the writes are properly
> > orders w.r.t. bitmap updates. I could probably remove that requirement if a
> > bitmap was not in use...
>
> How compute intensive is this thread during writes, if at all, at
> extreme IO bandwidth rates?

Not compute intensive at all - just single threaded. So it will only
dispatch a single request at a time. Whether single threading the writes is
good or bad is not something that I'm completely clear on. It seems bad in
the sense that modern machines have lots of CPUs and we are fore-going any
possible benefits of parallelism. However the current VM seems to do all
(or most) writeout from a single thread per device - the 'bdi' threads.
So maybe keeping it single threaded in the md level is perfectly natural and
avoids cache bouncing...


>
> >> operations assuming 10GB/s of continuous aggregate throughput? Would
> >> the mdraid threads consume sufficient cycles that when combined with
> >> network stack processing and interrupt processing, that 16 cores at
> >> 2.4GHz would be insufficient? If so, would bumping the two sockets up
> >> to 24 cores at 2.1GHz be enough for the total workload? Or, would we
> >> need to move to a 4 socket system with 32 or 48 cores?
> >>
> >> Is this possibly a situation where mdraid just isn't suitable due to the
> >> CPU, memory, and interconnect bandwidth demands, making hardware RAID
> >> the only real option?
> >
> > I'm sorry, but I don't do resource usage estimates or comparisons with
> > hardware raid. I just do software design and coding.
>
> I probably worded this question very poorly and have possibly made
> unfair assumptions about mdraid performance.
>
> >> And if it does requires hardware RAID, would it
> >> be possible to stick 16 block devices together in a --linear mdraid
> >> array and maintain the 10GB/s performance? Or, would the single
> >> --linear array be processed by a single thread? If so, would a single
> >> 2.4GHz core be able to handle an mdraid --leaner thread managing 8
> >> devices at 10GB/s aggregate?
> >
> > There is no thread for linear or RAID0.
>
> What kernel code is responsible for the concatenation and striping
> operations of mdraid linear and RAID0 if not an mdraid thread?
>

When the VM or filesystem or whatever wants to start an IO request, it calls
into the md code to find out how big it is allowed to make that request. The
md code returns a number which ensures that the request will end up being
mapped onto just one drive (at least in the majority of cases).
The VM or filesystem builds up the request (a struct bio) to at most that
size and hands it to md. md simply assigns a different target device and
offset in that device to the request, and hands it over the the target device.

So whatever thread it was that started the request carries it all the way
down to the device which is a member of the RAID array (for RAID0/linear).
Typically it then gets placed on a queue, and an interrupt handler takes it
off the queue and acts upon it.

So - no separate md thread.

RAID1 and RAID10 make only limited use of their thread, doing as much of the
work as possible in the original calling thread.
RAID4/5/6 do lots of work in the md thread. The calling thread just finds a
place in the stripe cache to attach the request, attaches it, and signals the
thread.
(Though reads on a non-degraded array can by-pass the cache and are handled
much like reads on RAID0).

> > If you want to share load over a number of devices, you would normally use
> > RAID0. However if the load had a high thread count and the filesystem
> > distributed IO evenly across the whole device space, then linear might work
> > for you.
>
> In my scenario I'm thinking I'd want to stay away RAID0 because of the
> multi-level stripe width issues of double nested RAID (RAID0 over
> RAID10). I assumed linear would be the way to go, as my scenario calls
> for using XFS. Using 32 allocation groups should evenly spread the
> load, which is ~50 NFS clients.

You may well be right.

>
> What I'm trying to figure out is how much CPU time I am going to need for:
>
> 1. Aggregate 10GB/s IO rate
> 2. mdraid managing 384 drives
> A. 16 mdraid10 arrays of 24 drives each
> B. mdraid linear concatenating the 16 arrays

I very much doubt that CPU is going to be an issue. Memory bandwidth might -
but I'm only really guessing here, so it is probably time to stop.


>
> Thanks for your input Neil.
>
Pleasure.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 20.03.2011 06:32:38 von Roberto Spadim

with 2 disks md raid0 i get 400MB/s SAS 10krpm 6gb/s channel
you will need at last 10000/400*2=3D25*2=3D50 disks to get a start numb=
er
memory/cpu/network speed?
memory must allow more than 10gb/s (ddr3 can do this, i don't know if
enabled ecc will be a problem or not, check with memtest86+)
cpu? hummm i don't know very well how to help here, since it's just
read and write memory/interfaces (network/disks), maybe a 'magic'
number like: 3ghz * 64bits/8bits=3D24.000 (maybe 24gbits/s) i don't kno=
w
how to estimate... but i think you will need a multicore cpu... maybe
one for network one for disks one for mdadm one for nfs and one for
linux, >=3D5 cores at least with 3ghz 64bits each (maybe starting with
xeon 6cores with hyper thread)
it's just a idea how to estimate, it's not correct/true/real
i think it's better contact ibm/dell/hp/compaq/texas/anyother and talk
about the problem, post results here, this is a nice hardware question
:)
don't tell about software raid, just the hardware to allow this
bandwidth (10gb/s) and share files

2011/3/20 NeilBrown :
> On Sat, 19 Mar 2011 20:34:26 -0500 Stan Hoeppner com>
> wrote:
>
>> NeilBrown put forth on 3/18/2011 5:01 PM:
>> > On Fri, 18 Mar 2011 10:43:43 -0500 Stan Hoeppner ak.com>
>> > wrote:
>> >
>> >> Christoph Hellwig put forth on 3/18/2011 9:05 AM:
>> >>
>> >> Thanks for the confirmations and explanations.
>> >>
>> >>> The kernel is pretty smart in placement of user and page cache d=
ata, but
>> >>> it can't really second guess your intention. =A0With the numactl=
tool you
>> >>> can help it doing the proper placement for you workload. =A0Note=
that the
>> >>> choice isn't always trivial - a numa system tends to have memory=
on
>> >>> multiple nodes, so you'll either have to find a good partitionin=
g of
>> >>> your workload or live with off-node references. =A0I don't think
>> >>> partitioning NFS workloads is trivial, but then again I'm not a
>> >>> networking expert.
>> >>
>> >> Bringing mdraid back into the fold, I'm wondering what kinda of l=
oad the
>> >> mdraid threads would place on a system of the caliber needed to p=
ush
>> >> 10GB/s NFS.
>> >>
>> >> Neil, I spent quite a bit of time yesterday spec'ing out what I b=
elieve
>> >
>> > Addressing me directly in an email that wasn't addressed to me dir=
ectly seem
>> > a bit ... odd. =A0Maybe that is just me.
>>
>> I guess that depends on one's perspective. =A0Is it the content of e=
mail
>> To: and Cc: headers that matters, or the substance of the list
>> discussion thread? =A0You are the lead developer and maintainer of L=
inux
>> mdraid AFAIK. =A0Thus I would have assumed that directly addressing =
a
>> question to you within any given list thread was acceptable, regardl=
ess
>> of whose address was where in the email headers.
>
> This assumes that I read every email on this list. =A0I certainly do =
read a lot,
> but I tend to tune out of threads that don't seem particularly intere=
sting -
> and configuring hardware is only vaguely interesting to me - and I am=
sure
> there are people on the list with more experience.
>
> But whatever... there is certainly more chance of me missing somethin=
g that
> isn't directly addressed to me (such messages get filed differently).
>
>
>>
>> >> How much of each core's cycles will we consume with normal random=
read
>> >
>> > For RAID10, the md thread plays no part in reads. =A0Which ever th=
read
>> > submitted the read submits it all the way down to the relevant mem=
ber device.
>> > If the read fails the thread will come in to play.
>>
>> So with RIAD10 read scalability is in essence limited to the executi=
on
>> rate of the block device layer code and the interconnect b/w require=
d.
>
> Correct.
>
>>
>> > For writes, the thread is used primarily to make sure the writes a=
re properly
>> > orders w.r.t. bitmap updates. =A0I could probably remove that requ=
irement if a
>> > bitmap was not in use...
>>
>> How compute intensive is this thread during writes, if at all, at
>> extreme IO bandwidth rates?
>
> Not compute intensive at all - just single threaded. =A0So it will on=
ly
> dispatch a single request at a time. =A0Whether single threading the =
writes is
> good or bad is not something that I'm completely clear on. =A0It seem=
s bad in
> the sense that modern machines have lots of CPUs and we are fore-goin=
g any
> possible benefits of parallelism. =A0However the current VM seems to =
do all
> (or most) writeout from a single thread per device - the 'bdi' thread=
s.
> So maybe keeping it single threaded in the md level is perfectly natu=
ral and
> avoids cache bouncing...
>
>
>>
>> >> operations assuming 10GB/s of continuous aggregate throughput? =A0=
Would
>> >> the mdraid threads consume sufficient cycles that when combined w=
ith
>> >> network stack processing and interrupt processing, that 16 cores =
at
>> >> 2.4GHz would be insufficient? =A0If so, would bumping the two soc=
kets up
>> >> to 24 cores at 2.1GHz be enough for the total workload? =A0Or, wo=
uld we
>> >> need to move to a 4 socket system with 32 or 48 cores?
>> >>
>> >> Is this possibly a situation where mdraid just isn't suitable due=
to the
>> >> CPU, memory, and interconnect bandwidth demands, making hardware =
RAID
>> >> the only real option?
>> >
>> > I'm sorry, but I don't do resource usage estimates or comparisons =
with
>> > hardware raid. =A0I just do software design and coding.
>>
>> I probably worded this question very poorly and have possibly made
>> unfair assumptions about mdraid performance.
>>
>> >> =A0 =A0 And if it does requires hardware RAID, would it
>> >> be possible to stick 16 block devices together in a --linear mdra=
id
>> >> array and maintain the 10GB/s performance? =A0Or, would the singl=
e
>> >> --linear array be processed by a single thread? =A0If so, would a=
single
>> >> 2.4GHz core be able to handle an mdraid --leaner thread managing =
8
>> >> devices at 10GB/s aggregate?
>> >
>> > There is no thread for linear or RAID0.
>>
>> What kernel code is responsible for the concatenation and striping
>> operations of mdraid linear and RAID0 if not an mdraid thread?
>>
>
> When the VM or filesystem or whatever wants to start an IO request, i=
t calls
> into the md code to find out how big it is allowed to make that reque=
st. =A0The
> md code returns a number which ensures that the request will end up b=
eing
> mapped onto just one drive (at least in the majority of cases).
> The VM or filesystem builds up the request (a struct bio) to at most =
that
> size and hands it to md. =A0md simply assigns a different target devi=
ce and
> offset in that device to the request, and hands it over the the targe=
t device.
>
> So whatever thread it was that started the request carries it all the=
way
> down to the device which is a member of the RAID array (for RAID0/lin=
ear).
> Typically it then gets placed on a queue, and an interrupt handler ta=
kes it
> off the queue and acts upon it.
>
> So - no separate md thread.
>
> RAID1 and RAID10 make only limited use of their thread, doing as much=
of the
> work as possible in the original calling thread.
> RAID4/5/6 do lots of work in the md thread. =A0The calling thread jus=
t finds a
> place in the stripe cache to attach the request, attaches it, and sig=
nals the
> thread.
> (Though reads on a non-degraded array can by-pass the cache and are h=
andled
> much like reads on RAID0).
>
>> > If you want to share load over a number of devices, you would norm=
ally use
>> > RAID0. =A0However if the load had a high thread count and the file=
system
>> > distributed IO evenly across the whole device space, then linear m=
ight work
>> > for you.
>>
>> In my scenario I'm thinking I'd want to stay away RAID0 because of t=
he
>> multi-level stripe width issues of double nested RAID (RAID0 over
>> RAID10). =A0I assumed linear would be the way to go, as my scenario =
calls
>> for using XFS. =A0Using 32 allocation groups should evenly spread th=
e
>> load, which is ~50 NFS clients.
>
> You may well be right.
>
>>
>> What I'm trying to figure out is how much CPU time I am going to nee=
d for:
>>
>> 1. =A0Aggregate 10GB/s IO rate
>> 2. =A0mdraid managing 384 drives
>> =A0 =A0 A. =A016 mdraid10 arrays of 24 drives each
>> =A0 =A0 B. =A0mdraid linear concatenating the 16 arrays
>
> I very much doubt that CPU is going to be an issue. =A0Memory bandwid=
th might -
> but I'm only really guessing here, so it is probably time to stop.
>
>
>>
>> Thanks for your input Neil.
>>
> Pleasure.
>
> NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 21.03.2011 00:22:30 von Stan Hoeppner

Roberto Spadim put forth on 3/20/2011 12:32 AM:

> i think it's better contact ibm/dell/hp/compaq/texas/anyother and talk
> about the problem, post results here, this is a nice hardware question
> :)

I don't need vendor assistance to design a hardware system capable of
the 10GB/s NFS throughput target. That's relatively easy. I've already
specified one possible hardware combination capable of this level of
performance (see below). The configuration will handle 10GB/s using the
RAID function of the LSI SAS HBAs. The only question is if it has
enough individual and aggregate CPU horsepower, memory, and HT
interconnect bandwidth to do the same using mdraid. This is the reason
for my questions directed at Neil.

> don't tell about software raid, just the hardware to allow this
> bandwidth (10gb/s) and share files

I already posted some of the minimum hardware specs earlier in this
thread for the given workload I described. Following is a description
of the workload and a complete hardware specification.

Target workload:

10GB/s continuous parallel NFS throughput serving 50+ NFS clients whose
application performs large streaming reads. At the storage array level
the 50+ parallel streaming reads become a random IO pattern workload
requiring a huge number of spindles due to the high seek rates.

Minimum hardware requirements, based on performance and cost. Ballpark
guess on total cost of the hardware below is $150-250k USD. We can't
get the data to the clients without a network, so the specification
starts with the switching hardware needed.

Ethernet switches:
One HP A5820X-24XG-SFP+ (JC102A) 24 10 GbE SFP ports
488 Gb/s backplane switching capacity
Five HP A5800-24G Switch (JC100A) 24 GbE ports, 4 10GbE SFP
208 Gb/s backplane switching capacity
Maximum common MTU enabled (jumbo frame) globally
Connect 12 server 10 GbE ports to A5820X
Uplink 2 10 GbE ports from each A5800 to A5820X
2 open 10 GbE ports left on A5820X for cluster expansion
or off cluster data transfers to the main network
Link aggregate 12 server 10 GbE ports to A5820X
Link aggregate each client's 2 GbE ports to A5800s
Aggregate client->switch bandwidth = 12.5 GB/s
Aggregate server->switch bandwidth = 15.0 GB/s
The excess server b/w of 2.5GB/s is a result of the following:
Allowing headroom for an additional 10 clients or out of cluster
data transfers
Balancing the packet load over the 3 quad port 10 GbE server NICs
regardless of how many clients are active to prevent hot spots
in the server memory and interconnect subsystems

Server chassis
HP Proliant DL585 G7 with the following specifications
Dual AMD Opteron 6136, 16 cores @2.4GHz
20GB/s node-node HT b/w, 160GB/s aggregate
128GB DDR3 1333, 16x8GB RDIMMS in 8 channels
20GB/s/node memory bandwidth, 80GB/s aggregate
7 PCIe x8 slots and 4 PCIe x16
8GB/s/slot, 56 GB/s aggregate PCIe x8 bandwidth

IO controllers
4 x LSI SAS 9285-8e 8 port SAS, 800MHz dual core ROC, 1GB cache
3 x NIAGARA 32714 PCIe x8 Quad Port Fiber 10 Gigabit Server Adapter

JBOD enclosures
16 x LSI 620J 2U 24 x 2.5" bay SAS 6Gb/s, w/SAS expander
2 SFF 8088 host and 1 expansion port per enclosure
384 total SAS 6GB/s 2.5" drive bays
Two units are daisy chained with one in each pair
connecting to one of 8 HBA SFF8088 ports, for a total of
32 6Gb/s SAS host connections, yielding 38.4 GB/s full duplex b/w

Disks drives
384 HITACHI Ultrastar C15K147 147GB 15000 RPM 64MB Cache 2.5" SAS
6Gb/s Internal Enterprise Hard Drive


Note that the HBA to disk bandwidths of 19.2GB/s one way and 38.4GB/s
full duplex are in excess of the HBA to PCIe bandwidths, 16 and 32GB/s
respectively, by approximately 20%. Also note that each drive can
stream reads at 160MB/s peak, yielding 61GB/s aggregate streaming read
capacity for the 384 disks. This is almost 4 times the aggregate one
way transfer rate of the 4 PCIe x8 slots, and is 6 times our target host
to parallel client data rate of 10GB/s. There are a few reasons why
this excess of capacity is built into the system:

1. RAID10 is the only suitable RAID level for this type of system with
this many disks, for many reasons that have been discussed before.
RAID10 instantly cuts the number of stripe spindles in two, dropping the
data rate by a factor of 2, giving us 30.5GB/s potential aggregate
throughput. Now we're only at 3 times out target data rate.

2. As a single disk drive's seek rate increases, its transfer rate
decreases in relation to its single streaming read performance.
Parallel streaming reads will increase seek rates as the disk head must
move between different regions of the disk platter.

3. In relation to 2, if we assume we'll lose no more than 66% of our
single streaming performance with a multi stream workload, we're down to
10.1GB/s throughput, right at our target.

By using relatively small arrays of 24 drives each (12 stripe spindles),
concatenating (--linear) the 16 resulting arrays, and using a filesystem
such as XFS across the entire array with its intelligent load balancing
of streams using allocation groups, we minimize disk head seeking.
Doing this can in essence divide our 50 client streams across 16 arrays,
with each array seeing approximately 3 of the streaming client reads.
Each disk should be able to easily maintain 33% of its max read rate
while servicing 3 streaming reads.

I hope you found this informative or interesting. I enjoyed the
exercise. I'd been working on this system specification for quite a few
days now but have been hesitant to post it due to its length, and the
fact that AFAIK hardware discussion is a bit OT on this list.

I hope it may be valuable to someone Google'ing for this type of
information in the future.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 21.03.2011 01:52:10 von Roberto Spadim

> I don't need vendor assistance to design a hardware system capable of
> the 10GB/s NFS throughput target. =A0That's relatively easy. =A0I've =
already
it works? tested?

--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 21.03.2011 03:44:52 von Keld Simonsen

On Sun, Mar 20, 2011 at 06:22:30PM -0500, Stan Hoeppner wrote:
> Roberto Spadim put forth on 3/20/2011 12:32 AM:
>
> > i think it's better contact ibm/dell/hp/compaq/texas/anyother and talk
> > about the problem, post results here, this is a nice hardware question
> > :)
>
> I don't need vendor assistance to design a hardware system capable of
> the 10GB/s NFS throughput target. That's relatively easy. I've already
> specified one possible hardware combination capable of this level of
> performance (see below). The configuration will handle 10GB/s using the
> RAID function of the LSI SAS HBAs. The only question is if it has
> enough individual and aggregate CPU horsepower, memory, and HT
> interconnect bandwidth to do the same using mdraid. This is the reason
> for my questions directed at Neil.
>
> > don't tell about software raid, just the hardware to allow this
> > bandwidth (10gb/s) and share files
>
> I already posted some of the minimum hardware specs earlier in this
> thread for the given workload I described. Following is a description
> of the workload and a complete hardware specification.
>
> Target workload:
>
> 10GB/s continuous parallel NFS throughput serving 50+ NFS clients whose
> application performs large streaming reads. At the storage array level
> the 50+ parallel streaming reads become a random IO pattern workload
> requiring a huge number of spindles due to the high seek rates.
>
> Minimum hardware requirements, based on performance and cost. Ballpark
> guess on total cost of the hardware below is $150-250k USD. We can't
> get the data to the clients without a network, so the specification
> starts with the switching hardware needed.
>
> Ethernet switches:
> One HP A5820X-24XG-SFP+ (JC102A) 24 10 GbE SFP ports
> 488 Gb/s backplane switching capacity
> Five HP A5800-24G Switch (JC100A) 24 GbE ports, 4 10GbE SFP
> 208 Gb/s backplane switching capacity
> Maximum common MTU enabled (jumbo frame) globally
> Connect 12 server 10 GbE ports to A5820X
> Uplink 2 10 GbE ports from each A5800 to A5820X
> 2 open 10 GbE ports left on A5820X for cluster expansion
> or off cluster data transfers to the main network
> Link aggregate 12 server 10 GbE ports to A5820X
> Link aggregate each client's 2 GbE ports to A5800s
> Aggregate client->switch bandwidth = 12.5 GB/s
> Aggregate server->switch bandwidth = 15.0 GB/s
> The excess server b/w of 2.5GB/s is a result of the following:
> Allowing headroom for an additional 10 clients or out of cluster
> data transfers
> Balancing the packet load over the 3 quad port 10 GbE server NICs
> regardless of how many clients are active to prevent hot spots
> in the server memory and interconnect subsystems
>
> Server chassis
> HP Proliant DL585 G7 with the following specifications
> Dual AMD Opteron 6136, 16 cores @2.4GHz
> 20GB/s node-node HT b/w, 160GB/s aggregate
> 128GB DDR3 1333, 16x8GB RDIMMS in 8 channels
> 20GB/s/node memory bandwidth, 80GB/s aggregate
> 7 PCIe x8 slots and 4 PCIe x16
> 8GB/s/slot, 56 GB/s aggregate PCIe x8 bandwidth
>
> IO controllers
> 4 x LSI SAS 9285-8e 8 port SAS, 800MHz dual core ROC, 1GB cache
> 3 x NIAGARA 32714 PCIe x8 Quad Port Fiber 10 Gigabit Server Adapter
>
> JBOD enclosures
> 16 x LSI 620J 2U 24 x 2.5" bay SAS 6Gb/s, w/SAS expander
> 2 SFF 8088 host and 1 expansion port per enclosure
> 384 total SAS 6GB/s 2.5" drive bays
> Two units are daisy chained with one in each pair
> connecting to one of 8 HBA SFF8088 ports, for a total of
> 32 6Gb/s SAS host connections, yielding 38.4 GB/s full duplex b/w
>
> Disks drives
> 384 HITACHI Ultrastar C15K147 147GB 15000 RPM 64MB Cache 2.5" SAS
> 6Gb/s Internal Enterprise Hard Drive
>
>
> Note that the HBA to disk bandwidths of 19.2GB/s one way and 38.4GB/s
> full duplex are in excess of the HBA to PCIe bandwidths, 16 and 32GB/s
> respectively, by approximately 20%. Also note that each drive can
> stream reads at 160MB/s peak, yielding 61GB/s aggregate streaming read
> capacity for the 384 disks. This is almost 4 times the aggregate one
> way transfer rate of the 4 PCIe x8 slots, and is 6 times our target host
> to parallel client data rate of 10GB/s. There are a few reasons why
> this excess of capacity is built into the system:
>
> 1. RAID10 is the only suitable RAID level for this type of system with
> this many disks, for many reasons that have been discussed before.
> RAID10 instantly cuts the number of stripe spindles in two, dropping the
> data rate by a factor of 2, giving us 30.5GB/s potential aggregate
> throughput. Now we're only at 3 times out target data rate.
>
> 2. As a single disk drive's seek rate increases, its transfer rate
> decreases in relation to its single streaming read performance.
> Parallel streaming reads will increase seek rates as the disk head must
> move between different regions of the disk platter.
>
> 3. In relation to 2, if we assume we'll lose no more than 66% of our
> single streaming performance with a multi stream workload, we're down to
> 10.1GB/s throughput, right at our target.
>
> By using relatively small arrays of 24 drives each (12 stripe spindles),
> concatenating (--linear) the 16 resulting arrays, and using a filesystem
> such as XFS across the entire array with its intelligent load balancing
> of streams using allocation groups, we minimize disk head seeking.
> Doing this can in essence divide our 50 client streams across 16 arrays,
> with each array seeing approximately 3 of the streaming client reads.
> Each disk should be able to easily maintain 33% of its max read rate
> while servicing 3 streaming reads.
>
> I hope you found this informative or interesting. I enjoyed the
> exercise. I'd been working on this system specification for quite a few
> days now but have been hesitant to post it due to its length, and the
> fact that AFAIK hardware discussion is a bit OT on this list.
>
> I hope it may be valuable to someone Google'ing for this type of
> information in the future.
>
> --
> Stan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Are you then building the system yourself, and running Linux MD RAID?

Anyway, with 384 spindles and only 50 users, each user will have in
average 7 spindles for himself. I think much of the time this would mean
no random IO, as most users are doing large sequential reading.
Thus on average you can expect quite close to striping speed if you
are running RAID capable of striping.

I am puzzled about the --linear concatenating. I think this may cause
the disks in the --linear array to be considered as one spindle, and thus
no concurrent IO will be made. I may be wrong there.

best regards
Keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 21.03.2011 04:13:46 von Roberto Spadim

hum, maybe with linear will have less cpu use instead stripe?
i never tested a array with more than 8 disks with linear, and with
stripe hehehe
anyone could help here?

2011/3/20 Keld J=F8rn Simonsen :
> On Sun, Mar 20, 2011 at 06:22:30PM -0500, Stan Hoeppner wrote:
>> Roberto Spadim put forth on 3/20/2011 12:32 AM:
>>
>> > i think it's better contact ibm/dell/hp/compaq/texas/anyother and =
talk
>> > about the problem, post results here, this is a nice hardware ques=
tion
>> > :)
>>
>> I don't need vendor assistance to design a hardware system capable o=
f
>> the 10GB/s NFS throughput target. =A0That's relatively easy. =A0I've=
already
>> specified one possible hardware combination capable of this level of
>> performance (see below). =A0The configuration will handle 10GB/s usi=
ng the
>> RAID function of the LSI SAS HBAs. =A0The only question is if it has
>> enough individual and aggregate CPU horsepower, memory, and HT
>> interconnect bandwidth to do the same using mdraid. =A0This is the r=
eason
>> for my questions directed at Neil.
>>
>> > don't tell about software raid, just the hardware to allow this
>> > bandwidth (10gb/s) and share files
>>
>> I already posted some of the minimum hardware specs earlier in this
>> thread for the given workload I described. =A0Following is a descrip=
tion
>> of the workload and a complete hardware specification.
>>
>> Target workload:
>>
>> 10GB/s continuous parallel NFS throughput serving 50+ NFS clients wh=
ose
>> application performs large streaming reads. =A0At the storage array =
level
>> the 50+ parallel streaming reads become a random IO pattern workload
>> requiring a huge number of spindles due to the high seek rates.
>>
>> Minimum hardware requirements, based on performance and cost. =A0Bal=
lpark
>> guess on total cost of the hardware below is $150-250k USD. =A0We ca=
n't
>> get the data to the clients without a network, so the specification
>> starts with the switching hardware needed.
>>
>> Ethernet switches:
>> =A0 =A0One HP A5820X-24XG-SFP+ (JC102A) 24 10 GbE SFP ports
>> =A0 =A0 =A0 488 Gb/s backplane switching capacity
>> =A0 =A0Five HP A5800-24G Switch (JC100A) 24 GbE ports, 4 10GbE SFP
>> =A0 =A0 =A0 208 Gb/s backplane switching capacity
>> =A0 =A0Maximum common MTU enabled (jumbo frame) globally
>> =A0 =A0Connect 12 server 10 GbE ports to A5820X
>> =A0 =A0Uplink 2 10 GbE ports from each A5800 to A5820X
>> =A0 =A0 =A0 =A02 open 10 GbE ports left on A5820X for cluster expans=
ion
>> =A0 =A0 =A0 =A0or off cluster data transfers to the main network
>> =A0 =A0Link aggregate 12 server 10 GbE ports to A5820X
>> =A0 =A0Link aggregate each client's 2 GbE ports to A5800s
>> =A0 =A0Aggregate client->switch bandwidth =3D 12.5 GB/s
>> =A0 =A0Aggregate server->switch bandwidth =3D 15.0 GB/s
>> =A0 =A0The excess server b/w of 2.5GB/s is a result of the following=
:
>> =A0 =A0 =A0 =A0Allowing headroom for an additional 10 clients or out=
of cluster
>> =A0 =A0 =A0 =A0 =A0 data transfers
>> =A0 =A0 =A0 =A0Balancing the packet load over the 3 quad port 10 GbE=
server NICs
>> =A0 =A0 =A0 =A0 =A0 regardless of how many clients are active to pre=
vent hot spots
>> =A0 =A0 =A0 =A0 =A0 in the server memory and interconnect subsystems
>>
>> Server chassis
>> =A0 =A0HP Proliant DL585 G7 with the following specifications
>> =A0 =A0Dual AMD Opteron 6136, 16 cores @2.4GHz
>> =A0 =A020GB/s node-node HT b/w, 160GB/s aggregate
>> =A0 =A0128GB DDR3 1333, 16x8GB RDIMMS in 8 channels
>> =A0 =A020GB/s/node memory bandwidth, 80GB/s aggregate
>> =A0 =A07 PCIe x8 slots and 4 PCIe x16
>> =A0 =A08GB/s/slot, 56 GB/s aggregate PCIe x8 bandwidth
>>
>> IO controllers
>> =A0 =A04 x LSI SAS 9285-8e 8 port SAS, 800MHz dual core ROC, 1GB cac=
he
>> =A0 =A03 x NIAGARA 32714 PCIe x8 Quad Port Fiber 10 Gigabit Server A=
dapter
>>
>> JBOD enclosures
>> =A0 =A016 x LSI 620J 2U 24 x 2.5" bay SAS 6Gb/s, w/SAS expander
>> =A0 =A02 SFF 8088 host and 1 expansion port per enclosure
>> =A0 =A0384 total SAS 6GB/s 2.5" drive bays
>> =A0 =A0Two units are daisy chained with one in each pair
>> =A0 =A0 =A0 connecting to one of 8 HBA SFF8088 ports, for a total of
>> =A0 =A0 =A0 32 6Gb/s SAS host connections, yielding 38.4 GB/s full d=
uplex b/w
>>
>> Disks drives
>> =A0 =A0384 HITACHI Ultrastar C15K147 147GB 15000 RPM 64MB Cache 2.5"=
SAS
>> =A0 =A0 =A0 6Gb/s Internal Enterprise Hard Drive
>>
>>
>> Note that the HBA to disk bandwidths of 19.2GB/s one way and 38.4GB/=
s
>> full duplex are in excess of the HBA to PCIe bandwidths, 16 and 32GB=
/s
>> respectively, by approximately 20%. =A0Also note that each drive can
>> stream reads at 160MB/s peak, yielding 61GB/s aggregate streaming re=
ad
>> capacity for the 384 disks. =A0This is almost 4 times the aggregate =
one
>> way transfer rate of the 4 PCIe x8 slots, and is 6 times our target =
host
>> to parallel client data rate of 10GB/s. =A0There are a few reasons w=
hy
>> this excess of capacity is built into the system:
>>
>> 1. =A0RAID10 is the only suitable RAID level for this type of system=
with
>> this many disks, for many reasons that have been discussed before.
>> RAID10 instantly cuts the number of stripe spindles in two, dropping=
the
>> data rate by a factor of 2, giving us 30.5GB/s potential aggregate
>> throughput. =A0Now we're only at 3 times out target data rate.
>>
>> 2. =A0As a single disk drive's seek rate increases, its transfer rat=
e
>> decreases in relation to its single streaming read performance.
>> Parallel streaming reads will increase seek rates as the disk head m=
ust
>> move between different regions of the disk platter.
>>
>> 3. =A0In relation to 2, if we assume we'll lose no more than 66% of =
our
>> single streaming performance with a multi stream workload, we're dow=
n to
>> 10.1GB/s throughput, right at our target.
>>
>> By using relatively small arrays of 24 drives each (12 stripe spindl=
es),
>> concatenating (--linear) the 16 resulting arrays, and using a filesy=
stem
>> such as XFS across the entire array with its intelligent load balanc=
ing
>> of streams using allocation groups, we minimize disk head seeking.
>> Doing this can in essence divide our 50 client streams across 16 arr=
ays,
>> with each array seeing approximately 3 of the streaming client reads=

>> Each disk should be able to easily maintain 33% of its max read rate
>> while servicing 3 streaming reads.
>>
>> I hope you found this informative or interesting. =A0I enjoyed the
>> exercise. =A0I'd been working on this system specification for quite=
a few
>> days now but have been hesitant to post it due to its length, and th=
e
>> fact that AFAIK hardware discussion is a bit OT on this list.
>>
>> I hope it may be valuable to someone Google'ing for this type of
>> information in the future.
>>
>> --
>> Stan
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid=
" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>
> Are you then building the system yourself, and running Linux MD RAID?
>
> Anyway, with 384 spindles and only 50 users, each user will have in
> average 7 spindles for himself. I think much of the time this would m=
ean
> no random IO, as most users are doing large sequential reading.
> Thus on average you can expect quite close to striping speed if you
> are running RAID capable of striping.
>
> I am puzzled about the --linear concatenating. I think this may cause
> the disks in the --linear array to be considered as one spindle, and =
thus
> no concurrent IO will be made. I may be wrong there.
>
> best regards
> Keld
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 21.03.2011 04:14:15 von Roberto Spadim

with hardware raid we don't think about this problem, but with
software we should consider since we run others app with software raid
running too

2011/3/21 Roberto Spadim :
> hum, maybe with linear will have less cpu use instead stripe?
> i never tested a array with more than 8 disks with linear, and with
> stripe hehehe
> anyone could help here?
>
> 2011/3/20 Keld J=F8rn Simonsen :
>> On Sun, Mar 20, 2011 at 06:22:30PM -0500, Stan Hoeppner wrote:
>>> Roberto Spadim put forth on 3/20/2011 12:32 AM:
>>>
>>> > i think it's better contact ibm/dell/hp/compaq/texas/anyother and=
talk
>>> > about the problem, post results here, this is a nice hardware que=
stion
>>> > :)
>>>
>>> I don't need vendor assistance to design a hardware system capable =
of
>>> the 10GB/s NFS throughput target. =A0That's relatively easy. =A0I'v=
e already
>>> specified one possible hardware combination capable of this level o=
f
>>> performance (see below). =A0The configuration will handle 10GB/s us=
ing the
>>> RAID function of the LSI SAS HBAs. =A0The only question is if it ha=
s
>>> enough individual and aggregate CPU horsepower, memory, and HT
>>> interconnect bandwidth to do the same using mdraid. =A0This is the =
reason
>>> for my questions directed at Neil.
>>>
>>> > don't tell about software raid, just the hardware to allow this
>>> > bandwidth (10gb/s) and share files
>>>
>>> I already posted some of the minimum hardware specs earlier in this
>>> thread for the given workload I described. =A0Following is a descri=
ption
>>> of the workload and a complete hardware specification.
>>>
>>> Target workload:
>>>
>>> 10GB/s continuous parallel NFS throughput serving 50+ NFS clients w=
hose
>>> application performs large streaming reads. =A0At the storage array=
level
>>> the 50+ parallel streaming reads become a random IO pattern workloa=
d
>>> requiring a huge number of spindles due to the high seek rates.
>>>
>>> Minimum hardware requirements, based on performance and cost. =A0Ba=
llpark
>>> guess on total cost of the hardware below is $150-250k USD. =A0We c=
an't
>>> get the data to the clients without a network, so the specification
>>> starts with the switching hardware needed.
>>>
>>> Ethernet switches:
>>> =A0 =A0One HP A5820X-24XG-SFP+ (JC102A) 24 10 GbE SFP ports
>>> =A0 =A0 =A0 488 Gb/s backplane switching capacity
>>> =A0 =A0Five HP A5800-24G Switch (JC100A) 24 GbE ports, 4 10GbE SFP
>>> =A0 =A0 =A0 208 Gb/s backplane switching capacity
>>> =A0 =A0Maximum common MTU enabled (jumbo frame) globally
>>> =A0 =A0Connect 12 server 10 GbE ports to A5820X
>>> =A0 =A0Uplink 2 10 GbE ports from each A5800 to A5820X
>>> =A0 =A0 =A0 =A02 open 10 GbE ports left on A5820X for cluster expan=
sion
>>> =A0 =A0 =A0 =A0or off cluster data transfers to the main network
>>> =A0 =A0Link aggregate 12 server 10 GbE ports to A5820X
>>> =A0 =A0Link aggregate each client's 2 GbE ports to A5800s
>>> =A0 =A0Aggregate client->switch bandwidth =3D 12.5 GB/s
>>> =A0 =A0Aggregate server->switch bandwidth =3D 15.0 GB/s
>>> =A0 =A0The excess server b/w of 2.5GB/s is a result of the followin=
g:
>>> =A0 =A0 =A0 =A0Allowing headroom for an additional 10 clients or ou=
t of cluster
>>> =A0 =A0 =A0 =A0 =A0 data transfers
>>> =A0 =A0 =A0 =A0Balancing the packet load over the 3 quad port 10 Gb=
E server NICs
>>> =A0 =A0 =A0 =A0 =A0 regardless of how many clients are active to pr=
event hot spots
>>> =A0 =A0 =A0 =A0 =A0 in the server memory and interconnect subsystem=
s
>>>
>>> Server chassis
>>> =A0 =A0HP Proliant DL585 G7 with the following specifications
>>> =A0 =A0Dual AMD Opteron 6136, 16 cores @2.4GHz
>>> =A0 =A020GB/s node-node HT b/w, 160GB/s aggregate
>>> =A0 =A0128GB DDR3 1333, 16x8GB RDIMMS in 8 channels
>>> =A0 =A020GB/s/node memory bandwidth, 80GB/s aggregate
>>> =A0 =A07 PCIe x8 slots and 4 PCIe x16
>>> =A0 =A08GB/s/slot, 56 GB/s aggregate PCIe x8 bandwidth
>>>
>>> IO controllers
>>> =A0 =A04 x LSI SAS 9285-8e 8 port SAS, 800MHz dual core ROC, 1GB ca=
che
>>> =A0 =A03 x NIAGARA 32714 PCIe x8 Quad Port Fiber 10 Gigabit Server =
Adapter
>>>
>>> JBOD enclosures
>>> =A0 =A016 x LSI 620J 2U 24 x 2.5" bay SAS 6Gb/s, w/SAS expander
>>> =A0 =A02 SFF 8088 host and 1 expansion port per enclosure
>>> =A0 =A0384 total SAS 6GB/s 2.5" drive bays
>>> =A0 =A0Two units are daisy chained with one in each pair
>>> =A0 =A0 =A0 connecting to one of 8 HBA SFF8088 ports, for a total o=
f
>>> =A0 =A0 =A0 32 6Gb/s SAS host connections, yielding 38.4 GB/s full =
duplex b/w
>>>
>>> Disks drives
>>> =A0 =A0384 HITACHI Ultrastar C15K147 147GB 15000 RPM 64MB Cache 2.5=
" SAS
>>> =A0 =A0 =A0 6Gb/s Internal Enterprise Hard Drive
>>>
>>>
>>> Note that the HBA to disk bandwidths of 19.2GB/s one way and 38.4GB=
/s
>>> full duplex are in excess of the HBA to PCIe bandwidths, 16 and 32G=
B/s
>>> respectively, by approximately 20%. =A0Also note that each drive ca=
n
>>> stream reads at 160MB/s peak, yielding 61GB/s aggregate streaming r=
ead
>>> capacity for the 384 disks. =A0This is almost 4 times the aggregate=
one
>>> way transfer rate of the 4 PCIe x8 slots, and is 6 times our target=
host
>>> to parallel client data rate of 10GB/s. =A0There are a few reasons =
why
>>> this excess of capacity is built into the system:
>>>
>>> 1. =A0RAID10 is the only suitable RAID level for this type of syste=
m with
>>> this many disks, for many reasons that have been discussed before.
>>> RAID10 instantly cuts the number of stripe spindles in two, droppin=
g the
>>> data rate by a factor of 2, giving us 30.5GB/s potential aggregate
>>> throughput. =A0Now we're only at 3 times out target data rate.
>>>
>>> 2. =A0As a single disk drive's seek rate increases, its transfer ra=
te
>>> decreases in relation to its single streaming read performance.
>>> Parallel streaming reads will increase seek rates as the disk head =
must
>>> move between different regions of the disk platter.
>>>
>>> 3. =A0In relation to 2, if we assume we'll lose no more than 66% of=
our
>>> single streaming performance with a multi stream workload, we're do=
wn to
>>> 10.1GB/s throughput, right at our target.
>>>
>>> By using relatively small arrays of 24 drives each (12 stripe spind=
les),
>>> concatenating (--linear) the 16 resulting arrays, and using a files=
ystem
>>> such as XFS across the entire array with its intelligent load balan=
cing
>>> of streams using allocation groups, we minimize disk head seeking.
>>> Doing this can in essence divide our 50 client streams across 16 ar=
rays,
>>> with each array seeing approximately 3 of the streaming client read=
s.
>>> Each disk should be able to easily maintain 33% of its max read rat=
e
>>> while servicing 3 streaming reads.
>>>
>>> I hope you found this informative or interesting. =A0I enjoyed the
>>> exercise. =A0I'd been working on this system specification for quit=
e a few
>>> days now but have been hesitant to post it due to its length, and t=
he
>>> fact that AFAIK hardware discussion is a bit OT on this list.
>>>
>>> I hope it may be valuable to someone Google'ing for this type of
>>> information in the future.
>>>
>>> --
>>> Stan
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rai=
d" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.htm=
l
>>
>> Are you then building the system yourself, and running Linux MD RAID=
?
>>
>> Anyway, with 384 spindles and only 50 users, each user will have in
>> average 7 spindles for himself. I think much of the time this would =
mean
>> no random IO, as most users are doing large sequential reading.
>> Thus on average you can expect quite close to striping speed if you
>> are running RAID capable of striping.
>>
>> I am puzzled about the --linear concatenating. I think this may caus=
e
>> the disks in the --linear array to be considered as one spindle, and=
thus
>> no concurrent IO will be made. I may be wrong there.
>>
>> best regards
>> Keld
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid=
" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>>
>
>
>
> --
> Roberto Spadim
> Spadim Technology / SPAEmpresarial
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 21.03.2011 15:18:57 von Stan Hoeppner

Keld J=F8rn Simonsen put forth on 3/20/2011 9:44 PM:

> Are you then building the system yourself, and running Linux MD RAID?

No. These specifications meet the needs of Matt Garman's analysis
cluster, and extend that performance from 6GB/s to 10GB/s. Christoph's
comments about 10GB/s throughput with XFS on large CPU count Altix 4000
series machines from a few years ago prompted me to specify a single
chassis multicore AMD Opteron based system that can achieve the same
throughput at substantially lower cost.

> Anyway, with 384 spindles and only 50 users, each user will have in
> average 7 spindles for himself. I think much of the time this would m=
ean=20
> no random IO, as most users are doing large sequential reading.=20
> Thus on average you can expect quite close to striping speed if you
> are running RAID capable of striping.=20

This is not how large scale shared RAID storage works under a
multi-stream workload. I thought I explained this in sufficient detail=

Maybe not.

> I am puzzled about the --linear concatenating. I think this may cause
> the disks in the --linear array to be considered as one spindle, and =
thus
> no concurrent IO will be made. I may be wrong there.

You are puzzled because you are not familiar with the large scale
performance features built into the XFS filesystem. XFS allocation
groups automatically enable large scale parallelism on a single logical
device comprised of multiple arrays or single disks, when configured
correctly. See:

http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure //tmp/en-US=
/html/Allocation_Groups.html

The storage pool in my proposed 10GB/s NFS server system consists of 16
RAID10 arrays comprised of 24 disks of 146GB capacity, 12 stripe
spindles per array, 1.752TB per array, 28TB total raw. Concatenating
the 16 array devices with mdadm --linear creates a 28TB logical device.
We format it with this simple command, not having to worry about strip=
e
block size, stripe spindle width, stripe alignment, etc:

~# mkfs.xfs -d agcount=3D64

Using this method to achieve parallel scalability is simpler and less
prone to configuration errors when compared to multi-level striping,
which often leads to poor performance and poor space utilization. With
64 XFS allocation groups the kernel can read/write 4 concurrent streams
from/to each array of 12 spindles, which should be able to handle this
load with plenty of headroom. This system has 32 SAS 6G channels, each
able to carry two 300MB/s streams, 19.8GB/s aggregate, substantially
more than our 10GB/s target. I was going to state that we're limited t=
o
10.4GB/s due to the PCIe/HT bridge to the processor. However, I just
realized I made an error when specifying the DL585 G7 with only 2
processors. See [1] below for details.

Using XFS in this manner allows us to avoid nested striped arrays and
the inherent problems associated with them. For example, in absence of
using XFS allocation groups to get our parallelism, we could do the
following:

1. Width 16 RAID0 stripe over width 12 RAID10 stripe
2. Width 16 LVM stripe over width 12 RAID10 stripe

In either case, what is the correct/optimum stripe block size for each
level when nesting the two? The answer is that there really aren't
correct or optimum stripe sizes in this scenario. Writes to the top
level stripe will be broken into 16 chunks. Each of these 16 chunks
will then be broken into 12 more chunks. You may be thinking, "Why
don't we just create one 384 disk RAID10? It would SCREAM with 192
spindles!!" There are many reasons why nobody does this, one being the
same stripe block size issue as with nested stripes. Extremely wide
arrays have a plethora of problems associated with them.

In summary, concatenating many relatively low stripe spindle count
arrays, and using XFS allocation groups to achieve parallel scalability=
,
gives us the performance we want without the problems associated with
other configurations.


[1] In order to get all 11 PCIe slots in the DL585 G7 one must use the
4 socket model, as the additional PCIe slots of the mezzanine card
connect to two additional SR5690 chips, each one connecting to an HT
port on each of the two additional CPUs. Thus, I'm re-specifying the
DL585 G7 model to have 4 Opteron 6136 CPUs instead of two, 32 cores
total. The 128GB in 16 RDIMMs will be spread across all 16 memory
channels. Memory bandwidth thus doubles to 160GB/s and interconnect b/=
w
doubles to 320GB/s. Thus, we now have up to 19.2 GB/s of available one
way disk bandwidth as we're no longer limited by a 10.4GB/s HT/PCIe
link. Adding the two required CPUs may have just made this system
capable of 15GB/s NFS throughput for less than $5000 additional cost,
not due to the processors, but the extra IO bandwidth enabled as a
consequence of their inclusion. Adding another quad port 10 GbE NIC
will take it close to 20GB/s NFS throughput. Shame on me for not
digging far deeper into the DL585 G7 docs.

--=20
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 21.03.2011 18:07:45 von Stan Hoeppner

Roberto Spadim put forth on 3/20/2011 10:14 PM:
> with hardware raid we don't think about this problem, but with
> software we should consider since we run others app with software raid
> running too

Which is precisely why I asked Neil about this. If you recall Neil
stated that CPU burn shouldn't be an issue when using mdraid linear over
16 mdraid 10 arrays in the proposed system. As long as the kernel
somewhat evenly distributes IO steams amongst multiple cores I'm
inclined to agree with Neil.

Note that the application in this case, the NFS server, is threaded
kernel code, and thus very fast and scalable across all CPUs. By
design, all of the performance critical code in this system runs in
kernel space.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 21.03.2011 18:08:20 von Roberto Spadim

hum, i think you have all to work with mdraid and hardware,right?
xfs allocation groups is nice, i don=B4t know what workload it could
accept maybe with raid0 linear this work better than stripe (i must
test)

i think you know what you do =3D)
any more doubt?


2011/3/21 Stan Hoeppner :
> Keld J=F8rn Simonsen put forth on 3/20/2011 9:44 PM:
>
>> Are you then building the system yourself, and running Linux MD RAID=
?
>
> No. =A0These specifications meet the needs of Matt Garman's analysis
> cluster, and extend that performance from 6GB/s to 10GB/s. =A0Christo=
ph's
> comments about 10GB/s throughput with XFS on large CPU count Altix 40=
00
> series machines from a few years ago prompted me to specify a single
> chassis multicore AMD Opteron based system that can achieve the same
> throughput at substantially lower cost.
>
>> Anyway, with 384 spindles and only 50 users, each user will have in
>> average 7 spindles for himself. I think much of the time this would =
mean
>> no random IO, as most users are doing large sequential reading.
>> Thus on average you can expect quite close to striping speed if you
>> are running RAID capable of striping.
>
> This is not how large scale shared RAID storage works under a
> multi-stream workload. =A0I thought I explained this in sufficient de=
tail.
> =A0Maybe not.
>
>> I am puzzled about the --linear concatenating. I think this may caus=
e
>> the disks in the --linear array to be considered as one spindle, and=
thus
>> no concurrent IO will be made. I may be wrong there.
>
> You are puzzled because you are not familiar with the large scale
> performance features built into the XFS filesystem. =A0XFS allocation
> groups automatically enable large scale parallelism on a single logic=
al
> device comprised of multiple arrays or single disks, when configured
> correctly. =A0See:
>
> http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure //tmp/en-=
US/html/Allocation_Groups.html
>
> The storage pool in my proposed 10GB/s NFS server system consists of =
16
> RAID10 arrays comprised of 24 disks of 146GB capacity, 12 stripe
> spindles per array, 1.752TB per array, 28TB total raw. =A0Concatenati=
ng
> the 16 array devices with mdadm --linear creates a 28TB logical devic=
e.
> =A0We format it with this simple command, not having to worry about s=
tripe
> block size, stripe spindle width, stripe alignment, etc:
>
> ~# mkfs.xfs -d agcount=3D64
>
> Using this method to achieve parallel scalability is simpler and less
> prone to configuration errors when compared to multi-level striping,
> which often leads to poor performance and poor space utilization. =A0=
With
> 64 XFS allocation groups the kernel can read/write 4 concurrent strea=
ms
> from/to each array of 12 spindles, which should be able to handle thi=
s
> load with plenty of headroom. =A0This system has 32 SAS 6G channels, =
each
> able to carry two 300MB/s streams, 19.8GB/s aggregate, substantially
> more than our 10GB/s target. =A0I was going to state that we're limit=
ed to
> 10.4GB/s due to the PCIe/HT bridge to the processor. =A0However, I ju=
st
> realized I made an error when specifying the DL585 G7 with only 2
> processors. =A0See [1] below for details.
>
> Using XFS in this manner allows us to avoid nested striped arrays and
> the inherent problems associated with them. =A0For example, in absenc=
e of
> using XFS allocation groups to get our parallelism, we could do the
> following:
>
> 1. =A0Width 16 RAID0 stripe over width 12 RAID10 stripe
> 2. =A0Width 16 LVM =A0 stripe over width 12 RAID10 stripe
>
> In either case, what is the correct/optimum stripe block size for eac=
h
> level when nesting the two? =A0The answer is that there really aren't
> correct or optimum stripe sizes in this scenario. =A0Writes to the to=
p
> level stripe will be broken into 16 chunks. =A0Each of these 16 chunk=
s
> will then be broken into 12 more chunks. =A0You may be thinking, "Why
> don't we just create one 384 disk RAID10? =A0It would SCREAM with 192
> spindles!!" =A0There are many reasons why nobody does this, one being=
the
> same stripe block size issue as with nested stripes. =A0Extremely wid=
e
> arrays have a plethora of problems associated with them.
>
> In summary, concatenating many relatively low stripe spindle count
> arrays, and using XFS allocation groups to achieve parallel scalabili=
ty,
> gives us the performance we want without the problems associated with
> other configurations.
>
>
> [1] =A0In order to get all 11 PCIe slots in the DL585 G7 one must use=
the
> 4 socket model, as the additional PCIe slots of the mezzanine card
> connect to two additional SR5690 chips, each one connecting to an HT
> port on each of the two additional CPUs. =A0Thus, I'm re-specifying t=
he
> DL585 G7 model to have 4 Opteron 6136 CPUs instead of two, 32 cores
> total. =A0The 128GB in 16 RDIMMs will be spread across all 16 memory
> channels. =A0Memory bandwidth thus doubles to 160GB/s and interconnec=
t b/w
> doubles to 320GB/s. =A0Thus, we now have up to 19.2 GB/s of available=
one
> way disk bandwidth as we're no longer limited by a 10.4GB/s HT/PCIe
> link. =A0Adding the two required CPUs may have just made this system
> capable of 15GB/s NFS throughput for less than $5000 additional cost,
> not due to the processors, but the extra IO bandwidth enabled as a
> consequence of their inclusion. =A0Adding another quad port 10 GbE NI=
C
> will take it close to 20GB/s NFS throughput. =A0Shame on me for not
> digging far deeper into the DL585 G7 docs.
>
> --
> Stan
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 21.03.2011 23:13:04 von Keld Simonsen

On Mon, Mar 21, 2011 at 09:18:57AM -0500, Stan Hoeppner wrote:
> Keld J=F8rn Simonsen put forth on 3/20/2011 9:44 PM:
>=20
> > Are you then building the system yourself, and running Linux MD RAI=
D?
>=20
> No. These specifications meet the needs of Matt Garman's analysis
> cluster, and extend that performance from 6GB/s to 10GB/s. Christoph=
's
> comments about 10GB/s throughput with XFS on large CPU count Altix 40=
00
> series machines from a few years ago prompted me to specify a single
> chassis multicore AMD Opteron based system that can achieve the same
> throughput at substantially lower cost.

OK, But I understand that this is running Linux MD RAID, and not some
hardware RAID. True?

Or at least Linux MD RAID is used to build a --linear FS.
Then why not use Linux MD to make the underlying RAID1+0 arrays?

>=20
> > Anyway, with 384 spindles and only 50 users, each user will have in
> > average 7 spindles for himself. I think much of the time this would=
mean=20
> > no random IO, as most users are doing large sequential reading.=20
> > Thus on average you can expect quite close to striping speed if you
> > are running RAID capable of striping.=20
>=20
> This is not how large scale shared RAID storage works under a
> multi-stream workload. I thought I explained this in sufficient deta=
il.
> Maybe not.

Given that the whole array system is only lightly loaded, this is how I
expect it to function. Maybe you can explain why it would not be so, if
you think otherwise.

> > I am puzzled about the --linear concatenating. I think this may cau=
se
> > the disks in the --linear array to be considered as one spindle, an=
d thus
> > no concurrent IO will be made. I may be wrong there.
>=20
> You are puzzled because you are not familiar with the large scale
> performance features built into the XFS filesystem. XFS allocation
> groups automatically enable large scale parallelism on a single logic=
al
> device comprised of multiple arrays or single disks, when configured
> correctly. See:
>=20
> http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure //tmp/en-=
US/html/Allocation_Groups.html
>=20
> The storage pool in my proposed 10GB/s NFS server system consists of =
16
> RAID10 arrays comprised of 24 disks of 146GB capacity, 12 stripe
> spindles per array, 1.752TB per array, 28TB total raw. Concatenating
> the 16 array devices with mdadm --linear creates a 28TB logical devic=
e.
> We format it with this simple command, not having to worry about str=
ipe
> block size, stripe spindle width, stripe alignment, etc:
>=20
> ~# mkfs.xfs -d agcount=3D64
>=20
> Using this method to achieve parallel scalability is simpler and less
> prone to configuration errors when compared to multi-level striping,
> which often leads to poor performance and poor space utilization. Wi=
th
> 64 XFS allocation groups the kernel can read/write 4 concurrent strea=
ms
> from/to each array of 12 spindles, which should be able to handle thi=
s
> load with plenty of headroom. This system has 32 SAS 6G channels, ea=
ch
> able to carry two 300MB/s streams, 19.8GB/s aggregate, substantially
> more than our 10GB/s target. I was going to state that we're limited=
to
> 10.4GB/s due to the PCIe/HT bridge to the processor. However, I just
> realized I made an error when specifying the DL585 G7 with only 2
> processors. See [1] below for details.
>=20
> Using XFS in this manner allows us to avoid nested striped arrays and
> the inherent problems associated with them. For example, in absence =
of
> using XFS allocation groups to get our parallelism, we could do the
> following:
>=20
> 1. Width 16 RAID0 stripe over width 12 RAID10 stripe
> 2. Width 16 LVM stripe over width 12 RAID10 stripe
>=20
> In either case, what is the correct/optimum stripe block size for eac=
h
> level when nesting the two? The answer is that there really aren't
> correct or optimum stripe sizes in this scenario. Writes to the top
> level stripe will be broken into 16 chunks. Each of these 16 chunks
> will then be broken into 12 more chunks. You may be thinking, "Why
> don't we just create one 384 disk RAID10? It would SCREAM with 192
> spindles!!" There are many reasons why nobody does this, one being t=
he
> same stripe block size issue as with nested stripes. Extremely wide
> arrays have a plethora of problems associated with them.
>=20
> In summary, concatenating many relatively low stripe spindle count
> arrays, and using XFS allocation groups to achieve parallel scalabili=
ty,
> gives us the performance we want without the problems associated with
> other configurations.
>=20
>=20
> [1] In order to get all 11 PCIe slots in the DL585 G7 one must use t=
he
> 4 socket model, as the additional PCIe slots of the mezzanine card
> connect to two additional SR5690 chips, each one connecting to an HT
> port on each of the two additional CPUs. Thus, I'm re-specifying the
> DL585 G7 model to have 4 Opteron 6136 CPUs instead of two, 32 cores
> total. The 128GB in 16 RDIMMs will be spread across all 16 memory
> channels. Memory bandwidth thus doubles to 160GB/s and interconnect =
b/w
> doubles to 320GB/s. Thus, we now have up to 19.2 GB/s of available o=
ne
> way disk bandwidth as we're no longer limited by a 10.4GB/s HT/PCIe
> link. Adding the two required CPUs may have just made this system
> capable of 15GB/s NFS throughput for less than $5000 additional cost,
> not due to the processors, but the extra IO bandwidth enabled as a
> consequence of their inclusion. Adding another quad port 10 GbE NIC
> will take it close to 20GB/s NFS throughput. Shame on me for not
> digging far deeper into the DL585 G7 docs.

it is probably not the concurrency of XFS that makes the parallelism of
the IO. It is more likely the IO system, and that would also work for
other file system types, like ext4. I do not see anything in the XFS al=
location
blocks with any knowledge of the underlying disk structure.=20
What the file system does is only to administer the scheduling of the
IO, in combination with the rest of the kernel.

Anyway, thanks for the energy and expertise that you are supplying to
this thread.

Best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 22.03.2011 10:46:58 von Robin Hill

--EVF5PPMfhYS0aIcm
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Mon Mar 21, 2011 at 11:13:04 +0100, Keld J=F8rn Simonsen wrote:

> On Mon, Mar 21, 2011 at 09:18:57AM -0500, Stan Hoeppner wrote:
> >=20
> > > Anyway, with 384 spindles and only 50 users, each user will have in
> > > average 7 spindles for himself. I think much of the time this would m=
ean=20
> > > no random IO, as most users are doing large sequential reading.=20
> > > Thus on average you can expect quite close to striping speed if you
> > > are running RAID capable of striping.=20
> >=20
> > This is not how large scale shared RAID storage works under a
> > multi-stream workload. I thought I explained this in sufficient detail.
> > Maybe not.
>=20
> Given that the whole array system is only lightly loaded, this is how I
> expect it to function. Maybe you can explain why it would not be so, if
> you think otherwise.
>=20
If you have more than one system accessing the array simultaneously then
your sequential IO immediately becomes random (as it'll interleave the
requests from the multiple systems). The more systems accessing
simultaneously, the more random the IO becomes. Of course, there will
still be an opportunity for some readahead, so it's not entirely random
IO.

> it is probably not the concurrency of XFS that makes the parallelism of
> the IO. It is more likely the IO system, and that would also work for
> other file system types, like ext4. I do not see anything in the XFS allo=
cation
> blocks with any knowledge of the underlying disk structure.=20
> What the file system does is only to administer the scheduling of the
> IO, in combination with the rest of the kernel.
>=20
XFS allows for splitting the single filesystem into multiple allocation
groups. It can then allocate blocks from each group simultaneously
without worrying about collisions. If the allocation groups are on
separate physical spindles then (apart from the initial mapping of a
request to an allocation group, which should be a very quick operation),
the entire write process is parallelised. Most filesystems have only a
single allocation group, so the block allocation is single threaded and
can easily become a bottleneck. It's only once the blocks are allocated
(assuming the filesystem knows about the physical layout) that the
writes can be parallelised. I've not looked into the details of ext4
though, so I don't know whether it makes any moves towards parallelising
block allocation.

Cheers,
Robin
--=20
___ =20
( ' } | Robin Hill |
/ / ) | Little Jim says .... |
// !! | "He fallen in de water !!" |

--EVF5PPMfhYS0aIcm
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)

iEYEARECAAYFAk2IcBEACgkQShxCyD40xBIHwACgq/Nabna31PJwMf5mZwGL IKS0
opAAn2wN5gmwDZl/s02fyDPDVz5zeCzN
=o2xQ
-----END PGP SIGNATURE-----

--EVF5PPMfhYS0aIcm--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 22.03.2011 11:00:40 von Stan Hoeppner

Keld J=F8rn Simonsen put forth on 3/21/2011 5:13 PM:
> On Mon, Mar 21, 2011 at 09:18:57AM -0500, Stan Hoeppner wrote:
>> Keld J=F8rn Simonsen put forth on 3/20/2011 9:44 PM:
>>
>>> Are you then building the system yourself, and running Linux MD RAI=
D?
>>
>> No. These specifications meet the needs of Matt Garman's analysis
>> cluster, and extend that performance from 6GB/s to 10GB/s. Christop=
h's
>> comments about 10GB/s throughput with XFS on large CPU count Altix 4=
000
>> series machines from a few years ago prompted me to specify a single
>> chassis multicore AMD Opteron based system that can achieve the same
>> throughput at substantially lower cost.
>=20
> OK, But I understand that this is running Linux MD RAID, and not some
> hardware RAID. True?
>=20
> Or at least Linux MD RAID is used to build a --linear FS.
> Then why not use Linux MD to make the underlying RAID1+0 arrays?

Using mdadm --linear is a requirement of this system specification. Th=
e
underlying RAID10 arrays can be either HBA RAID or mdraid. Note my
recent questions to Neil regarding mdraid CPU consumption across 16
cores with 16 x 24 drive mdraid 10 arrays.

>>> Anyway, with 384 spindles and only 50 users, each user will have in
>>> average 7 spindles for himself. I think much of the time this would=
mean=20
>>> no random IO, as most users are doing large sequential reading.=20
>>> Thus on average you can expect quite close to striping speed if you
>>> are running RAID capable of striping.=20
>>
>> This is not how large scale shared RAID storage works under a
>> multi-stream workload. I thought I explained this in sufficient det=
ail.
>> Maybe not.
>=20
> Given that the whole array system is only lightly loaded, this is how=
I
> expect it to function. Maybe you can explain why it would not be so, =
if
> you think otherwise.

Using the term "lightly loaded" to describe any system sustaining
concurrent 10GB/s block IO and NFS throughput doesn't seem to be an
accurate statement. I think you're confusing theoretical maximum
hardware performance with real world IO performance. The former is
always significantly higher than the latter. With this in mind, as wit=
h
any well designed system, I specified this system to have some headroom=
,
as I previously stated. Everything we've discussed so far WRT this
system has been strictly parallel reads.

Now, if 10 cluster nodes are added with an application that performs
streaming writes, occurring concurrently with the 50 streaming reads,
we've just significantly increased the amount of head seeking on our
disks. The combined IO workload is now a mixed heavy random read/write
workload. This is the most difficult type of workload for any RAID
subsystem. It would bring most parity RAID arrays to their knees. Thi=
s
is one of the reasons why RAID10 is the only suitable RAID level for
this type of system.

>> In summary, concatenating many relatively low stripe spindle count
>> arrays, and using XFS allocation groups to achieve parallel scalabil=
ity,
>> gives us the performance we want without the problems associated wit=
h
>> other configurations.

> it is probably not the concurrency of XFS that makes the parallelism =
of
> the IO.=20

It most certainly is the parallelism of XFS. There are some caveats to
the amount of XFS IO parallelism that are workload dependent. But
generally, with multiple processes/threads reading/writing multiple
files in multiple directories, the device parallelism is very high. Fo=
r
example:

If you have 50 NFS clients all reading the same large 20GB file
concurrently, IO parallelism will be limited to the 12 stripe spindles
on the single underlying RAID array upon which the AG holding this file
resides. If no other files in the AG are being accessed at the time,
you'll get something like 1.8GB/s throughput for this 20GB file. Since
the bulk, if not all, of this file will get cached during the read, all
50 NFS clients will likely be served from cache at their line rate of
200MB/s, or 10GB/s aggregate. There's that magic 10GB/s number again.
;) As you can see I put some serious thought into this system
specification.

If you have all 50 NFS clients accessing 50 different files in 50
different directories you have no cache benefit. But we will have file=
s
residing in all allocations groups on all 16 arrays. Since XFS evenly
distributes new directories across AGs when the directories are created=
,
we can probably assume we'll have parallel IO across all 16 arrays with
this workload. Since each array can stream reads at 1.8GB/s, that's
potential parallel throughput of 28GB/s, saturating our PCIe bus
bandwidth of 16GB/s.

Now change this to 50 clients each doing 10,000 4KB file reads in a
directory along with 10,000 4KB file writes. The throughput of each 12
disk array may now drop by over a factor of approximately 128, as each
disk can only sustain about 300 head seeks/second, dropping its
throughput to 300 * 4096 bytes =3D 1.17MB/s. Kernel readahead may help
some, but it'll still suck.

It is the occasional workload such as that above that dictates
overbuilding the disk subsystem. Imagine adding a high IOPS NFS client
workload to this server after it went into production to "only" serve
large streaming reads. The random workload above would drop the
performance of this 384 disk array with 15k spindles from a peak
streaming rate of 28.4GB/s to 18MB/s--yes, that's megabytes.

With one workload the disks can saturate the PCIe bus by almost a facto=
r
of two. With an opposite workload the disks can only transfer one
14,000th of the PCIe bandwidth. This is why Fortune 500 companies and
others with extremely high random IO workloads such as databases, and
plenty of cash, have farms with multiple thousands of disks attached to
database and other servers.

> It is more likely the IO system, and that would also work for
> other file system types, like ext4.=20

No. Upper kernel layers doesn't provide this parallelism. This is
strictly an XFS feature, although JFS had something similar (and JFS is
now all but dead), though not as performant. BTRFS might have somethin=
g
similar but I've read nothing about BTRFS internals. Because XFS has
simply been the king of scalable filesystems for 15 years, and added
great new capability along the way, all of the other filesystem
developers have started to steal ideas from XFS. IIRC Ted T'so stole
some things from XFS for use in EXT4, but allocation groups wasn't one
of them.

> I do not see anything in the XFS allocation
> blocks with any knowledge of the underlying disk structure.=20

The primary structure that allows for XFS parallelism is
xfs_agnumber_t sb_agcount

Making the filesystem with
mkfs.xfs -d agcount=3D16

creates 16 allocations groups of 1.752TB each in our case, 1 per 12
spindle array. XFS will read/write to all 16 AGs in parallel, under th=
e
right circumstances, with 1 or multiple IO streams to/from each 12
spindle array. XFS is the only Linux filesystem with this type of
scalability, again, unless BTRFS has something similar.

> What the file system does is only to administer the scheduling of the
> IO, in combination with the rest of the kernel.

Given that XFS has 64xxx lines of code, BTRFS has 46xxx, and EXT4 has
29xxx, I think there's a bit more to it than that Keld. ;) Note that
XFS has over twice the code size of EXT4. That's not bloat but
features, one them being allocation groups. If your simplistic view of
this was correct we'd have only one Linux filesystem. Filesystem code
does much much more than you realize.

> Anyway, thanks for the energy and expertise that you are supplying to
> this thread.

High performance systems are one of my passions. I'm glad to
participate and share. Speaking of sharing, after further reading on
how the parallelism of AGs is done and some other related things, I'm
changing my recommendation to using only 16 allocation groups of 1.752T=
B
with this system, one AG per array, instead of 64 AGs of 438GB. Using
64 AGs could potentially hinder parallelism in some cases.

--=20
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 22.03.2011 11:14:03 von Keld Simonsen

On Tue, Mar 22, 2011 at 09:46:58AM +0000, Robin Hill wrote:
> On Mon Mar 21, 2011 at 11:13:04 +0100, Keld J=F8rn Simonsen wrote:
>=20
> > On Mon, Mar 21, 2011 at 09:18:57AM -0500, Stan Hoeppner wrote:
> > >=20
> > > > Anyway, with 384 spindles and only 50 users, each user will hav=
e in
> > > > average 7 spindles for himself. I think much of the time this w=
ould mean=20
> > > > no random IO, as most users are doing large sequential reading.=
=20
> > > > Thus on average you can expect quite close to striping speed if=
you
> > > > are running RAID capable of striping.=20
> > >=20
> > > This is not how large scale shared RAID storage works under a
> > > multi-stream workload. I thought I explained this in sufficient =
detail.
> > > Maybe not.
> >=20
> > Given that the whole array system is only lightly loaded, this is h=
ow I
> > expect it to function. Maybe you can explain why it would not be so=
, if
> > you think otherwise.
> >=20
> If you have more than one system accessing the array simultaneously t=
hen
> your sequential IO immediately becomes random (as it'll interleave th=
e
> requests from the multiple systems). The more systems accessing
> simultaneously, the more random the IO becomes. Of course, there will
> still be an opportunity for some readahead, so it's not entirely rand=
om
> IO.

Of course the IO will be randomized, if there is more users, but the
read IO will tend to be quite sequential, if the reading of each proces=
s
is sequential. So if a user reads a big file sequentially, and the
system is lightly loaded, IO schedulers will tend to order all IO
for the process so that it is served in one series of operations,
given that the big file is laid out consequently on the file system.

> > it is probably not the concurrency of XFS that makes the parallelis=
m of
> > the IO. It is more likely the IO system, and that would also work f=
or
> > other file system types, like ext4. I do not see anything in the XF=
S allocation
> > blocks with any knowledge of the underlying disk structure.=20
> > What the file system does is only to administer the scheduling of t=
he
> > IO, in combination with the rest of the kernel.

> XFS allows for splitting the single filesystem into multiple allocati=
on
> groups. It can then allocate blocks from each group simultaneously
> without worrying about collisions. If the allocation groups are on
> separate physical spindles then (apart from the initial mapping of a
> request to an allocation group, which should be a very quick operatio=
n),
> the entire write process is parallelised. Most filesystems have only=
a
> single allocation group, so the block allocation is single threaded a=
nd
> can easily become a bottleneck. It's only once the blocks are allocat=
ed
> (assuming the filesystem knows about the physical layout) that the
> writes can be parallelised. I've not looked into the details of ext4
> though, so I don't know whether it makes any moves towards parallelis=
ing
> block allocation.

The block allocation is only done when writing. The system at hand was
specified as a mostly reading system, where such a bottleneck of block
allocating is not so dominant.

Best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 22.03.2011 12:01:29 von Keld Simonsen

On Tue, Mar 22, 2011 at 05:00:40AM -0500, Stan Hoeppner wrote:
> Keld J=F8rn Simonsen put forth on 3/21/2011 5:13 PM:
> > On Mon, Mar 21, 2011 at 09:18:57AM -0500, Stan Hoeppner wrote:
> >> Keld J=F8rn Simonsen put forth on 3/20/2011 9:44 PM:
> >>
> >>> Anyway, with 384 spindles and only 50 users, each user will have =
in
> >>> average 7 spindles for himself. I think much of the time this wou=
ld mean=20
> >>> no random IO, as most users are doing large sequential reading.=20
> >>> Thus on average you can expect quite close to striping speed if y=
ou
> >>> are running RAID capable of striping.=20
> >>
> >> This is not how large scale shared RAID storage works under a
> >> multi-stream workload. I thought I explained this in sufficient d=
etail.
> >> Maybe not.
> >=20
> > Given that the whole array system is only lightly loaded, this is h=
ow I
> > expect it to function. Maybe you can explain why it would not be so=
, if
> > you think otherwise.
>=20
> Using the term "lightly loaded" to describe any system sustaining
> concurrent 10GB/s block IO and NFS throughput doesn't seem to be an
> accurate statement. I think you're confusing theoretical maximum
> hardware performance with real world IO performance. The former is
> always significantly higher than the latter. With this in mind, as w=
ith
> any well designed system, I specified this system to have some headro=
om,
> as I previously stated. Everything we've discussed so far WRT this
> system has been strictly parallel reads.

The disks themselves should be cabable of doing about 60 GB/s so 10 GB/=
s
is only a 15 % use of the disks. And most of the IO is concurrent
sequential reading of big files.

> Now, if 10 cluster nodes are added with an application that performs
> streaming writes, occurring concurrently with the 50 streaming reads,
> we've just significantly increased the amount of head seeking on our
> disks. The combined IO workload is now a mixed heavy random read/wri=
te
> workload. This is the most difficult type of workload for any RAID
> subsystem. It would bring most parity RAID arrays to their knees. T=
his
> is one of the reasons why RAID10 is the only suitable RAID level for
> this type of system.

Yes, I agree. And that is why I also suggest you use a mirrored raid in
the form of Linux MD RAID 10, F2, for better striping performance and d=
isk
access performance than traditional RAID1+0.

Anyway the system was not specified to have additional 10 heavy writing=
processes.

> >> In summary, concatenating many relatively low stripe spindle count
> >> arrays, and using XFS allocation groups to achieve parallel scalab=
ility,
> >> gives us the performance we want without the problems associated w=
ith
> >> other configurations.
>=20
> > it is probably not the concurrency of XFS that makes the parallelis=
m of
> > the IO.=20
>=20
> It most certainly is the parallelism of XFS. There are some caveats =
to
> the amount of XFS IO parallelism that are workload dependent. But
> generally, with multiple processes/threads reading/writing multiple
> files in multiple directories, the device parallelism is very high. =
=46or
> example:
>=20
> If you have 50 NFS clients all reading the same large 20GB file
> concurrently, IO parallelism will be limited to the 12 stripe spindle=
s
> on the single underlying RAID array upon which the AG holding this fi=
le
> resides. If no other files in the AG are being accessed at the time,
> you'll get something like 1.8GB/s throughput for this 20GB file. Sin=
ce
> the bulk, if not all, of this file will get cached during the read, a=
ll
> 50 NFS clients will likely be served from cache at their line rate of
> 200MB/s, or 10GB/s aggregate. There's that magic 10GB/s number again=

> ;) As you can see I put some serious thought into this system
> specification.
>=20
> If you have all 50 NFS clients accessing 50 different files in 50
> different directories you have no cache benefit. But we will have fi=
les
> residing in all allocations groups on all 16 arrays. Since XFS evenl=
y
> distributes new directories across AGs when the directories are creat=
ed,
> we can probably assume we'll have parallel IO across all 16 arrays wi=
th
> this workload. Since each array can stream reads at 1.8GB/s, that's
> potential parallel throughput of 28GB/s, saturating our PCIe bus
> bandwidth of 16GB/s.

Hmm, yes RAID1+0 can probably only stream read at 1.8 GB/s. Linux MD
RAID10,F2 can stream read at around 3.6 GB/s, on an array of 24
spindles 15000 rpm, given that each spindle is capable of stream
reading at about 150 MB/s.

> Now change this to 50 clients each doing 10,000 4KB file reads in a
> directory along with 10,000 4KB file writes. The throughput of each =
12
> disk array may now drop by over a factor of approximately 128, as eac=
h
> disk can only sustain about 300 head seeks/second, dropping its
> throughput to 300 * 4096 bytes =3D 1.17MB/s. Kernel readahead may he=
lp
> some, but it'll still suck.
>=20
> It is the occasional workload such as that above that dictates
> overbuilding the disk subsystem. Imagine adding a high IOPS NFS clie=
nt
> workload to this server after it went into production to "only" serve
> large streaming reads. The random workload above would drop the
> performance of this 384 disk array with 15k spindles from a peak
> streaming rate of 28.4GB/s to 18MB/s--yes, that's megabytes.

Yes, random reading can diminish performance a lot.
If the mix of random/sequential reading is still with a good sequential
part, then I tink the system should still perform well. I think we lack
measurements for things like that, for maybe incremental sequential
reading speed on a non-saturated file system. I am not sure on how to
define such measures, tho.

> With one workload the disks can saturate the PCIe bus by almost a fac=
tor
> of two. With an opposite workload the disks can only transfer one
> 14,000th of the PCIe bandwidth. This is why Fortune 500 companies an=
d
> others with extremely high random IO workloads such as databases, and
> plenty of cash, have farms with multiple thousands of disks attached =
to
> database and other servers.

Or use SSD.

> > It is more likely the IO system, and that would also work for
> > other file system types, like ext4.=20
>=20
> No. Upper kernel layers doesn't provide this parallelism. This is
> strictly an XFS feature, although JFS had something similar (and JFS =
is
> now all but dead), though not as performant. BTRFS might have someth=
ing
> similar but I've read nothing about BTRFS internals. Because XFS has
> simply been the king of scalable filesystems for 15 years, and added
> great new capability along the way, all of the other filesystem
> developers have started to steal ideas from XFS. IIRC Ted T'so stol=
e
> some things from XFS for use in EXT4, but allocation groups wasn't on=
e
> of them.
>=20
> > I do not see anything in the XFS allocation
> > blocks with any knowledge of the underlying disk structure.=20
>=20
> The primary structure that allows for XFS parallelism is
> xfs_agnumber_t sb_agcount
>=20
> Making the filesystem with
> mkfs.xfs -d agcount=3D16
>=20
> creates 16 allocations groups of 1.752TB each in our case, 1 per 12
> spindle array. XFS will read/write to all 16 AGs in parallel, under =
the
> right circumstances, with 1 or multiple IO streams to/from each 12
> spindle array. XFS is the only Linux filesystem with this type of
> scalability, again, unless BTRFS has something similar.
>=20
> > What the file system does is only to administer the scheduling of t=
he
> > IO, in combination with the rest of the kernel.
>=20
> Given that XFS has 64xxx lines of code, BTRFS has 46xxx, and EXT4 has
> 29xxx, I think there's a bit more to it than that Keld. ;) Note that
> XFS has over twice the code size of EXT4. That's not bloat but
> features, one them being allocation groups. If your simplistic view =
of
> this was correct we'd have only one Linux filesystem. Filesystem cod=
e
> does much much more than you realize.

Oh, well, of cause the file system does a lot of things. And I have don=
e
a number of designs and patches to a number of file systems during the =
years.
But I was talking about the overall picture. The CPU power should not b=
e the
bottleneck, the bottleneck is the IO. So we use the kernel code to
administer the IO in the best possible way. I am also using XFS for
many file systems, but I am also using EXT3, and I think I get
about the same results for the systems I do, which are also a mostly
sequential reading of many big files concurrently (a ftp server).

> > Anyway, thanks for the energy and expertise that you are supplying =
to
> > this thread.
>=20
> High performance systems are one of my passions. I'm glad to
> participate and share. Speaking of sharing, after further reading on
> how the parallelism of AGs is done and some other related things, I'm
> changing my recommendation to using only 16 allocation groups of 1.75=
2TB
> with this system, one AG per array, instead of 64 AGs of 438GB. Usin=
g
> 64 AGs could potentially hinder parallelism in some cases.

Thank you again for your insights
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 23.03.2011 09:53:45 von Stan Hoeppner

Keld J=F8rn Simonsen put forth on 3/22/2011 5:14 AM:

> Of course the IO will be randomized, if there is more users, but the
> read IO will tend to be quite sequential, if the reading of each proc=
ess
> is sequential. So if a user reads a big file sequentially, and the
> system is lightly loaded, IO schedulers will tend to order all IO
> for the process so that it is served in one series of operations,
> given that the big file is laid out consequently on the file system.

With the way I've architected this hypothetical system, the read load o=
n
each allocation group (each 12 spindle array) should be relatively low,
about 3 streams on 14 AGs, 4 streams on the remaining two AGs,
_assuming_ the files being read are spread out evenly across at least 1=
6
directories. As you all read in the docs for which I provided links,
XFS AG parallelism functions at the directory and file level. For
example, if we create 32 directories on a virgin XFS filesystem of 16
allocation groups, the following layout would result:

AG1: /general requirements AG1: /alabama
AG2: /site construction AG2: /alaska
AG3: /concrete AG3: /arizona
.
.
AG14: /conveying systems AG14: /indiana
AG15: /mechanical AG15: /iowa
AG16: /electrical AG16: /kansas

AIUI, the first 16 directories get created in consecutive AGs until we
hit the last AG. The 17th directory is then created in the first AG an=
d
we start the cycle over. This is how XFS allocation group parallelism
works. It doesn't provide linear IO scaling for all workloads, and it'=
s
not magic, but it works especially well for multiuser fileservers, and
typically better than multi nested stripe levels or extremely wide arra=
ys.

Imagine you have a 5000 seat company. You'd mount this XFS filesytem i=
n
/home. Each user home directory created would fall in a consecutive AG=
,
resulting in about 312 user dirs per AG. In this type of environment
XFS AG parallelism will work marvelously as you'll achieve fairly
balanced IO across all AGs and thus all 16 arrays.

In the case where you have many clients reading files from only one
directory, hence the same AG, IO parallelism is limited to the 12
spindles of that one array. When this happens, we end up with a highly
random workload at the disk head, resulting in high seek rates and low
throughput. This is one of the reasons I built some "excess" capacity
into the disk subsystem. Using XFS AGs for parallelism doesn't
guarantee even distribution of IO across all the 192 spindles of the 16
arrays. It gives good parallelism if clients are accessing different
files in different directories concurrently, but not in the opposite ca=
se.

> The block allocation is only done when writing. The system at hand wa=
s
> specified as a mostly reading system, where such a bottleneck of bloc=
k
> allocating is not so dominant.

This system would excel at massive parallel writes as well, again, as
long as we have many writers into multiple directories concurrently,
which spreads the write load across all AGs, and thus all arrays.

XFS is legendary for multiple large file parallel write throughput,
thanks to delayed allocation, and some other tricks.

--=20
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 23.03.2011 16:57:43 von Roberto Spadim

it's something like 'partitioning'? i don't know xfs very well, but ...
if you use 99% ag16 and 1% ag1-15
you should use a raid0 with stripe (for better write/read rate),
linear wouldn't help like stripe, i'm right?

a question... this example was with directories, how files (metadata)
are saved? and how file content are saved? and jornaling?

i see a filesystem something like: read/write
jornaling(metadata/files), read/write metadata, read/write file
content, check/repair filesystem, features (backup, snapshot, garbage
collection, raid1, increase/decrease fs size, others)

speed of write and read will be a function of how you designed it to
use device layer (it's something like a virtual memory utilization, a
big memory, and many programs trying to use small parts and when need
use a big part)


2011/3/23 Stan Hoeppner :
> Keld J=F8rn Simonsen put forth on 3/22/2011 5:14 AM:
>
>> Of course the IO will be randomized, if there is more users, but the
>> read IO will tend to be quite sequential, if the reading of each pro=
cess
>> is sequential. So if a user reads a big file sequentially, and the
>> system is lightly loaded, IO schedulers will tend to order all IO
>> for the process so that it is served in one series of operations,
>> given that the big file is laid out consequently on the file system.
>
> With the way I've architected this hypothetical system, the read load=
on
> each allocation group (each 12 spindle array) should be relatively lo=
w,
> about 3 streams on 14 AGs, 4 streams on the remaining two AGs,
> _assuming_ the files being read are spread out evenly across at least=
16
> directories. =A0As you all read in the docs for which I provided link=
s,
> XFS AG parallelism functions at the directory and file level. =A0For
> example, if we create 32 directories on a virgin XFS filesystem of 16
> allocation groups, the following layout would result:
>
> AG1: =A0/general requirements =A0 =A0 AG1: =A0/alabama
> AG2: =A0/site construction =A0 =A0 =A0 =A0AG2: =A0/alaska
> AG3: =A0/concrete =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 AG3: =A0/arizona
> ..
> ..
> AG14: /conveying systems =A0 =A0 =A0 =A0AG14: /indiana
> AG15: /mechanical =A0 =A0 =A0 =A0 =A0 =A0 =A0 AG15: /iowa
> AG16: /electrical =A0 =A0 =A0 =A0 =A0 =A0 =A0 AG16: /kansas
>
> AIUI, the first 16 directories get created in consecutive AGs until w=
e
> hit the last AG. =A0The 17th directory is then created in the first A=
G and
> we start the cycle over. =A0This is how XFS allocation group parallel=
ism
> works. =A0It doesn't provide linear IO scaling for all workloads, and=
it's
> not magic, but it works especially well for multiuser fileservers, an=
d
> typically better than multi nested stripe levels or extremely wide ar=
rays.
>
> Imagine you have a 5000 seat company. =A0You'd mount this XFS filesyt=
em in
> /home. =A0Each user home directory created would fall in a consecutiv=
e AG,
> resulting in about 312 user dirs per AG. =A0In this type of environme=
nt
> XFS AG parallelism will work marvelously as you'll achieve fairly
> balanced IO across all AGs and thus all 16 arrays.
>
> In the case where you have many clients reading files from only one
> directory, hence the same AG, IO parallelism is limited to the 12
> spindles of that one array. =A0When this happens, we end up with a hi=
ghly
> random workload at the disk head, resulting in high seek rates and lo=
w
> throughput. =A0This is one of the reasons I built some "excess" capac=
ity
> into the disk subsystem. =A0Using XFS AGs for parallelism doesn't
> guarantee even distribution of IO across all the 192 spindles of the =
16
> arrays. =A0It gives good parallelism if clients are accessing differe=
nt
> files in different directories concurrently, but not in the opposite =
case.
>
>> The block allocation is only done when writing. The system at hand w=
as
>> specified as a mostly reading system, where such a bottleneck of blo=
ck
>> allocating is not so dominant.
>
> This system would excel at massive parallel writes as well, again, as
> long as we have many writers into multiple directories concurrently,
> which spreads the write load across all AGs, and thus all arrays.
>
> XFS is legendary for multiple large file parallel write throughput,
> thanks to delayed allocation, and some other tricks.
>
> --
> Stan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 23.03.2011 17:19:39 von Joe Landman

On 03/23/2011 11:57 AM, Roberto Spadim wrote:
> it's something like 'partitioning'? i don't know xfs very well, but ...
> if you use 99% ag16 and 1% ag1-15
> you should use a raid0 with stripe (for better write/read rate),
> linear wouldn't help like stripe, i'm right?
>
> a question... this example was with directories, how files (metadata)
> are saved? and how file content are saved? and jornaling?

I won't comment on the hardware design or choices aspects. Will briefly
touch on the file system and MD raid.

MD RAID0 or RAID10 would be the sanest approach, and xfs happily does
talk nicely to the MD raid system, gathering the stripe information from it.

The issue though is that xfs stores journals internally by default. You
can change this, and in specific use cases, an external journal is
strongly advised. This would be one such use case.

Though, the OP wants a very read heavy machine, and not a write heavy
machine. So it makes more sense to have massive amounts of RAM for the
OP, and lots of high speed fabric (Infiniband HCA, 10-40 GbE NICs, ...).
However, a single system design for the OP's requirements makes very
little economic or practical sense. Would be very expensive to build.

And to keep this on target, MD raid could handle it.

> i see a filesystem something like: read/write
> jornaling(metadata/files), read/write metadata, read/write file
> content, check/repair filesystem, features (backup, snapshot, garbage
> collection, raid1, increase/decrease fs size, others)

Unfortunately, xfs snapshots have to be done via LVM2 right now. My
memory isn't clear on this, there may be an xfs_freeze requirement for
the snapshot to be really valid. e.g.

xfs_freeze -f /mount/point
# insert your lvm snapshot command
xfs_freeze -u /mount/point

I am not sure if this is still required.

> speed of write and read will be a function of how you designed it to
> use device layer (it's something like a virtual memory utilization, a
> big memory, and many programs trying to use small parts and when need
> use a big part)

At the end of the day, it will be *far* more economical to build a
distributed storage cluster with a parallel file system atop it, than
build a single large storage unit. We've achieved well north of 10GB/s
sustained reads and writes from thousands of simultaneous processes
across thousands of cores (yes, with MD backed RAIDs being part of
this), for hundreds of GB reads/writes (well into the TB range)

Hardware design is very important here, as are many other features. The
BOM posted here notwithstanding, very good performance starts with good
selection of underlying components, and a rational design. Not all
designs you might see are worth the electrons used to transport them to
your reader.

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 24.03.2011 06:52:00 von Stan Hoeppner

Roberto Spadim put forth on 3/23/2011 10:57 AM:
> it's something like 'partitioning'? i don't know xfs very well, but ...
> if you use 99% ag16 and 1% ag1-15
> you should use a raid0 with stripe (for better write/read rate),
> linear wouldn't help like stripe, i'm right?

You should really read up on XFS internals to understand exactly how
allocation groups work.

http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure //tmp/en-US/html/index.html

I've explained the basics. What I didn't mention is that an individual
file can be written concurrently to more than one allocation group,
yielding some of the benefit of striping but without the baggage of
RAID0 over 16 RAID10 or a wide stripe RAID10. However, I've not been
able to find documentation stating exactly how this is done and under
what circumstances, and I would really like to know. XFS has some good
documentation, but none of it goes into this kind of low level detail
with lay person digestible descriptions. I'm not a dev so I'm unable to
understand how this works by reading the codde.

Note that once such a large file is written, reading that file later
puts multiple AGs into play so you have read parallelism approaching the
performance of straight disk striping.

The problems with nested RAID0 over RAID10, or simply a very wide array
(384 disks in this case) are two fold:

1. Lower performance with files smaller than the stripe width
2. Poor space utilization for the same reason

Let's analyze the wide RAID10 case. With 384 disks you get a stripe
width of 192 spindles. A common stripe block size is 64KB, or 16
filesystem blocks, 128 disk sectors. Taking that 64KB and multiplying
by 192 stripe spindles we get a stripe size of exactly 12MB.

If you write a file much smaller than the stripe size, say a 1MB file,
to the filesystem atop this wide RAID10, the file will only be striped
across 16 of the 192 spindles, with 64KB going to each stripe member, 16
filesystem blocks, 128 sectors. I don't know about mdraid, but with
many hardware RAID striping implementations the remaining 176 disks in
the stripe will have zeros or nulls written for their portion of the
stripe for this file that is a tiny fraction of the stripe size. Also,
all modern disk drives are much more efficient when doing larger
multi-sector transfers of anywhere from 512KB to 1MB or more than with
small transfers of 64KB.

By using XFS allocation groups for parallelism instead of a wide stripe
array, you don't suffer from this massive waste of disk space, and,
since each file is striped across fewer disks (12 in the case of my
example system), we end up with slightly better throughput as each
transfer is larger, 170 sectors in this case. The extremely wide array,
or nested stripe over striped array setup, is only useful in situations
where all files being written are close to or larger than the stripe
size. There are many application areas where this is not only plausible
but preferred. Most HPC applications work with data sets far larger
than the 12MB in this example, usually hundreds of megs if not multiple
gigs. In this case extremely wide arrays are the way to go, whether
using a single large file store, a cluster of fileservers, or a cluster
filesystem on SAN storage such as CXFS.

Most other environments are going to have a mix of small and large
files, and all sizes in between. This is the case where leveraging XFS
allocation group parallelism makes far more sense than a very wide
array, and why I chose this configuration for my example system.

Do note that XFS will also outperform any other filesytem when used
directly atop this same 192 spindle wide RAID10 array. You'll still
have 16 allocation groups, but the performance characteristics of the
AGs change when the underlying storage is a wide stripe. In this case
the AGs become cylinder groups from the outer to inner edge of the
disks, instead of each AG occupying an entire 12 spindle disk array.

In this case the AGs do more to prevent fragmentation than increase
parallel throughput at the hardware level. AGs do always allow more
filesystem concurrency though, regardless of the underlying hardware
storage structure, because inodes can be allocated or read in parallel.
This is sue to the fact each XFS AG has its own set of B+ trees and
inodes. Each AG is a "filesystem within a filesystem".

If we pretend for a moment that an EXT4 filesystem can be larger than
16TB, in this case 28TB, and we tested this 192 spindle RAID10 array
with a high parallel workload with both EXT4 and XFS, you'd find that
EXT4 throughput is a small fraction of XFS due to the fact that so much
of EXT4 IO is serialized, precisely because it lacks XFS' allocation
group architecture.

> a question... this example was with directories, how files (metadata)
> are saved? and how file content are saved? and jornaling?

http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure //tmp/en-US/html/index.html

> speed of write and read will be a function of how you designed it to
> use device layer (it's something like a virtual memory utilization, a
> big memory, and many programs trying to use small parts and when need
> use a big part)

Not only that, but how efficiently you can walk the directory tree to
locate inodes. XFS can walk many directory trees in parallel, partly
due to allocation groups. This is one huge advantage it has over
EXT2/3/4, ReiserFS, JFS, etc.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 24.03.2011 07:33:43 von NeilBrown

On Thu, 24 Mar 2011 00:52:00 -0500 Stan Hoeppner
wrote:

> If you write a file much smaller than the stripe size, say a 1MB file,
> to the filesystem atop this wide RAID10, the file will only be striped
> across 16 of the 192 spindles, with 64KB going to each stripe member, 16
> filesystem blocks, 128 sectors. I don't know about mdraid, but with
> many hardware RAID striping implementations the remaining 176 disks in
> the stripe will have zeros or nulls written for their portion of the
> stripe for this file that is a tiny fraction of the stripe size.

This doesn't make any sense at all. No RAID - hardware or otherwise - is
going to write zeros to most of the stripe like this. The RAID doesn't even
know about the concept of a file, so it couldn't.
The filesystem places files in the virtual device that is the array, and the
RAID just spreads those blocks out across the various devices.

There will be no space wastage.

If you have a 1MB file, then there is no way you can ever get useful 192-way
parallelism across that file. Bit if you have 192 1MB files, then they will
be spread even across your spindles some how (depending on FS and RAID level)
and if you have multiple concurrent accessors, they could well get close to
192-way parallelism.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 24.03.2011 09:05:02 von Stan Hoeppner

Joe Landman put forth on 3/23/2011 11:19 AM:

> MD RAID0 or RAID10 would be the sanest approach, and xfs happily does
> talk nicely to the MD raid system, gathering the stripe information from
> it.

Surely you don't mean a straight mdraid0 over the 384 drives, I assume.
You referring to the nested case I mentioned, yes?

Yes, XFS does read the mdraid parameters and sets block, stripe size,
etc, accordingly.

> The issue though is that xfs stores journals internally by default. You
> can change this, and in specific use cases, an external journal is
> strongly advised. This would be one such use case.

The target workload is read heavy, very few writes. Even if we added a
write heavy workload to the system, with journal residing on an array
that's seeing heavy utilization from the primary workload, with delayed
logging enabled, this is a non issue.

Thus, this is not a case where an external log device is needed. In
fact, now that we have the delayed logging feature, cases where an
external log device might be needed are very few and far between.

> Though, the OP wants a very read heavy machine, and not a write heavy
> machine. So it makes more sense to have massive amounts of RAM for the

Assuming the same files aren't being re-read, how does massive RAM
quantity for buffer cache help?

> OP, and lots of high speed fabric (Infiniband HCA, 10-40 GbE NICs, ...).
> However, a single system design for the OP's requirements makes very
> little economic or practical sense. Would be very expensive to build.

I estimated the cost of my proposed 10GB/s NFS server at $150-250k
including all required 10GbE switches, the works. Did you read that
post? What is your definition of "very expensive"? Compared to?

> And to keep this on target, MD raid could handle it.

mdraid was mentioned in my system as well. And yes, Neil seems to think
mdraid would be fine, not a CPU hog.

> Unfortunately, xfs snapshots have to be done via LVM2 right now. My
> memory isn't clear on this, there may be an xfs_freeze requirement for
> the snapshot to be really valid. e.g.

Why do you say "unfortunately"? *ALL* Linux filesystem snapshots are
performed with a filesystem freeze implemented in the VFS layer. The
freeze 'was' specific to XFS. It is such a valuable, *needed* feature
that it was bumped into the VFS so all filesystems could take advantage
of it. Are you saying freezing writes to a filesystem before taking a
snapshot is a bad thing? (/incredulous)

http://en.wikipedia.org/wiki/XFS#Snapshots

> xfs_freeze -f /mount/point
> # insert your lvm snapshot command
> xfs_freeze -u /mount/point
>
> I am not sure if this is still required.

It's been fully automatic since 2.6.29, for all Linux filesystems.
Invoking an LVM snapshot automatically freezes the filesystem.

> At the end of the day, it will be *far* more economical to build a
> distributed storage cluster with a parallel file system atop it, than
> build a single large storage unit.

I must call BS on the "far more economical" comment. At the end of the
day, to use your phrase, the cost of any large scale high performance
storage system comes down to the quantity and price of the disk drives
needed to achieve the required spindle throughput. Whether you use a
$20K server chassis to host the NICs, disk controllers and all the
drives, or you used six $3000 server chassis, the costs come out roughly
the same. The big advantages a single chassis server has are simplicity
of design, maintenance, and use. The only downside is single point of
failure, not higher cost, compared to a storage cluster. Failures of
complete server chassis are very rare, BTW, especially quad socket HP
servers.

If it takes 8 of your JackRabbit boxen, 384 drives, to sustain 10+GB/s
using RAID10, maintaining that rate during a rebuild, with a load of 50+
concurrent 200MB/s clients, we're looking at about $200K USD, correct,
$25K per box? Your site doesn't show any pricing that I can find so I
making an educated guess. That cost figure is not substantially
different than my hypothetical configuration, but mine includes $40K of
HP 10GbE switches to connect the clients and the server at full bandwidth.

> We've achieved well north of 10GB/s
> sustained reads and writes from thousands of simultaneous processes
> across thousands of cores (yes, with MD backed RAIDs being part of
> this), for hundreds of GB reads/writes (well into the TB range)

That's great. Also, be honest with the fine folks on the list. You use
mdraid0 or linear for stitching hardware RAID arrays together, similar
to what I mentioned. You're not using mdraid across all 48 drives in
your chassis. If you are, the information on your website is incorrect
at best, misleading at worst, as it mentions "RAID Controllers" and
quantity per system model, 1-4 in the case of the JackRabbit.

> Hardware design is very important here, as are many other features. The
> BOM posted here notwithstanding, very good performance starts with good
> selection of underlying components, and a rational design. Not all
> designs you might see are worth the electrons used to transport them to
> your reader.

Fortunately for the readers here, such unworthy designs you mention
aren't posted on this list.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 24.03.2011 09:07:53 von Roberto Spadim

i will read xfs again, but check, if i'm thinking wrong or right...

i see two ideas about raid0 (i/o rate vs many users)

first lets think that raid0 is somethink like a harddisk firmware....
the problem: we have many plastes/heads and just one arm.

a hard disk =3D many plates+many heads+ only one arm to move heads(mayb=
e
in future we can use many arms in only one harddisk!)
plates=3Dmany sectors=3Dmany bits (harddisk works like NOR memories, on=
ly
with bits, not with bytes or pages like NAND memories, for bytes it
must head based (stripe) or many reads (time consuming) )

firmware will use:
raid0 stripe =3D> make group of bits from diferent plates/heads
(1,2,3,4,5) a 'block/byte/character' unit (if you have 8heads you can
read a byte with only one 'read all heads bits' command, and merge
bits from head1,2,3,4,5,6,7,8 and get a byte, it can be done in
parallel like raid0 stripe do on linux software raid, with only 1
cycle of read)
raid0 linear =3D> read many bits from a plate to create a 'sector' of
bits (a 'block unit' too) this can only be done in sequential read
(many cycles of read); wait read of bit1 to read bit2,3,4,5,6,7,8,9...
different from stripe where you send many reads after all reads will
merge bits to get a byte)

-----
it's like a 3Ghz cpu with 1 core vs 1Ghz cpu with 3 cores, what's fast?
if you need just 1 cycle of cpu, 3ghz is faster
the problem with harddisk is just one: random reads.
think about a mix of ssd and harddisks (there's some disks that have
it! did you tried? they are nice! there's a bcache and one facebook
linux kernel module to emulate this at o.s.) you will not have the
random read problem, since ssd is very good for random read
-----
the only magics i think a filesystem can do is:
1)online compression - think about 32MB blocks, and if read 12MB
compressed information you can have 32MB of uncompressed information,
if you want more information you will need to jump to sector of next
32MB block, you could use stripe at raid0 here to allow second disk to
be used and don't wait access time of first disk
2)group of similar file access (i think it's what xfs call about
allocation groups). could be done by statistics about: acesstime, read
rate, write rate, filesize, create/delete file rate, file type
(symbolick links, directory, files, devices, pipes, etc), metadata,
journaling
3)how device works: good for write, good for read, good for sequencial
read (few arms-stripe), good for random read(ssd), good for multi task
(many arms-linear)
----------------

reading about harddisks informations at database forums/blogs
(intensive disk users)...
harddisks work better with big blocks since it will get a small
acesstime to read more information...
read rate =3D bytes read / total time.
total time =3D accesstime+read time.
accesstime=3Darm positioning+disk positioning,
read time=3Ddisk speed (7200rpm, 10krpm, 15krpm...) and sector nits per
disk revolution for harddisks.

thinking about this... sequencial reads are fast, random reads are slow

how to optimise random reads? read ahead, raid0 (a arm for each group o=
f sector)
how filesystem can optimize random reads? try to not fragment most
access file, put they close, convert random reads use to cache
sequencial information, use of statistic of most read, most write,
file size, create/delete rate, etc to select betters candidates of
futures use (preditive idea)

i think it's all filesystem and raid0 could do

2011/3/24 NeilBrown :
> On Thu, 24 Mar 2011 00:52:00 -0500 Stan Hoeppner com>
> wrote:
>
>> If you write a file much smaller than the stripe size, say a 1MB fil=
e,
>> to the filesystem atop this wide RAID10, the file will only be strip=
ed
>> across 16 of the 192 spindles, with 64KB going to each stripe member=
, 16
>> filesystem blocks, 128 sectors. =A0I don't know about mdraid, but wi=
th
>> many hardware RAID striping implementations the remaining 176 disks =
in
>> the stripe will have zeros or nulls written for their portion of the
>> stripe for this file that is a tiny fraction of the stripe size.
>
> This doesn't make any sense at all. =A0No RAID - hardware or otherwis=
e - is
> going to write zeros to most of the stripe like this. =A0The RAID doe=
sn't even
> know about the concept of a file, so it couldn't.
> The filesystem places files in the virtual device that is the array, =
and the
> RAID just spreads those blocks out across the various devices.
>
> There will be no space wastage.
>
> If you have a 1MB file, then there is no way you can ever get useful =
192-way
> parallelism across that file. =A0Bit if you have 192 1MB files, then =
they will
> be spread even across your spindles some how (depending on FS and RAI=
D level)
> and if you have multiple concurrent accessors, they could well get cl=
ose to
> 192-way parallelism.
>
> NeilBrown
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 24.03.2011 09:31:21 von Stan Hoeppner

NeilBrown put forth on 3/24/2011 1:33 AM:
> On Thu, 24 Mar 2011 00:52:00 -0500 Stan Hoeppner
> wrote:
>
>> If you write a file much smaller than the stripe size, say a 1MB file,
>> to the filesystem atop this wide RAID10, the file will only be striped
>> across 16 of the 192 spindles, with 64KB going to each stripe member, 16
>> filesystem blocks, 128 sectors. I don't know about mdraid, but with
>> many hardware RAID striping implementations the remaining 176 disks in
>> the stripe will have zeros or nulls written for their portion of the
>> stripe for this file that is a tiny fraction of the stripe size.
>
> This doesn't make any sense at all. No RAID - hardware or otherwise - is
> going to write zeros to most of the stripe like this. The RAID doesn't even
> know about the concept of a file, so it couldn't.
> The filesystem places files in the virtual device that is the array, and the
> RAID just spreads those blocks out across the various devices.
>
> There will be no space wastage.

Well that's good to know then. Apparently I was confusing partial block
writes with partial stripe writes. Thanks for clarifying this Neil.

> If you have a 1MB file, then there is no way you can ever get useful 192-way
> parallelism across that file.

That was exactly my point. Hence my recommendation against very wide
stripe arrays for general purpose fileservers.

> Bit if you have 192 1MB files, then they will
> be spread even across your spindles some how (depending on FS and RAID level)
> and if you have multiple concurrent accessors, they could well get close to
> 192-way parallelism.

The key here being parallelism, to a great extent. All 192 files would
need to be in the queue simultaneously. This would have to be a
relatively busy file or DB server.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 24.03.2011 14:12:59 von Joe Landman

On 03/24/2011 04:05 AM, Stan Hoeppner wrote:

>> At the end of the day, it will be *far* more economical to build a
>> distributed storage cluster with a parallel file system atop it, than
>> build a single large storage unit.
>
> I must call BS on the "far more economical" comment. At the end of the

I find it funny ... really, that the person whom hasn't designed and
built the thing that we have, is calling BS on us.

This is the reason why email filters were developed.

In another email, Neil corrected some of Stan's other fundamental
misconceptions on RAID writing. Christoph corrected others. Free
advice here ... proceed with caution if you are considering using *any*
of his advice, and get it sanity checked beforehand..

[...]

>> We've achieved well north of 10GB/s

It is important to note this. We have. He hasn't.

One thing we deal with on a fairly regular basis are people slapping
components together that they think will work, and having expectations
set really high on the performance side. Expectations get moderated by
experience. Those who've done these things know what troubles await,
those who don't look at spec's, say I need X of these, Y of those and my
performance troubles will be gone. It doesn't work that way. Watching
such processes unfold is akin to watching a slow motion train wreck on a
movie ... you don't want it to occur, but it will, and it won't end well.

>> sustained reads and writes from thousands of simultaneous processes
>> across thousands of cores (yes, with MD backed RAIDs being part of
>> this), for hundreds of GB reads/writes (well into the TB range)
>
> That's great. Also, be honest with the fine folks on the list. You use
> mdraid0 or linear for stitching hardware RAID arrays together, similar
> to what I mentioned. You're not using mdraid across all 48 drives in

Again, since we didn't talk about how we use MD RAID, he doesn't know.
Then constructs a strawman and proceeds to knock it down.

I won't fisk the rest of this, just make sure that, before you take his
advice, you check with someone that's done it. He doesn't grok why one
might need lots of ram in a read heavy scenario, or how RAID writes
work, or ...

Yeah, you need to be pretty careful taking advice on building RAID or
high performance scalable file server systems like this from people whom
haven't, are guessing, and getting their answers corrected at a deep
fundamental level by others.

[...]

> Fortunately for the readers here, such unworthy designs you mention
> aren't posted on this list.

.... says the person whom hasn't designed/built/tested configurations
that the other group they are criticizing has successfully deployed ...

As a reminder of thread history, he started with singing the praises of
the Nexsan FC targets, indicated MD raid wasn't up to the task, that it
wasn't "a professionally used solution" or similar statement. Then he
attacked anyone who disagreed, and pointed out flaws in his
statement/argument. When people like me (and others) suggested cluster
file systems, he went on his single system design way, and again, using
FC/SAS, decided that a linear stripe was the right approach.

Heh!

Nothing to see here folks, adjust your filters accordingly.

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 24.03.2011 18:07:53 von Christoph Hellwig

On Wed, Mar 23, 2011 at 12:19:39PM -0400, Joe Landman wrote:
> The issue though is that xfs stores journals internally by default.
> You can change this, and in specific use cases, an external journal
> is strongly advised. This would be one such use case.

In general if you have enough spindles, or an SSD for the log for
an otherwise disk based setup the external log will always be
a win. For many workloads log will be the only backwards seeks.

This is slightly offtopic here through, because as Joe already
sais corretly it won't matter too much for a read heavy workload.

> Unfortunately, xfs snapshots have to be done via LVM2 right now. My
> memory isn't clear on this, there may be an xfs_freeze requirement
> for the snapshot to be really valid. e.g.

That's not needed anymore for a long time now - device mapper
now calls the freeze_fs method to invoke exactly the same code
to freeze the filesystem.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: high throughput storage server?

am 25.03.2011 08:06:30 von Stan Hoeppner

Joe Landman put forth on 3/24/2011 8:12 AM:
> On 03/24/2011 04:05 AM, Stan Hoeppner wrote:

>> I must call BS on the "far more economical" comment. At the end of the
>
> I find it funny ... really, that the person whom hasn't designed and
> built the thing that we have, is calling BS on us.

Demonstrate your systems are "far more economical" than the estimate I
gave for the system I specified. You made the claim, I challenged it.
Back your claim.

[lots of smoke/mirrors personal attacks deleted]

> Again, since we didn't talk about how we use MD RAID, he doesn't know.
> Then constructs a strawman and proceeds to knock it down.

Answer the question: do you use/offer/sell hardware RAID controllers.

[lots more smoke/mirrors personal attacks deleted]

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html