md-raid and block sizes

am 27.12.2010 13:00:28 von spider

Hello,
With the advent of the WD*EARS drives and the "advanced partitioning"
system that requires 4k blocks, I wanted to pop a quick question to see
how the md metadata is aligned and how it should be aligned to get
proper performance out of the devices.

So, For raid1, I'll assume there is some metadata-overhead on the
drives, how large is this block? Will there be need to make a partition
on the md-device in order to get proper alignment of filesystem =>
platters?

for both raid1 and 5/6 levels, what are the appropiate stride-sizes for
ext3/4?

For more advanced raid configurations (5-6), how should the ext3/4
stripe size be configured?

Yes, a lot of similarly naive questions, however, I'm asking mostly
because of how devices are changing from 512 to 4k sized blocks, with
quite interesting changes in performance, and I wanted to figure out
what is the current state of software raid. At which "level" of the
stack: (partition) raid (partition) filesystem , do you have to account
for the block sizes in order to not degrade performance of the devices.

( In a perfect world I'd be able to purchase a stack of the devices and
test for myself and come back with a report. However, money and hardware
is a scarce resource ;)

ps. please keep me CC'd as I'm not subscribed to the list.

// Spider

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: md-raid and block sizes

am 27.12.2010 20:02:50 von Doug Dumitru

Mr. Ljungmark,

The meta data is stored at the end of each raid device. So if you are
dealing with an "advanced partitioned" device, just make sure your
individual elements are stores with alignments that make sense.
Normally, this is just 'fdisk -u' and start the partition on a
multiple of 8 sectors (start at 64).

In terms of stripe sizes, they are all multiples of 4K pages anyway,
so it is not really possible for them to be "wrong" in terms of drive
format alignment.

=46ile systems should be 4K or multiples thereof. It has been a while,
but I think only XFS really breaks this rule unless you overwrite the
block size to 4K. extn is fine.

The same rules apply to most SSDs. Most SSDs prefer 4K alignment
because of how the FTL (Flash Translation Layers) operate, even though
<4K will sometimes still work pretty well.

On Mon, Dec 27, 2010 at 4:00 AM, D.S. Ljungmark wro=
te:
>
> Hello,
> =A0With the advent of the WD*EARS drives and the "advanced partitioni=
ng"
> system that requires 4k blocks, I wanted to pop a quick question to s=
ee
> how the md metadata is aligned and how it should be aligned to get
> proper performance out of the devices.
>
> So, For raid1, =A0I'll assume there is some metadata-overhead on the
> drives, how large is this block? Will there be need to make a partiti=
on
> on the md-device in order to get proper alignment of filesystem =3D>
> platters?
>
> for both raid1 and 5/6 levels, what are the appropiate stride-sizes f=
or
> ext3/4?
>
> For more advanced raid configurations (5-6), how should the ext3/4
> stripe size be configured?
>
>
> Yes, a lot of similarly naive questions, however, I'm asking mostly
> because of how devices are changing from 512 to 4k sized blocks, with
> quite interesting changes in performance, and I wanted to figure out
> what is the current state of software raid. At which "level" of the
> stack: (partition) raid (partition) filesystem , =A0do you have to ac=
count
> for the block sizes in order to not degrade performance of the device=
s.
>
> ( In a perfect world I'd be able to purchase a stack of the devices a=
nd
> test for myself and come back with a report. However, money and hardw=
are
> is a scarce resource ;)
>
> ps. =A0please keep me CC'd as I'm not subscribed to the list.
>
>
> // Spider
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html

--
Doug Dumitru
EasyCo LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: md-raid and block sizes

am 28.12.2010 09:46:20 von spider

Thankyou,
This makes planning ahead a bit easier, and means that I "only" have
to worry about the traditional issues regarding block sizes, stripe and
stride on "naked" disks in the array.

Regards,
Spid

On Mon, 2010-12-27 at 10:59 -0800, Doug Dumitru wrote:
> Mr. Ljungmark,
>
> The meta data is stored at the end of each raid device. So if you are
> dealing with an "advanced partitioned" device, just make sure your
> individual elements are stores with alignments that make sense. Normally,
> this is just 'fdisk -u' and start the partition on a multiple of 8 sectors
> (start at 64).
>
> In terms of stripe sizes, they are all multiples of 4K pages anyway, so it
> is not really possible for them to be "wrong" in terms of drive format
> alignment.
>
> File systems should be 4K or multiples thereof. It has been a while, but I
> think only XFS really breaks this rule unless you overwrite the block size
> to 4K. extn is fine.
>
> The same rules apply to most SSDs. Most SSDs prefer 4K alignment because of
> how the FTL (Flash Translation Layers) operate, even though <4K will
> sometimes still work pretty well.
>
> Doug Dumitru
> CTO EasyCo LLC
>
> On Mon, Dec 27, 2010 at 4:00 AM, D.S. Ljungmark wrote:
>
> > Hello,
> > With the advent of the WD*EARS drives and the "advanced partitioning"
> > system that requires 4k blocks, I wanted to pop a quick question to see
> > how the md metadata is aligned and how it should be aligned to get
> > proper performance out of the devices.
> >
> > So, For raid1, I'll assume there is some metadata-overhead on the
> > drives, how large is this block? Will there be need to make a partition
> > on the md-device in order to get proper alignment of filesystem =>
> > platters?
> >
> > for both raid1 and 5/6 levels, what are the appropiate stride-sizes for
> > ext3/4?
> >
> > For more advanced raid configurations (5-6), how should the ext3/4
> > stripe size be configured?
> >
> >
> > Yes, a lot of similarly naive questions, however, I'm asking mostly
> > because of how devices are changing from 512 to 4k sized blocks, with
> > quite interesting changes in performance, and I wanted to figure out
> > what is the current state of software raid. At which "level" of the
> > stack: (partition) raid (partition) filesystem , do you have to account
> > for the block sizes in order to not degrade performance of the devices.
> >
> > ( In a perfect world I'd be able to purchase a stack of the devices and
> > test for myself and come back with a report. However, money and hardware
> > is a scarce resource ;)
> >
> > ps. please keep me CC'd as I'm not subscribed to the list.
> >
> >
> > // Spider
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: md-raid and block sizes

am 28.12.2010 12:29:26 von hansBKK

This doesn't actually relate to the blocksize issue, but a caveat -
I've heard that these "green" drives are not suitable for use in a
RAID.

The specific issue is apparently that these drives spin down very
frequently, but most RAID implementations keep spinning them back up
again just as frequently (perhaps unnecessarily?), thus causing undue
wear and tear on the drives' mechanics and ultimately premature
failure.

Of course this could be spin from the vendors to get people to spend
more money on the "enterprise" level drives - I personally am a firm
believer in saving money by buying consumer-level hardware and using
the savings to buy extra redundancy.

It just so happens that I've bought a batch of Samsung 2TB drives for
a RAID6 I'm building, and it turns out they are quite similar to the
WDEARS, both in their use of "new format" 4K blocks and some of the
green features. I haven't yet detected any undue drive
stopping/starting, but then again they are so quite I'm not sure how
to check. . .

Confirmation or refutation of these thoughts would be most welcome.

On Tue, Dec 28, 2010 at 3:46 PM, D.S. Ljungmark wro=
te:
> Thankyou,
> =A0This makes planning ahead a bit easier, and means that I "only" ha=
ve
> to worry about the traditional issues regarding block sizes, stripe a=
nd
> stride on "naked" disks in the array.
>
> Regards,
> =A0 Spid
>
> On Mon, 2010-12-27 at 10:59 -0800, Doug Dumitru wrote:
>> Mr. Ljungmark,
>>
>> The meta data is stored at the end of each raid device. =A0So if you=
are
>> dealing with an "advanced partitioned" device, just make sure your
>> individual elements are stores with alignments that make sense. =A0N=
ormally,
>> this is just 'fdisk -u' and start the partition on a multiple of 8 s=
ectors
>> (start at 64).
>>
>> In terms of stripe sizes, they are all multiples of 4K pages anyway,=
so it
>> is not really possible for them to be "wrong" in terms of drive form=
at
>> alignment.
>>
>> File systems should be 4K or multiples thereof. =A0It has been a whi=
le, but I
>> think only XFS really breaks this rule unless you overwrite the bloc=
k size
>> to 4K. =A0extn is fine.
>>
>> The same rules apply to most SSDs. =A0Most SSDs prefer 4K alignment =
because of
>> how the FTL (Flash Translation Layers) operate, even though <4K will
>> sometimes still work pretty well.
>>
>> Doug Dumitru
>> CTO EasyCo LLC
>>
>> On Mon, Dec 27, 2010 at 4:00 AM, D.S. Ljungmark =
wrote:
>>
>> > Hello,
>> > =A0With the advent of the WD*EARS drives and the "advanced partiti=
oning"
>> > system that requires 4k blocks, I wanted to pop a quick question t=
o see
>> > how the md metadata is aligned and how it should be aligned to get
>> > proper performance out of the devices.
>> >
>> > So, For raid1, =A0I'll assume there is some metadata-overhead on t=
he
>> > drives, how large is this block? Will there be need to make a part=
ition
>> > on the md-device in order to get proper alignment of filesystem =3D=
>
>> > platters?
>> >
>> > for both raid1 and 5/6 levels, what are the appropiate stride-size=
s for
>> > ext3/4?
>> >
>> > For more advanced raid configurations (5-6), how should the ext3/4
>> > stripe size be configured?
>> >
>> >
>> > Yes, a lot of similarly naive questions, however, I'm asking mostl=
y
>> > because of how devices are changing from 512 to 4k sized blocks, w=
ith
>> > quite interesting changes in performance, and I wanted to figure o=
ut
>> > what is the current state of software raid. At which "level" of th=
e
>> > stack: (partition) raid (partition) filesystem , =A0do you have to=
account
>> > for the block sizes in order to not degrade performance of the dev=
ices.
>> >
>> > ( In a perfect world I'd be able to purchase a stack of the device=
s and
>> > test for myself and come back with a report. However, money and ha=
rdware
>> > is a scarce resource ;)
>> >
>> > ps. =A0please keep me CC'd as I'm not subscribed to the list.
>> >
>> >
>> > // Spider
>> >
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-ra=
id" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.ht=
ml
>> >
>>
>>
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: md-raid and block sizes

am 28.12.2010 12:45:13 von NeilBrown

On Tue, 28 Dec 2010 18:29:26 +0700 hansbkk@gmail.com wrote:

> This doesn't actually relate to the blocksize issue, but a caveat -
> I've heard that these "green" drives are not suitable for use in a
> RAID.
>
> The specific issue is apparently that these drives spin down very
> frequently, but most RAID implementations keep spinning them back up
> again just as frequently (perhaps unnecessarily?), thus causing undue
> wear and tear on the drives' mechanics and ultimately premature
> failure.
>

After a brief period of no writes, md will update the bitmap and/or the
superblock to record that the array is clean (it may update the bitmap at
other times too, but that is not relevant here).

If the auto-spindown time of the drive is less than the delay-before-marking
the-array-clean of md, then you could get extra spin-ups.

The delay for updating the superblock is in sysfs in the md/safe_mode_timeout
file which defaults to 0.2 seconds (200msec).
The delay for updating the bitmap is set by an mdadm option (--bitmap-delay
or something like that) when adding a bitmap to an array, and I think is
available in sysfs in md/bitmap/something in recent kernels.
The actual delay before a write is between 2 and 3 times this number.
I think it defaults to 5 seconds (hence 10 to 15 second delay).

So if the drive spins down sooner than 15 seconds after the last IO, there
could be a problem but tuning md can git rid of that problem.
If the drive spin-down time is longer than 15 seconds, they should be no
unnecessary spin-ups.

If anyone has any data on default spin-down times of these "green" drives I
would be keen to hear about it.

Thanks,
NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: md-raid and block sizes

am 28.12.2010 12:45:15 von John Robinson

On 27/12/2010 19:02, Doug Dumitru wrote:
[...]
> The same rules apply to most SSDs. Most SSDs prefer 4K alignment
> because of how the FTL (Flash Translation Layers) operate, even though
> <4K will sometimes still work pretty well.

If I remember correctly, it may be better to align on SSDs at a larger
granularity, to the SSDs' erase block size, which may be up to 512K
depending on the SSD, though I don't remember at all how well Linux
filesystems, VFS, block layer drivers etc. handle SSD TRIM etc.

Cheers,

John.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: md-raid and block sizes

am 28.12.2010 12:45:15 von Roman Mamedov

--Sig_/7A9uet6CmpIHrOpAQBnKyQk
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

On Tue, 28 Dec 2010 18:29:26 +0700
hansbkk@gmail.com wrote:

> This doesn't actually relate to the blocksize issue, but a caveat -
> I've heard that these "green" drives are not suitable for use in a
> RAID.
>=20
> The specific issue is apparently that these drives spin down very
> frequently, but most RAID implementations keep spinning them back up
> again just as frequently (perhaps unnecessarily?), thus causing undue
> wear and tear on the drives' mechanics and ultimately premature
> failure.
>=20
> Of course this could be spin from the vendors to get people to spend
> more money on the "enterprise" level drives - I personally am a firm
> believer in saving money by buying consumer-level hardware and using
> the savings to buy extra redundancy.
>=20
> It just so happens that I've bought a batch of Samsung 2TB drives for
> a RAID6 I'm building, and it turns out they are quite similar to the
> WDEARS, both in their use of "new format" 4K blocks and some of the
> green features. I haven't yet detected any undue drive
> stopping/starting, but then again they are so quite I'm not sure how
> to check. . .
>=20
> Confirmation or refutation of these thoughts would be most welcome.

This all is just FUD and you shouldn't repeat and spread it further, helping
the vendor's marketing department get people spend money on an "Enterprise"
sticker.

WD*EARS/EADS, or at least most older models in that line-up, do indeed unlo=
ad
their heads after a short period of time, however that in no way inhibits t=
heir
usage in RAID (with this issue there's no difference at all, RAID or no RAI=
D),
and is user-adjustable using the WDIDLE3 utility:
https://encrypted.google.com/search?q=3DWDIDLE3

Samsung drives do not have any spindown/load/unload problems, but be aware =
of
this: http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF 4EGBadBlo=
cks

--=20
With respect,
Roman

--Sig_/7A9uet6CmpIHrOpAQBnKyQk
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)

iEYEARECAAYFAk0ZzcsACgkQTLKSvz+PZwhObwCcDr/JNUbD7SJ2xuvPE/QC NfD/
MTsAnj8us5uXa5AGsU8XrKA6oUmjinOP
=czwn
-----END PGP SIGNATURE-----

--Sig_/7A9uet6CmpIHrOpAQBnKyQk--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: md-raid and block sizes

am 28.12.2010 12:50:50 von Mikael Abrahamsson

On Tue, 28 Dec 2010, Neil Brown wrote:

> If anyone has any data on default spin-down times of these "green"
> drives I would be keen to hear about it.

The problem I've seen is the load/unload cycle count increasing, the head
parking occurs after 8 seconds.

I've never seen one spin down though, I wonder if the original poster was
talking about the head parking and not drive spindown.

--
Mikael Abrahamsson email: swmike@swm.pp.se
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: md-raid and block sizes

am 28.12.2010 12:51:26 von Roman Mamedov

--Sig_/6aM6wD6KlWF+KsJmk82rBJL
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

On Tue, 28 Dec 2010 22:45:13 +1100
Neil Brown wrote:

> If anyone has any data on default spin-down times of these "green" drives=
I
> would be keen to hear about it.

According to many sources, the factory default on WD Green is 8 seconds:
https://encrypted.google.com/search?hl=3Den&q=3Dwd+load+unlo ad+factory+defa=
ult+seconds

--=20
With respect,
Roman

--Sig_/6aM6wD6KlWF+KsJmk82rBJL
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)

iEYEARECAAYFAk0Zzz8ACgkQTLKSvz+PZwiKGQCdFPefVaxw9LSbipL//lrg 3G4/
+QcAnjW3TxZxrNIgPql4XhkiagGIcbyu
=cOpd
-----END PGP SIGNATURE-----

--Sig_/6aM6wD6KlWF+KsJmk82rBJL--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html