potentially lost largeish raid5 array..

am 23.09.2011 03:50:36 von Thomas Fjellstrom

Hi,

I've been struggling with a SAS card recently that has had poor driver support
for a long time, and tonight its decided to kick every drive in the array one
after the other. Now mdstat shows:

md1 : active raid5 sdf[0](F) sdh[7](F) sdi[6](F) sdj[5](F) sde[3](F) sdd[2](F)
sdg[1](F)
5860574208 blocks super 1.1 level 5, 512k chunk, algorithm 2 [7/0]
[_______]
bitmap: 3/8 pages [12KB], 65536KB chunk

Does the fact that I'm using a bitmap save my rear here? Or am I hosed? If I'm
not hosed, is there a way I can recover the array without rebooting? maybe
just a --stop and a --assemble ? If that won't work, will a reboot be ok?

I'd really prefer not to have lost all of my data. Please tell me (please)
that it is possible to recover the array. All but sdi are still visible in
/dev (I may be able to get it back via hotplug maybe, but it'd get sdk or
something).

--
Thomas Fjellstrom
tfjellstrom@shaw.ca
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 23.09.2011 06:32:10 von NeilBrown

--Sig_/4EdauhzQo0Jz9qhHaFHpuBf
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

On Thu, 22 Sep 2011 19:50:36 -0600 Thomas Fjellstrom
wrote:

> Hi,
>=20
> I've been struggling with a SAS card recently that has had poor driver su=
pport=20
> for a long time, and tonight its decided to kick every drive in the array=
one=20
> after the other. Now mdstat shows:
>=20
> md1 : active raid5 sdf[0](F) sdh[7](F) sdi[6](F) sdj[5](F) sde[3](F) sdd[=
2](F)=20
> sdg[1](F)
> 5860574208 blocks super 1.1 level 5, 512k chunk, algorithm 2 [7/0]=
=20
> [_______]
> bitmap: 3/8 pages [12KB], 65536KB chunk
>=20
> Does the fact that I'm using a bitmap save my rear here? Or am I hosed? I=
f I'm=20
> not hosed, is there a way I can recover the array without rebooting? mayb=
e=20
> just a --stop and a --assemble ? If that won't work, will a reboot be ok?
>=20
> I'd really prefer not to have lost all of my data. Please tell me (please=
)=20
> that it is possible to recover the array. All but sdi are still visible i=
n=20
> /dev (I may be able to get it back via hotplug maybe, but it'd get sdk or=
=20
> something).
>=20

mdadm --stop /dev/md1

mdadm --examine /dev/sd[fhijedg]
mdadm --assemble --verbose /dev/md1 /dev/sd[fhijedg]

Report all output.

NeilBrown

--Sig_/4EdauhzQo0Jz9qhHaFHpuBf
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)

iD8DBQFOfAvKG5fc6gV+Wb0RAod5AKDD/kNYpiNlvYQ0BgBXfazzk0JjbACe Jvr9
nZ41KzZGQ4rQMf76l5X2N1E=
=Ykya
-----END PGP SIGNATURE-----

--Sig_/4EdauhzQo0Jz9qhHaFHpuBf--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 23.09.2011 06:49:12 von Thomas Fjellstrom

On September 22, 2011, NeilBrown wrote:
> On Thu, 22 Sep 2011 19:50:36 -0600 Thomas Fjellstrom
>
> wrote:
> > Hi,
> >
> > I've been struggling with a SAS card recently that has had poor driver
> > support for a long time, and tonight its decided to kick every drive in
> > the array one after the other. Now mdstat shows:
> >
> > md1 : active raid5 sdf[0](F) sdh[7](F) sdi[6](F) sdj[5](F) sde[3](F)
> > sdd[2](F) sdg[1](F)
> >
> > 5860574208 blocks super 1.1 level 5, 512k chunk, algorithm 2 [7/0]
> >
> > [_______]
> >
> > bitmap: 3/8 pages [12KB], 65536KB chunk
> >
> > Does the fact that I'm using a bitmap save my rear here? Or am I hosed?
> > If I'm not hosed, is there a way I can recover the array without
> > rebooting? maybe just a --stop and a --assemble ? If that won't work,
> > will a reboot be ok?
> >
> > I'd really prefer not to have lost all of my data. Please tell me
> > (please) that it is possible to recover the array. All but sdi are still
> > visible in /dev (I may be able to get it back via hotplug maybe, but
> > it'd get sdk or something).
>
> mdadm --stop /dev/md1
>
> mdadm --examine /dev/sd[fhijedg]
> mdadm --assemble --verbose /dev/md1 /dev/sd[fhijedg]
>
> Report all output.
>
> NeilBrown

Hi, thanks for the help. Seems the SAS card/driver is in a funky state at the
moment. the --stop worked*. but --examine just gives "no md superblock
detected", and dmesg reports io errors for all drives.

I've just reloaded the driver, and things seem to have come back:

root@boris:~# mdadm --examine /dev/sd[fhijedg]
/dev/sdd:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x1
Array UUID : 7d0e9847:ec3a4a46:32b60a80:06d0ee1c
Name : natasha:0
Creation Time : Wed Oct 14 08:55:25 2009
Raid Level : raid5
Raid Devices : 7

Avail Dev Size : 1953524904 (931.51 GiB 1000.20 GB)
Array Size : 11721148416 (5589.08 GiB 6001.23 GB)
Used Dev Size : 1953524736 (931.51 GiB 1000.20 GB)
Data Offset : 264 sectors
Super Offset : 0 sectors
State : clean
Device UUID : c36d428a:e3bf801a:409009e1:b75207ed

Internal Bitmap : 2 sectors from superblock
Update Time : Thu Sep 22 19:21:04 2011
Checksum : 1eb2e3e5 - correct
Events : 1241766

Layout : left-symmetric
Chunk Size : 512K

Device Role : Active device 2
Array State : AAAAA.A ('A' == active, '.' == missing)
/dev/sde:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x1
Array UUID : 7d0e9847:ec3a4a46:32b60a80:06d0ee1c
Name : natasha:0
Creation Time : Wed Oct 14 08:55:25 2009
Raid Level : raid5
Raid Devices : 7

Avail Dev Size : 1953524904 (931.51 GiB 1000.20 GB)
Array Size : 11721148416 (5589.08 GiB 6001.23 GB)
Used Dev Size : 1953524736 (931.51 GiB 1000.20 GB)
Data Offset : 264 sectors
Super Offset : 0 sectors
State : clean
Device UUID : 9c5ced42:03179de0:5d9520a0:9da28ce4

Internal Bitmap : 2 sectors from superblock
Update Time : Thu Sep 22 19:21:04 2011
Checksum : 54167ee2 - correct
Events : 1241766

Layout : left-symmetric
Chunk Size : 512K

Device Role : Active device 3
Array State : AAAAA.A ('A' == active, '.' == missing)
/dev/sdf:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x1
Array UUID : 7d0e9847:ec3a4a46:32b60a80:06d0ee1c
Name : natasha:0
Creation Time : Wed Oct 14 08:55:25 2009
Raid Level : raid5
Raid Devices : 7

Avail Dev Size : 1953524904 (931.51 GiB 1000.20 GB)
Array Size : 11721148416 (5589.08 GiB 6001.23 GB)
Used Dev Size : 1953524736 (931.51 GiB 1000.20 GB)
Data Offset : 264 sectors
Super Offset : 0 sectors
State : clean
Device UUID : 6622021f:ad59cbf8:6395788e:0c78edca

Internal Bitmap : 2 sectors from superblock
Update Time : Thu Sep 22 19:21:04 2011
Checksum : 1d125cc8 - correct
Events : 1241766

Layout : left-symmetric
Chunk Size : 512K

Device Role : Active device 0
Array State : AAAAA.A ('A' == active, '.' == missing)
/dev/sdg:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x1
Array UUID : 7d0e9847:ec3a4a46:32b60a80:06d0ee1c
Name : natasha:0
Creation Time : Wed Oct 14 08:55:25 2009
Raid Level : raid5
Raid Devices : 7

Avail Dev Size : 1953524904 (931.51 GiB 1000.20 GB)
Array Size : 11721148416 (5589.08 GiB 6001.23 GB)
Used Dev Size : 1953524736 (931.51 GiB 1000.20 GB)
Data Offset : 264 sectors
Super Offset : 0 sectors
State : clean
Device UUID : d566a663:226e4f40:b5b688fb:1c538e7e

Internal Bitmap : 2 sectors from superblock
Update Time : Thu Sep 22 19:21:04 2011
Checksum : c9ebb20e - correct
Events : 1241766

Layout : left-symmetric
Chunk Size : 512K

Device Role : Active device 1
Array State : AAAAA.A ('A' == active, '.' == missing)
/dev/sdh:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x1
Array UUID : 7d0e9847:ec3a4a46:32b60a80:06d0ee1c
Name : natasha:0
Creation Time : Wed Oct 14 08:55:25 2009
Raid Level : raid5
Raid Devices : 7

Avail Dev Size : 1953524736 (931.51 GiB 1000.20 GB)
Array Size : 11721148416 (5589.08 GiB 6001.23 GB)
Data Offset : 432 sectors
Super Offset : 0 sectors
State : clean
Device UUID : 5507b034:a8b38618:a849baa2:a2dfdeb1

Internal Bitmap : 2 sectors from superblock
Update Time : Thu Sep 22 19:21:04 2011
Checksum : 4daeb793 - correct
Events : 1241766

Layout : left-symmetric
Chunk Size : 512K

Device Role : Active device 6
Array State : AAAAA.A ('A' == active, '.' == missing)
/dev/sdi:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x1
Array UUID : 7d0e9847:ec3a4a46:32b60a80:06d0ee1c
Name : natasha:0
Creation Time : Wed Oct 14 08:55:25 2009
Raid Level : raid5
Raid Devices : 7

Avail Dev Size : 1953524736 (931.51 GiB 1000.20 GB)
Array Size : 11721148416 (5589.08 GiB 6001.23 GB)
Data Offset : 432 sectors
Super Offset : 0 sectors
State : active
Device UUID : 9f036406:e5783077:4d9fe524:1966e68e

Internal Bitmap : 2 sectors from superblock
Update Time : Thu Sep 22 19:19:51 2011
Checksum : dd3e54d8 - correct
Events : 1241740

Layout : left-symmetric
Chunk Size : 512K

Device Role : Active device 5
Array State : AAAAAAA ('A' == active, '.' == missing)
/dev/sdj:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x1
Array UUID : 7d0e9847:ec3a4a46:32b60a80:06d0ee1c
Name : natasha:0
Creation Time : Wed Oct 14 08:55:25 2009
Raid Level : raid5
Raid Devices : 7

Avail Dev Size : 1953524904 (931.51 GiB 1000.20 GB)
Array Size : 11721148416 (5589.08 GiB 6001.23 GB)
Used Dev Size : 1953524736 (931.51 GiB 1000.20 GB)
Data Offset : 264 sectors
Super Offset : 0 sectors
State : clean
Device UUID : 1c7f3eb5:94fc143b:7ec49dec:fc575fea

Internal Bitmap : 2 sectors from superblock
Update Time : Thu Sep 22 19:21:04 2011
Checksum : 732f6b75 - correct
Events : 1241766

Layout : left-symmetric
Chunk Size : 512K

Device Role : Active device 4
Array State : AAAAA.A ('A' == active, '.' == missing)

root@boris:~# mdadm --assemble --verbose /dev/md1 /dev/sd[fhijedg]
mdadm: looking for devices for /dev/md1
mdadm: /dev/sdd is identified as a member of /dev/md1, slot 2.
mdadm: /dev/sde is identified as a member of /dev/md1, slot 3.
mdadm: /dev/sdf is identified as a member of /dev/md1, slot 0.
mdadm: /dev/sdg is identified as a member of /dev/md1, slot 1.
mdadm: /dev/sdh is identified as a member of /dev/md1, slot 6.
mdadm: /dev/sdi is identified as a member of /dev/md1, slot 5.
mdadm: /dev/sdj is identified as a member of /dev/md1, slot 4.
mdadm: added /dev/sdg to /dev/md1 as 1
mdadm: added /dev/sdd to /dev/md1 as 2
mdadm: added /dev/sde to /dev/md1 as 3
mdadm: added /dev/sdj to /dev/md1 as 4
mdadm: added /dev/sdi to /dev/md1 as 5
mdadm: added /dev/sdh to /dev/md1 as 6
mdadm: added /dev/sdf to /dev/md1 as 0
mdadm: /dev/md1 has been started with 6 drives (out of 7).

Now I guess the question is, how to get that last drive back in? would:

mdadm --re-add /dev/md1 /dev/sdi

work?

--
Thomas Fjellstrom
tfjellstrom@shaw.ca
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 23.09.2011 06:58:34 von Roman Mamedov

--Sig_/tI4KVlbzW9AuHbYR=sLKXH9
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

On Thu, 22 Sep 2011 22:49:12 -0600
Thomas Fjellstrom wrote:

> Now I guess the question is, how to get that last drive back in? would:=20
>=20
> mdadm --re-add /dev/md1 /dev/sdi=20
>=20
> work?

It should, or at least it will not harm anything, but keep in mind that sim=
ply trying to continue using the array (raid5 with a largeish member count)=
on a flaky controller card is akin to playing with fire.

--=20
With respect,
Roman

--Sig_/tI4KVlbzW9AuHbYR=sLKXH9
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iEYEARECAAYFAk58EfoACgkQTLKSvz+PZwjuMQCgjLvyHMCSHh9N3itpt3eA +ZPb
uhoAnj8YmR/vA5rYinKATNx/TnNstLJw
=bB4R
-----END PGP SIGNATURE-----

--Sig_/tI4KVlbzW9AuHbYR=sLKXH9--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 23.09.2011 07:10:28 von Thomas Fjellstrom

On September 22, 2011, Roman Mamedov wrote:
> On Thu, 22 Sep 2011 22:49:12 -0600
>
> Thomas Fjellstrom wrote:
> > Now I guess the question is, how to get that last drive back in? would:
> >
> > mdadm --re-add /dev/md1 /dev/sdi
> >
> > work?
>
> It should, or at least it will not harm anything, but keep in mind that
> simply trying to continue using the array (raid5 with a largeish member
> count) on a flaky controller card is akin to playing with fire.

Yeah, I think I won't be using the 3.0 kernel after tonight. At least the
older kernel's would just lock up the card and not cause md to boot the disks
one at a time.

I /really really/ wish the driver for this card was more stable, but you deal
with what you've got (in my case a $100 2 port SAS/8 port SATA card). I've
been rather lucky so far it seems, I hope my luck keeps up long enough for
either the driver to stabilize, me to get a new card, or at the very least, to
get a third drive for my backup array, so if the main array does go down, I
have a recent daily sync.

--
Thomas Fjellstrom
tfjellstrom@shaw.ca
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 23.09.2011 07:11:08 von NeilBrown

--Sig_/euMt5CNEm3x96Gecwmj9ZaC
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

On Thu, 22 Sep 2011 22:49:12 -0600 Thomas Fjellstrom
wrote:

> On September 22, 2011, NeilBrown wrote:
> > On Thu, 22 Sep 2011 19:50:36 -0600 Thomas Fjellstrom ca>
> >=20
> > wrote:
> > > Hi,
> > >=20
> > > I've been struggling with a SAS card recently that has had poor driver
> > > support for a long time, and tonight its decided to kick every drive =
in
> > > the array one after the other. Now mdstat shows:
> > >=20
> > > md1 : active raid5 sdf[0](F) sdh[7](F) sdi[6](F) sdj[5](F) sde[3](F)
> > > sdd[2](F) sdg[1](F)
> > >=20
> > > 5860574208 blocks super 1.1 level 5, 512k chunk, algorithm 2 [7=
/0]
> > >=20
> > > [_______]
> > >=20
> > > bitmap: 3/8 pages [12KB], 65536KB chunk
> > >=20
> > > Does the fact that I'm using a bitmap save my rear here? Or am I hose=
d?
> > > If I'm not hosed, is there a way I can recover the array without
> > > rebooting? maybe just a --stop and a --assemble ? If that won't work,
> > > will a reboot be ok?
> > >=20
> > > I'd really prefer not to have lost all of my data. Please tell me
> > > (please) that it is possible to recover the array. All but sdi are st=
ill
> > > visible in /dev (I may be able to get it back via hotplug maybe, but
> > > it'd get sdk or something).
> >=20
> > mdadm --stop /dev/md1
> >=20
> > mdadm --examine /dev/sd[fhijedg]
> > mdadm --assemble --verbose /dev/md1 /dev/sd[fhijedg]
> >=20
> > Report all output.
> >=20
> > NeilBrown
>=20
> Hi, thanks for the help. Seems the SAS card/driver is in a funky state at=
the=20
> moment. the --stop worked*. but --examine just gives "no md superblock=20
> detected", and dmesg reports io errors for all drives.
>=20
> I've just reloaded the driver, and things seem to have come back:

That's good!!

>=20
> root@boris:~# mdadm --examine /dev/sd[fhijedg]
.....

sd1 has a slightly older event count than the others - Update time is 1:13
older. So it presumably died first.

>=20
> root@boris:~# mdadm --assemble --verbose /dev/md1 /dev/sd[fhijedg]
> mdadm: looking for devices for /dev/md1
> mdadm: /dev/sdd is identified as a member of /dev/md1, slot 2.
> mdadm: /dev/sde is identified as a member of /dev/md1, slot 3.
> mdadm: /dev/sdf is identified as a member of /dev/md1, slot 0.
> mdadm: /dev/sdg is identified as a member of /dev/md1, slot 1.
> mdadm: /dev/sdh is identified as a member of /dev/md1, slot 6.
> mdadm: /dev/sdi is identified as a member of /dev/md1, slot 5.
> mdadm: /dev/sdj is identified as a member of /dev/md1, slot 4.
> mdadm: added /dev/sdg to /dev/md1 as 1
> mdadm: added /dev/sdd to /dev/md1 as 2
> mdadm: added /dev/sde to /dev/md1 as 3
> mdadm: added /dev/sdj to /dev/md1 as 4
> mdadm: added /dev/sdi to /dev/md1 as 5
> mdadm: added /dev/sdh to /dev/md1 as 6
> mdadm: added /dev/sdf to /dev/md1 as 0
> mdadm: /dev/md1 has been started with 6 drives (out of 7).
>=20
>=20
> Now I guess the question is, how to get that last drive back in? would:=20
>=20
> mdadm --re-add /dev/md1 /dev/sdi=20
>=20
> work?
>=20

re-add should work, yes. It will use the bitmap info to only update the
blocks that need updating - presumably not many.
It might be interesting to run
mdadm -X /dev/sdf

first to see what the bitmap looks like - how many dirty bits and what the
event counts are.

But yes: --re-add should make it all happy.

NeilBrown

--Sig_/euMt5CNEm3x96Gecwmj9ZaC
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)

iD8DBQFOfBTsG5fc6gV+Wb0RAmT2AKC7kFqp01utSc73hshE3hyqrjBNIACg uo++
oN1exB3d3gctzE+tBmc/VME=
=Y1QP
-----END PGP SIGNATURE-----

--Sig_/euMt5CNEm3x96Gecwmj9ZaC--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 23.09.2011 07:22:56 von Thomas Fjellstrom

On September 22, 2011, NeilBrown wrote:
> On Thu, 22 Sep 2011 22:49:12 -0600 Thomas Fjellstrom
>
> wrote:
> > On September 22, 2011, NeilBrown wrote:
> > > On Thu, 22 Sep 2011 19:50:36 -0600 Thomas Fjellstrom
> > >
> > >
> > > wrote:
> > > > Hi,
> > > >
> > > > I've been struggling with a SAS card recently that has had poor
> > > > driver support for a long time, and tonight its decided to kick
> > > > every drive in the array one after the other. Now mdstat shows:
> > > >
> > > > md1 : active raid5 sdf[0](F) sdh[7](F) sdi[6](F) sdj[5](F) sde[3](F)
> > > > sdd[2](F) sdg[1](F)
> > > >
> > > > 5860574208 blocks super 1.1 level 5, 512k chunk, algorithm 2
> > > > [7/0]
> > > >
> > > > [_______]
> > > >
> > > > bitmap: 3/8 pages [12KB], 65536KB chunk
> > > >
> > > > Does the fact that I'm using a bitmap save my rear here? Or am I
> > > > hosed? If I'm not hosed, is there a way I can recover the array
> > > > without rebooting? maybe just a --stop and a --assemble ? If that
> > > > won't work, will a reboot be ok?
> > > >
> > > > I'd really prefer not to have lost all of my data. Please tell me
> > > > (please) that it is possible to recover the array. All but sdi are
> > > > still visible in /dev (I may be able to get it back via hotplug
> > > > maybe, but it'd get sdk or something).
> > >
> > > mdadm --stop /dev/md1
> > >
> > > mdadm --examine /dev/sd[fhijedg]
> > > mdadm --assemble --verbose /dev/md1 /dev/sd[fhijedg]
> > >
> > > Report all output.
> > >
> > > NeilBrown
> >
> > Hi, thanks for the help. Seems the SAS card/driver is in a funky state at
> > the moment. the --stop worked*. but --examine just gives "no md
> > superblock detected", and dmesg reports io errors for all drives.
>
> > I've just reloaded the driver, and things seem to have come back:
> That's good!!
>
> > root@boris:~# mdadm --examine /dev/sd[fhijedg]
>
> ....
>
> sd1 has a slightly older event count than the others - Update time is 1:13
> older. So it presumably died first.
>
> > root@boris:~# mdadm --assemble --verbose /dev/md1 /dev/sd[fhijedg]
> > mdadm: looking for devices for /dev/md1
> > mdadm: /dev/sdd is identified as a member of /dev/md1, slot 2.
> > mdadm: /dev/sde is identified as a member of /dev/md1, slot 3.
> > mdadm: /dev/sdf is identified as a member of /dev/md1, slot 0.
> > mdadm: /dev/sdg is identified as a member of /dev/md1, slot 1.
> > mdadm: /dev/sdh is identified as a member of /dev/md1, slot 6.
> > mdadm: /dev/sdi is identified as a member of /dev/md1, slot 5.
> > mdadm: /dev/sdj is identified as a member of /dev/md1, slot 4.
> > mdadm: added /dev/sdg to /dev/md1 as 1
> > mdadm: added /dev/sdd to /dev/md1 as 2
> > mdadm: added /dev/sde to /dev/md1 as 3
> > mdadm: added /dev/sdj to /dev/md1 as 4
> > mdadm: added /dev/sdi to /dev/md1 as 5
> > mdadm: added /dev/sdh to /dev/md1 as 6
> > mdadm: added /dev/sdf to /dev/md1 as 0
> > mdadm: /dev/md1 has been started with 6 drives (out of 7).
> >
> >
> > Now I guess the question is, how to get that last drive back in? would:
> >
> > mdadm --re-add /dev/md1 /dev/sdi
> >
> > work?
>
> re-add should work, yes. It will use the bitmap info to only update the
> blocks that need updating - presumably not many.
> It might be interesting to run
> mdadm -X /dev/sdf
>
> first to see what the bitmap looks like - how many dirty bits and what the
> event counts are.

root@boris:~# mdadm -X /dev/sdf
Filename : /dev/sdf
Magic : 6d746962
Version : 4
UUID : 7d0e9847:ec3a4a46:32b60a80:06d0ee1c
Events : 1241766
Events Cleared : 1241740
State : OK
Chunksize : 64 MB
Daemon : 5s flush period
Write Mode : Normal
Sync Size : 976762368 (931.51 GiB 1000.20 GB)
Bitmap : 14905 bits (chunks), 18 dirty (0.1%)

> But yes: --re-add should make it all happy.

Very nice. I was quite upset there for a bit. Had to take a walk ;D

> NeilBrown

--
Thomas Fjellstrom
tfjellstrom@shaw.ca
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 23.09.2011 09:06:35 von David Brown

On 23/09/2011 07:10, Thomas Fjellstrom wrote:
> On September 22, 2011, Roman Mamedov wrote:
>> On Thu, 22 Sep 2011 22:49:12 -0600
>>
>> Thomas Fjellstrom wrote:
>>> Now I guess the question is, how to get that last drive back in? would:
>>>
>>> mdadm --re-add /dev/md1 /dev/sdi
>>>
>>> work?
>>
>> It should, or at least it will not harm anything, but keep in mind that
>> simply trying to continue using the array (raid5 with a largeish member
>> count) on a flaky controller card is akin to playing with fire.
>
> Yeah, I think I won't be using the 3.0 kernel after tonight. At least the
> older kernel's would just lock up the card and not cause md to boot the disks
> one at a time.
>
> I /really really/ wish the driver for this card was more stable, but you deal
> with what you've got (in my case a $100 2 port SAS/8 port SATA card). I've
> been rather lucky so far it seems, I hope my luck keeps up long enough for
> either the driver to stabilize, me to get a new card, or at the very least, to
> get a third drive for my backup array, so if the main array does go down, I
> have a recent daily sync.
>

My own (limited) experience with SAS is that you /don't/ get what you
pay for. I had a SAS drive on a server (actually a firewall) as the
server salesman had persuaded me that it was more reliable than SATA,
and therefore a good choice for a critical machine. The SAS controller
card died recently. I replaced it with two SATA drives connected
directly to the motherboard, with md raid - much more reliable and much
cheaper (and faster too).

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 23.09.2011 09:37:04 von Thomas Fjellstrom

On September 23, 2011, David Brown wrote:
> On 23/09/2011 07:10, Thomas Fjellstrom wrote:
> > On September 22, 2011, Roman Mamedov wrote:
> >> On Thu, 22 Sep 2011 22:49:12 -0600
> >>
> >> Thomas Fjellstrom wrote:
> >>> Now I guess the question is, how to get that last drive back in? would:
> >>>
> >>> mdadm --re-add /dev/md1 /dev/sdi
> >>>
> >>> work?
> >>
> >> It should, or at least it will not harm anything, but keep in mind that
> >> simply trying to continue using the array (raid5 with a largeish member
> >> count) on a flaky controller card is akin to playing with fire.
> >
> > Yeah, I think I won't be using the 3.0 kernel after tonight. At least the
> > older kernel's would just lock up the card and not cause md to boot the
> > disks one at a time.
> >
> > I /really really/ wish the driver for this card was more stable, but you
> > deal with what you've got (in my case a $100 2 port SAS/8 port SATA
> > card). I've been rather lucky so far it seems, I hope my luck keeps up
> > long enough for either the driver to stabilize, me to get a new card, or
> > at the very least, to get a third drive for my backup array, so if the
> > main array does go down, I have a recent daily sync.
>
> My own (limited) experience with SAS is that you /don't/ get what you
> pay for. I had a SAS drive on a server (actually a firewall) as the
> server salesman had persuaded me that it was more reliable than SATA,
> and therefore a good choice for a critical machine. The SAS controller
> card died recently. I replaced it with two SATA drives connected
> directly to the motherboard, with md raid - much more reliable and much
> cheaper (and faster too).

Well the driver for this card is known to be rather dodgy, especially with
SATA disks. At one point it was panicking on SATA hotplug, would randomly kick
one or more drives, the entire card would randomly lock up, and there were
random long'ish pauses during access. It's a heck of a lot better now than it
was 2 years ago. Except that those problems never caused the array to fall
apart like it did today. I guess since the card /didn't/ lock up, md was able
to notice that the drives were gone, and subsequently failed the disks.

I am worried about sdi though. the bay light on it is flickering a bit, and I
think its the only one thats been kicked out lately (other than tonight).
Maybe it is causing the card to behave worse than it would if nothing else was
bad. Usually though, the card would lock up after the first boot, so a reboot
was needed to get the card back in shape, then the array would resync (if
needed), and the bitmap would make the resync only take a few minutes (20m the
last time I think).

> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Thomas Fjellstrom
thomas@fjellstrom.ca
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 23.09.2011 10:09:36 von Thomas Fjellstrom

On September 22, 2011, Thomas Fjellstrom wrote:
> On September 22, 2011, NeilBrown wrote:
> > On Thu, 22 Sep 2011 22:49:12 -0600 Thomas Fjellstrom
> >
> >
> > wrote:
> > > On September 22, 2011, NeilBrown wrote:
> > > > On Thu, 22 Sep 2011 19:50:36 -0600 Thomas Fjellstrom
> > > >
> > > >
> > > > wrote:
> > > > > Hi,
> > > > >
> > > > > I've been struggling with a SAS card recently that has had poor
> > > > > driver support for a long time, and tonight its decided to kick
> > > > > every drive in the array one after the other. Now mdstat shows:
> > > > >
> > > > > md1 : active raid5 sdf[0](F) sdh[7](F) sdi[6](F) sdj[5](F)
> > > > > sde[3](F) sdd[2](F) sdg[1](F)
> > > > >
> > > > > 5860574208 blocks super 1.1 level 5, 512k chunk, algorithm 2
> > > > > [7/0]
> > > > >
> > > > > [_______]
> > > > >
> > > > > bitmap: 3/8 pages [12KB], 65536KB chunk
> > > > >
> > > > > Does the fact that I'm using a bitmap save my rear here? Or am I
> > > > > hosed? If I'm not hosed, is there a way I can recover the array
> > > > > without rebooting? maybe just a --stop and a --assemble ? If that
> > > > > won't work, will a reboot be ok?
> > > > >
> > > > > I'd really prefer not to have lost all of my data. Please tell me
> > > > > (please) that it is possible to recover the array. All but sdi are
> > > > > still visible in /dev (I may be able to get it back via hotplug
> > > > > maybe, but it'd get sdk or something).
> > > >
> > > > mdadm --stop /dev/md1
> > > >
> > > > mdadm --examine /dev/sd[fhijedg]
> > > > mdadm --assemble --verbose /dev/md1 /dev/sd[fhijedg]
> > > >
> > > > Report all output.
> > > >
> > > > NeilBrown
> > >
> > > Hi, thanks for the help. Seems the SAS card/driver is in a funky state
> > > at the moment. the --stop worked*. but --examine just gives "no md
> > > superblock detected", and dmesg reports io errors for all drives.
> >
> > > I've just reloaded the driver, and things seem to have come back:
> > That's good!!
> >
> > > root@boris:~# mdadm --examine /dev/sd[fhijedg]
> >
> > ....
> >
> > sd1 has a slightly older event count than the others - Update time is
> > 1:13 older. So it presumably died first.
> >
> > > root@boris:~# mdadm --assemble --verbose /dev/md1 /dev/sd[fhijedg]
> > > mdadm: looking for devices for /dev/md1
> > > mdadm: /dev/sdd is identified as a member of /dev/md1, slot 2.
> > > mdadm: /dev/sde is identified as a member of /dev/md1, slot 3.
> > > mdadm: /dev/sdf is identified as a member of /dev/md1, slot 0.
> > > mdadm: /dev/sdg is identified as a member of /dev/md1, slot 1.
> > > mdadm: /dev/sdh is identified as a member of /dev/md1, slot 6.
> > > mdadm: /dev/sdi is identified as a member of /dev/md1, slot 5.
> > > mdadm: /dev/sdj is identified as a member of /dev/md1, slot 4.
> > > mdadm: added /dev/sdg to /dev/md1 as 1
> > > mdadm: added /dev/sdd to /dev/md1 as 2
> > > mdadm: added /dev/sde to /dev/md1 as 3
> > > mdadm: added /dev/sdj to /dev/md1 as 4
> > > mdadm: added /dev/sdi to /dev/md1 as 5
> > > mdadm: added /dev/sdh to /dev/md1 as 6
> > > mdadm: added /dev/sdf to /dev/md1 as 0
> > > mdadm: /dev/md1 has been started with 6 drives (out of 7).
> > >
> > >
> > > Now I guess the question is, how to get that last drive back in? would:
> > >
> > > mdadm --re-add /dev/md1 /dev/sdi
> > >
> > > work?
> >
> > re-add should work, yes. It will use the bitmap info to only update the
> > blocks that need updating - presumably not many.
> > It might be interesting to run
> >
> > mdadm -X /dev/sdf
> >
> > first to see what the bitmap looks like - how many dirty bits and what
> > the event counts are.
>
> root@boris:~# mdadm -X /dev/sdf
> Filename : /dev/sdf
> Magic : 6d746962
> Version : 4
> UUID : 7d0e9847:ec3a4a46:32b60a80:06d0ee1c
> Events : 1241766
> Events Cleared : 1241740
> State : OK
> Chunksize : 64 MB
> Daemon : 5s flush period
> Write Mode : Normal
> Sync Size : 976762368 (931.51 GiB 1000.20 GB)
> Bitmap : 14905 bits (chunks), 18 dirty (0.1%)
>
> > But yes: --re-add should make it all happy.
>
> Very nice. I was quite upset there for a bit. Had to take a walk ;D

I forgot to say, but: Thank you very much :) for the help, and your tireless
work on md.

> > NeilBrown

--
Thomas Fjellstrom
thomas@fjellstrom.ca
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 23.09.2011 11:15:39 von NeilBrown

--Sig_/29M8Wa/9gv5GbakGVQkFk=b
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

On Fri, 23 Sep 2011 02:09:36 -0600 Thomas Fjellstrom
wrote:
>=20
> I forgot to say, but: Thank you very much :) for the help, and your tirel=
ess=20
> work on md.
>=20

You've very welcome .... but I felt I needed to respond to that word
"tireless".
The truth is that I am getting rather tired of md .... if anyone knows anyo=
ne
who wants to get into kernel development and is wondering where to start -
please consider whispering 'the md driver' in their ear. Plenty to do, gre=
at
mentoring possibilities, and competent linux kernel engineers with good
experience are unlikely to have much trouble finding a job ;-)

NeilBrown

--Sig_/29M8Wa/9gv5GbakGVQkFk=b
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)

iD8DBQFOfE47G5fc6gV+Wb0RAuTIAKDGsCgXifyj3VAdymwmn3KcVqB+vwCa Asu5
YL2bEhuBYiEV8ZMl99xjvjg=
=86LK
-----END PGP SIGNATURE-----

--Sig_/29M8Wa/9gv5GbakGVQkFk=b--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 23.09.2011 14:56:00 von Stan Hoeppner

On 9/23/2011 12:10 AM, Thomas Fjellstrom wrote:

> I /really really/ wish the driver for this card was more stable, but you deal
> with what you've got (in my case a $100 2 port SAS/8 port SATA card).

Please don't shield the identity of the problem card. Others need to
know of your problems. An educated guess tells me it is one of...

Card: SuperMicro AOC-SASLP-MV8 Marvell 88SE6480
Driver: MVSAS

Card: HighPoint RocketRAID 2680/2680SGL Marvell 88SE6485
Driver: MVSAS

This ASIC/driver combo is so historically horrible with Linux that I'm
surprised all the owners haven't had a big bon fire party and thrown all
the cards in. Or simply Ebay'd them to Windows users, where they seem
to work relatively OK.

Solve your problem with a 50% more $$ LSI SAS1068E based Intel 8 port
PCIe x4 SAS/SATA HBA, which uses the mptsas driver:
http://www.newegg.com/Product/Product.aspx?Item=N82E16816117 157

It seems this is the card most users switch to after being burned by the
cheap Marvell based SAS 2xSFF8087 cards. The 1068E cards and the mptsas
driver are far more reliable, stable, and faster. Many OEM cards from
IBM, Dell, etc, use this chip and can be had on Ebay for less than the
new retail Intel card. In your situation I'd probably buy new Intel
just in case. Hope this info/insight helps.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 23.09.2011 15:28:36 von David Brown

On 23/09/2011 14:56, Stan Hoeppner wrote:
> On 9/23/2011 12:10 AM, Thomas Fjellstrom wrote:
>
>> I /really really/ wish the driver for this card was more stable, but
>> you deal
>> with what you've got (in my case a $100 2 port SAS/8 port SATA card).
>
> Please don't shield the identity of the problem card. Others need to
> know of your problems. An educated guess tells me it is one of...
>
> Card: SuperMicro AOC-SASLP-MV8 Marvell 88SE6480
> Driver: MVSAS
>
> Card: HighPoint RocketRAID 2680/2680SGL Marvell 88SE6485
> Driver: MVSAS
>
> This ASIC/driver combo is so historically horrible with Linux that I'm
> surprised all the owners haven't had a big bon fire party and thrown all
> the cards in. Or simply Ebay'd them to Windows users, where they seem to
> work relatively OK.
>
> Solve your problem with a 50% more $$ LSI SAS1068E based Intel 8 port
> PCIe x4 SAS/SATA HBA, which uses the mptsas driver:
> http://www.newegg.com/Product/Product.aspx?Item=N82E16816117 157
>
> It seems this is the card most users switch to after being burned by the
> cheap Marvell based SAS 2xSFF8087 cards. The 1068E cards and the mptsas
> driver are far more reliable, stable, and faster. Many OEM cards from
> IBM, Dell, etc, use this chip and can be had on Ebay for less than the
> new retail Intel card. In your situation I'd probably buy new Intel just
> in case. Hope this info/insight helps.
>

The SAS card I had was an LSI SAS1068 card, with 2 SAS (no SATA), Dell
brand. It worked flawlessly with Linux right up to the day the card
died out of the blue.

With a sample size of 1, I don't have the statistics to justify judging
the card or the controller, but I certainly will be sceptical about
using such a card again.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 23.09.2011 18:22:59 von Thomas Fjellstrom

On September 23, 2011, Stan Hoeppner wrote:
> On 9/23/2011 12:10 AM, Thomas Fjellstrom wrote:
> > I /really really/ wish the driver for this card was more stable, but you
> > deal with what you've got (in my case a $100 2 port SAS/8 port SATA
> > card).
>
> Please don't shield the identity of the problem card. Others need to
> know of your problems. An educated guess tells me it is one of...
>
> Card: SuperMicro AOC-SASLP-MV8 Marvell 88SE6480
> Driver: MVSAS

*DING**DING**DING**DING**DING*

We have a winner! It's a very nice card when it works.

> Card: HighPoint RocketRAID 2680/2680SGL Marvell 88SE6485
> Driver: MVSAS
>
> This ASIC/driver combo is so historically horrible with Linux that I'm
> surprised all the owners haven't had a big bon fire party and thrown all
> the cards in. Or simply Ebay'd them to Windows users, where they seem
> to work relatively OK.
>
> Solve your problem with a 50% more $$ LSI SAS1068E based Intel 8 port
> PCIe x4 SAS/SATA HBA, which uses the mptsas driver:
> http://www.newegg.com/Product/Product.aspx?Item=N82E16816117 157
>
> It seems this is the card most users switch to after being burned by the
> cheap Marvell based SAS 2xSFF8087 cards. The 1068E cards and the mptsas
> driver are far more reliable, stable, and faster. Many OEM cards from
> IBM, Dell, etc, use this chip and can be had on Ebay for less than the
> new retail Intel card. In your situation I'd probably buy new Intel
> just in case. Hope this info/insight helps.

I'd love to switch, but I didn't really have the money for the card then, and
now I have less money. I suppose if I ebayed this card first, and then bought a
new one that would work out, but yeah, It will have to wait a bit (things are
VERY tight right now).

So this Intel card, looks like a good option, but how much faster is it? I get
500MB/s read off this SASLP. Probably a bit more now that there's 7 drives in
the array. Off of XFS, it gets at least 200MB/s read (the discrepancy between
raw and over xfs really bugs me, something there can't be right can it?).

Thank you for the suggestion though, I will have to book mark that link.

--
Thomas Fjellstrom
tfjellstrom@shaw.ca
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 23.09.2011 18:26:53 von Thomas Fjellstrom

On September 23, 2011, NeilBrown wrote:
> On Fri, 23 Sep 2011 02:09:36 -0600 Thomas Fjellstrom
>
> wrote:
> > I forgot to say, but: Thank you very much :) for the help, and your
> > tireless work on md.
>
> You've very welcome .... but I felt I needed to respond to that word
> "tireless".
> The truth is that I am getting rather tired of md .... if anyone knows
> anyone who wants to get into kernel development and is wondering where to
> start - please consider whispering 'the md driver' in their ear. Plenty
> to do, great mentoring possibilities, and competent linux kernel engineers
> with good experience are unlikely to have much trouble finding a job ;-)
>
> NeilBrown

Very tempting. How much work do you think it would be to add in full raid10
reshape support? ;D (not that I'm volunteering, I don't think I could
comprehend a lot of the raid code, at least not at this point in time
(extenuating circumstances)).

--
Thomas Fjellstrom
thomas@fjellstrom.ca
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 24.09.2011 01:24:28 von Stan Hoeppner

On 9/23/2011 11:22 AM, Thomas Fjellstrom wrote:

> I'd love to switch, but I didn't really have the money for the card then, and
> now I have less money. I suppose if I ebayed this card first, and then bought a
> new one that would work out, but yeah, It will have to wait a bit (things are
> VERY tight right now).

Which is why you purchased the cheapest SAS card on the market at that
time. :)

> So this Intel card, looks like a good option, but how much faster is it? I get
> 500MB/s read off this SASLP. Probably a bit more now that there's 7 drives in
> the array. Off of XFS, it gets at least 200MB/s read (the discrepancy between
> raw and over xfs really bugs me, something there can't be right can it?).

When properly configured XFS will achieve near spindle throughput.
Recent versions of mkfs.xfs read the mdraid configuration and configure
the filesystem automatically for sw, swidth, number of allocation
groups, etc. Thus you should get max performance out of the gate.

If you really would like to fix this, you'll need to post on the XFS
list. Much more data will be required than simply stating "it's slower
by x than 'raw' read". This will include your mdadm config, testing
methodology, and xfs_info output at minimum. There is no simple "check
this box" mega solution with XFS.

> Thank you for the suggestion though, I will have to book mark that link.

You're welcome.

You can't find a better value for an 8 port SAS or SATA solution that
actually works well with Linux. Not to my knowledge anyway. You could
buy two PCIe x1 4 port Marvell based SATA only cards for $20-30 less
maybe, but would be limited to 500MB/s raw unidirectional PCIe b/w vs
2GB/s with an x4 card, have less features, eat two slots, etc. That
would be more reliable than what you have now though. The Marvell SATA
driver in Linux is much more solid that the SAS driver, from what I've
read anyway. I've never used/owned any Marvell based cards. If I go
cheap I go Silicon Image. It's too bad they don't have a 4 port PCIe
ASIC in their line up. The only 4 port chip they have is PCI based.
Addonics sells a Silicon Image expander, but the total cost for a 2 port
card and two expanders is quite a bit higher than the better Intel
single card solution.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 24.09.2011 02:11:07 von Thomas Fjellstrom

On September 23, 2011, Stan Hoeppner wrote:
> On 9/23/2011 11:22 AM, Thomas Fjellstrom wrote:
> > I'd love to switch, but I didn't really have the money for the card then,
> > and now I have less money. I suppose if I ebayed this card first, and
> > then bought a new one that would work out, but yeah, It will have to
> > wait a bit (things are VERY tight right now).
>
> Which is why you purchased the cheapest SAS card on the market at that
> time. :)
>
> > So this Intel card, looks like a good option, but how much faster is it?
> > I get 500MB/s read off this SASLP. Probably a bit more now that there's
> > 7 drives in the array. Off of XFS, it gets at least 200MB/s read (the
> > discrepancy between raw and over xfs really bugs me, something there
> > can't be right can it?).
>
> When properly configured XFS will achieve near spindle throughput.
> Recent versions of mkfs.xfs read the mdraid configuration and configure
> the filesystem automatically for sw, swidth, number of allocation
> groups, etc. Thus you should get max performance out of the gate.

What happens when you add a drive and reshape? Is it enough just to tweak the
mount options?

> If you really would like to fix this, you'll need to post on the XFS
> list. Much more data will be required than simply stating "it's slower
> by x than 'raw' read". This will include your mdadm config, testing
> methodology, and xfs_info output at minimum. There is no simple "check
> this box" mega solution with XFS.

I tweaked a crap load of settings before settling on what I have. Its
reasonable, a balance between raw throughput and directory access/modification
performance. Read performance atm isn't as bad as I remember, about 423MB/s
according to bonnie++. Write performance is 153MB/s which seems a tad low to
me, but still not horrible. Faster than I generally need at any given time.

> > Thank you for the suggestion though, I will have to book mark that link.
>
> You're welcome.
>
> You can't find a better value for an 8 port SAS or SATA solution that
> actually works well with Linux. Not to my knowledge anyway. You could
> buy two PCIe x1 4 port Marvell based SATA only cards for $20-30 less
> maybe, but would be limited to 500MB/s raw unidirectional PCIe b/w vs
> 2GB/s with an x4 card, have less features, eat two slots, etc. That
> would be more reliable than what you have now though. The Marvell SATA
> driver in Linux is much more solid that the SAS driver, from what I've
> read anyway. I've never used/owned any Marvell based cards. If I go
> cheap I go Silicon Image. It's too bad they don't have a 4 port PCIe
> ASIC in their line up. The only 4 port chip they have is PCI based.
> Addonics sells a Silicon Image expander, but the total cost for a 2 port
> card and two expanders is quite a bit higher than the better Intel
> single card solution.

I appreciate the tips. That intel/LSI card seems like the best bet.

--
Thomas Fjellstrom
tfjellstrom@shaw.ca
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 24.09.2011 07:59:59 von Mikael Abrahamsson

On Fri, 23 Sep 2011, Thomas Fjellstrom wrote:

>> Card: SuperMicro AOC-SASLP-MV8 Marvell 88SE6480
>> Driver: MVSAS
>
> *DING**DING**DING**DING**DING*
>
> We have a winner! It's a very nice card when it works.

I have one of these, it's in the drawer. As late as 2.6.38 it worked
marginally, before that it wouldn't even survive a raid resync. Sell it
and buy something else. As far as I can tell there is nothing wrong with
the hardware in this card, it just so happens the driver support is...
err... lacking.

--
Mikael Abrahamsson email: swmike@swm.pp.se
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 24.09.2011 14:17:42 von Stan Hoeppner

On 9/23/2011 7:11 PM, Thomas Fjellstrom wrote:
> On September 23, 2011, Stan Hoeppner wrote:
>> On 9/23/2011 11:22 AM, Thomas Fjellstrom wrote:
>>> I'd love to switch, but I didn't really have the money for the card then,
>>> and now I have less money. I suppose if I ebayed this card first, and
>>> then bought a new one that would work out, but yeah, It will have to
>>> wait a bit (things are VERY tight right now).
>>
>> Which is why you purchased the cheapest SAS card on the market at that
>> time. :)
>>
>>> So this Intel card, looks like a good option, but how much faster is it?
>>> I get 500MB/s read off this SASLP. Probably a bit more now that there's
>>> 7 drives in the array. Off of XFS, it gets at least 200MB/s read (the
>>> discrepancy between raw and over xfs really bugs me, something there
>>> can't be right can it?).
>>
>> When properly configured XFS will achieve near spindle throughput.
>> Recent versions of mkfs.xfs read the mdraid configuration and configure
>> the filesystem automatically for sw, swidth, number of allocation
>> groups, etc. Thus you should get max performance out of the gate.
>
> What happens when you add a drive and reshape? Is it enough just to tweak the
> mount options?

When you change the number of effective spindles with a reshape, and
thus the stripe width and stripe size, you definitely should add the
appropriate XFS mount options and values to reflect this. Performance
will be less than optimal if you don't.

If you use a linear concat under XFS you never have to worry about the
above situation. It has many other advantages over a striped array and
better performance for many workloads, especially multi user general
file serving and maildir storage--workloads with lots of concurrent IO.
If you 'need' maximum single stream performance for large files, a
striped array is obviously better. Most applications however don't need
large single stream performance.

>> If you really would like to fix this, you'll need to post on the XFS
>> list. Much more data will be required than simply stating "it's slower
>> by x than 'raw' read". This will include your mdadm config, testing
>> methodology, and xfs_info output at minimum. There is no simple "check
>> this box" mega solution with XFS.
>
> I tweaked a crap load of settings before settling on what I have. Its
> reasonable, a balance between raw throughput and directory access/modification
> performance. Read performance atm isn't as bad as I remember, about 423MB/s
> according to bonnie++. Write performance is 153MB/s which seems a tad low to
> me, but still not horrible. Faster than I generally need at any given time.

That low write performance is probably due to barriers to some degree.
Disabling barriers could yield a sizable increase in write performance
for some workloads, especially portions of synthetic benchies. Using an
external log journal device could help as well. Keep in mind we're
talking about numbers generated by synthetic benchmarks. Making such
changes may not help your actual application workload much, if at all.

Given your HBA and the notoriously flaky kernel driver for it, you'd be
asking for severe pain if you disabled barriers. If you had a rock
stable system and a good working UPS you could probably run ok with
barriers disabled, but it's always risky without a BBWC RAID card. If
you want to increase benchy write performance I'd first try an external
log device since SATA disks are cheap. You'll want to mirror two disks
for the log, of course. A couple of 2.5" 160GB 7.2k drives would fit
the bill and will run about $100 USD total.

>>> Thank you for the suggestion though, I will have to book mark that link.
>>
>> You're welcome.
>>
>> You can't find a better value for an 8 port SAS or SATA solution that
>> actually works well with Linux. Not to my knowledge anyway. You could
>> buy two PCIe x1 4 port Marvell based SATA only cards for $20-30 less
>> maybe, but would be limited to 500MB/s raw unidirectional PCIe b/w vs
>> 2GB/s with an x4 card, have less features, eat two slots, etc. That
>> would be more reliable than what you have now though. The Marvell SATA
>> driver in Linux is much more solid that the SAS driver, from what I've
>> read anyway. I've never used/owned any Marvell based cards. If I go
>> cheap I go Silicon Image. It's too bad they don't have a 4 port PCIe
>> ASIC in their line up. The only 4 port chip they have is PCI based.
>> Addonics sells a Silicon Image expander, but the total cost for a 2 port
>> card and two expanders is quite a bit higher than the better Intel
>> single card solution.
>
> I appreciate the tips. That intel/LSI card seems like the best bet.

It's hard to beat for 8 ports at that price point. And it's an Intel
card with LSI ASIC, not some cheapo Rosewill or Syba card with a
Marvell, SI, JMicron, etc.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

(unknown)

am 24.09.2011 15:11:47 von dulik

unsubscribe linux-raid
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 24.09.2011 17:16:40 von David Brown

On 24/09/2011 14:17, Stan Hoeppner wrote:
> On 9/23/2011 7:11 PM, Thomas Fjellstrom wrote:
>> On September 23, 2011, Stan Hoeppner wrote:
>
>>> When properly configured XFS will achieve near spindle throughput.
>>> Recent versions of mkfs.xfs read the mdraid configuration and configure
>>> the filesystem automatically for sw, swidth, number of allocation
>>> groups, etc. Thus you should get max performance out of the gate.
>>
>> What happens when you add a drive and reshape? Is it enough just to
>> tweak the
>> mount options?
>
> When you change the number of effective spindles with a reshape, and
> thus the stripe width and stripe size, you definitely should add the
> appropriate XFS mount options and values to reflect this. Performance
> will be less than optimal if you don't.
>
> If you use a linear concat under XFS you never have to worry about the
> above situation. It has many other advantages over a striped array and
> better performance for many workloads, especially multi user general
> file serving and maildir storage--workloads with lots of concurrent IO.
> If you 'need' maximum single stream performance for large files, a
> striped array is obviously better. Most applications however don't need
> large single stream performance.
>

If you use a linear concatenation of drives for XFS, is it not correct
that you want one allocation group per drive (or per raid set, if you
are concatenating a bunch of raid sets)? If you then add another drive
or raid set, can you grow XFS with another allocation group?

mvh.,

David

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 24.09.2011 18:38:52 von Stan Hoeppner

On 9/24/2011 10:16 AM, David Brown wrote:
> On 24/09/2011 14:17, Stan Hoeppner wrote:
>> On 9/23/2011 7:11 PM, Thomas Fjellstrom wrote:
>>> On September 23, 2011, Stan Hoeppner wrote:
>>
>>>> When properly configured XFS will achieve near spindle throughput.
>>>> Recent versions of mkfs.xfs read the mdraid configuration and configure
>>>> the filesystem automatically for sw, swidth, number of allocation
>>>> groups, etc. Thus you should get max performance out of the gate.
>>>
>>> What happens when you add a drive and reshape? Is it enough just to
>>> tweak the
>>> mount options?
>>
>> When you change the number of effective spindles with a reshape, and
>> thus the stripe width and stripe size, you definitely should add the
>> appropriate XFS mount options and values to reflect this. Performance
>> will be less than optimal if you don't.
>>
>> If you use a linear concat under XFS you never have to worry about the
>> above situation. It has many other advantages over a striped array and
>> better performance for many workloads, especially multi user general
>> file serving and maildir storage--workloads with lots of concurrent IO.
>> If you 'need' maximum single stream performance for large files, a
>> striped array is obviously better. Most applications however don't need
>> large single stream performance.
>>
>
> If you use a linear concatenation of drives for XFS, is it not correct
> that you want one allocation group per drive (or per raid set, if you
> are concatenating a bunch of raid sets)?

Yes. Normally with a linear concat you would make X number of RAID1
mirrors via mdraid or hardware RAID, then concat them with mdadm
--linear or LVM. Then mkfs.xfs -d ag=X ...

Currently XFS has a 1TB limit for allocation groups. If you use 2TB
drives you'll get 2 AGs per effective spindle instead of one. With some
'borderline' workloads this may hinder performance. It depends on how
many top level directories you have in the filesystem and your
concurrency to them.

> If you then add another drive
> or raid set, can you grow XFS with another allocation group?

XFS creates more allocation groups automatically as part of the grow
operation. If you have a linear concat setup you'll obviously wan to
control this manually to maintain the same number of AGs per effective
spindle.

Always remember that the key to linear concat performance with XFS is
directory level parallelism. If you have lots of top level directories
in your filesystem and high concurrent access (home dirs, maildir, etc)
it will typically work better than a striped array. If you have few
directories and low concurrency, are streaming large files, etc, stick
with a striped array.

Also note that a linear concat will only give increased performance with
XFS, again for appropriate worklods. Using a linear concat with EXT3/4
will give you the performance of a single spindle regardless of the
total number of disks used. So one should stick with striped arrays for
EXT3/4.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 24.09.2011 19:48:49 von Thomas Fjellstrom

On September 24, 2011, Stan Hoeppner wrote:
> On 9/23/2011 7:11 PM, Thomas Fjellstrom wrote:
> > On September 23, 2011, Stan Hoeppner wrote:
> >> On 9/23/2011 11:22 AM, Thomas Fjellstrom wrote:
> >>> I'd love to switch, but I didn't really have the money for the card
> >>> then, and now I have less money. I suppose if I ebayed this card
> >>> first, and then bought a new one that would work out, but yeah, It
> >>> will have to wait a bit (things are VERY tight right now).
> >>
> >> Which is why you purchased the cheapest SAS card on the market at that
> >> time. :)
> >>
> >>> So this Intel card, looks like a good option, but how much faster is
> >>> it? I get 500MB/s read off this SASLP. Probably a bit more now that
> >>> there's 7 drives in the array. Off of XFS, it gets at least 200MB/s
> >>> read (the discrepancy between raw and over xfs really bugs me,
> >>> something there can't be right can it?).
> >>
> >> When properly configured XFS will achieve near spindle throughput.
> >> Recent versions of mkfs.xfs read the mdraid configuration and configure
> >> the filesystem automatically for sw, swidth, number of allocation
> >> groups, etc. Thus you should get max performance out of the gate.
> >
> > What happens when you add a drive and reshape? Is it enough just to tweak
> > the mount options?
>
> When you change the number of effective spindles with a reshape, and
> thus the stripe width and stripe size, you definitely should add the
> appropriate XFS mount options and values to reflect this. Performance
> will be less than optimal if you don't.
>
> If you use a linear concat under XFS you never have to worry about the
> above situation. It has many other advantages over a striped array and
> better performance for many workloads, especially multi user general
> file serving and maildir storage--workloads with lots of concurrent IO.
> If you 'need' maximum single stream performance for large files, a
> striped array is obviously better. Most applications however don't need
> large single stream performance.
>
> >> If you really would like to fix this, you'll need to post on the XFS
> >> list. Much more data will be required than simply stating "it's slower
> >> by x than 'raw' read". This will include your mdadm config, testing
> >> methodology, and xfs_info output at minimum. There is no simple "check
> >> this box" mega solution with XFS.
> >
> > I tweaked a crap load of settings before settling on what I have. Its
> > reasonable, a balance between raw throughput and directory
> > access/modification performance. Read performance atm isn't as bad as I
> > remember, about 423MB/s according to bonnie++. Write performance is
> > 153MB/s which seems a tad low to me, but still not horrible. Faster than
> > I generally need at any given time.
>
> That low write performance is probably due to barriers to some degree.
> Disabling barriers could yield a sizable increase in write performance
> for some workloads, especially portions of synthetic benchies. Using an
> external log journal device could help as well. Keep in mind we're
> talking about numbers generated by synthetic benchmarks. Making such
> changes may not help your actual application workload much, if at all.

Yeah, I'm not too terribly concerned about synthetic benchmarks. Performance
for what I use it for is pretty good.

This array sees a few uses, like p2p, media streaming, rsnapshot backups, and
an apt mirror. It also used to store the disk images for several VMs as well.
I've since moved them off to two separate 500G drives in RAID1.

What I was concerned about is that I can have my backups going, apt-mirror
doing its thing, and streaming one or more media files (say a 1080p video or
two) all at the same time and get /no/ stutter in the movie playback.

> Given your HBA and the notoriously flaky kernel driver for it, you'd be
> asking for severe pain if you disabled barriers. If you had a rock
> stable system and a good working UPS you could probably run ok with
> barriers disabled, but it's always risky without a BBWC RAID card. If
> you want to increase benchy write performance I'd first try an external
> log device since SATA disks are cheap. You'll want to mirror two disks
> for the log, of course. A couple of 2.5" 160GB 7.2k drives would fit
> the bill and will run about $100 USD total.

If bcache ever gets support for adding a cache to an existing device, I've
been thinking about using a 30G Virtex SSD as a cache for the big array, and
could set some space aside on it for the xfs log.

> >>> Thank you for the suggestion though, I will have to book mark that
> >>> link.
> >>
> >> You're welcome.
> >>
> >> You can't find a better value for an 8 port SAS or SATA solution that
> >> actually works well with Linux. Not to my knowledge anyway. You could
> >> buy two PCIe x1 4 port Marvell based SATA only cards for $20-30 less
> >> maybe, but would be limited to 500MB/s raw unidirectional PCIe b/w vs
> >> 2GB/s with an x4 card, have less features, eat two slots, etc. That
> >> would be more reliable than what you have now though. The Marvell SATA
> >> driver in Linux is much more solid that the SAS driver, from what I've
> >> read anyway. I've never used/owned any Marvell based cards. If I go
> >> cheap I go Silicon Image. It's too bad they don't have a 4 port PCIe
> >> ASIC in their line up. The only 4 port chip they have is PCI based.
> >> Addonics sells a Silicon Image expander, but the total cost for a 2 port
> >> card and two expanders is quite a bit higher than the better Intel
> >> single card solution.
> >
> > I appreciate the tips. That intel/LSI card seems like the best bet.
>
> It's hard to beat for 8 ports at that price point. And it's an Intel
> card with LSI ASIC, not some cheapo Rosewill or Syba card with a
> Marvell, SI, JMicron, etc.

Oh man don't get me started on JMicron...

--
Thomas Fjellstrom
tfjellstrom@shaw.ca
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 24.09.2011 19:53:02 von Thomas Fjellstrom

On September 23, 2011, Mikael Abrahamsson wrote:
> On Fri, 23 Sep 2011, Thomas Fjellstrom wrote:
> >> Card: SuperMicro AOC-SASLP-MV8 Marvell 88SE6480
> >> Driver: MVSAS
> >
> > *DING**DING**DING**DING**DING*
> >
> > We have a winner! It's a very nice card when it works.
>
> I have one of these, it's in the drawer. As late as 2.6.38 it worked
> marginally, before that it wouldn't even survive a raid resync. Sell it
> and buy something else. As far as I can tell there is nothing wrong with
> the hardware in this card, it just so happens the driver support is...
> err... lacking.

For the most part, it works. I've only had a couple scary moments with it. The
funny thing is, it seems as the driver gets closer to working properly, it
gets more dangerous. Before it'd just lock up the entire card, and thus, the
file system was mostly clean, just needed to replay the log. Now instead of
locking up the entire card, the ports return IO errors, which is kinda worse.

2.6.38-2.6.39 were pretty decent. Before that I was using a hacked up version
of a 1-2 year old set of patches from someone I assume works/worked at
Marvell. The only problem with that version of the driver was potential OOPSs
on hot swap, and very odd pauses in IO.

--
Thomas Fjellstrom
tfjellstrom@shaw.ca
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 24.09.2011 23:57:58 von Aapo Laine

On 09/23/11 11:15, NeilBrown wrote:
> On Fri, 23 Sep 2011 02:09:36 -0600 Thomas Fjellstrom
> wrote:
>> I forgot to say, but: Thank you very much :) for the help, and your tireless
>> work on md.
>>
>
> You've very welcome .... but I felt I needed to respond to that word
> "tireless".
> The truth is that I am getting rather tired of md .... if anyone knows anyone
> who wants to get into kernel development and is wondering where to start -
> please consider whispering 'the md driver' in their ear. Plenty to do, great
> mentoring possibilities, and competent linux kernel engineers with good
> experience are unlikely to have much trouble finding a job ;-)
>
> NeilBrown

Whoa this is shocking news!

Firstly then, thank you so much for your excellent work up to now. Linux
has what I believe to be the best software raid of all operating systems
thanks to you. Excellent in both features, and reliability i.e. quality
of code.

And also the support through the list was great. I found so many
problems solved already just by looking at the archives... so many
people with their arse saved by you.

I think everybody here is sorry to see you willing to go.

Now the bad news... Regarding the MD takeover, there I think I see a
problem...
The MD code is very tersely commented, compared to its complexity!

- there is not much explanation of overall strategies, or the structure
of code. Also the flow of data between the functions is not much
explained most of the times.

- It's not obvious to me what is the entry point for kernel processes
related to MD arrays, how are they triggered and where do they run...
E.g. in the past I tried to understand how did resync work, but I
couldn't. I thought there was a kernel process controlling resync
advancement, but I couldn't really find the boundaries of code inside
which it was executing.

- it's not clear what the various functions do or in what occasion they
are called. Except from their own name, most of them have no comments
just before the definition.

- the algoritms within the functions are very long and complex, but only
rarely they are explained by comments. I am now seeing pieces having 5
levels of if's nested one inside the other, and there are almost no
comments.

- last but not least, variables have very short names, and for most of
them, it is not explained what they mean. This is mostly for local
variables, but sometimes even for the structs which go into metadata
e.g. in struct r1_private_data_s most members do not have an
explanation. This is pretty serious, to me at least, for understanding
the code.

- ... maybe more I can't think of right now ...

Your code is of excellent quality, as I wrote, I wish there were more
programmers like you, but if you now want to leave, THEN I start to be
worried! Would you please comment it (much) more before leaving? Fully
understanding your code I think is going to take other people a lot of
time otherwise, and you might not find a replacement easily and/or s/he
might do mistakes.

There were times in the past when I had ideas and I wanted to contribute
code, but when I looked inside MD and tried to understand where should I
put my changes, I realized I wasn't able to understand what current code
was doing. Maybe I am not a good enough C programmer, but I was able to
change things in other occasions.

I hope you won't get these critiques bad...
and thanks for all your efforts, really, in the name of, I think, everybody.
Aapo L.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 25.09.2011 11:18:28 von kristleifur

On Sat, Sep 24, 2011 at 9:57 PM, Aapo Laine =
wrote:
>
> On 09/23/11 11:15, NeilBrown wrote:
>>
>> On Fri, 23 Sep 2011 02:09:36 -0600 Thomas Fjellstrom om.ca>
>> wrote:
>>>
>>> I forgot to say, but: Thank you very much :) for the help, and your=
tireless
>>> work on md.
>>>
>>
>> You've very welcome .... but I felt I needed to respond to that word
>> "tireless".
>> The truth is that I am getting rather tired of md .... if anyone kno=
ws anyone
>> who wants to get into kernel development and is wondering where to s=
tart -
>> please consider whispering 'the md driver' in their ear. =A0Plenty t=
o do, great
>> mentoring possibilities, and competent linux kernel engineers with g=
ood
>> experience are unlikely to have much trouble finding a job ;-)
>>
>> NeilBrown
>
> Whoa this is shocking news!
>
> Firstly then, thank you so much for your excellent work up to now. Li=
nux has what I believe to be the best software raid of all operating sy=
stems thanks to you. Excellent in both features, and reliability i.e. q=
uality of code.
>
> And also the support through the list was great. I found so many prob=
lems solved already just by looking at the archives... so many people w=
ith their arse saved by you.
>
> I think everybody here is sorry to see you willing to go.
>
> Now the bad news... Regarding the MD takeover, there I think I see a =
problem...
> The MD code is very tersely commented, compared to its complexity!
>
> - there is not much explanation of overall strategies, or the structu=
re of code. Also the flow of data between the functions is not much exp=
lained most of the times.
>
> - It's not obvious to me what is the entry point for kernel processes=
related to MD arrays, how are they triggered and where do they run... =
E.g. in the past I tried to understand how did resync work, but I could=
n't. I thought there was a kernel process controlling resync advancemen=
t, but I couldn't really find the boundaries of code inside which it wa=
s executing.
>
> - it's not clear what the various functions do or in what occasion th=
ey are called. Except from their own name, most of them have no comment=
s just before the definition.
>
> - the algoritms within the functions are very long and complex, but o=
nly rarely they are explained by comments. I am now seeing pieces havin=
g 5 levels of if's nested one inside the other, and there are almost no=
comments.
>
> - last but not least, variables have very short names, and for most o=
f them, it is not explained what they mean. This is mostly for local va=
riables, but sometimes even for the structs which go into metadata e.g.=
in struct r1_private_data_s most members do not have an explanation. T=
his is pretty serious, to me at least, for understanding the code.
>
> - ... maybe more I can't think of right now ...
>
> Your code is of excellent quality, as I wrote, I wish there were more=
programmers like you, but if you now want to leave, THEN I start to be=
worried! Would you please comment it (much) more before leaving? Fully=
understanding your code I think is going to take other people a lot of=
time otherwise, and you might not find a replacement easily and/or s/h=
e might do mistakes.
>
> There were times in the past when I had ideas and I wanted to contrib=
ute code, but when I looked inside MD and tried to understand where sho=
uld I put my changes, I realized I wasn't able to understand what curre=
nt code was doing. Maybe I am not a good enough C programmer, but I was=
able to change things in other occasions.
>
>
> I hope you won't get these critiques bad...
> and thanks for all your efforts, really, in the name of, I think, eve=
rybody.
> Aapo L.
> --

Thank you Mr. Brown for the immensely useful mdadm. In its simplicity,
it is mastery of craft, purely functional to the point that it attains
an inherent beauty of form.

Re. Aapo's comments, a GitHub repository of mdadm could be the ticket.
A de-facto Github merge maintainer owns a master repo, say
"mdadm-documented". People fork from it at will, creating comments for
the code and / or increasing intuitive readability if need be.
Maintaner merges it all, creating a mdadm version that is functionally
identical to mdadm stable but sort of pedagogically smoothed :)

The mdadm maintainer could then merge this into the actual mdadm head a=
t will.

If it takes off, this would take the load off Neil's shoulders =96 plus
it's sometimes easier for fresh eyes to see things in context when the
initial work of building the full mental picture is done.

-kd
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 25.09.2011 11:37:47 von NeilBrown

--Sig_/qyvdUz7gHFu4V9XVuHEYh1d
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

On Fri, 23 Sep 2011 10:26:53 -0600 Thomas Fjellstrom
wrote:

> On September 23, 2011, NeilBrown wrote:
> > On Fri, 23 Sep 2011 02:09:36 -0600 Thomas Fjellstrom ..ca>
> >=20
> > wrote:
> > > I forgot to say, but: Thank you very much :) for the help, and your
> > > tireless work on md.
> >=20
> > You've very welcome .... but I felt I needed to respond to that word
> > "tireless".
> > The truth is that I am getting rather tired of md .... if anyone knows
> > anyone who wants to get into kernel development and is wondering where =
to
> > start - please consider whispering 'the md driver' in their ear. Plenty
> > to do, great mentoring possibilities, and competent linux kernel engine=
ers
> > with good experience are unlikely to have much trouble finding a job ;-)
> >=20
> > NeilBrown
>=20
> Very tempting. How much work do you think it would be to add in full raid=
10=20
> reshape support? ;D (not that I'm volunteering, I don't think I could=20
> comprehend a lot of the raid code, at least not at this point in time =20
> (extenuating circumstances)).
>=20

RAID10 reshape is more complicated than RAID5/6 reshape because there are
more options - more combinations.

So you would probably implement a subset of possible reshapes. And then
maybe implement another subset.

Providing you have:
- a clear understanding of the intermediate state and a way to record
that state in the metadata
- a way to tell if a given block is in the 'old' layout or the 'new' layo=
ut
or 'being reshaped'
- somewhere in memory to store all the blocks that are 'being reshaped'

it should be fairly easy. RAID5/6 has a stripe-cache so the last point is
trivial. Handling that in RAID10 is probably the biggest single part of the
task.

So: not a trivial task, but not an enormous task either.... which doesn't
narrow it down very much I guess.

NeilBrown

--Sig_/qyvdUz7gHFu4V9XVuHEYh1d
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)

iD8DBQFOfvZrG5fc6gV+Wb0RAiGRAJ9GzrX5bP5InkXXkxSMyJdB2G/gGQCg t8mm
HHwRi3QavxIOxC/kFLJET20=
=9dBp
-----END PGP SIGNATURE-----

--Sig_/qyvdUz7gHFu4V9XVuHEYh1d--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 25.09.2011 12:10:25 von NeilBrown

--Sig_/RTV.FQZ/8GWv3kScDrb0Bx=
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

On Sat, 24 Sep 2011 23:57:58 +0200 Aapo Laine
wrote:

> On 09/23/11 11:15, NeilBrown wrote:
> > On Fri, 23 Sep 2011 02:09:36 -0600 Thomas Fjellstrom ca>
> > wrote:
> >> I forgot to say, but: Thank you very much :) for the help, and your ti=
reless
> >> work on md.
> >>
> >
> > You've very welcome .... but I felt I needed to respond to that word
> > "tireless".
> > The truth is that I am getting rather tired of md .... if anyone knows =
anyone
> > who wants to get into kernel development and is wondering where to star=
t -
> > please consider whispering 'the md driver' in their ear. Plenty to do,=
great
> > mentoring possibilities, and competent linux kernel engineers with good
> > experience are unlikely to have much trouble finding a job ;-)
> >
> > NeilBrown
>=20
> Whoa this is shocking news!

Hopefully not too shocking... I'm not planning on leaving md any time soon.
I do still enjoy working on it.
But it certainly isn't as fresh and new as it was 10 years again. It would
probably do both me and md a lot of good to have someone with new enthusiasm
and new perspectives...

>=20
> Firstly then, thank you so much for your excellent work up to now. Linux=
=20
> has what I believe to be the best software raid of all operating systems=
=20
> thanks to you. Excellent in both features, and reliability i.e. quality=20
> of code.
>=20
> And also the support through the list was great. I found so many=20
> problems solved already just by looking at the archives... so many=20
> people with their arse saved by you.
>=20
> I think everybody here is sorry to see you willing to go.
>=20
> Now the bad news... Regarding the MD takeover, there I think I see a=20
> problem...
> The MD code is very tersely commented, compared to its complexity!

That is certainly true, but seems to be true across much of the kernel, and
probably most code in general (though I'm sure such a comment will lead to
people wanting to tell me their favourite exceptions ... so I'll start with
"TeX").

This is one of the reasons I offered "mentoring" to any likely candidate.

>=20
> - there is not much explanation of overall strategies, or the structure=20
> of code. Also the flow of data between the functions is not much=20
> explained most of the times.
>=20
> - It's not obvious to me what is the entry point for kernel processes=20
> related to MD arrays, how are they triggered and where do they run...=20
> E.g. in the past I tried to understand how did resync work, but I=20
> couldn't. I thought there was a kernel process controlling resync=20
> advancement, but I couldn't really find the boundaries of code inside=20
> which it was executing.

md_do_sync() is the heart of the resync process. it calls into the
personality's sync_request() function.

The kernel thread is started by md_check_recovery() if it appears to be
needed. md_check_recovery() is regularly run by each personality's main
controlling thread.

>=20
> - it's not clear what the various functions do or in what occasion they=20
> are called. Except from their own name, most of them have no comments=20
> just before the definition.

How about this:
- you identify some functions for which the purpose or use isn't clear
- I'll explain to you when/how/why they are used
- You create a patch which adds comments which explains it all
- I'll apply that patch.

deal??

>=20
> - the algoritms within the functions are very long and complex, but only=
=20
> rarely they are explained by comments. I am now seeing pieces having 5=20
> levels of if's nested one inside the other, and there are almost no=20
> comments.

I feel your pain. I really should factor out the deeply nested levels into
separate functions. Sometimes I have done that but there is plenty more do
to. Again, I would be much more motivated to do this if I were working with
someone who would be directly helped by it. So if you identify specific
problems, it'll be a lot easier for me to help fix them.

>=20
> - last but not least, variables have very short names, and for most of=20
> them, it is not explained what they mean. This is mostly for local=20
> variables, but sometimes even for the structs which go into metadata=20
> e.g. in struct r1_private_data_s most members do not have an=20
> explanation. This is pretty serious, to me at least, for understanding=20
> the code.

Does this help?

diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index e0d676b..feb44ad 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -28,42 +28,67 @@ struct r1_private_data_s {
mddev_t *mddev;
mirror_info_t *mirrors;
int raid_disks;
+
+ /* When choose the best device for a read (read_balance())
+ * we try to keep sequential reads one the same device
+ * using 'last_used' and 'next_seq_sect'
+ */
int last_used;
sector_t next_seq_sect;
+ /* During resync, read_balancing is only allowed on the part
+ * of the array that has been resynced. 'next_resync' tells us
+ * where that is.
+ */
+ sector_t next_resync;
+
spinlock_t device_lock;
=20
+ /* list of 'r1bio_t' that need to be processed by raid1d, whether
+ * to retry a read, writeout a resync or recovery block, or
+ * anything else.
+ */
struct list_head retry_list;
- /* queue pending writes and submit them on unplug */
- struct bio_list pending_bio_list;
=20
- /* for use when syncing mirrors: */
+ /* queue pending writes to be submitted on unplug */
+ struct bio_list pending_bio_list;
=20
+ /* for use when syncing mirrors:
+ * We don't allow both normal IO and resync/recovery IO at
+ * the same time - resync/recovery can only happen when there
+ * is no other IO. So when either is active, the other has to wait.
+ * See more details description in raid1.c near raise_barrier().
+ */
+ wait_queue_head_t wait_barrier;
spinlock_t resync_lock;
int nr_pending;
int nr_waiting;
int nr_queued;
int barrier;
- sector_t next_resync;
- int fullsync; /* set to 1 if a full sync is needed,
- * (fresh device added).
- * Cleared when a sync completes.
- */
- int recovery_disabled; /* when the same as
- * mddev->recovery_disabled
- * we don't allow recovery
- * to be attempted as we
- * expect a read error
- */
=20
- wait_queue_head_t wait_barrier;
+ /* Set to 1 if a full sync is needed, (fresh device added).
+ * Cleared when a sync completes.
+ */
+ int fullsync
=20
- struct pool_info *poolinfo;
+ /* When the same as mddev->recovery_disabled we don't allow
+ * recovery to be attempted as we expect a read error.
+ */
+ int recovery_disabled;
=20
- struct page *tmppage;
=20
+ /* poolinfo contains information about the content of the
+ * mempools - it changes when the array grows or shrinks
+ */
+ struct pool_info *poolinfo;
mempool_t *r1bio_pool;
mempool_t *r1buf_pool;
=20
+ /* temporary buffer to synchronous IO when attempting to repair
+ * a read error.
+ */
+ struct page *tmppage;
+
+
/* When taking over an array from a different personality, we store
* the new thread here until we fully activate the array.
*/

>=20
> - ... maybe more I can't think of right now ...
>=20
> Your code is of excellent quality, as I wrote, I wish there were more=20
> programmers like you, but if you now want to leave, THEN I start to be=20
> worried! Would you please comment it (much) more before leaving? Fully=20
> understanding your code I think is going to take other people a lot of=20
> time otherwise, and you might not find a replacement easily and/or s/he=20
> might do mistakes.

I'm not planning on leaving - not for quite some time anyway.
But I know the code so well that it is hard to see which bits need
documenting, and what sort of documentation would really help.
I would love it if you (or anyone) would review the code and point to parts
that particularly need improvement.

>=20
> There were times in the past when I had ideas and I wanted to contribute=
=20
> code, but when I looked inside MD and tried to understand where should I=
=20
> put my changes, I realized I wasn't able to understand what current code=
=20
> was doing. Maybe I am not a good enough C programmer, but I was able to=20
> change things in other occasions.

Don't be afraid to ask... But sometimes you do need a bit of persistence
though. :-) Not always easy to find time for that.

>=20
>=20
> I hope you won't get these critiques bad...

Not at all.

> and thanks for all your efforts, really, in the name of, I think, everybo=
dy.
> Aapo L.

Thanks for your valuable feedback.
Being able to see problems is of significant value. One of the reasons that
I pay close attention to this list is because it shows me where the problems
with md and mdadm are. People often try things that I would never even dre=
am
of trying (because I know they won't work). See this helps me know where t=
he
code and be improved - either so what they try does work, or so it fails mo=
re
gracefully and helpfully.

Thanks,
NeilBrown

--Sig_/RTV.FQZ/8GWv3kScDrb0Bx=
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)

iD8DBQFOfv4RG5fc6gV+Wb0RAmu2AJ45eaEcwdoE/iICar3iEcCuVLd40ACg j9A1
tQCAhfrSo0//oOuG/4UKaoc=
=Soyx
-----END PGP SIGNATURE-----

--Sig_/RTV.FQZ/8GWv3kScDrb0Bx=--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 25.09.2011 15:03:44 von David Brown

On 24/09/2011 18:38, Stan Hoeppner wrote:
> On 9/24/2011 10:16 AM, David Brown wrote:
>> On 24/09/2011 14:17, Stan Hoeppner wrote:
>>> On 9/23/2011 7:11 PM, Thomas Fjellstrom wrote:
>>>> On September 23, 2011, Stan Hoeppner wrote:
>>>
>>>>> When properly configured XFS will achieve near spindle throughput.
>>>>> Recent versions of mkfs.xfs read the mdraid configuration and
>>>>> configure
>>>>> the filesystem automatically for sw, swidth, number of allocation
>>>>> groups, etc. Thus you should get max performance out of the gate.
>>>>
>>>> What happens when you add a drive and reshape? Is it enough just to
>>>> tweak the
>>>> mount options?
>>>
>>> When you change the number of effective spindles with a reshape, and
>>> thus the stripe width and stripe size, you definitely should add the
>>> appropriate XFS mount options and values to reflect this. Performance
>>> will be less than optimal if you don't.
>>>
>>> If you use a linear concat under XFS you never have to worry about the
>>> above situation. It has many other advantages over a striped array and
>>> better performance for many workloads, especially multi user general
>>> file serving and maildir storage--workloads with lots of concurrent IO.
>>> If you 'need' maximum single stream performance for large files, a
>>> striped array is obviously better. Most applications however don't need
>>> large single stream performance.
>>>
>>
>> If you use a linear concatenation of drives for XFS, is it not correct
>> that you want one allocation group per drive (or per raid set, if you
>> are concatenating a bunch of raid sets)?
>
> Yes. Normally with a linear concat you would make X number of RAID1
> mirrors via mdraid or hardware RAID, then concat them with mdadm
> --linear or LVM. Then mkfs.xfs -d ag=X ...
>
> Currently XFS has a 1TB limit for allocation groups. If you use 2TB
> drives you'll get 2 AGs per effective spindle instead of one. With some
> 'borderline' workloads this may hinder performance. It depends on how
> many top level directories you have in the filesystem and your
> concurrency to them.
>
>> If you then add another drive
>> or raid set, can you grow XFS with another allocation group?
>
> XFS creates more allocation groups automatically as part of the grow
> operation. If you have a linear concat setup you'll obviously wan to
> control this manually to maintain the same number of AGs per effective
> spindle.
>
> Always remember that the key to linear concat performance with XFS is
> directory level parallelism. If you have lots of top level directories
> in your filesystem and high concurrent access (home dirs, maildir, etc)
> it will typically work better than a striped array. If you have few
> directories and low concurrency, are streaming large files, etc, stick
> with a striped array.
>

I understand the point about linear concat and allocation groups being a
good solution when you have multiple parallel accesses to different
files, rather than streamed access to a few large files.

But you seem to be suggesting here that accesses to different files
within the same top-level directory will be put in the same allocation
group - is that correct? That strikes me as very limiting - it is far
from uncommon for most accesses to be under one or two top-level
directories.

> Also note that a linear concat will only give increased performance with
> XFS, again for appropriate worklods. Using a linear concat with EXT3/4
> will give you the performance of a single spindle regardless of the
> total number of disks used. So one should stick with striped arrays for
> EXT3/4.
>

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 25.09.2011 16:39:32 von Stan Hoeppner

On 9/25/2011 8:03 AM, David Brown wrote:
> On 24/09/2011 18:38, Stan Hoeppner wrote:
>> On 9/24/2011 10:16 AM, David Brown wrote:
>>> On 24/09/2011 14:17, Stan Hoeppner wrote:
>>>> On 9/23/2011 7:11 PM, Thomas Fjellstrom wrote:
>>>>> On September 23, 2011, Stan Hoeppner wrote:
>>>>
>>>>>> When properly configured XFS will achieve near spindle throughput.
>>>>>> Recent versions of mkfs.xfs read the mdraid configuration and
>>>>>> configure
>>>>>> the filesystem automatically for sw, swidth, number of allocation
>>>>>> groups, etc. Thus you should get max performance out of the gate.
>>>>>
>>>>> What happens when you add a drive and reshape? Is it enough just to
>>>>> tweak the
>>>>> mount options?
>>>>
>>>> When you change the number of effective spindles with a reshape, and
>>>> thus the stripe width and stripe size, you definitely should add the
>>>> appropriate XFS mount options and values to reflect this. Performance
>>>> will be less than optimal if you don't.
>>>>
>>>> If you use a linear concat under XFS you never have to worry about the
>>>> above situation. It has many other advantages over a striped array and
>>>> better performance for many workloads, especially multi user general
>>>> file serving and maildir storage--workloads with lots of concurrent IO.
>>>> If you 'need' maximum single stream performance for large files, a
>>>> striped array is obviously better. Most applications however don't need
>>>> large single stream performance.
>>>>
>>>
>>> If you use a linear concatenation of drives for XFS, is it not correct
>>> that you want one allocation group per drive (or per raid set, if you
>>> are concatenating a bunch of raid sets)?
>>
>> Yes. Normally with a linear concat you would make X number of RAID1
>> mirrors via mdraid or hardware RAID, then concat them with mdadm
>> --linear or LVM. Then mkfs.xfs -d ag=X ...
>>
>> Currently XFS has a 1TB limit for allocation groups. If you use 2TB
>> drives you'll get 2 AGs per effective spindle instead of one. With some
>> 'borderline' workloads this may hinder performance. It depends on how
>> many top level directories you have in the filesystem and your
>> concurrency to them.
>>
>>> If you then add another drive
>>> or raid set, can you grow XFS with another allocation group?
>>
>> XFS creates more allocation groups automatically as part of the grow
>> operation. If you have a linear concat setup you'll obviously wan to
>> control this manually to maintain the same number of AGs per effective
>> spindle.
>>
>> Always remember that the key to linear concat performance with XFS is
>> directory level parallelism. If you have lots of top level directories
>> in your filesystem and high concurrent access (home dirs, maildir, etc)
>> it will typically work better than a striped array. If you have few
>> directories and low concurrency, are streaming large files, etc, stick
>> with a striped array.
>>
>
> I understand the point about linear concat and allocation groups being a
> good solution when you have multiple parallel accesses to different
> files, rather than streamed access to a few large files.

Not just different files, but files in different top level directories.

> But you seem to be suggesting here that accesses to different files
> within the same top-level directory will be put in the same allocation
> group - is that correct?

When you create a top level directory on an XFS filesystem it is
physically created in one of the on disk allocation groups. When you
create another directory it is physically created in the next allocation
group, and so on, until it wraps back to the first AG. This is why XFS
can derive parallelism from a linear concat and no other filesystem can.
Performance is rarely perfectly symmetrical, as the workload dictates
the file, and thus physical IO, access patterns.

But, with maildir and similar workloads, the odds are very high that
you'll achieve good directory level parallelism because each mailbox is
in a different directory. I've previously discussed the many other
reasons why XFS on a linear concat beats the stuffing out of anything on
a striped array for a maildir workload so I won't repeat all that here.

> That strikes me as very limiting - it is far
> from uncommon for most accesses to be under one or two top-level
> directories.

By design or ignorance? What application workload? What are the IOPS
and bandwidth needs of this workload you describe? Again, read the
paragraph below, which you apparently skipped the first time.

>> Also note that a linear concat will only give increased performance with
>> XFS, again for appropriate worklods. Using a linear concat with EXT3/4
>> will give you the performance of a single spindle regardless of the
>> total number of disks used. So one should stick with striped arrays for
>> EXT3/4.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 25.09.2011 17:18:21 von David Brown

On 25/09/11 16:39, Stan Hoeppner wrote:
> On 9/25/2011 8:03 AM, David Brown wrote:
>> On 24/09/2011 18:38, Stan Hoeppner wrote:
>>> On 9/24/2011 10:16 AM, David Brown wrote:
>>>> On 24/09/2011 14:17, Stan Hoeppner wrote:
>>>>> On 9/23/2011 7:11 PM, Thomas Fjellstrom wrote:
>>>>>> On September 23, 2011, Stan Hoeppner wrote:
>>>>>
>>>>>>> When properly configured XFS will achieve near spindle throughput.
>>>>>>> Recent versions of mkfs.xfs read the mdraid configuration and
>>>>>>> configure
>>>>>>> the filesystem automatically for sw, swidth, number of allocation
>>>>>>> groups, etc. Thus you should get max performance out of the gate.
>>>>>>
>>>>>> What happens when you add a drive and reshape? Is it enough just to
>>>>>> tweak the
>>>>>> mount options?
>>>>>
>>>>> When you change the number of effective spindles with a reshape, and
>>>>> thus the stripe width and stripe size, you definitely should add the
>>>>> appropriate XFS mount options and values to reflect this. Performance
>>>>> will be less than optimal if you don't.
>>>>>
>>>>> If you use a linear concat under XFS you never have to worry about the
>>>>> above situation. It has many other advantages over a striped array and
>>>>> better performance for many workloads, especially multi user general
>>>>> file serving and maildir storage--workloads with lots of concurrent
>>>>> IO.
>>>>> If you 'need' maximum single stream performance for large files, a
>>>>> striped array is obviously better. Most applications however don't
>>>>> need
>>>>> large single stream performance.
>>>>>
>>>>
>>>> If you use a linear concatenation of drives for XFS, is it not correct
>>>> that you want one allocation group per drive (or per raid set, if you
>>>> are concatenating a bunch of raid sets)?
>>>
>>> Yes. Normally with a linear concat you would make X number of RAID1
>>> mirrors via mdraid or hardware RAID, then concat them with mdadm
>>> --linear or LVM. Then mkfs.xfs -d ag=X ...
>>>
>>> Currently XFS has a 1TB limit for allocation groups. If you use 2TB
>>> drives you'll get 2 AGs per effective spindle instead of one. With some
>>> 'borderline' workloads this may hinder performance. It depends on how
>>> many top level directories you have in the filesystem and your
>>> concurrency to them.
>>>
>>>> If you then add another drive
>>>> or raid set, can you grow XFS with another allocation group?
>>>
>>> XFS creates more allocation groups automatically as part of the grow
>>> operation. If you have a linear concat setup you'll obviously wan to
>>> control this manually to maintain the same number of AGs per effective
>>> spindle.
>>>
>>> Always remember that the key to linear concat performance with XFS is
>>> directory level parallelism. If you have lots of top level directories
>>> in your filesystem and high concurrent access (home dirs, maildir, etc)
>>> it will typically work better than a striped array. If you have few
>>> directories and low concurrency, are streaming large files, etc, stick
>>> with a striped array.
>>>
>>
>> I understand the point about linear concat and allocation groups being a
>> good solution when you have multiple parallel accesses to different
>> files, rather than streamed access to a few large files.
>
> Not just different files, but files in different top level directories.
>
>> But you seem to be suggesting here that accesses to different files
>> within the same top-level directory will be put in the same allocation
>> group - is that correct?
>
> When you create a top level directory on an XFS filesystem it is
> physically created in one of the on disk allocation groups. When you
> create another directory it is physically created in the next allocation
> group, and so on, until it wraps back to the first AG. This is why XFS
> can derive parallelism from a linear concat and no other filesystem can.
> Performance is rarely perfectly symmetrical, as the workload dictates
> the file, and thus physical IO, access patterns.
>
> But, with maildir and similar workloads, the odds are very high that
> you'll achieve good directory level parallelism because each mailbox is
> in a different directory. I've previously discussed the many other
> reasons why XFS on a linear concat beats the stuffing out of anything on
> a striped array for a maildir workload so I won't repeat all that here.
>
>> That strikes me as very limiting - it is far
>> from uncommon for most accesses to be under one or two top-level
>> directories.
>
> By design or ignorance? What application workload? What are the IOPS and
> bandwidth needs of this workload you describe? Again, read the paragraph
> below, which you apparently skipped the first time.
>

Perhaps I am not expressing myself very clearly. I don't mean to sound
patronising by spelling it out like this - I just want to be sure I'm
getting an answer to the question in my mind (assuming, of course, you
have time and inclination to help me - you've certainly been very
helpful and informative so far!).

Suppose you have an xfs filesystem with 10 allocation groups, mounted on
/mnt. You make a directory /mnt/a. That gets created in allocation
group 1. You make a second directory /mnt/b. That gets created in
allocation group 2. Any files you put in /mnt/a go in allocation group
1, and any files in /mnt/b go in allocation group 2. Am I right so far?

Then you create directories /mnt/a/a1 and /mnt/a/a2. Do these also go
in allocation group 1, or do they go in groups 3 and 4? Similarly, do
files inside them go in group 1 or in groups 3 and 4?

To take an example that is quite relevant to me, consider a mail server
handling two domains. You have (for example) /var/mail/domain1 and
/var/mail/domain2, with each user having a directory within either
domain1 or domain2. What I would like to know, is if the xfs filesystem
is mounted on /var/mail, then are the user directories spread across the
allocation groups, or are all of domain1 users in one group and all of
domain2 users in another group? If it is the former, then xfs on a
linear concat would scale beautifully - if it is the later, then it
would be pretty terrible scaling.

>>> Also note that a linear concat will only give increased performance with
>>> XFS, again for appropriate worklods. Using a linear concat with EXT3/4
>>> will give you the performance of a single spindle regardless of the
>>> total number of disks used. So one should stick with striped arrays for
>>> EXT3/4.
>

I understand this, which is why I didn't comment earlier. I am aware
that only XFS can utilise the parts of a linear concat to improve
performance - my questions were about the circumstances in which XFS can
utilise the multiple allocation groups.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 25.09.2011 20:07:08 von Robert L Mathews

Stan Hoeppner wrote:

> Solve your problem with a 50% more $$ LSI SAS1068E based Intel 8 port
> PCIe x4 SAS/SATA HBA, which uses the mptsas driver:
> http://www.newegg.com/Product/Product.aspx?Item=N82E16816117 157

When this card is used in JBOD mode, is the on-disk format identical to
a standard disk that's not plugged into a RAID card?

In other words, if the card fails, is it possible to take the disks and
connect them directly to any non-RAID SATA/SAS motherboard? Or would you
need a replacement card to read the data?

(There should be a special term for proprietary JBOD formats to prevent
people from being burned by this... something like "JBOPD".)

--
Robert L Mathews, Tiger Technologies, http://www.tigertech.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 26.09.2011 01:58:58 von Stan Hoeppner

On 9/25/2011 10:18 AM, David Brown wrote:
> On 25/09/11 16:39, Stan Hoeppner wrote:
>> On 9/25/2011 8:03 AM, David Brown wrote:
>>> On 24/09/2011 18:38, Stan Hoeppner wrote:
>>>> On 9/24/2011 10:16 AM, David Brown wrote:
>>>>> On 24/09/2011 14:17, Stan Hoeppner wrote:
>>>>>> On 9/23/2011 7:11 PM, Thomas Fjellstrom wrote:
>>>>>>> On September 23, 2011, Stan Hoeppner wrote:
>>>>>>
>>>>>>>> When properly configured XFS will achieve near spindle throughput.
>>>>>>>> Recent versions of mkfs.xfs read the mdraid configuration and
>>>>>>>> configure
>>>>>>>> the filesystem automatically for sw, swidth, number of allocation
>>>>>>>> groups, etc. Thus you should get max performance out of the gate.
>>>>>>>
>>>>>>> What happens when you add a drive and reshape? Is it enough just to
>>>>>>> tweak the
>>>>>>> mount options?
>>>>>>
>>>>>> When you change the number of effective spindles with a reshape, and
>>>>>> thus the stripe width and stripe size, you definitely should add the
>>>>>> appropriate XFS mount options and values to reflect this. Performance
>>>>>> will be less than optimal if you don't.
>>>>>>
>>>>>> If you use a linear concat under XFS you never have to worry about
>>>>>> the
>>>>>> above situation. It has many other advantages over a striped array
>>>>>> and
>>>>>> better performance for many workloads, especially multi user general
>>>>>> file serving and maildir storage--workloads with lots of concurrent
>>>>>> IO.
>>>>>> If you 'need' maximum single stream performance for large files, a
>>>>>> striped array is obviously better. Most applications however don't
>>>>>> need
>>>>>> large single stream performance.
>>>>>>
>>>>>
>>>>> If you use a linear concatenation of drives for XFS, is it not correct
>>>>> that you want one allocation group per drive (or per raid set, if you
>>>>> are concatenating a bunch of raid sets)?
>>>>
>>>> Yes. Normally with a linear concat you would make X number of RAID1
>>>> mirrors via mdraid or hardware RAID, then concat them with mdadm
>>>> --linear or LVM. Then mkfs.xfs -d ag=X ...
>>>>
>>>> Currently XFS has a 1TB limit for allocation groups. If you use 2TB
>>>> drives you'll get 2 AGs per effective spindle instead of one. With some
>>>> 'borderline' workloads this may hinder performance. It depends on how
>>>> many top level directories you have in the filesystem and your
>>>> concurrency to them.
>>>>
>>>>> If you then add another drive
>>>>> or raid set, can you grow XFS with another allocation group?
>>>>
>>>> XFS creates more allocation groups automatically as part of the grow
>>>> operation. If you have a linear concat setup you'll obviously wan to
>>>> control this manually to maintain the same number of AGs per effective
>>>> spindle.
>>>>
>>>> Always remember that the key to linear concat performance with XFS is
>>>> directory level parallelism. If you have lots of top level directories
>>>> in your filesystem and high concurrent access (home dirs, maildir, etc)
>>>> it will typically work better than a striped array. If you have few
>>>> directories and low concurrency, are streaming large files, etc, stick
>>>> with a striped array.
>>>>
>>>
>>> I understand the point about linear concat and allocation groups being a
>>> good solution when you have multiple parallel accesses to different
>>> files, rather than streamed access to a few large files.
>>
>> Not just different files, but files in different top level directories.
>>
>>> But you seem to be suggesting here that accesses to different files
>>> within the same top-level directory will be put in the same allocation
>>> group - is that correct?
>>
>> When you create a top level directory on an XFS filesystem it is
>> physically created in one of the on disk allocation groups. When you
>> create another directory it is physically created in the next allocation
>> group, and so on, until it wraps back to the first AG. This is why XFS
>> can derive parallelism from a linear concat and no other filesystem can.
>> Performance is rarely perfectly symmetrical, as the workload dictates
>> the file, and thus physical IO, access patterns.
>>
>> But, with maildir and similar workloads, the odds are very high that
>> you'll achieve good directory level parallelism because each mailbox is
>> in a different directory. I've previously discussed the many other
>> reasons why XFS on a linear concat beats the stuffing out of anything on
>> a striped array for a maildir workload so I won't repeat all that here.
>>
>>> That strikes me as very limiting - it is far
>>> from uncommon for most accesses to be under one or two top-level
>>> directories.
>>
>> By design or ignorance? What application workload? What are the IOPS and
>> bandwidth needs of this workload you describe? Again, read the paragraph
>> below, which you apparently skipped the first time.
>>
>
> Perhaps I am not expressing myself very clearly. I don't mean to sound
> patronising by spelling it out like this - I just want to be sure I'm
> getting an answer to the question in my mind (assuming, of course, you
> have time and inclination to help me - you've certainly been very
> helpful and informative so far!).
>
> Suppose you have an xfs filesystem with 10 allocation groups, mounted on
> /mnt. You make a directory /mnt/a. That gets created in allocation group
> 1. You make a second directory /mnt/b. That gets created in allocation
> group 2. Any files you put in /mnt/a go in allocation group 1, and any
> files in /mnt/b go in allocation group 2.

You're describing the infrastructure first. You *always* start with the
needs of the workload and build the storage stack to best meet those
needs. You're going backwards, but I'll try to play your game.

> Am I right so far?

Yes. There are some corner cases but this is how a fresh XFS behaves.
I should have stated before that my comments are based on using the
inode64 mount option which is required to reach above 16TB, and which
yields superior performance. The default mode, inode32, behaves a bit
differently WRT allocation. It would take too much text to explain the
differences here. You're better off digging into the XFS documentation
at xfs.org.

> Then you create directories /mnt/a/a1 and /mnt/a/a2. Do these also go in
> allocation group 1, or do they go in groups 3 and 4? Similarly, do files
> inside them go in group 1 or in groups 3 and 4?

Remember this is a filesystem. Think of a file cabinet. The cabinet is
the XFS filesytsem, the drawers are the allocation groups, directories
are manilla folders, and files are papers in the folders. That's
exactly how the allocation works. Now, a single file will span more
than 1 AG (drawer) if the file is larger than the free space available
within the AG (drawer) when the file is created, or appended.

> To take an example that is quite relevant to me, consider a mail server
> handling two domains. You have (for example) /var/mail/domain1 and
> /var/mail/domain2, with each user having a directory within either
> domain1 or domain2. What I would like to know, is if the xfs filesystem
> is mounted on /var/mail, then are the user directories spread across the
> allocation groups, or are all of domain1 users in one group and all of
> domain2 users in another group? If it is the former, then xfs on a
> linear concat would scale beautifully - if it is the later, then it
> would be pretty terrible scaling.

See above for file placement.

With only two top level directories you're not going to achieve good
parallelism on an XFS linear concat. Modern delivery agents, dovecot
for example, allow you to store each user mail directory independently,
anywhere you choose, so this isn't a problem. Simply create a top level
directory for every mailbox, something like:

/var/mail/domain1.%user/
/var/mail/domain2.%user/

>>>> Also note that a linear concat will only give increased performance
>>>> with
>>>> XFS, again for appropriate worklods. Using a linear concat with EXT3/4
>>>> will give you the performance of a single spindle regardless of the
>>>> total number of disks used. So one should stick with striped arrays for
>>>> EXT3/4.
>>
>
> I understand this, which is why I didn't comment earlier. I am aware
> that only XFS can utilise the parts of a linear concat to improve
> performance - my questions were about the circumstances in which XFS can
> utilise the multiple allocation groups.

The optimal scenario is rather simple. Create multiple top level
directories and write/read files within all of them concurrently. This
works best with highly concurrent workloads where high random IOPS is
needed. This can be with small or large files.

The large file case is transactional database specific, and careful
planning and layout of the disks and filesystem are needed. In this
case we span a single large database file over multiple small allocation
groups. Transactional DB systems typically write only a few hundred
bytes per record. Consider a large retailer point of sale application.
With a striped array you would suffer the read-modify-write penalty
when updating records. With a linear concat you simply directly update
a single 4KB block.

XFS is extremely flexible and powerful. It can be tailored to yield
maximum performance for just about any workload with sufficient concurrency.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 26.09.2011 04:26:32 von Krzysztof Adamski

On Fri, 2011-09-23 at 07:56 -0500, Stan Hoeppner wrote:
> On 9/23/2011 12:10 AM, Thomas Fjellstrom wrote:
>
> > I /really really/ wish the driver for this card was more stable, but you deal
> > with what you've got (in my case a $100 2 port SAS/8 port SATA card).
>
> Please don't shield the identity of the problem card. Others need to
> know of your problems. An educated guess tells me it is one of...
>
> Card: SuperMicro AOC-SASLP-MV8 Marvell 88SE6480
> Driver: MVSAS
>
> Card: HighPoint RocketRAID 2680/2680SGL Marvell 88SE6485
> Driver: MVSAS
>
> This ASIC/driver combo is so historically horrible with Linux that I'm
> surprised all the owners haven't had a big bon fire party and thrown all
> the cards in. Or simply Ebay'd them to Windows users, where they seem
> to work relatively OK.
>
> Solve your problem with a 50% more $$ LSI SAS1068E based Intel 8 port
> PCIe x4 SAS/SATA HBA, which uses the mptsas driver:
> http://www.newegg.com/Product/Product.aspx?Item=N82E16816117 157
>
> It seems this is the card most users switch to after being burned by the
> cheap Marvell based SAS 2xSFF8087 cards. The 1068E cards and the mptsas
> driver are far more reliable, stable, and faster. Many OEM cards from
> IBM, Dell, etc, use this chip and can be had on Ebay for less than the
> new retail Intel card. In your situation I'd probably buy new Intel
> just in case. Hope this info/insight helps.
>

Keep in mind that the 1068E based cards do NOT support 3T drives fully.

K

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 26.09.2011 08:08:26 von Mikael Abrahamsson

On Sun, 25 Sep 2011, Robert L Mathews wrote:

>> Solve your problem with a 50% more $$ LSI SAS1068E based Intel 8 port PCIe
>> x4 SAS/SATA HBA, which uses the mptsas driver:
>> http://www.newegg.com/Product/Product.aspx?Item=N82E16816117 157
>
> When this card is used in JBOD mode, is the on-disk format identical to a
> standard disk that's not plugged into a RAID card?

Yes.

> In other words, if the card fails, is it possible to take the disks and
> connect them directly to any non-RAID SATA/SAS motherboard? Or would you need
> a replacement card to read the data?

Yes.

--
Mikael Abrahamsson email: swmike@swm.pp.se
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 26.09.2011 12:51:01 von David Brown

On 26/09/2011 01:58, Stan Hoeppner wrote:
> On 9/25/2011 10:18 AM, David Brown wrote:
>> On 25/09/11 16:39, Stan Hoeppner wrote:
>>> On 9/25/2011 8:03 AM, David Brown wrote:

(Sorry for getting so off-topic here - if it is bothering anyone, please
say and I will stop. Also Stan, you have been extremely helpful, but if
you feel you've given enough free support to an ignorant user, I fully
understand it. But every answer leads me to new questions, and I hope
that others in this mailing list will also find some of the information
useful.)

>> Suppose you have an xfs filesystem with 10 allocation groups, mounted on
>> /mnt. You make a directory /mnt/a. That gets created in allocation group
>> 1. You make a second directory /mnt/b. That gets created in allocation
>> group 2. Any files you put in /mnt/a go in allocation group 1, and any
>> files in /mnt/b go in allocation group 2.
>
> You're describing the infrastructure first. You *always* start with the
> needs of the workload and build the storage stack to best meet those
> needs. You're going backwards, but I'll try to play your game.
>

I agree on your principle here - figure out what you need before trying
to build it. But here I am trying to understand what happens if I build
/this/ way.

>> Am I right so far?
>
> Yes. There are some corner cases but this is how a fresh XFS behaves. I
> should have stated before that my comments are based on using the
> inode64 mount option which is required to reach above 16TB, and which
> yields superior performance. The default mode, inode32, behaves a bit
> differently WRT allocation. It would take too much text to explain the
> differences here. You're better off digging into the XFS documentation
> at xfs.org.
>

I've heard there are some differences between XFS running under 32-bit
and 64-bit kernels. It's probably fair to say that any modern system
big enough to be looking at scaling across a raid linear concat would be
running on a 64-bit system, and using appropriate make.xfs and mount
options for 64-bit systems. But it's helpful of you to point this out.

>> Then you create directories /mnt/a/a1 and /mnt/a/a2. Do these also go in
>> allocation group 1, or do they go in groups 3 and 4? Similarly, do files
>> inside them go in group 1 or in groups 3 and 4?
>
> Remember this is a filesystem. Think of a file cabinet. The cabinet is
> the XFS filesytsem, the drawers are the allocation groups, directories
> are manilla folders, and files are papers in the folders. That's exactly
> how the allocation works. Now, a single file will span more than 1 AG
> (drawer) if the file is larger than the free space available within the
> AG (drawer) when the file is created, or appended.
>
>> To take an example that is quite relevant to me, consider a mail server
>> handling two domains. You have (for example) /var/mail/domain1 and
>> /var/mail/domain2, with each user having a directory within either
>> domain1 or domain2. What I would like to know, is if the xfs filesystem
>> is mounted on /var/mail, then are the user directories spread across the
>> allocation groups, or are all of domain1 users in one group and all of
>> domain2 users in another group? If it is the former, then xfs on a
>> linear concat would scale beautifully - if it is the later, then it
>> would be pretty terrible scaling.
>
> See above for file placement.
>
> With only two top level directories you're not going to achieve good
> parallelism on an XFS linear concat. Modern delivery agents, dovecot for
> example, allow you to store each user mail directory independently,
> anywhere you choose, so this isn't a problem. Simply create a top level
> directory for every mailbox, something like:
>
> /var/mail/domain1.%user/
> /var/mail/domain2.%user/
>

Yes, that is indeed possible with dovecot.

To my mind, it is an unfortunate limitation that it is only top-level
directories that are spread across allocation groups, rather than all
directories. It means the directory structure needs to be changed to
suit the filesystem. In some cases, such as a dovecot mail server,
that's not a big issue. But in other cases it could be - it is a
somewhat artificial restraint in the way you organise your directories.
Of course, scaling across top-level directories is much better than no
scaling at all - and I'm sure the XFS developers have good reason for
organising the allocation groups in this way.

You have certainly answered my question now - many thanks. Now I am
clear how I need to organise directories in order to take advantage of
allocation groups. Even though I don't have any filesystems planned
that will be big enough to justify linear concats, spreading data across
allocation groups will spread the load across kernel threads and
therefore across processor cores, so it is important to understand it.

>>>>> Also note that a linear concat will only give increased performance
>>>>> with
>>>>> XFS, again for appropriate worklods. Using a linear concat with EXT3/4
>>>>> will give you the performance of a single spindle regardless of the
>>>>> total number of disks used. So one should stick with striped arrays
>>>>> for
>>>>> EXT3/4.
>>>
>>
>> I understand this, which is why I didn't comment earlier. I am aware
>> that only XFS can utilise the parts of a linear concat to improve
>> performance - my questions were about the circumstances in which XFS can
>> utilise the multiple allocation groups.
>
> The optimal scenario is rather simple. Create multiple top level
> directories and write/read files within all of them concurrently. This
> works best with highly concurrent workloads where high random IOPS is
> needed. This can be with small or large files.
>
> The large file case is transactional database specific, and careful
> planning and layout of the disks and filesystem are needed. In this case
> we span a single large database file over multiple small allocation
> groups. Transactional DB systems typically write only a few hundred
> bytes per record. Consider a large retailer point of sale application.
> With a striped array you would suffer the read-modify-write penalty when
> updating records. With a linear concat you simply directly update a
> single 4KB block.
>

When you are doing that, you would then use a large number of allocation
groups - is that correct?

References I have seen on the internet seem to be in two minds about
whether you should have many or a few allocation groups. On the one
hand, multiple groups let you do more things in parallel - on the other
hand, each group means more memory and overhead needed to keep track of
inode tables, etc. Certainly I see the point of having an allocation
group per part of the linear concat (or a multiple of the number of
parts), and I can see the point of having at least as many groups as you
have processor cores, but is there any point in having more groups than
that? I have read on the net about a size limitation of 4GB per group,
which would mean using more groups on a big system, but I get the
impression that this was a 32-bit limitation and that on a 64-bit system
the limit is 1 TB per group. Assuming a workload with lots of parallel
IO rather than large streams, are there any guidelines as to ideal
numbers of groups? Or is it better just to say that if you want the
last 10% out of a big system, you need to test it and benchmark it
yourself with a realistic test workload?

> XFS is extremely flexible and powerful. It can be tailored to yield
> maximum performance for just about any workload with sufficient
> concurrency.
>

I have also read that JFS uses allocation groups - have you any idea how
these compare to XFS, and whether it scales in the same way?

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 26.09.2011 21:52:53 von Stan Hoeppner

On 9/26/2011 5:51 AM, David Brown wrote:
> On 26/09/2011 01:58, Stan Hoeppner wrote:
>> On 9/25/2011 10:18 AM, David Brown wrote:
>>> On 25/09/11 16:39, Stan Hoeppner wrote:
>>>> On 9/25/2011 8:03 AM, David Brown wrote:
>
> (Sorry for getting so off-topic here - if it is bothering anyone, please
> say and I will stop. Also Stan, you have been extremely helpful, but if
> you feel you've given enough free support to an ignorant user, I fully
> understand it. But every answer leads me to new questions, and I hope
> that others in this mailing list will also find some of the information
> useful.)

I don't mind at all. I love 'talking shop' WRT storage architecture and
XFS. Others might though as we're very far OT at this point. The
proper place for this discussion in the XFS mailing list. There are
folks there far more knowledgeable than me and could thus answer your
questions more thoroughly, and correct me if I make an error.

> I've heard there are some differences between XFS running under 32-bit
> and 64-bit kernels. It's probably fair to say that any modern system big
> enough to be looking at scaling across a raid linear concat would be
> running on a 64-bit system, and using appropriate make.xfs and mount
> options for 64-bit systems. But it's helpful of you to point this out.

It's not that straightforward. The default XFS layout for a 64bit Linux
system is inode32, not inode64 (at least up to 2011). This is for
compatibility. There is apparently still some commercial and FOSS
backup software in production that doesn't cope with 64 bit inodes. And
IIRC some time ago there was also an issue with the Linux NFS and other
code not understanding 64 bit inodes. Christoph is in a better position
to discuss this than I am as he is an XFS dev.

>> With only two top level directories you're not going to achieve good
>> parallelism on an XFS linear concat. Modern delivery agents, dovecot for
>> example, allow you to store each user mail directory independently,
>> anywhere you choose, so this isn't a problem. Simply create a top level
>> directory for every mailbox, something like:
>>
>> /var/mail/domain1.%user/
>> /var/mail/domain2.%user/

> Yes, that is indeed possible with dovecot.
>
> To my mind, it is an unfortunate limitation that it is only top-level
> directories that are spread across allocation groups, rather than all
> directories. It means the directory structure needs to be changed to
> suit the filesystem.

That's because you don't yet fully understand how all this XFS goodness
works. Recall my comments about architecting the storage stack to
optimize the performance of a specific workload? Using an XFS+linear
concat setup is a tradeoff, just like anything else. To get maximum
performance you may need to trade some directory layout complexity for
that performance. If you don't want that complexity, simply go with a
plain striped array and use any directory layout you wish.

Striped arrays don't rely on directory or AG placement for performance
as does a linear concat array. However, because of the nature of a
striped array, you'll simply get less performance with the specific
workloads I've mentioned. This is because you will often generate many
physical IOs to the spindles per filesystem operation. With the linear
concat each filesystem IO generates one physical IO to one spindle.
Thus with a highly concurrent workload you get more real file IOPS than
with a striped array before the disks hit their head seek limit. There
are other factors as well, such as latency. Block latency will usually
be lower with a linear concat than with a striped array.

I think what you're failing to fully understand is the serious level of
flexibility that XFS provides, and the resulting complexity of
understanding required by the sysop. Other Linux filesystem offer zero
flexibility WRT optimizing for the underlying hardware layout. Because
of XFS' architecture one can tailor its performance characteristics to
many different physical storage architectures, including standard
striped arrays, linear concats, a combination of the two, etc, and
specific workloads. Again, an XFS+linear concat is a specific
configuration of XFS and the underlying storage, tailored to a specific
type of workload.

> In some cases, such as a dovecot mail server,
> that's not a big issue. But in other cases it could be - it is a
> somewhat artificial restraint in the way you organise your directories.

No, it's not a limitation, but a unique capability. See above.

> Of course, scaling across top-level directories is much better than no
> scaling at all - and I'm sure the XFS developers have good reason for
> organising the allocation groups in this way.

> You have certainly answered my question now - many thanks. Now I am
> clear how I need to organise directories in order to take advantage of
> allocation groups.

Again, this directory layout strategy only applies when using a linear
concat. It's not necessary with XFS atop a striped array. And it's
only a good fit for high concurrency high IOPS workloads.

> Even though I don't have any filesystems planned that
> will be big enough to justify linear concats,

A linear concat can be as small as 2 disks, even 2 partitions, 4 with
redundancy (2 mirror pairs). Maybe you meant workload here instead of
filesystem?

> spreading data across
> allocation groups will spread the load across kernel threads and
> therefore across processor cores, so it is important to understand it.

While this is true, and great engineering, it's only relevant on systems
doing large concurrent/continuous IO, as in multiple GB/s, given the
power of today's CPUs.

The XFS allocation strategy is brilliant, and simply beats the stuffing
out of all the other current Linux filesystems. It's time for me to
stop answering your questions, and time for you to read:

http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide//tmp/en-U S/html/index.html
http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure //tmp/en-US/html/index.html
http://xfs.org/docs/xfsdocs-xml-dev/XFS_Labs/tmp/en-US/html/ index.html

If you have further questions after digesting these valuable resources,
please post them on the xfs mailing list:

http://oss.sgi.com/mailman/listinfo/xfs

Myself and others would be happy to respond.

>> The large file case is transactional database specific, and careful
>> planning and layout of the disks and filesystem are needed. In this case
>> we span a single large database file over multiple small allocation
>> groups. Transactional DB systems typically write only a few hundred
>> bytes per record. Consider a large retailer point of sale application.
>> With a striped array you would suffer the read-modify-write penalty when
>> updating records. With a linear concat you simply directly update a
>> single 4KB block.

> When you are doing that, you would then use a large number of allocation
> groups - is that correct?

Not necessarily. It's a balancing act. And it's a rather complicated
setup. To thoroughly answer this question will take far more list space
and time than I have available. And given your questions the maildir
example prompted, you'll have far more if I try to explain this setup.

Please read the docs I mentioned above. They won't directly answer this
question, but will allow you to answer it yourself after you digest the
information.

> References I have seen on the internet seem to be in two minds about
> whether you should have many or a few allocation groups. On the one
> hand, multiple groups let you do more things in parallel - on the other

More parallelism only to an extent. Disks are very slow. Once you have
enough AGs for your workload to saturate your drive head actuators,
additional AGs simply create a drag on performance due to excess head
seeking amongst all your AGs. Again, it's a balancing act.

> hand, each group means more memory and overhead needed to keep track of
> inode tables, etc.

This is irrelevant. The impact of these things is infinitely small
compared to the physical disk overhead caused by too many AGs.

> Certainly I see the point of having an allocation
> group per part of the linear concat (or a multiple of the number of
> parts), and I can see the point of having at least as many groups as you
> have processor cores, but is there any point in having more groups than
> that?

You should be realizing about now why most people call tuning XFS a
"Black art". ;) Read the docs about allocation groups.

> I have read on the net about a size limitation of 4GB per group,

You're read in the wrong place, read old docs. The current AG size
limit is 1TB, has been for quite some time. It will be bumped up some
time in the future as disk sizes increase. The next limit will likely
be 4TB.

> which would mean using more groups on a big system, but I get the
> impression that this was a 32-bit limitation and that on a 64-bit system

The AG size limit has nothing to do with the system instruction width.
It is an 'arbitrary' fixed size.

> the limit is 1 TB per group. Assuming a workload with lots of parallel
> IO rather than large streams, are there any guidelines as to ideal
> numbers of groups? Or is it better just to say that if you want the last
> 10% out of a big system, you need to test it and benchmark it yourself
> with a realistic test workload?

There are no general guidelines here, but for the mkfs.xfs defaults.
Coincidentally, recent versions of mkfs.xfs will read the mdadm config
and build the filesystem correctly, automatically, on top of striped md
raid arrays.

Other than that, there are no general guidelines, and especially none
for a linear concat. The reason is that all storage hardware acts a
little bit differently and each host/storage combo may require different
XFS optimizations for peak performance. Pre-production testing is
*always* a good idea, and not just for XFS. :)

Unless or until one finds that the mkfs.xfs defaults aren't yielding the
required performance, it's best not to peek under the hood, as you're
going to get dirty once you dive in to tune the engine. ;)

>> XFS is extremely flexible and powerful. It can be tailored to yield
>> maximum performance for just about any workload with sufficient
>> concurrency.

> I have also read that JFS uses allocation groups - have you any idea how
> these compare to XFS, and whether it scales in the same way?

I've never used JFS. AIUI it staggers along like a zombie, with one dev
barely maintaining it today. It seems there hasn't been real active
Linux JFS code work for about 7 years, since 2004, only a handful of
commits, all bug fixes IIRC. The tools package appears to have received
slightly more attention.

XFS sees regular commits to both experimental and stable trees, both bug
fixes and new features, with at least a dozen or so devs banging on it
at a given time. I believe there is at least one Red Hat employee
working on XFS full time, or nearly so. Christoph is a kernel dev who
works on XFS, and could give you a more accurate head count. Christoph?

BTW, this is my last post on this subject. It must move to the XFS
list, or die.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 26.09.2011 22:29:25 von David Brown

On 26/09/11 21:52, Stan Hoeppner wrote:
> On 9/26/2011 5:51 AM, David Brown wrote:
>> On 26/09/2011 01:58, Stan Hoeppner wrote:
>>> On 9/25/2011 10:18 AM, David Brown wrote:
>>>> On 25/09/11 16:39, Stan Hoeppner wrote:
>>>>> On 9/25/2011 8:03 AM, David Brown wrote:
>>
>> (Sorry for getting so off-topic here - if it is bothering anyone, please
>> say and I will stop. Also Stan, you have been extremely helpful, but if
>> you feel you've given enough free support to an ignorant user, I fully
>> understand it. But every answer leads me to new questions, and I hope
>> that others in this mailing list will also find some of the information
>> useful.)
>
> I don't mind at all. I love 'talking shop' WRT storage architecture and
> XFS. Others might though as we're very far OT at this point. The proper
> place for this discussion in the XFS mailing list. There are folks there
> far more knowledgeable than me and could thus answer your questions more
> thoroughly, and correct me if I make an error.
>

I will stop after this post (at least, I will /try/ not to continue...).
I've got all the information I was looking for now, and if I need more
details I'll take your advice and look at the XFS mailing list. Before
that, though, I should really try it out a little first - I don't have
any need of a big XFS system at the moment, but it is on my list of
"experiments" to try some quiet evening.

>
>
>>
>> To my mind, it is an unfortunate limitation that it is only top-level
>> directories that are spread across allocation groups, rather than all
>> directories. It means the directory structure needs to be changed to
>> suit the filesystem.
>
> That's because you don't yet fully understand how all this XFS goodness
> works. Recall my comments about architecting the storage stack to
> optimize the performance of a specific workload? Using an XFS+linear
> concat setup is a tradeoff, just like anything else. To get maximum
> performance you may need to trade some directory layout complexity for
> that performance. If you don't want that complexity, simply go with a
> plain striped array and use any directory layout you wish.
>

I understand this - tradeoffs are inevitable. It's a shame that it is a
necessary tradeoff here. I can well see that in some cases (such as a
big dovecot server) the benefits of XFS + linear concat outweigh the
(real or perceived) benefits of a domain/user directory structure. But
that doesn't stop me wanting both!

> Striped arrays don't rely on directory or AG placement for performance
> as does a linear concat array. However, because of the nature of a
> striped array, you'll simply get less performance with the specific
> workloads I've mentioned. This is because you will often generate many
> physical IOs to the spindles per filesystem operation. With the linear
> concat each filesystem IO generates one physical IO to one spindle. Thus
> with a highly concurrent workload you get more real file IOPS than with
> a striped array before the disks hit their head seek limit. There are
> other factors as well, such as latency. Block latency will usually be
> lower with a linear concat than with a striped array.
>
> I think what you're failing to fully understand is the serious level of
> flexibility that XFS provides, and the resulting complexity of
> understanding required by the sysop. Other Linux filesystem offer zero
> flexibility WRT optimizing for the underlying hardware layout. Because
> of XFS' architecture one can tailor its performance characteristics to
> many different physical storage architectures, including standard
> striped arrays, linear concats, a combination of the two, etc, and
> specific workloads. Again, an XFS+linear concat is a specific
> configuration of XFS and the underlying storage, tailored to a specific
> type of workload.
>
>> In some cases, such as a dovecot mail server,
>> that's not a big issue. But in other cases it could be - it is a
>> somewhat artificial restraint in the way you organise your directories.
>
> No, it's not a limitation, but a unique capability. See above.
>

Well, let me rephrase - it is a unique capability, but it is limited to
situations where you can spread your load among many top-level directories.

>> Of course, scaling across top-level directories is much better than no
>> scaling at all - and I'm sure the XFS developers have good reason for
>> organising the allocation groups in this way.
>
>> You have certainly answered my question now - many thanks. Now I am
>> clear how I need to organise directories in order to take advantage of
>> allocation groups.
>
> Again, this directory layout strategy only applies when using a linear
> concat. It's not necessary with XFS atop a striped array. And it's only
> a good fit for high concurrency high IOPS workloads.
>

Yes, I understand that.

>> Even though I don't have any filesystems planned that
>> will be big enough to justify linear concats,
>
> A linear concat can be as small as 2 disks, even 2 partitions, 4 with
> redundancy (2 mirror pairs). Maybe you meant workload here instead of
> filesystem?
>

Yes, I meant workload :-)

>> spreading data across
>> allocation groups will spread the load across kernel threads and
>> therefore across processor cores, so it is important to understand it.
>
> While this is true, and great engineering, it's only relevant on systems
> doing large concurrent/continuous IO, as in multiple GB/s, given the
> power of today's CPUs.
>
> The XFS allocation strategy is brilliant, and simply beats the stuffing
> out of all the other current Linux filesystems. It's time for me to stop
> answering your questions, and time for you to read:
>
> http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide//tmp/en-U S/html/index.html
>
> http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure //tmp/en-US/html/index.html
>
> http://xfs.org/docs/xfsdocs-xml-dev/XFS_Labs/tmp/en-US/html/ index.html
>

These will keep me busy for a little while.

> If you have further questions after digesting these valuable resources,
> please post them on the xfs mailing list:
>
> http://oss.sgi.com/mailman/listinfo/xfs
>
> Myself and others would be happy to respond.
>
>>> The large file case is transactional database specific, and careful
>>> planning and layout of the disks and filesystem are needed. In this case
>>> we span a single large database file over multiple small allocation
>>> groups. Transactional DB systems typically write only a few hundred
>>> bytes per record. Consider a large retailer point of sale application.
>>> With a striped array you would suffer the read-modify-write penalty when
>>> updating records. With a linear concat you simply directly update a
>>> single 4KB block.
>
>> When you are doing that, you would then use a large number of allocation
>> groups - is that correct?
>
> Not necessarily. It's a balancing act. And it's a rather complicated
> setup. To thoroughly answer this question will take far more list space
> and time than I have available. And given your questions the maildir
> example prompted, you'll have far more if I try to explain this setup.
>
> Please read the docs I mentioned above. They won't directly answer this
> question, but will allow you to answer it yourself after you digest the
> information.
>

Fair enough - thanks for those links. And if I have more questions,
I'll try the XFS list - if nothing else, it will give you a break!

>> References I have seen on the internet seem to be in two minds about
>> whether you should have many or a few allocation groups. On the one
>> hand, multiple groups let you do more things in parallel - on the other
>
> More parallelism only to an extent. Disks are very slow. Once you have
> enough AGs for your workload to saturate your drive head actuators,
> additional AGs simply create a drag on performance due to excess head
> seeking amongst all your AGs. Again, it's a balancing act.
>
>> hand, each group means more memory and overhead needed to keep track of
>> inode tables, etc.
>
> This is irrelevant. The impact of these things is infinitely small
> compared to the physical disk overhead caused by too many AGs.
>

OK.

One of the problems with reading stuff on the net is that it is often
out of date, and there is no one checking the correctness of what is
published.

>> Certainly I see the point of having an allocation
>> group per part of the linear concat (or a multiple of the number of
>> parts), and I can see the point of having at least as many groups as you
>> have processor cores, but is there any point in having more groups than
>> that?
>
> You should be realizing about now why most people call tuning XFS a
> "Black art". ;) Read the docs about allocation groups.
>
>> I have read on the net about a size limitation of 4GB per group,
>
> You're read in the wrong place, read old docs. The current AG size limit
> is 1TB, has been for quite some time. It will be bumped up some time in
> the future as disk sizes increase. The next limit will likely be 4TB.
>
>> which would mean using more groups on a big system, but I get the
>> impression that this was a 32-bit limitation and that on a 64-bit system
>
> The AG size limit has nothing to do with the system instruction width.
> It is an 'arbitrary' fixed size.
>

OK.

>> the limit is 1 TB per group. Assuming a workload with lots of parallel
>> IO rather than large streams, are there any guidelines as to ideal
>> numbers of groups? Or is it better just to say that if you want the last
>> 10% out of a big system, you need to test it and benchmark it yourself
>> with a realistic test workload?
>
> There are no general guidelines here, but for the mkfs.xfs defaults.
> Coincidentally, recent versions of mkfs.xfs will read the mdadm config
> and build the filesystem correctly, automatically, on top of striped md
> raid arrays.
>

Yes, I have read about that - very convenient.

> Other than that, there are no general guidelines, and especially none
> for a linear concat. The reason is that all storage hardware acts a
> little bit differently and each host/storage combo may require different
> XFS optimizations for peak performance. Pre-production testing is
> *always* a good idea, and not just for XFS. :)
>
> Unless or until one finds that the mkfs.xfs defaults aren't yielding the
> required performance, it's best not to peek under the hood, as you're
> going to get dirty once you dive in to tune the engine. ;)
>

I usually find that when I get a new server to play with, I start poking
around, trying different fancy combinations of filesystems and disk
arrangements, trying benchmarks, etc. Then I realise time is running
out before it all has to be in place, and I set up something reasonable
with default settings. Unlike a car engine, it's easy to put the system
back to factory condition with a reformat!

>>> XFS is extremely flexible and powerful. It can be tailored to yield
>>> maximum performance for just about any workload with sufficient
>>> concurrency.
>
>> I have also read that JFS uses allocation groups - have you any idea how
>> these compare to XFS, and whether it scales in the same way?
>
> I've never used JFS. AIUI it staggers along like a zombie, with one dev
> barely maintaining it today. It seems there hasn't been real active
> Linux JFS code work for about 7 years, since 2004, only a handful of
> commits, all bug fixes IIRC. The tools package appears to have received
> slightly more attention.
>

That's the impression I got too.

> XFS sees regular commits to both experimental and stable trees, both bug
> fixes and new features, with at least a dozen or so devs banging on it
> at a given time. I believe there is at least one Red Hat employee
> working on XFS full time, or nearly so. Christoph is a kernel dev who
> works on XFS, and could give you a more accurate head count. Christoph?
>
> BTW, this is my last post on this subject. It must move to the XFS list,
> or die.
>

Fair enough.

Many thanks for your help and your patience.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 27.09.2011 01:28:15 von Krzysztof Adamski

On Mon, 2011-09-26 at 14:52 -0500, Stan Hoeppner wrote:
> Coincidentally, recent versions of mkfs.xfs will read the mdadm
> config
> and build the filesystem correctly, automatically, on top of striped
> md
> raid arrays.

If I have LVM on top of RAID6, will mkfs.xfs read the mdadm config?

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 27.09.2011 05:53:35 von Stan Hoeppner

On 9/26/2011 6:28 PM, Krzysztof Adamski wrote:
> On Mon, 2011-09-26 at 14:52 -0500, Stan Hoeppner wrote:
>> Coincidentally, recent versions of mkfs.xfs will read the mdadm
>> config
>> and build the filesystem correctly, automatically, on top of striped
>> md
>> raid arrays.
>
> If I have LVM on top of RAID6, will mkfs.xfs read the mdadm config?

I believe the correct answer is that it should, but it depends on a few
things. Support for this may be new enough that some/many distros won't
do it out of the box. You may need to build some code yourself, linked
with the proper libraries. See the following list thread for
background, which is a year old, so you may want to ask this question
directly on the XFS list, or see the XFS FAQ:

http://oss.sgi.com/archives/xfs/2010-09/msg00161.html
http://xfs.org/index.php/XFS_FAQ

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 02.10.2011 01:21:18 von Aapo Laine

Excuse the late reply

On 09/25/11 12:10, NeilBrown wrote:
> On Sat, 24 Sep 2011 23:57:58 +0200 Aapo Laine
> wrote:
> Hopefully not too shocking... I'm not planning on leaving md any time soon.
> I do still enjoy working on it.
> But it certainly isn't as fresh and new as it was 10 years again. It would
> probably do both me and md a lot of good to have someone with new enthusiasm
> and new perspectives...

I understand

> ....
> That is certainly true, but seems to be true across much of the kernel, and
> probably most code in general (though I'm sure such a comment will lead to
> people wanting to tell me their favourite exceptions ... so I'll start with
> "TeX").
>
> This is one of the reasons I offered "mentoring" to any likely candidate.
>
>
>
>> - there is not much explanation of overall strategies, or the structure
>> of code. Also the flow of data between the functions is not much
>> explained most of the times.
>>
>> - It's not obvious to me what is the entry point for kernel processes
>> related to MD arrays, how are they triggered and where do they run...
>> E.g. in the past I tried to understand how did resync work, but I
>> couldn't. I thought there was a kernel process controlling resync
>> advancement, but I couldn't really find the boundaries of code inside
>> which it was executing.
> md_do_sync() is the heart of the resync process. it calls into the
> personality's sync_request() function.
>
> The kernel thread is started by md_check_recovery() if it appears to be
> needed. md_check_recovery() is regularly run by each personality's main
> controlling thread.
>
>> - it's not clear what the various functions do or in what occasion they
>> are called. Except from their own name, most of them have no comments
>> just before the definition.
> How about this:
> - you identify some functions for which the purpose or use isn't clear
> - I'll explain to you when/how/why they are used
> - You create a patch which adds comments which explains it all
> - I'll apply that patch.
>
> deal??

I would never dare to commit your comments back to you with my name, not
even if you ask me to do so :-)

But thanks for offering to explain the things.
If you do this on the mailing list everybody is going to read that and
this will be useful to everybody.

Actually we could even open a new mailing list or maybe there is
something better in web 2.0, like a wiki, for these MD code
explanations. Explanations in a wiki can be longer, there can be user
discussions, and such lines of comments do not need to be pushed as far
as Linus.

>> - the algoritms within the functions are very long and complex, but only
>> rarely they are explained by comments. I am now seeing pieces having 5
>> levels of if's nested one inside the other, and there are almost no
>> comments.
> I feel your pain. I really should factor out the deeply nested levels into
> separate functions. Sometimes I have done that but there is plenty more do
> to. Again, I would be much more motivated to do this if I were working with
> someone who would be directly helped by it. So if you identify specific
> problems, it'll be a lot easier for me to help fix them.

Well in fact I am not against deeply nested code, actually I am more
against functions which are called from one point only, because this
disrupts reading flow greatly.
But what I wanted to say was that if nesting level is 5 it means that
this is complex code so more comments are always appreciated.

>> - last but not least, variables have very short names, and for most of
>> them, it is not explained what they mean. This is mostly for local
>> variables, but sometimes even for the structs which go into metadata
>> e.g. in struct r1_private_data_s most members do not have an
>> explanation. This is pretty serious, to me at least, for understanding
>> the code.
> Does this help?
>
> diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
> index e0d676b..feb44ad 100644
> --- a/drivers/md/raid1.h
> +++ b/drivers/md/raid1.h
> @@ -28,42 +28,67 @@ struct r1_private_data_s {
> mddev_t *mddev;
> mirror_info_t *mirrors;
> int raid_disks;
> +
> + /* When choose the best device for a read (read_balance())
> + * we try to keep sequential reads one the same device
> + * using 'last_used' and 'next_seq_sect'
> + */
> int last_used;
> sector_t next_seq_sect;
> + /* During resync, read_balancing is only allowed on the part
> + * of the array that has been resynced. 'next_resync' tells us
> + * where that is.
> + */
> + sector_t next_resync;
> +
> spinlock_t device_lock;
>
> + /* list of 'r1bio_t' that need to be processed by raid1d, whether
> + * to retry a read, writeout a resync or recovery block, or
> + * anything else.
> + */
> struct list_head retry_list;
> - /* queue pending writes and submit them on unplug */
> - struct bio_list pending_bio_list;
>
> - /* for use when syncing mirrors: */
> + /* queue pending writes to be submitted on unplug */
> + struct bio_list pending_bio_list;
>
> + /* for use when syncing mirrors:
> + * We don't allow both normal IO and resync/recovery IO at
> + * the same time - resync/recovery can only happen when there
> + * is no other IO. So when either is active, the other has to wait.
> + * See more details description in raid1.c near raise_barrier().
> + */
> + wait_queue_head_t wait_barrier;
> spinlock_t resync_lock;
> int nr_pending;
> int nr_waiting;
> int nr_queued;
> int barrier;
> - sector_t next_resync;
> - int fullsync; /* set to 1 if a full sync is needed,
> - * (fresh device added).
> - * Cleared when a sync completes.
> - */
> - int recovery_disabled; /* when the same as
> - * mddev->recovery_disabled
> - * we don't allow recovery
> - * to be attempted as we
> - * expect a read error
> - */
>
> - wait_queue_head_t wait_barrier;
> + /* Set to 1 if a full sync is needed, (fresh device added).
> + * Cleared when a sync completes.
> + */
> + int fullsync
>
> - struct pool_info *poolinfo;
> + /* When the same as mddev->recovery_disabled we don't allow
> + * recovery to be attempted as we expect a read error.
> + */
> + int recovery_disabled;
>
> - struct page *tmppage;
>
> + /* poolinfo contains information about the content of the
> + * mempools - it changes when the array grows or shrinks
> + */
> + struct pool_info *poolinfo;
> mempool_t *r1bio_pool;
> mempool_t *r1buf_pool;
>
> + /* temporary buffer to synchronous IO when attempting to repair
> + * a read error.
> + */
> + struct page *tmppage;
> +
> +
> /* When taking over an array from a different personality, we store
> * the new thread here until we fully activate the array.
> */

YES!
thanks

>
> I'm not planning on leaving - not for quite some time anyway.
> But I know the code so well that it is hard to see which bits need
> documenting, and what sort of documentation would really help.
> I would love it if you (or anyone) would review the code and point to parts
> that particularly need improvement.

I don't have much time to dedicate to this but if I come across
something not clear I will ask, now that I know I can.
Other people hopefully would do the same so that I don't feel too stupid.

I think concentrating all code explanation requests, discussions, and/or
at least the answers, onto something like a wiki could prevent
double-questions. Unfortunately I know nothing about wikis or other web
2.0 technologies. Maybe someone in this ML can suggest a solution?

> ...
> Thanks for your valuable feedback.
> Being able to see problems is of significant value. One of the reasons that
> I pay close attention to this list is because it shows me where the problems
> with md and mdadm are. People often try things that I would never even dream
> of trying (because I know they won't work). See this helps me know where the
> code and be improved - either so what they try does work, or so it fails more
> gracefully and helpfully.

I understand
Thanks for your time, really

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 02.10.2011 19:00:52 von Aapo Laine

On 10/02/11 01:21, Aapo Laine wrote:
> Actually we could even open a new mailing list or maybe there is
> something better in web 2.0, like a wiki, for these MD code
> explanations. Explanations in a wiki can be longer, there can be user
> discussions, and such lines of comments do not need to be pushed as
> far as Linus.

Sorry, I wrote this before reading other replies. Kristleifur already
proposed what might be optimal, that is, GitHub

> I think concentrating all code explanation requests, discussions,
> and/or at least the answers, onto something like a wiki could prevent
> double-questions. Unfortunately I know nothing about wikis or other
> web 2.0 technologies. Maybe someone in this ML can suggest a solution?

Ditto

+1 on the github
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 05.10.2011 04:06:46 von NeilBrown

--Sig_/.QLO8lz8fh3YKTeB3dOZxlL
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

On Sun, 02 Oct 2011 01:21:18 +0200 Aapo Laine
wrote:

> >> - it's not clear what the various functions do or in what occasion they
> >> are called. Except from their own name, most of them have no comments
> >> just before the definition.
> > How about this:
> > - you identify some functions for which the purpose or use isn't clear
> > - I'll explain to you when/how/why they are used
> > - You create a patch which adds comments which explains it all
> > - I'll apply that patch.
> >
> > deal??
>=20
> I would never dare to commit your comments back to you with my name, not=
=20
> even if you ask me to do so :-)

There is absolutely nothing wrong with taking someones work, adding value,
and contributing it. You would obviously give credit to the original autho=
r.
This is was open/free licensing is all about.

There is a big difference between an answer to a question and piece of clear
coherent documentation. The later can certainly use text from the former,
but it also involves thinking about what documentation is needed, where to
put it, how to present it etc. There is real value in doing that.

>=20
> But thanks for offering to explain the things.
> If you do this on the mailing list everybody is going to read that and=20
> this will be useful to everybody.
>=20
> Actually we could even open a new mailing list or maybe there is=20
> something better in web 2.0, like a wiki, for these MD code=20
> explanations. Explanations in a wiki can be longer, there can be user=20
> discussions, and such lines of comments do not need to be pushed as far=20
> as Linus.

http://raid.wiki.kernel.org

might be an appropriate place ... once kernel.org is fully functional again.

One of the problems with documentation is that it quickly goes out of date.
The best chance of keeping it up-to-date is to have it in the source tree
with the code.
>=20
> >> - last but not least, variables have very short names, and for most of
> >> them, it is not explained what they mean. This is mostly for local
> >> variables, but sometimes even for the structs which go into metadata
> >> e.g. in struct r1_private_data_s most members do not have an
> >> explanation. This is pretty serious, to me at least, for understanding
> >> the code.
> > Does this help?
> >
> > diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
.....
>=20
> YES!
> thanks

I've added that patch to my queue - it should go through in the next merge
window.

NeilBrown

--Sig_/.QLO8lz8fh3YKTeB3dOZxlL
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)

iD8DBQFOi7u2G5fc6gV+Wb0RAooxAKCr0qccw2VoMiA4I88yeBmqASmMygCa AsNx
UlhG1a5Umyoje+S4FRXeQ5s=
=ZBEd
-----END PGP SIGNATURE-----

--Sig_/.QLO8lz8fh3YKTeB3dOZxlL--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: potentially lost largeish raid5 array..

am 05.10.2011 04:13:16 von NeilBrown

--Sig_/__ph/H+_uQ.U4Cizdjmob8e
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

On Sun, 02 Oct 2011 19:00:52 +0200 Aapo Laine
wrote:

> On 10/02/11 01:21, Aapo Laine wrote:
> > Actually we could even open a new mailing list or maybe there is=20
> > something better in web 2.0, like a wiki, for these MD code=20
> > explanations. Explanations in a wiki can be longer, there can be user=20
> > discussions, and such lines of comments do not need to be pushed as=20
> > far as Linus.
>=20
> Sorry, I wrote this before reading other replies. Kristleifur already=20
> proposed what might be optimal, that is, GitHub

Anyone who wants to is welcome to create a kernel tree on github, apply
patches, and send me pull requests. To just email me patches.
However the patches arrive I will need to review them at least until I have
enough experience with the person to have good reason to trust their patche=
s.
Typically the extent of review will drop as a history of good patches grows
until I just pull any that looks vaguely credible.

I don't think we need any new infrastructure to allow this. We just need
people with an interest to create and submit patches.

If someone says "I've created some patches but they aren't getting into
mainline because ...." of some reason, maybe because I keep ignoring them or
losing them in mail inbox or something, then it might be time to look at
technology solutions to make it easier for new contributors to get patches
in. But until those new contributors are trying there isn't much point
making it easier for them.

Thanks,
NeilBrown

>=20
> > I think concentrating all code explanation requests, discussions,=20
> > and/or at least the answers, onto something like a wiki could prevent=20
> > double-questions. Unfortunately I know nothing about wikis or other=20
> > web 2.0 technologies. Maybe someone in this ML can suggest a solution?
>=20
> Ditto
>=20
> +1 on the github
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--Sig_/__ph/H+_uQ.U4Cizdjmob8e
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)

iD8DBQFOi708G5fc6gV+Wb0RApxIAJ9fJqsp97haz+H0HI6hTdjT/V0WiwCe MGEV
7KlY1vAkPm/6w+oARmKKI1k=
=EgH7
-----END PGP SIGNATURE-----

--Sig_/__ph/H+_uQ.U4Cizdjmob8e--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html