Can reading a raid drive trigger all the other drives in that set?

am 28.06.2011 18:22:05 von Marc MERLIN

I have ext4 over lvm2 on a sw raid5 with 2.6.39.1

In order to save power I have my drives spin down.

When I access my filesystem mount point, I get hangs of 30sec or a bit more
as each and every drive are woken up serially.

Is there any chance to put a patch in the block layer so that when it gets a
read on a block after a certain timeout, it just does one dummy read on all
the other droves in parallel so that all the drives have a chance to spin
back up at the same time and not serially?

Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Can reading a raid drive trigger all the other drives in thatset?

am 02.09.2011 03:23:27 von Marc MERLIN

On Tue, Jun 28, 2011 at 09:22:05AM -0700, Marc MERLIN wrote:
> I have ext4 over lvm2 on a sw raid5 with 2.6.39.1
>
> In order to save power I have my drives spin down.
>
> When I access my filesystem mount point, I get hangs of 30sec or a bit more
> as each and every drive are woken up serially.
>
> Is there any chance to put a patch in the block layer so that when it gets a
> read on a block after a certain timeout, it just does one dummy read on all
> the other droves in parallel so that all the drives have a chance to spin
> back up at the same time and not serially?

Ok, so the lack of answer probably means 'no' :)

Given that, is there a user space way to do this?
I'm thinking I might be able to poll drives every second to see if they
were spun down and got an IO. If any drive gets an IO, then the other
ones all get a dummy read, although I'd have to make sure that read is
random so that it can't be in the cache.

I take it there is no such code in existence yet, correct? :)

Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Can reading a raid drive trigger all the other drives in that set?

am 02.09.2011 23:28:21 von Doug Dumitru

What you are looking to do is not really what raid is all about.
Essentially, the side effect of a drive wakeup is non optimal in that
the raid layer is not aware of this event. Then again, the drive does
this invisibly, so no software is really aware.

You "could" fix this with a "filter" plug-in. Basically, you could
write a device mapper plug-in that watched IO and after some length of
pause kicked off dummy reads so that all drives would wake up. In
terms of code, this would probably be less than 300 lines to implement
the module.

Writing a device mapper plug-in is not that hard (see dm-zero.c for a
hello-world example), but it is kernel code and does require a pretty
good understanding of the BIO structure and how things flow. If you
had such a module, you would load it with a dmsetup command and then
use the 2nd mapper device instead of /dev/mdX.

If this is an important, ie. commercial, issue that is costing you
money, then you might want to pursue this. If not, then you probably
need to just disable drive spin-downs or live with it.

It is also possible that you could appeal to someone with some free
time (ie, not me) that could do this as a "green" project. Anyone who
has worked anywhere near the device-mapper or software-raid layers
should be able to throw something together.

Doug Dumitru
EasyCo LLC

On Thu, Sep 1, 2011 at 6:23 PM, Marc MERLIN wrote:
>
> On Tue, Jun 28, 2011 at 09:22:05AM -0700, Marc MERLIN wrote:
> > I have ext4 over lvm2 on a sw raid5 with 2.6.39.1
> >
> > In order to save power I have my drives spin down.
> >
> > When I access my filesystem mount point, I get hangs of 30sec or a =
bit more
> > as each and every drive are woken up serially.
> >
> > Is there any chance to put a patch in the block layer so that when =
it gets a
> > read on a block after a certain timeout, it just does one dummy rea=
d on all
> > the other droves in parallel so that all the drives have a chance t=
o spin
> > back up at the same time and not serially?
>
> Ok, so the lack of answer probably means 'no' :)
>
> Given that, is there a user space way to do this?
> I'm thinking I might be able to poll drives every second to see if th=
ey
> were spun down and got an IO. If any drive gets an IO, then the other
> ones all get a dummy read, although I'd have to make sure that read i=
s
> random so that it can't be in the cache.
>
> I take it there is no such code in existence yet, correct? :)
>
> Thanks,
> Marc
> --
> "A mouse is a device used to point at the xterm you want to type in" =
- A.S.R.
> Microsoft is to operating systems ....
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
=A0.... what McDonalds is to gourmet cooking
> Home page: http://marc.merlins.org/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html

--
Doug Dumitru
EasyCo LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Can reading a raid drive trigger all the other drives in thatset?

am 03.09.2011 07:13:13 von Marc MERLIN

On Fri, Sep 02, 2011 at 02:28:21PM -0700, Doug Dumitru wrote:
> If this is an important, ie. commercial, issue that is costing you
> money, then you might want to pursue this. If not, then you probably
> need to just disable drive spin-downs or live with it.

That was just for home use where I like to save the 60W or so that
having my drives spun down by default, but the up to 30 sec wait for all
of them to spin up one per one feels a bit punitive sometimes :)
I have been "living with it" for a while, but figured that we could do
better.
Thanks for pointing out the plugin solution, I'll look at that in my
copious spare time [tm] :) in addition to the user space solution.
(I already had to write code to watch drive activity and manually spin
them down due to very ill designed firmware:
http://marc.merlins.org/perso/linux/post_2010-08-03_Spinning -Down-WD20EADS-Drives-and-Fixing-Load-Cycle.html
)

> It is also possible that you could appeal to someone with some free
> time (ie, not me) that could do this as a "green" project. Anyone who
> has worked anywhere near the device-mapper or software-raid layers
> should be able to throw something together.

Understood. If one ever picks up this message and works on it, please
let me know, but I do understand that pretty much everyone already has
their own priorities.

Thanks for the answer,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Can reading a raid drive trigger all the other drives in that set?

am 25.09.2011 00:33:46 von Marc MERLIN

Mark/Tejun et all, my issue may be linked to the fact that I'm using a =
port
multiplier for my drives. If so, please let me know if that might be th=
e
case.

I'm not quite sure what's going on, but it looks like for 2 sets of 5 d=
rives,
ST drive reads from a drive in sleep mode can happen in // (i.e. all dr=
ives spin up
in //) whereas the WDC drives seem to hang the kernel block layer so th=
at the next
drive will not be read and spun up before the previous one was.

Is that possible?
If not, any idea what's going on?

=46or what it's worth, all the drives are on the same SIL PMP plugged i=
nto the
same Marvel SATA card.

On Fri, Sep 02, 2011 at 02:28:21PM -0700, Doug Dumitru wrote:
> On Thu, Sep 1, 2011 at 6:23 PM, Marc MERLIN wrote:
> > > I have ext4 over lvm2 on a sw raid5 with 2.6.39.1
> > >
> > > In order to save power I have my drives spin down.
> > >
> > > When I access my filesystem mount point, I get hangs of 30sec or =
a bit more
> > > as each and every drive are woken up serially.
> > >
> > > Is there any chance to put a patch in the block layer so that whe=
n it gets a
> > > read on a block after a certain timeout, it just does one dummy r=
ead on all
> > > the other droves in parallel so that all the drives have a chance=
to spin
> > > back up at the same time and not serially?
> >
> > Ok, so the lack of answer probably means 'no' :)
> >
> > Given that, is there a user space way to do this?
> > I'm thinking I might be able to poll drives every second to see if =
they
> > were spun down and got an IO. If any drive gets an IO, then the oth=
er
> > ones all get a dummy read, although I'd have to make sure that read=
is
> > random so that it can't be in the cache.
>
> What you are looking to do is not really what raid is all about.
> Essentially, the side effect of a drive wakeup is non optimal in that
> the raid layer is not aware of this event. Then again, the drive doe=
s
> this invisibly, so no software is really aware.
>=20
> You "could" fix this with a "filter" plug-in. Basically, you could
> write a device mapper plug-in that watched IO and after some length o=
f
> pause kicked off dummy reads so that all drives would wake up. In
> terms of code, this would probably be less than 300 lines to implemen=
t
> the module.
>=20
> Writing a device mapper plug-in is not that hard (see dm-zero.c for a
> hello-world example), but it is kernel code and does require a pretty
> good understanding of the BIO structure and how things flow. If you
> had such a module, you would load it with a dmsetup command and then
> use the 2nd mapper device instead of /dev/mdX.

I just had a little time to work at what I thought would be the userspa=
ce
solution to this.

Please have a quick look at:
http://marc.merlins.org/linux/scripts/swraidwakeup
Basiscally, I use=20
iostat -z 1
to detect access to /dev/md5 and then read a random sector from all its
drives in //.

The idea is of course trigger a spinup of all the drive in // as oppose=
d to
waiting for the raid block layer to serially wait for the first drive, =
and
then the second, and the third, etc...

My script outputs what it does and I can tell that when I access the ra=
id
while the drives are sleeping, those 5 commands are sent at the same ti=
me:
dd if=3D/dev/sdh of=3D/dev/null bs=3D1024 ibs=3D1024 skip=3D304955122 c=
ount=3D1 2>/dev/null &
dd if=3D/dev/sdi of=3D/dev/null bs=3D1024 ibs=3D1024 skip=3D32879776 co=
unt=3D1 2>/dev/null &
dd if=3D/dev/sdj of=3D/dev/null bs=3D1024 ibs=3D1024 skip=3D214592398 c=
ount=3D1 2>/dev/null &
dd if=3D/dev/sdk of=3D/dev/null bs=3D1024 ibs=3D1024 skip=3D128138452 c=
ount=3D1 2>/dev/null &
dd if=3D/dev/sdl of=3D/dev/null bs=3D1024 ibs=3D1024 skip=3D397070851 c=
ount=3D1 2>/dev/null &

I'm working with 2 sets of drives:
/dev/sdc: ST3500630AS: 34°C
/dev/sdd: ST3500630AS: 35°C
/dev/sde: ST3750640AS: 36°C
/dev/sdf: ST3500630AS: 36°C
/dev/sdg: ST3500630AS: 36°C

/dev/sdh: WDC WD20EARS-00MVWB0: 38°C
/dev/sdi: WDC WD20EADS-00W4B0: 38°C
/dev/sdj: WDC WD20EADS-00S2B0: 45°C
/dev/sdk: WDC WD20EADS-00R6B0: 41°C
/dev/sdl: WDC WD20EADS-00R6B0: 41°C

(I use hddtemp since it's a handy way to see if the drive is sleeping o=
r
not without waking it up).

On my raidset with the Seagate drives, the spin up in 7 seconds at the =
same
time:

Here's an example wakeup with 4 drives sleeping and one awake:
/usr/bin/time -f 'sdc: %E secs' dd if=3D/dev/sdc of=3D/dev/null bs=3D10=
24 ibs=3D1024 skip=3D227835482 count=3D1 2>&1 | grep -Ev '(records|copi=
ed)' &
/usr/bin/time -f 'sdd: %E secs' dd if=3D/dev/sdd of=3D/dev/null bs=3D10=
24 ibs=3D1024 skip=3D158569697 count=3D1 2>&1 | grep -Ev '(records|copi=
ed)' &
/usr/bin/time -f 'sde: %E secs' dd if=3D/dev/sde of=3D/dev/null bs=3D10=
24 ibs=3D1024 skip=3D244180302 count=3D1 2>&1 | grep -Ev '(records|copi=
ed)' &
/usr/bin/time -f 'sdf: %E secs' dd if=3D/dev/sdf of=3D/dev/null bs=3D10=
24 ibs=3D1024 skip=3D257519832 count=3D1 2>&1 | grep -Ev '(records|copi=
ed)' &
/usr/bin/time -f 'sdg: %E secs' dd if=3D/dev/sdg of=3D/dev/null bs=3D10=
24 ibs=3D1024 skip=3D248812549 count=3D1 2>&1 | grep -Ev '(records|copi=
ed)' &
sdg: 0:00.01 secs
sdc: 0:07.56 secs
sdf: 0:07.60 secs
sdd: 0:07.78 secs
sde: 0:07.89 secs

On my other raid, my code still runs the 5 dd commands at the same time=
, but the block layer
seems to run them sequentially even though they were scheduled at the s=
ame time.

1) does that make sense?
2) could that be related to the fact that the drives are on a port mult=
iplier?
3) if so, why is it affecting the WDC drives but not the ST drives? Do =
the WDC
drives hang the kernel when issued a command while in sleep mode, bu=
t not the ST drives?

/usr/bin/time -f 'sdh: %E secs' dd if=3D/dev/sdh of=3D/dev/null bs=3D10=
24 ibs=3D1024 skip=3D31905054 count=3D1 2>&1 | grep -Ev '(records|copie=
d)' &
/usr/bin/time -f 'sdi: %E secs' dd if=3D/dev/sdi of=3D/dev/null bs=3D10=
24 ibs=3D1024 skip=3D261665955 count=3D1 2>&1 | grep -Ev '(records|copi=
ed)' &
/usr/bin/time -f 'sdj: %E secs' dd if=3D/dev/sdj of=3D/dev/null bs=3D10=
24 ibs=3D1024 skip=3D244694085 count=3D1 2>&1 | grep -Ev '(records|copi=
ed)' &
/usr/bin/time -f 'sdk: %E secs' dd if=3D/dev/sdk of=3D/dev/null bs=3D10=
24 ibs=3D1024 skip=3D323059576 count=3D1 2>&1 | grep -Ev '(records|copi=
ed)' &
/usr/bin/time -f 'sdl: %E secs' dd if=3D/dev/sdl of=3D/dev/null bs=3D10=
24 ibs=3D1024 skip=3D286720059 count=3D1 2>&1 | grep -Ev '(records|copi=
ed)' &
sdh: 0:06.91 secs
sdi: 0:10.38 secs
sdk: 0:20.82 secs
sdl: 0:31.29 secs
sdj: 0:31.91 secs

Thanks,
Marc
--=20
"A mouse is a device used to point at the xterm you want to type in" - =
A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet=
cooking
Home page: http://marc.merlins.org/ =20
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html