md question re: max_hw_sectors_kb

md question re: max_hw_sectors_kb

am 03.05.2011 22:20:46 von Michael Reed

Resending to linux-raid.


Hi Neil,

My name is Mike Reed. I work at SGI. I've been looking at a performance
issue with md writes and have tracked it down to the md device not having
a high enough initial max_hw_sectors_kb setting.

There is code in blk_queue_make_request() which lowers the default value
from INT_MAX to BLK_SAFE_MAX_SECTORS, which is 255. This is generally
lower than all the underlying devices with which I use md.

As md appears to be a stacking driver, i.e., it calls disk_stack_limits()
for each member of a volume, it would seem reasonable for md to use the,
INT_MAX setting for max_hw_sectors_kb instead of BLK_SAFE_MAX_SECTORS.

I have tried this, and have observed that md correctly limits the md device's
max_sectors_kb to the value of the underlying devices in my mirror volume.

Is this the correct way to address this issue?

Applies to linux-2.6.39-rc4.

Signed-off-by: Michael Reed


--- drivers/md/md.c 2011-04-18 21:26:00.000000000 -0700
+++ drivers/md/md.c.new 2011-04-21 15:53:11.452536201 -0700
@@ -4328,6 +4328,7 @@ static int md_alloc(dev_t dev, char *nam
mddev->queue->queuedata = mddev;

blk_queue_make_request(mddev->queue, md_make_request);
+ blk_queue_max_sectors(mddev->queue, INT_MAX);

disk = alloc_disk(1 << shift);
if (!disk) {


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: md question re: max_hw_sectors_kb

am 04.05.2011 19:58:05 von martin.petersen

>>>>> "Michael" == Michael Reed writes:

Michael> There is code in blk_queue_make_request() which lowers the
Michael> default value from INT_MAX to BLK_SAFE_MAX_SECTORS, which is
Michael> 255. This is generally lower than all the underlying devices
Michael> with which I use md.

Yeah, the SAFE value is there to appease legacy low-level drivers.


Michael> As md appears to be a stacking driver, i.e., it calls
Michael> disk_stack_limits() for each member of a volume, it would seem
Michael> reasonable for md to use the, INT_MAX setting for
Michael> max_hw_sectors_kb instead of BLK_SAFE_MAX_SECTORS.

Your fix is functionally correct. However, another case just popped up
this week where we need to distinguish between stacking driver and LLD
defaults. So I think we should try to handle this at the block layer
instead of explicitly tweaking this knob in MD.

I'll get this fixed up and will CC: you on the patch.

--
Martin K. Petersen Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: md question re: max_hw_sectors_kb

am 04.05.2011 20:08:02 von Bernd Schubert

On 05/04/2011 07:58 PM, Martin K. Petersen wrote:
>>>>>> "Michael" == Michael Reed writes:
>
> Michael> There is code in blk_queue_make_request() which lowers the
> Michael> default value from INT_MAX to BLK_SAFE_MAX_SECTORS, which is
> Michael> 255. This is generally lower than all the underlying devices
> Michael> with which I use md.
>
> Yeah, the SAFE value is there to appease legacy low-level drivers.
>
>
> Michael> As md appears to be a stacking driver, i.e., it calls
> Michael> disk_stack_limits() for each member of a volume, it would seem
> Michael> reasonable for md to use the, INT_MAX setting for
> Michael> max_hw_sectors_kb instead of BLK_SAFE_MAX_SECTORS.
>
> Your fix is functionally correct. However, another case just popped up
> this week where we need to distinguish between stacking driver and LLD
> defaults. So I think we should try to handle this at the block layer
> instead of explicitly tweaking this knob in MD.
>
> I'll get this fixed up and will CC: you on the patch.
>

Could you please CC me or linux-raid as well? I have a similar patch for
raid1 and while I'm not working on Lustre anymore, I know that Lustre
also always has a another kernel patch for raid5...


Thanks,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: md question re: max_hw_sectors_kb

am 06.05.2011 06:24:34 von NeilBrown

On Tue, 03 May 2011 15:20:46 -0500 Michael Reed wrote:

> Resending to linux-raid.
>
>
> Hi Neil,
>
> My name is Mike Reed. I work at SGI. I've been looking at a performance
> issue with md writes and have tracked it down to the md device not having
> a high enough initial max_hw_sectors_kb setting.
>
> There is code in blk_queue_make_request() which lowers the default value
> from INT_MAX to BLK_SAFE_MAX_SECTORS, which is 255. This is generally
> lower than all the underlying devices with which I use md.
>
> As md appears to be a stacking driver, i.e., it calls disk_stack_limits()
> for each member of a volume, it would seem reasonable for md to use the,
> INT_MAX setting for max_hw_sectors_kb instead of BLK_SAFE_MAX_SECTORS.
>
> I have tried this, and have observed that md correctly limits the md device's
> max_sectors_kb to the value of the underlying devices in my mirror volume.
>
> Is this the correct way to address this issue?
>
> Applies to linux-2.6.39-rc4.
>
> Signed-off-by: Michael Reed
>
>
> --- drivers/md/md.c 2011-04-18 21:26:00.000000000 -0700
> +++ drivers/md/md.c.new 2011-04-21 15:53:11.452536201 -0700
> @@ -4328,6 +4328,7 @@ static int md_alloc(dev_t dev, char *nam
> mddev->queue->queuedata = mddev;
>
> blk_queue_make_request(mddev->queue, md_make_request);
> + blk_queue_max_sectors(mddev->queue, INT_MAX);
>
> disk = alloc_disk(1 << shift);
> if (!disk) {
>

Hi Mike,
sorry for not responding to this earlier - it seemed to keep falling through
cracks for some reason :-(

Yes, I completely agree with your analysis and think you patch is correct and
useful.
I will queue it up for the next merge window.

thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: md question re: max_hw_sectors_kb

am 06.05.2011 06:40:15 von NeilBrown

On Fri, 6 May 2011 14:24:34 +1000 NeilBrown wrote:

> On Tue, 03 May 2011 15:20:46 -0500 Michael Reed wrote:
>
> > Resending to linux-raid.
> >
> >
> > Hi Neil,
> >
> > My name is Mike Reed. I work at SGI. I've been looking at a performance
> > issue with md writes and have tracked it down to the md device not having
> > a high enough initial max_hw_sectors_kb setting.
> >
> > There is code in blk_queue_make_request() which lowers the default value
> > from INT_MAX to BLK_SAFE_MAX_SECTORS, which is 255. This is generally
> > lower than all the underlying devices with which I use md.
> >
> > As md appears to be a stacking driver, i.e., it calls disk_stack_limits()
> > for each member of a volume, it would seem reasonable for md to use the,
> > INT_MAX setting for max_hw_sectors_kb instead of BLK_SAFE_MAX_SECTORS.
> >
> > I have tried this, and have observed that md correctly limits the md device's
> > max_sectors_kb to the value of the underlying devices in my mirror volume.
> >
> > Is this the correct way to address this issue?
> >
> > Applies to linux-2.6.39-rc4.
> >
> > Signed-off-by: Michael Reed
> >
> >
> > --- drivers/md/md.c 2011-04-18 21:26:00.000000000 -0700
> > +++ drivers/md/md.c.new 2011-04-21 15:53:11.452536201 -0700
> > @@ -4328,6 +4328,7 @@ static int md_alloc(dev_t dev, char *nam
> > mddev->queue->queuedata = mddev;
> >
> > blk_queue_make_request(mddev->queue, md_make_request);
> > + blk_queue_max_sectors(mddev->queue, INT_MAX);
> >
> > disk = alloc_disk(1 << shift);
> > if (!disk) {
> >
>
> Hi Mike,
> sorry for not responding to this earlier - it seemed to keep falling through
> cracks for some reason :-(
>
> Yes, I completely agree with your analysis and think you patch is correct and
> useful.
> I will queue it up for the next merge window.
>
> thanks,
> NeilBrown

....except of course that it should be
blk_queue_max_hw_sectors(......)
^^

because blk_queue_max_sectors was renamed back in 2.6.34 :-)

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: md question re: max_hw_sectors_kb

am 09.05.2011 18:02:53 von Michael Reed

On 05/05/2011 11:40 PM, NeilBrown wrote:
> On Fri, 6 May 2011 14:24:34 +1000 NeilBrown wrote:
>
>> On Tue, 03 May 2011 15:20:46 -0500 Michael Reed wrote:
>>
>>> Resending to linux-raid.
>>>
>>>
>>> Hi Neil,
>>>
>>> My name is Mike Reed. I work at SGI. I've been looking at a performance
>>> issue with md writes and have tracked it down to the md device not having
>>> a high enough initial max_hw_sectors_kb setting.
>>>
>>> There is code in blk_queue_make_request() which lowers the default value
>>> from INT_MAX to BLK_SAFE_MAX_SECTORS, which is 255. This is generally
>>> lower than all the underlying devices with which I use md.
>>>
>>> As md appears to be a stacking driver, i.e., it calls disk_stack_limits()
>>> for each member of a volume, it would seem reasonable for md to use the,
>>> INT_MAX setting for max_hw_sectors_kb instead of BLK_SAFE_MAX_SECTORS.
>>>
>>> I have tried this, and have observed that md correctly limits the md device's
>>> max_sectors_kb to the value of the underlying devices in my mirror volume.
>>>
>>> Is this the correct way to address this issue?
>>>
>>> Applies to linux-2.6.39-rc4.
>>>
>>> Signed-off-by: Michael Reed
>>>
>>>
>>> --- drivers/md/md.c 2011-04-18 21:26:00.000000000 -0700
>>> +++ drivers/md/md.c.new 2011-04-21 15:53:11.452536201 -0700
>>> @@ -4328,6 +4328,7 @@ static int md_alloc(dev_t dev, char *nam
>>> mddev->queue->queuedata = mddev;
>>>
>>> blk_queue_make_request(mddev->queue, md_make_request);
>>> + blk_queue_max_sectors(mddev->queue, INT_MAX);
>>>
>>> disk = alloc_disk(1<< shift);
>>> if (!disk) {
>>>
>>
>> Hi Mike,
>> sorry for not responding to this earlier - it seemed to keep falling through
>> cracks for some reason :-(
>>
>> Yes, I completely agree with your analysis and think you patch is correct and
>> useful.
>> I will queue it up for the next merge window.
>>
>> thanks,
>> NeilBrown
>
> ...except of course that it should be
> blk_queue_max_hw_sectors(......)
> ^^
>
> because blk_queue_max_sectors was renamed back in 2.6.34 :-)
>
> NeilBrown

Thank you for catching that! I've attached a corrected patch for the record.

Signed-off-by: Michael Reed


--- drivers/md/md.c 2011-04-18 21:26:00.000000000 -0700
+++ drivers/md/md.c.new 2011-05-09 10:57:16.000000000 -0700
@@ -4328,6 +4328,7 @@ static int md_alloc(dev_t dev, char *nam
mddev->queue->queuedata = mddev;

blk_queue_make_request(mddev->queue, md_make_request);
+ blk_queue_max_hw_sectors(mddev->queue, INT_MAX);

disk = alloc_disk(1 << shift);
if (!disk) {


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: md question re: max_hw_sectors_kb

am 10.05.2011 01:52:43 von NeilBrown

On Wed, 04 May 2011 13:58:05 -0400 "Martin K. Petersen"
wrote:

> >>>>> "Michael" == Michael Reed writes:
>
> Michael> There is code in blk_queue_make_request() which lowers the
> Michael> default value from INT_MAX to BLK_SAFE_MAX_SECTORS, which is
> Michael> 255. This is generally lower than all the underlying devices
> Michael> with which I use md.
>
> Yeah, the SAFE value is there to appease legacy low-level drivers.
>
>
> Michael> As md appears to be a stacking driver, i.e., it calls
> Michael> disk_stack_limits() for each member of a volume, it would seem
> Michael> reasonable for md to use the, INT_MAX setting for
> Michael> max_hw_sectors_kb instead of BLK_SAFE_MAX_SECTORS.
>
> Your fix is functionally correct. However, another case just popped up
> this week where we need to distinguish between stacking driver and LLD
> defaults. So I think we should try to handle this at the block layer
> instead of explicitly tweaking this knob in MD.
>
> I'll get this fixed up and will CC: you on the patch.
>

What case is this?

The is another problem that I am aware of with this patch - maybe it is the
same was what you are thinking of - maybe not.

If you have FS -> DM -> MD, then any change that MD makes to
max_hw_sectors_kb will not be visible to the FS. So adding and activating a
hot spare with smaller max_hw_sectors_kb cause cause it to receive requests
that are too big.

With the current default of BLK_SAFE_MAX_SECTORS, that only seems to affect a
few USB devices. If we raise the default we could see problems happening
more often.

So we really need a propery resolution to this problem first. i.e. A way for
'dm' to notice when 'md' changes its parameters - or in general any stacking
deivce to find out when an underlying device changes in any way.

I would implement this by having blkdev_get{,_by_path,_by_dev} take an extra
arg which is a pointer to a struct of functions. In the first instance there
would be just one which tells the claimer that something in queue.limits has
changed. Later we could add other calls to help with size changes.

So when md adds a new device, it call disk_stack_limits to updates its
limits, then if the bdev for the mddev is claimed with a non-NULL operations
pointer, it calls the 'limits_have_changed' function.

Thoughts?

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: md question re: max_hw_sectors_kb

am 12.05.2011 05:51:16 von martin.petersen

>>>>> "Neil" == NeilBrown writes:

Neil,

>> Your fix is functionally correct. However, another case just popped
>> up this week where we need to distinguish between stacking driver and
>> LLD defaults.

Neil> What case is this?

This particular case involved the need to set different defaults for
discard depending on whether it was a stacking or a low level driver.


Neil> If you have FS -> DM -> MD, then any change that MD makes to
Neil> max_hw_sectors_kb will not be visible to the FS. So adding and
Neil> activating a hot spare with smaller max_hw_sectors_kb cause cause
Neil> it to receive requests that are too big.

Yeah, this issue pops up occasionally. Alasdair and I were discussing it
just a couple of weeks ago.


Neil> So we really need a propery resolution to this problem first.
Neil> i.e. A way for 'dm' to notice when 'md' changes its parameters -
Neil> or in general any stacking deivce to find out when an underlying
Neil> device changes in any way.

Neil> I would implement this by having blkdev_get{,_by_path,_by_dev}
Neil> take an extra arg which is a pointer to a struct of functions. In
Neil> the first instance there would be just one which tells the claimer
Neil> that something in queue.limits has changed. Later we could add
Neil> other calls to help with size changes.

I agree we need a way to propagate queue limit and capacity changes up
the stack. I'll put in on my todo list.

--
Martin K. Petersen Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: md question re: max_hw_sectors_kb

am 31.05.2011 05:06:41 von fibre raid

Hi Mike,

Thank you for the patch. I've tested it though on 2.6.38 and do not
see any significant change in md RAID 5 write performance compared to
the unpatched kernel. Can you clarify how you may have achieved a
demonstrable improvement?

-Tommy


On Wed, May 11, 2011 at 8:51 PM, Martin K. Petersen
wrote:
>>>>>> "Neil" == NeilBrown =A0 writes:
>
> Neil,
>
>>> Your fix is functionally correct. However, another case just popped
>>> up this week where we need to distinguish between stacking driver a=
nd
>>> LLD defaults.
>
> Neil> What case is this?
>
> This particular case involved the need to set different defaults for
> discard depending on whether it was a stacking or a low level driver.
>
>
> Neil> If you have FS -> DM -> MD, then any change that MD makes to
> Neil> max_hw_sectors_kb will not be visible to the FS. =A0So adding a=
nd
> Neil> activating a hot spare with smaller max_hw_sectors_kb cause cau=
se
> Neil> it to receive requests that are too big.
>
> Yeah, this issue pops up occasionally. Alasdair and I were discussing=
it
> just a couple of weeks ago.
>
>
> Neil> So we really need a propery resolution to this problem first.
> Neil> i.e. A way for 'dm' to notice when 'md' changes its parameters =
-
> Neil> or in general any stacking deivce to find out when an underlyin=
g
> Neil> device changes in any way.
>
> Neil> I would implement this by having blkdev_get{,_by_path,_by_dev}
> Neil> take an extra arg which is a pointer to a struct of functions. =
=A0In
> Neil> the first instance there would be just one which tells the clai=
mer
> Neil> that something in queue.limits has changed. =A0Later we could a=
dd
> Neil> other calls to help with size changes.
>
> I agree we need a way to propagate queue limit and capacity changes u=
p
> the stack. I'll put in on my todo list.
>
> --
> Martin K. Petersen =A0 =A0 =A0Oracle Linux Engineering
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html