Re: [PATCH 05/10] block: remove per-queue plugging

am 11.04.2011 14:36:23 von NeilBrown

On Mon, 11 Apr 2011 14:11:58 +0200 Jens Axboe wrote:

> > Yes. But I need to know when to release the requests that I have stored.
> > I need to know when ->write_pages or ->read_pages or whatever has finished
> > submitting a pile of pages so that I can start processing the request that I
> > have put aside. So I need a callback from blk_finish_plug.
>
> OK fair enough, I'll add your callback patch.
>

Thanks. I'll queue up my md fixes to follow it once it gets to -linus.

NeilBrown

Re: [PATCH 05/10] block: remove per-queue plugging

am 11.04.2011 14:48:45 von Jens Axboe

On 2011-04-11 14:36, NeilBrown wrote:
> On Mon, 11 Apr 2011 14:11:58 +0200 Jens Axboe wrote:
>
>>> Yes. But I need to know when to release the requests that I have stored.
>>> I need to know when ->write_pages or ->read_pages or whatever has finished
>>> submitting a pile of pages so that I can start processing the request that I
>>> have put aside. So I need a callback from blk_finish_plug.
>>
>> OK fair enough, I'll add your callback patch.
>>
>
> Thanks. I'll queue up my md fixes to follow it once it gets to -linus.

Great, once you do that and XFS kills the blk_flush_plug() calls too,
then we can remove that export and make it internal only.

--
Jens Axboe

Re: [PATCH 05/10] block: remove per-queue plugging

am 12.04.2011 03:12:55 von Christoph Hellwig

On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> Great, once you do that and XFS kills the blk_flush_plug() calls too,
> then we can remove that export and make it internal only.

Linus pulled the tree, so they are gone now. Btw, there's still some
bits in the area that confuse me:

- what's the point of the queue_sync_plugs? It has a lot of comment
that seem to pre-data the onstack plugging, but except for that
it's trivial wrapper around blk_flush_plug, with an argument
that is not used.
- is there a good reason for the existance of __blk_flush_plug? You'd
get one additional instruction in the inlined version of
blk_flush_plug when opencoding, but avoid the need for chained
function calls.
- Why is having a plug in blk_flush_plug marked unlikely? Note that
unlikely is the static branch prediction hint to mark the case
extremly unlikely and is even used for hot/cold partitioning. But
when we call it we usually check beforehand if we actually have
plugs, so it's actually likely to happen.
- what is the point of blk_finish_plug? All callers have
the plug on stack, and there's no good reason for adding the NULL
check. Note that blk_start_plug doesn't have the NULL check either.
- Why does __blk_flush_plug call __blk_finish_plug which might clear
tsk->plug, just to set it back after the call? When manually inlining
__blk_finish_plug ino __blk_flush_plug it looks like:

void __blk_flush_plug(struct task_struct *tsk, struct blk_plug *plug)
{
flush_plug_list(plug);
if (plug == tsk->plug)
tsk->plug = NULL;
tsk->plug = plug;
}

it would seem much smarted to just call flush_plug_list directly.
In fact it seems like the tsk->plug is not nessecary at all and
all remaining __blk_flush_plug callers could be replaced with
flush_plug_list.

- and of course the remaining issue of why io_schedule needs an
expliciy blk_flush_plug when schedule() already does one in
case it actually needs to schedule.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 05/10] block: remove per-queue plugging

am 12.04.2011 10:36:30 von Jens Axboe

On 2011-04-12 03:12, hch@infradead.org wrote:
> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
>> then we can remove that export and make it internal only.
>
> Linus pulled the tree, so they are gone now. Btw, there's still some
> bits in the area that confuse me:

Great!

> - what's the point of the queue_sync_plugs? It has a lot of comment
> that seem to pre-data the onstack plugging, but except for that
> it's trivial wrapper around blk_flush_plug, with an argument
> that is not used.

There's really no point to it anymore. It's existance was due to the
older revision that had to track write requests for serializaing around
a barrier. I'll kill it, since we don't do that anymore.

> - is there a good reason for the existance of __blk_flush_plug? You'd
> get one additional instruction in the inlined version of
> blk_flush_plug when opencoding, but avoid the need for chained
> function calls.
> - Why is having a plug in blk_flush_plug marked unlikely? Note that
> unlikely is the static branch prediction hint to mark the case
> extremly unlikely and is even used for hot/cold partitioning. But
> when we call it we usually check beforehand if we actually have
> plugs, so it's actually likely to happen.

The existance and out-of-line is for the scheduler() hook. It should be
an unlikely event to schedule with a plug held, normally the plug should
have been explicitly unplugged before that happens.

> - what is the point of blk_finish_plug? All callers have
> the plug on stack, and there's no good reason for adding the NULL
> check. Note that blk_start_plug doesn't have the NULL check either.

That one can probably go, I need to double check that part since some
things changed.

> - Why does __blk_flush_plug call __blk_finish_plug which might clear
> tsk->plug, just to set it back after the call? When manually inlining
> __blk_finish_plug ino __blk_flush_plug it looks like:
>
> void __blk_flush_plug(struct task_struct *tsk, struct blk_plug *plug)
> {
> flush_plug_list(plug);
> if (plug == tsk->plug)
> tsk->plug = NULL;
> tsk->plug = plug;
> }
>
> it would seem much smarted to just call flush_plug_list directly.
> In fact it seems like the tsk->plug is not nessecary at all and
> all remaining __blk_flush_plug callers could be replaced with
> flush_plug_list.

It depends on whether this was an explicit unplug (eg
blk_finish_plug()), or whether it was an implicit event (eg on
schedule()). If we do it on schedule(), then we retain the plug after
the flush. Otherwise we clear it.

> - and of course the remaining issue of why io_schedule needs an
> expliciy blk_flush_plug when schedule() already does one in
> case it actually needs to schedule.

Already answered in other email.

--
Jens Axboe

Re: [PATCH 05/10] block: remove per-queue plugging

am 12.04.2011 14:22:48 von Dave Chinner

On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> On 2011-04-12 03:12, hch@infradead.org wrote:
> > On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> >> Great, once you do that and XFS kills the blk_flush_plug() calls too,
> >> then we can remove that export and make it internal only.
> >
> > Linus pulled the tree, so they are gone now. Btw, there's still some
> > bits in the area that confuse me:
>
> Great!
>
> > - what's the point of the queue_sync_plugs? It has a lot of comment
> > that seem to pre-data the onstack plugging, but except for that
> > it's trivial wrapper around blk_flush_plug, with an argument
> > that is not used.
>
> There's really no point to it anymore. It's existance was due to the
> older revision that had to track write requests for serializaing around
> a barrier. I'll kill it, since we don't do that anymore.
>
> > - is there a good reason for the existance of __blk_flush_plug? You'd
> > get one additional instruction in the inlined version of
> > blk_flush_plug when opencoding, but avoid the need for chained
> > function calls.
> > - Why is having a plug in blk_flush_plug marked unlikely? Note that
> > unlikely is the static branch prediction hint to mark the case
> > extremly unlikely and is even used for hot/cold partitioning. But
> > when we call it we usually check beforehand if we actually have
> > plugs, so it's actually likely to happen.
>
> The existance and out-of-line is for the scheduler() hook. It should be
> an unlikely event to schedule with a plug held, normally the plug should
> have been explicitly unplugged before that happens.

Though if it does, haven't you just added a significant amount of
depth to the worst case stack usage? I'm seeing this sort of thing
from io_schedule():

Depth Size Location (40 entries)
----- ---- --------
0) 4256 16 mempool_alloc_slab+0x15/0x20
1) 4240 144 mempool_alloc+0x63/0x160
2) 4096 16 scsi_sg_alloc+0x4c/0x60
3) 4080 112 __sg_alloc_table+0x66/0x140
4) 3968 32 scsi_init_sgtable+0x33/0x90
5) 3936 48 scsi_init_io+0x31/0xc0
6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
7) 3856 112 sd_prep_fn+0x150/0xa90
8) 3744 48 blk_peek_request+0x6a/0x1f0
9) 3696 96 scsi_request_fn+0x60/0x510
10) 3600 32 __blk_run_queue+0x57/0x100
11) 3568 80 flush_plug_list+0x133/0x1d0
12) 3488 32 __blk_flush_plug+0x24/0x50
13) 3456 32 io_schedule+0x79/0x80

(This is from a page fault on ext3 that is doing page cache
readahead and blocking on a locked buffer.)

I've seen traces where mempool_alloc_slab enters direct reclaim
which adds another 1.5k of stack usage to this path. So I'm
extremely concerned that you've just reduced the stack available to
every thread by at least 2.5k of space...

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com

Re: [PATCH 05/10] block: remove per-queue plugging

am 12.04.2011 14:28:31 von Jens Axboe

On 2011-04-12 14:22, Dave Chinner wrote:
> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
>> On 2011-04-12 03:12, hch@infradead.org wrote:
>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
>>>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
>>>> then we can remove that export and make it internal only.
>>>
>>> Linus pulled the tree, so they are gone now. Btw, there's still some
>>> bits in the area that confuse me:
>>
>> Great!
>>
>>> - what's the point of the queue_sync_plugs? It has a lot of comment
>>> that seem to pre-data the onstack plugging, but except for that
>>> it's trivial wrapper around blk_flush_plug, with an argument
>>> that is not used.
>>
>> There's really no point to it anymore. It's existance was due to the
>> older revision that had to track write requests for serializaing around
>> a barrier. I'll kill it, since we don't do that anymore.
>>
>>> - is there a good reason for the existance of __blk_flush_plug? You'd
>>> get one additional instruction in the inlined version of
>>> blk_flush_plug when opencoding, but avoid the need for chained
>>> function calls.
>>> - Why is having a plug in blk_flush_plug marked unlikely? Note that
>>> unlikely is the static branch prediction hint to mark the case
>>> extremly unlikely and is even used for hot/cold partitioning. But
>>> when we call it we usually check beforehand if we actually have
>>> plugs, so it's actually likely to happen.
>>
>> The existance and out-of-line is for the scheduler() hook. It should be
>> an unlikely event to schedule with a plug held, normally the plug should
>> have been explicitly unplugged before that happens.
>
> Though if it does, haven't you just added a significant amount of
> depth to the worst case stack usage? I'm seeing this sort of thing
> from io_schedule():
>
> Depth Size Location (40 entries)
> ----- ---- --------
> 0) 4256 16 mempool_alloc_slab+0x15/0x20
> 1) 4240 144 mempool_alloc+0x63/0x160
> 2) 4096 16 scsi_sg_alloc+0x4c/0x60
> 3) 4080 112 __sg_alloc_table+0x66/0x140
> 4) 3968 32 scsi_init_sgtable+0x33/0x90
> 5) 3936 48 scsi_init_io+0x31/0xc0
> 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
> 7) 3856 112 sd_prep_fn+0x150/0xa90
> 8) 3744 48 blk_peek_request+0x6a/0x1f0
> 9) 3696 96 scsi_request_fn+0x60/0x510
> 10) 3600 32 __blk_run_queue+0x57/0x100
> 11) 3568 80 flush_plug_list+0x133/0x1d0
> 12) 3488 32 __blk_flush_plug+0x24/0x50
> 13) 3456 32 io_schedule+0x79/0x80
>
> (This is from a page fault on ext3 that is doing page cache
> readahead and blocking on a locked buffer.)
>
> I've seen traces where mempool_alloc_slab enters direct reclaim
> which adds another 1.5k of stack usage to this path. So I'm
> extremely concerned that you've just reduced the stack available to
> every thread by at least 2.5k of space...

Yeah, that does not look great. If this turns out to be problematic, we
can turn the queue runs from the unlikely case into out-of-line from
kblockd.

But this really isn't that new, you could enter the IO dispatch path
when doing IO already (when submitting it). So we better be able to
handle that.

If it's a problem from the schedule()/io_schedule() path, then lets
ensure that those are truly unlikely events so we can punt them to
kblockd.

--
Jens Axboe

Re: [PATCH 05/10] block: remove per-queue plugging

am 12.04.2011 14:41:34 von Dave Chinner

On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
> On 2011-04-12 14:22, Dave Chinner wrote:
> > On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> >> On 2011-04-12 03:12, hch@infradead.org wrote:
> >>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> >>>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
> >>>> then we can remove that export and make it internal only.
> >>>
> >>> Linus pulled the tree, so they are gone now. Btw, there's still some
> >>> bits in the area that confuse me:
> >>
> >> Great!
> >>
> >>> - what's the point of the queue_sync_plugs? It has a lot of comment
> >>> that seem to pre-data the onstack plugging, but except for that
> >>> it's trivial wrapper around blk_flush_plug, with an argument
> >>> that is not used.
> >>
> >> There's really no point to it anymore. It's existance was due to the
> >> older revision that had to track write requests for serializaing around
> >> a barrier. I'll kill it, since we don't do that anymore.
> >>
> >>> - is there a good reason for the existance of __blk_flush_plug? You'd
> >>> get one additional instruction in the inlined version of
> >>> blk_flush_plug when opencoding, but avoid the need for chained
> >>> function calls.
> >>> - Why is having a plug in blk_flush_plug marked unlikely? Note that
> >>> unlikely is the static branch prediction hint to mark the case
> >>> extremly unlikely and is even used for hot/cold partitioning. But
> >>> when we call it we usually check beforehand if we actually have
> >>> plugs, so it's actually likely to happen.
> >>
> >> The existance and out-of-line is for the scheduler() hook. It should be
> >> an unlikely event to schedule with a plug held, normally the plug should
> >> have been explicitly unplugged before that happens.
> >
> > Though if it does, haven't you just added a significant amount of
> > depth to the worst case stack usage? I'm seeing this sort of thing
> > from io_schedule():
> >
> > Depth Size Location (40 entries)
> > ----- ---- --------
> > 0) 4256 16 mempool_alloc_slab+0x15/0x20
> > 1) 4240 144 mempool_alloc+0x63/0x160
> > 2) 4096 16 scsi_sg_alloc+0x4c/0x60
> > 3) 4080 112 __sg_alloc_table+0x66/0x140
> > 4) 3968 32 scsi_init_sgtable+0x33/0x90
> > 5) 3936 48 scsi_init_io+0x31/0xc0
> > 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
> > 7) 3856 112 sd_prep_fn+0x150/0xa90
> > 8) 3744 48 blk_peek_request+0x6a/0x1f0
> > 9) 3696 96 scsi_request_fn+0x60/0x510
> > 10) 3600 32 __blk_run_queue+0x57/0x100
> > 11) 3568 80 flush_plug_list+0x133/0x1d0
> > 12) 3488 32 __blk_flush_plug+0x24/0x50
> > 13) 3456 32 io_schedule+0x79/0x80
> >
> > (This is from a page fault on ext3 that is doing page cache
> > readahead and blocking on a locked buffer.)
> >
> > I've seen traces where mempool_alloc_slab enters direct reclaim
> > which adds another 1.5k of stack usage to this path. So I'm
> > extremely concerned that you've just reduced the stack available to
> > every thread by at least 2.5k of space...
>
> Yeah, that does not look great. If this turns out to be problematic, we
> can turn the queue runs from the unlikely case into out-of-line from
> kblockd.
>
> But this really isn't that new, you could enter the IO dispatch path
> when doing IO already (when submitting it). So we better be able to
> handle that.

The problem I see is that IO is submitted when there's plenty of
stack available whould have previously been fine. However now it
hits the plug, and then later on after the thread consumes a lot
more stack it, say, waits for a completion. We then schedule, it
unplugs the queue and we add the IO stack to a place where there
isn't much space available.

So effectively we are moving the places where stack is consumed
about, and it's complete unpredictable where that stack is going to
land now.

> If it's a problem from the schedule()/io_schedule() path, then
> lets ensure that those are truly unlikely events so we can punt
> them to kblockd.

Rather than wait for an explosion to be reported before doing this,
why not just punt unplugs to kblockd unconditionally?

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com

Re: [PATCH 05/10] block: remove per-queue plugging

am 12.04.2011 14:58:46 von Jens Axboe

On 2011-04-12 14:41, Dave Chinner wrote:
> On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
>> On 2011-04-12 14:22, Dave Chinner wrote:
>>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
>>>> On 2011-04-12 03:12, hch@infradead.org wrote:
>>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
>>>>>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
>>>>>> then we can remove that export and make it internal only.
>>>>>
>>>>> Linus pulled the tree, so they are gone now. Btw, there's still some
>>>>> bits in the area that confuse me:
>>>>
>>>> Great!
>>>>
>>>>> - what's the point of the queue_sync_plugs? It has a lot of comment
>>>>> that seem to pre-data the onstack plugging, but except for that
>>>>> it's trivial wrapper around blk_flush_plug, with an argument
>>>>> that is not used.
>>>>
>>>> There's really no point to it anymore. It's existance was due to the
>>>> older revision that had to track write requests for serializaing around
>>>> a barrier. I'll kill it, since we don't do that anymore.
>>>>
>>>>> - is there a good reason for the existance of __blk_flush_plug? You'd
>>>>> get one additional instruction in the inlined version of
>>>>> blk_flush_plug when opencoding, but avoid the need for chained
>>>>> function calls.
>>>>> - Why is having a plug in blk_flush_plug marked unlikely? Note that
>>>>> unlikely is the static branch prediction hint to mark the case
>>>>> extremly unlikely and is even used for hot/cold partitioning. But
>>>>> when we call it we usually check beforehand if we actually have
>>>>> plugs, so it's actually likely to happen.
>>>>
>>>> The existance and out-of-line is for the scheduler() hook. It should be
>>>> an unlikely event to schedule with a plug held, normally the plug should
>>>> have been explicitly unplugged before that happens.
>>>
>>> Though if it does, haven't you just added a significant amount of
>>> depth to the worst case stack usage? I'm seeing this sort of thing
>>> from io_schedule():
>>>
>>> Depth Size Location (40 entries)
>>> ----- ---- --------
>>> 0) 4256 16 mempool_alloc_slab+0x15/0x20
>>> 1) 4240 144 mempool_alloc+0x63/0x160
>>> 2) 4096 16 scsi_sg_alloc+0x4c/0x60
>>> 3) 4080 112 __sg_alloc_table+0x66/0x140
>>> 4) 3968 32 scsi_init_sgtable+0x33/0x90
>>> 5) 3936 48 scsi_init_io+0x31/0xc0
>>> 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
>>> 7) 3856 112 sd_prep_fn+0x150/0xa90
>>> 8) 3744 48 blk_peek_request+0x6a/0x1f0
>>> 9) 3696 96 scsi_request_fn+0x60/0x510
>>> 10) 3600 32 __blk_run_queue+0x57/0x100
>>> 11) 3568 80 flush_plug_list+0x133/0x1d0
>>> 12) 3488 32 __blk_flush_plug+0x24/0x50
>>> 13) 3456 32 io_schedule+0x79/0x80
>>>
>>> (This is from a page fault on ext3 that is doing page cache
>>> readahead and blocking on a locked buffer.)
>>>
>>> I've seen traces where mempool_alloc_slab enters direct reclaim
>>> which adds another 1.5k of stack usage to this path. So I'm
>>> extremely concerned that you've just reduced the stack available to
>>> every thread by at least 2.5k of space...
>>
>> Yeah, that does not look great. If this turns out to be problematic, we
>> can turn the queue runs from the unlikely case into out-of-line from
>> kblockd.
>>
>> But this really isn't that new, you could enter the IO dispatch path
>> when doing IO already (when submitting it). So we better be able to
>> handle that.
>
> The problem I see is that IO is submitted when there's plenty of
> stack available whould have previously been fine. However now it
> hits the plug, and then later on after the thread consumes a lot
> more stack it, say, waits for a completion. We then schedule, it
> unplugs the queue and we add the IO stack to a place where there
> isn't much space available.
>
> So effectively we are moving the places where stack is consumed
> about, and it's complete unpredictable where that stack is going to
> land now.

Isn't that example fairly contrived? If we ended up doing the IO
dispatch before, then the only difference now is the stack usage of
schedule() itself. Apart from that, as far as I can tell, there should
not be much difference.

>> If it's a problem from the schedule()/io_schedule() path, then
>> lets ensure that those are truly unlikely events so we can punt
>> them to kblockd.
>
> Rather than wait for an explosion to be reported before doing this,
> why not just punt unplugs to kblockd unconditionally?

Supposedly it's faster to do it inline rather than punt the dispatch.
But that may actually not be true, if you have multiple plugs going (and
thus multiple contenders for the queue lock on dispatch). So lets play
it safe and punt to kblockd, we can always revisit this later.

diff --git a/block/blk-core.c b/block/blk-core.c
index c6eaa1f..36b1a75 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2665,7 +2665,7 @@ static int plug_rq_cmp(void *priv, struct list_head *a, struct list_head *b)
static void queue_unplugged(struct request_queue *q, unsigned int depth)
{
trace_block_unplug_io(q, depth);
- __blk_run_queue(q, false);
+ __blk_run_queue(q, true);

if (q->unplugged_fn)
q->unplugged_fn(q);

--
Jens Axboe

Re: [PATCH 05/10] block: remove per-queue plugging

am 12.04.2011 15:31:17 von Dave Chinner

On Tue, Apr 12, 2011 at 02:58:46PM +0200, Jens Axboe wrote:
> On 2011-04-12 14:41, Dave Chinner wrote:
> > On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
> >> On 2011-04-12 14:22, Dave Chinner wrote:
> >>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> >>>> On 2011-04-12 03:12, hch@infradead.org wrote:
> >>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> >>>>>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
> >>>>>> then we can remove that export and make it internal only.
> >>>>>
> >>>>> Linus pulled the tree, so they are gone now. Btw, there's still some
> >>>>> bits in the area that confuse me:
> >>>>
> >>>> Great!
> >>>>
> >>>>> - what's the point of the queue_sync_plugs? It has a lot of comment
> >>>>> that seem to pre-data the onstack plugging, but except for that
> >>>>> it's trivial wrapper around blk_flush_plug, with an argument
> >>>>> that is not used.
> >>>>
> >>>> There's really no point to it anymore. It's existance was due to the
> >>>> older revision that had to track write requests for serializaing around
> >>>> a barrier. I'll kill it, since we don't do that anymore.
> >>>>
> >>>>> - is there a good reason for the existance of __blk_flush_plug? You'd
> >>>>> get one additional instruction in the inlined version of
> >>>>> blk_flush_plug when opencoding, but avoid the need for chained
> >>>>> function calls.
> >>>>> - Why is having a plug in blk_flush_plug marked unlikely? Note that
> >>>>> unlikely is the static branch prediction hint to mark the case
> >>>>> extremly unlikely and is even used for hot/cold partitioning. But
> >>>>> when we call it we usually check beforehand if we actually have
> >>>>> plugs, so it's actually likely to happen.
> >>>>
> >>>> The existance and out-of-line is for the scheduler() hook. It should be
> >>>> an unlikely event to schedule with a plug held, normally the plug should
> >>>> have been explicitly unplugged before that happens.
> >>>
> >>> Though if it does, haven't you just added a significant amount of
> >>> depth to the worst case stack usage? I'm seeing this sort of thing
> >>> from io_schedule():
> >>>
> >>> Depth Size Location (40 entries)
> >>> ----- ---- --------
> >>> 0) 4256 16 mempool_alloc_slab+0x15/0x20
> >>> 1) 4240 144 mempool_alloc+0x63/0x160
> >>> 2) 4096 16 scsi_sg_alloc+0x4c/0x60
> >>> 3) 4080 112 __sg_alloc_table+0x66/0x140
> >>> 4) 3968 32 scsi_init_sgtable+0x33/0x90
> >>> 5) 3936 48 scsi_init_io+0x31/0xc0
> >>> 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
> >>> 7) 3856 112 sd_prep_fn+0x150/0xa90
> >>> 8) 3744 48 blk_peek_request+0x6a/0x1f0
> >>> 9) 3696 96 scsi_request_fn+0x60/0x510
> >>> 10) 3600 32 __blk_run_queue+0x57/0x100
> >>> 11) 3568 80 flush_plug_list+0x133/0x1d0
> >>> 12) 3488 32 __blk_flush_plug+0x24/0x50
> >>> 13) 3456 32 io_schedule+0x79/0x80
> >>>
> >>> (This is from a page fault on ext3 that is doing page cache
> >>> readahead and blocking on a locked buffer.)
> >>>
> >>> I've seen traces where mempool_alloc_slab enters direct reclaim
> >>> which adds another 1.5k of stack usage to this path. So I'm
> >>> extremely concerned that you've just reduced the stack available to
> >>> every thread by at least 2.5k of space...
> >>
> >> Yeah, that does not look great. If this turns out to be problematic, we
> >> can turn the queue runs from the unlikely case into out-of-line from
> >> kblockd.
> >>
> >> But this really isn't that new, you could enter the IO dispatch path
> >> when doing IO already (when submitting it). So we better be able to
> >> handle that.
> >
> > The problem I see is that IO is submitted when there's plenty of
> > stack available whould have previously been fine. However now it
> > hits the plug, and then later on after the thread consumes a lot
> > more stack it, say, waits for a completion. We then schedule, it
> > unplugs the queue and we add the IO stack to a place where there
> > isn't much space available.
> >
> > So effectively we are moving the places where stack is consumed
> > about, and it's complete unpredictable where that stack is going to
> > land now.
>
> Isn't that example fairly contrived?

I don't think so. e.g. in the XFS allocation path we do btree block
readahead, then go do the real work. The real work can end up with a
deeper stack before blocking on locks or completions unrelated to
the readahead, leading to schedule() being called and an unplug
being issued at that point. You might think it contrived, but if
you can't provide a guarantee that it can't happen then I have to
assume it will happen.

My concern is that we're already under stack space stress in the
writeback path, so anything that has the potential to increase it
significantly is a major worry from my point of view...

> If we ended up doing the IO
> dispatch before, then the only difference now is the stack usage of
> schedule() itself. Apart from that, as far as I can tell, there should
> not be much difference.

There's a difference between IO submission and IO dispatch. IO
submission is submit_bio thru to the plug; IO dispatch is from the
plug down to the disk. If they happen at the same place, there's no
problem. If IO dispatch is moved to schedule() via a plug....

> >> If it's a problem from the schedule()/io_schedule() path, then
> >> lets ensure that those are truly unlikely events so we can punt
> >> them to kblockd.
> >
> > Rather than wait for an explosion to be reported before doing this,
> > why not just punt unplugs to kblockd unconditionally?
>
> Supposedly it's faster to do it inline rather than punt the dispatch.
> But that may actually not be true, if you have multiple plugs going (and
> thus multiple contenders for the queue lock on dispatch). So lets play
> it safe and punt to kblockd, we can always revisit this later.

It's always best to play it safe when it comes to other peoples
data....

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com

Re: [PATCH 05/10] block: remove per-queue plugging

am 12.04.2011 15:40:31 von Dave Chinner

On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
> On 2011-04-12 14:22, Dave Chinner wrote:
> > On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> >> On 2011-04-12 03:12, hch@infradead.org wrote:
> >>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> >>> function calls.
> >>> - Why is having a plug in blk_flush_plug marked unlikely? Note that
> >>> unlikely is the static branch prediction hint to mark the case
> >>> extremly unlikely and is even used for hot/cold partitioning. But
> >>> when we call it we usually check beforehand if we actually have
> >>> plugs, so it's actually likely to happen.
> >>
> >> The existance and out-of-line is for the scheduler() hook. It should be
> >> an unlikely event to schedule with a plug held, normally the plug should
> >> have been explicitly unplugged before that happens.
> >
> > Though if it does, haven't you just added a significant amount of
> > depth to the worst case stack usage? I'm seeing this sort of thing
> > from io_schedule():
> >
> > Depth Size Location (40 entries)
> > ----- ---- --------
> > 0) 4256 16 mempool_alloc_slab+0x15/0x20
> > 1) 4240 144 mempool_alloc+0x63/0x160
> > 2) 4096 16 scsi_sg_alloc+0x4c/0x60
> > 3) 4080 112 __sg_alloc_table+0x66/0x140
> > 4) 3968 32 scsi_init_sgtable+0x33/0x90
> > 5) 3936 48 scsi_init_io+0x31/0xc0
> > 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
> > 7) 3856 112 sd_prep_fn+0x150/0xa90
> > 8) 3744 48 blk_peek_request+0x6a/0x1f0
> > 9) 3696 96 scsi_request_fn+0x60/0x510
> > 10) 3600 32 __blk_run_queue+0x57/0x100
> > 11) 3568 80 flush_plug_list+0x133/0x1d0
> > 12) 3488 32 __blk_flush_plug+0x24/0x50
> > 13) 3456 32 io_schedule+0x79/0x80
> >
> > (This is from a page fault on ext3 that is doing page cache
> > readahead and blocking on a locked buffer.)

FYI, the next step in the allocation chain adds >900 bytes to that
stack:

$ cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (47 entries)
----- ---- --------
0) 5176 40 zone_statistics+0xad/0xc0
1) 5136 288 get_page_from_freelist+0x2cf/0x840
2) 4848 304 __alloc_pages_nodemask+0x121/0x930
3) 4544 48 kmem_getpages+0x62/0x160
4) 4496 96 cache_grow+0x308/0x330
5) 4400 80 cache_alloc_refill+0x21c/0x260
6) 4320 64 kmem_cache_alloc+0x1b7/0x1e0
7) 4256 16 mempool_alloc_slab+0x15/0x20
8) 4240 144 mempool_alloc+0x63/0x160
9) 4096 16 scsi_sg_alloc+0x4c/0x60
10) 4080 112 __sg_alloc_table+0x66/0x140
11) 3968 32 scsi_init_sgtable+0x33/0x90
12) 3936 48 scsi_init_io+0x31/0xc0
13) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
14) 3856 112 sd_prep_fn+0x150/0xa90
15) 3744 48 blk_peek_request+0x6a/0x1f0
16) 3696 96 scsi_request_fn+0x60/0x510
17) 3600 32 __blk_run_queue+0x57/0x100
18) 3568 80 flush_plug_list+0x133/0x1d0
19) 3488 32 __blk_flush_plug+0x24/0x50
20) 3456 32 io_schedule+0x79/0x80

That's close to 1800 bytes now, and that's not entering the reclaim
path. If i get one deeper than that, I'll be sure to post it. :)

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com

Re: [PATCH 05/10] block: remove per-queue plugging

am 12.04.2011 15:45:52 von Jens Axboe

On 2011-04-12 15:31, Dave Chinner wrote:
> On Tue, Apr 12, 2011 at 02:58:46PM +0200, Jens Axboe wrote:
>> On 2011-04-12 14:41, Dave Chinner wrote:
>>> On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
>>>> On 2011-04-12 14:22, Dave Chinner wrote:
>>>>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
>>>>>> On 2011-04-12 03:12, hch@infradead.org wrote:
>>>>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
>>>>>>>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
>>>>>>>> then we can remove that export and make it internal only.
>>>>>>>
>>>>>>> Linus pulled the tree, so they are gone now. Btw, there's still some
>>>>>>> bits in the area that confuse me:
>>>>>>
>>>>>> Great!
>>>>>>
>>>>>>> - what's the point of the queue_sync_plugs? It has a lot of comment
>>>>>>> that seem to pre-data the onstack plugging, but except for that
>>>>>>> it's trivial wrapper around blk_flush_plug, with an argument
>>>>>>> that is not used.
>>>>>>
>>>>>> There's really no point to it anymore. It's existance was due to the
>>>>>> older revision that had to track write requests for serializaing around
>>>>>> a barrier. I'll kill it, since we don't do that anymore.
>>>>>>
>>>>>>> - is there a good reason for the existance of __blk_flush_plug? You'd
>>>>>>> get one additional instruction in the inlined version of
>>>>>>> blk_flush_plug when opencoding, but avoid the need for chained
>>>>>>> function calls.
>>>>>>> - Why is having a plug in blk_flush_plug marked unlikely? Note that
>>>>>>> unlikely is the static branch prediction hint to mark the case
>>>>>>> extremly unlikely and is even used for hot/cold partitioning. But
>>>>>>> when we call it we usually check beforehand if we actually have
>>>>>>> plugs, so it's actually likely to happen.
>>>>>>
>>>>>> The existance and out-of-line is for the scheduler() hook. It should be
>>>>>> an unlikely event to schedule with a plug held, normally the plug should
>>>>>> have been explicitly unplugged before that happens.
>>>>>
>>>>> Though if it does, haven't you just added a significant amount of
>>>>> depth to the worst case stack usage? I'm seeing this sort of thing
>>>>> from io_schedule():
>>>>>
>>>>> Depth Size Location (40 entries)
>>>>> ----- ---- --------
>>>>> 0) 4256 16 mempool_alloc_slab+0x15/0x20
>>>>> 1) 4240 144 mempool_alloc+0x63/0x160
>>>>> 2) 4096 16 scsi_sg_alloc+0x4c/0x60
>>>>> 3) 4080 112 __sg_alloc_table+0x66/0x140
>>>>> 4) 3968 32 scsi_init_sgtable+0x33/0x90
>>>>> 5) 3936 48 scsi_init_io+0x31/0xc0
>>>>> 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
>>>>> 7) 3856 112 sd_prep_fn+0x150/0xa90
>>>>> 8) 3744 48 blk_peek_request+0x6a/0x1f0
>>>>> 9) 3696 96 scsi_request_fn+0x60/0x510
>>>>> 10) 3600 32 __blk_run_queue+0x57/0x100
>>>>> 11) 3568 80 flush_plug_list+0x133/0x1d0
>>>>> 12) 3488 32 __blk_flush_plug+0x24/0x50
>>>>> 13) 3456 32 io_schedule+0x79/0x80
>>>>>
>>>>> (This is from a page fault on ext3 that is doing page cache
>>>>> readahead and blocking on a locked buffer.)
>>>>>
>>>>> I've seen traces where mempool_alloc_slab enters direct reclaim
>>>>> which adds another 1.5k of stack usage to this path. So I'm
>>>>> extremely concerned that you've just reduced the stack available to
>>>>> every thread by at least 2.5k of space...
>>>>
>>>> Yeah, that does not look great. If this turns out to be problematic, we
>>>> can turn the queue runs from the unlikely case into out-of-line from
>>>> kblockd.
>>>>
>>>> But this really isn't that new, you could enter the IO dispatch path
>>>> when doing IO already (when submitting it). So we better be able to
>>>> handle that.
>>>
>>> The problem I see is that IO is submitted when there's plenty of
>>> stack available whould have previously been fine. However now it
>>> hits the plug, and then later on after the thread consumes a lot
>>> more stack it, say, waits for a completion. We then schedule, it
>>> unplugs the queue and we add the IO stack to a place where there
>>> isn't much space available.
>>>
>>> So effectively we are moving the places where stack is consumed
>>> about, and it's complete unpredictable where that stack is going to
>>> land now.
>>
>> Isn't that example fairly contrived?
>
> I don't think so. e.g. in the XFS allocation path we do btree block
> readahead, then go do the real work. The real work can end up with a
> deeper stack before blocking on locks or completions unrelated to
> the readahead, leading to schedule() being called and an unplug
> being issued at that point. You might think it contrived, but if
> you can't provide a guarantee that it can't happen then I have to
> assume it will happen.

If you ended up in lock_page() somewhere along the way, the path would
have been pretty much the same as it is now:

lock_page()
__lock_page()
__wait_on_bit_lock()
sync_page()
aops->sync_page();
block_sync_page()
__blk_run_backing_dev()

and the dispatch follows after that. If your schedules are only due to,
say, blocking on a mutex, then yes it'll be different. But is that
really the case?

I bet that worst case stack usage is exactly the same as before, and
that's the only metric we really care about.

> My concern is that we're already under stack space stress in the
> writeback path, so anything that has the potential to increase it
> significantly is a major worry from my point of view...

I agree on writeback being a worry, and that's why I made the change
(since it makes sense for other reasons, too). I just don't think we are
worse of than before.

>> If we ended up doing the IO
>> dispatch before, then the only difference now is the stack usage of
>> schedule() itself. Apart from that, as far as I can tell, there should
>> not be much difference.
>
> There's a difference between IO submission and IO dispatch. IO
> submission is submit_bio thru to the plug; IO dispatch is from the
> plug down to the disk. If they happen at the same place, there's no
> problem. If IO dispatch is moved to schedule() via a plug....

The IO submission can easily and non-deterministically turn into an IO
dispatch, so there's no real difference for the submitter. That was the
case before. With the explicit plug now, you _know_ that the IO
submission is only that and doesn't include IO dispatch. Not until you
schedule() or call blk_finish_plug(), both of which are events that you
can control.

>>>> If it's a problem from the schedule()/io_schedule() path, then
>>>> lets ensure that those are truly unlikely events so we can punt
>>>> them to kblockd.
>>>
>>> Rather than wait for an explosion to be reported before doing this,
>>> why not just punt unplugs to kblockd unconditionally?
>>
>> Supposedly it's faster to do it inline rather than punt the dispatch.
>> But that may actually not be true, if you have multiple plugs going (and
>> thus multiple contenders for the queue lock on dispatch). So lets play
>> it safe and punt to kblockd, we can always revisit this later.
>
> It's always best to play it safe when it comes to other peoples
> data....

Certainly, but so far I see no real evidence that this is in fact any
safer.

--
Jens Axboe

Re: [PATCH 05/10] block: remove per-queue plugging

am 12.04.2011 15:48:10 von Jens Axboe

On 2011-04-12 15:40, Dave Chinner wrote:
> On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
>> On 2011-04-12 14:22, Dave Chinner wrote:
>>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
>>>> On 2011-04-12 03:12, hch@infradead.org wrote:
>>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
>>>>> function calls.
>>>>> - Why is having a plug in blk_flush_plug marked unlikely? Note that
>>>>> unlikely is the static branch prediction hint to mark the case
>>>>> extremly unlikely and is even used for hot/cold partitioning. But
>>>>> when we call it we usually check beforehand if we actually have
>>>>> plugs, so it's actually likely to happen.
>>>>
>>>> The existance and out-of-line is for the scheduler() hook. It should be
>>>> an unlikely event to schedule with a plug held, normally the plug should
>>>> have been explicitly unplugged before that happens.
>>>
>>> Though if it does, haven't you just added a significant amount of
>>> depth to the worst case stack usage? I'm seeing this sort of thing
>>> from io_schedule():
>>>
>>> Depth Size Location (40 entries)
>>> ----- ---- --------
>>> 0) 4256 16 mempool_alloc_slab+0x15/0x20
>>> 1) 4240 144 mempool_alloc+0x63/0x160
>>> 2) 4096 16 scsi_sg_alloc+0x4c/0x60
>>> 3) 4080 112 __sg_alloc_table+0x66/0x140
>>> 4) 3968 32 scsi_init_sgtable+0x33/0x90
>>> 5) 3936 48 scsi_init_io+0x31/0xc0
>>> 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
>>> 7) 3856 112 sd_prep_fn+0x150/0xa90
>>> 8) 3744 48 blk_peek_request+0x6a/0x1f0
>>> 9) 3696 96 scsi_request_fn+0x60/0x510
>>> 10) 3600 32 __blk_run_queue+0x57/0x100
>>> 11) 3568 80 flush_plug_list+0x133/0x1d0
>>> 12) 3488 32 __blk_flush_plug+0x24/0x50
>>> 13) 3456 32 io_schedule+0x79/0x80
>>>
>>> (This is from a page fault on ext3 that is doing page cache
>>> readahead and blocking on a locked buffer.)
>
> FYI, the next step in the allocation chain adds >900 bytes to that
> stack:
>
> $ cat /sys/kernel/debug/tracing/stack_trace
> Depth Size Location (47 entries)
> ----- ---- --------
> 0) 5176 40 zone_statistics+0xad/0xc0
> 1) 5136 288 get_page_from_freelist+0x2cf/0x840
> 2) 4848 304 __alloc_pages_nodemask+0x121/0x930
> 3) 4544 48 kmem_getpages+0x62/0x160
> 4) 4496 96 cache_grow+0x308/0x330
> 5) 4400 80 cache_alloc_refill+0x21c/0x260
> 6) 4320 64 kmem_cache_alloc+0x1b7/0x1e0
> 7) 4256 16 mempool_alloc_slab+0x15/0x20
> 8) 4240 144 mempool_alloc+0x63/0x160
> 9) 4096 16 scsi_sg_alloc+0x4c/0x60
> 10) 4080 112 __sg_alloc_table+0x66/0x140
> 11) 3968 32 scsi_init_sgtable+0x33/0x90
> 12) 3936 48 scsi_init_io+0x31/0xc0
> 13) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
> 14) 3856 112 sd_prep_fn+0x150/0xa90
> 15) 3744 48 blk_peek_request+0x6a/0x1f0
> 16) 3696 96 scsi_request_fn+0x60/0x510
> 17) 3600 32 __blk_run_queue+0x57/0x100
> 18) 3568 80 flush_plug_list+0x133/0x1d0
> 19) 3488 32 __blk_flush_plug+0x24/0x50
> 20) 3456 32 io_schedule+0x79/0x80
>
> That's close to 1800 bytes now, and that's not entering the reclaim
> path. If i get one deeper than that, I'll be sure to post it. :)

Do you have traces from 2.6.38, or are you just doing them now?

The path you quote above should not go into reclaim, it's a GFP_ATOMIC
allocation.

--
Jens Axboe

Re: [PATCH 05/10] block: remove per-queue plugging

am 12.04.2011 16:34:52 von Dave Chinner

On Tue, Apr 12, 2011 at 03:45:52PM +0200, Jens Axboe wrote:
> On 2011-04-12 15:31, Dave Chinner wrote:
> > On Tue, Apr 12, 2011 at 02:58:46PM +0200, Jens Axboe wrote:
> >> On 2011-04-12 14:41, Dave Chinner wrote:
> >> Isn't that example fairly contrived?
> >
> > I don't think so. e.g. in the XFS allocation path we do btree block
> > readahead, then go do the real work. The real work can end up with a
> > deeper stack before blocking on locks or completions unrelated to
> > the readahead, leading to schedule() being called and an unplug
> > being issued at that point. You might think it contrived, but if
> > you can't provide a guarantee that it can't happen then I have to
> > assume it will happen.
>
> If you ended up in lock_page() somewhere along the way, the path would
> have been pretty much the same as it is now:
>
> lock_page()
> __lock_page()
> __wait_on_bit_lock()
> sync_page()
> aops->sync_page();
> block_sync_page()
> __blk_run_backing_dev()
>
> and the dispatch follows after that. If your schedules are only due to,
> say, blocking on a mutex, then yes it'll be different. But is that
> really the case?

XFS metadata IO does not use the page cache anymore, so won't take
that path - no page locks are taken during read or write. Even
before that change contending on page locks was extremely rare as
XFs uses the buffer container for synchronisation.

AFAICT, we have nothing that will cause plugs to be flushed until
scheduling occurs. In many cases it will be at the same points as
before (the explicit flushes XFS had), but there are going to be new
ones....

Like this:

0) 5360 40 zone_statistics+0xad/0xc0
1) 5320 288 get_page_from_freelist+0x2cf/0x840
2) 5032 304 __alloc_pages_nodemask+0x121/0x930
3) 4728 48 kmem_getpages+0x62/0x160
4) 4680 96 cache_grow+0x308/0x330
5) 4584 80 cache_alloc_refill+0x21c/0x260
6) 4504 16 __kmalloc+0x230/0x240
7) 4488 176 virtqueue_add_buf_gfp+0x1f9/0x3e0
8) 4312 144 do_virtblk_request+0x1f3/0x400
9) 4168 32 __blk_run_queue+0x57/0x100
10) 4136 80 flush_plug_list+0x133/0x1d0
11) 4056 32 __blk_flush_plug+0x24/0x50
12) 4024 160 schedule+0x867/0x9f0
13) 3864 208 schedule_timeout+0x1f5/0x2c0
14) 3656 144 wait_for_common+0xe7/0x190
15) 3512 16 wait_for_completion+0x1d/0x20
16) 3496 48 xfs_buf_iowait+0x36/0xb0
17) 3448 32 _xfs_buf_read+0x98/0xa0
18) 3416 48 xfs_buf_read+0xa2/0x100
19) 3368 80 xfs_trans_read_buf+0x1db/0x680
.......

This path adds roughly 500 bytes to the previous case of
immediate dispatch of the IO down through _xfs_buf_read()...

> I bet that worst case stack usage is exactly the same as before, and
> that's the only metric we really care about.

I've already demonstrated much worse stack usage with ext3 through
the page fault path via io_schedule(). io_schedule() never used to
dispatch IO and now it does. Similarly there are changes and
increases in XFS stack usage like above. IMO, worst case stack
usage is definitely increased by these changes.

> > My concern is that we're already under stack space stress in the
> > writeback path, so anything that has the potential to increase it
> > significantly is a major worry from my point of view...
>
> I agree on writeback being a worry, and that's why I made the change
> (since it makes sense for other reasons, too). I just don't think we are
> worse of than before.

We certainly are.

Hmmm, I just noticed a new cumulative stack usage path through
direct reclaim - via congestion_wait() -> io_schedule()....

> >> If we ended up doing the IO
> >> dispatch before, then the only difference now is the stack usage of
> >> schedule() itself. Apart from that, as far as I can tell, there should
> >> not be much difference.
> >
> > There's a difference between IO submission and IO dispatch. IO
> > submission is submit_bio thru to the plug; IO dispatch is from the
> > plug down to the disk. If they happen at the same place, there's no
> > problem. If IO dispatch is moved to schedule() via a plug....
>
> The IO submission can easily and non-deterministically turn into an IO
> dispatch, so there's no real difference for the submitter. That was the
> case before. With the explicit plug now, you _know_ that the IO
> submission is only that and doesn't include IO dispatch.

You're violently agreeing with me that you've changed where the IO
dispatch path is run from. ;)

> Not until you
> schedule() or call blk_finish_plug(), both of which are events that you
> can control.

Well, not really - now taking any sleeping lock or waiting on
anything can trigger a plug flush where previously you had to
explicitly issue them. I'm not saying what we had is better, just
that there are implicit flushes with your changes that are
inherently uncontrollable...

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com

Re: [PATCH 05/10] block: remove per-queue plugging

am 12.04.2011 18:44:17 von Christoph Hellwig

On Tue, Apr 12, 2011 at 02:58:46PM +0200, Jens Axboe wrote:
> Supposedly it's faster to do it inline rather than punt the dispatch.
> But that may actually not be true, if you have multiple plugs going (and
> thus multiple contenders for the queue lock on dispatch). So lets play
> it safe and punt to kblockd, we can always revisit this later.

Note that this can be optimized further by adding a new helper that just
queues up work on kblockd without taking the queue lock, e.g. adding a
new

void blk_run_queue_async(struct request_queue *q)
{
if (likely(!blk_queue_stopped(q)))
queue_delayed_work(kblockd_workqueue, &q->delay_work, 0);
}

And replacing all

__blk_run_queue(q, true);

callers with that, at which point they won't need the queuelock any
more.

Re: [PATCH 05/10] block: remove per-queue plugging

am 12.04.2011 18:49:53 von Jens Axboe

On 2011-04-12 18:44, hch@infradead.org wrote:
> On Tue, Apr 12, 2011 at 02:58:46PM +0200, Jens Axboe wrote:
>> Supposedly it's faster to do it inline rather than punt the dispatch.
>> But that may actually not be true, if you have multiple plugs going (and
>> thus multiple contenders for the queue lock on dispatch). So lets play
>> it safe and punt to kblockd, we can always revisit this later.
>
> Note that this can be optimized further by adding a new helper that just
> queues up work on kblockd without taking the queue lock, e.g. adding a
> new
>
> void blk_run_queue_async(struct request_queue *q)
> {
> if (likely(!blk_queue_stopped(q)))
> queue_delayed_work(kblockd_workqueue, &q->delay_work, 0);
> }
>
> And replacing all
>
> __blk_run_queue(q, true);
>
> callers with that, at which point they won't need the queuelock any
> more.

I realize that, in fact it's already safe as long as you pass in 'true'
for __blk_run_queue(). Before I had rewritten it to move the running
out, so that makes the trick a little difficult. This afternoon I also
tested it and saw no noticable difference, but I'll probably just do it
anyway as it makes sense.

--
Jens Axboe

Re: [PATCH 05/10] block: remove per-queue plugging

am 12.04.2011 18:50:42 von Christoph Hellwig

On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> The existance and out-of-line is for the scheduler() hook. It should be
> an unlikely event to schedule with a plug held, normally the plug should
> have been explicitly unplugged before that happens.

I still don't think unlikely() is the right thing to do. The static
branch prediction hints cause a real massive slowdown if taken. For
things like this that happen during normal operation you're much better
off leaving the dynamic branch prediction in the CPU predicting what's
going on. And I don't think it's all that unlikely - e.g. for all the
metadata during readpages/writepages schedule/io_schedule will be
the unplugging point right now. I'll see if I can run an I/O workload
with Steve's likely/unlikely profiling turned on.

> > void __blk_flush_plug(struct task_struct *tsk, struct blk_plug *plug)
> > {
> > flush_plug_list(plug);
> > if (plug == tsk->plug)
> > tsk->plug = NULL;
> > tsk->plug = plug;
> > }
> >
> > it would seem much smarted to just call flush_plug_list directly.
> > In fact it seems like the tsk->plug is not nessecary at all and
> > all remaining __blk_flush_plug callers could be replaced with
> > flush_plug_list.
>
> It depends on whether this was an explicit unplug (eg
> blk_finish_plug()), or whether it was an implicit event (eg on
> schedule()). If we do it on schedule(), then we retain the plug after
> the flush. Otherwise we clear it.

blk_finish_plug doesn't got through this codepath.

This is an untested patch how the area should look to me:

diff --git a/block/blk-core.c b/block/blk-core.c
index 90f22cc..6fa5ba1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2668,7 +2668,7 @@ static int plug_rq_cmp(void *priv, struct list_head *a, struct list_head *b)
return !(rqa->q <= rqb->q);
}

-static void flush_plug_list(struct blk_plug *plug)
+void blk_flush_plug_list(struct blk_plug *plug)
{
struct request_queue *q;
unsigned long flags;
@@ -2716,29 +2716,16 @@ static void flush_plug_list(struct blk_plug *plug)
BUG_ON(!list_empty(&plug->list));
local_irq_restore(flags);
}
-
-static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug)
-{
- flush_plug_list(plug);
-
- if (plug == tsk->plug)
- tsk->plug = NULL;
-}
+EXPORT_SYMBOL_GPL(blk_flush_plug_list);

void blk_finish_plug(struct blk_plug *plug)
{
- if (plug)
- __blk_finish_plug(current, plug);
+ blk_flush_plug_list(plug);
+ if (plug == current->plug)
+ current->plug = NULL;
}
EXPORT_SYMBOL(blk_finish_plug);

-void __blk_flush_plug(struct task_struct *tsk, struct blk_plug *plug)
-{
- __blk_finish_plug(tsk, plug);
- tsk->plug = plug;
-}
-EXPORT_SYMBOL(__blk_flush_plug);
-
int __init blk_dev_init(void)
{
BUILD_BUG_ON(__REQ_NR_BITS > 8 *
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 32176cc..fa6a4e1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -862,14 +862,14 @@ struct blk_plug {

extern void blk_start_plug(struct blk_plug *);
extern void blk_finish_plug(struct blk_plug *);
-extern void __blk_flush_plug(struct task_struct *, struct blk_plug *);
+extern void blk_flush_plug_list(struct blk_plug *);

static inline void blk_flush_plug(struct task_struct *tsk)
{
struct blk_plug *plug = tsk->plug;

- if (unlikely(plug))
- __blk_flush_plug(tsk, plug);
+ if (plug)
+ blk_flush_plug_list(plug);
}

static inline bool blk_needs_flush_plug(struct task_struct *tsk)

Re: [PATCH 05/10] block: remove per-queue plugging

am 12.04.2011 18:54:55 von Christoph Hellwig

On Tue, Apr 12, 2011 at 06:49:53PM +0200, Jens Axboe wrote:
> I realize that, in fact it's already safe as long as you pass in 'true'
> for __blk_run_queue(). Before I had rewritten it to move the running
> out, so that makes the trick a little difficult. This afternoon I also
> tested it and saw no noticable difference, but I'll probably just do it
> anyway as it makes sense.

We still need the lock for __elv_add_request, so we'll need to keep the
logic anyway. But splitting out the just queue to kblockd case from
__blk_run_queue and giving the latter a sane prototype still sounds
like a good idea to me.

Btw, now that we don't call the request_fn directly any more and thus
can't block, can the unplugging be moved into the preempt notifiers?

Re: [PATCH 05/10] block: remove per-queue plugging

am 12.04.2011 18:58:30 von Christoph Hellwig

On Tue, Apr 12, 2011 at 11:31:17PM +1000, Dave Chinner wrote:
> I don't think so. e.g. in the XFS allocation path we do btree block
> readahead, then go do the real work. The real work can end up with a
> deeper stack before blocking on locks or completions unrelated to
> the readahead, leading to schedule() being called and an unplug
> being issued at that point. You might think it contrived, but if
> you can't provide a guarantee that it can't happen then I have to
> assume it will happen.

In addition to the stack issue, which is a killer to this also has
latency implications. Before we could submit a synchronous metadata
read request inside readpage or writepage and kick it off to the disk
immediately, while now it won't get submitted until we block the next
time, i.e. have done some more work that could have been used for
doing I/O in the background. With the kblockd offload not only have
we spent more time but at the point where we finally kick it we
also need another context switch. It seem like we really need to
go through the filesystems and explicitly flush the plugging queue
for such cases. In fact a bio flag marking things as synchronous
metadata reads would help, but then again we need to clean up our
existing bio flags first..

Re: [PATCH 05/10] block: remove per-queue plugging

am 12.04.2011 19:24:21 von Jens Axboe

On 2011-04-12 18:54, hch@infradead.org wrote:
> On Tue, Apr 12, 2011 at 06:49:53PM +0200, Jens Axboe wrote:
>> I realize that, in fact it's already safe as long as you pass in 'true'
>> for __blk_run_queue(). Before I had rewritten it to move the running
>> out, so that makes the trick a little difficult. This afternoon I also
>> tested it and saw no noticable difference, but I'll probably just do it
>> anyway as it makes sense.
>
> We still need the lock for __elv_add_request, so we'll need to keep the
> logic anyway. But splitting out the just queue to kblockd case from
> __blk_run_queue and giving the latter a sane prototype still sounds
> like a good idea to me.
>
> Btw, now that we don't call the request_fn directly any more and thus
> can't block, can the unplugging be moved into the preempt notifiers?

It was only partly the reason, there's still the notice on preempt
(instead of schedule) and the runqueue lock problem. And if we allow
preempt, then we need to do disable preempt around all the plug logic.

--
Jens Axboe

Re: [PATCH 05/10] block: remove per-queue plugging

am 12.04.2011 19:29:29 von Jens Axboe

On 2011-04-12 18:58, hch@infradead.org wrote:
> On Tue, Apr 12, 2011 at 11:31:17PM +1000, Dave Chinner wrote:
>> I don't think so. e.g. in the XFS allocation path we do btree block
>> readahead, then go do the real work. The real work can end up with a
>> deeper stack before blocking on locks or completions unrelated to
>> the readahead, leading to schedule() being called and an unplug
>> being issued at that point. You might think it contrived, but if
>> you can't provide a guarantee that it can't happen then I have to
>> assume it will happen.
>
> In addition to the stack issue, which is a killer to this also has
> latency implications. Before we could submit a synchronous metadata
> read request inside readpage or writepage and kick it off to the disk
> immediately, while now it won't get submitted until we block the next
> time, i.e. have done some more work that could have been used for
> doing I/O in the background. With the kblockd offload not only have
> we spent more time but at the point where we finally kick it we
> also need another context switch. It seem like we really need to
> go through the filesystems and explicitly flush the plugging queue
> for such cases. In fact a bio flag marking things as synchronous
> metadata reads would help, but then again we need to clean up our
> existing bio flags first..

I think it would be a good idea to audit the SYNC cases, and if feasible
let that retain the 'immediate kick off' logic. If not, have some way to
signal that at least. Essentially allow some fine grained control of
what goes into the plug and what does not.

--
Jens Axboe

Re: [PATCH 05/10] block: remove per-queue plugging

am 12.04.2011 23:08:27 von NeilBrown

On Wed, 13 Apr 2011 00:34:52 +1000 Dave Chinner wrote:

> On Tue, Apr 12, 2011 at 03:45:52PM +0200, Jens Axboe wrote:
> Not until you
> > schedule() or call blk_finish_plug(), both of which are events that you
> > can control.
>
> Well, not really - now taking any sleeping lock or waiting on
> anything can trigger a plug flush where previously you had to
> explicitly issue them. I'm not saying what we had is better, just
> that there are implicit flushes with your changes that are
> inherently uncontrollable...

It's not just sleeping locks - if preempt is enabled a schedule can happen at
any time - at any depth. I've seen a spin_unlock do it.

NeilBrown

Re: [PATCH 05/10] block: remove per-queue plugging

am 13.04.2011 01:35:36 von Dave Chinner

On Tue, Apr 12, 2011 at 03:48:10PM +0200, Jens Axboe wrote:
> On 2011-04-12 15:40, Dave Chinner wrote:
> > On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
> >> On 2011-04-12 14:22, Dave Chinner wrote:
> >>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> >>>> On 2011-04-12 03:12, hch@infradead.org wrote:
> >>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> >>>>> function calls.
> >>>>> - Why is having a plug in blk_flush_plug marked unlikely? Note that
> >>>>> unlikely is the static branch prediction hint to mark the case
> >>>>> extremly unlikely and is even used for hot/cold partitioning. But
> >>>>> when we call it we usually check beforehand if we actually have
> >>>>> plugs, so it's actually likely to happen.
> >>>>
> >>>> The existance and out-of-line is for the scheduler() hook. It should be
> >>>> an unlikely event to schedule with a plug held, normally the plug should
> >>>> have been explicitly unplugged before that happens.
> >>>
> >>> Though if it does, haven't you just added a significant amount of
> >>> depth to the worst case stack usage? I'm seeing this sort of thing
> >>> from io_schedule():
> >>>
> >>> Depth Size Location (40 entries)
> >>> ----- ---- --------
> >>> 0) 4256 16 mempool_alloc_slab+0x15/0x20
> >>> 1) 4240 144 mempool_alloc+0x63/0x160
> >>> 2) 4096 16 scsi_sg_alloc+0x4c/0x60
> >>> 3) 4080 112 __sg_alloc_table+0x66/0x140
> >>> 4) 3968 32 scsi_init_sgtable+0x33/0x90
> >>> 5) 3936 48 scsi_init_io+0x31/0xc0
> >>> 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
> >>> 7) 3856 112 sd_prep_fn+0x150/0xa90
> >>> 8) 3744 48 blk_peek_request+0x6a/0x1f0
> >>> 9) 3696 96 scsi_request_fn+0x60/0x510
> >>> 10) 3600 32 __blk_run_queue+0x57/0x100
> >>> 11) 3568 80 flush_plug_list+0x133/0x1d0
> >>> 12) 3488 32 __blk_flush_plug+0x24/0x50
> >>> 13) 3456 32 io_schedule+0x79/0x80
> >>>
> >>> (This is from a page fault on ext3 that is doing page cache
> >>> readahead and blocking on a locked buffer.)
> >
> > FYI, the next step in the allocation chain adds >900 bytes to that
> > stack:
> >
> > $ cat /sys/kernel/debug/tracing/stack_trace
> > Depth Size Location (47 entries)
> > ----- ---- --------
> > 0) 5176 40 zone_statistics+0xad/0xc0
> > 1) 5136 288 get_page_from_freelist+0x2cf/0x840
> > 2) 4848 304 __alloc_pages_nodemask+0x121/0x930
> > 3) 4544 48 kmem_getpages+0x62/0x160
> > 4) 4496 96 cache_grow+0x308/0x330
> > 5) 4400 80 cache_alloc_refill+0x21c/0x260
> > 6) 4320 64 kmem_cache_alloc+0x1b7/0x1e0
> > 7) 4256 16 mempool_alloc_slab+0x15/0x20
> > 8) 4240 144 mempool_alloc+0x63/0x160
> > 9) 4096 16 scsi_sg_alloc+0x4c/0x60
> > 10) 4080 112 __sg_alloc_table+0x66/0x140
> > 11) 3968 32 scsi_init_sgtable+0x33/0x90
> > 12) 3936 48 scsi_init_io+0x31/0xc0
> > 13) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
> > 14) 3856 112 sd_prep_fn+0x150/0xa90
> > 15) 3744 48 blk_peek_request+0x6a/0x1f0
> > 16) 3696 96 scsi_request_fn+0x60/0x510
> > 17) 3600 32 __blk_run_queue+0x57/0x100
> > 18) 3568 80 flush_plug_list+0x133/0x1d0
> > 19) 3488 32 __blk_flush_plug+0x24/0x50
> > 20) 3456 32 io_schedule+0x79/0x80
> >
> > That's close to 1800 bytes now, and that's not entering the reclaim
> > path. If i get one deeper than that, I'll be sure to post it. :)
>
> Do you have traces from 2.6.38, or are you just doing them now?

I do stack checks like this all the time. I generally don't keep
them around, just pay attention to the path and depth. ext3 is used
for / on my test VMs, and has never shown up as the worse case stack
usage when running xfstests. As of the block plugging code, this
trace is the top stack user for the first ~130 tests, and often for
the entire test run on XFS....

> The path you quote above should not go into reclaim, it's a GFP_ATOMIC
> allocation.

Right. I'm still trying to produce a trace that shows more stack
usage in the block layer. It's random chance as to what pops up most
of the time. However, some of the stacks that are showing up in
2.6.39 are quite different from any I've ever seen before...

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 05/10] block: remove per-queue plugging

am 13.04.2011 04:23:52 von Linus Torvalds

--0022154012fa1cd2bd04a0c38101
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Tue, Apr 12, 2011 at 2:08 PM, NeilBrown wrote:
> On Wed, 13 Apr 2011 00:34:52 +1000 Dave Chinner wro=
te:
>>
>> Well, not really - now taking any sleeping lock or waiting on
>> anything can trigger a plug flush where previously you had to
>> explicitly issue them. I'm not saying what we had is better, just
>> that there are implicit flushes with your changes that are
>> inherently uncontrollable...
>
> It's not just sleeping locks - if preempt is enabled a schedule can happe=
n at
> any time - at any depth. =A0I've seen a spin_unlock do it.

Hmm. I don't think we should flush IO in the preemption path. That
smells wrong on many levels, just one of them being the "any time, any
depth".

It also sounds really wrong from an IO pattern standpoint. The process
is actually still running, and the IO flushing _already_ does the
"only if it's going to sleep" test, but it actually does it _wrong_.
The "current->state" check doesn't make sense for a preemption event,
because it's not actually going to sleep there.

So a patch like the attached (UNTESTED!) sounds like the right thing to do.

Whether it makes any difference for any MD issues, who knows.. But
considering that the unplugging already used to test for "prev->state
!=3D TASK_RUNNING", this is absolutely the right thing to do - that old
test was just broken.

Linus

--0022154012fa1cd2bd04a0c38101
Content-Type: text/x-patch; charset=US-ASCII; name="patch.diff"
Content-Disposition: attachment; filename="patch.diff"
Content-Transfer-Encoding: base64
X-Attachment-Id: f_gmfmyb2w0

IGtlcm5lbC9zY2hlZC5jIHwgICAyMCArKysrKysrKysrLS0tLS0tLS0tLQog MSBmaWxlcyBjaGFu
Z2VkLCAxMCBpbnNlcnRpb25zKCspLCAxMCBkZWxldGlvbnMoLSkKCmRpZmYg LS1naXQgYS9rZXJu
ZWwvc2NoZWQuYyBiL2tlcm5lbC9zY2hlZC5jCmluZGV4IDQ4MDEzNjMzZDc5 Mi4uYTE4N2MzZmUw
MjdiIDEwMDY0NAotLS0gYS9rZXJuZWwvc2NoZWQuYworKysgYi9rZXJuZWwv c2NoZWQuYwpAQCAt
NDExMSwyMCArNDExMSwyMCBAQCBuZWVkX3Jlc2NoZWQ6CiAJCQkJCXRyeV90 b193YWtlX3VwX2xv
Y2FsKHRvX3dha2V1cCk7CiAJCQl9CiAJCQlkZWFjdGl2YXRlX3Rhc2socnEs IHByZXYsIERFUVVF
VUVfU0xFRVApOworCisJCQkvKgorCQkJICogSWYgd2UgYXJlIGdvaW5nIHRv IHNsZWVwIGFuZCB3
ZSBoYXZlIHBsdWdnZWQgSU8gcXVldWVkLCBtYWtlCisJCQkgKiBzdXJlIHRv IHN1Ym1pdCBpdCB0
byBhdm9pZCBkZWFkbG9ja3MuCisJCQkgKi8KKwkJCWlmIChibGtfbmVlZHNf Zmx1c2hfcGx1Zyhw
cmV2KSkgeworCQkJCXJhd19zcGluX3VubG9jaygmcnEtPmxvY2spOworCQkJ CWJsa19mbHVzaF9w
bHVnKHByZXYpOworCQkJCXJhd19zcGluX2xvY2soJnJxLT5sb2NrKTsKKwkJ CX0KIAkJfQogCQlz
d2l0Y2hfY291bnQgPSAmcHJldi0+bnZjc3c7CiAJfQogCi0JLyoKLQkgKiBJ ZiB3ZSBhcmUgZ29p
bmcgdG8gc2xlZXAgYW5kIHdlIGhhdmUgcGx1Z2dlZCBJTyBxdWV1ZWQsIG1h a2UKLQkgKiBzdXJl
IHRvIHN1Ym1pdCBpdCB0byBhdm9pZCBkZWFkbG9ja3MuCi0JICovCi0JaWYg KHByZXYtPnN0YXRl
ICE9IFRBU0tfUlVOTklORyAmJiBibGtfbmVlZHNfZmx1c2hfcGx1ZyhwcmV2 KSkgewotCQlyYXdf
c3Bpbl91bmxvY2soJnJxLT5sb2NrKTsKLQkJYmxrX2ZsdXNoX3BsdWcocHJl dik7Ci0JCXJhd19z
cGluX2xvY2soJnJxLT5sb2NrKTsKLQl9Ci0KIAlwcmVfc2NoZWR1bGUocnEs IHByZXYpOwogCiAJ
aWYgKHVubGlrZWx5KCFycS0+bnJfcnVubmluZykpCg==
--0022154012fa1cd2bd04a0c38101--

Re: [PATCH 05/10] block: remove per-queue plugging

am 13.04.2011 13:12:02 von Peter Zijlstra

On Tue, 2011-04-12 at 19:23 -0700, Linus Torvalds wrote:
> kernel/sched.c | 20 ++++++++++----------
> 1 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 48013633d792..a187c3fe027b 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4111,20 +4111,20 @@ need_resched:
> try_to_wake_up_local(to_wakeup);
> }
> deactivate_task(rq, prev, DEQUEUE_SLEEP);
> +
> + /*
> + * If we are going to sleep and we have plugged IO queued, make
> + * sure to submit it to avoid deadlocks.
> + */
> + if (blk_needs_flush_plug(prev)) {
> + raw_spin_unlock(&rq->lock);
> + blk_flush_plug(prev);
> + raw_spin_lock(&rq->lock);
> + }
> }
> switch_count = &prev->nvcsw;
> }
>
> - /*
> - * If we are going to sleep and we have plugged IO queued, make
> - * sure to submit it to avoid deadlocks.
> - */
> - if (prev->state != TASK_RUNNING && blk_needs_flush_plug(prev)) {
> - raw_spin_unlock(&rq->lock);
> - blk_flush_plug(prev);
> - raw_spin_lock(&rq->lock);
> - }
> -
> pre_schedule(rq, prev);
>
> if (unlikely(!rq->nr_running))

Right, that cures the preemption problem. The reason I suggested placing
it where it was is that I'd like to keep all things that release
rq->lock in the middle of schedule() in one place, but I guess we can
cure that with some extra comments.

Re: [PATCH 05/10] block: remove per-queue plugging

am 13.04.2011 13:23:27 von Jens Axboe

On 2011-04-13 13:12, Peter Zijlstra wrote:
> On Tue, 2011-04-12 at 19:23 -0700, Linus Torvalds wrote:
>> kernel/sched.c | 20 ++++++++++----------
>> 1 files changed, 10 insertions(+), 10 deletions(-)
>>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 48013633d792..a187c3fe027b 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -4111,20 +4111,20 @@ need_resched:
>> try_to_wake_up_local(to_wakeup);
>> }
>> deactivate_task(rq, prev, DEQUEUE_SLEEP);
>> +
>> + /*
>> + * If we are going to sleep and we have plugged IO queued, make
>> + * sure to submit it to avoid deadlocks.
>> + */
>> + if (blk_needs_flush_plug(prev)) {
>> + raw_spin_unlock(&rq->lock);
>> + blk_flush_plug(prev);
>> + raw_spin_lock(&rq->lock);
>> + }
>> }
>> switch_count = &prev->nvcsw;
>> }
>>
>> - /*
>> - * If we are going to sleep and we have plugged IO queued, make
>> - * sure to submit it to avoid deadlocks.
>> - */
>> - if (prev->state != TASK_RUNNING && blk_needs_flush_plug(prev)) {
>> - raw_spin_unlock(&rq->lock);
>> - blk_flush_plug(prev);
>> - raw_spin_lock(&rq->lock);
>> - }
>> -
>> pre_schedule(rq, prev);
>>
>> if (unlikely(!rq->nr_running))
>
> Right, that cures the preemption problem. The reason I suggested placing
> it where it was is that I'd like to keep all things that release
> rq->lock in the middle of schedule() in one place, but I guess we can
> cure that with some extra comments.

We definitely only want to do it on going to sleep, not preempt events.
So if you are fine with this change, then lets please do that.

Linus, I've got a few other things queued up in the area, I'll add this
and send them off soon. Or feel free to add this one yourself, since you
already did it.

--
Jens Axboe

Re: [PATCH 05/10] block: remove per-queue plugging

am 13.04.2011 13:41:06 von Peter Zijlstra

On Wed, 2011-04-13 at 13:23 +0200, Jens Axboe wrote:
> We definitely only want to do it on going to sleep, not preempt events.
> So if you are fine with this change, then lets please do that.

Here's the Acked-by: Peter Zijlstra , that goes
with it ;-)

> Linus, I've got a few other things queued up in the area, I'll add this
> and send them off soon. Or feel free to add this one yourself, since you
> already did it.

Right, please send it onwards or have Linus commit it himself and I'll
cook up a patch clarifying the rq->lock'ing mess around there.

Re: [PATCH 05/10] block: remove per-queue plugging

am 13.04.2011 17:13:28 von Linus Torvalds

On Wed, Apr 13, 2011 at 4:23 AM, Jens Axboe wrote:
>
> Linus, I've got a few other things queued up in the area, I'll add this
> and send them off soon. Or feel free to add this one yourself, since you
> already did it.

Ok, I committed it with Peter's and your acks.

And if you already put it in your git tree too, git will merge it.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 05/10] block: remove per-queue plugging

am 13.04.2011 19:35:29 von Jens Axboe

On 2011-04-13 17:13, Linus Torvalds wrote:
> On Wed, Apr 13, 2011 at 4:23 AM, Jens Axboe wrote:
>>
>> Linus, I've got a few other things queued up in the area, I'll add this
>> and send them off soon. Or feel free to add this one yourself, since you
>> already did it.
>
> Ok, I committed it with Peter's and your acks.

Great, thanks.

> And if you already put it in your git tree too, git will merge it.

I did not, I had a feeling you'd merge this one.

--
Jens Axboe