[PATCH 0/2] FIX: Process hangs at wait

[PATCH 0/2] FIX: Process hangs at wait_barrier

am 04.02.2011 14:18:18 von krzysztof.wojcik

Patches resolve problem with process crash at wait_barrier()
after raid0->raid10 takeover.
First patch resolve this particular problem.
Solution is similar to RAID1 barrier implementation.
Second is proposal for general protection against barrier
become negative.

---

Krzysztof Wojcik (2):
FIX: md: process hangs at wait_barrier after 0->10 takeover
FIX: md: Prevent barrier become negative

drivers/md/raid1.c | 3 ++-
drivers/md/raid10.c | 9 ++++++---
2 files changed, 8 insertions(+), 4 deletions(-)

--
Krzysztof Wojcik
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] FIX: md: process hangs at wait_barrier after 0->10

am 04.02.2011 14:18:26 von krzysztof.wojcik

Following symptoms were observed:
1. After raid0->raid10 takeover operation we have array with 2
missing disks.
When we add disk for rebuild, recovery process starts as expected
but it does not finish- it stops at about 90%, md126_resync process
hangs in "D" state.
2. Similar behavior is when we have mounted raid0 array and we
execute takeover to raid10. After this when we try to unmount array-
it causes process umount hangs in "D"

In scenarios above processes hang at the same function- wait_barrier
in raid10.c.
Process waits in macro "wait_event_lock_irq" until the
"!conf->barrier" condition will be true.
In scenarios above it never happens.

Reason was that at the end of level_store, after calling pers->run,
we call mddev_resume. This calls pers->quiesce(mddev, 0) with
RAID10, that calls lower_barrier.
However raise_barrier hadn't been called on that 'conf' yet,
so conf->barrier becomes negative, which is bad.

This patch introduces setting conf->barrier=1 after takeover
operation. It prevents to become barrier negative after call
lower_barrier().

Signed-off-by: Krzysztof Wojcik
---
drivers/md/raid10.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 69b6595..3b607b2 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2463,11 +2463,13 @@ static void *raid10_takeover_raid0(mddev_t *mddev)
mddev->recovery_cp = MaxSector;

conf = setup_conf(mddev);
- if (!IS_ERR(conf))
+ if (!IS_ERR(conf)) {
list_for_each_entry(rdev, &mddev->disks, same_set)
if (rdev->raid_disk >= 0)
rdev->new_raid_disk = rdev->raid_disk * 2;
-
+ conf->barrier = 1;
+ }
+
return conf;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] FIX: md: Prevent barrier become negative

am 04.02.2011 14:18:35 von krzysztof.wojcik

In some situations barrier counter may become negative.
Calling lower_barrier with barrier=0 results barrier become negative.
It is harm situation and may cause process hang when we call wait_barrier()

This patch introduces additional condition in lower_barrier function-
decrement barrier counter only if it is raised.
It prevents to become barrier variable negative.

Signed-off-by: Krzysztof Wojcik
---
drivers/md/raid1.c | 3 ++-
drivers/md/raid10.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a23ffa3..fa7077b 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -660,7 +660,8 @@ static void lower_barrier(conf_t *conf)
unsigned long flags;
BUG_ON(conf->barrier <= 0);
spin_lock_irqsave(&conf->resync_lock, flags);
- conf->barrier--;
+ if (conf->barrier > 0)
+ conf->barrier--;
spin_unlock_irqrestore(&conf->resync_lock, flags);
wake_up(&conf->wait_barrier);
}
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3b607b2..c9e46a9 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -727,7 +727,8 @@ static void lower_barrier(conf_t *conf)
{
unsigned long flags;
spin_lock_irqsave(&conf->resync_lock, flags);
- conf->barrier--;
+ if (conf->barrier > 0)
+ conf->barrier--;
spin_unlock_irqrestore(&conf->resync_lock, flags);
wake_up(&conf->wait_barrier);
}

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] FIX: Process hangs at wait_barrier

am 08.02.2011 01:50:30 von NeilBrown

On Fri, 04 Feb 2011 14:18:18 +0100 Krzysztof Wojcik
wrote:

> Patches resolve problem with process crash at wait_barrier()
> after raid0->raid10 takeover.
> First patch resolve this particular problem.
> Solution is similar to RAID1 barrier implementation.
> Second is proposal for general protection against barrier
> become negative.
>
> ---
>
> Krzysztof Wojcik (2):
> FIX: md: process hangs at wait_barrier after 0->10 takeover

Applied, thanks.

> FIX: md: Prevent barrier become negative

If we ever trying to make 'barrier' negative, that is a bug somewhere.
So I would prefer:

BUG_ON(conf->barrier <= 0);
conf->barrier--;

NeilBrown

>
>
> drivers/md/raid1.c | 3 ++-
> drivers/md/raid10.c | 9 ++++++---
> 2 files changed, 8 insertions(+), 4 deletions(-)
>

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html