[PATCH] Fix: Sometimes mdmon throws core dump during reshape

[PATCH] Fix: Sometimes mdmon throws core dump during reshape

am 05.09.2011 12:39:55 von adam.kwolek

Problem was found during reshaping 2 volumes /raid0 and raid5/ in container.
Sometimes mdmon throws core dump due to NULL pointer exception.

Problem occurs in scenario:
- managemon: is about spare activation (degraded raid4 volume == raid0 under takeover)
- managemon: detect level change and signals monitor (manage_member() calls replace_array())
- monitor: detects transition raid4/5->raid0 and sets a->container to NULL
to indicate array deactivation
- managemon : continues his work and tries to activate spare (a->check_degraded is set).
NULL pointer is passed to metadata handler activate_spare()
Core dump is generated.

To resolve this situation managemon (after monitor kick) checks again
a->container pointer to learn if current array is not to be deactivated.

Signed-off-by: Adam Kwolek
---

managemon.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/managemon.c b/managemon.c
index d020f82..3540dac 100644
--- a/managemon.c
+++ b/managemon.c
@@ -475,6 +475,12 @@ static void manage_member(struct mdstat_ent *mdstat,
}
}

+ /* we are after monitor kick,
+ * so container field can be cleared - check it again
+ */
+ if (a->container == NULL)
+ return;
+
/* We don't check the array while any update is pending, as it
* might container a change (such as a spare assignment) which
* could affect our decisions.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Fix: Sometimes mdmon throws core dump during reshape

am 06.09.2011 21:09:54 von dan.j.williams

On Mon, Sep 5, 2011 at 3:39 AM, Adam Kwolek wro=
te:
> Problem was found during reshaping 2 volumes /raid0 and raid5/ in con=
tainer.
> Sometimes mdmon throws core dump due to NULL pointer exception.
>
> Problem occurs in scenario:
> - managemon: is about spare activation (degraded raid4 volume == =
raid0 under takeover)
> - managemon: detect level change and signals monitor (manage_member()=
calls replace_array())
> - monitor: detects transition raid4/5->raid0 and sets a->container to=
NULL
> =A0 =A0 =A0 =A0 =A0 to indicate array deactivation

Maybe I have lost track of the reshape implementation but I don't see
where the monitor sets ->container to NULL during a reshape? Do you
mean deactivate mdmon for the array after the reshape completes?

> - managemon : continues his work and tries to activate spare (a->chec=
k_degraded is set).
> =A0 =A0 =A0 =A0 =A0 =A0 =A0NULL pointer is passed to metadata handler=
activate_spare()
> =A0 =A0 =A0 =A0 =A0 =A0 =A0Core dump is generated.
>
> To resolve this situation managemon (after monitor kick) checks again
> a->container pointer to learn if current array is not to be deactivat=
ed.
[..]
> diff --git a/managemon.c b/managemon.c
> index d020f82..3540dac 100644
> --- a/managemon.c
> +++ b/managemon.c
> @@ -475,6 +475,12 @@ static void manage_member(struct mdstat_ent *mds=
tat,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
> =A0 =A0 =A0 =A0}
>
> + =A0 =A0 =A0 /* we are after monitor kick,
> + =A0 =A0 =A0 =A0* so container field can be cleared - check it again
> + =A0 =A0 =A0 =A0*/
> + =A0 =A0 =A0 if (a->container == NULL)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 return;
> +

Isn't this still racy? Because we don't wait for the monitor to run
before proceeding.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Fix: Sometimes mdmon throws core dump during reshape

am 07.09.2011 06:07:59 von NeilBrown

On Mon, 05 Sep 2011 12:39:55 +0200 Adam Kwolek wrote:

> Problem was found during reshaping 2 volumes /raid0 and raid5/ in container.
> Sometimes mdmon throws core dump due to NULL pointer exception.
>
> Problem occurs in scenario:
> - managemon: is about spare activation (degraded raid4 volume == raid0 under takeover)
> - managemon: detect level change and signals monitor (manage_member() calls replace_array())
> - monitor: detects transition raid4/5->raid0 and sets a->container to NULL
> to indicate array deactivation
> - managemon : continues his work and tries to activate spare (a->check_degraded is set).
> NULL pointer is passed to metadata handler activate_spare()
> Core dump is generated.
>
> To resolve this situation managemon (after monitor kick) checks again
> a->container pointer to learn if current array is not to be deactivated.

This looks like it might be the same bug as is fixed by
Lukasz Dorau
in
Subject: [PATCH] FIX: Mdmon crashes after changing RAID level from 1 to 0

Does that look likely?

Thanks,
NeilBrown


>
> Signed-off-by: Adam Kwolek
> ---
>
> managemon.c | 6 ++++++
> 1 files changed, 6 insertions(+), 0 deletions(-)
>
> diff --git a/managemon.c b/managemon.c
> index d020f82..3540dac 100644
> --- a/managemon.c
> +++ b/managemon.c
> @@ -475,6 +475,12 @@ static void manage_member(struct mdstat_ent *mdstat,
> }
> }
>
> + /* we are after monitor kick,
> + * so container field can be cleared - check it again
> + */
> + if (a->container == NULL)
> + return;
> +
> /* We don't check the array while any update is pending, as it
> * might container a change (such as a spare assignment) which
> * could affect our decisions.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: [PATCH] Fix: Sometimes mdmon throws core dump during reshape

am 07.09.2011 08:25:33 von adam.kwolek

> -----Original Message-----
> From: Williams, Dan J [mailto:dan.j.williams@intel.com]
> Sent: Tuesday, September 06, 2011 9:10 PM
> To: Kwolek, Adam
> Cc: neilb@suse.de; linux-raid@vger.kernel.org; Ciechanowski, Ed;
> Neubauer, Wojciech
> Subject: Re: [PATCH] Fix: Sometimes mdmon throws core dump during
> reshape
>=20
> On Mon, Sep 5, 2011 at 3:39 AM, Adam Kwolek
> wrote:
> > Problem was found during reshaping 2 volumes /raid0 and raid5/ in
> container.
> > Sometimes mdmon throws core dump due to NULL pointer exception.
> >
> > Problem occurs in scenario:
> > - managemon: is about spare activation (degraded raid4 volume ===
raid0
> under takeover)
> > - managemon: detect level change and signals monitor (manage_member=
()
> calls replace_array())
> > - monitor: detects transition raid4/5->raid0 and sets a->container =
to
> NULL
> > =A0 =A0 =A0 =A0 =A0 to indicate array deactivation
>=20
> Maybe I have lost track of the reshape implementation but I don't see
> where the monitor sets ->container to NULL during a reshape? Do you
> mean deactivate mdmon for the array after the reshape completes?
>=20
> > - managemon : continues his work and tries to activate spare (a-
> >check_degraded is set).
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0NULL pointer is passed to metadata handl=
er
> activate_spare()
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0Core dump is generated.
> >
> > To resolve this situation managemon (after monitor kick) checks aga=
in
> > a->container pointer to learn if current array is not to be
> deactivated.

Yes, when takeover is used. From one hand mdmon tries to resolve takeov=
ered raid0 degradation "problem"
and backward takeover occurs meanwhile.

BR
Adam

> [..]
> > diff --git a/managemon.c b/managemon.c
> > index d020f82..3540dac 100644
> > --- a/managemon.c
> > +++ b/managemon.c
> > @@ -475,6 +475,12 @@ static void manage_member(struct mdstat_ent
> *mdstat,
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
> > =A0 =A0 =A0 =A0}
> >
> > + =A0 =A0 =A0 /* we are after monitor kick,
> > + =A0 =A0 =A0 =A0* so container field can be cleared - check it aga=
in
> > + =A0 =A0 =A0 =A0*/
> > + =A0 =A0 =A0 if (a->container == NULL)
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 return;
> > +
>=20
> Isn't this still racy? Because we don't wait for the monitor to run
> before proceeding.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: [PATCH] Fix: Sometimes mdmon throws core dump during reshape

am 07.09.2011 08:36:12 von adam.kwolek

> -----Original Message-----
> From: NeilBrown [mailto:neilb@suse.de]
> Sent: Wednesday, September 07, 2011 6:08 AM
> To: Kwolek, Adam
> Cc: linux-raid@vger.kernel.org; Williams, Dan J; Ciechanowski, Ed;
> Neubauer, Wojciech
> Subject: Re: [PATCH] Fix: Sometimes mdmon throws core dump during
> reshape
>
> On Mon, 05 Sep 2011 12:39:55 +0200 Adam Kwolek
> wrote:
>
> > Problem was found during reshaping 2 volumes /raid0 and raid5/ in
> container.
> > Sometimes mdmon throws core dump due to NULL pointer exception.
> >
> > Problem occurs in scenario:
> > - managemon: is about spare activation (degraded raid4 volume == raid0
> under takeover)
> > - managemon: detect level change and signals monitor (manage_member()
> calls replace_array())
> > - monitor: detects transition raid4/5->raid0 and sets a->container to
> NULL
> > to indicate array deactivation
> > - managemon : continues his work and tries to activate spare (a-
> >check_degraded is set).
> > NULL pointer is passed to metadata handler
> activate_spare()
> > Core dump is generated.
> >
> > To resolve this situation managemon (after monitor kick) checks again
> > a->container pointer to learn if current array is not to be
> deactivated.
>
> This looks like it might be the same bug as is fixed by
> Lukasz Dorau
> in
> Subject: [PATCH] FIX: Mdmon crashes after changing RAID level from 1
> to 0
>
> Does that look likely?
>
> Thanks,
> NeilBrown

It is very rarely problem and I had got single reproduction only with applied patch pointed by you.
To completely solve this problem using Lukasze's patch only, new array monitoring deactivation
should be extended to every case. Container field should never be used for deactivation task.

Do you prefer such approach?


BR
Adam


>
>
> >
> > Signed-off-by: Adam Kwolek
> > ---
> >
> > managemon.c | 6 ++++++
> > 1 files changed, 6 insertions(+), 0 deletions(-)
> >
> > diff --git a/managemon.c b/managemon.c
> > index d020f82..3540dac 100644
> > --- a/managemon.c
> > +++ b/managemon.c
> > @@ -475,6 +475,12 @@ static void manage_member(struct mdstat_ent
> *mdstat,
> > }
> > }
> >
> > + /* we are after monitor kick,
> > + * so container field can be cleared - check it again
> > + */
> > + if (a->container == NULL)
> > + return;
> > +
> > /* We don't check the array while any update is pending, as it
> > * might container a change (such as a spare assignment) which
> > * could affect our decisions.
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html