[PATCH] md: do not write resync checkpoint,

am 27.01.2011 17:50:15 von unknown

If disk fails during resync, sync service of personality usually skips the
rest of not synchronized stripes. It finishes sync thread (md_do_sync())
and wakes up the main raid thread. md_recovery_check() starts and
unregisteres sync thread.
In the meanwhile mdmon also services failed disk - removes and replaces it
with a new one (if it was available).
If checkpoint is stored (with value of array's max_sector), next
md_recovery_check() will restart resync. It finishes normally and
activates ALL spares (including the one added recently) what is wrong.
Another md_recovery_check() will not start recovery as all disks are in
sync. If checkpoint is not stored, second resync does not start and
recovery can proceed.

Signed-off-by: Przemyslaw Czarnowski
---
drivers/md/md.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 3e40aad..6eda858 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6929,7 +6929,8 @@ void md_do_sync(mddev_t *mddev)
if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) &&
mddev->curr_resync > 2) {
if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
- if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
+ if (test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
+ mddev->curr_resync < max_sectors) {
if (mddev->curr_resync >= mddev->recovery_cp) {
printk(KERN_INFO
"md: checkpointing %s of %s.\n",

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH] md: do not write resync checkpoint, if max_sector hasbeen reached.

am 31.01.2011 03:45:57 von NeilBrown

On Thu, 27 Jan 2011 17:50:15 +0100 Przemyslaw Czarnowski
wrote:

> If disk fails during resync, sync service of personality usually skips the
> rest of not synchronized stripes. It finishes sync thread (md_do_sync())
> and wakes up the main raid thread. md_recovery_check() starts and
> unregisteres sync thread.
> In the meanwhile mdmon also services failed disk - removes and replaces it
> with a new one (if it was available).
> If checkpoint is stored (with value of array's max_sector), next
> md_recovery_check() will restart resync. It finishes normally and
> activates ALL spares (including the one added recently) what is wrong.
> Another md_recovery_check() will not start recovery as all disks are in
> sync. If checkpoint is not stored, second resync does not start and
> recovery can proceed.
>
> Signed-off-by: Przemyslaw Czarnowski
> ---
> drivers/md/md.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 3e40aad..6eda858 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -6929,7 +6929,8 @@ void md_do_sync(mddev_t *mddev)
> if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) &&
> mddev->curr_resync > 2) {
> if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
> - if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
> + if (test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
> + mddev->curr_resync < max_sectors) {
> if (mddev->curr_resync >= mddev->recovery_cp) {
> printk(KERN_INFO
> "md: checkpointing %s of %s.\n",
>

This is wrong. If curr_resync has reached some value, then the array *is*
in-sync up to that point.

If a device fails then that often makes the array fully in-sync - because
there it no longer any room for inconsistency.
This is particularly true for RAID1. If one drive in a 2-drive RAID1 fails,
then the array instantly becomes in-sync.
For RAID5, we should arguably fail the array at that point rather than
marking it in-sync, but that would probably cause more data loss than it
avoids, so we don't.
In any case - the array is now in-sync.

If a spare is added by mdmon at this time, then the array is not 'out of
sync', it is 'in need for recovery'. 'recovery' and 'resync' are different
things.

md_check_recovery should run remove_and_add_spares are this point. That
should return a non-zero value (because it found the spare that mdmon added)
and should then start a recovery pass which will ignore recovery_cp (which
is a really badly chosen variable name - it should be 'resync_cp', not
'recovery_cp'.

So if you are experiencing a problem where mdmon adds a spare and appears to
get recovered instantly, (which is what you seem to be saying) then the
problem is else-where.
If you can reproduce it, then it would help to put some tracing in
md_check_recovery, particularly reporting the return value of
remove_and_add_spares, and the value that is finally chosen for
mddev->recovery.

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: [PATCH] md: do not write resync checkpoint, if max_sector hasbeen reached.

am 31.01.2011 19:48:08 von unknown

> -----Original Message-----
> From: NeilBrown [mailto:neilb@suse.de]
> Sent: Monday, January 31, 2011 3:46 AM
> To: Hawrylewicz Czarnowski, Przemyslaw
> Cc: linux-raid@vger.kernel.org; Neubauer, Wojciech; Williams, Dan J;
> Ciechanowski, Ed
> Subject: Re: [PATCH] md: do not write resync checkpoint, if max_sector has
> been reached.
>
> On Thu, 27 Jan 2011 17:50:15 +0100 Przemyslaw Czarnowski
> wrote:
>
> > If disk fails during resync, sync service of personality usually skips
> the
> > rest of not synchronized stripes. It finishes sync thread (md_do_sync())
> > and wakes up the main raid thread. md_recovery_check() starts and
> > unregisteres sync thread.
> > In the meanwhile mdmon also services failed disk - removes and replaces
> it
> > with a new one (if it was available).
> > If checkpoint is stored (with value of array's max_sector), next
> > md_recovery_check() will restart resync. It finishes normally and
> > activates ALL spares (including the one added recently) what is wrong.
> > Another md_recovery_check() will not start recovery as all disks are in
> > sync. If checkpoint is not stored, second resync does not start and
> > recovery can proceed.
> >
> > Signed-off-by: Przemyslaw Czarnowski
>
> > ---
> > drivers/md/md.c | 3 ++-
> > 1 files changed, 2 insertions(+), 1 deletions(-)
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 3e40aad..6eda858 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -6929,7 +6929,8 @@ void md_do_sync(mddev_t *mddev)
> > if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) &&
> > mddev->curr_resync > 2) {
> > if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
> > - if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
> > + if (test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
> > + mddev->curr_resync < max_sectors) {
> > if (mddev->curr_resync >= mddev->recovery_cp) {
> > printk(KERN_INFO
> > "md: checkpointing %s of %s.\n",
> >
>
> This is wrong. If curr_resync has reached some value, then the array *is*
> in-sync up to that point.
>
> If a device fails then that often makes the array fully in-sync - because
> there it no longer any room for inconsistency.
> This is particularly true for RAID1. If one drive in a 2-drive RAID1
> fails,
> then the array instantly becomes in-sync.
> For RAID5, we should arguably fail the array at that point rather than
> marking it in-sync, but that would probably cause more data loss than it
> avoids, so we don't.
> In any case - the array is now in-sync.
Yes, I agree. But it is not the point here, in this bug.

>
> If a spare is added by mdmon at this time, then the array is not 'out of
> sync', it is 'in need for recovery'. 'recovery' and 'resync' are different
> things.
I fully understand the difference between recovery and resync (and reshape).

>
> md_check_recovery should run remove_and_add_spares are this point. That
And it does.

> should return a non-zero value (because it found the spare that mdmon
> added)
But the return value is wrong (it is correct according to current configuration). Please let me explain once again what's going on.

The flow is as follows:
0. resync is in progress
1. one disk fails
2. md_error() wakes up raid thread
3. md_do_sync() gets skipped=1 from mddev->pers->sync_request() and some amount of skipped sectors/stripes - usually all remaining to resync. mddev->recovery_cp is set to last sector (max_sector in md_do_sync)
3a. md_check_recovery() sees MD_RECOVERY_INTR (clears it) and unregisters recovery thread (which actually does resync)
3b. mdmon unblocks array member
4. md_check_recovery checks if some action is required.
4a. reshape is not taken into account as reshape_position==MaxSector
4b. recovery is not taken into account as mdmon did not add spare yet
4c. resync is started, as recovery_cp!=MaxSector (!)
5. md_do_sync exists normally (gets skipped=1 from mddev->pers->sync_request()) as checkpoint pointed at the last sector; it clears mddev->recovery_cp.
6. mdmon adds disk (via slot-store())
7. md_check_recovery() does cleanup after finished resync
7a. MD_RECOVERY_INTR is not set anymore, mddev->pers->spare_active() is started and ALL devices !In_sync available in array are set in_sync, and array degradation is cleared(!).
7b. remove_and_add_spares() does not see spares available (mddev->degraded==0) so recovery does not start.

> and should then start a recovery pass which will ignore recovery_cp (which
> is a really badly chosen variable name - it should be 'resync_cp', not
> 'recovery_cp'.
as you can see above, recovery_cp is not ignored (yes the name is confusing)

>
> So if you are experiencing a problem where mdmon adds a spare and appears
> to
> get recovered instantly, (which is what you seem to be saying) then the
to be precise, recovery do not start at all...

> problem is else-where.

> If you can reproduce it, then it would help to put some tracing in
> md_check_recovery, particularly reporting the return value of
> remove_and_add_spares, and the value that is finally chosen for
> mddev->recovery.
if you want some logs, I have plenty:) But I think my description will suffice to understand the problem.

>
> Thanks,
> NeilBrown

--
Best Regards,
Przemyslaw Hawrylewicz-Czarnowski

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH] md: do not write resync checkpoint, if max_sector hasbeen reached.

am 01.02.2011 02:07:52 von NeilBrown

On Mon, 31 Jan 2011 18:48:08 +0000 "Hawrylewicz Czarnowski, Przemyslaw"
wrote:

> > -----Original Message-----
> > From: NeilBrown [mailto:neilb@suse.de]
> > Sent: Monday, January 31, 2011 3:46 AM
> > To: Hawrylewicz Czarnowski, Przemyslaw
> > Cc: linux-raid@vger.kernel.org; Neubauer, Wojciech; Williams, Dan J;
> > Ciechanowski, Ed
> > Subject: Re: [PATCH] md: do not write resync checkpoint, if max_sector has
> > been reached.
> >
> > On Thu, 27 Jan 2011 17:50:15 +0100 Przemyslaw Czarnowski
> > wrote:
> >
> > > If disk fails during resync, sync service of personality usually skips
> > the
> > > rest of not synchronized stripes. It finishes sync thread (md_do_sync())
> > > and wakes up the main raid thread. md_recovery_check() starts and
> > > unregisteres sync thread.
> > > In the meanwhile mdmon also services failed disk - removes and replaces
> > it
> > > with a new one (if it was available).
> > > If checkpoint is stored (with value of array's max_sector), next
> > > md_recovery_check() will restart resync. It finishes normally and
> > > activates ALL spares (including the one added recently) what is wrong.
> > > Another md_recovery_check() will not start recovery as all disks are in
> > > sync. If checkpoint is not stored, second resync does not start and
> > > recovery can proceed.
> > >
> > > Signed-off-by: Przemyslaw Czarnowski
> >
> > > ---
> > > drivers/md/md.c | 3 ++-
> > > 1 files changed, 2 insertions(+), 1 deletions(-)
> > >
> > > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > > index 3e40aad..6eda858 100644
> > > --- a/drivers/md/md.c
> > > +++ b/drivers/md/md.c
> > > @@ -6929,7 +6929,8 @@ void md_do_sync(mddev_t *mddev)
> > > if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) &&
> > > mddev->curr_resync > 2) {
> > > if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
> > > - if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
> > > + if (test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
> > > + mddev->curr_resync < max_sectors) {
> > > if (mddev->curr_resync >= mddev->recovery_cp) {
> > > printk(KERN_INFO
> > > "md: checkpointing %s of %s.\n",
> > >
> >
> > This is wrong. If curr_resync has reached some value, then the array *is*
> > in-sync up to that point.
> >
> > If a device fails then that often makes the array fully in-sync - because
> > there it no longer any room for inconsistency.
> > This is particularly true for RAID1. If one drive in a 2-drive RAID1
> > fails,
> > then the array instantly becomes in-sync.
> > For RAID5, we should arguably fail the array at that point rather than
> > marking it in-sync, but that would probably cause more data loss than it
> > avoids, so we don't.
> > In any case - the array is now in-sync.
> Yes, I agree. But it is not the point here, in this bug.
>
> >
> > If a spare is added by mdmon at this time, then the array is not 'out of
> > sync', it is 'in need for recovery'. 'recovery' and 'resync' are different
> > things.
> I fully understand the difference between recovery and resync (and reshape).
>
> >
> > md_check_recovery should run remove_and_add_spares are this point. That
> And it does.
>
> > should return a non-zero value (because it found the spare that mdmon
> > added)
> But the return value is wrong (it is correct according to current configuration). Please let me explain once again what's going on.
>
> The flow is as follows:
> 0. resync is in progress
> 1. one disk fails
> 2. md_error() wakes up raid thread
> 3. md_do_sync() gets skipped=1 from mddev->pers->sync_request() and some amount of skipped sectors/stripes - usually all remaining to resync. mddev->recovery_cp is set to last sector (max_sector in md_do_sync)
> 3a. md_check_recovery() sees MD_RECOVERY_INTR (clears it) and unregisters recovery thread (which actually does resync)
> 3b. mdmon unblocks array member
> 4. md_check_recovery checks if some action is required.
> 4a. reshape is not taken into account as reshape_position==MaxSector
> 4b. recovery is not taken into account as mdmon did not add spare yet
> 4c. resync is started, as recovery_cp!=MaxSector (!)
> 5. md_do_sync exists normally (gets skipped=1 from mddev->pers->sync_request()) as checkpoint pointed at the last sector; it clears mddev->recovery_cp.
> 6. mdmon adds disk (via slot-store())
> 7. md_check_recovery() does cleanup after finished resync
> 7a. MD_RECOVERY_INTR is not set anymore, mddev->pers->spare_active() is started and ALL devices !In_sync available in array are set in_sync, and array degradation is cleared(!).
> 7b. remove_and_add_spares() does not see spares available (mddev->degraded==0) so recovery does not start.
>

Thank you for this excellent problem description.

I think the mistake here is at step 6.
mdmon should not be trying to add a disk until the resync has completed.
In particular, mdmon shouldn't try, and md should not allow mdmon to succeed.

So slot_store should return -EBUSY if MD_RECOVERY_RUNNING is set

mdmon needs to 'know' when a sync/etc is happening, and should avoid
processing ->check_degraded if it is.

I have added the md fix to my for-next branch.
I might do the mdadm fix later.

Thanks,
NeilBrown

> > and should then start a recovery pass which will ignore recovery_cp (which
> > is a really badly chosen variable name - it should be 'resync_cp', not
> > 'recovery_cp'.
> as you can see above, recovery_cp is not ignored (yes the name is confusing)
>
> >
> > So if you are experiencing a problem where mdmon adds a spare and appears
> > to
> > get recovered instantly, (which is what you seem to be saying) then the
> to be precise, recovery do not start at all...
>
> > problem is else-where.
>
> > If you can reproduce it, then it would help to put some tracing in
> > md_check_recovery, particularly reporting the return value of
> > remove_and_add_spares, and the value that is finally chosen for
> > mddev->recovery.
> if you want some logs, I have plenty:) But I think my description will suffice to understand the problem.
>
> >
> > Thanks,
> > NeilBrown
>

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: [PATCH] md: do not write resync checkpoint, if max_sector hasbeen reached.

am 02.02.2011 00:45:23 von unknown

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of NeilBrown
> Sent: Tuesday, February 01, 2011 2:08 AM
> To: Hawrylewicz Czarnowski, Przemyslaw
> Cc: linux-raid@vger.kernel.org; Neubauer, Wojciech; Williams, Dan J;
> Ciechanowski, Ed
> Subject: Re: [PATCH] md: do not write resync checkpoint, if max_sector has
> been reached.
>
[cut]

> > >
> > > This is wrong. If curr_resync has reached some value, then the array
> *is*
> > > in-sync up to that point.
> > >
> > > If a device fails then that often makes the array fully in-sync -
> because
> > > there it no longer any room for inconsistency.
> > > This is particularly true for RAID1. If one drive in a 2-drive RAID1
> > > fails,
> > > then the array instantly becomes in-sync.
> > > For RAID5, we should arguably fail the array at that point rather than
> > > marking it in-sync, but that would probably cause more data loss than
> it
> > > avoids, so we don't.
> > > In any case - the array is now in-sync.
> > Yes, I agree. But it is not the point here, in this bug.
> >
> > >
> > > If a spare is added by mdmon at this time, then the array is not 'out
> of
> > > sync', it is 'in need for recovery'. 'recovery' and 'resync' are
> different
> > > things.
> > I fully understand the difference between recovery and resync (and
> reshape).
> >
> > >
> > > md_check_recovery should run remove_and_add_spares are this point.
> That
> > And it does.
> >
> > > should return a non-zero value (because it found the spare that mdmon
> > > added)
> > But the return value is wrong (it is correct according to current
> configuration). Please let me explain once again what's going on.
> >
> > The flow is as follows:
> > 0. resync is in progress
> > 1. one disk fails
> > 2. md_error() wakes up raid thread
> > 3. md_do_sync() gets skipped=1 from mddev->pers->sync_request() and some
> amount of skipped sectors/stripes - usually all remaining to resync. mddev-
> >recovery_cp is set to last sector (max_sector in md_do_sync)
> > 3a. md_check_recovery() sees MD_RECOVERY_INTR (clears it) and unregisters
> recovery thread (which actually does resync)
> > 3b. mdmon unblocks array member
> > 4. md_check_recovery checks if some action is required.
> > 4a. reshape is not taken into account as reshape_position==MaxSector
> > 4b. recovery is not taken into account as mdmon did not add spare yet
> > 4c. resync is started, as recovery_cp!=MaxSector (!)
> > 5. md_do_sync exists normally (gets skipped=1 from mddev->pers-
> >sync_request()) as checkpoint pointed at the last sector; it clears mddev-
> >recovery_cp.
> > 6. mdmon adds disk (via slot-store())
> > 7. md_check_recovery() does cleanup after finished resync
> > 7a. MD_RECOVERY_INTR is not set anymore, mddev->pers->spare_active() is
> started and ALL devices !In_sync available in array are set in_sync, and
> array degradation is cleared(!).
> > 7b. remove_and_add_spares() does not see spares available (mddev-
> >degraded==0) so recovery does not start.
> >
>
> Thank you for this excellent problem description.
>
> I think the mistake here is at step 6.
> mdmon should not be trying to add a disk until the resync has completed.
> In particular, mdmon shouldn't try, and md should not allow mdmon to
> succeed.
>
> So slot_store should return -EBUSY if MD_RECOVERY_RUNNING is set
>
> mdmon needs to 'know' when a sync/etc is happening, and should avoid
> processing ->check_degraded if it is.
>
> I have added the md fix to my for-next branch.
I took a look at this change. It does no compile. It should be:

if (test_bit(MD_RECOVERY_RUNNING, &rdev->mddev->recovery)
return -EBUSY;

> I might do the mdadm fix later.

Now (without this kernel/mdmon handshaking) there is a infinite loop in manager (it is quite well reproducible for IMSM raid5). I have noticed, that state is still "resync", mdmon continuously sends "-blocked" and manager does not add spare (because of resync). I haven't found root cause why array_state is still "resync", but at first glance after md_do_sync finishes md_check_recovery do not start.
I will make some more tests tomorrow...

Przemek
>
> Thanks,
> NeilBrown
>
>
[cut]

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html