[RFC] MD: Restart an interrupted incremental recovery.

[RFC] MD: Restart an interrupted incremental recovery.

am 12.10.2011 21:41:35 von Andrei Warkentin

If an incremental recovery was interrupted, a subsequent
re-add will result in a full recovery, even though an
incremental should be possible.

This patch handles 99% of possible interruptions of an
incremental recovery (it doesn't handle I/O failure
that occurs immediately after the hot_add_disk(), which
causes the added rdev to become a spare).

Cc: Neil Brown
Signed-off-by: Andrei Warkentin
---
drivers/md/md.c | 28 +++++++++++++++++++++++++---
include/linux/raid/md_p.h | 6 +++++-
2 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5404b22..8002e02 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1701,10 +1701,27 @@ static int super_1_validate(mddev_t *mddev, mdk_rdev_t *rdev)
break;
default:
if ((le32_to_cpu(sb->feature_map) &
- MD_FEATURE_RECOVERY_OFFSET))
- rdev->recovery_offset = le64_to_cpu(sb->recovery_offset);
- else
+ MD_FEATURE_IN_INCREMENTAL)) {
+ /* This rdev had been in an incremental
+ * recovery, which was interrupted, and
+ * can be restarted. Note that we specifically
+ * ignore recovery_offset, as I/O to the degraded
+ * array could have modified data in sectors
+ * below the recovery offset.
+ */
+ char b[BDEVNAME_SIZE];
set_bit(In_sync, &rdev->flags);
+ printk(KERN_NOTICE
+ "md: restarting interrupted incremental resync on %s\n",
+ bdevname(rdev->bdev, b));
+ } else {
+ if ((le32_to_cpu(sb->feature_map) &
+ MD_FEATURE_RECOVERY_OFFSET))
+ rdev->recovery_offset =
+ le64_to_cpu(sb->recovery_offset);
+ else
+ set_bit(In_sync, &rdev->flags);
+ }
rdev->raid_disk = role;
break;
}
@@ -1738,6 +1755,11 @@ static void super_1_sync(mddev_t *mddev, mdk_rdev_t *rdev)
else
sb->resync_offset = cpu_to_le64(0);

+ if (mddev->bitmap && rdev->saved_raid_disk != -1)
+ sb->feature_map |= cpu_to_le32(MD_FEATURE_IN_INCREMENTAL);
+ else
+ sb->feature_map &= ~cpu_to_le32(MD_FEATURE_IN_INCREMENTAL);
+
sb->cnt_corrected_read = cpu_to_le32(atomic_read(&rdev->corrected_errors));

sb->raid_disks = cpu_to_le32(mddev->raid_disks);
diff --git a/include/linux/raid/md_p.h b/include/linux/raid/md_p.h
index 9e65d9e..8a17446 100644
--- a/include/linux/raid/md_p.h
+++ b/include/linux/raid/md_p.h
@@ -277,7 +277,11 @@ struct mdp_superblock_1 {
*/
#define MD_FEATURE_RESHAPE_ACTIVE 4
#define MD_FEATURE_BAD_BLOCKS 8 /* badblock list is not empty */
+#define MD_FEATURE_IN_INCREMENTAL 16 /* set on disks undergoing incremental
+ * recovery, so that an interrupted
+ * incremental recovery may be restarted
+ */

-#define MD_FEATURE_ALL (1|2|4|8)
+#define MD_FEATURE_ALL (1|2|4|8|16)

#endif
--
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [RFC] MD: Restart an interrupted incremental recovery.

am 13.10.2011 01:04:57 von andrey.warkentin

2011/10/12 Andrei Warkentin :
> If an incremental recovery was interrupted, a subsequent
> re-add will result in a full recovery, even though an
> incremental should be possible.
>
> This patch handles 99% of possible interruptions of an
> incremental recovery (it doesn't handle I/O failure
> that occurs immediately after the hot_add_disk(), which
> causes the added rdev to become a spare).
>
> Cc: Neil Brown
> Signed-off-by: Andrei Warkentin
> ---

Here is an alternative idea, that doesn't involve messing with the
superblocks - don't write out SB for recovering device (undergoing
incremental) until array is not degraded.

A
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[RFC] MD: Allow restarting an interrupted incremental recovery.

am 13.10.2011 01:05:33 von Andrei Warkentin

If an incremental recovery was interrupted, a subsequent
re-add will result in a full recovery, even though an
incremental should be possible (seen with raid1).

Solve this problem by not updating the superblock on the
recovering device until array is not degraded any longer.

Cc: Neil Brown
Signed-off-by: Andrei Warkentin
---
drivers/md/md.c | 19 +++++++++++++------
drivers/md/md.h | 6 ++++++
2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5404b22..153b3c6 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2444,9 +2444,12 @@ repeat:
continue; /* no noise on spare devices */
if (test_bit(Faulty, &rdev->flags))
dprintk("(skipping faulty ");
+ else if (test_bit(InIncremental, &rdev->flags))
+ dprintk("(skipping incremental s/r ");

dprintk("%s ", bdevname(rdev->bdev,b));
- if (!test_bit(Faulty, &rdev->flags)) {
+ if (!test_bit(Faulty, &rdev->flags) &&
+ !test_bit(InIncremental, &rdev->flags)) {
md_super_write(mddev,rdev,
rdev->sb_start, rdev->sb_size,
rdev->sb_page);
@@ -5490,9 +5493,10 @@ static int add_new_disk(mddev_t * mddev, mdu_disk_info_t *info)
return -EINVAL;
}

- if (test_bit(In_sync, &rdev->flags))
+ if (test_bit(In_sync, &rdev->flags)) {
rdev->saved_raid_disk = rdev->raid_disk;
- else
+ set_bit(InIncremental, &rdev->flags);
+ } else
rdev->saved_raid_disk = -1;

clear_bit(In_sync, &rdev->flags); /* just to be sure */
@@ -7353,15 +7357,18 @@ static void reap_sync_thread(mddev_t *mddev)
if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
mddev->pers->finish_reshape)
mddev->pers->finish_reshape(mddev);
- md_update_sb(mddev, 1);

/* if array is no-longer degraded, then any saved_raid_disk
- * information must be scrapped
+ * information must be scrapped, and superblock for
+ * incrementally recovered device written out.
*/
if (!mddev->degraded)
- list_for_each_entry(rdev, &mddev->disks, same_set)
+ list_for_each_entry(rdev, &mddev->disks, same_set) {
rdev->saved_raid_disk = -1;
+ clear_bit(InIncremental, &rdev->flags);
+ }

+ md_update_sb(mddev, 1);
clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 1e586bb..5e5399a 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -104,6 +104,12 @@ struct mdk_rdev_s
* accurately as possible is good, but
* not absolutely critical.
*/
+#define InIncremental 12 /* Device is undergoing incremental
+ * recovery, hence its superblock
+ * is not written out until the recovery
+ * ends, allowing the later to be
+ * restared if interrupted.
+ */
wait_queue_head_t blocked_wait;

int desc_nr; /* descriptor index in the superblock */
--
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [RFC] MD: Restart an interrupted incremental recovery.

am 14.10.2011 03:07:05 von andrey.warkentin

2011/10/12 Andrei Warkentin :
> If an incremental recovery was interrupted, a subsequent
> re-add will result in a full recovery, even though an
> incremental should be possible.
>
> This patch handles 99% of possible interruptions of an
> incremental recovery (it doesn't handle I/O failure
> that occurs immediately after the hot_add_disk(), which
> causes the added rdev to become a spare).
>
> Cc: Neil Brown
> Signed-off-by: Andrei Warkentin

Abandoning since there is a better solution (other patch which doesn't touch
the recovering driver's sb while incremental recovery is going on).

A
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [RFC] MD: Allow restarting an interrupted incremental recovery.

am 14.10.2011 03:18:43 von andrey.warkentin

2011/10/12 Andrei Warkentin :
> If an incremental recovery was interrupted, a subsequent
> re-add will result in a full recovery, even though an
> incremental should be possible (seen with raid1).
>
> Solve this problem by not updating the superblock on the
> recovering device until array is not degraded any longer.
>
> Cc: Neil Brown
> Signed-off-by: Andrei Warkentin

FWIW it appears to me that this idea seems to work well, for the
following reasons:

1) The recovering sb is not touched until the array is not degraded
(only for incremental sync).
2) The events_cleared count isn't updated in the active bitmap sb
until array is not degraded. This implies that if the incremental was
interrupted, recovering_sb->events is NOT less than
active_bitmap->events_cleared).
3) The bitmaps (and sb) are updated on all drives at all times as it
were before.

How I tested it:
1) Create RAID1 array with bitmap.
2) Degrade array by removing a drive.
3) Write a bunch of data (Gigs...)
4) Re-add removed drive - an incremental recovery is started.
5) Interrupt the incremental.
6) Write some more data.
7) MD5sum the data.
8) Re-add removed drive - and incremental recovery is restarted (I
verified it starts at sec 0, just like you mentioned it should be, to
avoid consistency issues). Verified that, indeed, only changed blocks
(as noted by write-intent) are synced.
10) Remove other half.
11) MD5sum data - hashes match.

Without this fix, you would of course have to deal with a full resync
after the interrupted incremental.

Is there anything you think I'm missing here?

A
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [RFC] MD: Allow restarting an interrupted incremental recovery.

am 17.10.2011 05:20:39 von NeilBrown

--Sig_/yXSfnSIr2WlpWf93NQSXvQk
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

On Thu, 13 Oct 2011 21:18:43 -0400 "Andrei E. Warkentin"
wrote:

> 2011/10/12 Andrei Warkentin :
> > If an incremental recovery was interrupted, a subsequent
> > re-add will result in a full recovery, even though an
> > incremental should be possible (seen with raid1).
> >
> > Solve this problem by not updating the superblock on the
> > recovering device until array is not degraded any longer.
> >
> > Cc: Neil Brown
> > Signed-off-by: Andrei Warkentin
>=20
> FWIW it appears to me that this idea seems to work well, for the
> following reasons:
>=20
> 1) The recovering sb is not touched until the array is not degraded
> (only for incremental sync).
> 2) The events_cleared count isn't updated in the active bitmap sb
> until array is not degraded. This implies that if the incremental was
> interrupted, recovering_sb->events is NOT less than
> active_bitmap->events_cleared).
> 3) The bitmaps (and sb) are updated on all drives at all times as it
> were before.
>=20
> How I tested it:
> 1) Create RAID1 array with bitmap.
> 2) Degrade array by removing a drive.
> 3) Write a bunch of data (Gigs...)
> 4) Re-add removed drive - an incremental recovery is started.
> 5) Interrupt the incremental.
> 6) Write some more data.
> 7) MD5sum the data.
> 8) Re-add removed drive - and incremental recovery is restarted (I
> verified it starts at sec 0, just like you mentioned it should be, to
> avoid consistency issues). Verified that, indeed, only changed blocks
> (as noted by write-intent) are synced.
> 10) Remove other half.
> 11) MD5sum data - hashes match.
>=20
> Without this fix, you would of course have to deal with a full resync
> after the interrupted incremental.
>=20
> Is there anything you think I'm missing here?
>=20
> A

Not much, it looks good, and your testing is of course a good sign.

My only thought is whether we really need the new InIncremental flag.
You set it exactly when saved_raid_disk is set, and clear it exactly when
saved_raid_disk is cleared (set to -1).
So maybe we can just used saved_raid_disk.
If you look at it that way, you might notice that saved_raid_disk is also s=
et
in slot_store, so probably InIncremental should be set there. So that might
be the one thing you missed.

Could you respin the patch without adding InIncremental, and testing=20
rdev->saved_raid_disk >=3D 0
instead, check if you agree that should work, and perform a similar test?
(Is that asking too much?).

If you agree that works I would like to go with that version.

Thanks,
NeilBrown


--Sig_/yXSfnSIr2WlpWf93NQSXvQk
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)

iQIVAwUBTpufBznsnt1WYoG5AQIjEg//Uhz8hAmcqQG2tlH+Y5e8JBFlgnkv YUTo
hEWiDSGvs+p1r1LOS7V321FwpaRTHXJ2KAAIG/wgkQuyCY9flaRV2IIlPJvT BVLY
UjGpJnZqu9Ga2lVPCisG2ou/1eF58bMbK/YfuRqwm4Dbb1iaJ1cN96+XthNQ wLy2
9RGwpsSyrerE0mfwJISEv7eVrW44ekgW875SC1dc+fj+2kCQGwIVJQ0jQwh4 QiYW
xi2Bo1ZVBH+YJ+WTAhTjh9twrFjGOoEyh/Tf0uh8waj9lx+OOv0kdG3PGxH9 TvOF
SBR4qDXmtINlnKuIIlVN+uVHfAXoW87m8N2N390wAMZdDqW0iai4nY9jI3IT v8gK
gCfBStIOJ11sCDcQOsw7MYNj8z6HbO63AjXatfF4NCKcpZEoHw2tI20v0S3u yrlO
Kjw7BGIoexSW+psNLEKsxMtFj1znNllYYnydkrm8FQVhhCh0nhhSxm3BbDS7 40Pr
msu/x/PxbFB0vEph/jBXiwW7UStPfiB+0r2gINgIHFATOz2fTfzJE2Yww80g KTQ2
znyBIfa4RHLDuf1rc//vceBr7AWp0gvhOQDBkJ448NkH0WYqJerDWe2+K88j ZxkS
it1f41lRMGoi9llb2rJ/YsKXPpnnwfz768C2jASKLb/fBc648GCWH+lJZK2U Jy9r
s7cSDYVFprU=
=zpIN
-----END PGP SIGNATURE-----

--Sig_/yXSfnSIr2WlpWf93NQSXvQk--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [RFC] MD: Allow restarting an interrupted incremental recovery.

am 17.10.2011 19:26:44 von Andrei Warkentin

Hi Neil,

----- Original Message -----
> From: "NeilBrown"
> To: "Andrei E. Warkentin"
> Cc: "Andrei Warkentin" , linux-raid@vger.kernel.org
> Sent: Sunday, October 16, 2011 11:20:39 PM
> Subject: Re: [RFC] MD: Allow restarting an interrupted incremental recovery.
>
> On Thu, 13 Oct 2011 21:18:43 -0400 "Andrei E. Warkentin"
> wrote:
>
> > 2011/10/12 Andrei Warkentin :
> > > If an incremental recovery was interrupted, a subsequent
> > > re-add will result in a full recovery, even though an
> > > incremental should be possible (seen with raid1).
> > >
> > > Solve this problem by not updating the superblock on the
> > > recovering device until array is not degraded any longer.
> > >
> > > Cc: Neil Brown
> > > Signed-off-by: Andrei Warkentin
> >
> > FWIW it appears to me that this idea seems to work well, for the
> > following reasons:
> >
> > 1) The recovering sb is not touched until the array is not degraded
> > (only for incremental sync).
> > 2) The events_cleared count isn't updated in the active bitmap sb
> > until array is not degraded. This implies that if the incremental
> > was
> > interrupted, recovering_sb->events is NOT less than
> > active_bitmap->events_cleared).
> > 3) The bitmaps (and sb) are updated on all drives at all times as
> > it
> > were before.
> >
> > How I tested it:
> > 1) Create RAID1 array with bitmap.
> > 2) Degrade array by removing a drive.
> > 3) Write a bunch of data (Gigs...)
> > 4) Re-add removed drive - an incremental recovery is started.
> > 5) Interrupt the incremental.
> > 6) Write some more data.
> > 7) MD5sum the data.
> > 8) Re-add removed drive - and incremental recovery is restarted (I
> > verified it starts at sec 0, just like you mentioned it should be,
> > to
> > avoid consistency issues). Verified that, indeed, only changed
> > blocks
> > (as noted by write-intent) are synced.
> > 10) Remove other half.
> > 11) MD5sum data - hashes match.
> >
> > Without this fix, you would of course have to deal with a full
> > resync
> > after the interrupted incremental.
> >
> > Is there anything you think I'm missing here?
> >
> > A
>
> Not much, it looks good, and your testing is of course a good sign.
>
> My only thought is whether we really need the new InIncremental flag.
> You set it exactly when saved_raid_disk is set, and clear it exactly
> when
> saved_raid_disk is cleared (set to -1).
> So maybe we can just used saved_raid_disk.

You're right, I looked at the code paths again (for non-bitmapped disks), and there is
no reason not to leverage saved_raid_disk. It should be safe (i.e. not missing necessary sb fluesh)
for non-bitmapped disks - a disk resuming recovery (from rec_offset) will never have In_sync, and raid_saved_disk
will not be set. For multipath this is irrelevant as well.

> If you look at it that way, you might notice that saved_raid_disk is
> also set
> in slot_store, so probably InIncremental should be set there. So
> that might
> be the one thing you missed.
>

Actually, maybe you can clarify something a bit about that code? The part where
slot != -1 and mddev->pers != NULL looks a lot like the add_new_disk path - except:

After pers->hot_add_disk:
1) rdev In_sync isn't cleared
2) Array isn't checked for degraded state and MD_RECOVERY_RECOVER isn't conditionally set.
3) MD_RECOVERY_NEEDED isn't set.
4) mddev->thread is't woken up.

Is this because if an array was degraded AND there were extra spares, they would already be
assigned to the array?

> Could you respin the patch without adding InIncremental, and testing
> rdev->saved_raid_disk >= 0
> instead, check if you agree that should work, and perform a similar
> test?
> (Is that asking too much?).
>

Absolutely! Thanks again!

A
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [RFC] MD: Allow restarting an interrupted incremental recovery.

am 17.10.2011 22:58:54 von NeilBrown

--Sig_/42eCC_37AWdTWY91GIUJxIi
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

On Mon, 17 Oct 2011 10:26:44 -0700 (PDT) Andrei Warkentin
wrote:

> Hi Neil,

> > If you look at it that way, you might notice that saved_raid_disk is
> > also set
> > in slot_store, so probably InIncremental should be set there. So
> > that might
> > be the one thing you missed.
> >
>=20
> Actually, maybe you can clarify something a bit about that code? The part=
where
> slot !=3D -1 and mddev->pers !=3D NULL looks a lot like the add_new_disk =
path - except:
>=20
> After pers->hot_add_disk:
> 1) rdev In_sync isn't cleared

That looks like a bug. I don't think I've ever tested re-adding a device by
setting 'slot' - only adding. And for adding, In_sync is already clear.
Thanks for spotting that.

> 2) Array isn't checked for degraded state and MD_RECOVERY_RECOVER isn't c=
onditionally set.

I think that isn't really needed in add_new_disk.
md_check_recovery will set it before recovery starts if there are spares in
the array.
So I'm inclined to just remove it from add_new_disk, but I feel that I need
to read the code a bit more carefully first.

In any case, the next point makes the omission harmless even if
MD_RECOVERY_RECOVER is needed there for some reason.


> 3) MD_RECOVERY_NEEDED isn't set.


That might be deliberate. It allows a few drives to be added, then recovery
started by writing "recover" to sync_action.
The same could be achieved by writing "frozen" to sync_action first but I
probably wrote this code before I added "frozen".


> 4) mddev->thread is't woken up.

That goes with 3. we only wake up the thread so that it notices
MD_RECOVERY_NEEDED.


>=20
> Is this because if an array was degraded AND there were extra spares, the=
y would already be
> assigned to the array?

Probably, yes.

Thinking a bit more, you would only get to set 'slot' on a spare in an acti=
ve
array if you had first 'frozen' sync_action .. so some of my comments above
might be a bit wrong.
And given that you have frozen sync_action, you really don't want to set
MD_RECOVERY_NEEDED or start the thread.


Thanks,
NeilBrown



> =20
> > Could you respin the patch without adding InIncremental, and testing
> > rdev->saved_raid_disk >=3D 0
> > instead, check if you agree that should work, and perform a similar
> > test?
> > (Is that asking too much?).
> >=20
>=20
> Absolutely! Thanks again!
>=20
> A


--Sig_/42eCC_37AWdTWY91GIUJxIi
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)

iQIVAwUBTpyXFDnsnt1WYoG5AQJlrA//UYGrWSufN+/ifEgsVjfbIQ/bunv2 X/wm
0vzYkTlH7KajeoBZaoSUD05aRuD7KZXEoW8HDCbUI8C2DPYTAIo3AgVVNaE3 EBCF
Tl7+iQTRNQANRrO7KtDfchTvuDtdqrW2Qv22bsNDKpx2bK3LJdFftuIWmFwu CefO
76CGRdt8tC/3fz7mhiS8SMThQXvMWImPsmdSTQMRKWRp7zGGueOcuzbyrVyM /hmi
pc8wvNy9YP+Bc15+TWo+5wnrIxezZ/SueuknbbvMZbkq5VHoE/EFpGrRDsay +4TT
h5/tMulVlP1tqu2+VUPmXtYl6f0yHLw+ORuXD+sJXiJiR5UQ0qqyqfxln0V5 IacJ
O3KR0AQizZOJaEIni5VGaS3eqcaAhZOPrKRADOrE2oVYQ687ob+Nx/YwMtE7 qjsK
EE7cX+CgEcJOEgUNR9RQiSjjAmYvQNCukWS/94bX74VsxuQPQRYzt+CL8qBU zbeh
g9DdTAkj8u280Z/rYHHYHjAErwkZ9qQEcByteDD195a8gKI2aOzDiZOsCcU4 gp9k
MVT1RJCs+oZ9IwRrAJ6AzXt4FRFNg4tVedK8dTtp0rbK0Z3xIRI/Jn+iOz/q 36YP
zD3hWFummecgcGlo3BXpiRl1phWxRkuW0g7jk7VEvwdTeB1AJFexCraZrDYq z5Oe
LnM3F6pYh7M=
=wTW0
-----END PGP SIGNATURE-----

--Sig_/42eCC_37AWdTWY91GIUJxIi--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html