1.X metadata: Resuming an interrupted incremental recovery for RAID1.

1.X metadata: Resuming an interrupted incremental recovery for RAID1.

am 12.10.2011 02:45:25 von andrey.warkentin

Hi group, Neil,

I've seen the following behavior -
1) Create a RAID1 array with two devices with an internal bitmap.
2) Degrade the array.
3) Write data in the array.
4) Re-add the removed member - this start an incremental recovery.
5) Interrrupt the recovery (cause I/O failure in the just re-added
disk) - array degraded again.
6) Re-add the removed member - this starts a full recovery.

If I understand, the choice behind incremental/full is based on the
In_Sync bit, which for the two
possibilities of an interrupted recovery, namely, an "active but
recovering" disk (with a role) and "a spare prior
to role assignment" (i.e. before remove_and_add_spares is run, I
think), the In_Sync bit is never set.

It seems like it should be safe enough to resume an incremental
recovery from where it left off, after all,
the intent bitmap will still reflect the unsynchornized data, right?

How about something like the following?

1) Add another SB feature - MD_FEATURE_IN_RECOVERY.
2) MD_FEATURE_IN_RECOVERY is set in in super_1_sync if
rdev->saved_raid_disk != -1 and mddev->bitmap.
3) MD_FEATURE_IN_RECOVERY is unset in super_1_sync otherwise.
4) If MD_FEATURE_IN_RECOVERY is for the 'default' case in
super_1_validate, set the In_Sync bit, causing an
incremental recovery to happen.

The above handles 99% (as far as I tested).

The only case left is dealing with the 'spare' transition - in which
case I also need to remember rdev->saved_raid_disk someplace
in the superblock (and restore raid_disk and the In_Sync bit in
super_1_validate as well). If I understand correctly,
sb->resync_offset is a
safe place, since it's disregarded for a bitmapped rdev.

What do you think? Am I missing something or is there a better way of
achieving what I am trying to do? I am basically
trying to ensure that if an rdev went away during incremental
recovery, then incremental recovery will resume if it is re-added.
This
will not affect adding a 'clean' spare (it will cause a full recovery).

--
A
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: 1.X metadata: Resuming an interrupted incremental recovery forRAID1.

am 12.10.2011 05:05:17 von NeilBrown

--Sig_/964kEBjEg8k9CR=v7yLplyk
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

On Tue, 11 Oct 2011 20:45:25 -0400 "Andrei E. Warkentin"
wrote:

> Hi group, Neil,
>=20
> I've seen the following behavior -
> 1) Create a RAID1 array with two devices with an internal bitmap.
> 2) Degrade the array.
> 3) Write data in the array.
> 4) Re-add the removed member - this start an incremental recovery.
> 5) Interrrupt the recovery (cause I/O failure in the just re-added
> disk) - array degraded again.
> 6) Re-add the removed member - this starts a full recovery.

Yeh, it probably shouldn't do that.


>=20
> If I understand, the choice behind incremental/full is based on the
> In_Sync bit, which for the two
> possibilities of an interrupted recovery, namely, an "active but
> recovering" disk (with a role) and "a spare prior
> to role assignment" (i.e. before remove_and_add_spares is run, I
> think), the In_Sync bit is never set.

Something like that ... though of course there are lots more horrible detai=
ls.


>=20
> It seems like it should be safe enough to resume an incremental
> recovery from where it left off, after all,
> the intent bitmap will still reflect the unsynchornized data, right?

Yes, it should.

>=20
> How about something like the following?
>=20
> 1) Add another SB feature - MD_FEATURE_IN_RECOVERY.
> 2) MD_FEATURE_IN_RECOVERY is set in in super_1_sync if
> rdev->saved_raid_disk !=3D -1 and mddev->bitmap.
> 3) MD_FEATURE_IN_RECOVERY is unset in super_1_sync otherwise.
> 4) If MD_FEATURE_IN_RECOVERY is for the 'default' case in
> super_1_validate, set the In_Sync bit, causing an
> incremental recovery to happen.

We probably do need another flag somewhere, as we have two different states
over-lapping.
The first time you re-added a device, it's recovery_offset was MAX and it's
event_count was old, but not older than the bitmap.
So we could do a bitmap-based recovery.

When it was re-added we would have updated the metadata to have a newer eve=
nt
count, but a lower recovery_offset. So it is no longer clear from the
metadata that a bitmap-based recovery is allowed.
We could leave the old metadata, but then the bitmap-based resync would
restart from the beginning.
You want the bitmap-based-resync to restart from recovery_offset which is a
new state. So maybe we do need a new flag.

>=20
> The above handles 99% (as far as I tested).

That's not bad.

>=20
> The only case left is dealing with the 'spare' transition - in which
> case I also need to remember rdev->saved_raid_disk someplace
> in the superblock (and restore raid_disk and the In_Sync bit in
> super_1_validate as well). If I understand correctly,
> sb->resync_offset is a
> safe place, since it's disregarded for a bitmapped rdev.

You've lost me ... and I was coping so well too!

What "spare transition". I cannot see a point in the process where the
metadata would get marked as being a spare...


>=20
> What do you think? Am I missing something or is there a better way of
> achieving what I am trying to do? I am basically
> trying to ensure that if an rdev went away during incremental
> recovery, then incremental recovery will resume if it is re-added.
> This
> will not affect adding a 'clean' spare (it will cause a full recovery).
>=20



I think that if a spare being bitmap-recovered fails and then gets re-added,
then we have to start the bitmap-recovery from the beginning. i.e. we cann=
ot
use the recovery_offset. This is because it is entirely possible that while
the device was missing the second time, a write went to an address before
recovery_offset and so there is a new bit, which the recovery has to handle.

So the correct thing to do is to *not* update the metadata on the recovering
device until recovery completes. Then if it fails and is re-added, it will
look just the same as when it was re-added the first time, and will do a
bitmap-based recovery.
Only when the bitmap-based recovery finishes should the metadata be updated
with a new event count, and then the bitmap can start forgetting those old
bits.

Credible?


NeilBrown


--Sig_/964kEBjEg8k9CR=v7yLplyk
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)

iQIVAwUBTpUD+Tnsnt1WYoG5AQJgZA/8Cdg23gS2C15Hnp3IwtwlIyYOEfDc dBi3
VM/lMKPHUkygWLTk0iMXxI+5pLnGzI8eQS/lji6tRZt89xmXkGzrZRFfK/9s GJ1q
WtINbD1IiYaqq7pegBvseGKUxSvhutub2w/ut4QRzCoq4u2XEdGF3002Du6A jMCb
TAi1CkvQuwcefgolP4C5SY/PZS21UuLt8rLvKJBRRY3O9Tb8iOVNMzei9ama cCUp
DDI1WZg5wTEozdwCIU2IZ/l97iAhhLvC8LAeiye8JA3/MBcNfMWp0mIAoB7n FreS
rR9g2Br+mK4HShgYgsyd8Ox5ZBDhYrQPCPrpI/fWWJsbbP14PkcrCDH+rbTU 6+VP
w417Pl5Z9teGVsNog7L9GpohWNAMWo21IPNI6lb91lDUW93s2Lm/wCVsJjPC Iubl
NLmXIYIlkdZvnJle9N84YU43wZYMgIg12Bnth5DH0LNjxyxgA7UtpHYdpSdF HH8M
fJhFp170meazTXQItVyK2gHEnoJkXbN/EfvYYHLbWSIG179YkgJ9bIQDdInM dysN
7mvXxmiHuHw5fFtB03ID71hHmUTNaltllstOA4GmoA0/u3lUeTIi2c4vxljV OT8r
zpbJiUuR9qo/Y3jVh5AfLKmz2SEzM9Ax4YqNXAsesiU61+hcBv4gO9QtTQQ0 eegj
cGykah0ickQ=
=W7EH
-----END PGP SIGNATURE-----

--Sig_/964kEBjEg8k9CR=v7yLplyk--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: 1.X metadata: Resuming an interrupted incremental recovery for RAID1.

am 12.10.2011 08:58:00 von andrey.warkentin

Hi Neil,

2011/10/11 NeilBrown :
>> 1) Add another SB feature - MD_FEATURE_IN_RECOVERY.
>> 2) MD_FEATURE_IN_RECOVERY is set in in super_1_sync if
>> rdev->saved_raid_disk !=3D -1 and mddev->bitmap.
>> 3) MD_FEATURE_IN_RECOVERY is unset in super_1_sync otherwise.
>> 4) If MD_FEATURE_IN_RECOVERY is for the 'default' case in
>> super_1_validate, set the In_Sync bit, causing an
>>     incremental recovery to happen.
>
> We probably do need another flag somewhere, as we have two different =
states
> over-lapping.
> The first time you re-added a device, it's recovery_offset was MAX an=
d it's
> event_count was old, but not older than the bitmap.
> So we could do a bitmap-based recovery.
>
> When it was re-added we would have updated the metadata to have a new=
er event
> count, but a lower recovery_offset.  So it is no longer clear fr=
om the
> metadata that a bitmap-based recovery is allowed.
> We could leave the old metadata, but then the bitmap-based resync wou=
ld
> restart from the beginning.
> You want the bitmap-based-resync to restart from recovery_offset whic=
h is a
> new state.  So maybe we do need a new flag.
>

I think for the 99% case (i.e. a recovery actually started, and was
interrupted), we can get away
with just an additional feature flag. The other percent I'll describe
further below in detail.

>> The only case left is dealing with the 'spare' transition - in which
>> case I also need to remember rdev->saved_raid_disk someplace
>> in the superblock (and restore raid_disk and the In_Sync bit in
>> super_1_validate as well).  If I understand correctly,
>> sb->resync_offset is a
>> safe place, since it's disregarded for a bitmapped rdev.
>
> You've lost me ... and I was coping so well too!
>
> What "spare transition".  I cannot see a point in the process wh=
ere the
> metadata would get marked as being a spare...

In md.c:add_new_disk(), you have code like this -
...
rdev->raid_disk =3D -1; // <---- make me a spare
err =3D bind_rdev_to_array(rdev, mddev);
if (!err && !mddev->pers->hot_remove_disk) {
...
}
if (err)
export_rdev(rdev);
else
sysfs_notify_dirent_safe(rdev->sysfs_state);
md_update_sb(mddev, 1); // <----- now a spare
...

And I have verified (by injecting an -EIO after the first SB update,
with SystemTap), that after that md_update_sb,
the superblock metadata reflects the rdev being a spare.

I didn't trace it diligently enough, but I would guess that the
transition of the SB metadata to a 'role' (active, but recovering)
happens when md_check_recovery() calls remove_and_add_spares(), which
calls pers->hot_add_disk in the degraded path. Does that sound right?

If you have an I/O error due to the drive going away sometime during
the SB update *after* the path in add_new_disk (i.e. the the rdev is a
spare), then
re-adding will cause a full sync, because the logic will have no idea
that the drive belonged to the array before. Now, this is a pretty
unlikely occurrence (when compared to the chance of, say, a network
issue while actually recovering 2TB of data), but it can still happen.

The first thing that came to mind was persisting rdev->saved_raid_disk
in SB, and since I didn't want to add a new field, stuffing it inside
something that wasn't used in incremental recovery. I believe
sb->resync_offset has these properties. I think it would be better to
have a separate field for this, of course. I thought about having some
runtime flag (like In_Incremental or something), but the point is,
beyond raid1, raid5.c and raid10.c need the actual saved raid role to
decide if a full resync or incremental is used. I haven't tried, but I
would guess both raid5.c and raid10.c could be affected as well.

>
>
> I think that if a spare being bitmap-recovered fails and then gets re=
-added,
> then we have to start the bitmap-recovery from the beginning.  i=
e. we cannot
> use the recovery_offset.  This is because it is entirely possibl=
e that while
> the device was missing the second time, a write went to an address be=
fore
> recovery_offset and so there is a new bit, which the recovery has to =
handle.
>
> So the correct thing to do is to *not* update the metadata on the rec=
overing
> device until recovery completes.  Then if it fails and is re-add=
ed, it will
> look just the same as when it was re-added the first time, and will d=
o a
> bitmap-based recovery.
> Only when the bitmap-based recovery finishes should the metadata be u=
pdated
> with a new event count, and then the bitmap can start forgetting thos=
e old
> bits.
>
> Credible?
>

I think you are spot on. If MD_FEATURE_IN_RECOVERY is set in the SB,
the recovery_offset should be disregarded and recovery
should start at sector zero. In fact, I've verified that in the
*current case*, that full recovery that follows the interrupted
incremental actually starts at
recovery_offset, so it's broken right now.

I think the bitmap can surely forget the bits as the chunks are synced
across. After all, if the device is missing after the failed
incremental, new writes would be reflected in the write-intent bitmap,
and thus eventually make it across, no? (as long as recovery_offset is
0, of course)

Anyway, thanks for thinking about this. I hope I am not being too
loopy, I am almost positive I missed at least one of the million
corner cases here =3DP.

I'll submit the patch for the 99% case tomorrow (that just involves
setting and clearing the MD_FEATURE_IN_RECOVERY) for review.

A
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: 1.X metadata: Resuming an interrupted incremental recovery for RAID1.

am 12.10.2011 18:15:01 von andrey.warkentin

2011/10/12 Andrei E. Warkentin :
>>
>>
>> I think that if a spare being bitmap-recovered fails and then gets r=
e-added,
>> then we have to start the bitmap-recovery from the beginning.  =
i.e. we cannot
>> use the recovery_offset.  This is because it is entirely possib=
le that while
>> the device was missing the second time, a write went to an address b=
efore
>> recovery_offset and so there is a new bit, which the recovery has to=
handle.
>>
>> So the correct thing to do is to *not* update the metadata on the re=
covering
>> device until recovery completes.  Then if it fails and is re-ad=
ded, it will
>> look just the same as when it was re-added the first time, and will =
do a
>> bitmap-based recovery.
>> Only when the bitmap-based recovery finishes should the metadata be =
updated
>> with a new event count, and then the bitmap can start forgetting tho=
se old
>> bits.
>>
>> Credible?
>>
>
> I think you are spot on. If MD_FEATURE_IN_RECOVERY is set in the SB,
> the recovery_offset should be disregarded and recovery
> should start at sector zero. In fact, I've verified that in the
> *current case*, that full recovery that follows the interrupted
> incremental actually starts at
> recovery_offset, so it's broken right now.
>

Actually, doesn't this imply that recovery_offset should never be used
during a recovery (full or incremental) - after all,
you could have had I/O to the degraded array?

A
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html