How to implement raid1 repair

am 17.03.2011 15:46:43 von Jan Schmidt

Hello everyone,

Currently, btrfs has its own raid1 but no repair mechanism for bad
checksums or EIOs. While trying to implement such a repair mechanism,
several more or less general questions came up.

There are two different retry paths for data and metadata. (If you know
or don't care how btrfs handles read errors: goto questions)

The data path: btrfs_io_failed_hook is called for each failed bio (EIO
or checksum error). Currently, it does not know which mirror failed at
first, because normally btrfs_map_block is called with mirror_num=0,
leading to a path where find_live_mirror picks one of them. The error
recovery strategy is then to explicitly read available mirrors one after
the other until one succeeds. In case the very first read picked mirror
1 and failed, the retry code will most likely fail at mirror 1 as well.
It would be nice to know which mirror was picked formerly and directly
try the other.

The metadata path: there is no failure hook, instead there is a loop in
btree_read_extent_buffer_pages, also starting off at mirror_num=0, which
again leaves the decision to find_live_mirror. If there is an error for
any page to be read, the same retry strategy is used as is in the data
path. This obviously might leave you alone with unreadable data
(consider page x is bad on mirror 1 and page x+1 is bad on mirror 2,
both belonging to the same extent, you lose). It would be nice to have a
mechanism at a lower level issuing page-sized retries. Of course,
knowing which mirror is bad before trying mirror 1 again is desirable as
well.

questions:
I have a raid1 repair solution in mind (partially coded) for btrfs that
can be implemented quite easily. However, I have some misgivings. All of
the following questions would need a "yes" for my solution to stand:

- Is it acceptable to retry reading a block immediately after the disk
said it won't work? Or in case of a successful read followed by a
checksum error? (Which is already being done right now in btrfs.)

- Is it acceptable to always write both mirrors if one is found to be
bad (also consider ssds)?

If either of the answers is "no", tracking where the initial read came
from seems inevitable. Tracking would be very easy if bios came back
with unmodified values in bd_bdev and bd_sector, which is not the case.

Thanks,
Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: How to implement raid1 repair

am 17.03.2011 18:19:11 von Josef Bacik

On Thu, Mar 17, 2011 at 03:46:43PM +0100, Jan Schmidt wrote:
> Hello everyone,
>
> Currently, btrfs has its own raid1 but no repair mechanism for bad
> checksums or EIOs. While trying to implement such a repair mechanism,
> several more or less general questions came up.
>
> There are two different retry paths for data and metadata. (If you know
> or don't care how btrfs handles read errors: goto questions)
>
> The data path: btrfs_io_failed_hook is called for each failed bio (EIO
> or checksum error). Currently, it does not know which mirror failed at
> first, because normally btrfs_map_block is called with mirror_num=0,
> leading to a path where find_live_mirror picks one of them. The error
> recovery strategy is then to explicitly read available mirrors one after
> the other until one succeeds. In case the very first read picked mirror
> 1 and failed, the retry code will most likely fail at mirror 1 as well.
> It would be nice to know which mirror was picked formerly and directly
> try the other.
>

So why not add a new field to the btrfs_multi_bio that tells you which mirror
we're acting on so that the endio stuff can know? That should be relatively
simple to accomplish.

> The metadata path: there is no failure hook, instead there is a loop in
> btree_read_extent_buffer_pages, also starting off at mirror_num=0, which
> again leaves the decision to find_live_mirror. If there is an error for
> any page to be read, the same retry strategy is used as is in the data
> path. This obviously might leave you alone with unreadable data
> (consider page x is bad on mirror 1 and page x+1 is bad on mirror 2,
> both belonging to the same extent, you lose). It would be nice to have a
> mechanism at a lower level issuing page-sized retries. Of course,
> knowing which mirror is bad before trying mirror 1 again is desirable as
> well.
>
> questions:
> I have a raid1 repair solution in mind (partially coded) for btrfs that
> can be implemented quite easily. However, I have some misgivings. All of
> the following questions would need a "yes" for my solution to stand:
>
> - Is it acceptable to retry reading a block immediately after the disk
> said it won't work? Or in case of a successful read followed by a
> checksum error? (Which is already being done right now in btrfs.)
>

You mean re-read the same block? No thats not ok, if the checksum failed or the
drive returned with the buffer not uptodate then re-reading the thing isn't
likely to change anything, so just wastes time.

> - Is it acceptable to always write both mirrors if one is found to be
> bad (also consider ssds)?
>

Yes, if one is bad we really want to re-write the bad one so we don't find it
again.

> If either of the answers is "no", tracking where the initial read came
> from seems inevitable. Tracking would be very easy if bios came back
> with unmodified values in bd_bdev and bd_sector, which is not the case.

Huh? The bios come back with the bi_bdev/bi_sector they were submitted on, how
are you getting something different? Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: How to implement raid1 repair

am 17.03.2011 18:36:51 von Chris Mason

Excerpts from Jan Schmidt's message of 2011-03-17 10:46:43 -0400:
> Hello everyone,
>
> Currently, btrfs has its own raid1 but no repair mechanism for bad
> checksums or EIOs. While trying to implement such a repair mechanism,
> several more or less general questions came up.
>
> There are two different retry paths for data and metadata. (If you know
> or don't care how btrfs handles read errors: goto questions)

You should talk with Ilya, who is working on replacing failed raid drives as
well.

>
> The data path: btrfs_io_failed_hook is called for each failed bio (EIO
> or checksum error). Currently, it does not know which mirror failed at
> first, because normally btrfs_map_block is called with mirror_num=0,
> leading to a path where find_live_mirror picks one of them. The error
> recovery strategy is then to explicitly read available mirrors one after
> the other until one succeeds. In case the very first read picked mirror
> 1 and failed, the retry code will most likely fail at mirror 1 as well.
> It would be nice to know which mirror was picked formerly and directly
> try the other.

Agree with Josef here, change the code to record which one was used.
The current bio submission stuff only keeps the btrfs_multi_bio struct
around when a given IO spans more than one disk. But you can easily
change it to keep the struct around for all IOs.

>
> The metadata path: there is no failure hook, instead there is a loop in
> btree_read_extent_buffer_pages, also starting off at mirror_num=0, which
> again leaves the decision to find_live_mirror. If there is an error for
> any page to be read, the same retry strategy is used as is in the data
> path. This obviously might leave you alone with unreadable data
> (consider page x is bad on mirror 1 and page x+1 is bad on mirror 2,
> both belonging to the same extent, you lose). It would be nice to have a
> mechanism at a lower level issuing page-sized retries. Of course,
> knowing which mirror is bad before trying mirror 1 again is desirable as
> well.

Currently the block size is always smaller than the stripe size. But
you have a good point.

>
> questions:
> I have a raid1 repair solution in mind (partially coded) for btrfs that
> can be implemented quite easily. However, I have some misgivings. All of
> the following questions would need a "yes" for my solution to stand:
>
> - Is it acceptable to retry reading a block immediately after the disk
> said it won't work? Or in case of a successful read followed by a
> checksum error? (Which is already being done right now in btrfs.)

In the initial implementation sure, but long term it's not the best.

>
> - Is it acceptable to always write both mirrors if one is found to be
> bad (also consider ssds)?

Sorry, I'd rather not overwrite the copy we know to be good.

>
> If either of the answers is "no", tracking where the initial read came
> from seems inevitable. Tracking would be very easy if bios came back
> with unmodified values in bd_bdev and bd_sector, which is not the case.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: How to implement raid1 repair

am 17.03.2011 18:37:54 von Jan Schmidt

On 03/17/2011 06:09 PM, Andrey Kuzmin wrote:
> On Thu, Mar 17, 2011 at 5:46 PM, Jan Schmidt > > wrote:
> - Is it acceptable to retry reading a block immediately after the disk
> said it won't work? Or in case of a successful read followed by a
> checksum error? (Which is already being done right now in btrfs.)
>
>
> These are two pretty different cases. When disk firmware fails read, it
> means it has retried number of times but gave up (suggesting media
> error), so an upper layer retry would hardly make sense. Checksum error
> catches on-disk EDC fault, so retry is on the contrary quite reasonable.

Agreed.

> - Is it acceptable to always write both mirrors if one is found to be
> bad (also consider ssds)?
>
>
> Writing on read path bypassing file-system transaction mechanism doesn't
> seem a good idea to me. Just imaging loosing power while overwriting
> last good copy.

Okay, sounds reasonable to me. Let's say we're bypassing transaction
mechanism in the same rude manner, but only write the bad mirror. Does
that seem reasonable?

Thanks,
Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: How to implement raid1 repair

am 17.03.2011 18:42:03 von Chris Mason

Excerpts from Jan Schmidt's message of 2011-03-17 13:37:54 -0400:
> On 03/17/2011 06:09 PM, Andrey Kuzmin wrote:
> > On Thu, Mar 17, 2011 at 5:46 PM, Jan Schmidt > > > wrote:
> > - Is it acceptable to retry reading a block immediately after the disk
> > said it won't work? Or in case of a successful read followed by a
> > checksum error? (Which is already being done right now in btrfs.)
> >
> >
> > These are two pretty different cases. When disk firmware fails read, it
> > means it has retried number of times but gave up (suggesting media
> > error), so an upper layer retry would hardly make sense. Checksum error
> > catches on-disk EDC fault, so retry is on the contrary quite reasonable.
>
> Agreed.
>
> > - Is it acceptable to always write both mirrors if one is found to be
> > bad (also consider ssds)?
> >
> >
> > Writing on read path bypassing file-system transaction mechanism doesn't
> > seem a good idea to me. Just imaging loosing power while overwriting
> > last good copy.
>
> Okay, sounds reasonable to me. Let's say we're bypassing transaction
> mechanism in the same rude manner, but only write the bad mirror. Does
> that seem reasonable?

The bad mirror is fair game. Write away, as long as you're sure you're
excluding nodatacow and you don't allow that block to get reallocated
elsewhere. You don't actually need to bypass the transaction
mechanism, just those two things.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: How to implement raid1 repair

am 17.03.2011 18:45:54 von Andrey Kuzmin

On Thu, Mar 17, 2011 at 8:42 PM, Chris Mason w=
rote:
> Excerpts from Jan Schmidt's message of 2011-03-17 13:37:54 -0400:
>> On 03/17/2011 06:09 PM, Andrey Kuzmin wrote:
>> > On Thu, Mar 17, 2011 at 5:46 PM, Jan Schmidt net
>> > > wrote:
>> > Â Â - Is it acceptable to retry reading a block immediat=
ely after the disk
>> > Â Â said it won't work? Or in case of a successful read =
followed by a
>> > Â Â checksum error? (Which is already being done right n=
ow in btrfs.)
>> >
>> >
>> > These are two pretty different cases. When disk firmware fails rea=
d, it
>> > means it has retried number of times but gave up (suggesting media
>> > error), so an upper layer retry would hardly make sense. Checksum =
error
>> > catches on-disk EDC fault, so retry is on the contrary quite reaso=
nable.
>>
>> Agreed.
>>
>> > Â Â - Is it acceptable to always write both mirrors if o=
ne is found to be
>> > Â Â bad (also consider ssds)?
>> >
>> >
>> > Writing on read path bypassing file-system transaction mechanism d=
oesn't
>> > seem a good idea to me. Just imaging loosing power while overwriti=
ng
>> > last good copy.
>>
>> Okay, sounds reasonable to me. Let's say we're bypassing transaction
>> mechanism in the same rude manner, but only write the bad mirror. Do=
es
>> that seem reasonable?
>
> The bad mirror is fair game. Â Write away, as long as you're sure=
you're
> excluding nodatacow and you don't allow that block to get reallocated
> elsewhere. Â You don't actually need to bypass the transaction
> mechanism, just those two things.

What happens if multiple readers (allowed by read lock) attempt an over=
write?

Regards,
Andrey

>
> -chris
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: How to implement raid1 repair

am 17.03.2011 18:49:31 von Jan Schmidt

On 03/17/2011 06:36 PM, Chris Mason wrote:
> Excerpts from Jan Schmidt's message of 2011-03-17 10:46:43 -0400:
>> Hello everyone,
>>
>> Currently, btrfs has its own raid1 but no repair mechanism for bad
>> checksums or EIOs. While trying to implement such a repair mechanism,
>> several more or less general questions came up.
>>
>> There are two different retry paths for data and metadata. (If you know
>> or don't care how btrfs handles read errors: goto questions)
>
> You should talk with Ilya, who is working on replacing failed raid drives as
> well.

Thanks, I'll do that.

>>
>> The data path: btrfs_io_failed_hook is called for each failed bio (EIO
>> or checksum error). Currently, it does not know which mirror failed at
>> first, because normally btrfs_map_block is called with mirror_num=0,
>> leading to a path where find_live_mirror picks one of them. The error
>> recovery strategy is then to explicitly read available mirrors one after
>> the other until one succeeds. In case the very first read picked mirror
>> 1 and failed, the retry code will most likely fail at mirror 1 as well.
>> It would be nice to know which mirror was picked formerly and directly
>> try the other.
>
> Agree with Josef here, change the code to record which one was used.
> The current bio submission stuff only keeps the btrfs_multi_bio struct
> around when a given IO spans more than one disk. But you can easily
> change it to keep the struct around for all IOs.

I was about to write the same in reply to Josefs mail :-)

>> The metadata path: there is no failure hook, instead there is a loop in
>> btree_read_extent_buffer_pages, also starting off at mirror_num=0, which
>> again leaves the decision to find_live_mirror. If there is an error for
>> any page to be read, the same retry strategy is used as is in the data
>> path. This obviously might leave you alone with unreadable data
>> (consider page x is bad on mirror 1 and page x+1 is bad on mirror 2,
>> both belonging to the same extent, you lose). It would be nice to have a
>> mechanism at a lower level issuing page-sized retries. Of course,
>> knowing which mirror is bad before trying mirror 1 again is desirable as
>> well.
>
> Currently the block size is always smaller than the stripe size. But
> you have a good point.
>
>>
>> questions:
>> I have a raid1 repair solution in mind (partially coded) for btrfs that
>> can be implemented quite easily. However, I have some misgivings. All of
>> the following questions would need a "yes" for my solution to stand:
>>
>> - Is it acceptable to retry reading a block immediately after the disk
>> said it won't work? Or in case of a successful read followed by a
>> checksum error? (Which is already being done right now in btrfs.)
>
> In the initial implementation sure, but long term it's not the best.
>
>>
>> - Is it acceptable to always write both mirrors if one is found to be
>> bad (also consider ssds)?
>
> Sorry, I'd rather not overwrite the copy we know to be good.
>
>>
>> If either of the answers is "no", tracking where the initial read came
>> from seems inevitable. Tracking would be very easy if bios came back
>> with unmodified values in bd_bdev and bd_sector, which is not the case.

First off, I'll go for the tracking part then.

I'll see whether implementing the retry mechanism somewhere near
end_bio_multi_stripe is a good thing or else, how to fiddle the
mirror_num further up so the code calling the hooks can care for the
retries and repairs.

Thanks!
-Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: How to implement raid1 repair

am 17.03.2011 18:52:38 von Jan Schmidt

(snipped the parts already commented on in the other mails just sent)

On 03/17/2011 06:19 PM, Josef Bacik wrote:
> On Thu, Mar 17, 2011 at 03:46:43PM +0100, Jan Schmidt wrote:
>> If either of the answers is "no", tracking where the initial read came
>> from seems inevitable. Tracking would be very easy if bios came back
>> with unmodified values in bd_bdev and bd_sector, which is not the case.
>
> Huh? The bios come back with the bi_bdev/bi_sector they were submitted on, how
> are you getting something different? Thanks,

That's only if you write to the block device directly. If I'm not
mistaken, you'll have bi_bdev and bi_sector modified when using bios
with partitions to block-device-absolute values.

Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html