rhel5 raid6 corruption

am 04.04.2011 15:59:02 von Robin Humble

Hi,

we are finding non-zero mismatch_cnt's and getting data corruption when
using RHEL5/CentOS5 kernels with md raid6.
actually, all kernels prior to 2.6.32 seem to have the bug.

the corruption only happens after we replace a failed disk, and the
incorrect data is always on the replacement disk. i.e. the problem is
with rebuild. mismatch_cnt is always a multiple of 8, so I suspect
pages are going astray.

hardware and disk drivers are NOT the problem as I've reproduced it on
2 different machines with FC disks and SATA disks which have completely
different drivers.

rebuilding the raid6 very very slowly (sync_speed_max=5000) mostly
avoids the problem. the faster the rebuild goes or the more i/o to the
raid whilst it's rebuilding, the more likely we are to see mismatches
afterwards.

git bisecting through drivers/md/raid5.c between 2.6.31 (has mismatches)
and .32 (no problems) says that one of these (unbisectable) commits
fixed the issue:
a9b39a741a7e3b262b9f51fefb68e17b32756999 md/raid6: asynchronous handle_stripe_dirtying6
5599becca4bee7badf605e41fd5bcde76d51f2a4 md/raid6: asynchronous handle_stripe_fill6
d82dfee0ad8f240fef1b28e2258891c07da57367 md/raid6: asynchronous handle_parity_check6
6c0069c0ae9659e3a91b68eaed06a5c6c37f45c8 md/raid6: asynchronous handle_stripe6

any ideas?
were any "write i/o whilst rebuilding from degraded" issues fixed by
the above patches?

I was hoping to find something specific and hopefully easily
backportable to .18, but the above looks quite major :-/

which stripe flags are associated with a degraded array that's
rebuilding and also writing data to the disk being reconstructed?

any help would be very much appreciated!

we have asked our hw+sw+filesystem vendor to fix the problem, but I
suspect this will take a very long time. for a variety of reasons (not
the least being we run modified CentOS kernels in production and don't
have a RedHat contract) we can't ask RedHat directly.

there is much more expertise on this list than with any vendor anyway :-)

in case anyone is interested (or is seeing similar corruption and has a
RedHat contract) below are steps to reproduce.

the i/o load that reproduces the mismatch problem is 32-way IOR
http://sourceforge.net/projects/ior-sio/ with small random direct i/o's.
this pattern mimics a small subset of the real i/o on our filesystem.
eg. to local ext3 ->
mpirun -np 32 ./IOR -a POSIX -B -w -z -F -k -Y -e -i3 -m -t4k -b 200MB -o /mnt/blah/testFile

steps to reproduce are:
1) create a md raid6 8+2, 128k chunk, 50GB in size
2) format as ext3 and mount
3) run the above IOR infinitely in a loop
4) mdadm --fail a disk, --remove, then --add it back in
5) killall -STOP the IOR just before the md rebuild finishes
6) let the md rebuild finish
7) run a md check
8) if there are mismatches then exit
9) if no mismatches then killall -CONT IOR
10) goto 4)

step 5) is needed because the corruption is always on the replacement
disk. the replacement disk goes from write-only during rebuild to
read-write when the rebuild finishes. so stopping all i/o to the raid
just before the rebuild finishes leaves any corruption on the
replacement disk and does not allow subsequent i/o to overwrite it,
propagate the corruption to other disks, or otherwise hide the
mismatches.

mismatches can usually be found using the above procedure in <100
iterations through the loop (roughly <36 hours). I've been running 2
machines in the above loops - one to FC disks and one to SATA disks.
so the disks and drivers are eliminated as a source of the problem.
the slower older FC disks usually hit the mismatches before the SATA
disks. mismatch_cnt's are always multiples of 8.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: rhel5 raid6 corruption

am 05.04.2011 07:00:22 von NeilBrown

On Mon, 4 Apr 2011 09:59:02 -0400 Robin Humble
wrote:

> Hi,
>
> we are finding non-zero mismatch_cnt's and getting data corruption when
> using RHEL5/CentOS5 kernels with md raid6.
> actually, all kernels prior to 2.6.32 seem to have the bug.
>
> the corruption only happens after we replace a failed disk, and the
> incorrect data is always on the replacement disk. i.e. the problem is
> with rebuild. mismatch_cnt is always a multiple of 8, so I suspect
> pages are going astray.
>
> hardware and disk drivers are NOT the problem as I've reproduced it on
> 2 different machines with FC disks and SATA disks which have completely
> different drivers.
>
> rebuilding the raid6 very very slowly (sync_speed_max=5000) mostly
> avoids the problem. the faster the rebuild goes or the more i/o to the
> raid whilst it's rebuilding, the more likely we are to see mismatches
> afterwards.
>
> git bisecting through drivers/md/raid5.c between 2.6.31 (has mismatches)
> and .32 (no problems) says that one of these (unbisectable) commits
> fixed the issue:
> a9b39a741a7e3b262b9f51fefb68e17b32756999 md/raid6: asynchronous handle_stripe_dirtying6
> 5599becca4bee7badf605e41fd5bcde76d51f2a4 md/raid6: asynchronous handle_stripe_fill6
> d82dfee0ad8f240fef1b28e2258891c07da57367 md/raid6: asynchronous handle_parity_check6
> 6c0069c0ae9659e3a91b68eaed06a5c6c37f45c8 md/raid6: asynchronous handle_stripe6
>
> any ideas?
> were any "write i/o whilst rebuilding from degraded" issues fixed by
> the above patches?

It looks like they were, but I didn't notice at the time.

If a write to a block in a stripe happens at exactly the same time as the
recovery of a different block in that stripe - and both operations are
combined into a single "fix up the stripe parity and write it all out"
operation, then the block that needs to be recovered is computed but not
written out. oops.

The following patch should fix it. Please test and report your results.
If they prove the fix I will submit it for the various -stable kernels.
It looks like this bug has "always" been present :-(

Thanks for the report .... and for all that testing! A git-bisect where each
run can take 36 hours is a really test of commitment!!!

NeilBrown

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b8a2c5d..f8cd6ef 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2436,10 +2436,16 @@ static void handle_stripe_dirtying6(raid5_conf_t *conf,
BUG();
case 1:
compute_block_1(sh, r6s->failed_num[0], 0);
+ set_bit(R5_LOCKED,
+ &sh->dev[r6s->failed_num[0]].flags);
break;
case 2:
compute_block_2(sh, r6s->failed_num[0],
r6s->failed_num[1]);
+ set_bit(R5_LOCKED,
+ &sh->dev[r6s->failed_num[0]].flags);
+ set_bit(R5_LOCKED,
+ &sh->dev[r6s->failed_num[1]].flags);
break;
default: /* This request should have been failed? */
BUG();
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: rhel5 raid6 corruption

am 07.04.2011 09:45:05 von Robin Humble

On Tue, Apr 05, 2011 at 03:00:22PM +1000, NeilBrown wrote:
>On Mon, 4 Apr 2011 09:59:02 -0400 Robin Humble
>> we are finding non-zero mismatch_cnt's and getting data corruption when
>> using RHEL5/CentOS5 kernels with md raid6.
>> actually, all kernels prior to 2.6.32 seem to have the bug.
>>
>> the corruption only happens after we replace a failed disk, and the
>> incorrect data is always on the replacement disk. i.e. the problem is
>> with rebuild. mismatch_cnt is always a multiple of 8, so I suspect
>> pages are going astray.
....
>> git bisecting through drivers/md/raid5.c between 2.6.31 (has mismatches)
>> and .32 (no problems) says that one of these (unbisectable) commits
>> fixed the issue:
>> a9b39a741a7e3b262b9f51fefb68e17b32756999 md/raid6: asynchronous handle_stripe_dirtying6
>> 5599becca4bee7badf605e41fd5bcde76d51f2a4 md/raid6: asynchronous handle_stripe_fill6
>> d82dfee0ad8f240fef1b28e2258891c07da57367 md/raid6: asynchronous handle_parity_check6
>> 6c0069c0ae9659e3a91b68eaed06a5c6c37f45c8 md/raid6: asynchronous handle_stripe6
>>
>> any ideas?
>> were any "write i/o whilst rebuilding from degraded" issues fixed by
>> the above patches?
>
>It looks like they were, but I didn't notice at the time.
>
>If a write to a block in a stripe happens at exactly the same time as the
>recovery of a different block in that stripe - and both operations are
>combined into a single "fix up the stripe parity and write it all out"
>operation, then the block that needs to be recovered is computed but not
>written out. oops.
>
>The following patch should fix it. Please test and report your results.
>If they prove the fix I will submit it for the various -stable kernels.
>It looks like this bug has "always" been present :-(

thanks for the very quick reply!

however, I don't think the patch has solved the problem :-/
I applied it to 2.6.31.14 and have got several mismatches since on both
FC and SATA machines.

BTW, these tests are actually fairly quick. often <10 kickout/rebuild
loops. so just a few hours.

in the above when you say "block in a stripe", does that mean the whole
128k on that disk (we have --chunk=128) might not have been written, or
one 512byte block (or a page)?
we don't see mismatch counts of 256 - usually 8 or 16, but I can see
why our current testing might hide such a count. ok - now I
blank (dd zeros over) the replacement disk before putting it back into
the raid and am seeing perhaps slightly larger typical mismatch counts
of 16 and 32, but so far not 128k of mismatches.

another (potential) data point is that often position of the mismatches
on the md device doesn't really line up with where I think the data is
being written to. the mismatch is often near the start of the md device,
but sometimes 50% of the way in, and sometimes 95%.
the filesystem is <20% full, although the wildcard is that I really
have no idea how the (sparse) files doing the 4k direct i/o are
allocated across the filsystem, or where fs metadata and journals might
be updating blocks either.
seems odd though...

>Thanks for the report .... and for all that testing! A git-bisect where each
>run can take 36 hours is a really test of commitment!!!

:-) no worries. thanks for md and for the help!
we've been trying to figure this out for months so a bit of testing
isn't a problem. we first eliminated a bunch of other things
(filesystem, drivers, firmware, ...) as possibilities. for a long time
we didn't really believe the problem could be with md as it's so well
tested around the world and has been very solid for us except for these
rebuilds.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility

>NeilBrown
>
>diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>index b8a2c5d..f8cd6ef 100644
>--- a/drivers/md/raid5.c
>+++ b/drivers/md/raid5.c
>@@ -2436,10 +2436,16 @@ static void handle_stripe_dirtying6(raid5_conf_t *conf,
> BUG();
> case 1:
> compute_block_1(sh, r6s->failed_num[0], 0);
>+ set_bit(R5_LOCKED,
>+ &sh->dev[r6s->failed_num[0]].flags);
> break;
> case 2:
> compute_block_2(sh, r6s->failed_num[0],
> r6s->failed_num[1]);
>+ set_bit(R5_LOCKED,
>+ &sh->dev[r6s->failed_num[0]].flags);
>+ set_bit(R5_LOCKED,
>+ &sh->dev[r6s->failed_num[1]].flags);
> break;
> default: /* This request should have been failed? */
> BUG();
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: rhel5 raid6 corruption

am 08.04.2011 12:33:45 von NeilBrown

On Thu, 7 Apr 2011 03:45:05 -0400 Robin Humble
wrote:

> On Tue, Apr 05, 2011 at 03:00:22PM +1000, NeilBrown wrote:
> >On Mon, 4 Apr 2011 09:59:02 -0400 Robin Humble
> >> we are finding non-zero mismatch_cnt's and getting data corruption when
> >> using RHEL5/CentOS5 kernels with md raid6.
> >> actually, all kernels prior to 2.6.32 seem to have the bug.
> >>
> >> the corruption only happens after we replace a failed disk, and the
> >> incorrect data is always on the replacement disk. i.e. the problem is
> >> with rebuild. mismatch_cnt is always a multiple of 8, so I suspect
> >> pages are going astray.
> ...
> >> git bisecting through drivers/md/raid5.c between 2.6.31 (has mismatches)
> >> and .32 (no problems) says that one of these (unbisectable) commits
> >> fixed the issue:
> >> a9b39a741a7e3b262b9f51fefb68e17b32756999 md/raid6: asynchronous handle_stripe_dirtying6
> >> 5599becca4bee7badf605e41fd5bcde76d51f2a4 md/raid6: asynchronous handle_stripe_fill6
> >> d82dfee0ad8f240fef1b28e2258891c07da57367 md/raid6: asynchronous handle_parity_check6
> >> 6c0069c0ae9659e3a91b68eaed06a5c6c37f45c8 md/raid6: asynchronous handle_stripe6
> >>
> >> any ideas?
> >> were any "write i/o whilst rebuilding from degraded" issues fixed by
> >> the above patches?
> >
> >It looks like they were, but I didn't notice at the time.
> >
> >If a write to a block in a stripe happens at exactly the same time as the
> >recovery of a different block in that stripe - and both operations are
> >combined into a single "fix up the stripe parity and write it all out"
> >operation, then the block that needs to be recovered is computed but not
> >written out. oops.
> >
> >The following patch should fix it. Please test and report your results.
> >If they prove the fix I will submit it for the various -stable kernels.
> >It looks like this bug has "always" been present :-(
>
> thanks for the very quick reply!
>
> however, I don't think the patch has solved the problem :-/
> I applied it to 2.6.31.14 and have got several mismatches since on both
> FC and SATA machines.

That's disappointing - I was sure I had found it. I'm tempted to ask "are
you really sure you are running the modified kernel", but I'm sure you are.

>
> BTW, these tests are actually fairly quick. often <10 kickout/rebuild
> loops. so just a few hours.
>
> in the above when you say "block in a stripe", does that mean the whole
> 128k on that disk (we have --chunk=128) might not have been written, or
> one 512byte block (or a page)?

'page'. raid5 does everything in one-page (4K) per device strips.
So mismatch count - which is measured in sectors - will always be a multiple
of 8.

> we don't see mismatch counts of 256 - usually 8 or 16, but I can see
> why our current testing might hide such a count. ok - now I
> blank (dd zeros over) the replacement disk before putting it back into
> the raid and am seeing perhaps slightly larger typical mismatch counts
> of 16 and 32, but so far not 128k of mismatches.
>
> another (potential) data point is that often position of the mismatches
> on the md device doesn't really line up with where I think the data is
> being written to. the mismatch is often near the start of the md device,
> but sometimes 50% of the way in, and sometimes 95%.
> the filesystem is <20% full, although the wildcard is that I really
> have no idea how the (sparse) files doing the 4k direct i/o are
> allocated across the filsystem, or where fs metadata and journals might
> be updating blocks either.
> seems odd though...

I suspect the filesystem spreads files across the whole disk, though it
depends a lot on the details of the particular filesystem.

>
> >Thanks for the report .... and for all that testing! A git-bisect where each
> >run can take 36 hours is a really test of commitment!!!
>
> :-) no worries. thanks for md and for the help!
> we've been trying to figure this out for months so a bit of testing
> isn't a problem. we first eliminated a bunch of other things
> (filesystem, drivers, firmware, ...) as possibilities. for a long time
> we didn't really believe the problem could be with md as it's so well
> tested around the world and has been very solid for us except for these
> rebuilds.
>

I'll try staring at the code a bit longer and see if anything jumps out at me.

thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: rhel5 raid6 corruption

am 09.04.2011 06:24:42 von Robin Humble

On Fri, Apr 08, 2011 at 08:33:45PM +1000, NeilBrown wrote:
>On Thu, 7 Apr 2011 03:45:05 -0400 Robin Humble wrote:
>> On Tue, Apr 05, 2011 at 03:00:22PM +1000, NeilBrown wrote:
>> >On Mon, 4 Apr 2011 09:59:02 -0400 Robin Humble
>> >> we are finding non-zero mismatch_cnt's and getting data corruption when
>> >> using RHEL5/CentOS5 kernels with md raid6.
>> >> actually, all kernels prior to 2.6.32 seem to have the bug.
>> >>
>> >> the corruption only happens after we replace a failed disk, and the
>> >> incorrect data is always on the replacement disk. i.e. the problem is
>> >> with rebuild. mismatch_cnt is always a multiple of 8, so I suspect
>> >> pages are going astray.
>> ...
>> >> git bisecting through drivers/md/raid5.c between 2.6.31 (has mismatches)
>> >> and .32 (no problems) says that one of these (unbisectable) commits
>> >> fixed the issue:
>> >> a9b39a741a7e3b262b9f51fefb68e17b32756999 md/raid6: asynchronous handle_stripe_dirtying6
>> >> 5599becca4bee7badf605e41fd5bcde76d51f2a4 md/raid6: asynchronous handle_stripe_fill6
>> >> d82dfee0ad8f240fef1b28e2258891c07da57367 md/raid6: asynchronous handle_parity_check6
>> >> 6c0069c0ae9659e3a91b68eaed06a5c6c37f45c8 md/raid6: asynchronous handle_stripe6
>> >>
>> >> any ideas?
>> >> were any "write i/o whilst rebuilding from degraded" issues fixed by
>> >> the above patches?
>> >
>> >It looks like they were, but I didn't notice at the time.
>> >
>> >If a write to a block in a stripe happens at exactly the same time as the
>> >recovery of a different block in that stripe - and both operations are
>> >combined into a single "fix up the stripe parity and write it all out"
>> >operation, then the block that needs to be recovered is computed but not
>> >written out. oops.
>> >
>> >The following patch should fix it. Please test and report your results.
>> >If they prove the fix I will submit it for the various -stable kernels.
>> >It looks like this bug has "always" been present :-(
>>
>> thanks for the very quick reply!
>>
>> however, I don't think the patch has solved the problem :-/
>> I applied it to 2.6.31.14 and have got several mismatches since on both
>> FC and SATA machines.
>
>That's disappointing - I was sure I had found it. I'm tempted to ask "are
>you really sure you are running the modified kernel", but I'm sure you are.

yes - really running it :-)

so I added some pr_debug's in and around the patched switch statement
in handle_stripe_dirtying6. it seems cases 1 or 2 in the below aren't
executed in either of normal rebuild, or when I get mismatches. so that
explains why fixes there didn't change anything.

so I guess something in
if (s->locked == 0 && rcw == 0 &&
!test_bit(STRIPE_BIT_DELAY, &sh->state)) {
if (must_compute > 0) {

is always failing?

# cat /sys/kernel/debug/dynamic_debug/control | grep handle_stripe_dirtying6
drivers/md/raid5.c:2466 [raid456]handle_stripe_dirtying6 - "Writing stripe %llu block %d\012"
drivers/md/raid5.c:2460 [raid456]handle_stripe_dirtying6 - "Computing parity for stripe %llu\012"
drivers/md/raid5.c:2448 [raid456]handle_stripe_dirtying6 p "rjh - case2\012"
drivers/md/raid5.c:2441 [raid456]handle_stripe_dirtying6 p "rjh - case1, r6s->failed_num[0] = %d, flags %lu\012"
drivers/md/raid5.c:2433 [raid456]handle_stripe_dirtying6 - "rjh - must_compute %d, s->failed %d\012"
drivers/md/raid5.c:2430 [raid456]handle_stripe_dirtying6 - "rjh - s->locked %d rcw %d test_bit(STRIPE_BIT_DELAY, &sh->state) %d\012"
drivers/md/raid5.c:2421 [raid456]handle_stripe_dirtying6 - "Request delayed stripe %llu block %d for Reconstruct\012"
drivers/md/raid5.c:2414 [raid456]handle_stripe_dirtying6 - "Read_old stripe %llu block %d for Reconstruct\012"
drivers/md/raid5.c:2398 [raid456]handle_stripe_dirtying6 - "for sector %llu, rcw=%d, must_compute=%d\012"
drivers/md/raid5.c:2392 [raid456]handle_stripe_dirtying6 - "raid6: must_compute: disk %d flags=%#lx\012"

the output is verbose if I turn on some of these. but this is short
snippet that I guess looks ok to you?

raid456:for sector 6114040, rcw=0, must_compute=0
raid456:for sector 6113904, rcw=0, must_compute=0
raid456:for sector 11766712, rcw=0, must_compute=0
raid456:for sector 6113912, rcw=0, must_compute=1
raid456:for sector 6113912, rcw=0, must_compute=0
raid456:for sector 11766712, rcw=0, must_compute=0
raid456:for sector 11767200, rcw=0, must_compute=1
raid456:for sector 11761952, rcw=0, must_compute=0
raid456:for sector 11765560, rcw=0, must_compute=1
raid456:for sector 11763064, rcw=0, must_compute=1

please let me know if you'd like me to try/print something else.

cheers,
robin

>> >diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> >index b8a2c5d..f8cd6ef 100644
>> >--- a/drivers/md/raid5.c
>> >+++ b/drivers/md/raid5.c
>> >@@ -2436,10 +2436,16 @@ static void handle_stripe_dirtying6(raid5_conf_t *conf,
>> > BUG();
>> > case 1:
>> > compute_block_1(sh, r6s->failed_num[0], 0);
>> >+ set_bit(R5_LOCKED,
>> >+ &sh->dev[r6s->failed_num[0]].flags);
>> > break;
>> > case 2:
>> > compute_block_2(sh, r6s->failed_num[0],
>> > r6s->failed_num[1]);
>> >+ set_bit(R5_LOCKED,
>> >+ &sh->dev[r6s->failed_num[0]].flags);
>> >+ set_bit(R5_LOCKED,
>> >+ &sh->dev[r6s->failed_num[1]].flags);
>> > break;
>> > default: /* This request should have been failed? */
>> > BUG();
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: rhel5 raid6 corruption

am 18.04.2011 02:00:57 von NeilBrown

On Sat, 9 Apr 2011 00:24:42 -0400 Robin Humble
wrote:

> On Fri, Apr 08, 2011 at 08:33:45PM +1000, NeilBrown wrote:
> >On Thu, 7 Apr 2011 03:45:05 -0400 Robin Humble wrote:
> >> On Tue, Apr 05, 2011 at 03:00:22PM +1000, NeilBrown wrote:
> >> >On Mon, 4 Apr 2011 09:59:02 -0400 Robin Humble
> >> >> we are finding non-zero mismatch_cnt's and getting data corruption when
> >> >> using RHEL5/CentOS5 kernels with md raid6.
> >> >> actually, all kernels prior to 2.6.32 seem to have the bug.
> >> >>
> >> >> the corruption only happens after we replace a failed disk, and the
> >> >> incorrect data is always on the replacement disk. i.e. the problem is
> >> >> with rebuild. mismatch_cnt is always a multiple of 8, so I suspect
> >> >> pages are going astray.
> >> ...
> >> >> git bisecting through drivers/md/raid5.c between 2.6.31 (has mismatches)
> >> >> and .32 (no problems) says that one of these (unbisectable) commits
> >> >> fixed the issue:
> >> >> a9b39a741a7e3b262b9f51fefb68e17b32756999 md/raid6: asynchronous handle_stripe_dirtying6
> >> >> 5599becca4bee7badf605e41fd5bcde76d51f2a4 md/raid6: asynchronous handle_stripe_fill6
> >> >> d82dfee0ad8f240fef1b28e2258891c07da57367 md/raid6: asynchronous handle_parity_check6
> >> >> 6c0069c0ae9659e3a91b68eaed06a5c6c37f45c8 md/raid6: asynchronous handle_stripe6
> >> >>
> >> >> any ideas?
> >> >> were any "write i/o whilst rebuilding from degraded" issues fixed by
> >> >> the above patches?
> >> >
> >> >It looks like they were, but I didn't notice at the time.
> >> >
> >> >If a write to a block in a stripe happens at exactly the same time as the
> >> >recovery of a different block in that stripe - and both operations are
> >> >combined into a single "fix up the stripe parity and write it all out"
> >> >operation, then the block that needs to be recovered is computed but not
> >> >written out. oops.
> >> >
> >> >The following patch should fix it. Please test and report your results.
> >> >If they prove the fix I will submit it for the various -stable kernels.
> >> >It looks like this bug has "always" been present :-(
> >>
> >> thanks for the very quick reply!
> >>
> >> however, I don't think the patch has solved the problem :-/
> >> I applied it to 2.6.31.14 and have got several mismatches since on both
> >> FC and SATA machines.
> >
> >That's disappointing - I was sure I had found it. I'm tempted to ask "are
> >you really sure you are running the modified kernel", but I'm sure you are.
>
> yes - really running it :-)
>
> so I added some pr_debug's in and around the patched switch statement
> in handle_stripe_dirtying6. it seems cases 1 or 2 in the below aren't
> executed in either of normal rebuild, or when I get mismatches. so that
> explains why fixes there didn't change anything.
>
> so I guess something in
> if (s->locked == 0 && rcw == 0 &&
> !test_bit(STRIPE_BIT_DELAY, &sh->state)) {
> if (must_compute > 0) {
>
> is always failing?

Yes... I think that whenever must_compute is non-zero, s->locks is too.
handle_stripe_fill6 has already done the compute_block calls, so there is
never a chance for handle_stripe_dirtying6 to do them.

I think this is that patch you want.

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f8cd6ef..83f83cd 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2466,8 +2466,6 @@ static void handle_stripe_dirtying6(raid5_conf_t *conf,
if (s->locked == disks)
if (!test_and_set_bit(STRIPE_FULL_WRITE, &sh->state))
atomic_inc(&conf->pending_full_writes);
- /* after a RECONSTRUCT_WRITE, the stripe MUST be in-sync */
- set_bit(STRIPE_INSYNC, &sh->state);

if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
atomic_dec(&conf->preread_active_stripes);

The comment isn't correct. While the stripe in-memory must be in-sync, the
stripe on disk might not be because if we computed a block rather than
reading it from an in-sync disk, the in-memory stripe can be different from
the on-disk stripe.

If this bug were still in mainline I would probably want a bigger patch which
would leave this code but also set R5_LOCKED on all blocks that have been
computed. But as it is a stablisation patch, the above is simple and more
clearly correct.

Thanks for you patience - I look forward to your success/failure report.

NeilBrown

>
> # cat /sys/kernel/debug/dynamic_debug/control | grep handle_stripe_dirtying6
> drivers/md/raid5.c:2466 [raid456]handle_stripe_dirtying6 - "Writing stripe %llu block %d\012"
> drivers/md/raid5.c:2460 [raid456]handle_stripe_dirtying6 - "Computing parity for stripe %llu\012"
> drivers/md/raid5.c:2448 [raid456]handle_stripe_dirtying6 p "rjh - case2\012"
> drivers/md/raid5.c:2441 [raid456]handle_stripe_dirtying6 p "rjh - case1, r6s->failed_num[0] = %d, flags %lu\012"
> drivers/md/raid5.c:2433 [raid456]handle_stripe_dirtying6 - "rjh - must_compute %d, s->failed %d\012"
> drivers/md/raid5.c:2430 [raid456]handle_stripe_dirtying6 - "rjh - s->locked %d rcw %d test_bit(STRIPE_BIT_DELAY, &sh->state) %d\012"
> drivers/md/raid5.c:2421 [raid456]handle_stripe_dirtying6 - "Request delayed stripe %llu block %d for Reconstruct\012"
> drivers/md/raid5.c:2414 [raid456]handle_stripe_dirtying6 - "Read_old stripe %llu block %d for Reconstruct\012"
> drivers/md/raid5.c:2398 [raid456]handle_stripe_dirtying6 - "for sector %llu, rcw=%d, must_compute=%d\012"
> drivers/md/raid5.c:2392 [raid456]handle_stripe_dirtying6 - "raid6: must_compute: disk %d flags=%#lx\012"
>
> the output is verbose if I turn on some of these. but this is short
> snippet that I guess looks ok to you?
>
> raid456:for sector 6114040, rcw=0, must_compute=0
> raid456:for sector 6113904, rcw=0, must_compute=0
> raid456:for sector 11766712, rcw=0, must_compute=0
> raid456:for sector 6113912, rcw=0, must_compute=1
> raid456:for sector 6113912, rcw=0, must_compute=0
> raid456:for sector 11766712, rcw=0, must_compute=0
> raid456:for sector 11767200, rcw=0, must_compute=1
> raid456:for sector 11761952, rcw=0, must_compute=0
> raid456:for sector 11765560, rcw=0, must_compute=1
> raid456:for sector 11763064, rcw=0, must_compute=1
>
> please let me know if you'd like me to try/print something else.
>
> cheers,
> robin
>
> >> >diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> >> >index b8a2c5d..f8cd6ef 100644
> >> >--- a/drivers/md/raid5.c
> >> >+++ b/drivers/md/raid5.c
> >> >@@ -2436,10 +2436,16 @@ static void handle_stripe_dirtying6(raid5_conf_t *conf,
> >> > BUG();
> >> > case 1:
> >> > compute_block_1(sh, r6s->failed_num[0], 0);
> >> >+ set_bit(R5_LOCKED,
> >> >+ &sh->dev[r6s->failed_num[0]].flags);
> >> > break;
> >> > case 2:
> >> > compute_block_2(sh, r6s->failed_num[0],
> >> > r6s->failed_num[1]);
> >> >+ set_bit(R5_LOCKED,
> >> >+ &sh->dev[r6s->failed_num[0]].flags);
> >> >+ set_bit(R5_LOCKED,
> >> >+ &sh->dev[r6s->failed_num[1]].flags);
> >> > break;
> >> > default: /* This request should have been failed? */
> >> > BUG();
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: rhel5 raid6 corruption

am 24.04.2011 02:45:45 von Robin Humble

On Mon, Apr 18, 2011 at 10:00:57AM +1000, NeilBrown wrote:
>On Sat, 9 Apr 2011 00:24:42 -0400 Robin Humble wrote:
>> On Fri, Apr 08, 2011 at 08:33:45PM +1000, NeilBrown wrote:
>> >On Thu, 7 Apr 2011 03:45:05 -0400 Robin Humble wrote:
>> >> On Tue, Apr 05, 2011 at 03:00:22PM +1000, NeilBrown wrote:
>> >> >On Mon, 4 Apr 2011 09:59:02 -0400 Robin Humble
>> >> >> we are finding non-zero mismatch_cnt's and getting data corruption when
>> >> >> using RHEL5/CentOS5 kernels with md raid6.
>> >> >> actually, all kernels prior to 2.6.32 seem to have the bug.
>> >> >>
>> >> >> the corruption only happens after we replace a failed disk, and the
>> >> >> incorrect data is always on the replacement disk. i.e. the problem is
....
>> so I guess something in
>> if (s->locked == 0 && rcw == 0 &&
>> !test_bit(STRIPE_BIT_DELAY, &sh->state)) {
>> if (must_compute > 0) {
>>
>> is always failing?
>
>Yes... I think that whenever must_compute is non-zero, s->locks is too.
>handle_stripe_fill6 has already done the compute_block calls, so there is
>never a chance for handle_stripe_dirtying6 to do them.
>
>I think this is that patch you want.
>
>diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>index f8cd6ef..83f83cd 100644
>--- a/drivers/md/raid5.c
>+++ b/drivers/md/raid5.c
>@@ -2466,8 +2466,6 @@ static void handle_stripe_dirtying6(raid5_conf_t *conf,
> if (s->locked == disks)
> if (!test_and_set_bit(STRIPE_FULL_WRITE, &sh->state))
> atomic_inc(&conf->pending_full_writes);
>- /* after a RECONSTRUCT_WRITE, the stripe MUST be in-sync */
>- set_bit(STRIPE_INSYNC, &sh->state);
>
> if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
> atomic_dec(&conf->preread_active_stripes);
>
>
>The comment isn't correct. While the stripe in-memory must be in-sync, the
>stripe on disk might not be because if we computed a block rather than
>reading it from an in-sync disk, the in-memory stripe can be different from
>the on-disk stripe.

yes, I think that works. thanks! :)

sorry for the long delay - I was on the other side of the globe for a
while there, and also I wanted to test as much as possible...

I added the above patch to the 2.6.31 that had your prev patch in it.
then ran tests for 3 days on 2 machines (a total of >1000 rebuild
cycles) and it didn't find any mismatches. so looks like that's fixed
it. whoo!

I also backported your 2 fixes to RHEL 5.5's 2.6.18 kernel and added
the rest of our network and filesystem stack back into the mix, and so
far (albeit with less than a day of testing so far) it seems to work
there too.

>If this bug were still in mainline I would probably want a bigger patch which
>would leave this code but also set R5_LOCKED on all blocks that have been
>computed. But as it is a stablisation patch, the above is simple and more
>clearly correct.

ok.

BTW, the case1 and case2 pr_debug's I added into the case statements of
your first patch (see the pr_debug list below) still don't seem to be
being hit, although I admit I didn't look for hits there very
comprehensively - they can scroll out of dmesg relatively quickly.

so I kinda suspect the mismatch problem is fixed by the above patch alone?

>Thanks for you patience - I look forward to your success/failure report.

AFAICT it's all good. thank you very much!
it's been a long road to this for us... we are very happy :)

cheers,
robin

>>
>> # cat /sys/kernel/debug/dynamic_debug/control | grep handle_stripe_dirtying6
>> drivers/md/raid5.c:2466 [raid456]handle_stripe_dirtying6 - "Writing stripe %llu block %d\012"
>> drivers/md/raid5.c:2460 [raid456]handle_stripe_dirtying6 - "Computing parity for stripe %llu\012"
>> drivers/md/raid5.c:2448 [raid456]handle_stripe_dirtying6 p "rjh - case2\012"
>> drivers/md/raid5.c:2441 [raid456]handle_stripe_dirtying6 p "rjh - case1, r6s->failed_num[0] = %d, flags %lu\012"
>> drivers/md/raid5.c:2433 [raid456]handle_stripe_dirtying6 - "rjh - must_compute %d, s->failed %d\012"
>> drivers/md/raid5.c:2430 [raid456]handle_stripe_dirtying6 - "rjh - s->locked %d rcw %d test_bit(STRIPE_BIT_DELAY, &sh->state) %d\012"
>> drivers/md/raid5.c:2421 [raid456]handle_stripe_dirtying6 - "Request delayed stripe %llu block %d for Reconstruct\012"
>> drivers/md/raid5.c:2414 [raid456]handle_stripe_dirtying6 - "Read_old stripe %llu block %d for Reconstruct\012"
>> drivers/md/raid5.c:2398 [raid456]handle_stripe_dirtying6 - "for sector %llu, rcw=%d, must_compute=%d\012"
>> drivers/md/raid5.c:2392 [raid456]handle_stripe_dirtying6 - "raid6: must_compute: disk %d flags=%#lx\012"
>>
>> the output is verbose if I turn on some of these. but this is short
>> snippet that I guess looks ok to you?
>>
>> raid456:for sector 6114040, rcw=0, must_compute=0
>> raid456:for sector 6113904, rcw=0, must_compute=0
>> raid456:for sector 11766712, rcw=0, must_compute=0
>> raid456:for sector 6113912, rcw=0, must_compute=1
>> raid456:for sector 6113912, rcw=0, must_compute=0
>> raid456:for sector 11766712, rcw=0, must_compute=0
>> raid456:for sector 11767200, rcw=0, must_compute=1
>> raid456:for sector 11761952, rcw=0, must_compute=0
>> raid456:for sector 11765560, rcw=0, must_compute=1
>> raid456:for sector 11763064, rcw=0, must_compute=1
>>
>> please let me know if you'd like me to try/print something else.
>>
>> cheers,
>> robin
>>
>> >> >diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> >> >index b8a2c5d..f8cd6ef 100644
>> >> >--- a/drivers/md/raid5.c
>> >> >+++ b/drivers/md/raid5.c
>> >> >@@ -2436,10 +2436,16 @@ static void handle_stripe_dirtying6(raid5_conf_t *conf,
>> >> > BUG();
>> >> > case 1:
>> >> > compute_block_1(sh, r6s->failed_num[0], 0);
>> >> >+ set_bit(R5_LOCKED,
>> >> >+ &sh->dev[r6s->failed_num[0]].flags);
>> >> > break;
>> >> > case 2:
>> >> > compute_block_2(sh, r6s->failed_num[0],
>> >> > r6s->failed_num[1]);
>> >> >+ set_bit(R5_LOCKED,
>> >> >+ &sh->dev[r6s->failed_num[0]].flags);
>> >> >+ set_bit(R5_LOCKED,
>> >> >+ &sh->dev[r6s->failed_num[1]].flags);
>> >> > break;
>> >> > default: /* This request should have been failed? */
>> >> > BUG();
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html