md raid1 rebuild bug? (2.6.32.25)

am 09.11.2010 13:41:11 von faerber

Hi,

i just stumbled across a problem while rebuilding a MD RAID1 on 2.6.32.25.
The server has 2 disks, /dev/hda and /dev/sda. The RAID1 is degraded, so sda
was replaced and i tried rebuilding from /dev/hda to /dev/sdb.
While rebuilding i noticed that /dev/hda has some problems/bad sectors
but the kernel
seems to be stuck in some endless loop:

--
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=239147198,
sector=239147057
hda: possibly failed opcode: 0xc8
end_request: I/O error, dev hda, sector 239147057
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=239148174,
sector=239148081
hda: possibly failed opcode: 0xc8
end_request: I/O error, dev hda, sector 239148081
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=239147213,
sector=239147209
hda: possibly failed opcode: 0xc8
end_request: I/O error, dev hda, sector 239147209
raid1: hda: unrecoverable I/O read error for block 237892224
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=239148174,
sector=239148169
hda: possibly failed opcode: 0xc8
end_request: I/O error, dev hda, sector 239148169
raid1: hda: unrecoverable I/O read error for block 237893120
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=239148225,
sector=239148225
hda: possibly failed opcode: 0xc8
end_request: I/O error, dev hda, sector 239148225
raid1: hda: unrecoverable I/O read error for block 237893248
md: md1: recovery done.
RAID1 conf printout:
--- wd:1 rd:2
disk 0, wo:0, o:1, dev:hda6
disk 1, wo:1, o:1, dev:sda6
RAID1 conf printout:
--- wd:1 rd:2
disk 0, wo:0, o:1, dev:hda6
disk 1, wo:1, o:1, dev:sda6
RAID1 conf printout:
--- wd:1 rd:2
disk 0, wo:0, o:1, dev:hda6
disk 1, wo:1, o:1, dev:sda6
--

I get a new "conf printout" message every few seconds until i used
mdadm to set /dev/sda6 to
"faulty". I know /dev/hda is bad and i probably won't be able to
rebuild the raid device, but this
endless loop seems fishy?
This is a md-raid1 on 2.6.32.25 with superblock version 0.90.
While the "conf printouts" were looping i had a look at /proc/mdstat:
--
# cat /proc/mdstat
Personalities : [raid1] [raid10]
md1 : active raid1 sda6[2] hda6[0]
119011456 blocks [2/1] [U_]
--
This shows that md1 is not correctly rebuilded, but dmesg showed "md:
md1: recovery done" earlier?

Would be great if someone who knows the raid-code could have a look at
this. I can provide more information
if necessary.

Regards,

Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: md raid1 rebuild bug? (2.6.32.25)

am 15.11.2010 02:33:42 von NeilBrown

On Tue, 9 Nov 2010 13:41:11 +0100
Sebastian Färber wrote:

> Hi,
>=20
> i just stumbled across a problem while rebuilding a MD RAID1 on 2.6.3=
2.25.
> The server has 2 disks, /dev/hda and /dev/sda. The RAID1 is degraded,=
so sda
> was replaced and i tried rebuilding from /dev/hda to /dev/sdb.
> While rebuilding i noticed that /dev/hda has some problems/bad sector=
s
> but the kernel
> seems to be stuck in some endless loop:
>=20
> --
> hda: dma_intr: status=3D0x51 { DriveReady SeekComplete Error }
> hda: dma_intr: error=3D0x40 { UncorrectableError }, LBAsect=3D2391471=
98,
> sector=3D239147057
> hda: possibly failed opcode: 0xc8
> end_request: I/O error, dev hda, sector 239147057
> hda: dma_intr: status=3D0x51 { DriveReady SeekComplete Error }
> hda: dma_intr: error=3D0x40 { UncorrectableError }, LBAsect=3D2391481=
74,
> sector=3D239148081
> hda: possibly failed opcode: 0xc8
> end_request: I/O error, dev hda, sector 239148081
> hda: dma_intr: status=3D0x51 { DriveReady SeekComplete Error }
> hda: dma_intr: error=3D0x40 { UncorrectableError }, LBAsect=3D2391472=
13,
> sector=3D239147209
> hda: possibly failed opcode: 0xc8
> end_request: I/O error, dev hda, sector 239147209
> raid1: hda: unrecoverable I/O read error for block 237892224
> hda: dma_intr: status=3D0x51 { DriveReady SeekComplete Error }
> hda: dma_intr: error=3D0x40 { UncorrectableError }, LBAsect=3D2391481=
74,
> sector=3D239148169
> hda: possibly failed opcode: 0xc8
> end_request: I/O error, dev hda, sector 239148169
> raid1: hda: unrecoverable I/O read error for block 237893120
> hda: dma_intr: status=3D0x51 { DriveReady SeekComplete Error }
> hda: dma_intr: error=3D0x40 { UncorrectableError }, LBAsect=3D2391482=
25,
> sector=3D239148225
> hda: possibly failed opcode: 0xc8
> end_request: I/O error, dev hda, sector 239148225
> raid1: hda: unrecoverable I/O read error for block 237893248
> md: md1: recovery done.
> RAID1 conf printout:
> --- wd:1 rd:2
> disk 0, wo:0, o:1, dev:hda6
> disk 1, wo:1, o:1, dev:sda6
> RAID1 conf printout:
> --- wd:1 rd:2
> disk 0, wo:0, o:1, dev:hda6
> disk 1, wo:1, o:1, dev:sda6
> RAID1 conf printout:
> --- wd:1 rd:2
> disk 0, wo:0, o:1, dev:hda6
> disk 1, wo:1, o:1, dev:sda6
> --
>=20
> I get a new "conf printout" message every few seconds until i used
> mdadm to set /dev/sda6 to
> "faulty". I know /dev/hda is bad and i probably won't be able to
> rebuild the raid device, but this
> endless loop seems fishy?

=46ishy indeed!!

This was supposed to have been fixed by commit
4044ba58dd15cb01797c4fd034f39ef4a75f7cc3
in 2.6.29. But it seems not.

The following patch should fix it properly.
Are you able to apply this patch to your kernel, rebuild, and see if it=
makes
the required difference?
Thanks.

I'm working on making md cope with this situation better and actually f=
inish
the recovery - recording where the bad blocks are so when you read from=
the
new device, you can still get read errors, but when you over-write, the=
error
goes away. But there are so many other things to do....

=46or now, your best bet might be to use dd-rescue (or is that ddrescue=
) to
copy from hda6 to sda6, then stop using hda6.

NeilBrown

=46rom c074e12fe437827908bc31247a05aec4815e1a1b Mon Sep 17 00:00:00 200=
1
=46rom: NeilBrown
Date: Mon, 15 Nov 2010 12:32:47 +1100
Subject: [PATCH] md/raid1: really fix recovery looping when single good=
device fails.
MIME-Version: 1.0
Content-Type: text/plain; charset=3DUTF-8
Content-Transfer-Encoding: 8bit

Commit 4044ba58dd15cb01797c4fd034f39ef4a75f7cc3 supposedly fixed a
problem where if a raid1 with just one good device gets a read-error
during recovery, the recovery would abort and immediately restart in
an infinite loop.

However it depended on raid1_remove_disk removing the spare device
from the array. But that does not happen in this case.
So add a test so that in the 'recovery_disabled', then device will be
removed.

Reported-by: Sebastian Färber
Signed-off-by: NeilBrown

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 45f8324..845cf95 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1161,6 +1161,7 @@ static int raid1_remove_disk(mddev_t *mddev, int =
number)
* is not possible.
*/
if (!test_bit(Faulty, &rdev->flags) &&
+ !mddev->recovery_disabled &&
mddev->degraded < conf->raid_disks) {
err =3D -EBUSY;
goto abort;

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: md raid1 rebuild bug? (2.6.32.25)

am 16.11.2010 10:22:08 von faerber

>
> The following patch should fix it properly.
> Are you able to apply this patch to your kernel, rebuild, and see if =
it makes
> the required difference?
> Thanks.

I tested your patch against 2.6.32.25 on another server (same problem,
but the original server
was already decommissioned) and it fixes the infinite loop. Thanks!
Would be great if this fix shows up in the stable kernels (i.e 2.6.32.2=
6).

> I'm working on making md cope with this situation better and actually=
finish
> the recovery - recording where the bad blocks are so when you read fr=
om the
> new device, you can still get read errors, but when you over-write, t=
he error
> goes away. =A0But there are so many other things to do....

That would be great, i see quite a lot bad disks/read errors where the =
recovery
fails because of bad blocks and i guess it's not too uncommon to have a
RAID-1 over two (cheap) disks :-)

Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html