md stuck while rebuilding?

md stuck while rebuilding?

am 18.07.2011 17:14:55 von Sandra Escandor

Hello all,

I'm using mdadm 3.1.4 and it appears that when a member disk is dropped
from a RAID10 (total of 4 member disks), and operation continues on the
other three disks, a RAID recovery starts. But, what is concerning is
that it appears to get stuck in a loop when recovery is done, which
causes the system to hang. Is this a known issue? If so, is there a
work-around or a fix?

Also, what do "wo" and "o" mean in the RAID10 conf printout?

I can send out a more detailed kernel log if needed.

The following are some snippets of the kernel log:

Jul 8 14:57:19 ecs-1u kernel: [ 8753.699144] raid10: Disk failure on
sdc, disabling device.
Jul 8 14:57:19 ecs-1u kernel: [ 8753.699144] raid10: Operation
continuing on 3 devices.

Jul 8 14:57:23 ecs-1u kernel: [ 8758.163655] md: recovery of RAID array
md126
Jul 8 14:57:23 ecs-1u kernel: [ 8758.163660] md: minimum _guaranteed_
speed: 1000 KB/sec/disk.
Jul 8 14:57:23 ecs-1u kernel: [ 8758.163662] md: using maximum
available idle IO bandwidth (but not more than 200000 KB/sec) for
recovery.
Jul 8 14:57:23 ecs-1u kernel: [ 8758.163672] md: using 128k window,
over a total of 732572288 blocks.
Jul 8 14:57:23 ecs-1u kernel: [ 8758.163675] md: resuming recovery of
md126 from checkpoint.
Jul 8 14:57:23 ecs-1u kernel: [ 8758.163677] md: md126: recovery done.
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296414] RAID10 conf printout:
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296416] --- wd:3 rd:4
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296417] disk 0, wo:0, o:1,
dev:sdb
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296419] disk 1, wo:1, o:0,
dev:sdc
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296420] disk 2, wo:0, o:1,
dev:sdd
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296421] disk 3, wo:0, o:1,
dev:sde

The following output is repeated:
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296673] md: recovery of RAID array
md126
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296676] md: minimum _guaranteed_
speed: 1000 KB/sec/disk.
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296679] md: using maximum
available idle IO bandwidth (but not more than 200000 KB/sec) for
recovery.
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296686] md: using 128k window,
over a total of 732572288 blocks.
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296689] md: resuming recovery of
md126 from checkpoint.
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296691] md: md126: recovery done.

And then after a while, we get this:

Jul 8 14:57:38 ecs-1u kernel: [ 8773.184381] md: resuming recovery of
md126 from checkpoint.
Jul 8 14:57:38 ecs-1u kernel: [ 8773.184384] md: md126: recovery done.
Jul 8 14:57:38 ecs-1u kernel: [ 8773.340104] RAID10 conf printout:
Jul 8 14:57:38 ecs-1u kernel: [ 8773.340106] --- wd:3 rd:4
Jul 8 14:57:38 ecs-1u kernel: [ 8773.340107] disk 0, wo:0, o:1,
dev:sdb
Jul 8 14:57:38 ecs-1u kernel: [ 8773.340109] disk 1, wo:1, o:0,
dev:sdc
Jul 8 14:57:38 ecs-1u kernel: [ 8773.340110] disk 2, wo:0, o:1,
dev:sdd
Jul 8 14:57:38 ecs-1u kernel: [ 8773.340111] disk 3, wo:0, o:1,
dev:sde
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088705] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088710] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088714] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 63 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088723] end_request: I/O error,
dev sdc, sector 1053778688
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088775] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088776] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088778] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 67 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088781] end_request: I/O error,
dev sdc, sector 1053779712
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088817] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088818] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088820] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 6b 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088823] end_request: I/O error,
dev sdc, sector 1053780736
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088859] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088860] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088862] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 6f 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088865] end_request: I/O error,
dev sdc, sector 1053781760
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088909] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088910] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088912] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 73 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088916] end_request: I/O error,
dev sdc, sector 1053782784
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089014] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089015] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089017] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 77 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089020] end_request: I/O error,
dev sdc, sector 1053783808
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089121] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089122] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089124] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 7b 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089127] end_request: I/O error,
dev sdc, sector 1053784832
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089236] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089237] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089239] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 7f 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089243] end_request: I/O error,
dev sdc, sector 1053785856
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089344] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089345] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089347] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 83 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089351] end_request: I/O error,
dev sdc, sector 1053786880
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089441] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089443] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089444] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 87 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089448] end_request: I/O error,
dev sdc, sector 1053787904
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089536] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089537] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089538] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 8b 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089542] end_request: I/O error,
dev sdc, sector 1053788928
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089631] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089632] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089634] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 8f 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089637] end_request: I/O error,
dev sdc, sector 1053789952
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041839] INFO: task kthreadd:2
blocked for more than 120 seconds.
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041867] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041905] kthreadd D
0000000000000000 0 2 0 0x00000000
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041908] ffff8801bf13aa60
0000000000000046 0000000000000000 ffff8801bf11d000
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041911] 0000000000000400
0000000000003737 000000000000f9e0 ffff8801bf067fd8
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041913] 0000000000015780
0000000000015780 ffff88033f028710 ffff88033f028a08
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041915] Call Trace:
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041925] [] ?
sync_page+0x0/0x46
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041929] [] ?
io_schedule+0x73/0xb7
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041931] [] ?
sync_page+0x41/0x46
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041933] [] ?
__wait_on_bit+0x41/0x70
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041935] [] ?
wait_on_page_bit+0x6b/0x71
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041938] [] ?
wake_bit_function+0x0/0x23
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041943] [] ?
shrink_page_list+0x14e/0x623
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041948] [] ?
del_timer_sync+0xc/0x16
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041953] [] ?
read_tsc+0xa/0x20
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041955] [] ?
schedule_timeout+0xad/0xdd
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041958] [] ?
ktime_get_ts+0x68/0xb2
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041961] [] ?
delayacct_end+0x74/0x7f
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041963] [] ?
isolate_pages_global+0x1a0/0x20f
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041965] [] ?
finish_wait+0x35/0x60
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041967] [] ?
autoremove_wake_function+0x0/0x2e
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041969] [] ?
shrink_list+0x528/0x767
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041971] [] ?
shrink_zone+0x280/0x342
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041975] [] ?
zone_statistics+0x3c/0x5d
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041977] [] ?
zone_watermark_ok+0x20/0xb1
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041979] [] ?
zone_reclaim+0x276/0x357
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041981] [] ?
isolate_pages_global+0x0/0x20f
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041983] [] ?
zone_watermark_ok+0x20/0xb1
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041985] [] ?
get_page_from_freelist+0x1ff/0x760
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041987] [] ?
__alloc_pages_nodemask+0x11c/0x5f4
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041994] [] ?
cpumask_next_and+0x2a/0x3a
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041998] [] ?
find_busiest_group+0x9ae/0xa1e
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042001] [] ?
alloc_pid+0x26e/0x390
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042003] [] ?
__get_free_pages+0x9/0x46
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042005] [] ?
copy_process+0xd7/0x115f
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042007] [] ?
do_fork+0x157/0x31e
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042009] [] ?
finish_task_switch+0x3a/0xaf
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042012] [] ?
kernel_thread+0x82/0xe0
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042014] [] ?
kthread+0x0/0x81
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042015] [] ?
child_rip+0x0/0x20
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042017] [] ?
kthreadd+0xb1/0xec
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042021] [] ?
early_idt_handler+0x0/0x71
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042022] [] ?
child_rip+0xa/0x20
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042024] [] ?
early_idt_handler+0x0/0x71
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042028] [] ?
do_set_mempolicy+0x128/0x13a
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042029] [] ?
kthreadd+0x0/0xec
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042031] [] ?
child_rip+0x0/0x20
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042076] INFO: task
md126_raid10:3493 blocked for more than 120 seconds.
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042101] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042138] md126_raid10 D
0000000000000000 0 3493 2 0x00000000
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042140] ffff88033f02b880
0000000000000046 0000000000000000 0000000a00000006
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042143] 0000006cffffffff
ffff880006e0fa98 000000000000f9e0 ffff88033df07fd8
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042145] 0000000000015780
0000000000015780 ffff88033e79aa60 ffff88033e79ad58
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042147] Call Trace:
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042150] [] ?
sprintf+0x51/0x59
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042152] [] ?
select_task_rq_fair+0x472/0x836
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042154] [] ?
schedule_timeout+0x2e/0xdd
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042156] [] ?
wait_for_common+0xde/0x15b
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042158] [] ?
default_wake_function+0x0/0x9
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042163] [] ?
kthread_create+0x93/0x121
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042167] [] ?
md_thread+0x0/0x10f [md_mod]
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042172] [] ?
__kmalloc+0x12f/0x141
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042175] [] ?
md_register_thread+0x22/0xcc [md_mod]
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042178] [] ?
md_do_sync+0x0/0xaf6 [md_mod]
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042181] [] ?
md_register_thread+0x96/0xcc [md_mod]
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042184] [] ?
md_check_recovery+0x3fd/0x4b9 [md_mod]
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042187] [] ?
flush_pending_writes+0x13/0x8a [raid10]
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042190] [] ?
raid10d+0x42/0xade [raid10]
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042191] [] ?
thread_return+0x79/0xe0
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042194] [] ?
apic_timer_interrupt+0xe/0x20
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042196] [] ?
thread_return+0xd6/0xe0
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042197] [] ?
schedule_timeout+0x2e/0xdd
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042200] [] ?
md_thread+0xf1/0x10f [md_mod]
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042202] [] ?
autoremove_wake_function+0x0/0x2e
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042205] [] ?
md_thread+0x0/0x10f [md_mod]
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042206] [] ?
kthread+0x79/0x81
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042208] [] ?
child_rip+0xa/0x20
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042210] [] ?
kthread+0x0/0x81
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042211] [] ?
child_rip+0x0/0x20
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html