md stuck while rebuilding?
am 18.07.2011 17:14:55 von Sandra EscandorHello all,
I'm using mdadm 3.1.4 and it appears that when a member disk is dropped
from a RAID10 (total of 4 member disks), and operation continues on the
other three disks, a RAID recovery starts. But, what is concerning is
that it appears to get stuck in a loop when recovery is done, which
causes the system to hang. Is this a known issue? If so, is there a
work-around or a fix?
Also, what do "wo" and "o" mean in the RAID10 conf printout?
I can send out a more detailed kernel log if needed.
The following are some snippets of the kernel log:
Jul 8 14:57:19 ecs-1u kernel: [ 8753.699144] raid10: Disk failure on
sdc, disabling device.
Jul 8 14:57:19 ecs-1u kernel: [ 8753.699144] raid10: Operation
continuing on 3 devices.
Jul 8 14:57:23 ecs-1u kernel: [ 8758.163655] md: recovery of RAID array
md126
Jul 8 14:57:23 ecs-1u kernel: [ 8758.163660] md: minimum _guaranteed_
speed: 1000 KB/sec/disk.
Jul 8 14:57:23 ecs-1u kernel: [ 8758.163662] md: using maximum
available idle IO bandwidth (but not more than 200000 KB/sec) for
recovery.
Jul 8 14:57:23 ecs-1u kernel: [ 8758.163672] md: using 128k window,
over a total of 732572288 blocks.
Jul 8 14:57:23 ecs-1u kernel: [ 8758.163675] md: resuming recovery of
md126 from checkpoint.
Jul 8 14:57:23 ecs-1u kernel: [ 8758.163677] md: md126: recovery done.
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296414] RAID10 conf printout:
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296416] --- wd:3 rd:4
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296417] disk 0, wo:0, o:1,
dev:sdb
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296419] disk 1, wo:1, o:0,
dev:sdc
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296420] disk 2, wo:0, o:1,
dev:sdd
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296421] disk 3, wo:0, o:1,
dev:sde
The following output is repeated:
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296673] md: recovery of RAID array
md126
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296676] md: minimum _guaranteed_
speed: 1000 KB/sec/disk.
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296679] md: using maximum
available idle IO bandwidth (but not more than 200000 KB/sec) for
recovery.
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296686] md: using 128k window,
over a total of 732572288 blocks.
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296689] md: resuming recovery of
md126 from checkpoint.
Jul 8 14:57:23 ecs-1u kernel: [ 8758.296691] md: md126: recovery done.
And then after a while, we get this:
Jul 8 14:57:38 ecs-1u kernel: [ 8773.184381] md: resuming recovery of
md126 from checkpoint.
Jul 8 14:57:38 ecs-1u kernel: [ 8773.184384] md: md126: recovery done.
Jul 8 14:57:38 ecs-1u kernel: [ 8773.340104] RAID10 conf printout:
Jul 8 14:57:38 ecs-1u kernel: [ 8773.340106] --- wd:3 rd:4
Jul 8 14:57:38 ecs-1u kernel: [ 8773.340107] disk 0, wo:0, o:1,
dev:sdb
Jul 8 14:57:38 ecs-1u kernel: [ 8773.340109] disk 1, wo:1, o:0,
dev:sdc
Jul 8 14:57:38 ecs-1u kernel: [ 8773.340110] disk 2, wo:0, o:1,
dev:sdd
Jul 8 14:57:38 ecs-1u kernel: [ 8773.340111] disk 3, wo:0, o:1,
dev:sde
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088705] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088710] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088714] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 63 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088723] end_request: I/O error,
dev sdc, sector 1053778688
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088775] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088776] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088778] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 67 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088781] end_request: I/O error,
dev sdc, sector 1053779712
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088817] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088818] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088820] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 6b 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088823] end_request: I/O error,
dev sdc, sector 1053780736
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088859] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088860] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088862] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 6f 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088865] end_request: I/O error,
dev sdc, sector 1053781760
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088909] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088910] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088912] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 73 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.088916] end_request: I/O error,
dev sdc, sector 1053782784
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089014] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089015] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089017] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 77 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089020] end_request: I/O error,
dev sdc, sector 1053783808
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089121] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089122] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089124] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 7b 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089127] end_request: I/O error,
dev sdc, sector 1053784832
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089236] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089237] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089239] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 7f 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089243] end_request: I/O error,
dev sdc, sector 1053785856
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089344] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089345] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089347] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 83 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089351] end_request: I/O error,
dev sdc, sector 1053786880
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089441] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089443] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089444] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 87 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089448] end_request: I/O error,
dev sdc, sector 1053787904
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089536] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089537] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089538] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 8b 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089542] end_request: I/O error,
dev sdc, sector 1053788928
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089631] sd 2:0:0:0: [sdc]
Unhandled error code
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089632] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089634] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 8f 00 00 04 00 00
Jul 8 14:58:17 ecs-1u kernel: [ 8812.089637] end_request: I/O error,
dev sdc, sector 1053789952
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041839] INFO: task kthreadd:2
blocked for more than 120 seconds.
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041867] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041905] kthreadd D
0000000000000000 0 2 0 0x00000000
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041908] ffff8801bf13aa60
0000000000000046 0000000000000000 ffff8801bf11d000
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041911] 0000000000000400
0000000000003737 000000000000f9e0 ffff8801bf067fd8
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041913] 0000000000015780
0000000000015780 ffff88033f028710 ffff88033f028a08
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041915] Call Trace:
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041925] [
sync_page+0x0/0x46
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041929] [
io_schedule+0x73/0xb7
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041931] [
sync_page+0x41/0x46
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041933] [
__wait_on_bit+0x41/0x70
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041935] [
wait_on_page_bit+0x6b/0x71
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041938] [
wake_bit_function+0x0/0x23
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041943] [
shrink_page_list+0x14e/0x623
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041948] [
del_timer_sync+0xc/0x16
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041953] [
read_tsc+0xa/0x20
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041955] [
schedule_timeout+0xad/0xdd
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041958] [
ktime_get_ts+0x68/0xb2
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041961] [
delayacct_end+0x74/0x7f
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041963] [
isolate_pages_global+0x1a0/0x20f
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041965] [
finish_wait+0x35/0x60
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041967] [
autoremove_wake_function+0x0/0x2e
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041969] [
shrink_list+0x528/0x767
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041971] [
shrink_zone+0x280/0x342
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041975] [
zone_statistics+0x3c/0x5d
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041977] [
zone_watermark_ok+0x20/0xb1
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041979] [
zone_reclaim+0x276/0x357
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041981] [
isolate_pages_global+0x0/0x20f
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041983] [
zone_watermark_ok+0x20/0xb1
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041985] [
get_page_from_freelist+0x1ff/0x760
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041987] [
__alloc_pages_nodemask+0x11c/0x5f4
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041994] [
cpumask_next_and+0x2a/0x3a
Jul 8 15:01:22 ecs-1u kernel: [ 8997.041998] [
find_busiest_group+0x9ae/0xa1e
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042001] [
alloc_pid+0x26e/0x390
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042003] [
__get_free_pages+0x9/0x46
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042005] [
copy_process+0xd7/0x115f
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042007] [
do_fork+0x157/0x31e
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042009] [
finish_task_switch+0x3a/0xaf
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042012] [
kernel_thread+0x82/0xe0
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042014] [
kthread+0x0/0x81
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042015] [
child_rip+0x0/0x20
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042017] [
kthreadd+0xb1/0xec
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042021] [
early_idt_handler+0x0/0x71
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042022] [
child_rip+0xa/0x20
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042024] [
early_idt_handler+0x0/0x71
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042028] [
do_set_mempolicy+0x128/0x13a
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042029] [
kthreadd+0x0/0xec
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042031] [
child_rip+0x0/0x20
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042076] INFO: task
md126_raid10:3493 blocked for more than 120 seconds.
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042101] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042138] md126_raid10 D
0000000000000000 0 3493 2 0x00000000
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042140] ffff88033f02b880
0000000000000046 0000000000000000 0000000a00000006
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042143] 0000006cffffffff
ffff880006e0fa98 000000000000f9e0 ffff88033df07fd8
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042145] 0000000000015780
0000000000015780 ffff88033e79aa60 ffff88033e79ad58
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042147] Call Trace:
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042150] [
sprintf+0x51/0x59
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042152] [
select_task_rq_fair+0x472/0x836
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042154] [
schedule_timeout+0x2e/0xdd
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042156] [
wait_for_common+0xde/0x15b
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042158] [
default_wake_function+0x0/0x9
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042163] [
kthread_create+0x93/0x121
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042167] [
md_thread+0x0/0x10f [md_mod]
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042172] [
__kmalloc+0x12f/0x141
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042175] [
md_register_thread+0x22/0xcc [md_mod]
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042178] [
md_do_sync+0x0/0xaf6 [md_mod]
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042181] [
md_register_thread+0x96/0xcc [md_mod]
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042184] [
md_check_recovery+0x3fd/0x4b9 [md_mod]
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042187] [
flush_pending_writes+0x13/0x8a [raid10]
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042190] [
raid10d+0x42/0xade [raid10]
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042191] [
thread_return+0x79/0xe0
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042194] [
apic_timer_interrupt+0xe/0x20
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042196] [
thread_return+0xd6/0xe0
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042197] [
schedule_timeout+0x2e/0xdd
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042200] [
md_thread+0xf1/0x10f [md_mod]
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042202] [
autoremove_wake_function+0x0/0x2e
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042205] [
md_thread+0x0/0x10f [md_mod]
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042206] [
kthread+0x79/0x81
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042208] [
child_rip+0xa/0x20
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042210] [
kthread+0x0/0x81
Jul 8 15:01:22 ecs-1u kernel: [ 8997.042211] [
child_rip+0x0/0x20
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html