md0_raid5 process consuming 100% CPU on disk failure
am 16.06.2011 21:01:17 von Shivani BhopeHello all,
I have been trying to test md raid 5 array faults. I set up a three-disk
RAID 5, and started some I/O. Then I entered the command:
mdadm /dev/md0 --fail /dev/sdc
As soon as the disk is failed, the md0_raid5 process is seen to be taking
100% CPU. This happens about 2 out of 3 times the test was run.
The details:
1. Output for mdadm -D /dev/md0 before the test:
/dev/md0:
Version : 1.2
Creation Time : Tue Jun 14 20:05:07 2011
Raid Level : raid5
Array Size : 143129600 (136.50 GiB 146.56 GB)
Used Dev Size : 71564800 (68.25 GiB 73.28 GB)
Raid Devices : 3
Total Devices : 3
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Jun 15 13:42:59 2011
State : active
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Name : 0
UUID : 5a1f5b73:7ce46e00:2b16a389:cadd7ae8
Events : 4716
Number Major Minor RaidDevice State
0 8 16 0 active sync /dev/sdb
1 8 32 1 active sync /dev/sdc
3 8 48 2 active sync /dev/sdd
2. Output for mdadm -D after the test:
/dev/md0:
Version : 1.2
Creation Time : Wed Jun 15 17:44:11 2011
Raid Level : raid5
Array Size : 143129600 (136.50 GiB 146.56 GB)
Used Dev Size : 71564800 (68.25 GiB 73.28 GB)
Raid Devices : 3
Total Devices : 3
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Jun 15 18:18:41 2011
State : active, degraded
Active Devices : 2
Working Devices : 2
Failed Devices : 1
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Name : 0
UUID : 8663119c:892e5b7d:a4f2be22:5bb2fdd4
Events : 1049
Number Major Minor RaidDevice State
0 8 16 0 active sync /dev/sdb
1 8 32 1 faulty spare
rebuilding /dev/sdc
3 8 48 2 active sync /dev/sdd
Note the device is not marked as removed and spare but as
spare rebuilding.
The system has to be manually power-cycled to recover.
This system is running RHEL 6.1, kernel version
2.6.32-131.0.15.el6.x86_64 & mdadm-3.2.1-1.el6.x86_64
The same test was also run on Fedora-15, kernel version
2.6.38.8-32.fc15.x86_64 & mdadm-3.1.5-2.fc15.x86_64
The output for sysrq-t showing md0_raid5 after the hang is:
Jun 15 18:20:39 kernel: md0_raid5 R running task 0 1896 2 0x00000080
Jun 15 18:20:39 kernel: ffff880175e3fbd0 ffffffff814db337 ffff880175e3fb70 ffffffff8104af29
Jun 15 18:20:39 kernel: 000000000000100c 0000000300000001 ffffe8ffffc00270 ffff88002801e988
Jun 15 18:20:39 kernel: ffff8801747bdab8 ffff880175e3ffd8 000000000000f598 ffff8801747bdac0
Jun 15 18:20:39 kernel: Call Trace:
Jun 15 18:20:39 kernel: [
Jun 15 18:20:39 kernel: [
Jun 15 18:20:39 kernel: [
Jun 15 18:20:39 kernel: [
Jun 15 18:20:39 kernel: [
Jun 15 18:20:39 kernel: [
Jun 15 18:20:39 kernel: [
Jun 15 18:20:39 kernel: [
Jun 15 18:20:39 kernel: [
Jun 15 18:20:39 kernel: [
Jun 15 18:20:39 kernel: [
Jun 15 18:20:39 kernel: [
Jun 15 18:20:39 kernel: [
Jun 15 18:20:39 kernel: [
Jun 15 18:20:39 kernel: [
Jun 15 18:20:39 kernel: [
I tried to increase the stripe_cache_size to 512 and then to 16384.
The problem is still seen.
What could be the reason for the problem?
Any pointers will be greatly appreciated.
Thanks,
Shivani
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html