High mismatch count on root device - how to best handle?

High mismatch count on root device - how to best handle?

am 26.04.2011 00:32:59 von Mark Knecht

I did a drive check today, first time in months, and found I have a
high mismatch count on my RAID1 root device. What's the best way to
handle getting this cleaned up?

1) I'm running some smartctl tests as I write this.

2) Do I just do an

echo repair

to md126 or do I have to boot a rescue CD before I do that?

If you need more info please let me know.

Thanks,
Mark

c2stable ~ # cat /sys/block/md3/md/mismatch_cnt
0
c2stable ~ # cat /sys/block/md6/md/mismatch_cnt
0
c2stable ~ # cat /sys/block/md7/md/mismatch_cnt
0
c2stable ~ # cat /sys/block/md126/md/mismatch_cnt
222336
c2stable ~ # df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/md126 51612920 26159408 22831712 54% /
udev 10240 432 9808 5% /dev
/dev/md7 389183252 144979184 224434676 40% /VirtualMachines
shm 6151452 0 6151452 0% /dev/shm
c2stable ~ # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md6 : active raid1 sdc6[2] sda6[0] sdb6[1]
247416933 blocks super 1.1 [3/3] [UUU]

md7 : active raid6 sdc7[2] sda7[0] sdb7[1] sdd2[3] sde2[4]
395387904 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]

md3 : active raid6 sdc3[2] sda3[0] sdb3[1] sdd3[3] sde3[4]
157305168 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]

md126 : active raid1 sdc5[2] sda5[0] sdb5[1]
52436032 blocks [3/3] [UUU]

unused devices:
c2stable ~ #
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: High mismatch count on root device - how to best handle?

am 26.04.2011 03:30:54 von Mark Knecht

On Mon, Apr 25, 2011 at 3:32 PM, Mark Knecht wro=
te:
> I did a drive check today, first time in months, and found I have a
> high mismatch count on my RAID1 root device. What's the best way to
> handle getting this cleaned up?
>
> 1) I'm running some smartctl tests as I write this.
>
> 2) Do I just do an
>
> echo repair
>
> to md126 or do I have to boot a rescue CD before I do that?
>
> If you need more info please let me know.
>
> Thanks,
> Mark
>
> c2stable ~ # cat /sys/block/md3/md/mismatch_cnt
> 0
> c2stable ~ # cat /sys/block/md6/md/mismatch_cnt
> 0
> c2stable ~ # cat /sys/block/md7/md/mismatch_cnt
> 0
> c2stable ~ # cat /sys/block/md126/md/mismatch_cnt
> 222336
> c2stable ~ # df
> Filesystem           1K-blocks    =
 Used Available Use% Mounted on
> /dev/md126            51612920  26=
159408  22831712  54% /
> udev                   =C2=
=A0 10240       432      9808   5% /=
dev
> /dev/md7             389183252 14497918=
4 224434676  40% /VirtualMachines
> shm                   =C2=
=A06151452         0   6151452   0% /dev/=
shm
> c2stable ~ # cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [ra=
id4]
> md6 : active raid1 sdc6[2] sda6[0] sdb6[1]
>      247416933 blocks super 1.1 [3/3] [UUU]
>
> md7 : active raid6 sdc7[2] sda7[0] sdb7[1] sdd2[3] sde2[4]
>      395387904 blocks super 1.2 level 6, 16k chunk, al=
gorithm 2 [5/5] [UUUUU]
>
> md3 : active raid6 sdc3[2] sda3[0] sdb3[1] sdd3[3] sde3[4]
>      157305168 blocks super 1.2 level 6, 16k chunk, al=
gorithm 2 [5/5] [UUUUU]
>
> md126 : active raid1 sdc5[2] sda5[0] sdb5[1]
>      52436032 blocks [3/3] [UUU]
>
> unused devices:
> c2stable ~ #
>

The smartctl tests that I ran (long) completed without error on all 5
drives in the system:

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 2887 =
-
# 2 Extended offline Completed without error 00% 2046 =
-


So, if I understand correctly the next step I'd do would be something l=
ike

echo repair >/sys/block/md126/md/sync_action

but I'm unclear about the need to do this when mdadm seems to think
the RAID is clean:

c2stable ~ # mdadm -D /dev/md126
/dev/md126:
Version : 0.90
Creation Time : Tue Apr 13 09:02:34 2010
Raid Level : raid1
Array Size : 52436032 (50.01 GiB 53.69 GB)
Used Dev Size : 52436032 (50.01 GiB 53.69 GB)
Raid Devices : 3
Total Devices : 3
Preferred Minor : 126
Persistence : Superblock is persistent

Update Time : Mon Apr 25 18:29:39 2011
State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0

UUID : edb0ed65:6e87b20e:dc0d88ba:780ef6a3
Events : 0.248880

Number Major Minor RaidDevice State
0 8 5 0 active sync /dev/sda5
1 8 21 1 active sync /dev/sdb5
2 8 37 2 active sync /dev/sdc5
c2stable ~ #

Thanks in advance.

Cheers,
Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: High mismatch count on root device - how to best handle?

am 26.04.2011 19:22:56 von Mark Knecht

On Mon, Apr 25, 2011 at 6:30 PM, Mark Knecht wro=
te:
> On Mon, Apr 25, 2011 at 3:32 PM, Mark Knecht w=
rote:
>> I did a drive check today, first time in months, and found I have a
>> high mismatch count on my RAID1 root device. What's the best way to
>> handle getting this cleaned up?
>>
>> 1) I'm running some smartctl tests as I write this.
>>
>> 2) Do I just do an
>>
>> echo repair
>>
>> to md126 or do I have to boot a rescue CD before I do that?
>>
>> If you need more info please let me know.
>>
>> Thanks,
>> Mark
>>
>> c2stable ~ # cat /sys/block/md3/md/mismatch_cnt
>> 0
>> c2stable ~ # cat /sys/block/md6/md/mismatch_cnt
>> 0
>> c2stable ~ # cat /sys/block/md7/md/mismatch_cnt
>> 0
>> c2stable ~ # cat /sys/block/md126/md/mismatch_cnt
>> 222336
>> c2stable ~ # df
>> Filesystem           1K-blocks    =
 Used Available Use% Mounted on
>> /dev/md126            51612920  2=
6159408  22831712  54% /
>> udev                   =
  10240       432      9808   5=
% /dev
>> /dev/md7             389183252 1449791=
84 224434676  40% /VirtualMachines
>> shm                   =C2=
=A06151452         0   6151452   0% /dev/=
shm
>> c2stable ~ # cat /proc/mdstat
>> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [r=
aid4]
>> md6 : active raid1 sdc6[2] sda6[0] sdb6[1]
>>      247416933 blocks super 1.1 [3/3] [UUU]
>>
>> md7 : active raid6 sdc7[2] sda7[0] sdb7[1] sdd2[3] sde2[4]
>>      395387904 blocks super 1.2 level 6, 16k chunk, a=
lgorithm 2 [5/5] [UUUUU]
>>
>> md3 : active raid6 sdc3[2] sda3[0] sdb3[1] sdd3[3] sde3[4]
>>      157305168 blocks super 1.2 level 6, 16k chunk, a=
lgorithm 2 [5/5] [UUUUU]
>>
>> md126 : active raid1 sdc5[2] sda5[0] sdb5[1]
>>      52436032 blocks [3/3] [UUU]
>>
>> unused devices:
>> c2stable ~ #
>>
>
> The smartctl tests that I ran (long) completed without error on all 5
> drives in the system:
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status       =C2=
=A0          Remaining
> LifeTime(hours)  LBA_of_first_error
> # 1  Extended offline    Completed without error  =
    00%      2887        =
-
> # 2  Extended offline    Completed without error  =
    00%      2046        =
-
>
>
> So, if I understand correctly the next step I'd do would be something=
like
>
> echo repair >/sys/block/md126/md/sync_action
>
> but I'm unclear about the need to do this when mdadm seems to think
> the RAID is clean:
>
> c2stable ~ # mdadm -D /dev/md126
> /dev/md126:
>        Version : 0.90
>  Creation Time : Tue Apr 13 09:02:34 2010
>     Raid Level : raid1
>     Array Size : 52436032 (50.01 GiB 53.69 GB)
>  Used Dev Size : 52436032 (50.01 GiB 53.69 GB)
>   Raid Devices : 3
>  Total Devices : 3
> Preferred Minor : 126
>    Persistence : Superblock is persistent
>
>    Update Time : Mon Apr 25 18:29:39 2011
>          State : clean
>  Active Devices : 3
> Working Devices : 3
>  Failed Devices : 0
>  Spare Devices : 0
>
>           UUID : edb0ed65:6e87b20e:dc0d88ba:=
780ef6a3
>         Events : 0.248880
>
>    Number   Major   Minor   RaidDevice State
>       0       8       =C2=
=A05        0      active sync  =
/dev/sda5
>       1       8       21=
       1      active sync   /d=
ev/sdb5
>       2       8       37=
       2      active sync   /d=
ev/sdc5
> c2stable ~ #
>
> Thanks in advance.
>
> Cheers,
> Mark
>

OK, I don't know exactly what I'm looking for a problem here. I ran
the repair, then rebooted. Mismatch count was zero. It seemed the
repair had worked.

I then used the system for about 4 hours. After 4 hours I did another
check and found the mismatch count had increased.

What I need to get a handle on is:

1) Is this serious? (I assume yes)

2) How do I figure out which drive(s) of the 3 is having trouble?

3) If there is a specific drive, what is the process to swap it out?

Thanks,
Mark


c2stable ~ # cat /sys/block/md126/md/mismatch_cnt
0
c2stable ~ # echo check >/sys/block/md126/md/sync_action
c2stable ~ # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid=
4]
md6 : active raid1 sdb6[1] sdc6[2] sda6[0]
247416933 blocks super 1.1 [3/3] [UUU]

md7 : active raid6 sdb7[1] sdc7[2] sde2[4] sda7[0] sdd2[3]
395387904 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] =
[UUUUU]

md3 : active raid6 sdb3[1] sdc3[2] sda3[0] sdd3[3] sde3[4]
157305168 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] =
[UUUUU]

md126 : active raid1 sdc5[2] sdb5[1] sda5[0]
52436032 blocks [3/3] [UUU]
[>....................] check =3D 1.1% (626560/52436032)
finish=3D11.0min speed=3D78320K/sec

unused devices:
c2stable ~ # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid=
4]
md6 : active raid1 sdb6[1] sdc6[2] sda6[0]
247416933 blocks super 1.1 [3/3] [UUU]

md7 : active raid6 sdb7[1] sdc7[2] sde2[4] sda7[0] sdd2[3]
395387904 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] =
[UUUUU]

md3 : active raid6 sdb3[1] sdc3[2] sda3[0] sdd3[3] sde3[4]
157305168 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] =
[UUUUU]

md126 : active raid1 sdc5[2] sdb5[1] sda5[0]
52436032 blocks [3/3] [UUU]
[===========3D>.........] check =3D 59.6% (3=
1291776/52436032)
finish=3D5.5min speed=3D63887K/sec

unused devices:
c2stable ~ # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid=
4]
md6 : active raid1 sdb6[1] sdc6[2] sda6[0]
247416933 blocks super 1.1 [3/3] [UUU]

md7 : active raid6 sdb7[1] sdc7[2] sde2[4] sda7[0] sdd2[3]
395387904 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5]
[UUUUU]

md3 : active raid6 sdb3[1] sdc3[2] sda3[0] sdd3[3] sde3[4]
157305168 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5]
[UUUUU]

md126 : active raid1 sdc5[2] sdb5[1] sda5[0]
52436032 blocks [3/3] [UUU]

unused devices:
c2stable ~ # cat /sys/block/md126/md/mismatch_cnt
7424
c2stable ~ #
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: High mismatch count on root device - how to best handle?

am 26.04.2011 21:38:38 von Phil Turmel

Hi Mark,

On 04/26/2011 01:22 PM, Mark Knecht wrote:
> On Mon, Apr 25, 2011 at 6:30 PM, Mark Knecht wrote:
[trim /]

> OK, I don't know exactly what I'm looking for a problem here. I ran
> the repair, then rebooted. Mismatch count was zero. It seemed the
> repair had worked.
>
> I then used the system for about 4 hours. After 4 hours I did another
> check and found the mismatch count had increased.
>
> What I need to get a handle on is:
>
> 1) Is this serious? (I assume yes)

Maybe. Are you using a file in this filesystem as swap in lieu of a dedicated swap partition?

I vaguely recall reading that certain code paths in the swap logic can abandon queued writes (due to the data no longer being needed by the VM), such that one or more raid members are left inconsistent. Supposedly only affecting mirrored raid, and only for swap files/partitions.

I don't know if this was ever fixed. or even if anyone tried to fix it.

> 2) How do I figure out which drive(s) of the 3 is having trouble?

Don't know. Failing drives usually give themselves away with warnings in dmesg, and/or ejection from the array. There's nothing in the kernel or mdadm that'll help here. You'd have to do three-way voting comparison of all blocks on the member partitions.

> 3) If there is a specific drive, what is the process to swap it out?

mdadm /dev/mdX --fail /dev/sdXY
mdadm /dev/mdX --remove /dev/sdXY

(swap drives)

mdadm /dev/mdX --add /dev/sdZY

HTH,

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: High mismatch count on root device - how to best handle?

am 28.04.2011 02:38:29 von Mark Knecht

On Tue, Apr 26, 2011 at 12:38 PM, Phil Turmel wrote=
:
> Hi Mark,
>
> On 04/26/2011 01:22 PM, Mark Knecht wrote:
>> On Mon, Apr 25, 2011 at 6:30 PM, Mark Knecht =
wrote:
> [trim /]
>
>> OK, I don't know exactly what I'm looking for a problem here. I ran
>> the repair, then rebooted. Mismatch count was zero. It seemed the
>> repair had worked.
>>
>> I then used the system for about 4 hours. After 4 hours I did anothe=
r
>> check and found the mismatch count had increased.
>>
>> What I need to get a handle on is:
>>
>> 1) Is this serious? (I assume yes)
>
> Maybe.  Are you using a file in this filesystem as swap in lieu =
of a dedicated swap partition?
>

No, swap is on 3 drives as 3 partitions. The kernel runs swap and it
has nothing to do with RAID other than it shares a portion of the
drives.

> I vaguely recall reading that certain code paths in the swap logic ca=
n abandon queued writes (due to the data no longer being needed by the =
VM), such that one or more raid members are left inconsistent.  Su=
pposedly only affecting mirrored raid, and only for swap files/partitio=
ns.
>
> I don't know if this was ever fixed.  or even if anyone tried to=
fix it.
>

md126 is the main 3-drive RAID1 root partition of a Gentoo install.
Kernel is 2.6.38-gentoo-r1 and I'm using mdadm-3.1.4.

Nothing I do with echo repair seems to stick very well. For a few
moments mismatch_cnt will read 0, but as far as I can tell if I do
another echo check then I Get another high mismatch_cnt again.

Once thing I'm wondering about is whether repair even works on a
3-disk RAID1? I've seen threads out there that suggest it doesn't and
that possibly it's just bypassing the actual repair operation?


>> 2) How do I figure out which drive(s) of the 3 is having trouble?
>
> Don't know.  Failing drives usually give themselves away with wa=
rnings in dmesg, and/or ejection from the array.  There's nothing =
in the kernel or mdadm that'll help here.  You'd have to do three-=
way voting comparison of all blocks on the member partitions.
>
>> 3) If there is a specific drive, what is the process to swap it out?
>
> mdadm /dev/mdX --fail /dev/sdXY
> mdadm /dev/mdX --remove /dev/sdXY
>
> (swap drives)
>
> mdadm /dev/mdX --add /dev/sdZY
>

I will have some additional things to figure out. There are 5 drives
in this box with a mixture of 3-drive RAID1 & 5-drive RAID6 across
them. If I pull a drive then I need to ensure that all four RAIDs are
going to get rebuilt correctly. I suspect they will, but I'll want to
be careful.

Still, if I haven't a clue which drive is causing the mismatch then I
cannot know which one to pull..

Thanks for your inputs!

Cheers,
Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: High mismatch count on root device - how to best handle?

am 28.04.2011 03:12:30 von Phil Turmel

Hi Mark,

On 04/27/2011 08:38 PM, Mark Knecht wrote:
> On Tue, Apr 26, 2011 at 12:38 PM, Phil Turmel wrote:
>> Hi Mark,
>>
>> On 04/26/2011 01:22 PM, Mark Knecht wrote:
>>> On Mon, Apr 25, 2011 at 6:30 PM, Mark Knecht wrote:
>> [trim /]
>>
>>> OK, I don't know exactly what I'm looking for a problem here. I ran
>>> the repair, then rebooted. Mismatch count was zero. It seemed the
>>> repair had worked.
>>>
>>> I then used the system for about 4 hours. After 4 hours I did another
>>> check and found the mismatch count had increased.
>>>
>>> What I need to get a handle on is:
>>>
>>> 1) Is this serious? (I assume yes)
>>
>> Maybe. Are you using a file in this filesystem as swap in lieu of a dedicated swap partition?
>>
>
> No, swap is on 3 drives as 3 partitions. The kernel runs swap and it
> has nothing to do with RAID other than it shares a portion of the
> drives.

OK.

>> I vaguely recall reading that certain code paths in the swap logic can abandon queued writes (due to the data no longer being needed by the VM), such that one or more raid members are left inconsistent. Supposedly only affecting mirrored raid, and only for swap files/partitions.
>>
>> I don't know if this was ever fixed. or even if anyone tried to fix it.
>>
>
> md126 is the main 3-drive RAID1 root partition of a Gentoo install.
> Kernel is 2.6.38-gentoo-r1 and I'm using mdadm-3.1.4.
>
> Nothing I do with echo repair seems to stick very well. For a few
> moments mismatch_cnt will read 0, but as far as I can tell if I do
> another echo check then I Get another high mismatch_cnt again.

Hmmm. Since its not swap, this would make me worry about the hardware. Have you considered shuffling SATA port assignments to see if a pattern shows up? Also consider moving some of the drive power load to another PS.

> Once thing I'm wondering about is whether repair even works on a
> 3-disk RAID1? I've seen threads out there that suggest it doesn't and
> that possibly it's just bypassing the actual repair operation?

I've not heard of such. But repair does *not* mean "pick the matching data and write to the third", but rather, "unconditionally write whatever is in the first mirror to the other two, if there's any mismatch".

One of Neil's links explains why, but it boils down to the lack of knowledge about the order writes occurred before the interruption (or bug) that caused the mismatch.

http://neil.brown.name/blog/20100211050355

>>> 2) How do I figure out which drive(s) of the 3 is having trouble?

After messing with hardware (one change at a time), brute-force is next:

Image the drives individually to new drives, or loop-mountable files on other storage, then assemble the copies as degraded arrays, one at a time. For each, compute file-by-file checksums, and compare to each other and to backups or other external reference (you *do* have backups... ?).

Others may have better suggestions. I've never had to do this.

>> Don't know. Failing drives usually give themselves away with warnings in dmesg, and/or ejection from the array. There's nothing in the kernel or mdadm that'll help here. You'd have to do three-way voting comparison of all blocks on the member partitions.
>>
>>> 3) If there is a specific drive, what is the process to swap it out?
>>
>> mdadm /dev/mdX --fail /dev/sdXY
>> mdadm /dev/mdX --remove /dev/sdXY
>>
>> (swap drives)
>>
>> mdadm /dev/mdX --add /dev/sdZY
>>
>
> I will have some additional things to figure out. There are 5 drives
> in this box with a mixture of 3-drive RAID1 & 5-drive RAID6 across
> them. If I pull a drive then I need to ensure that all four RAIDs are
> going to get rebuilt correctly. I suspect they will, but I'll want to
> be careful.

Paranoia is good. Backups are better.

> Still, if I haven't a clue which drive is causing the mismatch then I
> cannot know which one to pull..

This is really a file system problem, and effort are underway to solve it. Btrfs in particular, although it is still experimental. I'm looking forward to that status changing.

> Thanks for your inputs!
>
> Cheers,
> Mark

Regards,

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: High mismatch count on root device - how to best handle?

am 28.04.2011 07:31:23 von Wolfgang Denk

Dear Phil Turmel,

In message <4DB8BEFE.3020009@turmel.org> you wrote:
>
> Hmmm. Since its not swap, this would make me worry about the hardware. Have you considered shuffling SATA port assignments to see if a pattern shows up? Also consider moving some of the drive power load to another PS.

I do not think this is hardware related. I see this behaviour on at
least 5 different machines which show no other problems except for the
mismatch count in the RAID 1 partitins that hold the /boot partition.

> > Still, if I haven't a clue which drive is causing the mismatch then I
> > cannot know which one to pull..
>
> This is really a file system problem, and effort are underway to solve it. Btrfs in particular, although it is still experimental. I'm looking forward to that status changing.

It will probably take some time until grub can boot from a RAID1 array
with btrfs on it...

Best regards,

Wolfgang Denk

--
DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
Q: What's a light-year?
A: One-third less calories than a regular year.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: High mismatch count on root device - how to best handle?

am 01.05.2011 00:51:36 von Mark Knecht

On Wed, Apr 27, 2011 at 10:31 PM, Wolfgang Denk wrote:
> Dear Phil Turmel,
>
> In message <4DB8BEFE.3020009@turmel.org> you wrote:
>>
>> Hmmm.  Since its not swap, this would make me worry about the h=
ardware.  Have you considered shuffling SATA port assignments to s=
ee if a pattern shows up?  Also consider moving some of the drive =
power load to another PS.
>
> I do not think this is hardware related.  I see this behaviour o=
n at
> least 5 different machines which show no other problems except for th=
e
> mismatch count in the RAID 1 partitins that hold the /boot partition.
>

That's interesting to me.

In my case /boot is on it's own partition and not mounted when I do
the test. There was however a RAID6 mounted at the time I was doing
the repair on the RAID1. I tried dismounting it but that didn't change
anything. Still got the same sort of error count.

>> > Still, if I haven't a clue which drive is causing the mismatch the=
n I
>> > cannot know which one to pull..
>>
>> This is really a file system problem, and effort are underway to sol=
ve it.  Btrfs in particular, although it is still experimental. =C2=
=A0I'm looking forward to that status changing.
>
> It will probably take some time until grub can boot from a RAID1 arra=
y
> with btrfs on it...
>
> Best regards,
>
> Wolfgang Denk

Thanks for the info.

Cheers,
Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: High mismatch count on root device - how to best handle?

am 01.05.2011 16:50:32 von Brad Campbell

On 01/05/11 06:51, Mark Knecht wrote:
> On Wed, Apr 27, 2011 at 10:31 PM, Wolfgang Denk wrote:
>> Dear Phil Turmel,
>>
>> In message<4DB8BEFE.3020009@turmel.org> you wrote:
>>>
>>> Hmmm. Since its not swap, this would make me worry about the hardware. Have you considered shuffling SATA port assignments to see if a pattern shows up? Also consider moving some of the drive power load to another PS.
>>
>> I do not think this is hardware related. I see this behaviour on at
>> least 5 different machines which show no other problems except for the
>> mismatch count in the RAID 1 partitins that hold the /boot partition.

root@srv:/server# grep . /sys/block/md?/md/mismatch_cnt
/sys/block/md0/md/mismatch_cnt:0
/sys/block/md1/md/mismatch_cnt:128
/sys/block/md2/md/mismatch_cnt:0
/sys/block/md3/md/mismatch_cnt:41728
/sys/block/md4/md/mismatch_cnt:896
/sys/block/md5/md/mismatch_cnt:0
/sys/block/md6/md/mismatch_cnt:4352

root@srv:/server# cat /proc/mdstat | grep md[1346]
md6 : active raid1 sdp6[0] sdo6[1]
md4 : active raid1 sdp3[0] sdo3[1]
md3 : active raid1 sdp2[0] sdo2[1]
md1 : active raid1 sdp1[0] sdo1[1]

root@srv:/server# cat /etc/fstab | grep md[1346]
/dev/md1 / ext4
errors=remount-ro,commit=30,noatime 0 1
/dev/md6 /raid0 ext4 defaults,commit=30,noatime 0 1
/dev/md4 /home ext4 defaults,commit=30,noatime 0 1
/dev/md3 none swap sw

I see them _all_ the time on RAID1's..

When I configured this system _years_ ago, I did not know any better, so
it's a bit of a mish-mash.

The machine also has a 10 drive RAID-6 and a 3 drive RAID-5. The only
time I've seen mismatches on those is when I used a SIL 3132 controller
and it trashed the RAID-6.

Brad
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: High mismatch count on root device - how to best handle?

am 01.05.2011 19:13:12 von Mark Knecht

On Sun, May 1, 2011 at 7:50 AM, Brad Campbell m> wrote:
> On 01/05/11 06:51, Mark Knecht wrote:
>>
>> On Wed, Apr 27, 2011 at 10:31 PM, Wolfgang Denk  wr=
ote:
>>>
>>> Dear Phil Turmel,
>>>
>>> In message<4DB8BEFE.3020009@turmel.org>  you wrote:
>>>>
>>>> Hmmm.  Since its not swap, this would make me worry about the=
hardware.
>>>>  Have you considered shuffling SATA port assignments to see i=
f a pattern
>>>> shows up?  Also consider moving some of the drive power load =
to another PS.
>>>
>>> I do not think this is hardware related.  I see this behaviour=
on at
>>> least 5 different machines which show no other problems except for =
the
>>> mismatch count in the RAID 1 partitins that hold the /boot partitio=
n.
>
> root@srv:/server# grep . /sys/block/md?/md/mismatch_cnt
> /sys/block/md0/md/mismatch_cnt:0
> /sys/block/md1/md/mismatch_cnt:128
> /sys/block/md2/md/mismatch_cnt:0
> /sys/block/md3/md/mismatch_cnt:41728
> /sys/block/md4/md/mismatch_cnt:896
> /sys/block/md5/md/mismatch_cnt:0
> /sys/block/md6/md/mismatch_cnt:4352
>
> root@srv:/server# cat /proc/mdstat | grep md[1346]
> md6 : active raid1 sdp6[0] sdo6[1]
> md4 : active raid1 sdp3[0] sdo3[1]
> md3 : active raid1 sdp2[0] sdo2[1]
> md1 : active raid1 sdp1[0] sdo1[1]
>
> root@srv:/server# cat /etc/fstab | grep md[1346]
> /dev/md1        /         =C2=
=A0     ext4 errors=3Dremount-ro,commit=3D30,noatime
> 0 1
> /dev/md6        /raid0        =
 ext4    defaults,commit=3D30,noatime     =C2=
=A00 1
> /dev/md4        /home        =
  ext4    defaults,commit=3D30,noatime     =C2=
=A00 1
> /dev/md3        none         =
   swap    sw
>
> I see them _all_ the time on RAID1's..
>
> When I configured this system _years_ ago, I did not know any better,=
so
> it's a bit of a mish-mash.
>
> The machine also has a 10 drive RAID-6 and a 3 drive RAID-5. The only=
time
> I've seen mismatches on those is when I used a SIL 3132 controller an=
d it
> trashed the RAID-6.
>
> Brad
>

Brad,
Thanks very much for sharing the data and your experiences with
this. I'll simply ignore it on the RAID1's from now on.

Cheers,
Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html