Why does one get mismatches?

Why does one get mismatches?

am 19.01.2010 11:04:57 von Jon Hardcastle

Hi,

I kicked off a check/repair cycle on my machine after i moved the phyiscal ordering of my drives around and I am now on my second check/repair cycle and it has kept finding mismatches.

Is it correct that the mismatch value after a repair was needed should equal the value present after a check? What if it doesn't? What does it mean if another check STILL reveals mismatches?

I had something similar after i reshaped from raid 5 to 6 i had to run check/repair/check/repair several times before i got my 0.


-----------------------
N: Jon Hardcastle
E: Jon@eHardcastle.com
'Do not worry about tomorrow, for tomorrow will bring worries of its own.'

***********
Please note, I am phasing out jd_hardcastle AT yahoo.com and replacing it with jon AT eHardcastle.com
***********

-----------------------



--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 20.01.2010 15:19:44 von Brett Russ

On 01/19/2010 05:04 AM, Jon Hardcastle wrote:
> I kicked off a check/repair cycle on my machine after i moved the
> phyiscal ordering of my drives around and I am now on my second
> check/repair cycle and it has kept finding mismatches.
>
> Is it correct that the mismatch value after a repair was needed
> should equal the value present after a check? What if it doesn't?
> What does it mean if another check STILL reveals mismatches?
>
> I had something similar after i reshaped from raid 5 to 6 i had to
> run check/repair/check/repair several times before i got my 0.

I think to diagnose this you'll need to show us the results of running
'mdadm -E /dev/[hs]dX#' (i.e. /dev/sda2) for each member device in the
md device you're trying to assemble *before* attempting to start the md
device. This will report on the state of that specific member device
(partition) and will show why a resync/repair would/would not be needed.

Note that if your md device is not in a read-only mode that the member
states may be changing underneath you as you run the above command.
Therefore, you should either stop the device then run the commands, or
at least have the device in a read-only mode first.

-BR

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 20.01.2010 15:34:01 von Jon Hardcastle

--- On Wed, 20/1/10, Brett Russ wrote:

> From: Brett Russ
> Subject: Re: Why does one get mismatches?
> To: linux-raid@vger.kernel.org
> Date: Wednesday, 20 January, 2010, 14:19
> On 01/19/2010 05:04 AM, Jon
> Hardcastle wrote:
> > I kicked off a check/repair cycle on my machine after
> i moved the
> > phyiscal ordering of my drives around and I am now on
> my second
> > check/repair cycle and it has kept finding
> mismatches.
> >=20
> > Is it correct that the mismatch value after a repair
> was needed
> > should equal the value present after a check? What if
> it doesn't?
> > What does it mean if another check STILL reveals
> mismatches?
> >=20
> > I had something similar after i reshaped from raid 5
> to 6 i had to
> > run check/repair/check/repair several times before i
> got my 0.
>=20
> I think to diagnose this you'll need to show us the results
> of running 'mdadm -E /dev/[hs]dX#' (i.e. /dev/sda2) for each
> member device in the md device you're trying to assemble
> *before* attempting to start the md device.=A0 This will
> report on the state of that specific member device
> (partition) and will show why a resync/repair would/would
> not be needed.
>=20
> Note that if your md device is not in a read-only mode that
> the member states may be changing underneath you as you run
> the above command. Therefore, you should either stop the
> device then run the commands, or at least have the device in
> a read-only mode first.
>=20
> -BR
>=20
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at=A0 http://vger.kernel.org/majordomo-info.html
>=20

I will gather the information you require, but so it is clear it is a a=
echo 'check' that is kicking off the ultimate mismatch not from boot.

Also, I have never marked the array as read-only whilst i have done it =
historically - I can but never have and this is a data storage array an=
d isn't actually(shouldn't be) in use really whilst i am not there (can=
that be tested?!) the main OS drive md3 runs and completes without a p=
roblem..

-----------------------
N: Jon Hardcastle
E: Jon@eHardcastle.com
'Do not worry about tomorrow, for tomorrow will bring worries of its ow=
n.'

***********
Please note, I am phasing out jd_hardcastle AT yahoo.com and replacing =
it with jon AT eHardcastle.com
***********

-----------------------


=20
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 20.01.2010 15:46:04 von Brett Russ

On 01/20/2010 09:34 AM, Jon Hardcastle wrote:
> I will gather the information you require, but so it is clear it is a
> a echo 'check' that is kicking off the ultimate mismatch not from
> boot.

What do you mean by mismatches detected? How is this observed?
-BR

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 20.01.2010 16:03:38 von Jon Hardcastle

--0-1296991638-1263999818=:15704
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

--- On Wed, 20/1/10, Brett Russ wrote: > From: Bre=
tt Russ =0A> Subject: Re: Why does one get mismatches?=
=0A> To: linux-raid@vger.kernel.org=0A> Date: Wednesday, 20 January, 2010, =
14:46=0A> On 01/20/2010 09:34 AM, Jon=0A> Hardcastle wrote:=0A> > I will ga=
ther the information you require, but so it=0A> is clear it is a=0A> > a ec=
ho 'check' that is kicking off the ultimate=0A> mismatch not from=0A> > boo=
t.=0A> =0A> What do you mean by mismatches detected?=A0 How is this=0A> obs=
erved?=0A> -BR=0A> =0Acat /sys/block/md4/md/mismatch_cnt I have =
a script that i use that runs a 'check' then looks at this value once the c=
heck is complete. If it is > 0 it reports this number to me via email and t=
hen starts a repair. I am under the impression that the repair - when compl=
ete should report (in an ideal world) an identical number indicating that i=
t did indeed find x errors and repaired them. But i am getting differing am=
ount check shows 8 repair shows 12 next run it'll be 24 and 6 so something =
is up. I have been able to get the info you asked for using mdadm -E b=
ut not with the array stopped. I cant stop it just yet. What would you be l=
ooking for in this data? -----------------------=0AN: Jon Hardcastle=
=0AE: Jon@eHardcastle.com=0A'Do not worry about tomorrow, for tomorrow will=
bring worries of its own.' ***********=0APlease note, I am phasing ou=
t jd_hardcastle AT yahoo.com and replacing it with jon AT eHardcastle.com=
=0A*********** ----------------------- =0A
--0-1296991638-1263999818=:15704
Content-Type: text/plain; name="mdadm-EsdX1.txt"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="mdadm-EsdX1.txt"

L2Rldi9zZGExOg0KICAgICAgICAgIE1hZ2ljIDogYTkyYjRlZmMNCiAgICAg
ICAgVmVyc2lvbiA6IDAuOTAuMDANCiAgICAgICAgICAgVVVJRCA6IDc0Mzhl
ZmQxOjllNmNhMmI1OmQ2Yjg4Mjc0OjcwMDNiMWQzDQogIENyZWF0aW9uIFRp
bWUgOiBUaHUgT2N0IDExIDAwOjAxOjQ5IDIwMDcNCiAgICAgUmFpZCBMZXZl
bCA6IHJhaWQ2DQogIFVzZWQgRGV2IFNpemUgOiA0ODgzODM5MzYgKDQ2NS43
NiBHaUIgNTAwLjExIEdCKQ0KICAgICBBcnJheSBTaXplIDogMjQ0MTkxOTY4
MCAoMjMyOC44MCBHaUIgMjUwMC41MyBHQikNCiAgIFJhaWQgRGV2aWNlcyA6
IDcNCiAgVG90YWwgRGV2aWNlcyA6IDcNClByZWZlcnJlZCBNaW5vciA6IDQN
Cg0KICAgIFVwZGF0ZSBUaW1lIDogV2VkIEphbiAyMCAxMjo1NDoxNiAyMDEw
DQogICAgICAgICAgU3RhdGUgOiBjbGVhbg0KIEFjdGl2ZSBEZXZpY2VzIDog
Nw0KV29ya2luZyBEZXZpY2VzIDogNw0KIEZhaWxlZCBEZXZpY2VzIDogMA0K
ICBTcGFyZSBEZXZpY2VzIDogMA0KICAgICAgIENoZWNrc3VtIDogYjI0Njhl
YzIgLSBjb3JyZWN0DQogICAgICAgICBFdmVudHMgOiAxODM0MjI1DQoNCiAg
ICAgICAgIExheW91dCA6IGxlZnQtc3ltbWV0cmljDQogICAgIENodW5rIFNp
emUgOiA2NEsNCg0KICAgICAgTnVtYmVyICAgTWFqb3IgICBNaW5vciAgIFJh
aWREZXZpY2UgU3RhdGUNCnRoaXMgICAgIDUgICAgICAgOCAgICAgICAgMSAg
ICAgICAgNSAgICAgIGFjdGl2ZSBzeW5jICAgL2Rldi9zZGExDQoNCiAgIDAg
ICAgIDAgICAgICAgOCAgICAgICA2NSAgICAgICAgMCAgICAgIGFjdGl2ZSBz
eW5jICAgL2Rldi9zZGUxDQogICAxICAgICAxICAgICAgIDggICAgICAgMzMg
ICAgICAgIDEgICAgICBhY3RpdmUgc3luYyAgIC9kZXYvc2RjMQ0KICAgMiAg
ICAgMiAgICAgICA4ICAgICAgIDk3ICAgICAgICAyICAgICAgYWN0aXZlIHN5
bmMgICAvZGV2L3NkZzENCiAgIDMgICAgIDMgICAgICAgOCAgICAgICAxNyAg
ICAgICAgMyAgICAgIGFjdGl2ZSBzeW5jICAgL2Rldi9zZGIxDQogICA0ICAg
ICA0ICAgICAgIDggICAgICAgNDkgICAgICAgIDQgICAgICBhY3RpdmUgc3lu
YyAgIC9kZXYvc2RkMQ0KICAgNSAgICAgNSAgICAgICA4ICAgICAgICAxICAg
ICAgICA1ICAgICAgYWN0aXZlIHN5bmMgICAvZGV2L3NkYTENCiAgIDYgICAg
IDYgICAgICAgOCAgICAgICA4MSAgICAgICAgNiAgICAgIGFjdGl2ZSBzeW5j
ICAgL2Rldi9zZGYxDQovZGV2L3NkYjE6DQogICAgICAgICAgTWFnaWMgOiBh
OTJiNGVmYw0KICAgICAgICBWZXJzaW9uIDogMC45MC4wMA0KICAgICAgICAg
ICBVVUlEIDogNzQzOGVmZDE6OWU2Y2EyYjU6ZDZiODgyNzQ6NzAwM2IxZDMN
CiAgQ3JlYXRpb24gVGltZSA6IFRodSBPY3QgMTEgMDA6MDE6NDkgMjAwNw0K
ICAgICBSYWlkIExldmVsIDogcmFpZDYNCiAgVXNlZCBEZXYgU2l6ZSA6IDQ4
ODM4MzkzNiAoNDY1Ljc2IEdpQiA1MDAuMTEgR0IpDQogICAgIEFycmF5IFNp
emUgOiAyNDQxOTE5NjgwICgyMzI4LjgwIEdpQiAyNTAwLjUzIEdCKQ0KICAg
UmFpZCBEZXZpY2VzIDogNw0KICBUb3RhbCBEZXZpY2VzIDogNw0KUHJlZmVy
cmVkIE1pbm9yIDogNA0KDQogICAgVXBkYXRlIFRpbWUgOiBXZWQgSmFuIDIw
IDEyOjU0OjE2IDIwMTANCiAgICAgICAgICBTdGF0ZSA6IGNsZWFuDQogQWN0
aXZlIERldmljZXMgOiA3DQpXb3JraW5nIERldmljZXMgOiA3DQogRmFpbGVk
IERldmljZXMgOiAwDQogIFNwYXJlIERldmljZXMgOiAwDQogICAgICAgQ2hl
Y2tzdW0gOiBiMjQ2OGVjZSAtIGNvcnJlY3QNCiAgICAgICAgIEV2ZW50cyA6
IDE4MzQyMjUNCg0KICAgICAgICAgTGF5b3V0IDogbGVmdC1zeW1tZXRyaWMN
CiAgICAgQ2h1bmsgU2l6ZSA6IDY0Sw0KDQogICAgICBOdW1iZXIgICBNYWpv
ciAgIE1pbm9yICAgUmFpZERldmljZSBTdGF0ZQ0KdGhpcyAgICAgMyAgICAg
ICA4ICAgICAgIDE3ICAgICAgICAzICAgICAgYWN0aXZlIHN5bmMgICAvZGV2
L3NkYjENCg0KICAgMCAgICAgMCAgICAgICA4ICAgICAgIDY1ICAgICAgICAw
ICAgICAgYWN0aXZlIHN5bmMgICAvZGV2L3NkZTENCiAgIDEgICAgIDEgICAg
ICAgOCAgICAgICAzMyAgICAgICAgMSAgICAgIGFjdGl2ZSBzeW5jICAgL2Rl
di9zZGMxDQogICAyICAgICAyICAgICAgIDggICAgICAgOTcgICAgICAgIDIg
ICAgICBhY3RpdmUgc3luYyAgIC9kZXYvc2RnMQ0KICAgMyAgICAgMyAgICAg
ICA4ICAgICAgIDE3ICAgICAgICAzICAgICAgYWN0aXZlIHN5bmMgICAvZGV2
L3NkYjENCiAgIDQgICAgIDQgICAgICAgOCAgICAgICA0OSAgICAgICAgNCAg
ICAgIGFjdGl2ZSBzeW5jICAgL2Rldi9zZGQxDQogICA1ICAgICA1ICAgICAg
IDggICAgICAgIDEgICAgICAgIDUgICAgICBhY3RpdmUgc3luYyAgIC9kZXYv
c2RhMQ0KICAgNiAgICAgNiAgICAgICA4ICAgICAgIDgxICAgICAgICA2ICAg
ICAgYWN0aXZlIHN5bmMgICAvZGV2L3NkZjENCi9kZXYvc2RjMToNCiAgICAg
ICAgICBNYWdpYyA6IGE5MmI0ZWZjDQogICAgICAgIFZlcnNpb24gOiAwLjkw
LjAwDQogICAgICAgICAgIFVVSUQgOiA3NDM4ZWZkMTo5ZTZjYTJiNTpkNmI4
ODI3NDo3MDAzYjFkMw0KICBDcmVhdGlvbiBUaW1lIDogVGh1IE9jdCAxMSAw
MDowMTo0OSAyMDA3DQogICAgIFJhaWQgTGV2ZWwgOiByYWlkNg0KICBVc2Vk
IERldiBTaXplIDogNDg4MzgzOTM2ICg0NjUuNzYgR2lCIDUwMC4xMSBHQikN
CiAgICAgQXJyYXkgU2l6ZSA6IDI0NDE5MTk2ODAgKDIzMjguODAgR2lCIDI1
MDAuNTMgR0IpDQogICBSYWlkIERldmljZXMgOiA3DQogIFRvdGFsIERldmlj
ZXMgOiA3DQpQcmVmZXJyZWQgTWlub3IgOiA0DQoNCiAgICBVcGRhdGUgVGlt
ZSA6IFdlZCBKYW4gMjAgMTI6NTQ6MTYgMjAxMA0KICAgICAgICAgIFN0YXRl
IDogY2xlYW4NCiBBY3RpdmUgRGV2aWNlcyA6IDcNCldvcmtpbmcgRGV2aWNl
cyA6IDcNCiBGYWlsZWQgRGV2aWNlcyA6IDANCiAgU3BhcmUgRGV2aWNlcyA6
IDANCiAgICAgICBDaGVja3N1bSA6IGIyNDY4ZWRhIC0gY29ycmVjdA0KICAg
ICAgICAgRXZlbnRzIDogMTgzNDIyNQ0KDQogICAgICAgICBMYXlvdXQgOiBs
ZWZ0LXN5bW1ldHJpYw0KICAgICBDaHVuayBTaXplIDogNjRLDQoNCiAgICAg
IE51bWJlciAgIE1ham9yICAgTWlub3IgICBSYWlkRGV2aWNlIFN0YXRlDQp0
aGlzICAgICAxICAgICAgIDggICAgICAgMzMgICAgICAgIDEgICAgICBhY3Rp
dmUgc3luYyAgIC9kZXYvc2RjMQ0KDQogICAwICAgICAwICAgICAgIDggICAg
ICAgNjUgICAgICAgIDAgICAgICBhY3RpdmUgc3luYyAgIC9kZXYvc2RlMQ0K
ICAgMSAgICAgMSAgICAgICA4ICAgICAgIDMzICAgICAgICAxICAgICAgYWN0
aXZlIHN5bmMgICAvZGV2L3NkYzENCiAgIDIgICAgIDIgICAgICAgOCAgICAg
ICA5NyAgICAgICAgMiAgICAgIGFjdGl2ZSBzeW5jICAgL2Rldi9zZGcxDQog
ICAzICAgICAzICAgICAgIDggICAgICAgMTcgICAgICAgIDMgICAgICBhY3Rp
dmUgc3luYyAgIC9kZXYvc2RiMQ0KICAgNCAgICAgNCAgICAgICA4ICAgICAg
IDQ5ICAgICAgICA0ICAgICAgYWN0aXZlIHN5bmMgICAvZGV2L3NkZDENCiAg
IDUgICAgIDUgICAgICAgOCAgICAgICAgMSAgICAgICAgNSAgICAgIGFjdGl2
ZSBzeW5jICAgL2Rldi9zZGExDQogICA2ICAgICA2ICAgICAgIDggICAgICAg
ODEgICAgICAgIDYgICAgICBhY3RpdmUgc3luYyAgIC9kZXYvc2RmMQ0KL2Rl
di9zZGQxOg0KICAgICAgICAgIE1hZ2ljIDogYTkyYjRlZmMNCiAgICAgICAg
VmVyc2lvbiA6IDAuOTAuMDANCiAgICAgICAgICAgVVVJRCA6IDc0MzhlZmQx
OjllNmNhMmI1OmQ2Yjg4Mjc0OjcwMDNiMWQzDQogIENyZWF0aW9uIFRpbWUg
OiBUaHUgT2N0IDExIDAwOjAxOjQ5IDIwMDcNCiAgICAgUmFpZCBMZXZlbCA6
IHJhaWQ2DQogIFVzZWQgRGV2IFNpemUgOiA0ODgzODM5MzYgKDQ2NS43NiBH
aUIgNTAwLjExIEdCKQ0KICAgICBBcnJheSBTaXplIDogMjQ0MTkxOTY4MCAo
MjMyOC44MCBHaUIgMjUwMC41MyBHQikNCiAgIFJhaWQgRGV2aWNlcyA6IDcN
CiAgVG90YWwgRGV2aWNlcyA6IDcNClByZWZlcnJlZCBNaW5vciA6IDQNCg0K
ICAgIFVwZGF0ZSBUaW1lIDogV2VkIEphbiAyMCAxMjo1NDoxNiAyMDEwDQog
ICAgICAgICAgU3RhdGUgOiBjbGVhbg0KIEFjdGl2ZSBEZXZpY2VzIDogNw0K
V29ya2luZyBEZXZpY2VzIDogNw0KIEZhaWxlZCBEZXZpY2VzIDogMA0KICBT
cGFyZSBEZXZpY2VzIDogMA0KICAgICAgIENoZWNrc3VtIDogYjI0NjhlZjAg
LSBjb3JyZWN0DQogICAgICAgICBFdmVudHMgOiAxODM0MjI1DQoNCiAgICAg
ICAgIExheW91dCA6IGxlZnQtc3ltbWV0cmljDQogICAgIENodW5rIFNpemUg
OiA2NEsNCg0KICAgICAgTnVtYmVyICAgTWFqb3IgICBNaW5vciAgIFJhaWRE
ZXZpY2UgU3RhdGUNCnRoaXMgICAgIDQgICAgICAgOCAgICAgICA0OSAgICAg
ICAgNCAgICAgIGFjdGl2ZSBzeW5jICAgL2Rldi9zZGQxDQoNCiAgIDAgICAg
IDAgICAgICAgOCAgICAgICA2NSAgICAgICAgMCAgICAgIGFjdGl2ZSBzeW5j
ICAgL2Rldi9zZGUxDQogICAxICAgICAxICAgICAgIDggICAgICAgMzMgICAg
ICAgIDEgICAgICBhY3RpdmUgc3luYyAgIC9kZXYvc2RjMQ0KICAgMiAgICAg
MiAgICAgICA4ICAgICAgIDk3ICAgICAgICAyICAgICAgYWN0aXZlIHN5bmMg
ICAvZGV2L3NkZzENCiAgIDMgICAgIDMgICAgICAgOCAgICAgICAxNyAgICAg
ICAgMyAgICAgIGFjdGl2ZSBzeW5jICAgL2Rldi9zZGIxDQogICA0ICAgICA0
ICAgICAgIDggICAgICAgNDkgICAgICAgIDQgICAgICBhY3RpdmUgc3luYyAg
IC9kZXYvc2RkMQ0KICAgNSAgICAgNSAgICAgICA4ICAgICAgICAxICAgICAg
ICA1ICAgICAgYWN0aXZlIHN5bmMgICAvZGV2L3NkYTENCiAgIDYgICAgIDYg
ICAgICAgOCAgICAgICA4MSAgICAgICAgNiAgICAgIGFjdGl2ZSBzeW5jICAg
L2Rldi9zZGYxDQovZGV2L3NkZTE6DQogICAgICAgICAgTWFnaWMgOiBhOTJi
NGVmYw0KICAgICAgICBWZXJzaW9uIDogMC45MC4wMA0KICAgICAgICAgICBV
VUlEIDogNzQzOGVmZDE6OWU2Y2EyYjU6ZDZiODgyNzQ6NzAwM2IxZDMNCiAg
Q3JlYXRpb24gVGltZSA6IFRodSBPY3QgMTEgMDA6MDE6NDkgMjAwNw0KICAg
ICBSYWlkIExldmVsIDogcmFpZDYNCiAgVXNlZCBEZXYgU2l6ZSA6IDQ4ODM4
MzkzNiAoNDY1Ljc2IEdpQiA1MDAuMTEgR0IpDQogICAgIEFycmF5IFNpemUg
OiAyNDQxOTE5NjgwICgyMzI4LjgwIEdpQiAyNTAwLjUzIEdCKQ0KICAgUmFp
ZCBEZXZpY2VzIDogNw0KICBUb3RhbCBEZXZpY2VzIDogNw0KUHJlZmVycmVk
IE1pbm9yIDogNA0KDQogICAgVXBkYXRlIFRpbWUgOiBXZWQgSmFuIDIwIDEy
OjU0OjE2IDIwMTANCiAgICAgICAgICBTdGF0ZSA6IGNsZWFuDQogQWN0aXZl
IERldmljZXMgOiA3DQpXb3JraW5nIERldmljZXMgOiA3DQogRmFpbGVkIERl
dmljZXMgOiAwDQogIFNwYXJlIERldmljZXMgOiAwDQogICAgICAgQ2hlY2tz
dW0gOiBiMjQ2OGVmOCAtIGNvcnJlY3QNCiAgICAgICAgIEV2ZW50cyA6IDE4
MzQyMjUNCg0KICAgICAgICAgTGF5b3V0IDogbGVmdC1zeW1tZXRyaWMNCiAg
ICAgQ2h1bmsgU2l6ZSA6IDY0Sw0KDQogICAgICBOdW1iZXIgICBNYWpvciAg
IE1pbm9yICAgUmFpZERldmljZSBTdGF0ZQ0KdGhpcyAgICAgMCAgICAgICA4
ICAgICAgIDY1ICAgICAgICAwICAgICAgYWN0aXZlIHN5bmMgICAvZGV2L3Nk
ZTENCg0KICAgMCAgICAgMCAgICAgICA4ICAgICAgIDY1ICAgICAgICAwICAg
ICAgYWN0aXZlIHN5bmMgICAvZGV2L3NkZTENCiAgIDEgICAgIDEgICAgICAg
OCAgICAgICAzMyAgICAgICAgMSAgICAgIGFjdGl2ZSBzeW5jICAgL2Rldi9z
ZGMxDQogICAyICAgICAyICAgICAgIDggICAgICAgOTcgICAgICAgIDIgICAg
ICBhY3RpdmUgc3luYyAgIC9kZXYvc2RnMQ0KICAgMyAgICAgMyAgICAgICA4
ICAgICAgIDE3ICAgICAgICAzICAgICAgYWN0aXZlIHN5bmMgICAvZGV2L3Nk
YjENCiAgIDQgICAgIDQgICAgICAgOCAgICAgICA0OSAgICAgICAgNCAgICAg
IGFjdGl2ZSBzeW5jICAgL2Rldi9zZGQxDQogICA1ICAgICA1ICAgICAgIDgg
ICAgICAgIDEgICAgICAgIDUgICAgICBhY3RpdmUgc3luYyAgIC9kZXYvc2Rh
MQ0KICAgNiAgICAgNiAgICAgICA4ICAgICAgIDgxICAgICAgICA2ICAgICAg
YWN0aXZlIHN5bmMgICAvZGV2L3NkZjENCi9kZXYvc2RmMToNCiAgICAgICAg
ICBNYWdpYyA6IGE5MmI0ZWZjDQogICAgICAgIFZlcnNpb24gOiAwLjkwLjAw
DQogICAgICAgICAgIFVVSUQgOiA3NDM4ZWZkMTo5ZTZjYTJiNTpkNmI4ODI3
NDo3MDAzYjFkMw0KICBDcmVhdGlvbiBUaW1lIDogVGh1IE9jdCAxMSAwMDow
MTo0OSAyMDA3DQogICAgIFJhaWQgTGV2ZWwgOiByYWlkNg0KICBVc2VkIERl
diBTaXplIDogNDg4MzgzOTM2ICg0NjUuNzYgR2lCIDUwMC4xMSBHQikNCiAg
ICAgQXJyYXkgU2l6ZSA6IDI0NDE5MTk2ODAgKDIzMjguODAgR2lCIDI1MDAu
NTMgR0IpDQogICBSYWlkIERldmljZXMgOiA3DQogIFRvdGFsIERldmljZXMg
OiA3DQpQcmVmZXJyZWQgTWlub3IgOiA0DQoNCiAgICBVcGRhdGUgVGltZSA6
IFdlZCBKYW4gMjAgMTI6NTQ6MTYgMjAxMA0KICAgICAgICAgIFN0YXRlIDog
Y2xlYW4NCiBBY3RpdmUgRGV2aWNlcyA6IDcNCldvcmtpbmcgRGV2aWNlcyA6
IDcNCiBGYWlsZWQgRGV2aWNlcyA6IDANCiAgU3BhcmUgRGV2aWNlcyA6IDAN
CiAgICAgICBDaGVja3N1bSA6IGIyNDY4ZjE0IC0gY29ycmVjdA0KICAgICAg
ICAgRXZlbnRzIDogMTgzNDIyNQ0KDQogICAgICAgICBMYXlvdXQgOiBsZWZ0
LXN5bW1ldHJpYw0KICAgICBDaHVuayBTaXplIDogNjRLDQoNCiAgICAgIE51
bWJlciAgIE1ham9yICAgTWlub3IgICBSYWlkRGV2aWNlIFN0YXRlDQp0aGlz
ICAgICA2ICAgICAgIDggICAgICAgODEgICAgICAgIDYgICAgICBhY3RpdmUg
c3luYyAgIC9kZXYvc2RmMQ0KDQogICAwICAgICAwICAgICAgIDggICAgICAg
NjUgICAgICAgIDAgICAgICBhY3RpdmUgc3luYyAgIC9kZXYvc2RlMQ0KICAg
MSAgICAgMSAgICAgICA4ICAgICAgIDMzICAgICAgICAxICAgICAgYWN0aXZl
IHN5bmMgICAvZGV2L3NkYzENCiAgIDIgICAgIDIgICAgICAgOCAgICAgICA5
NyAgICAgICAgMiAgICAgIGFjdGl2ZSBzeW5jICAgL2Rldi9zZGcxDQogICAz
ICAgICAzICAgICAgIDggICAgICAgMTcgICAgICAgIDMgICAgICBhY3RpdmUg
c3luYyAgIC9kZXYvc2RiMQ0KICAgNCAgICAgNCAgICAgICA4ICAgICAgIDQ5
ICAgICAgICA0ICAgICAgYWN0aXZlIHN5bmMgICAvZGV2L3NkZDENCiAgIDUg
ICAgIDUgICAgICAgOCAgICAgICAgMSAgICAgICAgNSAgICAgIGFjdGl2ZSBz
eW5jICAgL2Rldi9zZGExDQogICA2ICAgICA2ICAgICAgIDggICAgICAgODEg
ICAgICAgIDYgICAgICBhY3RpdmUgc3luYyAgIC9kZXYvc2RmMQ0KL2Rldi9z
ZGcxOg0KICAgICAgICAgIE1hZ2ljIDogYTkyYjRlZmMNCiAgICAgICAgVmVy
c2lvbiA6IDAuOTAuMDANCiAgICAgICAgICAgVVVJRCA6IDc0MzhlZmQxOjll
NmNhMmI1OmQ2Yjg4Mjc0OjcwMDNiMWQzDQogIENyZWF0aW9uIFRpbWUgOiBU
aHUgT2N0IDExIDAwOjAxOjQ5IDIwMDcNCiAgICAgUmFpZCBMZXZlbCA6IHJh
aWQ2DQogIFVzZWQgRGV2IFNpemUgOiA0ODgzODM5MzYgKDQ2NS43NiBHaUIg
NTAwLjExIEdCKQ0KICAgICBBcnJheSBTaXplIDogMjQ0MTkxOTY4MCAoMjMy
OC44MCBHaUIgMjUwMC41MyBHQikNCiAgIFJhaWQgRGV2aWNlcyA6IDcNCiAg
VG90YWwgRGV2aWNlcyA6IDcNClByZWZlcnJlZCBNaW5vciA6IDQNCg0KICAg
IFVwZGF0ZSBUaW1lIDogV2VkIEphbiAyMCAxMjo1NDoxNiAyMDEwDQogICAg
ICAgICAgU3RhdGUgOiBjbGVhbg0KIEFjdGl2ZSBEZXZpY2VzIDogNw0KV29y
a2luZyBEZXZpY2VzIDogNw0KIEZhaWxlZCBEZXZpY2VzIDogMA0KICBTcGFy
ZSBEZXZpY2VzIDogMA0KICAgICAgIENoZWNrc3VtIDogYjI0NjhmMWMgLSBj
b3JyZWN0DQogICAgICAgICBFdmVudHMgOiAxODM0MjI1DQoNCiAgICAgICAg
IExheW91dCA6IGxlZnQtc3ltbWV0cmljDQogICAgIENodW5rIFNpemUgOiA2
NEsNCg0KICAgICAgTnVtYmVyICAgTWFqb3IgICBNaW5vciAgIFJhaWREZXZp
Y2UgU3RhdGUNCnRoaXMgICAgIDIgICAgICAgOCAgICAgICA5NyAgICAgICAg
MiAgICAgIGFjdGl2ZSBzeW5jICAgL2Rldi9zZGcxDQoNCiAgIDAgICAgIDAg
ICAgICAgOCAgICAgICA2NSAgICAgICAgMCAgICAgIGFjdGl2ZSBzeW5jICAg
L2Rldi9zZGUxDQogICAxICAgICAxICAgICAgIDggICAgICAgMzMgICAgICAg
IDEgICAgICBhY3RpdmUgc3luYyAgIC9kZXYvc2RjMQ0KICAgMiAgICAgMiAg
ICAgICA4ICAgICAgIDk3ICAgICAgICAyICAgICAgYWN0aXZlIHN5bmMgICAv
ZGV2L3NkZzENCiAgIDMgICAgIDMgICAgICAgOCAgICAgICAxNyAgICAgICAg
MyAgICAgIGFjdGl2ZSBzeW5jICAgL2Rldi9zZGIxDQogICA0ICAgICA0ICAg
ICAgIDggICAgICAgNDkgICAgICAgIDQgICAgICBhY3RpdmUgc3luYyAgIC9k
ZXYvc2RkMQ0KICAgNSAgICAgNSAgICAgICA4ICAgICAgICAxICAgICAgICA1
ICAgICAgYWN0aXZlIHN5bmMgICAvZGV2L3NkYTENCiAgIDYgICAgIDYgICAg
ICAgOCAgICAgICA4MSAgICAgICAgNiAgICAgIGFjdGl2ZSBzeW5jICAgL2Rl
di9zZGYxDQo=

--0-1296991638-1263999818=:15704--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 20.01.2010 16:34:00 von Brett Russ

On 01/20/2010 10:03 AM, Jon Hardcastle wrote:
> --- On Wed, 20/1/10, Brett Russ wrote:
>> What do you mean by mismatches detected? How is this observed?
>
> cat /sys/block/md4/md/mismatch_cnt

Someone else will need to comment on what this value pertains to and how
it should behave.

> I have been able to get the info you asked for using mdadm -E but not
> with the array stopped. I cant stop it just yet. What would you be
> looking for in this data?

Simply that the "Update Time" and "Events" values matched across all
members, which they do.

-BR

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 20.01.2010 21:44:07 von majedb

You can find details on these parameters and their values in md's
documentation: http://www.mjmwired.net/kernel/Documentation/md.txt

A mismatch count can be checked by writing "check" to the proper file
as stated in line 376:
http://www.mjmwired.net/kernel/Documentation/md.txt#376

On Wed, Jan 20, 2010 at 6:34 PM, Brett Russ wrote:
> On 01/20/2010 10:03 AM, Jon Hardcastle wrote:
>>
>> --- On Wed, 20/1/10, Brett Russ  wrote:
>>>
>>> What do you mean by mismatches detected?  How is this observed=
?
>>
>> cat /sys/block/md4/md/mismatch_cnt
>
> Someone else will need to comment on what this value pertains to and =
how it
> should behave.
>
>> I have been able to get the info you asked for using mdadm -E but no=
t
>> with the array stopped. I cant stop it just yet. What would you be
>> looking for in this data?
>
> Simply that the "Update Time" and "Events" values matched across all
> members, which they do.
>
> -BR
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.ht=
ml
>



--=20
Majed B.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 20.01.2010 23:25:47 von Brett Russ

On 01/20/2010 03:44 PM, Majed B. wrote:
> You can find details on these parameters and their values in md's
> documentation: http://www.mjmwired.net/kernel/Documentation/md.txt
>
> A mismatch count can be checked by writing "check" to the proper file
> as stated in line 376:
> http://www.mjmwired.net/kernel/Documentation/md.txt#376

Sounds like Jon may have a flaky HDD if certain members continually
throw errors. Jon, can you check SMART stats on your drives?

for di in a b c d e f g; do smartctl -a /dev/sd$di | grep -i _sect; done

you may have to add another option to smartctl, I forget.

-BR

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 20.01.2010 23:30:15 von majedb

He needs to run a full offline or long test before checking with
smartctl -a -- since it won't show any sector errors if those tests
weren't run at least once.

On Thu, Jan 21, 2010 at 1:25 AM, Brett Russ wrote:
> On 01/20/2010 03:44 PM, Majed B. wrote:
>>
>> You can find details on these parameters and their values in md's
>> documentation: http://www.mjmwired.net/kernel/Documentation/md.txt
>>
>> A mismatch count can be checked by writing "check" to the proper fil=
e
>> as stated in line 376:
>> http://www.mjmwired.net/kernel/Documentation/md.txt#376
>
> Sounds like Jon may have a flaky HDD if certain members continually t=
hrow
> errors.  Jon, can you check SMART stats on your drives?
>
> for di in a b c d e f g; do smartctl -a /dev/sd$di | grep -i _sect; d=
one
>
> you may have to add another option to smartctl, I forget.
>
> -BR
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.ht=
ml
>



--=20
Majed B.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 20.01.2010 23:43:45 von Brett Russ

On 01/20/2010 05:30 PM, Majed B. wrote:
> He needs to run a full offline or long test before checking with
> smartctl -a -- since it won't show any sector errors if those tests
> weren't run at least once.

Not sure I agree with that. The md checks he's been doing will cause a
read of all data regions of the relevant partition and if the disk is
throwing errors, those sectors should be marked probational. Then, if a
subsequent repair ends up remapping them, those sectors will show up as
remapped.

The grep will show both probational and remapped sector counts for each
drive.

BTW, the cmd should also include an echo so it's easy to tell which
drive is being reported:

for di in a b c d e f g; do echo $di; smartctl -a /dev/sd$di | grep -i
_sect; done

-BR

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 21.01.2010 00:01:37 von Christopher Chen

I keep misreading the the subject of this email thread as:

"Why does one get mustaches?"

cc

On Wed, Jan 20, 2010 at 2:43 PM, Brett Russ wrote:
> On 01/20/2010 05:30 PM, Majed B. wrote:
>>
>> He needs to run a full offline or long test before checking with
>> smartctl -a -- since it won't show any sector errors if those tests
>> weren't run at least once.
>
> Not sure I agree with that. =A0The md checks he's been doing will cau=
se a read
> of all data regions of the relevant partition and if the disk is thro=
wing
> errors, those sectors should be marked probational. =A0Then, if a sub=
sequent
> repair ends up remapping them, those sectors will show up as remapped=

>
> The grep will show both probational and remapped sector counts for ea=
ch
> drive.
>
> BTW, the cmd should also include an echo so it's easy to tell which d=
rive is
> being reported:
>
> for di in a b c d e f g; do echo $di; smartctl -a /dev/sd$di | grep -=
i
> _sect; done
>
> -BR
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Chris Chen
"The fact that yours is better than anyone else's
is not a guarantee that it's any good."
-- Motivational Poster
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 21.01.2010 05:17:56 von Steven Haigh

On Wed, 20 Jan 2010 17:43:45 -0500, Brett Russ wrote:
> On 01/20/2010 05:30 PM, Majed B. wrote:
>> He needs to run a full offline or long test before checking with
>> smartctl -a -- since it won't show any sector errors if those tests
>> weren't run at least once.
>
> Not sure I agree with that. The md checks he's been doing will cause a
> read of all data regions of the relevant partition and if the disk is
> throwing errors, those sectors should be marked probational. Then, if a

> subsequent repair ends up remapping them, those sectors will show up as
> remapped.
>
> The grep will show both probational and remapped sector counts for each
> drive.
>
> BTW, the cmd should also include an echo so it's easy to tell which
> drive is being reported:
>
> for di in a b c d e f g; do echo $di; smartctl -a /dev/sd$di | grep -i
> _sect; done

Interestingly enough, I'm struggling with a system on this matter too... I
can never seem to get rid of mismatches.

# for di in a b c d e f g; do echo $di; smartctl -a /dev/hd$di | grep -i
sect; done
a
=== START OF INFORMATION SECTION ===
=== START OF READ SMART DATA SECTION ===
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always
- 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 0
b
c
=== START OF INFORMATION SECTION ===
=== START OF READ SMART DATA SECTION ===
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always
- 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 0

Full offline tests of both drives less than 400 power on hours ago all
came up clean. No read errors. Just mismatches.

I can run a repair on them and STILL have mismatches again after a check.
At the moment:

# cat /sys/block/md2/md/mismatch_cnt
1024

It's in the middle of a repair now - as quite often the filesystem on
/dev/md2 will go read-only due to a journal error. I've tried everything
except replacing hardware to figure out what's going on here - but it will
do this like clockwork every month. A reboot later and it'll run an fsck,
find no errors, then between 21 and 30 days later it will go readonly
again.

It's annoying as hell and I wish I could get to the bottom of it!

--
Steven Haigh

Email: netwiz@crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 21.01.2010 09:08:42 von Asdo

Steven Haigh wrote:
> On Wed, 20 Jan 2010 17:43:45 -0500, Brett Russ wrote:
>
> CUT!
Might that be a problem of the disks/controllers?
Jon and Steven, what hardware do you have?

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 21.01.2010 11:52:44 von Steven Haigh

On Thu, 21 Jan 2010 09:08:42 +0100, Asdo wrote:
> Steven Haigh wrote:
>> On Wed, 20 Jan 2010 17:43:45 -0500, Brett Russ
wrote:
>>
>> CUT!
> Might that be a problem of the disks/controllers?
> Jon and Steven, what hardware do you have?

I'm running some fairly old hardware on this particular server. It's a
dual P3 1Ghz.

After running a repair on /dev/md2, I now see:
# cat /sys/block/md2/md/mismatch_cnt
1536

Again, no smart errors, nothing to indicate a disk problem at all :(

As this really keeps killing the machine and it is a live system - the
only thing I can really think of doing is to break the RAID and just rsync
the drives twice daily :\

--
Steven Haigh

Email: netwiz@crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 21.01.2010 12:48:20 von Farkas Levente

On 01/21/2010 11:52 AM, Steven Haigh wrote:
> On Thu, 21 Jan 2010 09:08:42 +0100, Asdo wrote:
>> Steven Haigh wrote:
>>> On Wed, 20 Jan 2010 17:43:45 -0500, Brett Russ
> wrote:
>>>
>>> CUT!
>> Might that be a problem of the disks/controllers?
>> Jon and Steven, what hardware do you have?
>
> I'm running some fairly old hardware on this particular server. It's a
> dual P3 1Ghz.
>
> After running a repair on /dev/md2, I now see:
> # cat /sys/block/md2/md/mismatch_cnt
> 1536
>
> Again, no smart errors, nothing to indicate a disk problem at all :(
>
> As this really keeps killing the machine and it is a live system - the
> only thing I can really think of doing is to break the RAID and just rsync
> the drives twice daily :\

the same happened with many people. and we all hate it since it cause a
huge load at all weekend on most of our servers:-(
according to redhat it's not a bug:-(

--
Levente "Si vis pacem para bellum!"
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 21.01.2010 13:15:32 von Jon Hardcastle

--- On Thu, 21/1/10, Farkas Levente wrote:

> From: Farkas Levente
> Subject: Re: Why does one get mismatches?
> To: "Steven Haigh"
> Cc: "Asdo" , linux-raid@vger.kernel.org
> Date: Thursday, 21 January, 2010, 11:48
> On 01/21/2010 11:52 AM, Steven Haigh
> wrote:
> > On Thu, 21 Jan 2010 09:08:42 +0100, Asdo=A0
> wrote:
> >> Steven Haigh wrote:
> >>> On Wed, 20 Jan 2010 17:43:45 -0500, Brett
> Russ
> > wrote:
> >>>
> >>> CUT!
> >> Might that be a problem of the disks/controllers?
> >> Jon and Steven, what hardware do you have?
> >
> > I'm running some fairly old hardware on this
> particular server. It's a
> > dual P3 1Ghz.
> >
> > After running a repair on /dev/md2, I now see:
> > # cat /sys/block/md2/md/mismatch_cnt
> > 1536
> >
> > Again, no smart errors, nothing to indicate a disk
> problem at all :(
> >
> > As this really keeps killing the machine and it is a
> live system - the
> > only thing I can really think of doing is to break the
> RAID and just rsync
> > the drives twice daily :\
>=20
> the same happened with many people. and we all hate it
> since it cause a=20
> huge load at all weekend on most of our servers:-(
> according to redhat it's not a bug:-(
>=20
> --=20
>   =A0Levente=A0 =A0 =A0 =A0 =A0
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0
> =A0   =A0"Si vis pacem para bellum!"
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at=A0 http://vger.kernel.org/majordomo-info.html
>=20

Well i am running a Semperon based desktop system that has 4 in built s=
ata and 2 IDE, and 2 PCI-E controller cards exposing 2 sata ports each.

I have off the IDE 2 320GB HDD's that is split across 3 md's boot/swap/=
main. Only the Main is 'check'ed/repaired. Very rarely have a problem h=
ere!

On the Sata's I have 7 HDD's of varying size(4x500, 2x750, 1x1TB) and m=
akes(Samsung, Hitachi, Seagate) strung together to form a now raid6 (ra=
id5 until a couple of weeks ago). On top of that i have a VG split into=
~6 LV's and in some of those i have mount SquashFS filesystems. until =
i moved the drive order around at the weekend for access issues. I didn=
't really have any problems - except the occasional issue - I scrub it =
weekly currently.

BUT I have only just converted from raid5 to 6 and probably not run tha=
t many checks since so it could be related to that!




=20
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 22.01.2010 17:22:03 von Jon Hardcastle


>
> Note that if your md device is not in a read-only mode that
> the member states may be changing underneath you as you run
> the above command. Therefore, you should either stop the
> device then run the commands, or at least have the device in
> a read-only mode first.
>
> -BR
>

I have just tried this - i umounted all LV and then deactivated the VG. I set to read-only but now any attempt to echo check > sync_action results in

'write error: device or resource busy'

any clues/



--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 22.01.2010 17:34:29 von Asdo

Jon Hardcastle wrote:
>
>
>> Note that if your md device is not in a read-only mode that
>> the member states may be changing underneath you as you run
>> the above command. Therefore, you should either stop the
>> device then run the commands, or at least have the device in
>> a read-only mode first.
>>
>> -BR
>>
>>
>
> I have just tried this - i umounted all LV and then deactivated the VG. I set to read-only but now any attempt to echo check > sync_action results in
>
> 'write error: device or resource busy'
>
> any clues/
>

I think running check or repair is not supported with the MD array set
read-only.
I remember Neil Brown himself saying that, and I seem to recall I have
also seen this in the source code.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 22.01.2010 18:41:04 von Brett Russ

On 01/22/2010 11:22 AM, Jon Hardcastle wrote:
>
>>
>> Note that if your md device is not in a read-only mode that the
>> member states may be changing underneath you as you run the above
>> command. Therefore, you should either stop the device then run the
>> commands, or at least have the device in a read-only mode first.
>>
>> -BR
>>
>
> I have just tried this - i umounted all LV and then deactivated the
> VG. I set to read-only but now any attempt to echo check>
> sync_action results in

Sorry for the misunderstanding, I was suggesting putting the array in
read only mode only for the purposes of doing the 'mdadm --examine' to
detect if member devices were out of sync with each other.

But, it turns out that the mismatches you're seeing are not a result of
the member devices being out of sync with each other but rather member
devices throwing errors. Sounds like other people see this same
behavior and it's not necessarily tied to any disk sector read errors.
If there are also no I/O errors in the kernel log during your 'check'
operation, you'll need either more verbose md logging during the check
or a look at the code to see what other kinds of errors bump the
mismatch counter.

-BR

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 25.01.2010 21:43:54 von Greg

On Jan 21, 12:48pm, Farkas Levente wrote:
} Subject: Re: Why does one get mismatches?

Good afternoon to everyone, hope the week is starting well.

> On 01/21/2010 11:52 AM, Steven Haigh wrote:
> > On Thu, 21 Jan 2010 09:08:42 +0100, Asdo wrote:
> >> Steven Haigh wrote:
> >>> On Wed, 20 Jan 2010 17:43:45 -0500, Brett Russ
> > wrote:
> >>>
> >>> CUT!
> >> Might that be a problem of the disks/controllers?
> >> Jon and Steven, what hardware do you have?
> >
> > I'm running some fairly old hardware on this particular server. It's a
> > dual P3 1Ghz.
> >
> > After running a repair on /dev/md2, I now see:
> > # cat /sys/block/md2/md/mismatch_cnt
> > 1536
> >
> > Again, no smart errors, nothing to indicate a disk problem at all :(
> >
> > As this really keeps killing the machine and it is a live system - the
> > only thing I can really think of doing is to break the RAID and just rsync
> > the drives twice daily :\

> the same happened with many people. and we all hate it since it
> cause a huge load at all weekend on most of our servers:-( according
> to redhat it's not a bug:-(

The RAID check/mismatch_count is an example of well intentioned
technology suffering from 'featuritis' by the distributions which is,
as I predicted a couple of times in this forum, causing all sorts of
angst and problems throughout the world. I've had some posts on this
subject but will summarize in the hopes of giving some background
information which will be useful to people.

There is an issue in the kernel which causes these mismatches. The
problem seems to be particularly bad with RAID1 arrays. The
contention is that these mismatches are 'harmless' because they only
occur in areas of the filesystems which are not being used.

The best description is that the buffers containing the data to be
written are not 'pinned' all the way down the I/O stack. This can
cause the contents of a buffer to be changed while in transit through
the I/O stack. Thus one copy of a mirror gets a buffer written to it
different then the other side of the mirror.

I've read reasoned discussions about why this occurs with swap over
RAID1 and why its harmless. I've set to see the same type of reasoned
discussion as to why it is not problematic with a filesystem over
RAID1. There has been some discussion that its due to high levels of
MMAP activity on the filesystem.

We have confirmed, that at least with RAID1, this all occurs with no
physical corruption on the 'disk drives'. We implement geographically
mirror storage with RAID1 against two separate data-centers. At each
data-center the RAID1 'block-device' are RAID5 volumes. These latter
volumes check out with no errors/mismatch counts etc. So the issue is
at the RAID1 data abstraction layer.

There do not appear to be any tools which allow one to determine
'where' the mismatches are. Such a tool, or logging by the kernel,
would be useful for people who want to verify what files, if any, are
affected by the mismatch. Otherwise running a 'repair' results in the
RAID1 code arbitraily deciding which of the two blocks is the
'correct' one.

So thats sort of a thumbnail sketch of what is going on. The fact the
distributions chose to implement this without understanding the issues
it presents is a bit problematic.

> Levente "Si vis pacem para bellum!"

Hopefully this information is helpful.

Greg

}-- End of excerpt from Farkas Levente

As always,
Dr. G.W. Wettstein, Ph.D. Enjellic Systems Development, LLC.
4206 N. 19th Ave. Specializing in information infra-structure
Fargo, ND 58102 development.
PH: 701-281-1686
FAX: 701-281-3949 EMAIL: greg@enjellic.com
------------------------------------------------------------ ------------------
"I am returning this otherwise good typing paper to you because
someone has printed gibberish all over it and put your name at the
top.
-- English Professor, Ohio University
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 25.01.2010 23:49:28 von Steven Haigh

On 26/01/2010, at 7:43 AM, greg@enjellic.com wrote:

> On Jan 21, 12:48pm, Farkas Levente wrote:
> } Subject: Re: Why does one get mismatches?
>
> Good afternoon to everyone, hope the week is starting well.
>
>> On 01/21/2010 11:52 AM, Steven Haigh wrote:
>>> On Thu, 21 Jan 2010 09:08:42 +0100, Asdo wrote:
>>>> Steven Haigh wrote:
>>>>> On Wed, 20 Jan 2010 17:43:45 -0500, Brett Russ
>>> wrote:
>>>>>
>>>>> CUT!
>>>> Might that be a problem of the disks/controllers?
>>>> Jon and Steven, what hardware do you have?
>>>
>>> I'm running some fairly old hardware on this particular server. It's a
>>> dual P3 1Ghz.
>>>
>>> After running a repair on /dev/md2, I now see:
>>> # cat /sys/block/md2/md/mismatch_cnt
>>> 1536
>>>
>>> Again, no smart errors, nothing to indicate a disk problem at all :(
>>>
>>> As this really keeps killing the machine and it is a live system - the
>>> only thing I can really think of doing is to break the RAID and just rsync
>>> the drives twice daily :\
>
>> the same happened with many people. and we all hate it since it
>> cause a huge load at all weekend on most of our servers:-( according
>> to redhat it's not a bug:-(
>
> The RAID check/mismatch_count is an example of well intentioned
> technology suffering from 'featuritis' by the distributions which is,
> as I predicted a couple of times in this forum, causing all sorts of
> angst and problems throughout the world. I've had some posts on this
> subject but will summarize in the hopes of giving some background
> information which will be useful to people.
>
> There is an issue in the kernel which causes these mismatches. The
> problem seems to be particularly bad with RAID1 arrays. The
> contention is that these mismatches are 'harmless' because they only
> occur in areas of the filesystems which are not being used.
>
> The best description is that the buffers containing the data to be
> written are not 'pinned' all the way down the I/O stack. This can
> cause the contents of a buffer to be changed while in transit through
> the I/O stack. Thus one copy of a mirror gets a buffer written to it
> different then the other side of the mirror.
>
> I've read reasoned discussions about why this occurs with swap over
> RAID1 and why its harmless. I've set to see the same type of reasoned
> discussion as to why it is not problematic with a filesystem over
> RAID1. There has been some discussion that its due to high levels of
> MMAP activity on the filesystem.
>
> We have confirmed, that at least with RAID1, this all occurs with no
> physical corruption on the 'disk drives'. We implement geographically
> mirror storage with RAID1 against two separate data-centers. At each
> data-center the RAID1 'block-device' are RAID5 volumes. These latter
> volumes check out with no errors/mismatch counts etc. So the issue is
> at the RAID1 data abstraction layer.
>
> There do not appear to be any tools which allow one to determine
> 'where' the mismatches are. Such a tool, or logging by the kernel,
> would be useful for people who want to verify what files, if any, are
> affected by the mismatch. Otherwise running a 'repair' results in the
> RAID1 code arbitraily deciding which of the two blocks is the
> 'correct' one.
>
> So thats sort of a thumbnail sketch of what is going on. The fact the
> distributions chose to implement this without understanding the issues
> it presents is a bit problematic.
>
>> Levente "Si vis pacem para bellum!"
>
> Hopefully this information is helpful.
>
> Greg

Hi Greg and all,

The funny part is that I believe the mismatches aren't happening in the empty space of the filesystem - as it seems that the errors are causing the ext3 journal to abort and force the filesystem into readonly in my particular situation.

It is interesting that I do not get any mismatches on md0, md1 or md3 - only md2.

md0 = /boot
md1 = swap
md2 = /
md3 = /tmp

I ran weekly checks on the all four RAID1 arrays and ONLY md2 had a problem with mismatches, which also had a habit of going readonly - therefore I don't believe the part of common belief that this problem only affects empty parts of the filesystem.

I have also done just about every test to the disks that I can think of with no errors to be found - leaving only the md layer to be suspect.

--
Steven Haigh

Email: netwiz@crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299






--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: Why does one get mismatches?

am 27.01.2010 22:54:12 von Tirumala Reddy Marri

I ran echo check > /sys/bloc/md0/md/sync_action after I ran the "echo
repair > /sys/block/md0/md/sync_action" .
I am seeing whole bunch of mismatch errors like 1233072 . I am using
RAID-5 array though.



-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Steven Haigh
Sent: Monday, January 25, 2010 2:49 PM
To: linux-raid@vger.kernel.org
Subject: Re: Why does one get mismatches?


On 26/01/2010, at 7:43 AM, greg@enjellic.com wrote:

> On Jan 21, 12:48pm, Farkas Levente wrote:
> } Subject: Re: Why does one get mismatches?
>
> Good afternoon to everyone, hope the week is starting well.
>
>> On 01/21/2010 11:52 AM, Steven Haigh wrote:
>>> On Thu, 21 Jan 2010 09:08:42 +0100, Asdo wrote:
>>>> Steven Haigh wrote:
>>>>> On Wed, 20 Jan 2010 17:43:45 -0500, Brett Russ
>>> wrote:
>>>>>
>>>>> CUT!
>>>> Might that be a problem of the disks/controllers?
>>>> Jon and Steven, what hardware do you have?
>>>
>>> I'm running some fairly old hardware on this particular server. It's
a
>>> dual P3 1Ghz.
>>>
>>> After running a repair on /dev/md2, I now see:
>>> # cat /sys/block/md2/md/mismatch_cnt
>>> 1536
>>>
>>> Again, no smart errors, nothing to indicate a disk problem at all :(
>>>
>>> As this really keeps killing the machine and it is a live system -
the
>>> only thing I can really think of doing is to break the RAID and just
rsync
>>> the drives twice daily :\
>
>> the same happened with many people. and we all hate it since it
>> cause a huge load at all weekend on most of our servers:-( according
>> to redhat it's not a bug:-(
>
> The RAID check/mismatch_count is an example of well intentioned
> technology suffering from 'featuritis' by the distributions which is,
> as I predicted a couple of times in this forum, causing all sorts of
> angst and problems throughout the world. I've had some posts on this
> subject but will summarize in the hopes of giving some background
> information which will be useful to people.
>
> There is an issue in the kernel which causes these mismatches. The
> problem seems to be particularly bad with RAID1 arrays. The
> contention is that these mismatches are 'harmless' because they only
> occur in areas of the filesystems which are not being used.
>
> The best description is that the buffers containing the data to be
> written are not 'pinned' all the way down the I/O stack. This can
> cause the contents of a buffer to be changed while in transit through
> the I/O stack. Thus one copy of a mirror gets a buffer written to it
> different then the other side of the mirror.
>
> I've read reasoned discussions about why this occurs with swap over
> RAID1 and why its harmless. I've set to see the same type of reasoned
> discussion as to why it is not problematic with a filesystem over
> RAID1. There has been some discussion that its due to high levels of
> MMAP activity on the filesystem.
>
> We have confirmed, that at least with RAID1, this all occurs with no
> physical corruption on the 'disk drives'. We implement geographically
> mirror storage with RAID1 against two separate data-centers. At each
> data-center the RAID1 'block-device' are RAID5 volumes. These latter
> volumes check out with no errors/mismatch counts etc. So the issue is
> at the RAID1 data abstraction layer.
>
> There do not appear to be any tools which allow one to determine
> 'where' the mismatches are. Such a tool, or logging by the kernel,
> would be useful for people who want to verify what files, if any, are
> affected by the mismatch. Otherwise running a 'repair' results in the
> RAID1 code arbitraily deciding which of the two blocks is the
> 'correct' one.
>
> So thats sort of a thumbnail sketch of what is going on. The fact the
> distributions chose to implement this without understanding the issues
> it presents is a bit problematic.
>
>> Levente "Si vis pacem para bellum!"
>
> Hopefully this information is helpful.
>
> Greg

Hi Greg and all,

The funny part is that I believe the mismatches aren't happening in the
empty space of the filesystem - as it seems that the errors are causing
the ext3 journal to abort and force the filesystem into readonly in my
particular situation.

It is interesting that I do not get any mismatches on md0, md1 or md3 -
only md2.

md0 = /boot
md1 = swap
md2 = /
md3 = /tmp

I ran weekly checks on the all four RAID1 arrays and ONLY md2 had a
problem with mismatches, which also had a habit of going readonly -
therefore I don't believe the part of common belief that this problem
only affects empty parts of the filesystem.

I have also done just about every test to the disks that I can think of
with no errors to be found - leaving only the md layer to be suspect.

--
Steven Haigh

Email: netwiz@crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299






--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: Why does one get mismatches?

am 28.01.2010 10:16:28 von Jon Hardcastle

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org
> [mailto:linux-raid-owner@vger.kernel.org]
> On Behalf Of Steven Haigh
> Sent: Monday, January 25, 2010 2:49 PM
> To: linux-raid@vger.kernel.org
> Subject: Re: Why does one get mismatches?
>=20
>=20
> On 26/01/2010, at 7:43 AM, greg@enjellic.com
> wrote:
>=20
> > On Jan 21, 12:48pm, Farkas Levente wrote:
> > } Subject: Re: Why does one get mismatches?
> >=20
> > Good afternoon to everyone, hope the week is starting
> well.
> >=20
> >> On 01/21/2010 11:52 AM, Steven Haigh wrote:
> >>> On Thu, 21 Jan 2010 09:08:42 +0100, Asdo=A0
> wrote:
> >>>> Steven Haigh wrote:
> >>>>> On Wed, 20 Jan 2010 17:43:45 -0500,
> Brett Russ
> >>> wrote:
> >>>>>=20
> >>>>> CUT!
> >>>> Might that be a problem of the
> disks/controllers?
> >>>> Jon and Steven, what hardware do you
> have?
> >>>=20
> >>> I'm running some fairly old hardware on this
> particular server. It's
> a
> >>> dual P3 1Ghz.
> >>>=20
> >>> After running a repair on /dev/md2, I now
> see:
> >>> # cat /sys/block/md2/md/mismatch_cnt
> >>> 1536
> >>>=20
> >>> Again, no smart errors, nothing to indicate a
> disk problem at all :(
> >>>=20
> >>> As this really keeps killing the machine and
> it is a live system -
> the
> >>> only thing I can really think of doing is to
> break the RAID and just
> rsync
> >>> the drives twice daily :\
> >=20
> >> the same happened with many people. and we all
> hate it since it
> >> cause a huge load at all weekend on most of our
> servers:-( according
> >> to redhat it's not a bug:-(
> >=20
> > The RAID check/mismatch_count is an example of well
> intentioned
> > technology suffering from 'featuritis' by the
> distributions which is,
> > as I predicted a couple of times in this forum,
> causing all sorts of
> > angst and problems throughout the world.=A0 I've
> had some posts on this
> > subject but will summarize in the hopes of giving some
> background
> > information which will be useful to people.
> >=20
> > There is an issue in the kernel which causes these
> mismatches.=A0 The
> > problem seems to be particularly bad with RAID1
> arrays.=A0 The
> > contention is that these mismatches are 'harmless'
> because they only
> > occur in areas of the filesystems which are not being
> used.
> >=20
> > The best description is that the buffers containing
> the data to be
> > written are not 'pinned' all the way down the I/O
> stack.=A0 This can
> > cause the contents of a buffer to be changed while in
> transit through
> > the I/O stack.=A0 Thus one copy of a mirror gets a
> buffer written to it
> > different then the other side of the mirror.
> >=20
> > I've read reasoned discussions about why this occurs
> with swap over
> > RAID1 and why its harmless.=A0 I've set to see the
> same type of reasoned
> > discussion as to why it is not problematic with a
> filesystem over
> > RAID1.=A0 There has been some discussion that its
> due to high levels of
> > MMAP activity on the filesystem.
> >=20
> > We have confirmed, that at least with RAID1, this all
> occurs with no
> > physical corruption on the 'disk drives'.=A0 We
> implement geographically
> > mirror storage with RAID1 against two separate
> data-centers.=A0 At each
> > data-center the RAID1 'block-device' are RAID5
> volumes.=A0 These latter
> > volumes check out with no errors/mismatch counts
> etc.=A0 So the issue is
> > at the RAID1 data abstraction layer.
> >=20
> > There do not appear to be any tools which allow one to
> determine
> > 'where' the mismatches are.=A0 Such a tool, or
> logging by the kernel,
> > would be useful for people who want to verify what
> files, if any, are
> > affected by the mismatch.=A0 Otherwise running a
> 'repair' results in the
> > RAID1 code arbitraily deciding which of the two blocks
> is the
> > 'correct' one.
> >=20
> > So thats sort of a thumbnail sketch of what is going
> on.=A0 The fact the
> > distributions chose to implement this without
> understanding the issues
> > it presents is a bit problematic.
> >=20
> >>  =A0Levente=A0 =A0 =A0
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0
> =A0 =A0 =A0   =A0"Si vis pacem para
> bellum!"
> >=20
> > Hopefully this information is helpful.
> >=20
> > Greg
>=20
> Hi Greg and all,
>=20
> The funny part is that I believe the mismatches aren't
> happening in the
> empty space of the filesystem - as it seems that the errors
> are causing
> the ext3 journal to abort and force the filesystem into
> readonly in my
> particular situation.
>=20
> It is interesting that I do not get any mismatches on md0,
> md1 or md3 -
> only md2.
>=20
> md0 =3D /boot
> md1 =3D swap
> md2 =3D /
> md3 =3D /tmp
>=20
> I ran weekly checks on the all four RAID1 arrays and ONLY
> md2 had a
> problem with mismatches, which also had a habit of going
> readonly -
> therefore I don't believe the part of common belief that
> this problem
> only affects empty parts of the filesystem.
>=20
> I have also done just about every test to the disks that I
> can think of
> with no errors to be found - leaving only the md layer to
> be suspect.
>=20
> --
> Steven Haigh
>=20
> Email: netwiz@crc.id.au
> Web: http://www.crc.id.au
> Phone: (03) 9001 6090 - 0412 935 897
> Fax: (03) 8338 0299
>=20

Well, I finished running my none-destructive badblocks check and ran se=
veral smart --long tests I also did a forcefsk on the bad boy and NOW t=
he active md4 (with a DEACTIVE vg on it) returns 0 mismatch_cnt. I have=
n't rebooted it in days though so I just dont know what casued this. No=
errors in the log, the pending/reallocated sector count is still 0 on =
all drives.

I have reactivated my VG and am running it again now it is just bizzare=




=20
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 28.01.2010 11:29:57 von Asdo

Jon Hardcastle wrote:
> Well, I finished running my none-destructive badblocks check and ran several smart --long tests I also did a forcefsk on the bad boy and NOW the active md4 (with a DEACTIVE vg on it) returns 0 mismatch_cnt. I haven't rebooted it in days though so I just dont know what casued this. No errors in the log, the pending/reallocated sector count is still 0 on all drives.
>
> I have reactivated my VG and am running it again now it is just bizzare.
>


Very interesting!

I'm wondering what we can infer...
I'm thinking probably it was either a fault of the disks, fixed with
smart --long tests, or a fault of the filesystem, fixed with forced fsck.
What do you think?

Tell us if mismatches start to happen again

Thank you
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: Why does one get mismatches?

am 28.01.2010 18:20:29 von Tirumala Reddy Marri

I have noticed that if I zero the /dev/md0 using "dd if=/dev/zero
of=/dev/md0 bs=4k count=64k". Then I run the "echo check >
/sys/block/md0/md/sync_action", no mismatch_cnt reported. If I use
"if=/some/file" then I see miss match count to set to huge number.

I am testing with small RAID-5 size for quick testing. How a reliable
is this test ? I am using HW accelerated XOR engine for RAID-5.



-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Tirumala Reddy
Marri
Sent: Wednesday, January 27, 2010 1:54 PM
To: Steven Haigh; linux-raid@vger.kernel.org
Subject: RE: Why does one get mismatches?

I ran echo check > /sys/bloc/md0/md/sync_action after I ran the "echo
repair > /sys/block/md0/md/sync_action" .
I am seeing whole bunch of mismatch errors like 1233072 . I am using
RAID-5 array though.



-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Steven Haigh
Sent: Monday, January 25, 2010 2:49 PM
To: linux-raid@vger.kernel.org
Subject: Re: Why does one get mismatches?


On 26/01/2010, at 7:43 AM, greg@enjellic.com wrote:

> On Jan 21, 12:48pm, Farkas Levente wrote:
> } Subject: Re: Why does one get mismatches?
>
> Good afternoon to everyone, hope the week is starting well.
>
>> On 01/21/2010 11:52 AM, Steven Haigh wrote:
>>> On Thu, 21 Jan 2010 09:08:42 +0100, Asdo wrote:
>>>> Steven Haigh wrote:
>>>>> On Wed, 20 Jan 2010 17:43:45 -0500, Brett Russ
>>> wrote:
>>>>>
>>>>> CUT!
>>>> Might that be a problem of the disks/controllers?
>>>> Jon and Steven, what hardware do you have?
>>>
>>> I'm running some fairly old hardware on this particular server. It's
a
>>> dual P3 1Ghz.
>>>
>>> After running a repair on /dev/md2, I now see:
>>> # cat /sys/block/md2/md/mismatch_cnt
>>> 1536
>>>
>>> Again, no smart errors, nothing to indicate a disk problem at all :(
>>>
>>> As this really keeps killing the machine and it is a live system -
the
>>> only thing I can really think of doing is to break the RAID and just
rsync
>>> the drives twice daily :\
>
>> the same happened with many people. and we all hate it since it
>> cause a huge load at all weekend on most of our servers:-( according
>> to redhat it's not a bug:-(
>
> The RAID check/mismatch_count is an example of well intentioned
> technology suffering from 'featuritis' by the distributions which is,
> as I predicted a couple of times in this forum, causing all sorts of
> angst and problems throughout the world. I've had some posts on this
> subject but will summarize in the hopes of giving some background
> information which will be useful to people.
>
> There is an issue in the kernel which causes these mismatches. The
> problem seems to be particularly bad with RAID1 arrays. The
> contention is that these mismatches are 'harmless' because they only
> occur in areas of the filesystems which are not being used.
>
> The best description is that the buffers containing the data to be
> written are not 'pinned' all the way down the I/O stack. This can
> cause the contents of a buffer to be changed while in transit through
> the I/O stack. Thus one copy of a mirror gets a buffer written to it
> different then the other side of the mirror.
>
> I've read reasoned discussions about why this occurs with swap over
> RAID1 and why its harmless. I've set to see the same type of reasoned
> discussion as to why it is not problematic with a filesystem over
> RAID1. There has been some discussion that its due to high levels of
> MMAP activity on the filesystem.
>
> We have confirmed, that at least with RAID1, this all occurs with no
> physical corruption on the 'disk drives'. We implement geographically
> mirror storage with RAID1 against two separate data-centers. At each
> data-center the RAID1 'block-device' are RAID5 volumes. These latter
> volumes check out with no errors/mismatch counts etc. So the issue is
> at the RAID1 data abstraction layer.
>
> There do not appear to be any tools which allow one to determine
> 'where' the mismatches are. Such a tool, or logging by the kernel,
> would be useful for people who want to verify what files, if any, are
> affected by the mismatch. Otherwise running a 'repair' results in the
> RAID1 code arbitraily deciding which of the two blocks is the
> 'correct' one.
>
> So thats sort of a thumbnail sketch of what is going on. The fact the
> distributions chose to implement this without understanding the issues
> it presents is a bit problematic.
>
>> Levente "Si vis pacem para bellum!"
>
> Hopefully this information is helpful.
>
> Greg

Hi Greg and all,

The funny part is that I believe the mismatches aren't happening in the
empty space of the filesystem - as it seems that the errors are causing
the ext3 journal to abort and force the filesystem into readonly in my
particular situation.

It is interesting that I do not get any mismatches on md0, md1 or md3 -
only md2.

md0 = /boot
md1 = swap
md2 = /
md3 = /tmp

I ran weekly checks on the all four RAID1 arrays and ONLY md2 had a
problem with mismatches, which also had a habit of going readonly -
therefore I don't believe the part of common belief that this problem
only affects empty parts of the filesystem.

I have also done just about every test to the disks that I can think of
with no errors to be found - leaving only the md layer to be suspect.

--
Steven Haigh

Email: netwiz@crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299






--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 28.01.2010 19:23:38 von Goswin von Brederlow

"Tirumala Reddy Marri" writes:

> I have noticed that if I zero the /dev/md0 using "dd if=/dev/zero
> of=/dev/md0 bs=4k count=64k". Then I run the "echo check >
> /sys/block/md0/md/sync_action", no mismatch_cnt reported. If I use
> "if=/some/file" then I see miss match count to set to huge number.
>
> I am testing with small RAID-5 size for quick testing. How a reliable
> is this test ? I am using HW accelerated XOR engine for RAID-5.

Have you tried without?

Maybe the XOR engine creates garbage.

MfG
Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: Why does one get mismatches?

am 28.01.2010 20:03:59 von Tirumala Reddy Marri

I just tried and miss-match count is zero. Interesting, is XOR engine
doing something wrong ? . Then I ran a test where I raw-write the file
to /dev/md0. Then did the raw-read for the same size. In this case XOR
matched as expected. Then I failed a drive using "mdadm -f /dev/md0
/dev/sda". Then I read the same size data again from /dev/md0. And
checksum matches too.
What is this mean XOR engine is doing right thing. But "chec/repair"
test is not functioning properly with XOR-engine ?

Or is this something to do with how the buffers are handled ? may they
are cached ?


-Marri



-----Original Message-----
From: goswin-v-b@web.de [mailto:goswin-v-b@web.de]
Sent: Thursday, January 28, 2010 10:24 AM
To: Tirumala Reddy Marri
Cc: linux-raid@vger.kernel.org
Subject: Re: Why does one get mismatches?

"Tirumala Reddy Marri" writes:

> I have noticed that if I zero the /dev/md0 using "dd if=/dev/zero
> of=/dev/md0 bs=4k count=64k". Then I run the "echo check >
> /sys/block/md0/md/sync_action", no mismatch_cnt reported. If I use
> "if=/some/file" then I see miss match count to set to huge number.
>
> I am testing with small RAID-5 size for quick testing. How a reliable
> is this test ? I am using HW accelerated XOR engine for RAID-5.

Have you tried without?

Maybe the XOR engine creates garbage.

MfG
Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 28.01.2010 21:24:46 von Goswin von Brederlow

"Tirumala Reddy Marri" writes:

> I just tried and miss-match count is zero. Interesting, is XOR engine
> doing something wrong ? . Then I ran a test where I raw-write the file
> to /dev/md0. Then did the raw-read for the same size. In this case XOR
> matched as expected. Then I failed a drive using "mdadm -f /dev/md0
> /dev/sda". Then I read the same size data again from /dev/md0. And
> checksum matches too.
> What is this mean XOR engine is doing right thing. But "chec/repair"
> test is not functioning properly with XOR-engine ?
>
> Or is this something to do with how the buffers are handled ? may they
> are cached ?
>
>
> -Marri

No idea. But if everything works without the XOR engine and gives
mismatches with then I would think there is a software or hardware error
there and not in the cable or disks.

MfG
Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 29.01.2010 16:37:05 von Jon Hardcastle

--- On Thu, 28/1/10, Goswin von Brederlow wrote:

> From: Goswin von Brederlow
> Subject: Re: Why does one get mismatches?
> To: linux-raid@vger.kernel.org
> Date: Thursday, 28 January, 2010, 20:24
> "Tirumala Reddy Marri"
> writes:
>=20
> > I just tried and miss-match count is zero.
> Interesting, is XOR engine
> > doing something wrong ? . Then I ran a test where I
> raw-write the file
> > to /dev/md0. Then did the raw-read for the same size.
> In this case XOR
> > matched as expected. Then I failed a drive using
> "mdadm -f /dev/md0
> > /dev/sda". Then I read the same size data again from
> /dev/md0. And
> > checksum matches too.=20
> > What is this mean XOR engine is doing right thing. But
> "chec/repair"
> > test is not functioning properly with XOR-engine ?
> >
> > Or is this something to do with=A0 how the buffers
> are handled ? may they
> > are cached ?
> >
> >
> > -Marri
>=20
> No idea. But if everything works without the XOR engine and
> gives
> mismatches with then I would think there is a software or
> hardware error
> there and not in the cable or disks.
>=20
> MfG
> =A0 =A0 =A0 =A0 Goswin
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at=A0 http://vger.kernel.org/majordomo-info.html
>=20

i think my RAM is fecked - does that sound like a possible cause? memte=
st86 gives PAGES of red errors when run but the POST gives nothing and =
the machine boots.... it has 512MB

i have some more on order as a speculative..=20


=20
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 30.01.2010 00:52:28 von Goswin von Brederlow

Jon Hardcastle writes:

> --- On Thu, 28/1/10, Goswin von Brederlow wrote:
>
>> From: Goswin von Brederlow
>> Subject: Re: Why does one get mismatches?
>> To: linux-raid@vger.kernel.org
>> Date: Thursday, 28 January, 2010, 20:24
>> "Tirumala Reddy Marri"
>> writes:
>>=20
>> > I just tried and miss-match count is zero.
>> Interesting, is XOR engine
>> > doing something wrong ? . Then I ran a test where I
>> raw-write the file
>> > to /dev/md0. Then did the raw-read for the same size.
>> In this case XOR
>> > matched as expected. Then I failed a drive using
>> "mdadm -f /dev/md0
>> > /dev/sda". Then I read the same size data again from
>> /dev/md0. And
>> > checksum matches too.=20
>> > What is this mean XOR engine is doing right thing. But
>> "chec/repair"
>> > test is not functioning properly with XOR-engine ?
>> >
>> > Or is this something to do with=A0 how the buffers
>> are handled ? may they
>> > are cached ?
>> >
>> >
>> > -Marri
>>=20
>> No idea. But if everything works without the XOR engine and
>> gives
>> mismatches with then I would think there is a software or
>> hardware error
>> there and not in the cable or disks.
>>=20
>> MfG
>> =A0 =A0 =A0 =A0 Goswin
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at=A0 http://vger.kernel.org/majordomo-info.html
>>=20
>
> i think my RAM is fecked - does that sound like a possible cause? mem=
test86 gives PAGES of red errors when run but the POST gives nothing an=
d the machine boots.... it has 512MB
>
> i have some more on order as a speculative..=20

If memtest gives errors you certainly have errors. The reverse isn't
allways true.

MfG
Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 30.01.2010 11:39:10 von Jon Hardcastle

--- On Fri, 29/1/10, Goswin von Brederlow wrote:

> From: Goswin von Brederlow
> Subject: Re: Why does one get mismatches?
> To: Jon@eHardcastle.com
> Cc: linux-raid@vger.kernel.org
> Date: Friday, 29 January, 2010, 23:52
> Jon Hardcastle
> writes:
>=20
> > --- On Thu, 28/1/10, Goswin von Brederlow
> wrote:
> >
> >> From: Goswin von Brederlow
> >> Subject: Re: Why does one get mismatches?
> >> To: linux-raid@vger.kernel.org
> >> Date: Thursday, 28 January, 2010, 20:24
> >> "Tirumala Reddy Marri"
> >> writes:
> >>=20
> >> > I just tried and miss-match count is zero.
> >> Interesting, is XOR engine
> >> > doing something wrong ? . Then I ran a test
> where I
> >> raw-write the file
> >> > to /dev/md0. Then did the raw-read for the
> same size.
> >> In this case XOR
> >> > matched as expected. Then I failed a drive
> using
> >> "mdadm -f /dev/md0
> >> > /dev/sda". Then I read the same size data
> again from
> >> /dev/md0. And
> >> > checksum matches too.=20
> >> > What is this mean XOR engine is doing right
> thing. But
> >> "chec/repair"
> >> > test is not functioning properly with
> XOR-engine ?
> >> >
> >> > Or is this something to do with=A0 how the
> buffers
> >> are handled ? may they
> >> > are cached ?
> >> >
> >> >
> >> > -Marri
> >>=20
> >> No idea. But if everything works without the XOR
> engine and
> >> gives
> >> mismatches with then I would think there is a
> software or
> >> hardware error
> >> there and not in the cable or disks.
> >>=20
> >> MfG
> >> =A0 =A0 =A0 =A0 Goswin
> >> --
> >> To unsubscribe from this list: send the line
> "unsubscribe
> >> linux-raid" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at=A0 http://vger.kernel.org/majordomo-info.ht=
ml
> >>=20
> >
> > i think my RAM is fecked - does that sound like a
> possible cause? memtest86 gives PAGES of red errors when run
> but the POST gives nothing and the machine boots.... it has
> 512MB
> >
> > i have some more on order as a speculative..=20
>=20
> If memtest gives errors you certainly have errors. The
> reverse isn't
> allways true.
>=20
> MfG
> =A0 =A0 =A0 =A0 Goswin
>=20

Might have been a false alarm.. both the new chip and old one in an ass=
ortment of slots give the errors - upgrade from memtest86 3.3 to 3.5 an=
d the errors go away.


=20
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 01.02.2010 21:48:40 von Bill Davidsen

Brett Russ wrote:
> On 01/20/2010 09:34 AM, Jon Hardcastle wrote:
>> I will gather the information you require, but so it is clear it is a
>> a echo 'check' that is kicking off the ultimate mismatch not from
>> boot.
>
> What do you mean by mismatches detected? How is this observed?
> -BR

I see it as an error count > 0

--
Bill Davidsen
"We can't solve today's problems by using the same thinking we
used in creating them." - Einstein

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does one get mismatches?

am 01.02.2010 22:10:15 von Bill Davidsen

Jon Hardcastle wrote:
> --- On Thu, 28/1/10, Goswin von Brederlow wrote:
>
>
>> From: Goswin von Brederlow
>> Subject: Re: Why does one get mismatches?
>> To: linux-raid@vger.kernel.org
>> Date: Thursday, 28 January, 2010, 20:24
>> "Tirumala Reddy Marri"
>> writes:
>>
>>
>>> I just tried and miss-match count is zero.
>>>
>> Interesting, is XOR engine
>>
>>> doing something wrong ? . Then I ran a test where I
>>>
>> raw-write the file
>>
>>> to /dev/md0. Then did the raw-read for the same size.
>>>
>> In this case XOR
>>
>>> matched as expected. Then I failed a drive using
>>>
>> "mdadm -f /dev/md0
>>
>>> /dev/sda". Then I read the same size data again from
>>>
>> /dev/md0. And
>>
>>> checksum matches too.
>>> What is this mean XOR engine is doing right thing. But
>>>
>> "chec/repair"
>>
>>> test is not functioning properly with XOR-engine ?
>>>
>>> Or is this something to do with how the buffers
>>>
>> are handled ? may they
>>
>>> are cached ?
>>>
>>>
>>> -Marri
>>>
>> No idea. But if everything works without the XOR engine and
>> gives
>> mismatches with then I would think there is a software or
>> hardware error
>> there and not in the cable or disks.
>>
>> MfG
>> Goswin
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>
> i think my RAM is fecked - does that sound like a possible cause? memtest86 gives PAGES of red errors when run but the POST gives nothing and the machine boots.... it has 512MB
>
> i have some more on order as a speculative..
>

I would bet that your RAM is broken. Any errors indicate bad RAM, no
errors indicate no persistent error. Scrap that RAM. More will make the
machine faster, too.

--
Bill Davidsen
"We can't solve today's problems by using the same thinking we
used in creating them." - Einstein

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: Why does one get mismatches?

am 02.02.2010 00:14:37 von Jon Hardcastle

But what if 2 of the three sources say xyz then you can make a guess that has a higher propencity to beng right? I guess this would also work for raid 6.

Incidently the problem of my mismatches was almost certainly badly seated ram... But my tests are ongoing to be sure....



-----Original Message-----
From: Neil Brown
Sent: 01 February 2010 22:37
To: Bill Davidsen
Cc: Jon@eHardcastle.com; linux-raid@vger.kernel.org
Subject: Re: Why does one get mismatches?

On Mon, 01 Feb 2010 16:18:23 -0500
Bill Davidsen wrote:

> Comment: when there is a three way RAID-1, why doesn't repair *vote* on
> the correct value instead of just making a guess?
>

Because truth is not democratic.

(and I defy you to define "correct" in any general way in this context).

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html