read errors corrected

am 30.12.2010 04:20:48 von James

All,

I'm looking for a bit of guidance here. I have a RAID 6 set up on my
system and am seeing some errors in my logs as follows:

# cat messages | grep "read erro"
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262528 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262536 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262544 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262552 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262560 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262568 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262576 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262584 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262592 on sda4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923648 on sdb4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923656 on sdb4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923664 on sdb4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923672 on sdb4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923680 on sdb4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923688 on sdb4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923696 on sdb4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923520 on sdc4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923528 on sdc4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923536 on sdc4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940552 on sdd4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940672 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940680 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940688 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940696 on sdb4)

I've Google'd the heck out of this error message but am not seeing a
clear and concise message: is this benign? What would cause these
errors? Should I be concerned?

There is an error message (read error corrected) on each of the drives
in the array. They all seem to be functioning properly. The I/O on the
drives is pretty heavy for some parts of the day.

# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5]
[raid4] [multipath]
md1 : active raid6 sdb1[1] sda1[0] sdd1[3] sdc1[2]
497792 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]

md2 : active raid6 sdb2[1] sda2[0] sdd2[3] sdc2[2]
4000000 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]

md3 : active raid6 sdb3[1] sda3[0] sdd3[3] sdc3[2]
25992960 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]

md4 : active raid6 sdb4[1] sda4[0] sdd4[3] sdc4[2]
2899780480 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]

unused devices:

I have a really hard time believing there's something wrong with all
of the drives in the array, although admittedly they're the same model
from the same manufacturer.

Can someone point me in the right direction?
(a) what causes these errors precisely?
(b) is the error benign? How can I determine if it is *likely* a
hardware problem? (I imagine it's probably impossible to tell if it's
HW until it's too late)
(c) are these errors expected in a RAID array that is heavily used?
(d) what kind of errors should I see regarding "read errors" that
*would* indicate an imminent hardware failure?

Thoughts and ideas would be welcomed. I'm sure a thread where some
hefty discussion is thrown at this topic will help future Googlers
like me. :)

-james
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: read errors corrected

am 30.12.2010 06:24:21 von Mikael Abrahamsson

On Thu, 30 Dec 2010, James wrote:

> Can someone point me in the right direction?
> (a) what causes these errors precisely?

dmesg should give you information if this is SATA errors.

> (c) are these errors expected in a RAID array that is heavily used?

No.

> (d) what kind of errors should I see regarding "read errors" that
> *would* indicate an imminent hardware failure?

You should look into the SMART information on the drives using smartctl.

--
Mikael Abrahamsson email: swmike@swm.pp.se
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: read errors corrected

am 30.12.2010 10:15:01 von NeilBrown

On Thu, 30 Dec 2010 03:20:48 +0000 James wrote:

> All,
>
> I'm looking for a bit of guidance here. I have a RAID 6 set up on my
> system and am seeing some errors in my logs as follows:
>
> # cat messages | grep "read erro"
> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 974262528 on sda4)
> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 974262536 on sda4)
......

>
> I've Google'd the heck out of this error message but am not seeing a
> clear and concise message: is this benign? What would cause these
> errors? Should I be concerned?
>
> There is an error message (read error corrected) on each of the drives
> in the array. They all seem to be functioning properly. The I/O on the
> drives is pretty heavy for some parts of the day.
>
> # cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5]
> [raid4] [multipath]
> md1 : active raid6 sdb1[1] sda1[0] sdd1[3] sdc1[2]
> 497792 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
>
> md2 : active raid6 sdb2[1] sda2[0] sdd2[3] sdc2[2]
> 4000000 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
>
> md3 : active raid6 sdb3[1] sda3[0] sdd3[3] sdc3[2]
> 25992960 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
>
> md4 : active raid6 sdb4[1] sda4[0] sdd4[3] sdc4[2]
> 2899780480 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
>
> unused devices:
>
> I have a really hard time believing there's something wrong with all
> of the drives in the array, although admittedly they're the same model
> from the same manufacturer.
>
> Can someone point me in the right direction?
> (a) what causes these errors precisely?

When md/raid6 tries to read from a device and gets a read error, it try to
read from other other devices. When that succeeds it computes the data that
it had tried to read and then write it back to the original drive. If this
succeeded is assumes that the read error has been correct by a write, and
prints the message that you see.

> (b) is the error benign? How can I determine if it is *likely* a
> hardware problem? (I imagine it's probably impossible to tell if it's
> HW until it's too late)

A few occasional messages like this are fairly benign. The could be a sign
that the drive surface is degrading. If you see lots of these messages, then
you should seriously consider replacing the drive.

As you are seeing these message across all devices, it is possible that the
problem is with the sata controller rather than the disks. Do know which you
should check the errors that are reported in dmesg. If you don't understand
these message, then post them to the list - feel free to post several hundred
lines of logs - too much is much much better than not enough.

NeilBrown

> (c) are these errors expected in a RAID array that is heavily used?
> (d) what kind of errors should I see regarding "read errors" that
> *would* indicate an imminent hardware failure?
>
> Thoughts and ideas would be welcomed. I'm sure a thread where some
> hefty discussion is thrown at this topic will help future Googlers
> like me. :)
>
> -james
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: read errors corrected

am 30.12.2010 11:13:35 von Giovanni Tessore

On 12/30/2010 04:20 AM, James wrote:
> Can someone point me in the right direction?
> (a) what causes these errors precisely?
> (b) is the error benign? How can I determine if it is *likely* a
> hardware problem? (I imagine it's probably impossible to tell if it's
> HW until it's too late)
> (c) are these errors expected in a RAID array that is heavily used?
> (d) what kind of errors should I see regarding "read errors" that
> *would* indicate an imminent hardware failure?

(a) these errors usually come from defective disk sectors. raid
recostructs the missing sector from parity from other disks in the
array, then rewrites the sector on the defective disk; if the sector is
rewritten without error (maybe the hd remaps the sector into its
reserved area), then just the log messages is displayed.

(b) with raid-6 it's almost benign; to get troubles you should get a
read error on same sector for >2 disks; or have 2 disks failed and out
of the array and get a read error on one of the other disks while
recostructing the array; or have 1 disk failed and get a read error on
same sector on >1 disk while recostructing (with raid-5 it's almost
dangerous instead, as you can have big troubles if a disk fails and you
get a read error on another disk while recostructing; that happened to me!)

(c) no; it's also a good rule to perform a periodic scrub of the array
(check of the array), to reveal and correct defective sectors

(d) check smart status of the disks, for "relocated sectors count"; also
if md superblock is >= 1 there is a persistent count of corrected read
errors for each device into /sys/block/mdXX/md/dev-XX/errors, when this
counter reaches 256 the disk is marked failed; ihmo when a disk is
giving even few corrected read errors in a short interval its better to
replace it.

--
Yours faithfully.

Giovanni Tessore

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: read errors corrected

am 30.12.2010 17:33:31 von James

Lots of tremendous responses. I appreciate it. I'm going to reply to
the first person who responded here, but this email should cover some
of the questions posed in further responses.

On Thu, Dec 30, 2010 at 00:24, Mikael Abrahamsson wr=
ote:
> On Thu, 30 Dec 2010, James wrote:
>
>> Can someone point me in the right direction?
>> (a) what causes these errors precisely?
>
> dmesg should give you information if this is SATA errors.

Here are some other logs that may be relevant:

Dec 15 15:40:34 nuova kernel: sd 0:0:0:0: [sda] Unhandled error code
Dec 15 15:40:34 nuova kernel: sd 0:0:0:0: [sda] Result: hostbyte=3D0x00
driverbyte=3D0x06
Dec 15 15:40:34 nuova kernel: sd 0:0:0:0: [sda] CDB: cdb[0]=3D0x28: 28
00 3b e3 53 ea 00 00 48 00
Dec 15 15:40:34 nuova kernel: end_request: I/O error, dev sda, sector 1=
004753898
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262528 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262536 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262544 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262552 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262560 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262568 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262576 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262584 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262592 on sda4)

Unfortunately I had not caught those error messages at first
glance...I/O error? Hrmm...doesn't sound good. The issue is repeated
later on.

Dec 29 03:04:01 nuova kernel: sd 1:0:1:0: [sdd] Unhandled error code
Dec 29 03:04:01 nuova kernel: sd 0:0:1:0: [sdb] Unhandled error code
Dec 29 03:04:01 nuova kernel: sd 0:0:1:0: [sdb] Result: hostbyte=3D0x00
driverbyte=3D0x06
Dec 29 03:04:01 nuova kernel: sd 0:0:1:0: [sdb] CDB: cdb[0]=3D0x28: 28
00 1b 06 d2 ea 00 00 78 00
Dec 29 03:04:01 nuova kernel: end_request: I/O error, dev sdb, sector 4=
53432042
Dec 29 03:04:01 nuova kernel: sd 1:0:1:0: [sdd] Result: hostbyte=3D0x00
driverbyte=3D0x06
Dec 29 03:04:01 nuova kernel: sd 1:0:1:0: [sdd] CDB: cdb[0]=3D0x28: 28
00 1b 06 d2 62 00 00 88 00
Dec 29 03:04:01 nuova kernel: end_request: I/O error, dev sdd, sector 4=
53431906
Dec 29 03:04:01 nuova kernel: raid5_end_read_request: 13 callbacks supp=
ressed
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940552 on sdd4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940672 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940680 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940688 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940696 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940704 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940712 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940720 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940728 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940736 on sdb4)

Ouch.

>> (c) are these errors expected in a RAID array that is heavily used?
>
> No.
>
>> (d) what kind of errors should I see regarding "read errors" that
>> *would* indicate an imminent hardware failure?
>
> You should look into the SMART information on the drives using smartc=
tl.

All of the drives indicate that the SMART status is
"passed"...unfortuantely this isn't very verbose. :)

Is there something specific I should be looking at in my SMART status?

I also see hundreds and hundreds of lines in my /var/log/messages that
indicates the following:

Dec 20 06:12:40 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 47 to 46
Dec 20 07:12:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68
Dec 20 07:12:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 33 to 32
Dec 20 07:12:40 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68
Dec 20 07:12:40 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 33 to 32
Dec 20 07:42:40 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 46 to 45
Dec 20 08:12:39 nuova smartd[22451]: Device: /dev/sdb [SAT], SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 42 to 41
Dec 20 08:42:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 46 to 45
Dec 20 08:42:39 nuova smartd[22451]: Device: /dev/sdc [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 66 to 67
Dec 20 08:42:39 nuova smartd[22451]: Device: /dev/sdc [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 34 to 33
Dec 20 09:42:39 nuova smartd[22451]: Device: /dev/sdb [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68
Dec 20 09:42:39 nuova smartd[22451]: Device: /dev/sdb [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 33 to 32
Dec 20 10:12:39 nuova smartd[22451]: Device: /dev/sdb [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 68 to 67
Dec 20 10:12:39 nuova smartd[22451]: Device: /dev/sdb [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 32 to 33
Dec 20 10:12:39 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 45 to 44
Dec 20 11:12:40 nuova smartd[22451]: Device: /dev/sdb [SAT], SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 41 to 40
Dec 20 13:42:39 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 44 to 43
Dec 20 14:42:40 nuova smartd[22451]: Device: /dev/sdb [SAT], SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 40 to 39
Dec 20 15:42:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 68 to 67
Dec 20 15:42:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 32 to 33
Dec 20 15:42:40 nuova smartd[22451]: Device: /dev/sdc [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 66
Dec 20 15:42:40 nuova smartd[22451]: Device: /dev/sdc [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 33 to 34
Dec 20 15:42:40 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 68 to 67
Dec 20 15:42:40 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 32 to 33
Dec 20 16:12:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68
Dec 20 16:12:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 33 to 32
Dec 20 16:12:40 nuova smartd[22451]: Device: /dev/sdc [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 66 to 67
Dec 20 16:12:40 nuova smartd[22451]: Device: /dev/sdc [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 34 to 33
Dec 20 16:12:40 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68
Dec 20 16:12:40 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 33 to 32
Dec 20 16:42:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 68 to 67
Dec 20 16:42:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 32 to 33
Dec 20 16:42:39 nuova smartd[22451]: Device: /dev/sdc [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 66
Dec 20 16:42:39 nuova smartd[22451]: Device: /dev/sdc [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 33 to 34
Dec 20 16:42:40 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 68 to 67
Dec 20 16:42:40 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 32 to 33
Dec 20 17:12:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68

Is it normal for SMART to update the attributes as the drives are
being used? (I've never had SMART installed before, so this is all
very new to me).

-james

> --
> Mikael Abrahamsson =A0 =A0email: swmike@swm.pp.se
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: read errors corrected

am 30.12.2010 17:35:59 von James

Sorry Neil, I meant to reply-all.

-james

On Thu, Dec 30, 2010 at 11:35, James wrote:
> Inline.
>
> On Thu, Dec 30, 2010 at 04:15, Neil Brown wrote:
>> On Thu, 30 Dec 2010 03:20:48 +0000 James wrote:
>>
>>> All,
>>>
>>> I'm looking for a bit of guidance here. I have a RAID 6 set up on m=
y
>>> system and am seeing some errors in my logs as follows:
>>>
>>> # cat messages | grep "read erro"
>>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
>>> sectors at 974262528 on sda4)
>>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
>>> sectors at 974262536 on sda4)
>> .....
>>
>>>
>>> I've Google'd the heck out of this error message but am not seeing =
a
>>> clear and concise message: is this benign? What would cause these
>>> errors? Should I be concerned?
>>>
>>> There is an error message (read error corrected) on each of the dri=
ves
>>> in the array. They all seem to be functioning properly. The I/O on =
the
>>> drives is pretty heavy for some parts of the day.
>>>
>>> # cat /proc/mdstat
>>> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5]
>>> [raid4] [multipath]
>>> md1 : active raid6 sdb1[1] sda1[0] sdd1[3] sdc1[2]
>>> =A0 =A0 =A0 497792 blocks level 6, 64k chunk, algorithm 2 [4/4] [UU=
UU]
>>>
>>> md2 : active raid6 sdb2[1] sda2[0] sdd2[3] sdc2[2]
>>> =A0 =A0 =A0 4000000 blocks level 6, 64k chunk, algorithm 2 [4/4] [U=
UUU]
>>>
>>> md3 : active raid6 sdb3[1] sda3[0] sdd3[3] sdc3[2]
>>> =A0 =A0 =A0 25992960 blocks level 6, 64k chunk, algorithm 2 [4/4] [=
UUUU]
>>>
>>> md4 : active raid6 sdb4[1] sda4[0] sdd4[3] sdc4[2]
>>> =A0 =A0 =A0 2899780480 blocks level 6, 64k chunk, algorithm 2 [4/4]=
[UUUU]
>>>
>>> unused devices:
>>>
>>> I have a really hard time believing there's something wrong with al=
l
>>> of the drives in the array, although admittedly they're the same mo=
del
>>> from the same manufacturer.
>>>
>>> Can someone point me in the right direction?
>>> (a) what causes these errors precisely?
>>
>> When md/raid6 tries to read from a device and gets a read error, it =
try to
>> read from other other devices. =A0When that succeeds it computes the=
data that
>> it had tried to read and then write it back to the original drive. =A0=
If this
>> succeeded is assumes that the read error has been correct by a write=
, and
>> prints the message that you see.
>>
>>
>>> (b) is the error benign? How can I determine if it is *likely* a
>>> hardware problem? (I imagine it's probably impossible to tell if it=
's
>>> HW until it's too late)
>>
>> A few occasional messages like this are fairly benign. =A0The could =
be a sign
>> that the drive surface is degrading. =A0If you see lots of these mes=
sages, then
>> you should seriously consider replacing the drive.
>
> Wow, this is hard for me to believe considering this is happening on
> all the drives. It's not impossible, however, since the drives are
> likely from the same batch.
>
>> As you are seeing these message across all devices, it is possible t=
hat the
>> problem is with the sata controller rather than the disks. =A0Do kno=
w which you
>> should check the errors that are reported in dmesg. =A0If you don't =
understand
>> these message, then post them to the list - feel free to post severa=
l hundred
>> lines of logs - too much is much much better than not enough.
>
> I posted a few errors in my response to the thread a bit ago -- here'=
s
> another snippet:
>
> Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] Unhandled error code
> Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] Result: hostbyte=3D0x=
00
> driverbyte=3D0x06
> Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] CDB: cdb[0]=3D0x28: 2=
8
> 00 25 a2 a0 6a 00 00 80 00
> Dec 29 01:55:03 nuova kernel: end_request: I/O error, dev sdc, sector=
631414890
> Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] Unhandled error code
> Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] Result: hostbyte=3D0x=
00
> driverbyte=3D0x06
> Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] CDB: cdb[0]=3D0x28: 2=
8
> 00 25 a2 a0 ea 00 00 38 00
> Dec 29 01:55:03 nuova kernel: end_request: I/O error, dev sdb, sector=
631415018
> Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 600923648 on sdb4)
> Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 600923656 on sdb4)
> Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 600923664 on sdb4)
> Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 600923672 on sdb4)
> Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 600923680 on sdb4)
> Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 600923688 on sdb4)
> Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 600923696 on sdb4)
> Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 600923520 on sdc4)
> Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 600923528 on sdc4)
> Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 600923536 on sdc4)
>
> Is there a good way to determine if the issue is with the motherboard
> (where the SATA controller is), or with the drives themselves?
>
>> NeilBrown
>>
>>
>>
>>> (c) are these errors expected in a RAID array that is heavily used?
>>> (d) what kind of errors should I see regarding "read errors" that
>>> *would* indicate an imminent hardware failure?
>>>
>>> Thoughts and ideas would be welcomed. I'm sure a thread where some
>>> hefty discussion is thrown at this topic will help future Googlers
>>> like me. :)
>>>
>>> -james
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rai=
d" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.htm=
l
>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: read errors corrected

am 30.12.2010 17:41:18 von James

Inline.

On Thu, Dec 30, 2010 at 05:13, Giovanni Tessore wro=
te:
> On 12/30/2010 04:20 AM, James wrote:
>>
>> Can someone point me in the right direction?
>> (a) what causes these errors precisely?
>> (b) is the error benign? How can I determine if it is *likely* a
>> hardware problem? (I imagine it's probably impossible to tell if it'=
s
>> HW until it's too late)
>> (c) are these errors expected in a RAID array that is heavily used?
>> (d) what kind of errors should I see regarding "read errors" that
>> *would* indicate an imminent hardware failure?
>
> (a) these errors usually come from defective disk sectors. raid recos=
tructs
> the missing sector from parity from other disks in the array, then re=
writes
> the sector on the defective disk; if the sector is rewritten without =
error
> (maybe the hd remaps the sector into its reserved area), then just th=
e log
> messages is displayed.
>
> (b) with raid-6 it's almost benign; to get troubles you should get a =
read
> error on same sector for >2 disks; or have 2 disks failed and out of =
the
> array and get a read error on one of the other disks while recostruct=
ing the
> array; or have 1 disk failed and get a read error on same sector on >=
1 disk
> while recostructing (with raid-5 it's almost dangerous instead, as yo=
u can
> have big troubles if a disk fails and you get a read error on another=
disk
> while recostructing; that happened to me!)
>
> (c) no; it's also a good rule to perform a periodic scrub of the arra=
y
> (check of the array), to reveal and correct defective sectors
>
> (d) check smart status of the disks, for "relocated sectors count"; a=
lso if
> md superblock is >=3D 1 there is a persistent count of corrected read=
errors
> for each device into /sys/block/mdXX/md/dev-XX/errors, when this coun=
ter
> reaches 256 the disk is marked failed; ihmo when a disk is giving eve=
n few
> corrected read errors in a short interval its better to replace it.

Good call.

Here's the output of the reallocated sector count:

~ # for i in a b c d ; do smartctl -a /dev/sd$i | grep Realloc ; done
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 1
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 3
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 5
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 1

Are these values high? Low? Acceptable?

How about values like "Raw_Read_Error_Rate" and "Seek_Error_Rate" -- I
believe I've read those are values that are normally very high...is
this true?

~ # for i in a b c d ; do smartctl -a /dev/sd$i | grep
Raw_Read_Error_Rate ; done
1 Raw_Read_Error_Rate 0x000f 116 099 006 Pre-fail
Always - 106523474
1 Raw_Read_Error_Rate 0x000f 114 099 006 Pre-fail
Always - 77952706
1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail
Always - 137525325
1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail
Always - 179042738

..and...

~ # for i in a b c d ; do smartctl -a /dev/sd$i | grep
Seek_Error_Rate ; done
7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail
Always - 14923821
7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail
Always - 15648709
7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail
Always - 15733727
7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail
Always - 14279452

Thoughts appreciated.

> --
> Yours faithfully.
>
> Giovanni Tessore
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: read errors corrected

am 30.12.2010 17:44:33 von Roman Mamedov

--Sig_/9AjmauszNH_fs9Ivg6wW6.t
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

On Thu, 30 Dec 2010 11:33:31 -0500
James wrote:

> Dec 20 17:12:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
> Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68
>=20
> Is it normal for SMART to update the attributes as the drives are
> being used? (I've never had SMART installed before, so this is all
> very new to me).

If your drives run at 68 degrees Celsius, you should emergency-cut the power
ASAP and perhaps reach for the nearest fire extinguisher.

--=20
With respect,
Roman

--Sig_/9AjmauszNH_fs9Ivg6wW6.t
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)

iEYEARECAAYFAk0ctvEACgkQTLKSvz+PZwjKmwCgkl7a753VEITchvI4ZFAl kpF2
mjgAn04yFuzuxP8AX1DS3tLX94214qn5
=YZIW
-----END PGP SIGNATURE-----

--Sig_/9AjmauszNH_fs9Ivg6wW6.t--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: read errors corrected

am 30.12.2010 17:51:51 von James

On Thu, Dec 30, 2010 at 11:44, Roman Mamedov wrote:
> On Thu, 30 Dec 2010 11:33:31 -0500
> James wrote:
>
>> Dec 20 17:12:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
>> Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68
>>
>> Is it normal for SMART to update the attributes as the drives are
>> being used? (I've never had SMART installed before, so this is all
>> very new to me).
>
> If your drives run at 68 degrees Celsius, you should emergency-cut the power
> ASAP and perhaps reach for the nearest fire extinguisher.

Agreed. ;) That's why I posted those messages -- I'm unsure why it
would change those values.

Here's what smartctl shows for all of the drives:

~ # for i in a b c d ; do smartctl -a /dev/sd$i | grep Temperature ;
done
190 Airflow_Temperature_Cel 0x0022 069 059 045 Old_age
Always - 31 (Lifetime Min/Max 23/37)
194 Temperature_Celsius 0x0022 031 041 000 Old_age
Always - 31 (0 23 0 0)
190 Airflow_Temperature_Cel 0x0022 068 058 045 Old_age
Always - 32 (Lifetime Min/Max 22/38)
194 Temperature_Celsius 0x0022 032 042 000 Old_age
Always - 32 (0 22 0 0)
190 Airflow_Temperature_Cel 0x0022 068 057 045 Old_age
Always - 32 (Lifetime Min/Max 22/38)
194 Temperature_Celsius 0x0022 032 043 000 Old_age
Always - 32 (0 22 0 0)
190 Airflow_Temperature_Cel 0x0022 069 059 045 Old_age
Always - 31 (Lifetime Min/Max 23/37)
194 Temperature_Celsius 0x0022 031 041 000 Old_age
Always - 31 (0 23 0 0)

Those values seem appropriate, particularly since the "max" is 37 (as
defined by the drive manufacturer?).

> --
> With respect,
> Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: read errors corrected

am 30.12.2010 18:59:15 von Ryan Wagoner

2010/12/30 James :
>
> Here's what smartctl shows for all of the drives:
>
> ~ # for i in a b c d ; do smartctl -a /dev/sd$i | grep Temperature ;
> done
> 190 Airflow_Temperature_Cel 0x0022 =A0 069 =A0 059 =A0 045 =A0 =A0Old=
_age
> Always =A0 =A0 =A0 - =A0 =A0 =A0 31 (Lifetime Min/Max 23/37)
> 194 Temperature_Celsius =A0 =A0 0x0022 =A0 031 =A0 041 =A0 000 =A0 =A0=
Old_age
> Always =A0 =A0 =A0 - =A0 =A0 =A0 31 (0 23 0 0)
> 190 Airflow_Temperature_Cel 0x0022 =A0 068 =A0 058 =A0 045 =A0 =A0Old=
_age
> Always =A0 =A0 =A0 - =A0 =A0 =A0 32 (Lifetime Min/Max 22/38)
> 194 Temperature_Celsius =A0 =A0 0x0022 =A0 032 =A0 042 =A0 000 =A0 =A0=
Old_age
> Always =A0 =A0 =A0 - =A0 =A0 =A0 32 (0 22 0 0)
> 190 Airflow_Temperature_Cel 0x0022 =A0 068 =A0 057 =A0 045 =A0 =A0Old=
_age
> Always =A0 =A0 =A0 - =A0 =A0 =A0 32 (Lifetime Min/Max 22/38)
> 194 Temperature_Celsius =A0 =A0 0x0022 =A0 032 =A0 043 =A0 000 =A0 =A0=
Old_age
> Always =A0 =A0 =A0 - =A0 =A0 =A0 32 (0 22 0 0)
> 190 Airflow_Temperature_Cel 0x0022 =A0 069 =A0 059 =A0 045 =A0 =A0Old=
_age
> Always =A0 =A0 =A0 - =A0 =A0 =A0 31 (Lifetime Min/Max 23/37)
> 194 Temperature_Celsius =A0 =A0 0x0022 =A0 031 =A0 041 =A0 000 =A0 =A0=
Old_age
> Always =A0 =A0 =A0 - =A0 =A0 =A0 31 (0 23 0 0)
>
> Those values seem appropriate, particularly since the "max" is 37 (as
> defined by the drive manufacturer?).
>

Not sure why the log is showing the weird C temp. The output from
smartctrl looks correct. The max is not defined by the manufacture,
but the maximum temp the drive has reached.

Ryan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: read errors corrected

am 30.12.2010 19:03:26 von James

=46air enough. :) Thanks for the response.

So the big question (to all) becomes this: is this a hard drive issue,
or a motherboard / SATA controller issue? Either one would suck, but
hard drives are obviously easier to swap than a motherboard.

Thoughts on how to go about diagnosing the issue further to determine
what is going on would be greatly appreciated. Aside from replacing
all the drives and hoping for the best, I don't see an easy way to
really figure out what is causing the I/O errors that are resulting in
bad sectors.

-james

On Thu, Dec 30, 2010 at 12:59, Ryan Wagoner wrote=
:
> 2010/12/30 James :
>>
>> Here's what smartctl shows for all of the drives:
>>
>> ~ # for i in a b c d ; do smartctl -a /dev/sd$i | grep Temperature ;
>> done
>> 190 Airflow_Temperature_Cel 0x0022 =A0 069 =A0 059 =A0 045 =A0 =A0Ol=
d_age
>> Always =A0 =A0 =A0 - =A0 =A0 =A0 31 (Lifetime Min/Max 23/37)
>> 194 Temperature_Celsius =A0 =A0 0x0022 =A0 031 =A0 041 =A0 000 =A0 =A0=
Old_age
>> Always =A0 =A0 =A0 - =A0 =A0 =A0 31 (0 23 0 0)
>> 190 Airflow_Temperature_Cel 0x0022 =A0 068 =A0 058 =A0 045 =A0 =A0Ol=
d_age
>> Always =A0 =A0 =A0 - =A0 =A0 =A0 32 (Lifetime Min/Max 22/38)
>> 194 Temperature_Celsius =A0 =A0 0x0022 =A0 032 =A0 042 =A0 000 =A0 =A0=
Old_age
>> Always =A0 =A0 =A0 - =A0 =A0 =A0 32 (0 22 0 0)
>> 190 Airflow_Temperature_Cel 0x0022 =A0 068 =A0 057 =A0 045 =A0 =A0Ol=
d_age
>> Always =A0 =A0 =A0 - =A0 =A0 =A0 32 (Lifetime Min/Max 22/38)
>> 194 Temperature_Celsius =A0 =A0 0x0022 =A0 032 =A0 043 =A0 000 =A0 =A0=
Old_age
>> Always =A0 =A0 =A0 - =A0 =A0 =A0 32 (0 22 0 0)
>> 190 Airflow_Temperature_Cel 0x0022 =A0 069 =A0 059 =A0 045 =A0 =A0Ol=
d_age
>> Always =A0 =A0 =A0 - =A0 =A0 =A0 31 (Lifetime Min/Max 23/37)
>> 194 Temperature_Celsius =A0 =A0 0x0022 =A0 031 =A0 041 =A0 000 =A0 =A0=
Old_age
>> Always =A0 =A0 =A0 - =A0 =A0 =A0 31 (0 23 0 0)
>>
>> Those values seem appropriate, particularly since the "max" is 37 (a=
s
>> defined by the drive manufacturer?).
>>
>
> Not sure why the log is showing the weird C temp. The output from
> smartctrl looks correct. The max is not defined by the manufacture,
> but the maximum temp the drive has reached.
>
> Ryan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: read errors corrected

am 30.12.2010 21:19:19 von Richard Scobie

Ryan wrote:

> Not sure why the log is showing the weird C temp.

See the SMART attribute 190 definition here:

http://en.wikipedia.org/wiki/S.M.A.R.T.

Regards,

Richard
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: read errors corrected

am 31.12.2010 00:12:43 von NeilBrown

On Thu, 30 Dec 2010 11:35:59 -0500 James wrote:

> Sorry Neil, I meant to reply-all.
>=20
> -james
>=20
> On Thu, Dec 30, 2010 at 11:35, James wrote:
> > Inline.
> >
> > On Thu, Dec 30, 2010 at 04:15, Neil Brown wrote:
> >> On Thu, 30 Dec 2010 03:20:48 +0000 James wrote:
> >>
> >>> All,
> >>>
> >>> I'm looking for a bit of guidance here. I have a RAID 6 set up on=
my
> >>> system and am seeing some errors in my logs as follows:
> >>>
> >>> # cat messages | grep "read erro"
> >>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (=
8
> >>> sectors at 974262528 on sda4)
> >>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (=
8
> >>> sectors at 974262536 on sda4)
> >> .....
> >>
> >>>
> >>> I've Google'd the heck out of this error message but am not seein=
g a
> >>> clear and concise message: is this benign? What would cause these
> >>> errors? Should I be concerned?
> >>>
> >>> There is an error message (read error corrected) on each of the d=
rives
> >>> in the array. They all seem to be functioning properly. The I/O o=
n the
> >>> drives is pretty heavy for some parts of the day.
> >>>
> >>> # cat /proc/mdstat
> >>> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5]
> >>> [raid4] [multipath]
> >>> md1 : active raid6 sdb1[1] sda1[0] sdd1[3] sdc1[2]
> >>> =A0 =A0 =A0 497792 blocks level 6, 64k chunk, algorithm 2 [4/4] [=
UUUU]
> >>>
> >>> md2 : active raid6 sdb2[1] sda2[0] sdd2[3] sdc2[2]
> >>> =A0 =A0 =A0 4000000 blocks level 6, 64k chunk, algorithm 2 [4/4] =
[UUUU]
> >>>
> >>> md3 : active raid6 sdb3[1] sda3[0] sdd3[3] sdc3[2]
> >>> =A0 =A0 =A0 25992960 blocks level 6, 64k chunk, algorithm 2 [4/4]=
[UUUU]
> >>>
> >>> md4 : active raid6 sdb4[1] sda4[0] sdd4[3] sdc4[2]
> >>> =A0 =A0 =A0 2899780480 blocks level 6, 64k chunk, algorithm 2 [4/=
4] [UUUU]
> >>>
> >>> unused devices:
> >>>
> >>> I have a really hard time believing there's something wrong with =
all
> >>> of the drives in the array, although admittedly they're the same =
model
> >>> from the same manufacturer.
> >>>
> >>> Can someone point me in the right direction?
> >>> (a) what causes these errors precisely?
> >>
> >> When md/raid6 tries to read from a device and gets a read error, i=
t try to
> >> read from other other devices. =A0When that succeeds it computes t=
he data that
> >> it had tried to read and then write it back to the original drive.=
=A0If this
> >> succeeded is assumes that the read error has been correct by a wri=
te, and
> >> prints the message that you see.
> >>
> >>
> >>> (b) is the error benign? How can I determine if it is *likely* a
> >>> hardware problem? (I imagine it's probably impossible to tell if =
it's
> >>> HW until it's too late)
> >>
> >> A few occasional messages like this are fairly benign. =A0The coul=
d be a sign
> >> that the drive surface is degrading. =A0If you see lots of these m=
essages, then
> >> you should seriously consider replacing the drive.
> >
> > Wow, this is hard for me to believe considering this is happening o=
n
> > all the drives. It's not impossible, however, since the drives are
> > likely from the same batch.
> >
> >> As you are seeing these message across all devices, it is possible=
that the
> >> problem is with the sata controller rather than the disks. =A0Do k=
now which you
> >> should check the errors that are reported in dmesg. =A0If you don'=
t understand
> >> these message, then post them to the list - feel free to post seve=
ral hundred
> >> lines of logs - too much is much much better than not enough.
> >
> > I posted a few errors in my response to the thread a bit ago -- her=
e's
> > another snippet:
> >
> > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] Unhandled error cod=
e
> > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] Result: hostbyte=3D=
0x00
> > driverbyte=3D0x06
> > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] CDB: cdb[0]=3D0x28:=
28
> > 00 25 a2 a0 6a 00 00 80 00
> > Dec 29 01:55:03 nuova kernel: end_request: I/O error, dev sdc, sect=
or 631414890
> > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] Unhandled error cod=
e
> > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] Result: hostbyte=3D=
0x00
> > driverbyte=3D0x06

"Unhandled error code" sounds like it could be a driver problem...

Try googling that error message...

http://us.generation-nt.com/answer/2-6-33-libata-issues-via- sata-pata-c=
ontroller-help-197123882.html

"Also, please try the latest 2.6.34-rc kernel, as that has several fixe=
s
for both pata_via and sata_via which did not make 2.6.33."

What kernel are you running???

NeilBrown

> > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] CDB: cdb[0]=3D0x28:=
28
> > 00 25 a2 a0 ea 00 00 38 00
> > Dec 29 01:55:03 nuova kernel: end_request: I/O error, dev sdb, sect=
or 631415018
> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> > sectors at 600923648 on sdb4)
> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> > sectors at 600923656 on sdb4)
> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> > sectors at 600923664 on sdb4)
> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> > sectors at 600923672 on sdb4)
> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> > sectors at 600923680 on sdb4)
> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> > sectors at 600923688 on sdb4)
> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> > sectors at 600923696 on sdb4)
> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> > sectors at 600923520 on sdc4)
> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> > sectors at 600923528 on sdc4)
> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> > sectors at 600923536 on sdc4)
> >
> > Is there a good way to determine if the issue is with the motherboa=
rd
> > (where the SATA controller is), or with the drives themselves?
> >
> >> NeilBrown
> >>
> >>
> >>
> >>> (c) are these errors expected in a RAID array that is heavily use=
d?
> >>> (d) what kind of errors should I see regarding "read errors" that
> >>> *would* indicate an imminent hardware failure?
> >>>
> >>> Thoughts and ideas would be welcomed. I'm sure a thread where som=
e
> >>> hefty discussion is thrown at this topic will help future Googler=
s
> >>> like me. :)
> >>>
> >>> -james
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-r=
aid" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.h=
tml
> >>
> >>
> >

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: read errors corrected

am 31.12.2010 02:48:07 von James

Neil,

I'm runinng 2.6.35.

Although an expensive route, the only thing I can think to do to
determine 100% whether the issue is software or hardware (and, if
hardware, whether SATA controller or the drives) is to swap the drives
out.

Ouch!

Any other ideas, however, would be appreciated before I drop a few
hundred bucks. :)

-james

On Thu, Dec 30, 2010 at 23:12, Neil Brown wrote:
> On Thu, 30 Dec 2010 11:35:59 -0500 James wrote:
>
>> Sorry Neil, I meant to reply-all.
>>
>> -james
>>
>> On Thu, Dec 30, 2010 at 11:35, James wrote:
>> > Inline.
>> >
>> > On Thu, Dec 30, 2010 at 04:15, Neil Brown wrote:
>> >> On Thu, 30 Dec 2010 03:20:48 +0000 James wrote:
>> >>
>> >>> All,
>> >>>
>> >>> I'm looking for a bit of guidance here. I have a RAID 6 set up o=
n my
>> >>> system and am seeing some errors in my logs as follows:
>> >>>
>> >>> # cat messages | grep "read erro"
>> >>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected =
(8
>> >>> sectors at 974262528 on sda4)
>> >>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected =
(8
>> >>> sectors at 974262536 on sda4)
>> >> .....
>> >>
>> >>>
>> >>> I've Google'd the heck out of this error message but am not seei=
ng a
>> >>> clear and concise message: is this benign? What would cause thes=
e
>> >>> errors? Should I be concerned?
>> >>>
>> >>> There is an error message (read error corrected) on each of the =
drives
>> >>> in the array. They all seem to be functioning properly. The I/O =
on the
>> >>> drives is pretty heavy for some parts of the day.
>> >>>
>> >>> # cat /proc/mdstat
>> >>> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5=
]
>> >>> [raid4] [multipath]
>> >>> md1 : active raid6 sdb1[1] sda1[0] sdd1[3] sdc1[2]
>> >>> =A0 =A0 =A0 497792 blocks level 6, 64k chunk, algorithm 2 [4/4] =
[UUUU]
>> >>>
>> >>> md2 : active raid6 sdb2[1] sda2[0] sdd2[3] sdc2[2]
>> >>> =A0 =A0 =A0 4000000 blocks level 6, 64k chunk, algorithm 2 [4/4]=
[UUUU]
>> >>>
>> >>> md3 : active raid6 sdb3[1] sda3[0] sdd3[3] sdc3[2]
>> >>> =A0 =A0 =A0 25992960 blocks level 6, 64k chunk, algorithm 2 [4/4=
] [UUUU]
>> >>>
>> >>> md4 : active raid6 sdb4[1] sda4[0] sdd4[3] sdc4[2]
>> >>> =A0 =A0 =A0 2899780480 blocks level 6, 64k chunk, algorithm 2 [4=
/4] [UUUU]
>> >>>
>> >>> unused devices:
>> >>>
>> >>> I have a really hard time believing there's something wrong with=
all
>> >>> of the drives in the array, although admittedly they're the same=
model
>> >>> from the same manufacturer.
>> >>>
>> >>> Can someone point me in the right direction?
>> >>> (a) what causes these errors precisely?
>> >>
>> >> When md/raid6 tries to read from a device and gets a read error, =
it try to
>> >> read from other other devices. =A0When that succeeds it computes =
the data that
>> >> it had tried to read and then write it back to the original drive=
=A0If this
>> >> succeeded is assumes that the read error has been correct by a wr=
ite, and
>> >> prints the message that you see.
>> >>
>> >>
>> >>> (b) is the error benign? How can I determine if it is *likely* a
>> >>> hardware problem? (I imagine it's probably impossible to tell if=
it's
>> >>> HW until it's too late)
>> >>
>> >> A few occasional messages like this are fairly benign. =A0The cou=
ld be a sign
>> >> that the drive surface is degrading. =A0If you see lots of these =
messages, then
>> >> you should seriously consider replacing the drive.
>> >
>> > Wow, this is hard for me to believe considering this is happening =
on
>> > all the drives. It's not impossible, however, since the drives are
>> > likely from the same batch.
>> >
>> >> As you are seeing these message across all devices, it is possibl=
e that the
>> >> problem is with the sata controller rather than the disks. =A0Do =
know which you
>> >> should check the errors that are reported in dmesg. =A0If you don=
't understand
>> >> these message, then post them to the list - feel free to post sev=
eral hundred
>> >> lines of logs - too much is much much better than not enough.
>> >
>> > I posted a few errors in my response to the thread a bit ago -- he=
re's
>> > another snippet:
>> >
>> > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] Unhandled error co=
de
>> > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] Result: hostbyte=3D=
0x00
>> > driverbyte=3D0x06
>> > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] CDB: cdb[0]=3D0x28=
: 28
>> > 00 25 a2 a0 6a 00 00 80 00
>> > Dec 29 01:55:03 nuova kernel: end_request: I/O error, dev sdc, sec=
tor 631414890
>> > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] Unhandled error co=
de
>> > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] Result: hostbyte=3D=
0x00
>> > driverbyte=3D0x06
>
> "Unhandled error code" sounds like it could be a driver problem...
>
> Try googling that error message...
>
> http://us.generation-nt.com/answer/2-6-33-libata-issues-via- sata-pata=
-controller-help-197123882.html
>
>
> "Also, please try the latest 2.6.34-rc kernel, as that has several fi=
xes
> for both pata_via and sata_via which did not make 2.6.33."
>
> What kernel are =A0you running???
>
> NeilBrown
>
>
>
>
>> > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] CDB: cdb[0]=3D0x28=
: 28
>> > 00 25 a2 a0 ea 00 00 38 00
>> > Dec 29 01:55:03 nuova kernel: end_request: I/O error, dev sdb, sec=
tor 631415018
>> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
>> > sectors at 600923648 on sdb4)
>> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
>> > sectors at 600923656 on sdb4)
>> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
>> > sectors at 600923664 on sdb4)
>> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
>> > sectors at 600923672 on sdb4)
>> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
>> > sectors at 600923680 on sdb4)
>> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
>> > sectors at 600923688 on sdb4)
>> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
>> > sectors at 600923696 on sdb4)
>> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
>> > sectors at 600923520 on sdc4)
>> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
>> > sectors at 600923528 on sdc4)
>> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
>> > sectors at 600923536 on sdc4)
>> >
>> > Is there a good way to determine if the issue is with the motherbo=
ard
>> > (where the SATA controller is), or with the drives themselves?
>> >
>> >> NeilBrown
>> >>
>> >>
>> >>
>> >>> (c) are these errors expected in a RAID array that is heavily us=
ed?
>> >>> (d) what kind of errors should I see regarding "read errors" tha=
t
>> >>> *would* indicate an imminent hardware failure?
>> >>>
>> >>> Thoughts and ideas would be welcomed. I'm sure a thread where so=
me
>> >>> hefty discussion is thrown at this topic will help future Google=
rs
>> >>> like me. :)
>> >>>
>> >>> -james
>> >>> --
>> >>> To unsubscribe from this list: send the line "unsubscribe linux-=
raid" in
>> >>> the body of a message to majordomo@vger.kernel.org
>> >>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.=
html
>> >>
>> >>
>> >
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: read errors corrected

am 31.12.2010 02:56:58 von Guy Watkins

} -----Original Message-----
} From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
} owner@vger.kernel.org] On Behalf Of James
} Sent: Thursday, December 30, 2010 8:48 PM
} To: Neil Brown
} Cc: linux-raid@vger.kernel.org
} Subject: Re: read errors corrected
}=20
} Neil,
}=20
} I'm runinng 2.6.35.
}=20
} Although an expensive route, the only thing I can think to do to
} determine 100% whether the issue is software or hardware (and, if
} hardware, whether SATA controller or the drives) is to swap the drive=
s
} out.
}=20
} Ouch!
}=20
} Any other ideas, however, would be appreciated before I drop a few
} hundred bucks. :)

Just swap out 1 for now? :)

I believe your drives are fine because your smart stats don't reflect t=
he
number of errors you see in the logs.

}=20
} -james
}=20
} On Thu, Dec 30, 2010 at 23:12, Neil Brown wrote:
} > On Thu, 30 Dec 2010 11:35:59 -0500 James wrote:
} >
} >> Sorry Neil, I meant to reply-all.
} >>
} >> -james
} >>
} >> On Thu, Dec 30, 2010 at 11:35, James wrote:
} >> > Inline.
} >> >
} >> > On Thu, Dec 30, 2010 at 04:15, Neil Brown wrote:
} >> >> On Thu, 30 Dec 2010 03:20:48 +0000 James wrote:
} >> >>
} >> >>> All,
} >> >>>
} >> >>> I'm looking for a bit of guidance here. I have a RAID 6 set up=
on
} my
} >> >>> system and am seeing some errors in my logs as follows:
} >> >>>
} >> >>> # cat messages | grep "read erro"
} >> >>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error correcte=
d (8
} >> >>> sectors at 974262528 on sda4)
} >> >>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error correcte=
d (8
} >> >>> sectors at 974262536 on sda4)
} >> >> .....
} >> >>
} >> >>>
} >> >>> I've Google'd the heck out of this error message but am not se=
eing
} a
} >> >>> clear and concise message: is this benign? What would cause th=
ese
} >> >>> errors? Should I be concerned?
} >> >>>
} >> >>> There is an error message (read error corrected) on each of th=
e
} drives
} >> >>> in the array. They all seem to be functioning properly. The I/=
O on
} the
} >> >>> drives is pretty heavy for some parts of the day.
} >> >>>
} >> >>> # cat /proc/mdstat
} >> >>> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [rai=
d5]
} >> >>> [raid4] [multipath]
} >> >>> md1 : active raid6 sdb1[1] sda1[0] sdd1[3] sdc1[2]
} >> >>> =A0 =A0 =A0 497792 blocks level 6, 64k chunk, algorithm 2 [4/4=
] [UUUU]
} >> >>>
} >> >>> md2 : active raid6 sdb2[1] sda2[0] sdd2[3] sdc2[2]
} >> >>> =A0 =A0 =A0 4000000 blocks level 6, 64k chunk, algorithm 2 [4/=
4] [UUUU]
} >> >>>
} >> >>> md3 : active raid6 sdb3[1] sda3[0] sdd3[3] sdc3[2]
} >> >>> =A0 =A0 =A0 25992960 blocks level 6, 64k chunk, algorithm 2 [4=
/4] [UUUU]
} >> >>>
} >> >>> md4 : active raid6 sdb4[1] sda4[0] sdd4[3] sdc4[2]
} >> >>> =A0 =A0 =A0 2899780480 blocks level 6, 64k chunk, algorithm 2 =
[4/4]
} [UUUU]
} >> >>>
} >> >>> unused devices:
} >> >>>
} >> >>> I have a really hard time believing there's something wrong wi=
th
} all
} >> >>> of the drives in the array, although admittedly they're the sa=
me
} model
} >> >>> from the same manufacturer.
} >> >>>
} >> >>> Can someone point me in the right direction?
} >> >>> (a) what causes these errors precisely?
} >> >>
} >> >> When md/raid6 tries to read from a device and gets a read error=
, it
} try to
} >> >> read from other other devices. =A0When that succeeds it compute=
s the
} data that
} >> >> it had tried to read and then write it back to the original dri=
ve.
} =A0If this
} >> >> succeeded is assumes that the read error has been correct by a
} write, and
} >> >> prints the message that you see.
} >> >>
} >> >>
} >> >>> (b) is the error benign? How can I determine if it is *likely*=
a
} >> >>> hardware problem? (I imagine it's probably impossible to tell =
if
} it's
} >> >>> HW until it's too late)
} >> >>
} >> >> A few occasional messages like this are fairly benign. =A0The c=
ould be
} a sign
} >> >> that the drive surface is degrading. =A0If you see lots of thes=
e
} messages, then
} >> >> you should seriously consider replacing the drive.
} >> >
} >> > Wow, this is hard for me to believe considering this is happenin=
g on
} >> > all the drives. It's not impossible, however, since the drives a=
re
} >> > likely from the same batch.
} >> >
} >> >> As you are seeing these message across all devices, it is possi=
ble
} that the
} >> >> problem is with the sata controller rather than the disks. =A0D=
o know
} which you
} >> >> should check the errors that are reported in dmesg. =A0If you d=
on't
} understand
} >> >> these message, then post them to the list - feel free to post
} several hundred
} >> >> lines of logs - too much is much much better than not enough.
} >> >
} >> > I posted a few errors in my response to the thread a bit ago --
} here's
} >> > another snippet:
} >> >
} >> > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] Unhandled error =
code
} >> > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] Result: hostbyte=
=3D0x00
} >> > driverbyte=3D0x06
} >> > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] CDB: cdb[0]=3D0x=
28: 28
} >> > 00 25 a2 a0 6a 00 00 80 00
} >> > Dec 29 01:55:03 nuova kernel: end_request: I/O error, dev sdc, s=
ector
} 631414890
} >> > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] Unhandled error =
code
} >> > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] Result: hostbyte=
=3D0x00
} >> > driverbyte=3D0x06
} >
} > "Unhandled error code" sounds like it could be a driver problem...
} >
} > Try googling that error message...
} >
} > http://us.generation-nt.com/answer/2-6-33-libata-issues-via- sata-pa=
ta-
} controller-help-197123882.html
} >
} >
} > "Also, please try the latest 2.6.34-rc kernel, as that has several =
fixes
} > for both pata_via and sata_via which did not make 2.6.33."
} >
} > What kernel are =A0you running???
} >
} > NeilBrown
} >
} >
} >
} >
} >> > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] CDB: cdb[0]=3D0x=
28: 28
} >> > 00 25 a2 a0 ea 00 00 38 00
} >> > Dec 29 01:55:03 nuova kernel: end_request: I/O error, dev sdb, s=
ector
} 631415018
} >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected =
(8
} >> > sectors at 600923648 on sdb4)
} >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected =
(8
} >> > sectors at 600923656 on sdb4)
} >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected =
(8
} >> > sectors at 600923664 on sdb4)
} >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected =
(8
} >> > sectors at 600923672 on sdb4)
} >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected =
(8
} >> > sectors at 600923680 on sdb4)
} >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected =
(8
} >> > sectors at 600923688 on sdb4)
} >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected =
(8
} >> > sectors at 600923696 on sdb4)
} >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected =
(8
} >> > sectors at 600923520 on sdc4)
} >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected =
(8
} >> > sectors at 600923528 on sdc4)
} >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected =
(8
} >> > sectors at 600923536 on sdc4)
} >> >
} >> > Is there a good way to determine if the issue is with the mother=
board
} >> > (where the SATA controller is), or with the drives themselves?
} >> >
} >> >> NeilBrown
} >> >>
} >> >>
} >> >>
} >> >>> (c) are these errors expected in a RAID array that is heavily =
used?
} >> >>> (d) what kind of errors should I see regarding "read errors" t=
hat
} >> >>> *would* indicate an imminent hardware failure?
} >> >>>
} >> >>> Thoughts and ideas would be welcomed. I'm sure a thread where =
some
} >> >>> hefty discussion is thrown at this topic will help future Goog=
lers
} >> >>> like me. :)
} >> >>>
} >> >>> -james
} >> >>> --
} >> >>> To unsubscribe from this list: send the line "unsubscribe linu=
x-
} raid" in
} >> >>> the body of a message to majordomo@vger.kernel.org
} >> >>> More majordomo info at =A0http://vger.kernel.org/majordomo-inf=
o.html
} >> >>
} >> >>
} >> >
} >
} >
} --
} To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
} the body of a message to majordomo@vger.kernel.org
} More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: read errors corrected

am 31.12.2010 03:08:07 von NeilBrown

On Fri, 31 Dec 2010 01:48:07 +0000 James wrote:

> Neil,
>
> I'm runinng 2.6.35.
>
> Although an expensive route, the only thing I can think to do to
> determine 100% whether the issue is software or hardware (and, if
> hardware, whether SATA controller or the drives) is to swap the drives
> out.
>
> Ouch!
>
> Any other ideas, however, would be appreciated before I drop a few
> hundred bucks. :)

Buy a PCIe SATA controller, plug it in and move some/all drives over to that?
Should be a lot less than $100. Make sure it is a different chipset to what
you have on your motherboard.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: read errors corrected

am 15.01.2011 13:00:06 von Giovanni Tessore

On 12/30/2010 05:41 PM, James wrote:
> Inline.
>
> On Thu, Dec 30, 2010 at 05:13, Giovanni Tessore wrote:
>> On 12/30/2010 04:20 AM, James wrote:
>>> Can someone point me in the right direction?
>>> (a) what causes these errors precisely?
>>> (b) is the error benign? How can I determine if it is *likely* a
>>> hardware problem? (I imagine it's probably impossible to tell if it's
>>> HW until it's too late)
>>> (c) are these errors expected in a RAID array that is heavily used?
>>> (d) what kind of errors should I see regarding "read errors" that
>>> *would* indicate an imminent hardware failure?
>> (a) these errors usually come from defective disk sectors. raid recostructs
>> the missing sector from parity from other disks in the array, then rewrites
>> the sector on the defective disk; if the sector is rewritten without error
>> (maybe the hd remaps the sector into its reserved area), then just the log
>> messages is displayed.
>>
>> (b) with raid-6 it's almost benign; to get troubles you should get a read
>> error on same sector for>2 disks; or have 2 disks failed and out of the
>> array and get a read error on one of the other disks while recostructing the
>> array; or have 1 disk failed and get a read error on same sector on>1 disk
>> while recostructing (with raid-5 it's almost dangerous instead, as you can
>> have big troubles if a disk fails and you get a read error on another disk
>> while recostructing; that happened to me!)
>>
>> (c) no; it's also a good rule to perform a periodic scrub of the array
>> (check of the array), to reveal and correct defective sectors
>>
>> (d) check smart status of the disks, for "relocated sectors count"; also if
>> md superblock is>= 1 there is a persistent count of corrected read errors
>> for each device into /sys/block/mdXX/md/dev-XX/errors, when this counter
>> reaches 256 the disk is marked failed; ihmo when a disk is giving even few
>> corrected read errors in a short interval its better to replace it.
> Good call.
>
> Here's the output of the reallocated sector count:
>
> ~ # for i in a b c d ; do smartctl -a /dev/sd$i | grep Realloc ; done
> 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
> Always - 1
> 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
> Always - 3
> 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
> Always - 5
> 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
> Always - 1
>
> Are these values high? Low? Acceptable?
>
> How about values like "Raw_Read_Error_Rate" and "Seek_Error_Rate" -- I
> believe I've read those are values that are normally very high...is
> this true?
>
> ~ # for i in a b c d ; do smartctl -a /dev/sd$i | grep
> Raw_Read_Error_Rate ; done
> 1 Raw_Read_Error_Rate 0x000f 116 099 006 Pre-fail
> Always - 106523474
> 1 Raw_Read_Error_Rate 0x000f 114 099 006 Pre-fail
> Always - 77952706
> 1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail
> Always - 137525325
> 1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail
> Always - 179042738
>
> ...and...
>
> ~ # for i in a b c d ; do smartctl -a /dev/sd$i | grep
> Seek_Error_Rate ; done
> 7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail
> Always - 14923821
> 7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail
> Always - 15648709
> 7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail
> Always - 15733727
> 7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail
> Always - 14279452
>
> Thoughts appreciated.
>

As I know, Reallocated_Sector_Ct is the most meaningful SMART parameter
related to disk sectors health.
Also check for Current_Pending_Sector (sector that gave read on error
and has not been reallocated yet).
The values of your disks seems quite safe at the moment.
Be proactive if the value grows in short time.

I had same problem this week, one of my disk gave >800 reallocated read
errors.
The disk was still marked good and alive into array, but I replaced it
immediately.

Regards.

--
Cordiali saluti.
Yours faithfully.

Giovanni Tessore

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: read errors corrected

am 16.01.2011 09:33:25 von Jaap Crezee

On 01/15/11 13:00, Giovanni Tessore wrote:
> On 12/30/2010 05:41 PM, James wrote:
> As I know, Reallocated_Sector_Ct is the most meaningful SMART parameter related
> to disk sectors health.
> Also check for Current_Pending_Sector (sector that gave read on error and has
> not been reallocated yet).
> The values of your disks seems quite safe at the moment.
> Be proactive if the value grows in short time.

I wouldn't be too happy with more than 0 relocated- and/or pending sectors:
I replace these disks at once. Never had any problems with warranty about more
then 0 sectors relocated. It seems manufacturers use the same values....

regards,

Jaap Crezee
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html