RAID1, changed disk, 2nd has errors ...

am 26.08.2011 13:46:41 von lists

Please help:

Today I removed a defective hdd out of a RAID1-array and swapped in a
new hdd instead.

3 arrays, to be true, md[012]

0 and 1 synced fine, in the process of syncing md2 the old sda threw
errors (in sda4):

md/raid1:md2: sda: unrecoverable I/O read error for block 643686144
md: md2: recovery done.

[...]

md/raid1:md2: sda: unrecoverable I/O read error for block 643686272

----

Did the system stop syncing or is "recovery done" the indication that
md2 was fully recovered BEFORE the system threw sda4 out of the array??

I hope for the second!

See:

# mdadm -D /dev/md2
/dev/md2:
Version : 0.90
Creation Time : Thu Feb 11 19:40:11 2010
Raid Level : raid1
Array Size : 962454080 (917.87 GiB 985.55 GB)
Used Dev Size : 962454080 (917.87 GiB 985.55 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 2
Persistence : Superblock is persistent

Update Time : Fri Aug 26 13:40:55 2011
State : clean, degraded
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1

UUID : 0ee7bbc7:fc6b0172:d195d856:5f94e963
Events : 0.1833443

Number Major Minor RaidDevice State
0 8 4 0 active sync /dev/sda4
1 0 0 1 removed

2 8 20 - spare /dev/sdb4

# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb3[1] sda3[0]
13679232 blocks [2/2] [UU]

md2 : active raid1 sdb4[2](S) sda4[0]
962454080 blocks [2/1] [U_]

md0 : active raid1 sdb1[1] sda1[0]
128384 blocks [2/2] [UU]

unused devices:

----

The system seems to work OK, md2 which is a PV in a LVM-volumegroup is
there, etc

I just wonder if should somehow re-add sda4 or not touch a thing until I
have a new hdd at hand??

Can/should I somehow test the integrity of md2?

Pls help me to relax in this case ...

btw:

Linux version 2.6.36-gentoo-r5
mdadm-3.1.4

Thanks in advance, Stefan!

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID1, changed disk, 2nd has errors ...

am 26.08.2011 14:01:51 von mathias.buren

On 26 August 2011 12:46, Stefan G. Weichinger wrote:
>
> Please help:
>
> Today I removed a defective hdd out of a RAID1-array and swapped in a
> new hdd instead.
>
> 3 arrays, to be true, md[012]
>
> 0 and 1 synced fine, in the process of syncing md2 the old sda threw
> errors (in sda4):
>
> md/raid1:md2: sda: unrecoverable I/O read error for block 643686144
> md: md2: recovery done.
>
> [...]
>
> md/raid1:md2: sda: unrecoverable I/O read error for block 643686272
>
> ----
>
> Did the system stop syncing or is "recovery done" the indication that
> md2 was fully recovered BEFORE the system threw sda4 out of the array=
??
>
> I hope for the second!
>
> See:
>
> # mdadm -D /dev/md2
> /dev/md2:
> Â Â Â Â Version : 0.90
> Â Creation Time : Thu Feb 11 19:40:11 2010
> Â Â Raid Level : raid1
> Â Â Array Size : 962454080 (917.87 GiB 985.55 GB)
> Â Used Dev Size : 962454080 (917.87 GiB 985.55 GB)
> Â Raid Devices : 2
> Â Total Devices : 2
> Preferred Minor : 2
> Â Â Persistence : Superblock is persistent
>
> Â Â Update Time : Fri Aug 26 13:40:55 2011
> Â Â Â Â Â State : clean, degraded
> Â Active Devices : 1
> Working Devices : 2
> Â Failed Devices : 0
> Â Spare Devices : 1
>
> Â Â Â Â Â UUID : 0ee7bbc7:fc6b0172:d195d856:=
5f94e963
> Â Â Â Â Events : 0.1833443
>
> Â Â Number Â Major Â Minor Â RaidDevice State
> Â Â Â 0 Â Â Â 8 Â Â Â =C2=
=A04 Â Â Â Â 0 Â Â Â active sync Â =
/dev/sda4
> Â Â Â 1 Â Â Â 0 Â Â Â =C2=
=A00 Â Â Â Â 1 Â Â Â removed
>
> Â Â Â 2 Â Â Â 8 Â Â Â 20=
Â Â Â Â - Â Â Â spare Â /dev/sdb=
4
>
> # cat /proc/mdstat
> Personalities : [raid1]
> md1 : active raid1 sdb3[1] sda3[0]
> Â Â Â 13679232 blocks [2/2] [UU]
>
> md2 : active raid1 sdb4[2](S) sda4[0]
> Â Â Â 962454080 blocks [2/1] [U_]
>
> md0 : active raid1 sdb1[1] sda1[0]
> Â Â Â 128384 blocks [2/2] [UU]
>
> unused devices:
>
> ----
>
> The system seems to work OK, md2 which is a PV in a LVM-volumegroup i=
s
> there, etc
>
> I just wonder if should somehow re-add sda4 or not touch a thing unti=
l I
> have a new hdd at hand??
>
> Can/should I somehow test the integrity of md2?
>
> Pls help me to relax in this case ...
>
> btw:
>
> Linux version 2.6.36-gentoo-r5
> mdadm-3.1.4
>
> Thanks in advance, Stefan!
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at Â http://vger.kernel.org/majordomo-info.ht=
ml
>

Hm,

Could you perhaps post the output of "smartctl -a /dev/sda" (and sdb
for completeness sake) here? You can find smartctl in the
smartmontools package.

/Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID1, changed disk, 2nd has errors ...

am 26.08.2011 14:19:25 von lists

Am 26.08.2011 14:01, schrieb Mathias BurÃ©n:

> Could you perhaps post the output of "smartctl -a /dev/sda" (and sdb
> for completeness sake) here? You can find smartctl in the
> smartmontools package.

sure. sdb is the new hdd from today (as mentioned)

->

# smartctl -a /dev/sda
smartctl 5.40 2010-10-16 r3189 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.=
net

===3D START OF INFORMATION SECTION ===3D
Model Family: Seagate Barracuda 7200.12 family
Device Model: ST31000528AS
Serial Number: 9VP3BSEV
=46irmware Version: CC38
User Capacity: 1.000.204.886.016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Fri Aug 26 14:18:06 2011 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

===3D START OF READ SMART DATA SECTION ===3D
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activit=
y
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 600) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before enterin=
g
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 178) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x103f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 101 099 006 Pre-fail Alway=
s
- 77880938
3 Spin_Up_Time 0x0003 097 095 000 Pre-fail Alway=
s
- 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Alway=
s
- 50
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Alway=
s
- 0
7 Seek_Error_Rate 0x000f 080 060 030 Pre-fail Alway=
s
- 110698342
9 Power_On_Hours 0x0032 085 085 000 Old_age Alway=
s
- 13359
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Alway=
s
- 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Alway=
s
- 25
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Alway=
s
- 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Alway=
s
- 0
187 Reported_Uncorrect 0x0032 082 082 000 Old_age Alway=
s
- 18
188 Command_Timeout 0x0032 100 099 000 Old_age Alway=
s
- 2
189 High_Fly_Writes 0x003a 100 100 000 Old_age Alway=
s
- 0
190 Airflow_Temperature_Cel 0x0022 065 060 045 Old_age Alway=
s
- 35 (Min/Max 32/36)
194 Temperature_Celsius 0x0022 035 040 000 Old_age Alway=
s
- 35 (0 15 0 0)
195 Hardware_ECC_Recovered 0x001a 046 024 000 Old_age Alway=
s
- 77880938
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Alway=
s
- 2
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 2
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Alway=
s
- 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age
Offline - 16896401355883
241 Total_LBAs_Written 0x0000 100 253 000 Old_age
Offline - 2526036334
242 Total_LBAs_Read 0x0000 100 253 000 Old_age
Offline - 2586691393

SMART Error Log Version: 1
ATA Error Count: 18 (device log contains only the most recent five erro=
rs)
CR =3D Command Register [HEX]
FR =3D Features Register [HEX]
SC =3D Sector Count Register [HEX]
SN =3D Sector Number Register [HEX]
CL =3D Cylinder Low Register [HEX]
CH =3D Cylinder High Register [HEX]
DH =3D Device/Head Register [HEX]
DC =3D Device Command Register [HEX]
ER =3D Error register [HEX]
ST =3D Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=3Ddays, hh=3Dhours, mm=3Dminutes,
SS=3Dsec, and sss=3Dmillisec. It "wraps" after 49.710 days.

Error 18 occurred at disk power-on lifetime: 13357 hours (556 days + 13
hours)
When the command that caused the error occurred, the device was activ=
e
or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA =3D 0x0fffffff =3D 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 01:28:56.212 READ DMA EXT
27 00 00 00 00 00 e0 00 01:28:56.211 READ NATIVE MAX ADDRESS EX=
T
ec 00 00 00 00 00 a0 00 01:28:56.191 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 01:28:56.175 SET FEATURES [Set transfer
mode]
27 00 00 00 00 00 e0 00 01:28:56.151 READ NATIVE MAX ADDRESS EX=
T

Error 17 occurred at disk power-on lifetime: 13357 hours (556 days + 13
hours)
When the command that caused the error occurred, the device was activ=
e
or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA =3D 0x0fffffff =3D 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 01:28:53.001 READ DMA EXT
27 00 00 00 00 00 e0 00 01:28:53.000 READ NATIVE MAX ADDRESS EX=
T
ec 00 00 00 00 00 a0 00 01:28:52.980 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 01:28:52.961 SET FEATURES [Set transfer
mode]
27 00 00 00 00 00 e0 00 01:28:52.940 READ NATIVE MAX ADDRESS EX=
T

Error 16 occurred at disk power-on lifetime: 13357 hours (556 days + 13
hours)
When the command that caused the error occurred, the device was activ=
e
or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA =3D 0x0fffffff =3D 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 01:28:49.790 READ DMA EXT
27 00 00 00 00 00 e0 00 01:28:49.789 READ NATIVE MAX ADDRESS EX=
T
ec 00 00 00 00 00 a0 00 01:28:49.749 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 01:28:49.739 SET FEATURES [Set transfer
mode]
27 00 00 00 00 00 e0 00 01:28:49.719 READ NATIVE MAX ADDRESS EX=
T

Error 15 occurred at disk power-on lifetime: 13357 hours (556 days + 13
hours)
When the command that caused the error occurred, the device was activ=
e
or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA =3D 0x0fffffff =3D 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 01:28:46.580 READ DMA EXT
27 00 00 00 00 00 e0 00 01:28:46.579 READ NATIVE MAX ADDRESS EX=
T
ec 00 00 00 00 00 a0 00 01:28:46.559 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 01:28:46.542 SET FEATURES [Set transfer
mode]
27 00 00 00 00 00 e0 00 01:28:46.519 READ NATIVE MAX ADDRESS EX=
T

Error 14 occurred at disk power-on lifetime: 13357 hours (556 days + 13
hours)
When the command that caused the error occurred, the device was activ=
e
or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA =3D 0x0fffffff =3D 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 01:28:43.379 READ DMA EXT
27 00 00 00 00 00 e0 00 01:28:43.378 READ NATIVE MAX ADDRESS EX=
T
ec 00 00 00 00 00 a0 00 01:28:43.358 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 01:28:43.345 SET FEATURES [Set transfer
mode]
27 00 00 00 00 00 e0 00 01:28:43.318 READ NATIVE MAX ADDRESS EX=
T

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 13357
-
# 2 Short offline Completed without error 00% 13333
-
# 3 Short offline Completed without error 00% 13310
-
# 4 Short offline Completed without error 00% 13286
-
# 5 Short offline Completed without error 00% 13261
-
# 6 Short offline Completed without error 00% 13237
-
# 7 Short offline Completed without error 00% 13213
-
# 8 Extended offline Completed without error 00% 13207
-
# 9 Short offline Completed without error 00% 13189
-
#10 Short offline Completed without error 00% 13164
-
#11 Short offline Completed without error 00% 13162
-
#12 Short offline Completed without error 00% 13138
-
#13 Short offline Completed without error 00% 13114
-
#14 Short offline Completed without error 00% 13090
-
#15 Short offline Completed without error 00% 13066
-
#16 Extended offline Completed without error 00% 13060
-
#17 Short offline Completed without error 00% 13042
-
#18 Short offline Completed without error 00% 13018
-
#19 Short offline Completed without error 00% 12994
-
#20 Short offline Completed without error 00% 12970
-
#21 Short offline Completed without error 00% 12946
-

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute de=
lay.

# smartctl -a /dev/sdb
smartctl 5.40 2010-10-16 r3189 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.=
net

===3D START OF INFORMATION SECTION ===3D
Device Model: ST1000NM0011
Serial Number: Z1N04CMC
=46irmware Version: SN02
User Capacity: 1.000.204.886.016 bytes
Device is: Not in smartctl database [for details use: -P showall=
]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Fri Aug 26 14:18:35 2011 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

===3D START OF READ SMART DATA SECTION ===3D
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activit=
y
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 114) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before enterin=
g
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 155) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x10bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 066 066 044 Pre-fail Alway=
s
- 5184768
3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Alway=
s
- 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Alway=
s
- 8
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Alway=
s
- 0
7 Seek_Error_Rate 0x000f 100 253 030 Pre-fail Alway=
s
- 88000
9 Power_On_Hours 0x0032 100 100 000 Old_age Alway=
s
- 3
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Alway=
s
- 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Alway=
s
- 8
184 End-to-End_Error 0x0032 100 100 099 Old_age Alway=
s
- 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Alway=
s
- 0
188 Command_Timeout 0x0032 100 100 000 Old_age Alway=
s
- 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Alway=
s
- 0
190 Airflow_Temperature_Cel 0x0022 064 049 045 Old_age Alway=
s
- 36 (Min/Max 30/37)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Alway=
s
- 1
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Alway=
s
- 7
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Alway=
s
- 8
194 Temperature_Celsius 0x0022 036 051 000 Old_age Alway=
s
- 36 (0 25 0 0)
195 Hardware_ECC_Recovered 0x001a 102 100 000 Old_age Alway=
s
- 5184768
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Alway=
s
- 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Alway=
s
- 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 1
-

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute de=
lay.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID1, changed disk, 2nd has errors ...

am 26.08.2011 14:44:53 von lists

Additional stuff letting me hope for a valid array:

# grep md /var/log/messages

Aug 26 11:22:02 horde mdadm[3739]: RebuildFinished event detected on md
device /dev/md/1
Aug 26 11:22:02 horde mdadm[3739]: SpareActive event detected on md
device /dev/md/1, component device /dev/sdb3
Aug 26 11:22:02 horde mdadm[3739]: RebuildStarted event detected on md
device /dev/md/2
Aug 26 12:12:02 horde mdadm[3739]: Rebuild20 event detected on md device
/dev/md/2
Aug 26 12:42:53 horde kernel: ata1.00: cmd
25/00:00:b5:73:12/00:04:28:00:00/e0 tag 0 dma 524288 in
Aug 26 12:42:59 horde kernel: ata1.00: cmd
25/00:00:b5:73:12/00:04:28:00:00/e0 tag 0 dma 524288 in
Aug 26 12:43:05 horde kernel: ata1.00: cmd
25/00:00:b5:73:12/00:04:28:00:00/e0 tag 0 dma 524288 in
Aug 26 12:43:09 horde kernel: ata1.00: cmd
25/00:00:b5:73:12/00:04:28:00:00/e0 tag 0 dma 524288 in
Aug 26 12:43:12 horde kernel: ata1.00: cmd
25/00:00:b5:73:12/00:04:28:00:00/e0 tag 0 dma 524288 in
Aug 26 12:43:15 horde kernel: ata1.00: cmd
25/00:00:b5:73:12/00:04:28:00:00/e0 tag 0 dma 524288 in
Aug 26 12:43:19 horde kernel: ata1.00: cmd
25/00:08:fd:73:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:22 horde kernel: ata1.00: cmd
25/00:08:fd:73:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:25 horde kernel: ata1.00: cmd
25/00:08:fd:73:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:28 horde kernel: ata1.00: cmd
25/00:08:fd:73:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:31 horde kernel: ata1.00: cmd
25/00:08:fd:73:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:34 horde kernel: ata1.00: cmd
25/00:08:fd:73:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:34 horde kernel: md/raid1:md2: sda: unrecoverable I/O read
error for block 643686144
Aug 26 12:43:34 horde kernel: md: md2: recovery done.
Aug 26 12:43:37 horde kernel: ata1.00: cmd
25/00:08:35:74:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:41 horde kernel: ata1.00: cmd
25/00:08:35:74:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:44 horde kernel: ata1.00: cmd
25/00:08:35:74:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:47 horde kernel: ata1.00: cmd
25/00:08:35:74:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:50 horde kernel: ata1.00: cmd
25/00:08:35:74:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:53 horde kernel: ata1.00: cmd
25/00:08:35:74:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:54 horde kernel: md/raid1:md2: sda: unrecoverable I/O read
error for block 643686272
Aug 26 12:43:54 horde mdadm[3739]: RebuildFinished event detected on md
device /dev/md/2
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID1, changed disk, 2nd has errors ...

am 26.08.2011 14:56:53 von Robin Hill

--n8g4imXOkfNTN/H1
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri Aug 26, 2011 at 01:46:41PM +0200, Stefan G. Weichinger wrote:

>=20
> Please help:
>=20
> Today I removed a defective hdd out of a RAID1-array and swapped in a
> new hdd instead.
>=20
> 3 arrays, to be true, md[012]
>=20
> 0 and 1 synced fine, in the process of syncing md2 the old sda threw
> errors (in sda4):
>=20
> md/raid1:md2: sda: unrecoverable I/O read error for block 643686144
> md: md2: recovery done.
>=20
> [...]
>=20
> md/raid1:md2: sda: unrecoverable I/O read error for block 643686272
>=20
> ----
>=20
> Did the system stop syncing or is "recovery done" the indication that
> md2 was fully recovered BEFORE the system threw sda4 out of the array??
>=20
> I hope for the second!
>=20
I think it just indicates that it stopped attempting recovery at this
point.

> # cat /proc/mdstat
> Personalities : [raid1]
> md1 : active raid1 sdb3[1] sda3[0]
> 13679232 blocks [2/2] [UU]
>=20
> md2 : active raid1 sdb4[2](S) sda4[0]
> 962454080 blocks [2/1] [U_]
>=20
> md0 : active raid1 sdb1[1] sda1[0]
> 128384 blocks [2/2] [UU]
>=20
This would indicate that sdb has been reset as a spare, suggesting that
the resync failed so it has left sda alone in the array (as failing it
would destroy the array).

I'd suggest stopping the array and using ddrescue to clone sda4
to sdb4. That'll copy everything possible, flagging up any read issues.
You'll then need to run a "fsck -f" on sdb4 to clear up any filesystem
damage. You may still be left with damaged/missing files, depending on
where any read errors occurred. How critical this is will depend on what
the filesystem is used for (and whether you have any backup).

If that all works okay, then get sda replaced and give it a thorough
badblocks and SMART test.

I'd also advise setting up regular array checks (echo check >
/sys/block/mdX/md/sync_action) to make sure the disks are checked and
any unreadable blocks repaired/mapped out _before_ they're needed for
recovery.

Cheers,
Robin
--=20
___ =20
( ' } | Robin Hill |
/ / ) | Little Jim says .... |
// !! | "He fallen in de water !!" |

--n8g4imXOkfNTN/H1
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)

iEYEARECAAYFAk5XmBUACgkQShxCyD40xBLg+ACghEsJN57XDspV01Hc2drd zgJK
pY0AoNzl+4zgvkqT9YH45C24NDyFM9EL
=hxrm
-----END PGP SIGNATURE-----

--n8g4imXOkfNTN/H1--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID1, changed disk, 2nd has errors ...

am 26.08.2011 15:51:17 von lists

Am 26.08.2011 14:56, schrieb Robin Hill:

> This would indicate that sdb has been reset as a spare, suggesting
> that the resync failed so it has left sda alone in the array (as
> failing it would destroy the array).

oh my ...

So the array is somehow split-brain now? Some sectors good here, some
there??

Why is sda4 now flagged as (S)? Is it a spare or not?
I don't fully understand the current state of the array ...

> I'd suggest stopping the array and using ddrescue to clone sda4 to
> sdb4. That'll copy everything possible, flagging up any read
> issues. You'll then need to run a "fsck -f" on sdb4 to clear up
> any filesystem damage. You may still be left with damaged/missing
> files, depending on where any read errors occurred. How critical
> this is will depend on what the filesystem is used for (and whether
> you have any backup).

I am rather scared to do so ... as I am ~50kms away from the box now,
and as it seems to be working fine so far (though there are currently
no users working with it).

As mentioned /dev/md2 doesn't contain a filesystem itself, but is the
single PV in a LVM-volumegroup.

This group contains 6 logical volumes ...

As far as I understand it might be possible to spot the defective
sectors and the related LV?

I have backups, yes ...

> If that all works okay, then get sda replaced and give it a
> thorough badblocks and SMART test.
>
> I'd also advise setting up regular array checks (echo check >
> /sys/block/mdX/md/sync_action) to make sure the disks are checked
> and any unreadable blocks repaired/mapped out _before_ they're
> needed for recovery.

re-adding sda4 and starting such a check would be possible?
Or would a re-add damage things?

Should I shutdown the box for safety?

I am really feeling unsafe now, and getting another hdd for swapping
will take me at least until monday.

(I would like to dd-rescue to another new disk to keep sdb, just in case)

Thanks, Stefan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID1, changed disk, 2nd has errors ...

am 26.08.2011 16:08:10 von Robin Hill

--dc+cDN39EJAMEtIO
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri Aug 26, 2011 at 03:51:17PM +0200, Stefan G. Weichinger wrote:

> Am 26.08.2011 14:56, schrieb Robin Hill:
>=20
> > This would indicate that sdb has been reset as a spare, suggesting=20
> > that the resync failed so it has left sda alone in the array (as=20
> > failing it would destroy the array).
>=20
> oh my ...
>=20
> So the array is somehow split-brain now? Some sectors good here, some
> there??
>=20
> Why is sda4 now flagged as (S)? Is it a spare or not?
> I don't fully understand the current state of the array ...
>=20
sda4 is still in the array, with some unreadable sectors. sdb4 is a
spare because the resync failed due to unreadable sectors on sda4. You
cannot add a disk to an array unless the data can all be read (or
recovered if there's still enough redundancy).

> > I'd suggest stopping the array and using ddrescue to clone sda4 to=20
> > sdb4. That'll copy everything possible, flagging up any read
> > issues. You'll then need to run a "fsck -f" on sdb4 to clear up
> > any filesystem damage. You may still be left with damaged/missing
> > files, depending on where any read errors occurred. How critical
> > this is will depend on what the filesystem is used for (and whether
> > you have any backup).
>=20
> I am rather scared to do so ... as I am ~50kms away from the box now,
> and as it seems to be working fine so far (though there are currently
> no users working with it).
>=20
It'll work fine unless something attempts to read from any of the
unreadable sectors on sda4. If these are not used by the filesystem
currently, then you may never run into an issue (as they'll get remapped
if a write error occurs when they do get used).

> As mentioned /dev/md2 doesn't contain a filesystem itself, but is the
> single PV in a LVM-volumegroup.
>=20
> This group contains 6 logical volumes ...
>=20
> As far as I understand it might be possible to spot the defective
> sectors and the related LV?
>=20
A read of the relevant block device (dd if=3D/dev/xxx of=3D/dev/null) will
result in read errors for whichever block device contains the bad
sectors. You could also probably map the sectors reported by the kernel
to the position on the disk to tell what LV it.

> I have backups, yes ...
>=20
In which case the absolute safest option is just to recreate whatever
arrays, PVs, LVs, etc. on sdb4 and restore the data, ignoring whatever's
on sda4 currently.

> > If that all works okay, then get sda replaced and give it a
> > thorough badblocks and SMART test.
> >=20
> > I'd also advise setting up regular array checks (echo check >=20
> > /sys/block/mdX/md/sync_action) to make sure the disks are checked=20
> > and any unreadable blocks repaired/mapped out _before_ they're
> > needed for recovery.
>=20
> re-adding sda4 and starting such a check would be possible?
> Or would a re-add damage things?
>=20
You can't add sda4 because it's already in the array.

> Should I shutdown the box for safety?
>=20
For absolute safety, yes, though I don't think the risk is too high at
the moment, and I don't think things'll get any worse in the short term.

> I am really feeling unsafe now, and getting another hdd for swapping
> will take me at least until monday.
>=20
> (I would like to dd-rescue to another new disk to keep sdb, just in case)
>=20
I doubt you'd be able to recover anything useful from sdb4 at the
moment, but that's up to you.

--=20
___ =20
( ' } | Robin Hill |
/ / ) | Little Jim says .... |
// !! | "He fallen in de water !!" |

--dc+cDN39EJAMEtIO
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)

iEYEARECAAYFAk5XqMkACgkQShxCyD40xBITYACcDVXYAyJZZta4w/pbBRih AJ0O
XZMAn0COXKudUGoxQhA7ZJ8o8BtAzkeC
=qwfx
-----END PGP SIGNATURE-----

--dc+cDN39EJAMEtIO--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID1, changed disk, 2nd has errors ...

am 26.08.2011 17:41:47 von lists

Am 2011-08-26 16:08, schrieb Robin Hill:
> sda4 is still in the array, with some unreadable sectors. sdb4 is a
> spare because the resync failed due to unreadable sectors on
> sda4. You cannot add a disk to an array unless the data can all be
> read (or recovered if there's still enough redundancy).

Ah, now I got it.
I misinterpreted this:

md2 : active raid1 sdb4[2](S) sda4[0]
962454080 blocks [2/1] [U_]

I thought [U_] maps to the first line "sdb4 sda4" and somehow read
"sdb4 is UP and sda4 is down"

I could have seen it at

Number Major Minor RaidDevice State
0 8 4 0 active sync /dev/sda4
1 0 0 1 removed

2 8 20 - spare /dev/sdb4

but you know, panic ;-)

So basically I am where I was before swapping sdb: everything running
on sda, which has some corrupt sectors. Which may never have been
touched so far.

>> As far as I understand it might be possible to spot the defective
>> sectors and the related LV?
>>
> A read of the relevant block device (dd if=/dev/xxx of=/dev/null)
> will result in read errors for whichever block device contains the
> bad sectors. You could also probably map the sectors reported by
> the kernel to the position on the disk to tell what LV it.

There is only 350GB out of ~920GB mapped to active LVs. It might be
the case that the corrupt stuff isn't even mapped yet.

I once knew how to figure that out, I will have a closer look.

>> I have backups, yes ...
>>
> In which case the absolute safest option is just to recreate
> whatever arrays, PVs, LVs, etc. on sdb4 and restore the data,
> ignoring whatever's on sda4 currently.

I understand now, yes.

>> re-adding sda4 and starting such a check would be possible? Or
>> would a re-add damage things?
>>
> You can't add sda4 because it's already in the array.

Sure, now that I figured out the mentioned misunderstanding.

>> Should I shutdown the box for safety?
>>
> For absolute safety, yes, though I don't think the risk is too
> high at the moment, and I don't think things'll get any worse in
> the short term.

That sounds good for my weekend! Thanks ...

>> I am really feeling unsafe now, and getting another hdd for
>> swapping will take me at least until monday.
>>
>> (I would like to dd-rescue to another new disk to keep sdb, just
>> in case)
>>
> I doubt you'd be able to recover anything useful from sdb4 at the
> moment, but that's up to you.

Yep, also clear now.
I wait with that ddrescue-stuff anyway.

Thanks for your help!
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID1, changed disk, 2nd has errors ...

am 26.08.2011 22:00:23 von mathias.buren

On 26 August 2011 13:19, Stefan G. Weichinger wrote:
> Am 26.08.2011 14:01, schrieb Mathias BurÃ©n:
>
>> Could you perhaps post the output of "smartctl -a /dev/sda" (and sdb
>> for completeness sake) here? You can find smartctl in the
>> smartmontools package.
>
> sure. sdb is the new hdd from today (as mentioned)
>
> ->
>
(snip)
>
>

=46WIW, sda is failing, looking at uncorrectable sectors and all else.
If possible I'd mount the HDD (array) read-only, copy the contents
somewhere else, then recreate the array from scratch using your new
HDD and a new HDD to replace sda.

/Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID1, changed disk, 2nd has errors ...

am 27.08.2011 00:12:00 von lists

Am 26.08.2011 22:00, schrieb Mathias BurÃ©n:

> FWIW, sda is failing, looking at uncorrectable sectors and all else.
> If possible I'd mount the HDD (array) read-only, copy the contents
> somewhere else, then recreate the array from scratch using your new
> HDD and a new HDD to replace sda.

Thanks, Mathias .... will continue work on this on monday, as soon as I
have another hdd at hand (regarding the distance etc)

S

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID1, changed disk, 2nd has errors ...

am 29.08.2011 09:02:05 von lists

Am 26.08.2011 17:41, schrieb Stefan G. Weichinger:

> Yep, also clear now.
> I wait with that ddrescue-stuff anyway.

Could I somehow make the hdd re-map those 2 sectors?
S

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID1, changed disk, 2nd has errors ...

am 29.08.2011 09:45:17 von lists

Am 29.08.2011 09:02, schrieb Stefan G. Weichinger:
> Am 26.08.2011 17:41, schrieb Stefan G. Weichinger:
>
>> Yep, also clear now.
>> I wait with that ddrescue-stuff anyway.
>
> Could I somehow make the hdd re-map those 2 sectors?

I now followed

http://smartmontools.sourceforge.net/badblockhowto.html#lvm

and afai see the two bad blocks are inside a LVM-LV which is not
important at all!

It is a 20 GB LV prepared for something the customer never really used
so I will mv away the test-data and remove the LV.

Does this somehow help me to be able to maybe remap the bad blocks?

Thanks, Stefan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID1, changed disk, 2nd has errors ...

am 29.08.2011 09:51:01 von mathias.buren

On 29 August 2011 08:45, Stefan G. Weichinger wrote:
> Am 29.08.2011 09:02, schrieb Stefan G. Weichinger:
>> Am 26.08.2011 17:41, schrieb Stefan G. Weichinger:
>>
>>> Yep, also clear now.
>>> I wait with that ddrescue-stuff anyway.
>>
>> Could I somehow make the hdd re-map those 2 sectors?
>
> I now followed
>
> http://smartmontools.sourceforge.net/badblockhowto.html#lvm
>
> and afai see the two bad blocks are inside a LVM-LV which is not
> important at all!
>
> It is a 20 GB LV prepared for something the customer never really use=
d
> so I will mv away the test-data and remove the LV.
>
> Does this somehow help me to be able to maybe remap the bad blocks?
>
> Thanks, Stefan
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at Â http://vger.kernel.org/majordomo-info.ht=
ml
>

Maybe running badblocks on the sector range (or over the whole HDD,
but in non-read-write mode it takes quite a while longer) will do the
trick.

/M
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID1, changed disk, 2nd has errors ...

am 29.08.2011 10:00:18 von lists

Am 29.08.2011 09:51, schrieb Mathias BurÃ©n:

> Maybe running badblocks on the sector range (or over the whole HDD,
> but in non-read-write mode it takes quite a while longer) will do the
> trick.

I currently run "badblocks -n -s /dev/VG01/my_lv" ... we'll see

S
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID1, changed disk, 2nd has errors ...

am 29.08.2011 10:25:54 von lists

Am 29.08.2011 10:00, schrieb Stefan G. Weichinger:
> Am 29.08.2011 09:51, schrieb Mathias BurÃ©n:
>=20
>> Maybe running badblocks on the sector range (or over the whole HDD,
>> but in non-read-write mode it takes quite a while longer) will do th=
e
>> trick.
>=20
> I currently run "badblocks -n -s /dev/VG01/my_lv" ... we'll see

Switched over to

dd if=3D/dev/zero of=3D/dev/VG01/my_lv bs=3D4096

This executed without error (wrote ~20GB) and now when I check with:

smartctl -a /dev/sda

I get

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Alway=
s
- 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0

Sounds good to me! Right?

So now I could re-add /dev/sdb4 to retry syncing that array, correct?

Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: (solved) RAID1, changed disk, 2nd has errors ...

am 29.08.2011 16:34:48 von lists

Am 29.08.2011 10:25, schrieb Stefan G. Weichinger:

> I get
>
> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
> - 0
> 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
> Offline - 0
>
>
> Sounds good to me! Right?
>
> So now I could re-add /dev/sdb4 to retry syncing that array, correct?

Did that.

I failed/removed/re-added /dev/sdb4 and waited for some hours of resyncing.

Now /dev/md2 is in sync again, still with no bad sectors in SMART
(attached, @Mathias ;-))

thanks to Robin and Mathias for your feedback, it helped me to get the
picture and chose the next steps!

For now I let the arrays as they are and wait for the second new hdd.
As soon as I have it here I will swap /dev/sdb as well.

(a new server with maybe RAID6 is soon to come there ...)

Thanks, Stefan

----

# smartctl -a /dev/sda
smartctl 5.40 2010-10-16 r3189 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.12 family
Device Model: ST31000528AS
Serial Number: 9VP3BSEV
Firmware Version: CC38
User Capacity: 1.000.204.886.016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Mon Aug 29 16:31:35 2011 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 600) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 178) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x103f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always
- 134791791
3 Spin_Up_Time 0x0003 097 095 000 Pre-fail Always
- 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always
- 50
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always
- 0
7 Seek_Error_Rate 0x000f 080 060 030 Pre-fail Always
- 111650379
9 Power_On_Hours 0x0032 085 085 000 Old_age Always
- 13433
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always
- 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always
- 25
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always
- 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always
- 0
187 Reported_Uncorrect 0x0032 082 082 000 Old_age Always
- 18
188 Command_Timeout 0x0032 100 099 000 Old_age Always
- 2
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always
- 0
190 Airflow_Temperature_Cel 0x0022 067 060 045 Old_age Always
- 33 (Min/Max 27/36)
194 Temperature_Celsius 0x0022 033 040 000 Old_age Always
- 33 (0 15 0 0)
195 Hardware_ECC_Recovered 0x001a 048 024 000 Old_age Always
- 134791791
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always
- 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age
Offline - 255980050855093
241 Total_LBAs_Written 0x0000 100 253 000 Old_age
Offline - 2678846567
242 Total_LBAs_Read 0x0000 100 253 000 Old_age
Offline - 4015371061

SMART Error Log Version: 1
ATA Error Count: 18 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 18 occurred at disk power-on lifetime: 13357 hours (556 days + 13
hours)
When the command that caused the error occurred, the device was active
or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 01:28:56.212 READ DMA EXT
27 00 00 00 00 00 e0 00 01:28:56.211 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 01:28:56.191 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 01:28:56.175 SET FEATURES [Set transfer
mode]
27 00 00 00 00 00 e0 00 01:28:56.151 READ NATIVE MAX ADDRESS EXT

Error 17 occurred at disk power-on lifetime: 13357 hours (556 days + 13
hours)
When the command that caused the error occurred, the device was active
or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 01:28:53.001 READ DMA EXT
27 00 00 00 00 00 e0 00 01:28:53.000 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 01:28:52.980 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 01:28:52.961 SET FEATURES [Set transfer
mode]
27 00 00 00 00 00 e0 00 01:28:52.940 READ NATIVE MAX ADDRESS EXT

Error 16 occurred at disk power-on lifetime: 13357 hours (556 days + 13
hours)
When the command that caused the error occurred, the device was active
or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 01:28:49.790 READ DMA EXT
27 00 00 00 00 00 e0 00 01:28:49.789 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 01:28:49.749 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 01:28:49.739 SET FEATURES [Set transfer
mode]
27 00 00 00 00 00 e0 00 01:28:49.719 READ NATIVE MAX ADDRESS EXT

Error 15 occurred at disk power-on lifetime: 13357 hours (556 days + 13
hours)
When the command that caused the error occurred, the device was active
or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 01:28:46.580 READ DMA EXT
27 00 00 00 00 00 e0 00 01:28:46.579 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 01:28:46.559 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 01:28:46.542 SET FEATURES [Set transfer
mode]
27 00 00 00 00 00 e0 00 01:28:46.519 READ NATIVE MAX ADDRESS EXT

Error 14 occurred at disk power-on lifetime: 13357 hours (556 days + 13
hours)
When the command that caused the error occurred, the device was active
or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 01:28:43.379 READ DMA EXT
27 00 00 00 00 00 e0 00 01:28:43.378 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 01:28:43.358 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 01:28:43.345 SET FEATURES [Set transfer
mode]
27 00 00 00 00 00 e0 00 01:28:43.318 READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 13429
-
# 2 Short offline Completed without error 00% 13405
-
# 3 Short offline Completed without error 00% 13381
-
# 4 Extended offline Completed without error 00% 13375
-
# 5 Short offline Completed without error 00% 13357
-
# 6 Short offline Completed without error 00% 13333
-
# 7 Short offline Completed without error 00% 13310
-
# 8 Short offline Completed without error 00% 13286
-
# 9 Short offline Completed without error 00% 13261
-
#10 Short offline Completed without error 00% 13237
-
#11 Short offline Completed without error 00% 13213
-
#12 Extended offline Completed without error 00% 13207
-
#13 Short offline Completed without error 00% 13189
-
#14 Short offline Completed without error 00% 13164
-
#15 Short offline Completed without error 00% 13162
-
#16 Short offline Completed without error 00% 13138
-
#17 Short offline Completed without error 00% 13114
-
#18 Short offline Completed without error 00% 13090
-
#19 Short offline Completed without error 00% 13066
-
#20 Extended offline Completed without error 00% 13060
-
#21 Short offline Completed without error 00% 13042
-

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: (solved) RAID1, changed disk, 2nd has errors ...

am 30.08.2011 01:40:51 von mathias.buren

On 29 August 2011 15:34, Stefan G. Weichinger wrote:
> Am 29.08.2011 10:25, schrieb Stefan G. Weichinger:
>
>> I get
>>
>> 197 Current_Pending_Sector Â 0x0012 Â 100 Â 100 Â =
000 Â Â Old_age Â Always
>> Â Â Â - Â Â Â 0
>> 198 Offline_Uncorrectable Â 0x0010 Â 100 Â 100 Â =
000 Â Â Old_age
>> Offline Â Â Â - Â Â Â 0
>>
>>
>> Sounds good to me! Right?
>>
>> So now I could re-add /dev/sdb4 to retry syncing that array, correct=
?
>
> Did that.
>
> I failed/removed/re-added /dev/sdb4 and waited for some hours of resy=
ncing.
>
> Now /dev/md2 is in sync again, still with no bad sectors in SMART
> (attached, @Mathias ;-))
>
> thanks to Robin and Mathias for your feedback, it helped me to get th=
e
> picture and chose the next steps!
>
> For now I let the arrays as they are and wait for the second new hdd.
> As soon as I have it here I will swap /dev/sdb as well.
>
> (a new server with maybe RAID6 is soon to come there ...)
>
> Thanks, Stefan
>
> ----
>
> # smartctl -a /dev/sda
> smartctl 5.40 2010-10-16 r3189 [i686-pc-linux-gnu] (local build)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforg=
e.net
>
> ===3D START OF INFORMATION SECTION ===3D
> Model Family: Â Â Seagate Barracuda 7200.12 family
> Device Model: Â Â ST31000528AS
> Serial Number: Â Â 9VP3BSEV
> Firmware Version: CC38
> User Capacity: Â Â 1.000.204.886.016 bytes
> Device is: Â Â Â Â In smartctl database [for detai=
ls use: -P show]
> ATA Version is: Â 8
> ATA Standard is: Â ATA-8-ACS revision 4
> Local Time is: Â Â Mon Aug 29 16:31:35 2011 CEST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> ===3D START OF READ SMART DATA SECTION ===3D
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status: Â (0x82) Offline data collection =
activity
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â w=
as completed without error.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â A=
uto Offline Data Collection: Enabled.
> Self-test execution status: Â Â Â ( Â 0) The previ=
ous self-test routine
> completed
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â w=
ithout error or no self-test has ever
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â b=
een run.
> Total time to complete Offline
> data collection: Â Â Â Â Â Â Â =C2=
=A0 ( 600) seconds.
> Offline data collection
> capabilities: Â Â Â Â Â Â Â Â =
Â Â (0x7b) SMART execute Offline immediate.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â A=
uto Offline data collection on/off support.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
uspend Offline collection upon new
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â c=
ommand.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â O=
ffline surface scan supported.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
elf-test supported.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â C=
onveyance Self-test supported.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
elective Self-test supported.
> SMART capabilities: Â Â Â Â Â Â (0x0003)=
Saves SMART data before entering
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â p=
ower-saving mode.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
upports SMART auto save timer.
> Error logging capability: Â Â Â Â (0x01) Error log=
ging supported.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â G=
eneral Purpose Logging supported.
> Short self-test routine
> recommended polling time: Â Â Â Â ( Â 1) minu=
tes.
> Extended self-test routine
> recommended polling time: Â Â Â Â ( 178) minutes.
> Conveyance self-test routine
> recommended polling time: Â Â Â Â ( Â 2) minu=
tes.
> SCT capabilities: Â Â Â Â Â Â Â (0x=
103f) SCT Status supported.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
CT Error Recovery Control supported.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
CT Feature Control supported.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
CT Data Table supported.
>
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME Â Â Â Â Â FLAG Â Â =
VALUE WORST THRESH TYPE
> UPDATED Â WHEN_FAILED RAW_VALUE
> Â 1 Raw_Read_Error_Rate Â Â 0x000f Â 117 Â 09=
9 Â 006 Â Â Pre-fail Â Always
> Â Â Â - Â Â Â 134791791
> Â 3 Spin_Up_Time Â Â Â Â Â Â 0x0003 =
Â 097 Â 095 Â 000 Â Â Pre-fail Â Always
> Â Â Â - Â Â Â 0
> Â 4 Start_Stop_Count Â Â Â Â 0x0032 Â 100=
Â 100 Â 020 Â Â Old_age Â Always
> Â Â Â - Â Â Â 50
> Â 5 Reallocated_Sector_Ct Â 0x0033 Â 100 Â 100 =C2=
=A0 036 Â Â Pre-fail Â Always
> Â Â Â - Â Â Â 0
> Â 7 Seek_Error_Rate Â Â Â Â 0x000f Â 080=
Â 060 Â 030 Â Â Pre-fail Â Always
> Â Â Â - Â Â Â 111650379
> Â 9 Power_On_Hours Â Â Â Â Â 0x0032 Â =
085 Â 085 Â 000 Â Â Old_age Â Always
> Â Â Â - Â Â Â 13433
> Â 10 Spin_Retry_Count Â Â Â Â 0x0013 Â 10=
0 Â 100 Â 097 Â Â Pre-fail Â Always
> Â Â Â - Â Â Â 0
> Â 12 Power_Cycle_Count Â Â Â 0x0032 Â 100 =C2=
=A0 100 Â 020 Â Â Old_age Â Always
> Â Â Â - Â Â Â 25
> 183 Runtime_Bad_Block Â Â Â 0x0032 Â 100 Â 1=
00 Â 000 Â Â Old_age Â Always
> Â Â Â - Â Â Â 0
> 184 End-to-End_Error Â Â Â Â 0x0032 Â 100 =C2=
=A0 100 Â 099 Â Â Old_age Â Always
> Â Â Â - Â Â Â 0
> 187 Reported_Uncorrect Â Â Â 0x0032 Â 082 Â 0=
82 Â 000 Â Â Old_age Â Always
> Â Â Â - Â Â Â 18
> 188 Command_Timeout Â Â Â Â 0x0032 Â 100 =C2=
=A0 099 Â 000 Â Â Old_age Â Always
> Â Â Â - Â Â Â 2
> 189 High_Fly_Writes Â Â Â Â 0x003a Â 100 =C2=
=A0 100 Â 000 Â Â Old_age Â Always
> Â Â Â - Â Â Â 0
> 190 Airflow_Temperature_Cel 0x0022 Â 067 Â 060 Â 045 =C2=
=A0 Â Old_age Â Always
> Â Â Â - Â Â Â 33 (Min/Max 27/36)
> 194 Temperature_Celsius Â Â 0x0022 Â 033 Â 040 =C2=
=A0 000 Â Â Old_age Â Always
> Â Â Â - Â Â Â 33 (0 15 0 0)
> 195 Hardware_ECC_Recovered Â 0x001a Â 048 Â 024 Â =
000 Â Â Old_age Â Always
> Â Â Â - Â Â Â 134791791
> 197 Current_Pending_Sector Â 0x0012 Â 100 Â 100 Â =
000 Â Â Old_age Â Always
> Â Â Â - Â Â Â 0
> 198 Offline_Uncorrectable Â 0x0010 Â 100 Â 100 Â =
000 Â Â Old_age
> Offline Â Â Â - Â Â Â 0
> 199 UDMA_CRC_Error_Count Â Â 0x003e Â 200 Â 200 =C2=
=A0 000 Â Â Old_age Â Always
> Â Â Â - Â Â Â 0
> 240 Head_Flying_Hours Â Â Â 0x0000 Â 100 Â 2=
53 Â 000 Â Â Old_age
> Offline Â Â Â - Â Â Â 255980050855093
> 241 Total_LBAs_Written Â Â Â 0x0000 Â 100 Â 2=
53 Â 000 Â Â Old_age
> Offline Â Â Â - Â Â Â 2678846567
> 242 Total_LBAs_Read Â Â Â Â 0x0000 Â 100 =C2=
=A0 253 Â 000 Â Â Old_age
> Offline Â Â Â - Â Â Â 4015371061
>
> SMART Error Log Version: 1
> ATA Error Count: 18 (device log contains only the most recent five er=
rors)
> Â Â Â Â CR =3D Command Register [HEX]
> Â Â Â Â FR =3D Features Register [HEX]
> Â Â Â Â SC =3D Sector Count Register [HEX]
> Â Â Â Â SN =3D Sector Number Register [HEX]
> Â Â Â Â CL =3D Cylinder Low Register [HEX]
> Â Â Â Â CH =3D Cylinder High Register [HEX]
> Â Â Â Â DH =3D Device/Head Register [HEX]
> Â Â Â Â DC =3D Device Command Register [HEX]
> Â Â Â Â ER =3D Error register [HEX]
> Â Â Â Â ST =3D Status register [HEX]
> Powered_Up_Time is measured from power on, and printed as
> DDd+hh:mm:SS.sss where DD=3Ddays, hh=3Dhours, mm=3Dminutes,
> SS=3Dsec, and sss=3Dmillisec. It "wraps" after 49.710 days.
>
> Error 18 occurred at disk power-on lifetime: 13357 hours (556 days + =
13
> hours)
> Â When the command that caused the error occurred, the device was=
active
> or idle.
>
> Â After command completion occurred, registers were:
> Â ER ST SC SN CL CH DH
> Â -- -- -- -- -- -- --
> Â 40 51 00 ff ff ff 0f Â Error: UNC at LBA =3D 0x0fffffff =3D=
268435455
>
> Â Commands leading to the command that caused the error were:
> Â CR FR SC SN CL CH DH DC Â Powered_Up_Time Â Command/Fe=
ature_Name
> Â -- -- -- -- -- -- -- -- Â ---------------- Â ----------=
----------
> Â 25 00 08 ff ff ff ef 00 Â Â Â 01:28:56.212 Â =
READ DMA EXT
> Â 27 00 00 00 00 00 e0 00 Â Â Â 01:28:56.211 Â =
READ NATIVE MAX ADDRESS EXT
> Â ec 00 00 00 00 00 a0 00 Â Â Â 01:28:56.191 Â =
IDENTIFY DEVICE
> Â ef 03 46 00 00 00 a0 00 Â Â Â 01:28:56.175 Â =
SET FEATURES [Set transfer
> mode]
> Â 27 00 00 00 00 00 e0 00 Â Â Â 01:28:56.151 Â =
READ NATIVE MAX ADDRESS EXT
>
> Error 17 occurred at disk power-on lifetime: 13357 hours (556 days + =
13
> hours)
> Â When the command that caused the error occurred, the device was=
active
> or idle.
>
> Â After command completion occurred, registers were:
> Â ER ST SC SN CL CH DH
> Â -- -- -- -- -- -- --
> Â 40 51 00 ff ff ff 0f Â Error: UNC at LBA =3D 0x0fffffff =3D=
268435455
>
> Â Commands leading to the command that caused the error were:
> Â CR FR SC SN CL CH DH DC Â Powered_Up_Time Â Command/Fe=
ature_Name
> Â -- -- -- -- -- -- -- -- Â ---------------- Â ----------=
----------
> Â 25 00 08 ff ff ff ef 00 Â Â Â 01:28:53.001 Â =
READ DMA EXT
> Â 27 00 00 00 00 00 e0 00 Â Â Â 01:28:53.000 Â =
READ NATIVE MAX ADDRESS EXT
> Â ec 00 00 00 00 00 a0 00 Â Â Â 01:28:52.980 Â =
IDENTIFY DEVICE
> Â ef 03 46 00 00 00 a0 00 Â Â Â 01:28:52.961 Â =
SET FEATURES [Set transfer
> mode]
> Â 27 00 00 00 00 00 e0 00 Â Â Â 01:28:52.940 Â =
READ NATIVE MAX ADDRESS EXT
>
> Error 16 occurred at disk power-on lifetime: 13357 hours (556 days + =
13
> hours)
> Â When the command that caused the error occurred, the device was=
active
> or idle.
>
> Â After command completion occurred, registers were:
> Â ER ST SC SN CL CH DH
> Â -- -- -- -- -- -- --
> Â 40 51 00 ff ff ff 0f Â Error: UNC at LBA =3D 0x0fffffff =3D=
268435455
>
> Â Commands leading to the command that caused the error were:
> Â CR FR SC SN CL CH DH DC Â Powered_Up_Time Â Command/Fe=
ature_Name
> Â -- -- -- -- -- -- -- -- Â ---------------- Â ----------=
----------
> Â 25 00 08 ff ff ff ef 00 Â Â Â 01:28:49.790 Â =
READ DMA EXT
> Â 27 00 00 00 00 00 e0 00 Â Â Â 01:28:49.789 Â =
READ NATIVE MAX ADDRESS EXT
> Â ec 00 00 00 00 00 a0 00 Â Â Â 01:28:49.749 Â =
IDENTIFY DEVICE
> Â ef 03 46 00 00 00 a0 00 Â Â Â 01:28:49.739 Â =
SET FEATURES [Set transfer
> mode]
> Â 27 00 00 00 00 00 e0 00 Â Â Â 01:28:49.719 Â =
READ NATIVE MAX ADDRESS EXT
>
> Error 15 occurred at disk power-on lifetime: 13357 hours (556 days + =
13
> hours)
> Â When the command that caused the error occurred, the device was=
active
> or idle.
>
> Â After command completion occurred, registers were:
> Â ER ST SC SN CL CH DH
> Â -- -- -- -- -- -- --
> Â 40 51 00 ff ff ff 0f Â Error: UNC at LBA =3D 0x0fffffff =3D=
268435455
>
> Â Commands leading to the command that caused the error were:
> Â CR FR SC SN CL CH DH DC Â Powered_Up_Time Â Command/Fe=
ature_Name
> Â -- -- -- -- -- -- -- -- Â ---------------- Â ----------=
----------
> Â 25 00 08 ff ff ff ef 00 Â Â Â 01:28:46.580 Â =
READ DMA EXT
> Â 27 00 00 00 00 00 e0 00 Â Â Â 01:28:46.579 Â =
READ NATIVE MAX ADDRESS EXT
> Â ec 00 00 00 00 00 a0 00 Â Â Â 01:28:46.559 Â =
IDENTIFY DEVICE
> Â ef 03 46 00 00 00 a0 00 Â Â Â 01:28:46.542 Â =
SET FEATURES [Set transfer
> mode]
> Â 27 00 00 00 00 00 e0 00 Â Â Â 01:28:46.519 Â =
READ NATIVE MAX ADDRESS EXT
>
> Error 14 occurred at disk power-on lifetime: 13357 hours (556 days + =
13
> hours)
> Â When the command that caused the error occurred, the device was=
active
> or idle.
>
> Â After command completion occurred, registers were:
> Â ER ST SC SN CL CH DH
> Â -- -- -- -- -- -- --
> Â 40 51 00 ff ff ff 0f Â Error: UNC at LBA =3D 0x0fffffff =3D=
268435455
>
> Â Commands leading to the command that caused the error were:
> Â CR FR SC SN CL CH DH DC Â Powered_Up_Time Â Command/Fe=
ature_Name
> Â -- -- -- -- -- -- -- -- Â ---------------- Â ----------=
----------
> Â 25 00 08 ff ff ff ef 00 Â Â Â 01:28:43.379 Â =
READ DMA EXT
> Â 27 00 00 00 00 00 e0 00 Â Â Â 01:28:43.378 Â =
READ NATIVE MAX ADDRESS EXT
> Â ec 00 00 00 00 00 a0 00 Â Â Â 01:28:43.358 Â =
IDENTIFY DEVICE
> Â ef 03 46 00 00 00 a0 00 Â Â Â 01:28:43.345 Â =
SET FEATURES [Set transfer
> mode]
> Â 27 00 00 00 00 00 e0 00 Â Â Â 01:28:43.318 Â =
READ NATIVE MAX ADDRESS EXT
>
> SMART Self-test log structure revision number 1
> Num Â Test_Description Â Â Status Â Â Â =C2=
=A0 Â Â Â Â Â Remaining
> LifeTime(hours) Â LBA_of_first_error
> # 1 Â Short offline Â Â Â Completed without error =
Â Â Â 00% Â Â 13429
> Â Â -
> # 2 Â Short offline Â Â Â Completed without error =
Â Â Â 00% Â Â 13405
> Â Â -
> # 3 Â Short offline Â Â Â Completed without error =
Â Â Â 00% Â Â 13381
> Â Â -
> # 4 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â 13375
> Â Â -
> # 5 Â Short offline Â Â Â Completed without error =
Â Â Â 00% Â Â 13357
> Â Â -
> # 6 Â Short offline Â Â Â Completed without error =
Â Â Â 00% Â Â 13333
> Â Â -
> # 7 Â Short offline Â Â Â Completed without error =
Â Â Â 00% Â Â 13310
> Â Â -
> # 8 Â Short offline Â Â Â Completed without error =
Â Â Â 00% Â Â 13286
> Â Â -
> # 9 Â Short offline Â Â Â Completed without error =
Â Â Â 00% Â Â 13261
> Â Â -
> #10 Â Short offline Â Â Â Completed without error =
Â Â Â 00% Â Â 13237
> Â Â -
> #11 Â Short offline Â Â Â Completed without error =
Â Â Â 00% Â Â 13213
> Â Â -
> #12 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â 13207
> Â Â -
> #13 Â Short offline Â Â Â Completed without error =
Â Â Â 00% Â Â 13189
> Â Â -
> #14 Â Short offline Â Â Â Completed without error =
Â Â Â 00% Â Â 13164
> Â Â -
> #15 Â Short offline Â Â Â Completed without error =
Â Â Â 00% Â Â 13162
> Â Â -
> #16 Â Short offline Â Â Â Completed without error =
Â Â Â 00% Â Â 13138
> Â Â -
> #17 Â Short offline Â Â Â Completed without error =
Â Â Â 00% Â Â 13114
> Â Â -
> #18 Â Short offline Â Â Â Completed without error =
Â Â Â 00% Â Â 13090
> Â Â -
> #19 Â Short offline Â Â Â Completed without error =
Â Â Â 00% Â Â 13066
> Â Â -
> #20 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â 13060
> Â Â -
> #21 Â Short offline Â Â Â Completed without error =
Â Â Â 00% Â Â 13042
> Â Â -
>
> SMART Selective self-test log data structure revision number 1
> Â SPAN Â MIN_LBA Â MAX_LBA Â CURRENT_TEST_STATUS
> Â Â 1 Â Â Â Â 0 Â Â Â Â =
0 Â Not_testing
> Â Â 2 Â Â Â Â 0 Â Â Â Â =
0 Â Not_testing
> Â Â 3 Â Â Â Â 0 Â Â Â Â =
0 Â Not_testing
> Â Â 4 Â Â Â Â 0 Â Â Â Â =
0 Â Not_testing
> Â Â 5 Â Â Â Â 0 Â Â Â Â =
0 Â Not_testing
> Selective self-test flags (0x0):
> Â After scanning selected spans, do NOT read-scan remainder of di=
sk.
> If Selective self-test is pending on power-up, resume after 0 minute =
delay.
>
>
>

Glad you got it working, but your drive looks like a failing drive to
me, because of these:

187 Reported_Uncorrect 0x0032 082 082 000 Old_age Alway=
s
- 18

So I'd replace it ASAP. Cheers,

/M
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: (solved) RAID1, changed disk, 2nd has errors ...

am 30.08.2011 14:14:42 von lists

Am 30.08.2011 01:40, schrieb Mathias BurÃ©n:

> Glad you got it working, but your drive looks like a failing drive
> to me, because of these:
>=20
> 187 Reported_Uncorrect 0x0032 082 082 000 Old_age
> Always - 18
>=20
> So I'd replace it ASAP.

As mentioned, I ordered the new disk already.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html