Do I have a bad HDD?

am 31.07.2011 15:05:44 von mathias.buren

Hi list,

Here's the output of my weekly script:

DEV EVENTS REALL PEND UNCORR CRC RAW ZONE END
sdb1 6158767 0 0 0 2 0 0
sdc1 6158767 0 0 0 0 0 0
sdd1 6158767 0 0 0 0 0 0
sde1 6158767 0 0 0 0 0 1
sdf1 6158767 0 0 0 0 47 6
sdg1 6158767 0 0 0 0 0 0
sdh1 6158767 0 6 0 0 340 3

Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdf1[5] sdh1[6] sdg1[0] sde1[7] sdc1[3] sdd1[4] sdb1[1]
9751756800 blocks super 1.2 level 6, 64k chunk, algorithm 2
[7/7] [UUUUUUU]

unused devices:

/dev/md0:
Version : 1.2
Creation Time : Tue Oct 19 08:58:41 2010
Raid Level : raid6
Array Size : 9751756800 (9300.00 GiB 9985.80 GB)
Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB)
Raid Devices : 7
Total Devices : 7
Persistence : Superblock is persistent

Update Time : Sun Jul 31 09:50:43 2011
State : clean
Active Devices : 7
Working Devices : 7
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

Name : ion:0 (local to host ion)
UUID : e6595c64:b3ae90b3:f01133ac:3f402d20
Events : 6158767

Number Major Minor RaidDevice State
0 8 97 0 active sync /dev/sdg1
1 8 17 1 active sync /dev/sdb1
4 8 49 2 active sync /dev/sdd1
3 8 33 3 active sync /dev/sdc1
5 8 81 4 active sync /dev/sdf1
6 8 113 5 active sync /dev/sdh1
7 8 65 6 active sync /dev/sde1

Here's the SMART data for sdh:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.39-ck] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint F4 EG (AFT)
Device Model: SAMSUNG HD204UI
Serial Number: S2HGJ1RZ800850
LU WWN Device Id: 5 0024e9 003f1ebc9
Firmware Version: 1AQ10003
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 6
Local Time is: Sun Jul 31 14:03:32 2011 IST

==> WARNING: Using smartmontools or hdparm with this
drive may result in data loss due to a firmware bug.
****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******
Buggy and fixed firmware report same version number!
See the following web pages for details:
http://www.samsung.com/global/business/hdd/faqView.do?b2b_bb s_msg_id=386
http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF 4EGBadBlocks

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 37) The self-test routine was interrupted
by the host with a hard or soft reset.
Total time to complete Offline
data collection: (20640) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail
Always - 340
2 Throughput_Performance 0x0026 055 053 000 Old_age
Always - 18989
3 Spin_Up_Time 0x0023 067 044 025 Pre-fail
Always - 10165
4 Start_Stop_Count 0x0032 100 100 000 Old_age
Always - 18
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail
Always - 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age
Always - 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age
Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age
Always - 6447
10 Spin_Retry_Count 0x0032 252 252 051 Old_age
Always - 0
11 Calibration_Retry_Count 0x0032 252 252 000 Old_age
Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age
Always - 20
181 Program_Fail_Cnt_Total 0x0022 100 100 000 Old_age
Always - 10117271
191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age
Always - 1
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age
Always - 0
194 Temperature_Celsius 0x0002 064 057 000 Old_age
Always - 35 (Min/Max 16/43)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age
Always - 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age
Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age
Always - 6
198 Offline_Uncorrectable 0x0030 252 252 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age
Always - 0
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age
Always - 3
223 Load_Retry_Count 0x0032 252 252 000 Old_age
Always - 0
225 Load_Cycle_Count 0x0032 100 100 000 Old_age
Always - 21

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Extended offline Interrupted (host reset) 50% 6408 -
# 2 Extended offline Completed without error 00% 6317 -
# 3 Extended offline Completed without error 00% 6260 -
# 4 Extended offline Completed without error 00% 6232 -
# 5 Extended offline Completed without error 00% 6170 -
# 6 Extended offline Completed without error 00% 6064 -
# 7 Extended offline Completed without error 00% 6029 -
# 8 Extended offline Completed without error 00% 5898 -
# 9 Extended offline Aborted by host 60% 5893 -
#10 Extended offline Completed without error 00% 5728 -
#11 Extended offline Completed without error 00% 5706 -
#12 Extended offline Interrupted (host reset) 40% 5701 -
#13 Extended offline Interrupted (host reset) 90% 5666 -
#14 Extended offline Completed without error 00% 5560 -
#15 Extended offline Completed without error 00% 5527 -
#16 Extended offline Completed without error 00% 5392 -
#17 Extended offline Completed without error 00% 5357 -
#18 Extended offline Completed without error 00% 5250 -
#19 Extended offline Completed without error 00% 4272 -
#20 Extended offline Completed without error 00% 4017 -
#21 Extended offline Completed without error 00% 3935 -

Note: selective self-test log revision number (0) not 1 implies that
no selective self-test has ever been run
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has
ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Interrupted [50% left] (0-65535)
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

It has 6 pending sectors. Why are they not reallocated? Can I force
this somehow? (a scrub did not reallocate them) Is this enough to
replace the HDD?

Thanks,
Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Do I have a bad HDD?

am 31.07.2011 19:59:00 von mathias.buren

On 31 July 2011 14:05, Mathias BurÃ©n wro=
te:
> Hi list,
>
> Here's the output of my weekly script:
>
> DEV Â Â EVENTS Â REALL Â PEND Â Â UNCORR =
Â CRC Â Â RAW Â Â ZONE Â Â END
> sdb1 Â Â 6158767 0 Â Â Â 0 Â Â Â =
0 Â Â Â 2 Â Â Â 0 Â Â Â 0
> sdc1 Â Â 6158767 0 Â Â Â 0 Â Â Â =
0 Â Â Â 0 Â Â Â 0 Â Â Â 0
> sdd1 Â Â 6158767 0 Â Â Â 0 Â Â Â =
0 Â Â Â 0 Â Â Â 0 Â Â Â 0
> sde1 Â Â 6158767 0 Â Â Â 0 Â Â Â =
0 Â Â Â 0 Â Â Â 0 Â Â Â 1
> sdf1 Â Â 6158767 0 Â Â Â 0 Â Â Â =
0 Â Â Â 0 Â Â Â 47 Â Â Â 6
> sdg1 Â Â 6158767 0 Â Â Â 0 Â Â Â =
0 Â Â Â 0 Â Â Â 0 Â Â Â 0
> sdh1 Â Â 6158767 0 Â Â Â 6 Â Â Â =
0 Â Â Â 0 Â Â Â 340 Â Â 3
>
>
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdf1[5] sdh1[6] sdg1[0] sde1[7] sdc1[3] sdd1[4] sd=
b1[1]
> Â Â Â 9751756800 blocks super 1.2 level 6, 64k chunk, a=
lgorithm 2
> [7/7] [UUUUUUU]
>
> unused devices:
>
>
> /dev/md0:
> Â Â Â Â Version : 1.2
> Â Creation Time : Tue Oct 19 08:58:41 2010
> Â Â Raid Level : raid6
> Â Â Array Size : 9751756800 (9300.00 GiB 9985.80 GB)
> Â Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB)
> Â Raid Devices : 7
> Â Total Devices : 7
> Â Â Persistence : Superblock is persistent
>
> Â Â Update Time : Sun Jul 31 09:50:43 2011
> Â Â Â Â Â State : clean
> Â Active Devices : 7
> Working Devices : 7
> Â Failed Devices : 0
> Â Spare Devices : 0
>
> Â Â Â Â Layout : left-symmetric
> Â Â Chunk Size : 64K
>
> Â Â Â Â Â Name : ion:0 Â (local to host =
ion)
> Â Â Â Â Â UUID : e6595c64:b3ae90b3:f01133ac:=
3f402d20
> Â Â Â Â Events : 6158767
>
> Â Â Number Â Major Â Minor Â RaidDevice State
> Â Â Â 0 Â Â Â 8 Â Â Â 97=
Â Â Â Â 0 Â Â Â active sync Â /d=
ev/sdg1
> Â Â Â 1 Â Â Â 8 Â Â Â 17=
Â Â Â Â 1 Â Â Â active sync Â /d=
ev/sdb1
> Â Â Â 4 Â Â Â 8 Â Â Â 49=
Â Â Â Â 2 Â Â Â active sync Â /d=
ev/sdd1
> Â Â Â 3 Â Â Â 8 Â Â Â 33=
Â Â Â Â 3 Â Â Â active sync Â /d=
ev/sdc1
> Â Â Â 5 Â Â Â 8 Â Â Â 81=
Â Â Â Â 4 Â Â Â active sync Â /d=
ev/sdf1
> Â Â Â 6 Â Â Â 8 Â Â Â 113=
Â Â Â Â 5 Â Â Â active sync Â /d=
ev/sdh1
> Â Â Â 7 Â Â Â 8 Â Â Â 65=
Â Â Â Â 6 Â Â Â active sync Â /d=
ev/sde1
>
> Here's the SMART data for sdh:
>
>
> smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.39-ck] (local build)
> Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforg=
e.net
>
> ===3D START OF INFORMATION SECTION ===3D
> Model Family: Â Â SAMSUNG SpinPoint F4 EG (AFT)
> Device Model: Â Â SAMSUNG HD204UI
> Serial Number: Â Â S2HGJ1RZ800850
> LU WWN Device Id: 5 0024e9 003f1ebc9
> Firmware Version: 1AQ10003
> User Capacity: Â Â 2,000,398,934,016 bytes [2.00 TB]
> Sector Size: Â Â Â 512 bytes logical/physical
> Device is: Â Â Â Â In smartctl database [for detai=
ls use: -P show]
> ATA Version is: Â 8
> ATA Standard is: Â ATA-8-ACS revision 6
> Local Time is: Â Â Sun Jul 31 14:03:32 2011 IST
>
> ==> WARNING: Using smartmontools or hdparm with this
> drive may result in data loss due to a firmware bug.
> ****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******
> Buggy and fixed firmware report same version number!
> See the following web pages for details:
> http://www.samsung.com/global/business/hdd/faqView.do?b2b_bb s_msg_id=3D=
386
> http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF 4EGBadBlo=
cks
>
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> ===3D START OF READ SMART DATA SECTION ===3D
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status: Â (0x82) Offline data collection =
activity
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â w=
as completed without error.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â A=
uto Offline Data Collection: Enabled.
> Self-test execution status: Â Â Â ( Â 37) The self-=
test routine was interrupted
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â b=
y the host with a hard or soft reset.
> Total time to complete Offline
> data collection: Â Â Â Â Â Â Â =C2=
=A0(20640) seconds.
> Offline data collection
> capabilities: Â Â Â Â Â Â Â Â =
Â Â (0x5b) SMART execute Offline immediate.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â A=
uto Offline data collection on/off support.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
uspend Offline collection upon new
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â c=
ommand.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â O=
ffline surface scan supported.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
elf-test supported.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â N=
o Conveyance Self-test supported.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
elective Self-test supported.
> SMART capabilities: Â Â Â Â Â Â (0x0003)=
Saves SMART data before entering
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â p=
ower-saving mode.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
upports SMART auto save timer.
> Error logging capability: Â Â Â Â (0x01) Error log=
ging supported.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â G=
eneral Purpose Logging supported.
> Short self-test routine
> recommended polling time: Â Â Â Â ( Â 2) minu=
tes.
> Extended self-test routine
> recommended polling time: Â Â Â Â ( 255) minutes.
> SCT capabilities: Â Â Â Â Â Â Â (0x=
003f) SCT Status supported.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
CT Error Recovery Control supported.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
CT Feature Control supported.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
CT Data Table supported.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME Â Â Â Â Â FLAG Â Â =
VALUE WORST THRESH TYPE
> UPDATED Â WHEN_FAILED RAW_VALUE
> Â 1 Raw_Read_Error_Rate Â Â 0x002f Â 100 Â 10=
0 Â 051 Â Â Pre-fail
> Always Â Â Â - Â Â Â 340
> Â 2 Throughput_Performance Â 0x0026 Â 055 Â 053 =C2=
=A0 000 Â Â Old_age
> Always Â Â Â - Â Â Â 18989
> Â 3 Spin_Up_Time Â Â Â Â Â Â 0x0023 =
Â 067 Â 044 Â 025 Â Â Pre-fail
> Always Â Â Â - Â Â Â 10165
> Â 4 Start_Stop_Count Â Â Â Â 0x0032 Â 100=
Â 100 Â 000 Â Â Old_age
> Always Â Â Â - Â Â Â 18
> Â 5 Reallocated_Sector_Ct Â 0x0033 Â 252 Â 252 =C2=
=A0 010 Â Â Pre-fail
> Always Â Â Â - Â Â Â 0
> Â 7 Seek_Error_Rate Â Â Â Â 0x002e Â 252=
Â 252 Â 051 Â Â Old_age
> Always Â Â Â - Â Â Â 0
> Â 8 Seek_Time_Performance Â 0x0024 Â 252 Â 252 =C2=
=A0 015 Â Â Old_age
> Offline Â Â Â - Â Â Â 0
> Â 9 Power_On_Hours Â Â Â Â Â 0x0032 Â =
100 Â 100 Â 000 Â Â Old_age
> Always Â Â Â - Â Â Â 6447
> Â 10 Spin_Retry_Count Â Â Â Â 0x0032 Â 25=
2 Â 252 Â 051 Â Â Old_age
> Always Â Â Â - Â Â Â 0
> Â 11 Calibration_Retry_Count 0x0032 Â 252 Â 252 Â =
000 Â Â Old_age
> Always Â Â Â - Â Â Â 0
> Â 12 Power_Cycle_Count Â Â Â 0x0032 Â 100 =C2=
=A0 100 Â 000 Â Â Old_age
> Always Â Â Â - Â Â Â 20
> 181 Program_Fail_Cnt_Total Â 0x0022 Â 100 Â 100 Â =
000 Â Â Old_age
> Always Â Â Â - Â Â Â 10117271
> 191 G-Sense_Error_Rate Â Â Â 0x0022 Â 100 Â 1=
00 Â 000 Â Â Old_age
> Always Â Â Â - Â Â Â 1
> 192 Power-Off_Retract_Count 0x0022 Â 252 Â 252 Â 000 =C2=
=A0 Â Old_age
> Always Â Â Â - Â Â Â 0
> 194 Temperature_Celsius Â Â 0x0002 Â 064 Â 057 =C2=
=A0 000 Â Â Old_age
> Always Â Â Â - Â Â Â 35 (Min/Max 16/43)
> 195 Hardware_ECC_Recovered Â 0x003a Â 100 Â 100 Â =
000 Â Â Old_age
> Always Â Â Â - Â Â Â 0
> 196 Reallocated_Event_Count 0x0032 Â 252 Â 252 Â 000 =C2=
=A0 Â Old_age
> Always Â Â Â - Â Â Â 0
> 197 Current_Pending_Sector Â 0x0032 Â 100 Â 100 Â =
000 Â Â Old_age
> Always Â Â Â - Â Â Â 6
> 198 Offline_Uncorrectable Â 0x0030 Â 252 Â 252 Â =
000 Â Â Old_age
> Offline Â Â Â - Â Â Â 0
> 199 UDMA_CRC_Error_Count Â Â 0x0036 Â 200 Â 200 =C2=
=A0 000 Â Â Old_age
> Always Â Â Â - Â Â Â 0
> 200 Multi_Zone_Error_Rate Â 0x002a Â 100 Â 100 Â =
000 Â Â Old_age
> Always Â Â Â - Â Â Â 3
> 223 Load_Retry_Count Â Â Â Â 0x0032 Â 252 =C2=
=A0 252 Â 000 Â Â Old_age
> Always Â Â Â - Â Â Â 0
> 225 Load_Cycle_Count Â Â Â Â 0x0032 Â 100 =C2=
=A0 100 Â 000 Â Â Old_age
> Always Â Â Â - Â Â Â 21
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num Â Test_Description Â Â Status Â Â Â =C2=
=A0 Â Â Â Â Â Remaining
> LifeTime(hours) Â LBA_of_first_error
> # 1 Â Extended offline Â Â Interrupted (host reset) Â =
Â Â 50% Â Â Â 6408 Â Â Â Â =
-
> # 2 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6317 Â Â Â Â =
-
> # 3 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6260 Â Â Â Â =
-
> # 4 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6232 Â Â Â Â =
-
> # 5 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6170 Â Â Â Â =
-
> # 6 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6064 Â Â Â Â =
-
> # 7 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6029 Â Â Â Â =
-
> # 8 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5898 Â Â Â Â =
-
> # 9 Â Extended offline Â Â Aborted by host Â Â =
Â Â Â Â Â 60% Â Â Â 5893 Â =
Â Â Â -
> #10 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5728 Â Â Â Â =
-
> #11 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5706 Â Â Â Â =
-
> #12 Â Extended offline Â Â Interrupted (host reset) Â =
Â Â 40% Â Â Â 5701 Â Â Â Â =
-
> #13 Â Extended offline Â Â Interrupted (host reset) Â =
Â Â 90% Â Â Â 5666 Â Â Â Â =
-
> #14 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5560 Â Â Â Â =
-
> #15 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5527 Â Â Â Â =
-
> #16 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5392 Â Â Â Â =
-
> #17 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5357 Â Â Â Â =
-
> #18 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5250 Â Â Â Â =
-
> #19 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 4272 Â Â Â Â =
-
> #20 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 4017 Â Â Â Â =
-
> #21 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 3935 Â Â Â Â =
-
>
> Note: selective self-test log revision number (0) not 1 implies that
> no selective self-test has ever been run
> SMART Selective self-test log data structure revision number 0
> Note: revision number not 1 implies that no selective self-test has
> ever been run
> Â SPAN Â MIN_LBA Â MAX_LBA Â CURRENT_TEST_STATUS
> Â Â 1 Â Â Â Â 0 Â Â Â Â =
0 Â Interrupted [50% left] (0-65535)
> Â Â 2 Â Â Â Â 0 Â Â Â Â =
0 Â Not_testing
> Â Â 3 Â Â Â Â 0 Â Â Â Â =
0 Â Not_testing
> Â Â 4 Â Â Â Â 0 Â Â Â Â =
0 Â Not_testing
> Â Â 5 Â Â Â Â 0 Â Â Â Â =
0 Â Not_testing
> Selective self-test flags (0x0):
> Â After scanning selected spans, do NOT read-scan remainder of di=
sk.
> If Selective self-test is pending on power-up, resume after 0 minute =
delay.
>
>
> It has 6 pending sectors. Why are they not reallocated? Can I force
> this somehow? (a scrub did not reallocate them) Is this enough to
> replace the HDD?
>
> Thanks,
> Mathias
>

Uh oh, I did another scrub, and here's the status:

DEV EVENTS REALL PEND UNCORR CRC RAW ZONE END
sdb1 6158767 0 0 0 2 0 0=09
sdc1 6158767 0 0 0 0 0 0=09
sdd1 6158767 0 0 0 0 0 0=09
sde1 6158767 0 0 0 0 0 1=09
sdf1 6158768 0 0 0 0 47 6=09
sdg1 6158767 0 0 0 0 0 0=09
sdh1 6158767 0 8 1 0 341 3=09

Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdh1[6] sdg1[0] sdf1[5] sde1[7] sdd1[4] sdb1[1] sdc1=
[3]
9751756800 blocks super 1.2 level 6, 64k chunk, algorithm 2
[7/7] [UUUUUUU]

unused devices:

/dev/md0:
Version : 1.2
Creation Time : Tue Oct 19 08:58:41 2010
Raid Level : raid6
Array Size : 9751756800 (9300.00 GiB 9985.80 GB)
Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB)
Raid Devices : 7
Total Devices : 7
Persistence : Superblock is persistent

Update Time : Sun Jul 31 18:51:58 2011
State : clean
Active Devices : 7
Working Devices : 7
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

Name : ion:0 (local to host ion)
UUID : e6595c64:b3ae90b3:f01133ac:3f402d20
Events : 6158767

Number Major Minor RaidDevice State
0 8 97 0 active sync /dev/sdg1
1 8 17 1 active sync /dev/sdb1
4 8 49 2 active sync /dev/sdd1
3 8 33 3 active sync /dev/sdc1
5 8 81 4 active sync /dev/sdf1
6 8 113 5 active sync /dev/sdh1
7 8 65 6 active sync /dev/sde1

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.39-ck] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.=
net

===3D START OF INFORMATION SECTION ===3D
Model Family: SAMSUNG SpinPoint F4 EG (AFT)
Device Model: SAMSUNG HD204UI
Serial Number: S2HGJ1RZ800850
LU WWN Device Id: 5 0024e9 003f1ebc9
=46irmware Version: 1AQ10003
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 6
Local Time is: Sun Jul 31 18:51:59 2011 IST

==> WARNING: Using smartmontools or hdparm with this
drive may result in data loss due to a firmware bug.
****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******
Buggy and fixed firmware report same version number!
See the following web pages for details:
http://www.samsung.com/global/business/hdd/faqView.do?b2b_bb s_msg_id=3D=
386
http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF 4EGBadBlock=
s

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

===3D START OF READ SMART DATA SECTION ===3D
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x80) Offline data collection activit=
y
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 118) The previous self-test complete=
d having
the read element of the test failed.
Total time to complete Offline
data collection: (20640) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before enterin=
g
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail
Always - 341
2 Throughput_Performance 0x0026 055 053 000 Old_age
Always - 18989
3 Spin_Up_Time 0x0023 067 044 025 Pre-fail
Always - 10165
4 Start_Stop_Count 0x0032 100 100 000 Old_age
Always - 18
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail
Always - 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age
Always - 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age
Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age
Always - 6452
10 Spin_Retry_Count 0x0032 252 252 051 Old_age
Always - 0
11 Calibration_Retry_Count 0x0032 252 252 000 Old_age
Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age
Always - 20
181 Program_Fail_Cnt_Total 0x0022 100 100 000 Old_age
Always - 10121757
191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age
Always - 1
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age
Always - 0
194 Temperature_Celsius 0x0002 064 057 000 Old_age
Always - 31 (Min/Max 16/43)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age
Always - 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age
Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age
Always - 8
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age
Offline - 1
199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age
Always - 0
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age
Always - 3
223 Load_Retry_Count 0x0032 252 252 000 Old_age
Always - 0
225 Load_Cycle_Count 0x0032 100 100 000 Old_age
Always - 21

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 60% 6452
1519520304
# 2 Extended offline Interrupted (host reset) 50% 6408 =
-
# 3 Extended offline Completed without error 00% 6317 =
-
# 4 Extended offline Completed without error 00% 6260 =
-
# 5 Extended offline Completed without error 00% 6232 =
-
# 6 Extended offline Completed without error 00% 6170 =
-
# 7 Extended offline Completed without error 00% 6064 =
-
# 8 Extended offline Completed without error 00% 6029 =
-
# 9 Extended offline Completed without error 00% 5898 =
-
#10 Extended offline Aborted by host 60% 5893 =
-
#11 Extended offline Completed without error 00% 5728 =
-
#12 Extended offline Completed without error 00% 5706 =
-
#13 Extended offline Interrupted (host reset) 40% 5701 =
-
#14 Extended offline Interrupted (host reset) 90% 5666 =
-
#15 Extended offline Completed without error 00% 5560 =
-
#16 Extended offline Completed without error 00% 5527 =
-
#17 Extended offline Completed without error 00% 5392 =
-
#18 Extended offline Completed without error 00% 5357 =
-
#19 Extended offline Completed without error 00% 5250 =
-
#20 Extended offline Completed without error 00% 4272 =
-
#21 Extended offline Completed without error 00% 4017 =
-

Note: selective self-test log revision number (0) not 1 implies that
no selective self-test has ever been run
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has
ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Completed_read_failure [60% left] (0-65535)
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute de=
lay.

:( I do have a bad HDD. RMA is already in progress, I need to take out
the drive and ship it to Samsung in Holland. Will print labels at work
on Tuesday.

Questions:

* How do I remove this HDD without causing damage to the array? Is
this the correct way?:
mdadm --manage /dev/md0 --fail /dev/sdh1 # fail the device
mdadm --manage /dev/md0 --remove /dev/sdh1 # remove the device
* (shut down the system gracefully)
* (remove the HDD)
* (install new HDD)
* (start system)
sfdisk -d /dev/sde | sfdisk /dev/sdh # partition the new HDD
mdadm --manage /dev/md0 --add /dev/sdh1 # add the partition to the arra=
y

* After removing the HDD, should I do another scrub?

Thanks a lot in advance!

/Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Do I have a bad HDD?

am 31.07.2011 20:05:39 von mathias.buren

On 31 July 2011 18:59, Mathias BurÃ©n wro=
te:
> On 31 July 2011 14:05, Mathias BurÃ©n w=
rote:
>> Hi list,
>>
>> Here's the output of my weekly script:
>>
>> DEV Â Â EVENTS Â REALL Â PEND Â Â UNCORR=
Â CRC Â Â RAW Â Â ZONE Â Â END
>> sdb1 Â Â 6158767 0 Â Â Â 0 Â Â =C2=
=A0 0 Â Â Â 2 Â Â Â 0 Â Â Â =
0
>> sdc1 Â Â 6158767 0 Â Â Â 0 Â Â =C2=
=A0 0 Â Â Â 0 Â Â Â 0 Â Â Â =
0
>> sdd1 Â Â 6158767 0 Â Â Â 0 Â Â =C2=
=A0 0 Â Â Â 0 Â Â Â 0 Â Â Â =
0
>> sde1 Â Â 6158767 0 Â Â Â 0 Â Â =C2=
=A0 0 Â Â Â 0 Â Â Â 0 Â Â Â =
1
>> sdf1 Â Â 6158767 0 Â Â Â 0 Â Â =C2=
=A0 0 Â Â Â 0 Â Â Â 47 Â Â Â =
6
>> sdg1 Â Â 6158767 0 Â Â Â 0 Â Â =C2=
=A0 0 Â Â Â 0 Â Â Â 0 Â Â Â =
0
>> sdh1 Â Â 6158767 0 Â Â Â 6 Â Â =C2=
=A0 0 Â Â Â 0 Â Â Â 340 Â Â 3
>>
>>
>> Personalities : [raid6] [raid5] [raid4]
>> md0 : active raid6 sdf1[5] sdh1[6] sdg1[0] sde1[7] sdc1[3] sdd1[4] s=
db1[1]
>> Â Â Â 9751756800 blocks super 1.2 level 6, 64k chunk, =
algorithm 2
>> [7/7] [UUUUUUU]
>>
>> unused devices:
>>
>>
>> /dev/md0:
>> Â Â Â Â Version : 1.2
>> Â Creation Time : Tue Oct 19 08:58:41 2010
>> Â Â Raid Level : raid6
>> Â Â Array Size : 9751756800 (9300.00 GiB 9985.80 GB)
>> Â Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB)
>> Â Raid Devices : 7
>> Â Total Devices : 7
>> Â Â Persistence : Superblock is persistent
>>
>> Â Â Update Time : Sun Jul 31 09:50:43 2011
>> Â Â Â Â Â State : clean
>> Â Active Devices : 7
>> Working Devices : 7
>> Â Failed Devices : 0
>> Â Spare Devices : 0
>>
>> Â Â Â Â Layout : left-symmetric
>> Â Â Chunk Size : 64K
>>
>> Â Â Â Â Â Name : ion:0 Â (local to host=
ion)
>> Â Â Â Â Â UUID : e6595c64:b3ae90b3:f01133ac=
:3f402d20
>> Â Â Â Â Events : 6158767
>>
>> Â Â Number Â Major Â Minor Â RaidDevice Stat=
e
>> Â Â Â 0 Â Â Â 8 Â Â Â 9=
7 Â Â Â Â 0 Â Â Â active sync Â /=
dev/sdg1
>> Â Â Â 1 Â Â Â 8 Â Â Â 1=
7 Â Â Â Â 1 Â Â Â active sync Â /=
dev/sdb1
>> Â Â Â 4 Â Â Â 8 Â Â Â 4=
9 Â Â Â Â 2 Â Â Â active sync Â /=
dev/sdd1
>> Â Â Â 3 Â Â Â 8 Â Â Â 3=
3 Â Â Â Â 3 Â Â Â active sync Â /=
dev/sdc1
>> Â Â Â 5 Â Â Â 8 Â Â Â 8=
1 Â Â Â Â 4 Â Â Â active sync Â /=
dev/sdf1
>> Â Â Â 6 Â Â Â 8 Â Â Â 11=
3 Â Â Â Â 5 Â Â Â active sync Â /=
dev/sdh1
>> Â Â Â 7 Â Â Â 8 Â Â Â 6=
5 Â Â Â Â 6 Â Â Â active sync Â /=
dev/sde1
>>
>> Here's the SMART data for sdh:
>>
>>
>> smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.39-ck] (local build=
)
>> Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourcefor=
ge.net
>>
>> ===3D START OF INFORMATION SECTION ===3D
>> Model Family: Â Â SAMSUNG SpinPoint F4 EG (AFT)
>> Device Model: Â Â SAMSUNG HD204UI
>> Serial Number: Â Â S2HGJ1RZ800850
>> LU WWN Device Id: 5 0024e9 003f1ebc9
>> Firmware Version: 1AQ10003
>> User Capacity: Â Â 2,000,398,934,016 bytes [2.00 TB]
>> Sector Size: Â Â Â 512 bytes logical/physical
>> Device is: Â Â Â Â In smartctl database [for deta=
ils use: -P show]
>> ATA Version is: Â 8
>> ATA Standard is: Â ATA-8-ACS revision 6
>> Local Time is: Â Â Sun Jul 31 14:03:32 2011 IST
>>
>> ==> WARNING: Using smartmontools or hdparm with this
>> drive may result in data loss due to a firmware bug.
>> ****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******
>> Buggy and fixed firmware report same version number!
>> See the following web pages for details:
>> http://www.samsung.com/global/business/hdd/faqView.do?b2b_bb s_msg_id=
=3D386
>> http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF 4EGBadBl=
ocks
>>
>> SMART support is: Available - device has SMART capability.
>> SMART support is: Enabled
>>
>> ===3D START OF READ SMART DATA SECTION ===3D
>> SMART overall-health self-assessment test result: PASSED
>>
>> General SMART Values:
>> Offline data collection status: Â (0x82) Offline data collection=
activity
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â w=
as completed without error.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â A=
uto Offline Data Collection: Enabled.
>> Self-test execution status: Â Â Â ( Â 37) The self=
-test routine was interrupted
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â b=
y the host with a hard or soft reset.
>> Total time to complete Offline
>> data collection: Â Â Â Â Â Â Â =C2=
=A0(20640) seconds.
>> Offline data collection
>> capabilities: Â Â Â Â Â Â Â Â =
Â Â (0x5b) SMART execute Offline immediate.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â A=
uto Offline data collection on/off support.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
uspend Offline collection upon new
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â c=
ommand.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â O=
ffline surface scan supported.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
elf-test supported.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â N=
o Conveyance Self-test supported.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
elective Self-test supported.
>> SMART capabilities: Â Â Â Â Â Â (0x0003=
) Saves SMART data before entering
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â p=
ower-saving mode.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
upports SMART auto save timer.
>> Error logging capability: Â Â Â Â (0x01) Error lo=
gging supported.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â G=
eneral Purpose Logging supported.
>> Short self-test routine
>> recommended polling time: Â Â Â Â ( Â 2) min=
utes.
>> Extended self-test routine
>> recommended polling time: Â Â Â Â ( 255) minutes.
>> SCT capabilities: Â Â Â Â Â Â Â (0=
x003f) SCT Status supported.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
CT Error Recovery Control supported.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
CT Feature Control supported.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
CT Data Table supported.
>>
>> SMART Attributes Data Structure revision number: 16
>> Vendor Specific SMART Attributes with Thresholds:
>> ID# ATTRIBUTE_NAME Â Â Â Â Â FLAG Â =C2=
=A0 VALUE WORST THRESH TYPE
>> UPDATED Â WHEN_FAILED RAW_VALUE
>> Â 1 Raw_Read_Error_Rate Â Â 0x002f Â 100 Â 1=
00 Â 051 Â Â Pre-fail
>> Always Â Â Â - Â Â Â 340
>> Â 2 Throughput_Performance Â 0x0026 Â 055 Â 053 =C2=
=A0 000 Â Â Old_age
>> Always Â Â Â - Â Â Â 18989
>> Â 3 Spin_Up_Time Â Â Â Â Â Â 0x0023=
Â 067 Â 044 Â 025 Â Â Pre-fail
>> Always Â Â Â - Â Â Â 10165
>> Â 4 Start_Stop_Count Â Â Â Â 0x0032 Â 10=
0 Â 100 Â 000 Â Â Old_age
>> Always Â Â Â - Â Â Â 18
>> Â 5 Reallocated_Sector_Ct Â 0x0033 Â 252 Â 252 =C2=
=A0 010 Â Â Pre-fail
>> Always Â Â Â - Â Â Â 0
>> Â 7 Seek_Error_Rate Â Â Â Â 0x002e Â 25=
2 Â 252 Â 051 Â Â Old_age
>> Always Â Â Â - Â Â Â 0
>> Â 8 Seek_Time_Performance Â 0x0024 Â 252 Â 252 =C2=
=A0 015 Â Â Old_age
>> Offline Â Â Â - Â Â Â 0
>> Â 9 Power_On_Hours Â Â Â Â Â 0x0032 Â =
100 Â 100 Â 000 Â Â Old_age
>> Always Â Â Â - Â Â Â 6447
>> Â 10 Spin_Retry_Count Â Â Â Â 0x0032 Â 2=
52 Â 252 Â 051 Â Â Old_age
>> Always Â Â Â - Â Â Â 0
>> Â 11 Calibration_Retry_Count 0x0032 Â 252 Â 252 Â =
000 Â Â Old_age
>> Always Â Â Â - Â Â Â 0
>> Â 12 Power_Cycle_Count Â Â Â 0x0032 Â 100 =C2=
=A0 100 Â 000 Â Â Old_age
>> Always Â Â Â - Â Â Â 20
>> 181 Program_Fail_Cnt_Total Â 0x0022 Â 100 Â 100 Â =
000 Â Â Old_age
>> Always Â Â Â - Â Â Â 10117271
>> 191 G-Sense_Error_Rate Â Â Â 0x0022 Â 100 Â =
100 Â 000 Â Â Old_age
>> Always Â Â Â - Â Â Â 1
>> 192 Power-Off_Retract_Count 0x0022 Â 252 Â 252 Â 000 =
Â Â Old_age
>> Always Â Â Â - Â Â Â 0
>> 194 Temperature_Celsius Â Â 0x0002 Â 064 Â 057 =C2=
=A0 000 Â Â Old_age
>> Always Â Â Â - Â Â Â 35 (Min/Max 16/43=
)
>> 195 Hardware_ECC_Recovered Â 0x003a Â 100 Â 100 Â =
000 Â Â Old_age
>> Always Â Â Â - Â Â Â 0
>> 196 Reallocated_Event_Count 0x0032 Â 252 Â 252 Â 000 =
Â Â Old_age
>> Always Â Â Â - Â Â Â 0
>> 197 Current_Pending_Sector Â 0x0032 Â 100 Â 100 Â =
000 Â Â Old_age
>> Always Â Â Â - Â Â Â 6
>> 198 Offline_Uncorrectable Â 0x0030 Â 252 Â 252 Â =
000 Â Â Old_age
>> Offline Â Â Â - Â Â Â 0
>> 199 UDMA_CRC_Error_Count Â Â 0x0036 Â 200 Â 200 =C2=
=A0 000 Â Â Old_age
>> Always Â Â Â - Â Â Â 0
>> 200 Multi_Zone_Error_Rate Â 0x002a Â 100 Â 100 Â =
000 Â Â Old_age
>> Always Â Â Â - Â Â Â 3
>> 223 Load_Retry_Count Â Â Â Â 0x0032 Â 252 =C2=
=A0 252 Â 000 Â Â Old_age
>> Always Â Â Â - Â Â Â 0
>> 225 Load_Cycle_Count Â Â Â Â 0x0032 Â 100 =C2=
=A0 100 Â 000 Â Â Old_age
>> Always Â Â Â - Â Â Â 21
>>
>> SMART Error Log Version: 1
>> No Errors Logged
>>
>> SMART Self-test log structure revision number 1
>> Num Â Test_Description Â Â Status Â Â Â =
Â Â Â Â Â Â Remaining
>> LifeTime(hours) Â LBA_of_first_error
>> # 1 Â Extended offline Â Â Interrupted (host reset) =C2=
=A0 Â Â 50% Â Â Â 6408 Â Â Â Â =
-
>> # 2 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6317 Â Â Â Â =
-
>> # 3 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6260 Â Â Â Â =
-
>> # 4 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6232 Â Â Â Â =
-
>> # 5 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6170 Â Â Â Â =
-
>> # 6 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6064 Â Â Â Â =
-
>> # 7 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6029 Â Â Â Â =
-
>> # 8 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5898 Â Â Â Â =
-
>> # 9 Â Extended offline Â Â Aborted by host Â Â =
Â Â Â Â Â 60% Â Â Â 5893 Â =
Â Â Â -
>> #10 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5728 Â Â Â Â =
-
>> #11 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5706 Â Â Â Â =
-
>> #12 Â Extended offline Â Â Interrupted (host reset) =C2=
=A0 Â Â 40% Â Â Â 5701 Â Â Â Â =
-
>> #13 Â Extended offline Â Â Interrupted (host reset) =C2=
=A0 Â Â 90% Â Â Â 5666 Â Â Â Â =
-
>> #14 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5560 Â Â Â Â =
-
>> #15 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5527 Â Â Â Â =
-
>> #16 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5392 Â Â Â Â =
-
>> #17 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5357 Â Â Â Â =
-
>> #18 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5250 Â Â Â Â =
-
>> #19 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 4272 Â Â Â Â =
-
>> #20 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 4017 Â Â Â Â =
-
>> #21 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 3935 Â Â Â Â =
-
>>
>> Note: selective self-test log revision number (0) not 1 implies that
>> no selective self-test has ever been run
>> SMART Selective self-test log data structure revision number 0
>> Note: revision number not 1 implies that no selective self-test has
>> ever been run
>> Â SPAN Â MIN_LBA Â MAX_LBA Â CURRENT_TEST_STATUS
>> Â Â 1 Â Â Â Â 0 Â Â Â =C2=
=A00 Â Interrupted [50% left] (0-65535)
>> Â Â 2 Â Â Â Â 0 Â Â Â =C2=
=A00 Â Not_testing
>> Â Â 3 Â Â Â Â 0 Â Â Â =C2=
=A00 Â Not_testing
>> Â Â 4 Â Â Â Â 0 Â Â Â =C2=
=A00 Â Not_testing
>> Â Â 5 Â Â Â Â 0 Â Â Â =C2=
=A00 Â Not_testing
>> Selective self-test flags (0x0):
>> Â After scanning selected spans, do NOT read-scan remainder of d=
isk.
>> If Selective self-test is pending on power-up, resume after 0 minute=
delay.
>>
>>
>> It has 6 pending sectors. Why are they not reallocated? Can I force
>> this somehow? (a scrub did not reallocate them) Is this enough to
>> replace the HDD?
>>
>> Thanks,
>> Mathias
>>
>
> Uh oh, I did another scrub, and here's the status:
>
> DEV Â Â EVENTS Â REALL Â PEND Â Â UNCORR =
Â CRC Â Â RAW Â Â ZONE Â Â END
> sdb1 Â Â 6158767 0 Â Â Â 0 Â Â Â =
0 Â Â Â 2 Â Â Â 0 Â Â Â 0
> sdc1 Â Â 6158767 0 Â Â Â 0 Â Â Â =
0 Â Â Â 0 Â Â Â 0 Â Â Â 0
> sdd1 Â Â 6158767 0 Â Â Â 0 Â Â Â =
0 Â Â Â 0 Â Â Â 0 Â Â Â 0
> sde1 Â Â 6158767 0 Â Â Â 0 Â Â Â =
0 Â Â Â 0 Â Â Â 0 Â Â Â 1
> sdf1 Â Â 6158768 0 Â Â Â 0 Â Â Â =
0 Â Â Â 0 Â Â Â 47 Â Â Â 6
> sdg1 Â Â 6158767 0 Â Â Â 0 Â Â Â =
0 Â Â Â 0 Â Â Â 0 Â Â Â 0
> sdh1 Â Â 6158767 0 Â Â Â 8 Â Â Â =
1 Â Â Â 0 Â Â Â 341 Â Â 3
>
>
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdh1[6] sdg1[0] sdf1[5] sde1[7] sdd1[4] sdb1[1] sd=
c1[3]
> Â Â Â 9751756800 blocks super 1.2 level 6, 64k chunk, a=
lgorithm 2
> [7/7] [UUUUUUU]
>
> unused devices:
>
>
> /dev/md0:
> Â Â Â Â Version : 1.2
> Â Creation Time : Tue Oct 19 08:58:41 2010
> Â Â Raid Level : raid6
> Â Â Array Size : 9751756800 (9300.00 GiB 9985.80 GB)
> Â Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB)
> Â Raid Devices : 7
> Â Total Devices : 7
> Â Â Persistence : Superblock is persistent
>
> Â Â Update Time : Sun Jul 31 18:51:58 2011
> Â Â Â Â Â State : clean
> Â Active Devices : 7
> Working Devices : 7
> Â Failed Devices : 0
> Â Spare Devices : 0
>
> Â Â Â Â Layout : left-symmetric
> Â Â Chunk Size : 64K
>
> Â Â Â Â Â Name : ion:0 Â (local to host =
ion)
> Â Â Â Â Â UUID : e6595c64:b3ae90b3:f01133ac:=
3f402d20
> Â Â Â Â Events : 6158767
>
> Â Â Number Â Major Â Minor Â RaidDevice State
> Â Â Â 0 Â Â Â 8 Â Â Â 97=
Â Â Â Â 0 Â Â Â active sync Â /d=
ev/sdg1
> Â Â Â 1 Â Â Â 8 Â Â Â 17=
Â Â Â Â 1 Â Â Â active sync Â /d=
ev/sdb1
> Â Â Â 4 Â Â Â 8 Â Â Â 49=
Â Â Â Â 2 Â Â Â active sync Â /d=
ev/sdd1
> Â Â Â 3 Â Â Â 8 Â Â Â 33=
Â Â Â Â 3 Â Â Â active sync Â /d=
ev/sdc1
> Â Â Â 5 Â Â Â 8 Â Â Â 81=
Â Â Â Â 4 Â Â Â active sync Â /d=
ev/sdf1
> Â Â Â 6 Â Â Â 8 Â Â Â 113=
Â Â Â Â 5 Â Â Â active sync Â /d=
ev/sdh1
> Â Â Â 7 Â Â Â 8 Â Â Â 65=
Â Â Â Â 6 Â Â Â active sync Â /d=
ev/sde1
>
> smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.39-ck] (local build)
> Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforg=
e.net
>
> ===3D START OF INFORMATION SECTION ===3D
> Model Family: Â Â SAMSUNG SpinPoint F4 EG (AFT)
> Device Model: Â Â SAMSUNG HD204UI
> Serial Number: Â Â S2HGJ1RZ800850
> LU WWN Device Id: 5 0024e9 003f1ebc9
> Firmware Version: 1AQ10003
> User Capacity: Â Â 2,000,398,934,016 bytes [2.00 TB]
> Sector Size: Â Â Â 512 bytes logical/physical
> Device is: Â Â Â Â In smartctl database [for detai=
ls use: -P show]
> ATA Version is: Â 8
> ATA Standard is: Â ATA-8-ACS revision 6
> Local Time is: Â Â Sun Jul 31 18:51:59 2011 IST
>
> ==> WARNING: Using smartmontools or hdparm with this
> drive may result in data loss due to a firmware bug.
> ****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******
> Buggy and fixed firmware report same version number!
> See the following web pages for details:
> http://www.samsung.com/global/business/hdd/faqView.do?b2b_bb s_msg_id=3D=
386
> http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF 4EGBadBlo=
cks
>
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> ===3D START OF READ SMART DATA SECTION ===3D
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status: Â (0x80) Offline data collection =
activity
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â w=
as never started.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â A=
uto Offline Data Collection: Enabled.
> Self-test execution status: Â Â Â ( 118) The previous s=
elf-test completed having
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â t=
he read element of the test failed.
> Total time to complete Offline
> data collection: Â Â Â Â Â Â Â =C2=
=A0(20640) seconds.
> Offline data collection
> capabilities: Â Â Â Â Â Â Â Â =
Â Â (0x5b) SMART execute Offline immediate.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â A=
uto Offline data collection on/off support.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
uspend Offline collection upon new
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â c=
ommand.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â O=
ffline surface scan supported.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
elf-test supported.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â N=
o Conveyance Self-test supported.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
elective Self-test supported.
> SMART capabilities: Â Â Â Â Â Â (0x0003)=
Saves SMART data before entering
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â p=
ower-saving mode.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
upports SMART auto save timer.
> Error logging capability: Â Â Â Â (0x01) Error log=
ging supported.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â G=
eneral Purpose Logging supported.
> Short self-test routine
> recommended polling time: Â Â Â Â ( Â 2) minu=
tes.
> Extended self-test routine
> recommended polling time: Â Â Â Â ( 255) minutes.
> SCT capabilities: Â Â Â Â Â Â Â (0x=
003f) SCT Status supported.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
CT Error Recovery Control supported.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
CT Feature Control supported.
> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
CT Data Table supported.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME Â Â Â Â Â FLAG Â Â =
VALUE WORST THRESH TYPE
> UPDATED Â WHEN_FAILED RAW_VALUE
> Â 1 Raw_Read_Error_Rate Â Â 0x002f Â 100 Â 10=
0 Â 051 Â Â Pre-fail
> Always Â Â Â - Â Â Â 341
> Â 2 Throughput_Performance Â 0x0026 Â 055 Â 053 =C2=
=A0 000 Â Â Old_age
> Always Â Â Â - Â Â Â 18989
> Â 3 Spin_Up_Time Â Â Â Â Â Â 0x0023 =
Â 067 Â 044 Â 025 Â Â Pre-fail
> Always Â Â Â - Â Â Â 10165
> Â 4 Start_Stop_Count Â Â Â Â 0x0032 Â 100=
Â 100 Â 000 Â Â Old_age
> Always Â Â Â - Â Â Â 18
> Â 5 Reallocated_Sector_Ct Â 0x0033 Â 252 Â 252 =C2=
=A0 010 Â Â Pre-fail
> Always Â Â Â - Â Â Â 0
> Â 7 Seek_Error_Rate Â Â Â Â 0x002e Â 252=
Â 252 Â 051 Â Â Old_age
> Always Â Â Â - Â Â Â 0
> Â 8 Seek_Time_Performance Â 0x0024 Â 252 Â 252 =C2=
=A0 015 Â Â Old_age
> Offline Â Â Â - Â Â Â 0
> Â 9 Power_On_Hours Â Â Â Â Â 0x0032 Â =
100 Â 100 Â 000 Â Â Old_age
> Always Â Â Â - Â Â Â 6452
> Â 10 Spin_Retry_Count Â Â Â Â 0x0032 Â 25=
2 Â 252 Â 051 Â Â Old_age
> Always Â Â Â - Â Â Â 0
> Â 11 Calibration_Retry_Count 0x0032 Â 252 Â 252 Â =
000 Â Â Old_age
> Always Â Â Â - Â Â Â 0
> Â 12 Power_Cycle_Count Â Â Â 0x0032 Â 100 =C2=
=A0 100 Â 000 Â Â Old_age
> Always Â Â Â - Â Â Â 20
> 181 Program_Fail_Cnt_Total Â 0x0022 Â 100 Â 100 Â =
000 Â Â Old_age
> Always Â Â Â - Â Â Â 10121757
> 191 G-Sense_Error_Rate Â Â Â 0x0022 Â 100 Â 1=
00 Â 000 Â Â Old_age
> Always Â Â Â - Â Â Â 1
> 192 Power-Off_Retract_Count 0x0022 Â 252 Â 252 Â 000 =C2=
=A0 Â Old_age
> Always Â Â Â - Â Â Â 0
> 194 Temperature_Celsius Â Â 0x0002 Â 064 Â 057 =C2=
=A0 000 Â Â Old_age
> Always Â Â Â - Â Â Â 31 (Min/Max 16/43)
> 195 Hardware_ECC_Recovered Â 0x003a Â 100 Â 100 Â =
000 Â Â Old_age
> Always Â Â Â - Â Â Â 0
> 196 Reallocated_Event_Count 0x0032 Â 252 Â 252 Â 000 =C2=
=A0 Â Old_age
> Always Â Â Â - Â Â Â 0
> 197 Current_Pending_Sector Â 0x0032 Â 100 Â 100 Â =
000 Â Â Old_age
> Always Â Â Â - Â Â Â 8
> 198 Offline_Uncorrectable Â 0x0030 Â 100 Â 100 Â =
000 Â Â Old_age
> Offline Â Â Â - Â Â Â 1
> 199 UDMA_CRC_Error_Count Â Â 0x0036 Â 200 Â 200 =C2=
=A0 000 Â Â Old_age
> Always Â Â Â - Â Â Â 0
> 200 Multi_Zone_Error_Rate Â 0x002a Â 100 Â 100 Â =
000 Â Â Old_age
> Always Â Â Â - Â Â Â 3
> 223 Load_Retry_Count Â Â Â Â 0x0032 Â 252 =C2=
=A0 252 Â 000 Â Â Old_age
> Always Â Â Â - Â Â Â 0
> 225 Load_Cycle_Count Â Â Â Â 0x0032 Â 100 =C2=
=A0 100 Â 000 Â Â Old_age
> Always Â Â Â - Â Â Â 21
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num Â Test_Description Â Â Status Â Â Â =C2=
=A0 Â Â Â Â Â Remaining
> LifeTime(hours) Â LBA_of_first_error
> # 1 Â Extended offline Â Â Completed: read failure Â =
Â Â 60% Â Â Â 6452
> Â Â Â 1519520304
> # 2 Â Extended offline Â Â Interrupted (host reset) Â =
Â Â 50% Â Â Â 6408 Â Â Â Â =
-
> # 3 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6317 Â Â Â Â =
-
> # 4 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6260 Â Â Â Â =
-
> # 5 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6232 Â Â Â Â =
-
> # 6 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6170 Â Â Â Â =
-
> # 7 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6064 Â Â Â Â =
-
> # 8 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6029 Â Â Â Â =
-
> # 9 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5898 Â Â Â Â =
-
> #10 Â Extended offline Â Â Aborted by host Â Â =
Â Â Â Â Â 60% Â Â Â 5893 Â =
Â Â Â -
> #11 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5728 Â Â Â Â =
-
> #12 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5706 Â Â Â Â =
-
> #13 Â Extended offline Â Â Interrupted (host reset) Â =
Â Â 40% Â Â Â 5701 Â Â Â Â =
-
> #14 Â Extended offline Â Â Interrupted (host reset) Â =
Â Â 90% Â Â Â 5666 Â Â Â Â =
-
> #15 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5560 Â Â Â Â =
-
> #16 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5527 Â Â Â Â =
-
> #17 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5392 Â Â Â Â =
-
> #18 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5357 Â Â Â Â =
-
> #19 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5250 Â Â Â Â =
-
> #20 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 4272 Â Â Â Â =
-
> #21 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 4017 Â Â Â Â =
-
>
> Note: selective self-test log revision number (0) not 1 implies that
> no selective self-test has ever been run
> SMART Selective self-test log data structure revision number 0
> Note: revision number not 1 implies that no selective self-test has
> ever been run
> Â SPAN Â MIN_LBA Â MAX_LBA Â CURRENT_TEST_STATUS
> Â Â 1 Â Â Â Â 0 Â Â Â Â =
0 Â Completed_read_failure [60% left] (0-65535)
> Â Â 2 Â Â Â Â 0 Â Â Â Â =
0 Â Not_testing
> Â Â 3 Â Â Â Â 0 Â Â Â Â =
0 Â Not_testing
> Â Â 4 Â Â Â Â 0 Â Â Â Â =
0 Â Not_testing
> Â Â 5 Â Â Â Â 0 Â Â Â Â =
0 Â Not_testing
> Selective self-test flags (0x0):
> Â After scanning selected spans, do NOT read-scan remainder of di=
sk.
> If Selective self-test is pending on power-up, resume after 0 minute =
delay.
>
> :( I do have a bad HDD. RMA is already in progress, I need to take ou=
t
> the drive and ship it to Samsung in Holland. Will print labels at wor=
k
> on Tuesday.
>
> Questions:
>
> * How do I remove this HDD without causing damage to the array? Is
> this the correct way?:
> mdadm --manage /dev/md0 --fail /dev/sdh1 # fail the device
> mdadm --manage /dev/md0 --remove /dev/sdh1 # remove the device
> * (shut down the system gracefully)
> * (remove the HDD)
> * (install new HDD)
> * (start system)
> sfdisk -d /dev/sde | sfdisk /dev/sdh # partition the new HDD
> mdadm --manage /dev/md0 --add /dev/sdh1 # add the partition to the ar=
ray
>
> * After removing the HDD, should I do another scrub?
>
> Thanks a lot in advance!
>
> /Mathias
>

I think I need to hurry up:

[13957.348692] ata10.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0=
x6 frozen
[13957.348704] ata10.00: failed command: READ FPDMA QUEUED
[13957.348716] ata10.00: cmd 60/08:00:00:ab:5a/00:00:e1:00:00/40 tag 0
ncq 4096 in
[13957.348719] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask
0x4 (timeout)
[13957.348724] ata10.00: status: { DRDY }
[13957.348730] ata10.00: failed command: WRITE FPDMA QUEUED
[13957.348741] ata10.00: cmd 61/08:08:a8:e7:8d/00:00:2e:00:00/40 tag 1
ncq 4096 out
[13957.348743] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask
0x4 (timeout)
[13957.348749] ata10.00: status: { DRDY }
[13957.348759] ata10: hard resetting link
[13962.835319] ata10: link is slow to respond, please be patient (ready=
=3D0)
[13967.368679] ata10: SRST failed (errno=3D-16)
[13967.368693] ata10: hard resetting link
[13970.988699] ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[13971.002394] ata10.00: configured for UDMA/133
[13971.002407] ata10.00: device reported invalid CHS sector 0
[13971.002413] ata10.00: device reported invalid CHS sector 0
[13971.002427] ata10: EH complete
[14001.358848] ata10.00: exception Emask 0x0 SAct 0x2 SErr 0x0 action 0=
x6 frozen
[14001.358862] ata10.00: failed command: READ FPDMA QUEUED
[14001.358887] ata10.00: cmd 60/08:08:00:ab:5a/00:00:e1:00:00/40 tag 1
ncq 4096 in
[14001.358890] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask
0x4 (timeout)
[14001.358898] ata10.00: status: { DRDY }
[14001.358913] ata10: hard resetting link
[14006.845324] ata10: link is slow to respond, please be patient (ready=
=3D0)
[14011.378656] ata10: SRST failed (errno=3D-16)
[14011.378669] ata10: hard resetting link
[14016.865323] ata10: link is slow to respond, please be patient (ready=
=3D0)
[14021.398640] ata10: SRST failed (errno=3D-16)
[14021.398652] ata10: hard resetting link
[14026.885310] ata10: link is slow to respond, please be patient (ready=
=3D0)
[14029.925349] ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[14029.939048] ata10.00: configured for UDMA/133
[14029.939061] ata10.00: device reported invalid CHS sector 0
[14029.939078] ata10: EH complete
[14060.345358] ata10.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0=
x6 frozen
[14060.345371] ata10.00: failed command: READ FPDMA QUEUED
[14060.345384] ata10.00: cmd 60/08:00:00:ab:5a/00:00:e1:00:00/40 tag 0
ncq 4096 in
[14060.345387] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask
0x4 (timeout)
[14060.345394] ata10.00: status: { DRDY }
[14060.345407] ata10: hard resetting link
[14065.831985] ata10: link is slow to respond, please be patient (ready=
=3D0)
[14070.365333] ata10: SRST failed (errno=3D-16)
[14070.365345] ata10: hard resetting link
[14074.625347] ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[14074.639043] ata10.00: configured for UDMA/133
[14074.639056] ata10.00: device reported invalid CHS sector 0
[14074.639088] ata10: EH complete
[14105.358687] ata10.00: NCQ disabled due to excessive errors
[14105.358700] ata10.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0=
x6 frozen
[14105.358712] ata10.00: failed command: READ FPDMA QUEUED
[14105.358729] ata10.00: cmd 60/08:00:00:ab:5a/00:00:e1:00:00/40 tag 0
ncq 4096 in
[14105.358732] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask
0x4 (timeout)
[14105.358741] ata10.00: status: { DRDY }
[14105.358754] ata10: hard resetting link
[14110.845314] ata10: link is slow to respond, please be patient (ready=
=3D0)
[14115.378674] ata10: SRST failed (errno=3D-16)
[14115.378689] ata10: hard resetting link
[14119.372023] ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[14119.385704] ata10.00: configured for UDMA/133
[14119.385716] ata10.00: device reported invalid CHS sector 0
[14119.385743] ata10: EH complete
[14121.527814] ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0=
x6
[14121.527824] ata10.00: edma_err_cause=3D00000084 pp_flags=3D00000001,
dev error, EDMA self-disable
[14121.527832] ata10.00: failed command: READ DMA EXT
[14121.527844] ata10.00: cmd 25/00:08:00:ab:5a/00:00:e1:00:00/e0 tag 0
dma 4096 in
[14121.527846] res 51/89:08:00:ab:5a/89:00:e1:00:00/e0 Emask
0x10 (ATA bus error)
[14121.527852] ata10.00: status: { DRDY ERR }
[14121.527857] ata10.00: error: { ICRC }
[14121.527867] ata10: hard resetting link
[14127.011973] ata10: link is slow to respond, please be patient (ready=
=3D0)
[14131.545295] ata10: SRST failed (errno=3D-16)
[14131.545307] ata10: hard resetting link
[14137.031984] ata10: link is slow to respond, please be patient (ready=
=3D0)
[14141.565306] ata10: SRST failed (errno=3D-16)
[14141.565317] ata10: hard resetting link
[14147.051993] ata10: link is slow to respond, please be patient (ready=
=3D0)
[14161.032035] INFO: task jbd2/dm-0-8:613 blocked for more than 120 sec=
onds.
[14161.032044] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[14161.032050] jbd2/dm-0-8 D ffffffff81823020 0 613 2 0x=
00000000
[14161.032060] ffff8800c91e11a0 0000000000000046 0000000000000000
ffff8800cf0697c0
[14161.032070] ffff8800cf069780 ffff8800cf069780 ffff8800c93bdfd8
ffff8800c93bdfd8
[14161.032079] ffff8800c91e13d8 0000000000004000 ffff880094ed19c0
ffff8800c97e0688
[14161.032087] Call Trace:
[14161.032106] [] ? generic_make_request+0x2f7/0x570
[14161.032116] [] ? __wait_on_buffer+0x30/0x30
[14161.032124] [] ? io_schedule+0x57/0x80
[14161.032131] [] ? sleep_on_buffer+0xa/0x20
[14161.032137] [] ? __wait_on_bit+0x4f/0x80
[14161.032143] [] ? __wait_on_buffer+0x30/0x30
[14161.032150] [] ? out_of_line_wait_on_bit+0x7d/0xa=
0
[14161.032159] [] ? autoremove_wake_function+0x30/0x=
30
[14161.032168] [] ?
jbd2_journal_commit_transaction+0x155e/0x16f0
[14161.032176] [] ? abort_exclusive_wait+0xb0/0xb0
[14161.032183] [] ? apic_timer_interrupt+0xe/0x20
[14161.032191] [] ? kjournald2+0xad/0x210
[14161.032198] [] ? abort_exclusive_wait+0xb0/0xb0
[14161.032205] [] ? commit_timeout+0x10/0x10
[14161.032212] [] ? kthread+0x7f/0x90
[14161.032219] [] ? kernel_thread_helper+0x4/0x10
[14161.032226] [] ? kthread_worker_fn+0x180/0x180
[14161.032233] [] ? gs_change+0xb/0xb
[14161.032261] INFO: task squid:1659 blocked for more than 120 seconds.
[14161.032264] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[14161.032269] squid D ffffffff81823020 0 1659 1657 0x=
00000000
[14161.032277] ffff8800c9179d60 0000000000000082 ffff8800c97b0688
ffffffff812bf074
[14161.032285] ffff880012b818c0 ffffffff81823020 ffff8800c22c5fd8
ffff8800c22c5fd8
[14161.032293] ffff8800c9179f98 0000000000004000 ffff8800c22c5fd8
0000000000000000
[14161.032301] Call Trace:

:-/

Currently shutting down all daemons that access the filesystem on the
array, in attempt to umount the fs.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Do I have a bad HDD?

am 31.07.2011 20:14:21 von mathias.buren

On 31 July 2011 19:05, Mathias BurÃ©n wro=
te:
> On 31 July 2011 18:59, Mathias BurÃ©n w=
rote:
>> On 31 July 2011 14:05, Mathias BurÃ©n =
wrote:
>>> Hi list,
>>>
>>> Here's the output of my weekly script:
>>>
>>> DEV Â Â EVENTS Â REALL Â PEND Â Â UNCOR=
R Â CRC Â Â RAW Â Â ZONE Â Â END
>>> sdb1 Â Â 6158767 0 Â Â Â 0 Â Â =C2=
=A0 0 Â Â Â 2 Â Â Â 0 Â Â Â =
0
>>> sdc1 Â Â 6158767 0 Â Â Â 0 Â Â =C2=
=A0 0 Â Â Â 0 Â Â Â 0 Â Â Â =
0
>>> sdd1 Â Â 6158767 0 Â Â Â 0 Â Â =C2=
=A0 0 Â Â Â 0 Â Â Â 0 Â Â Â =
0
>>> sde1 Â Â 6158767 0 Â Â Â 0 Â Â =C2=
=A0 0 Â Â Â 0 Â Â Â 0 Â Â Â =
1
>>> sdf1 Â Â 6158767 0 Â Â Â 0 Â Â =C2=
=A0 0 Â Â Â 0 Â Â Â 47 Â Â Â =
6
>>> sdg1 Â Â 6158767 0 Â Â Â 0 Â Â =C2=
=A0 0 Â Â Â 0 Â Â Â 0 Â Â Â =
0
>>> sdh1 Â Â 6158767 0 Â Â Â 6 Â Â =C2=
=A0 0 Â Â Â 0 Â Â Â 340 Â Â 3
>>>
>>>
>>> Personalities : [raid6] [raid5] [raid4]
>>> md0 : active raid6 sdf1[5] sdh1[6] sdg1[0] sde1[7] sdc1[3] sdd1[4] =
sdb1[1]
>>> Â Â Â 9751756800 blocks super 1.2 level 6, 64k chunk,=
algorithm 2
>>> [7/7] [UUUUUUU]
>>>
>>> unused devices:
>>>
>>>
>>> /dev/md0:
>>> Â Â Â Â Version : 1.2
>>> Â Creation Time : Tue Oct 19 08:58:41 2010
>>> Â Â Raid Level : raid6
>>> Â Â Array Size : 9751756800 (9300.00 GiB 9985.80 GB)
>>> Â Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB)
>>> Â Raid Devices : 7
>>> Â Total Devices : 7
>>> Â Â Persistence : Superblock is persistent
>>>
>>> Â Â Update Time : Sun Jul 31 09:50:43 2011
>>> Â Â Â Â Â State : clean
>>> Â Active Devices : 7
>>> Working Devices : 7
>>> Â Failed Devices : 0
>>> Â Spare Devices : 0
>>>
>>> Â Â Â Â Layout : left-symmetric
>>> Â Â Chunk Size : 64K
>>>
>>> Â Â Â Â Â Name : ion:0 Â (local to hos=
t ion)
>>> Â Â Â Â Â UUID : e6595c64:b3ae90b3:f01133a=
c:3f402d20
>>> Â Â Â Â Events : 6158767
>>>
>>> Â Â Number Â Major Â Minor Â RaidDevice Sta=
te
>>> Â Â Â 0 Â Â Â 8 Â Â Â =
97 Â Â Â Â 0 Â Â Â active sync Â =
/dev/sdg1
>>> Â Â Â 1 Â Â Â 8 Â Â Â =
17 Â Â Â Â 1 Â Â Â active sync Â =
/dev/sdb1
>>> Â Â Â 4 Â Â Â 8 Â Â Â =
49 Â Â Â Â 2 Â Â Â active sync Â =
/dev/sdd1
>>> Â Â Â 3 Â Â Â 8 Â Â Â =
33 Â Â Â Â 3 Â Â Â active sync Â =
/dev/sdc1
>>> Â Â Â 5 Â Â Â 8 Â Â Â =
81 Â Â Â Â 4 Â Â Â active sync Â =
/dev/sdf1
>>> Â Â Â 6 Â Â Â 8 Â Â Â 1=
13 Â Â Â Â 5 Â Â Â active sync Â =
/dev/sdh1
>>> Â Â Â 7 Â Â Â 8 Â Â Â =
65 Â Â Â Â 6 Â Â Â active sync Â =
/dev/sde1
>>>
>>> Here's the SMART data for sdh:
>>>
>>>
>>> smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.39-ck] (local buil=
d)
>>> Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourcefo=
rge.net
>>>
>>> ===3D START OF INFORMATION SECTION ===3D
>>> Model Family: Â Â SAMSUNG SpinPoint F4 EG (AFT)
>>> Device Model: Â Â SAMSUNG HD204UI
>>> Serial Number: Â Â S2HGJ1RZ800850
>>> LU WWN Device Id: 5 0024e9 003f1ebc9
>>> Firmware Version: 1AQ10003
>>> User Capacity: Â Â 2,000,398,934,016 bytes [2.00 TB]
>>> Sector Size: Â Â Â 512 bytes logical/physical
>>> Device is: Â Â Â Â In smartctl database [for det=
ails use: -P show]
>>> ATA Version is: Â 8
>>> ATA Standard is: Â ATA-8-ACS revision 6
>>> Local Time is: Â Â Sun Jul 31 14:03:32 2011 IST
>>>
>>> ==> WARNING: Using smartmontools or hdparm with this
>>> drive may result in data loss due to a firmware bug.
>>> ****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******
>>> Buggy and fixed firmware report same version number!
>>> See the following web pages for details:
>>> http://www.samsung.com/global/business/hdd/faqView.do?b2b_bb s_msg_i=
d=3D386
>>> http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF 4EGBadB=
locks
>>>
>>> SMART support is: Available - device has SMART capability.
>>> SMART support is: Enabled
>>>
>>> ===3D START OF READ SMART DATA SECTION ===3D
>>> SMART overall-health self-assessment test result: PASSED
>>>
>>> General SMART Values:
>>> Offline data collection status: Â (0x82) Offline data collectio=
n activity
>>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â w=
as completed without error.
>>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â A=
uto Offline Data Collection: Enabled.
>>> Self-test execution status: Â Â Â ( Â 37) The sel=
f-test routine was interrupted
>>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â b=
y the host with a hard or soft reset.
>>> Total time to complete Offline
>>> data collection: Â Â Â Â Â Â Â =C2=
=A0(20640) seconds.
>>> Offline data collection
>>> capabilities: Â Â Â Â Â Â Â Â =
Â Â (0x5b) SMART execute Offline immediate.
>>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â A=
uto Offline data collection on/off support.
>>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
uspend Offline collection upon new
>>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â c=
ommand.
>>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â O=
ffline surface scan supported.
>>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
elf-test supported.
>>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â N=
o Conveyance Self-test supported.
>>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
elective Self-test supported.
>>> SMART capabilities: Â Â Â Â Â Â (0x000=
3) Saves SMART data before entering
>>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â p=
ower-saving mode.
>>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
upports SMART auto save timer.
>>> Error logging capability: Â Â Â Â (0x01) Error l=
ogging supported.
>>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â G=
eneral Purpose Logging supported.
>>> Short self-test routine
>>> recommended polling time: Â Â Â Â ( Â 2) mi=
nutes.
>>> Extended self-test routine
>>> recommended polling time: Â Â Â Â ( 255) minutes=

>>> SCT capabilities: Â Â Â Â Â Â Â (=
0x003f) SCT Status supported.
>>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
CT Error Recovery Control supported.
>>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
CT Feature Control supported.
>>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
CT Data Table supported.
>>>
>>> SMART Attributes Data Structure revision number: 16
>>> Vendor Specific SMART Attributes with Thresholds:
>>> ID# ATTRIBUTE_NAME Â Â Â Â Â FLAG Â =C2=
=A0 VALUE WORST THRESH TYPE
>>> UPDATED Â WHEN_FAILED RAW_VALUE
>>> Â 1 Raw_Read_Error_Rate Â Â 0x002f Â 100 Â =
100 Â 051 Â Â Pre-fail
>>> Always Â Â Â - Â Â Â 340
>>> Â 2 Throughput_Performance Â 0x0026 Â 055 Â 053 =C2=
=A0 000 Â Â Old_age
>>> Always Â Â Â - Â Â Â 18989
>>> Â 3 Spin_Up_Time Â Â Â Â Â Â 0x002=
3 Â 067 Â 044 Â 025 Â Â Pre-fail
>>> Always Â Â Â - Â Â Â 10165
>>> Â 4 Start_Stop_Count Â Â Â Â 0x0032 Â 1=
00 Â 100 Â 000 Â Â Old_age
>>> Always Â Â Â - Â Â Â 18
>>> Â 5 Reallocated_Sector_Ct Â 0x0033 Â 252 Â 252 =C2=
=A0 010 Â Â Pre-fail
>>> Always Â Â Â - Â Â Â 0
>>> Â 7 Seek_Error_Rate Â Â Â Â 0x002e Â 2=
52 Â 252 Â 051 Â Â Old_age
>>> Always Â Â Â - Â Â Â 0
>>> Â 8 Seek_Time_Performance Â 0x0024 Â 252 Â 252 =C2=
=A0 015 Â Â Old_age
>>> Offline Â Â Â - Â Â Â 0
>>> Â 9 Power_On_Hours Â Â Â Â Â 0x0032 =C2=
=A0 100 Â 100 Â 000 Â Â Old_age
>>> Always Â Â Â - Â Â Â 6447
>>> Â 10 Spin_Retry_Count Â Â Â Â 0x0032 Â =
252 Â 252 Â 051 Â Â Old_age
>>> Always Â Â Â - Â Â Â 0
>>> Â 11 Calibration_Retry_Count 0x0032 Â 252 Â 252 Â =
000 Â Â Old_age
>>> Always Â Â Â - Â Â Â 0
>>> Â 12 Power_Cycle_Count Â Â Â 0x0032 Â 100 =C2=
=A0 100 Â 000 Â Â Old_age
>>> Always Â Â Â - Â Â Â 20
>>> 181 Program_Fail_Cnt_Total Â 0x0022 Â 100 Â 100 Â =
000 Â Â Old_age
>>> Always Â Â Â - Â Â Â 10117271
>>> 191 G-Sense_Error_Rate Â Â Â 0x0022 Â 100 Â =
100 Â 000 Â Â Old_age
>>> Always Â Â Â - Â Â Â 1
>>> 192 Power-Off_Retract_Count 0x0022 Â 252 Â 252 Â 000=
Â Â Old_age
>>> Always Â Â Â - Â Â Â 0
>>> 194 Temperature_Celsius Â Â 0x0002 Â 064 Â 057 =
Â 000 Â Â Old_age
>>> Always Â Â Â - Â Â Â 35 (Min/Max 16/4=
3)
>>> 195 Hardware_ECC_Recovered Â 0x003a Â 100 Â 100 Â =
000 Â Â Old_age
>>> Always Â Â Â - Â Â Â 0
>>> 196 Reallocated_Event_Count 0x0032 Â 252 Â 252 Â 000=
Â Â Old_age
>>> Always Â Â Â - Â Â Â 0
>>> 197 Current_Pending_Sector Â 0x0032 Â 100 Â 100 Â =
000 Â Â Old_age
>>> Always Â Â Â - Â Â Â 6
>>> 198 Offline_Uncorrectable Â 0x0030 Â 252 Â 252 Â =
000 Â Â Old_age
>>> Offline Â Â Â - Â Â Â 0
>>> 199 UDMA_CRC_Error_Count Â Â 0x0036 Â 200 Â 200 =
Â 000 Â Â Old_age
>>> Always Â Â Â - Â Â Â 0
>>> 200 Multi_Zone_Error_Rate Â 0x002a Â 100 Â 100 Â =
000 Â Â Old_age
>>> Always Â Â Â - Â Â Â 3
>>> 223 Load_Retry_Count Â Â Â Â 0x0032 Â 252 =C2=
=A0 252 Â 000 Â Â Old_age
>>> Always Â Â Â - Â Â Â 0
>>> 225 Load_Cycle_Count Â Â Â Â 0x0032 Â 100 =C2=
=A0 100 Â 000 Â Â Old_age
>>> Always Â Â Â - Â Â Â 21
>>>
>>> SMART Error Log Version: 1
>>> No Errors Logged
>>>
>>> SMART Self-test log structure revision number 1
>>> Num Â Test_Description Â Â Status Â Â Â =
Â Â Â Â Â Â Remaining
>>> LifeTime(hours) Â LBA_of_first_error
>>> # 1 Â Extended offline Â Â Interrupted (host reset) =C2=
=A0 Â Â 50% Â Â Â 6408 Â Â Â Â =
-
>>> # 2 Â Extended offline Â Â Completed without error =C2=
=A0 Â Â 00% Â Â Â 6317 Â Â Â =C2=
=A0 -
>>> # 3 Â Extended offline Â Â Completed without error =C2=
=A0 Â Â 00% Â Â Â 6260 Â Â Â =C2=
=A0 -
>>> # 4 Â Extended offline Â Â Completed without error =C2=
=A0 Â Â 00% Â Â Â 6232 Â Â Â =C2=
=A0 -
>>> # 5 Â Extended offline Â Â Completed without error =C2=
=A0 Â Â 00% Â Â Â 6170 Â Â Â =C2=
=A0 -
>>> # 6 Â Extended offline Â Â Completed without error =C2=
=A0 Â Â 00% Â Â Â 6064 Â Â Â =C2=
=A0 -
>>> # 7 Â Extended offline Â Â Completed without error =C2=
=A0 Â Â 00% Â Â Â 6029 Â Â Â =C2=
=A0 -
>>> # 8 Â Extended offline Â Â Completed without error =C2=
=A0 Â Â 00% Â Â Â 5898 Â Â Â =C2=
=A0 -
>>> # 9 Â Extended offline Â Â Aborted by host Â Â =
Â Â Â Â Â 60% Â Â Â 5893 Â =
Â Â Â -
>>> #10 Â Extended offline Â Â Completed without error =C2=
=A0 Â Â 00% Â Â Â 5728 Â Â Â =C2=
=A0 -
>>> #11 Â Extended offline Â Â Completed without error =C2=
=A0 Â Â 00% Â Â Â 5706 Â Â Â =C2=
=A0 -
>>> #12 Â Extended offline Â Â Interrupted (host reset) =C2=
=A0 Â Â 40% Â Â Â 5701 Â Â Â Â =
-
>>> #13 Â Extended offline Â Â Interrupted (host reset) =C2=
=A0 Â Â 90% Â Â Â 5666 Â Â Â Â =
-
>>> #14 Â Extended offline Â Â Completed without error =C2=
=A0 Â Â 00% Â Â Â 5560 Â Â Â =C2=
=A0 -
>>> #15 Â Extended offline Â Â Completed without error =C2=
=A0 Â Â 00% Â Â Â 5527 Â Â Â =C2=
=A0 -
>>> #16 Â Extended offline Â Â Completed without error =C2=
=A0 Â Â 00% Â Â Â 5392 Â Â Â =C2=
=A0 -
>>> #17 Â Extended offline Â Â Completed without error =C2=
=A0 Â Â 00% Â Â Â 5357 Â Â Â =C2=
=A0 -
>>> #18 Â Extended offline Â Â Completed without error =C2=
=A0 Â Â 00% Â Â Â 5250 Â Â Â =C2=
=A0 -
>>> #19 Â Extended offline Â Â Completed without error =C2=
=A0 Â Â 00% Â Â Â 4272 Â Â Â =C2=
=A0 -
>>> #20 Â Extended offline Â Â Completed without error =C2=
=A0 Â Â 00% Â Â Â 4017 Â Â Â =C2=
=A0 -
>>> #21 Â Extended offline Â Â Completed without error =C2=
=A0 Â Â 00% Â Â Â 3935 Â Â Â =C2=
=A0 -
>>>
>>> Note: selective self-test log revision number (0) not 1 implies tha=
t
>>> no selective self-test has ever been run
>>> SMART Selective self-test log data structure revision number 0
>>> Note: revision number not 1 implies that no selective self-test has
>>> ever been run
>>> Â SPAN Â MIN_LBA Â MAX_LBA Â CURRENT_TEST_STATUS
>>> Â Â 1 Â Â Â Â 0 Â Â Â =C2=
=A00 Â Interrupted [50% left] (0-65535)
>>> Â Â 2 Â Â Â Â 0 Â Â Â =C2=
=A00 Â Not_testing
>>> Â Â 3 Â Â Â Â 0 Â Â Â =C2=
=A00 Â Not_testing
>>> Â Â 4 Â Â Â Â 0 Â Â Â =C2=
=A00 Â Not_testing
>>> Â Â 5 Â Â Â Â 0 Â Â Â =C2=
=A00 Â Not_testing
>>> Selective self-test flags (0x0):
>>> Â After scanning selected spans, do NOT read-scan remainder of =
disk.
>>> If Selective self-test is pending on power-up, resume after 0 minut=
e delay.
>>>
>>>
>>> It has 6 pending sectors. Why are they not reallocated? Can I force
>>> this somehow? (a scrub did not reallocate them) Is this enough to
>>> replace the HDD?
>>>
>>> Thanks,
>>> Mathias
>>>
>>
>> Uh oh, I did another scrub, and here's the status:
>>
>> DEV Â Â EVENTS Â REALL Â PEND Â Â UNCORR=
Â CRC Â Â RAW Â Â ZONE Â Â END
>> sdb1 Â Â 6158767 0 Â Â Â 0 Â Â =C2=
=A0 0 Â Â Â 2 Â Â Â 0 Â Â Â =
0
>> sdc1 Â Â 6158767 0 Â Â Â 0 Â Â =C2=
=A0 0 Â Â Â 0 Â Â Â 0 Â Â Â =
0
>> sdd1 Â Â 6158767 0 Â Â Â 0 Â Â =C2=
=A0 0 Â Â Â 0 Â Â Â 0 Â Â Â =
0
>> sde1 Â Â 6158767 0 Â Â Â 0 Â Â =C2=
=A0 0 Â Â Â 0 Â Â Â 0 Â Â Â =
1
>> sdf1 Â Â 6158768 0 Â Â Â 0 Â Â =C2=
=A0 0 Â Â Â 0 Â Â Â 47 Â Â Â =
6
>> sdg1 Â Â 6158767 0 Â Â Â 0 Â Â =C2=
=A0 0 Â Â Â 0 Â Â Â 0 Â Â Â =
0
>> sdh1 Â Â 6158767 0 Â Â Â 8 Â Â =C2=
=A0 1 Â Â Â 0 Â Â Â 341 Â Â 3
>>
>>
>> Personalities : [raid6] [raid5] [raid4]
>> md0 : active raid6 sdh1[6] sdg1[0] sdf1[5] sde1[7] sdd1[4] sdb1[1] s=
dc1[3]
>> Â Â Â 9751756800 blocks super 1.2 level 6, 64k chunk, =
algorithm 2
>> [7/7] [UUUUUUU]
>>
>> unused devices:
>>
>>
>> /dev/md0:
>> Â Â Â Â Version : 1.2
>> Â Creation Time : Tue Oct 19 08:58:41 2010
>> Â Â Raid Level : raid6
>> Â Â Array Size : 9751756800 (9300.00 GiB 9985.80 GB)
>> Â Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB)
>> Â Raid Devices : 7
>> Â Total Devices : 7
>> Â Â Persistence : Superblock is persistent
>>
>> Â Â Update Time : Sun Jul 31 18:51:58 2011
>> Â Â Â Â Â State : clean
>> Â Active Devices : 7
>> Working Devices : 7
>> Â Failed Devices : 0
>> Â Spare Devices : 0
>>
>> Â Â Â Â Layout : left-symmetric
>> Â Â Chunk Size : 64K
>>
>> Â Â Â Â Â Name : ion:0 Â (local to host=
ion)
>> Â Â Â Â Â UUID : e6595c64:b3ae90b3:f01133ac=
:3f402d20
>> Â Â Â Â Events : 6158767
>>
>> Â Â Number Â Major Â Minor Â RaidDevice Stat=
e
>> Â Â Â 0 Â Â Â 8 Â Â Â 9=
7 Â Â Â Â 0 Â Â Â active sync Â /=
dev/sdg1
>> Â Â Â 1 Â Â Â 8 Â Â Â 1=
7 Â Â Â Â 1 Â Â Â active sync Â /=
dev/sdb1
>> Â Â Â 4 Â Â Â 8 Â Â Â 4=
9 Â Â Â Â 2 Â Â Â active sync Â /=
dev/sdd1
>> Â Â Â 3 Â Â Â 8 Â Â Â 3=
3 Â Â Â Â 3 Â Â Â active sync Â /=
dev/sdc1
>> Â Â Â 5 Â Â Â 8 Â Â Â 8=
1 Â Â Â Â 4 Â Â Â active sync Â /=
dev/sdf1
>> Â Â Â 6 Â Â Â 8 Â Â Â 11=
3 Â Â Â Â 5 Â Â Â active sync Â /=
dev/sdh1
>> Â Â Â 7 Â Â Â 8 Â Â Â 6=
5 Â Â Â Â 6 Â Â Â active sync Â /=
dev/sde1
>>
>> smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.39-ck] (local build=
)
>> Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourcefor=
ge.net
>>
>> ===3D START OF INFORMATION SECTION ===3D
>> Model Family: Â Â SAMSUNG SpinPoint F4 EG (AFT)
>> Device Model: Â Â SAMSUNG HD204UI
>> Serial Number: Â Â S2HGJ1RZ800850
>> LU WWN Device Id: 5 0024e9 003f1ebc9
>> Firmware Version: 1AQ10003
>> User Capacity: Â Â 2,000,398,934,016 bytes [2.00 TB]
>> Sector Size: Â Â Â 512 bytes logical/physical
>> Device is: Â Â Â Â In smartctl database [for deta=
ils use: -P show]
>> ATA Version is: Â 8
>> ATA Standard is: Â ATA-8-ACS revision 6
>> Local Time is: Â Â Sun Jul 31 18:51:59 2011 IST
>>
>> ==> WARNING: Using smartmontools or hdparm with this
>> drive may result in data loss due to a firmware bug.
>> ****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******
>> Buggy and fixed firmware report same version number!
>> See the following web pages for details:
>> http://www.samsung.com/global/business/hdd/faqView.do?b2b_bb s_msg_id=
=3D386
>> http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF 4EGBadBl=
ocks
>>
>> SMART support is: Available - device has SMART capability.
>> SMART support is: Enabled
>>
>> ===3D START OF READ SMART DATA SECTION ===3D
>> SMART overall-health self-assessment test result: PASSED
>>
>> General SMART Values:
>> Offline data collection status: Â (0x80) Offline data collection=
activity
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â w=
as never started.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â A=
uto Offline Data Collection: Enabled.
>> Self-test execution status: Â Â Â ( 118) The previous =
self-test completed having
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â t=
he read element of the test failed.
>> Total time to complete Offline
>> data collection: Â Â Â Â Â Â Â =C2=
=A0(20640) seconds.
>> Offline data collection
>> capabilities: Â Â Â Â Â Â Â Â =
Â Â (0x5b) SMART execute Offline immediate.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â A=
uto Offline data collection on/off support.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
uspend Offline collection upon new
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â c=
ommand.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â O=
ffline surface scan supported.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
elf-test supported.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â N=
o Conveyance Self-test supported.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
elective Self-test supported.
>> SMART capabilities: Â Â Â Â Â Â (0x0003=
) Saves SMART data before entering
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â p=
ower-saving mode.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
upports SMART auto save timer.
>> Error logging capability: Â Â Â Â (0x01) Error lo=
gging supported.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â G=
eneral Purpose Logging supported.
>> Short self-test routine
>> recommended polling time: Â Â Â Â ( Â 2) min=
utes.
>> Extended self-test routine
>> recommended polling time: Â Â Â Â ( 255) minutes.
>> SCT capabilities: Â Â Â Â Â Â Â (0=
x003f) SCT Status supported.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
CT Error Recovery Control supported.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
CT Feature Control supported.
>> Â Â Â Â Â Â Â Â Â Â =
Â Â Â Â Â Â Â Â Â Â S=
CT Data Table supported.
>>
>> SMART Attributes Data Structure revision number: 16
>> Vendor Specific SMART Attributes with Thresholds:
>> ID# ATTRIBUTE_NAME Â Â Â Â Â FLAG Â =C2=
=A0 VALUE WORST THRESH TYPE
>> UPDATED Â WHEN_FAILED RAW_VALUE
>> Â 1 Raw_Read_Error_Rate Â Â 0x002f Â 100 Â 1=
00 Â 051 Â Â Pre-fail
>> Always Â Â Â - Â Â Â 341
>> Â 2 Throughput_Performance Â 0x0026 Â 055 Â 053 =C2=
=A0 000 Â Â Old_age
>> Always Â Â Â - Â Â Â 18989
>> Â 3 Spin_Up_Time Â Â Â Â Â Â 0x0023=
Â 067 Â 044 Â 025 Â Â Pre-fail
>> Always Â Â Â - Â Â Â 10165
>> Â 4 Start_Stop_Count Â Â Â Â 0x0032 Â 10=
0 Â 100 Â 000 Â Â Old_age
>> Always Â Â Â - Â Â Â 18
>> Â 5 Reallocated_Sector_Ct Â 0x0033 Â 252 Â 252 =C2=
=A0 010 Â Â Pre-fail
>> Always Â Â Â - Â Â Â 0
>> Â 7 Seek_Error_Rate Â Â Â Â 0x002e Â 25=
2 Â 252 Â 051 Â Â Old_age
>> Always Â Â Â - Â Â Â 0
>> Â 8 Seek_Time_Performance Â 0x0024 Â 252 Â 252 =C2=
=A0 015 Â Â Old_age
>> Offline Â Â Â - Â Â Â 0
>> Â 9 Power_On_Hours Â Â Â Â Â 0x0032 Â =
100 Â 100 Â 000 Â Â Old_age
>> Always Â Â Â - Â Â Â 6452
>> Â 10 Spin_Retry_Count Â Â Â Â 0x0032 Â 2=
52 Â 252 Â 051 Â Â Old_age
>> Always Â Â Â - Â Â Â 0
>> Â 11 Calibration_Retry_Count 0x0032 Â 252 Â 252 Â =
000 Â Â Old_age
>> Always Â Â Â - Â Â Â 0
>> Â 12 Power_Cycle_Count Â Â Â 0x0032 Â 100 =C2=
=A0 100 Â 000 Â Â Old_age
>> Always Â Â Â - Â Â Â 20
>> 181 Program_Fail_Cnt_Total Â 0x0022 Â 100 Â 100 Â =
000 Â Â Old_age
>> Always Â Â Â - Â Â Â 10121757
>> 191 G-Sense_Error_Rate Â Â Â 0x0022 Â 100 Â =
100 Â 000 Â Â Old_age
>> Always Â Â Â - Â Â Â 1
>> 192 Power-Off_Retract_Count 0x0022 Â 252 Â 252 Â 000 =
Â Â Old_age
>> Always Â Â Â - Â Â Â 0
>> 194 Temperature_Celsius Â Â 0x0002 Â 064 Â 057 =C2=
=A0 000 Â Â Old_age
>> Always Â Â Â - Â Â Â 31 (Min/Max 16/43=
)
>> 195 Hardware_ECC_Recovered Â 0x003a Â 100 Â 100 Â =
000 Â Â Old_age
>> Always Â Â Â - Â Â Â 0
>> 196 Reallocated_Event_Count 0x0032 Â 252 Â 252 Â 000 =
Â Â Old_age
>> Always Â Â Â - Â Â Â 0
>> 197 Current_Pending_Sector Â 0x0032 Â 100 Â 100 Â =
000 Â Â Old_age
>> Always Â Â Â - Â Â Â 8
>> 198 Offline_Uncorrectable Â 0x0030 Â 100 Â 100 Â =
000 Â Â Old_age
>> Offline Â Â Â - Â Â Â 1
>> 199 UDMA_CRC_Error_Count Â Â 0x0036 Â 200 Â 200 =C2=
=A0 000 Â Â Old_age
>> Always Â Â Â - Â Â Â 0
>> 200 Multi_Zone_Error_Rate Â 0x002a Â 100 Â 100 Â =
000 Â Â Old_age
>> Always Â Â Â - Â Â Â 3
>> 223 Load_Retry_Count Â Â Â Â 0x0032 Â 252 =C2=
=A0 252 Â 000 Â Â Old_age
>> Always Â Â Â - Â Â Â 0
>> 225 Load_Cycle_Count Â Â Â Â 0x0032 Â 100 =C2=
=A0 100 Â 000 Â Â Old_age
>> Always Â Â Â - Â Â Â 21
>>
>> SMART Error Log Version: 1
>> No Errors Logged
>>
>> SMART Self-test log structure revision number 1
>> Num Â Test_Description Â Â Status Â Â Â =
Â Â Â Â Â Â Remaining
>> LifeTime(hours) Â LBA_of_first_error
>> # 1 Â Extended offline Â Â Completed: read failure Â =
Â Â 60% Â Â Â 6452
>> Â Â Â 1519520304
>> # 2 Â Extended offline Â Â Interrupted (host reset) =C2=
=A0 Â Â 50% Â Â Â 6408 Â Â Â Â =
-
>> # 3 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6317 Â Â Â Â =
-
>> # 4 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6260 Â Â Â Â =
-
>> # 5 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6232 Â Â Â Â =
-
>> # 6 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6170 Â Â Â Â =
-
>> # 7 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6064 Â Â Â Â =
-
>> # 8 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 6029 Â Â Â Â =
-
>> # 9 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5898 Â Â Â Â =
-
>> #10 Â Extended offline Â Â Aborted by host Â Â =
Â Â Â Â Â 60% Â Â Â 5893 Â =
Â Â Â -
>> #11 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5728 Â Â Â Â =
-
>> #12 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5706 Â Â Â Â =
-
>> #13 Â Extended offline Â Â Interrupted (host reset) =C2=
=A0 Â Â 40% Â Â Â 5701 Â Â Â Â =
-
>> #14 Â Extended offline Â Â Interrupted (host reset) =C2=
=A0 Â Â 90% Â Â Â 5666 Â Â Â Â =
-
>> #15 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5560 Â Â Â Â =
-
>> #16 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5527 Â Â Â Â =
-
>> #17 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5392 Â Â Â Â =
-
>> #18 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5357 Â Â Â Â =
-
>> #19 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 5250 Â Â Â Â =
-
>> #20 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 4272 Â Â Â Â =
-
>> #21 Â Extended offline Â Â Completed without error Â =
Â Â 00% Â Â Â 4017 Â Â Â Â =
-
>>
>> Note: selective self-test log revision number (0) not 1 implies that
>> no selective self-test has ever been run
>> SMART Selective self-test log data structure revision number 0
>> Note: revision number not 1 implies that no selective self-test has
>> ever been run
>> Â SPAN Â MIN_LBA Â MAX_LBA Â CURRENT_TEST_STATUS
>> Â Â 1 Â Â Â Â 0 Â Â Â =C2=
=A00 Â Completed_read_failure [60% left] (0-65535)
>> Â Â 2 Â Â Â Â 0 Â Â Â =C2=
=A00 Â Not_testing
>> Â Â 3 Â Â Â Â 0 Â Â Â =C2=
=A00 Â Not_testing
>> Â Â 4 Â Â Â Â 0 Â Â Â =C2=
=A00 Â Not_testing
>> Â Â 5 Â Â Â Â 0 Â Â Â =C2=
=A00 Â Not_testing
>> Selective self-test flags (0x0):
>> Â After scanning selected spans, do NOT read-scan remainder of d=
isk.
>> If Selective self-test is pending on power-up, resume after 0 minute=
delay.
>>
>> :( I do have a bad HDD. RMA is already in progress, I need to take o=
ut
>> the drive and ship it to Samsung in Holland. Will print labels at wo=
rk
>> on Tuesday.
>>
>> Questions:
>>
>> * How do I remove this HDD without causing damage to the array? Is
>> this the correct way?:
>> mdadm --manage /dev/md0 --fail /dev/sdh1 # fail the device
>> mdadm --manage /dev/md0 --remove /dev/sdh1 # remove the device
>> * (shut down the system gracefully)
>> * (remove the HDD)
>> * (install new HDD)
>> * (start system)
>> sfdisk -d /dev/sde | sfdisk /dev/sdh # partition the new HDD
>> mdadm --manage /dev/md0 --add /dev/sdh1 # add the partition to the a=
rray
>>
>> * After removing the HDD, should I do another scrub?
>>
>> Thanks a lot in advance!
>>
>> /Mathias
>>
>
> I think I need to hurry up:
>
> [13957.348692] ata10.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action=
0x6 frozen
> [13957.348704] ata10.00: failed command: READ FPDMA QUEUED
> [13957.348716] ata10.00: cmd 60/08:00:00:ab:5a/00:00:e1:00:00/40 tag =
0
> ncq 4096 in
> [13957.348719] Â Â Â Â Â res 40/00:00:00:4f:c=
2/00:00:00:00:00/00 Emask
> 0x4 (timeout)
> [13957.348724] ata10.00: status: { DRDY }
> [13957.348730] ata10.00: failed command: WRITE FPDMA QUEUED
> [13957.348741] ata10.00: cmd 61/08:08:a8:e7:8d/00:00:2e:00:00/40 tag =
1
> ncq 4096 out
> [13957.348743] Â Â Â Â Â res 40/00:00:00:00:0=
0/00:00:00:00:00/00 Emask
> 0x4 (timeout)
> [13957.348749] ata10.00: status: { DRDY }
> [13957.348759] ata10: hard resetting link
> [13962.835319] ata10: link is slow to respond, please be patient (rea=
dy=3D0)
> [13967.368679] ata10: SRST failed (errno=3D-16)
> [13967.368693] ata10: hard resetting link
> [13970.988699] ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300=
)
> [13971.002394] ata10.00: configured for UDMA/133
> [13971.002407] ata10.00: device reported invalid CHS sector 0
> [13971.002413] ata10.00: device reported invalid CHS sector 0
> [13971.002427] ata10: EH complete
> [14001.358848] ata10.00: exception Emask 0x0 SAct 0x2 SErr 0x0 action=
0x6 frozen
> [14001.358862] ata10.00: failed command: READ FPDMA QUEUED
> [14001.358887] ata10.00: cmd 60/08:08:00:ab:5a/00:00:e1:00:00/40 tag =
1
> ncq 4096 in
> [14001.358890] Â Â Â Â Â res 40/00:00:00:00:0=
0/00:00:00:00:00/00 Emask
> 0x4 (timeout)
> [14001.358898] ata10.00: status: { DRDY }
> [14001.358913] ata10: hard resetting link
> [14006.845324] ata10: link is slow to respond, please be patient (rea=
dy=3D0)
> [14011.378656] ata10: SRST failed (errno=3D-16)
> [14011.378669] ata10: hard resetting link
> [14016.865323] ata10: link is slow to respond, please be patient (rea=
dy=3D0)
> [14021.398640] ata10: SRST failed (errno=3D-16)
> [14021.398652] ata10: hard resetting link
> [14026.885310] ata10: link is slow to respond, please be patient (rea=
dy=3D0)
> [14029.925349] ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300=
)
> [14029.939048] ata10.00: configured for UDMA/133
> [14029.939061] ata10.00: device reported invalid CHS sector 0
> [14029.939078] ata10: EH complete
> [14060.345358] ata10.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action=
0x6 frozen
> [14060.345371] ata10.00: failed command: READ FPDMA QUEUED
> [14060.345384] ata10.00: cmd 60/08:00:00:ab:5a/00:00:e1:00:00/40 tag =
0
> ncq 4096 in
> [14060.345387] Â Â Â Â Â res 40/00:00:00:4f:c=
2/00:00:00:00:00/00 Emask
> 0x4 (timeout)
> [14060.345394] ata10.00: status: { DRDY }
> [14060.345407] ata10: hard resetting link
> [14065.831985] ata10: link is slow to respond, please be patient (rea=
dy=3D0)
> [14070.365333] ata10: SRST failed (errno=3D-16)
> [14070.365345] ata10: hard resetting link
> [14074.625347] ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300=
)
> [14074.639043] ata10.00: configured for UDMA/133
> [14074.639056] ata10.00: device reported invalid CHS sector 0
> [14074.639088] ata10: EH complete
> [14105.358687] ata10.00: NCQ disabled due to excessive errors
> [14105.358700] ata10.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action=
0x6 frozen
> [14105.358712] ata10.00: failed command: READ FPDMA QUEUED
> [14105.358729] ata10.00: cmd 60/08:00:00:ab:5a/00:00:e1:00:00/40 tag =
0
> ncq 4096 in
> [14105.358732] Â Â Â Â Â res 40/00:00:00:4f:c=
2/00:00:00:00:00/00 Emask
> 0x4 (timeout)
> [14105.358741] ata10.00: status: { DRDY }
> [14105.358754] ata10: hard resetting link
> [14110.845314] ata10: link is slow to respond, please be patient (rea=
dy=3D0)
> [14115.378674] ata10: SRST failed (errno=3D-16)
> [14115.378689] ata10: hard resetting link
> [14119.372023] ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300=
)
> [14119.385704] ata10.00: configured for UDMA/133
> [14119.385716] ata10.00: device reported invalid CHS sector 0
> [14119.385743] ata10: EH complete
> [14121.527814] ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action=
0x6
> [14121.527824] ata10.00: edma_err_cause=3D00000084 pp_flags=3D0000000=
1,
> dev error, EDMA self-disable
> [14121.527832] ata10.00: failed command: READ DMA EXT
> [14121.527844] ata10.00: cmd 25/00:08:00:ab:5a/00:00:e1:00:00/e0 tag =
0
> dma 4096 in
> [14121.527846] Â Â Â Â Â res 51/89:08:00:ab:5=
a/89:00:e1:00:00/e0 Emask
> 0x10 (ATA bus error)
> [14121.527852] ata10.00: status: { DRDY ERR }
> [14121.527857] ata10.00: error: { ICRC }
> [14121.527867] ata10: hard resetting link
> [14127.011973] ata10: link is slow to respond, please be patient (rea=
dy=3D0)
> [14131.545295] ata10: SRST failed (errno=3D-16)
> [14131.545307] ata10: hard resetting link
> [14137.031984] ata10: link is slow to respond, please be patient (rea=
dy=3D0)
> [14141.565306] ata10: SRST failed (errno=3D-16)
> [14141.565317] ata10: hard resetting link
> [14147.051993] ata10: link is slow to respond, please be patient (rea=
dy=3D0)
> [14161.032035] INFO: task jbd2/dm-0-8:613 blocked for more than 120 s=
econds.
> [14161.032044] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [14161.032050] jbd2/dm-0-8 Â Â D ffffffff81823020 Â =C2=
=A0 0 Â 613 Â Â Â 2 0x00000000
> [14161.032060] Â ffff8800c91e11a0 0000000000000046 00000000000000=
00
> ffff8800cf0697c0
> [14161.032070] Â ffff8800cf069780 ffff8800cf069780 ffff8800c93bdf=
d8
> ffff8800c93bdfd8
> [14161.032079] Â ffff8800c91e13d8 0000000000004000 ffff880094ed19=
c0
> ffff8800c97e0688
> [14161.032087] Call Trace:
> [14161.032106] Â [] ? generic_make_request+0x2f=
7/0x570
> [14161.032116] Â [] ? __wait_on_buffer+0x30/0x3=
0
> [14161.032124] Â [] ? io_schedule+0x57/0x80
> [14161.032131] Â [] ? sleep_on_buffer+0xa/0x20
> [14161.032137] Â [] ? __wait_on_bit+0x4f/0x80
> [14161.032143] Â [] ? __wait_on_buffer+0x30/0x3=
0
> [14161.032150] Â [] ? out_of_line_wait_on_bit+0=
x7d/0xa0
> [14161.032159] Â [] ? autoremove_wake_function+=
0x30/0x30
> [14161.032168] Â [] ?
> jbd2_journal_commit_transaction+0x155e/0x16f0
> [14161.032176] Â [] ? abort_exclusive_wait+0xb0=
/0xb0
> [14161.032183] Â [] ? apic_timer_interrupt+0xe/=
0x20
> [14161.032191] Â [] ? kjournald2+0xad/0x210
> [14161.032198] Â [] ? abort_exclusive_wait+0xb0=
/0xb0
> [14161.032205] Â [] ? commit_timeout+0x10/0x10
> [14161.032212] Â [] ? kthread+0x7f/0x90
> [14161.032219] Â [] ? kernel_thread_helper+0x4/=
0x10
> [14161.032226] Â [] ? kthread_worker_fn+0x180/0=
x180
> [14161.032233] Â [] ? gs_change+0xb/0xb
> [14161.032261] INFO: task squid:1659 blocked for more than 120 second=
s.
> [14161.032264] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [14161.032269] squid Â Â Â Â Â D ffffffff818=
23020 Â Â 0 Â 1659 Â 1657 0x00000000
> [14161.032277] Â ffff8800c9179d60 0000000000000082 ffff8800c97b06=
88
> ffffffff812bf074
> [14161.032285] Â ffff880012b818c0 ffffffff81823020 ffff8800c22c5f=
d8
> ffff8800c22c5fd8
> [14161.032293] Â ffff8800c9179f98 0000000000004000 ffff8800c22c5f=
d8
> 0000000000000000
> [14161.032301] Call Trace:
>
>
> :-/
>
> Currently shutting down all daemons that access the filesystem on the
> array, in attempt to umount the fs.
>

Sorry for spamming like this, it's happening in realtime. While I was
shutting down daemons it looks like MD took out the failing HDD
itself, see:

[14401.032600] INFO: task flush-253:0:25632 blocked for more than 120 s=
econds.
[14401.032604] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[14401.032608] flush-253:0 D ffffffff81823020 0 25632 2 0x=
00000000
[14401.032615] ffff88003c02bac0 0000000000000046 ffffffff81425a7f
ffff8800cf06f5c0
[14401.032623] ffff8800cf06f580 ffffffff81823020 ffff88001666ffd8
ffff88001666ffd8
[14401.032632] ffff88003c02bcf8 0000000000004000 ffff88001666ffd8
ffff88003c02bd00
[14401.032639] Call Trace:
[14401.032645] [] ? schedule+0x49f/0xcb0
[14401.032653] [] ? sched_clock_local+0x15/0x80
[14401.032662] [] ? get_active_stripe+0x307/0x6f0
[14401.032669] [] ? try_to_wake_up+0x210/0x210
[14401.032676] [] ? make_request+0x198/0x660
[14401.032683] [] ? __map_bio+0x4a/0x1d0
[14401.032690] [] ? abort_exclusive_wait+0xb0/0xb0
[14401.032697] [] ? md_make_request+0x103/0x250
[14401.032704] [] ? generic_make_request+0x2f7/0x570
[14401.032710] [] ? kmem_cache_alloc+0x169/0x180
[14401.032718] [] ? dm_get_live_table+0x44/0x60
[14401.032724] [] ? linear_merge+0x45/0x50
[14401.032731] [] ? submit_bio+0x6d/0x100
[14401.032738] [] ? ext4_io_submit+0x1c/0x50
[14401.032744] [] ? ext4_bio_write_page+0x121/0x370
[14401.032751] [] ? mpage_da_submit_io+0x347/0x450
[14401.032759] [] ? mpage_da_map_and_submit+0x1ce/0x=
420
[14401.032766] [] ? ext4_da_writepages+0x340/0x620
[14401.032774] [] ? blk_flush_plug_list+0xa7/0x250
[14401.032782] [] ? writeback_single_inode+0x10e/0x2=
70
[14401.032789] [] ? writeback_sb_inodes+0xf1/0x1b0
[14401.032796] [] ? reschedule_interrupt+0xe/0x20
[14401.032803] [] ? writeback_inodes_wb+0x7b/0x150
[14401.032810] [] ? wb_writeback+0x493/0x4f0
[14401.032818] [] ? get_nr_inodes+0x42/0x60
[14401.032825] [] ? wb_check_old_data_flush+0x97/0xa=
0
[14401.032832] [] ? wb_do_writeback+0x16f/0x210
[14401.032839] [] ? init_timer_deferrable_key+0x10/0=
x10
[14401.032846] [] ? bdi_writeback_thread+0x7b/0x310
[14401.032852] [] ? __wake_up_common+0x49/0x80
[14401.032860] [] ? wb_do_writeback+0x210/0x210
[14401.032866] [] ? kthread+0x7f/0x90
[14401.032873] [] ? kernel_thread_helper+0x4/0x10
[14401.032880] [] ? kthread_worker_fn+0x180/0x180
[14401.032887] [] ? gs_change+0xb/0xb
[14404.915260] ata10: link is slow to respond, please be patient (ready=
=3D0)
[14409.448594] ata10: SRST failed (errno=3D-16)
[14409.448604] ata10: hard resetting link
[14414.935260] ata10: link is slow to respond, please be patient (ready=
=3D0)
[14431.895309] ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[14431.909004] ata10.00: configured for UDMA/33
[14431.909026] ata10: EH complete
[14434.040581] ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0=
x6
[14434.040589] ata10.00: edma_err_cause=3D00000084 pp_flags=3D00000001,
dev error, EDMA self-disable
[14434.040597] ata10.00: failed command: READ DMA EXT
[14434.040609] ata10.00: cmd 25/00:08:10:e0:59/00:00:b2:00:00/e0 tag 0
dma 4096 in
[14434.040612] res 51/89:08:10:e0:59/89:00:b2:00:00/e0 Emask
0x10 (ATA bus error)
[14434.040617] ata10.00: status: { DRDY ERR }
[14434.040622] ata10.00: error: { ICRC }
[14434.040631] ata10: hard resetting link
[14439.525256] ata10: link is slow to respond, please be patient (ready=
=3D0)
[14444.058589] ata10: SRST failed (errno=3D-16)
[14444.058600] ata10: hard resetting link
[14449.545254] ata10: link is slow to respond, please be patient (ready=
=3D0)
[14454.078594] ata10: SRST failed (errno=3D-16)
[14454.078604] ata10: hard resetting link
[14459.565256] ata10: link is slow to respond, please be patient (ready=
=3D0)
[14476.525289] ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[14476.579007] ata10.00: configured for UDMA/33
[14476.579038] sd 9:0:0:0: [sdh] Device not ready
[14476.579043] sd 9:0:0:0: [sdh] Result: hostbyte=3DDID_OK
driverbyte=3DDRIVER_SENSE
[14476.579049] sd 9:0:0:0: [sdh] Sense Key : Not Ready [current] [desc=
riptor]
[14476.579057] Descriptor sense data with sense descriptors (in hex):
[14476.579061] 72 02 04 00 00 00 00 0c 00 0a 80 00 00 00 00 00
[14476.579076] b2 59 e0 10
[14476.579083] sd 9:0:0:0: [sdh] Add. Sense: Logical unit not ready,
cause not reportable
[14476.579091] sd 9:0:0:0: [sdh] CDB: Read(10): 28 00 b2 59 e0 10 00 00=
08 00
[14476.579106] end_request: I/O error, dev sdh, sector 2992234512
[14476.579147] ata10: EH complete
[14484.352060] ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0=
x6 frozen
[14484.352070] ata10.00: failed command: SMART
[14484.352082] ata10.00: cmd b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0
[14484.352084] res 40/00:08:10:e0:59/89:00:b2:00:00/e0 Emask
0x4 (timeout)
[14484.352090] ata10.00: status: { DRDY }
[14484.352099] ata10: hard resetting link
[14489.838590] ata10: link is slow to respond, please be patient (ready=
=3D0)
[14494.371940] ata10: SRST failed (errno=3D-16)
[14494.371951] ata10: hard resetting link
[14499.858587] ata10: link is slow to respond, please be patient (ready=
=3D0)
[14504.391923] ata10: SRST failed (errno=3D-16)
[14504.391933] ata10: hard resetting link
[14509.878583] ata10: link is slow to respond, please be patient (ready=
=3D0)
[14536.995640] nfsd: last server has exited, flushing export cache
[14539.425250] ata10: SRST failed (errno=3D-16)
[14539.425263] ata10: hard resetting link
[14544.431917] ata10: SRST failed (errno=3D-16)
[14544.431926] ata10: reset failed, giving up
[14544.431932] ata10.00: disabled
[14544.431973] ata10: EH complete
[14544.432042] sd 9:0:0:0: [sdh] Unhandled error code
[14544.432062] sd 9:0:0:0: [sdh] Unhandled error code
[14544.432080] sd 9:0:0:0: [sdh] Result: hostbyte=3DDID_BAD_TARGET
driverbyte=3DDRIVER_OK
[14544.432111] sd 9:0:0:0: [sdh]
[14544.432126] sd 9:0:0:0: [sdh] CDB: Read(10)Result:
hostbyte=3DDID_BAD_TARGET driverbyte=3DDRIVER_OK
[14544.432163] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00 e1 5a ab 00 00 0=
0 08
[14544.432216] sd 9:0:0:0: [sdh] Unhandled error code
[14544.432232] 00
[14544.432240] sd 9:0:0:0: [sdh]
[14544.432253] Result: hostbyte=3DDID_BAD_TARGET driverbyte=3DDRIVER_OK
[14544.432272] end_request: I/O error, dev sdh, sector 3780815616
[14544.432290] sd 9:0:0:0: [sdh] CDB:
[14544.432305] md/raid:md0: Disk failure on sdh1, disabling device.
[14544.432321] md/raid:md0: Operation continuing on 6 devices.
[14544.432336] Write(10): 2a 00 e1 5a ef 78 00 00 80 00
[14544.432389] end_request: I/O error, dev sdh, sector 3780833144
[14544.432412] : 28 00 b2 33 44 80 00 00 08 00
[14544.432433] end_request: I/O error, dev sdh, sector 2989704320
[14544.432453] sd 9:0:0:0: [sdh] Unhandled error code
[14544.432471] sd 9:0:0:0: [sdh] Result: hostbyte=3DDID_BAD_TARGET
driverbyte=3DDRIVER_OK
[14544.432498] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00 e1 5a ef f8 00 0=
0 80 00
[14544.432553] end_request: I/O error, dev sdh, sector 3780833272
[14544.432618] sd 9:0:0:0: [sdh] Unhandled error code
[14544.432639] sd 9:0:0:0: [sdh] Unhandled error code
[14544.432659] sd 9:0:0:0: [sdh] Result: hostbyte=3DDID_BAD_TARGET
driverbyte=3DDRIVER_OK
[14544.432687] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00 b2 59 e0 10 00 0=
0 08 00
[14544.432749] end_request: I/O error, dev sdh, sector 2992234512
[14544.432774] sd 9:0:0:0: [sdh] Result: hostbyte=3DDID_BAD_TARGET
driverbyte=3DDRIVER_OK
[14544.432785] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00
[14544.432800] sd 9:0:0:0: [sdh] Unhandled error code
[14544.432820] sd 9:0:0:0: [sdh] Result: hostbyte=3DDID_BAD_TARGET
driverbyte=3DDRIVER_OK
[14544.432849] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00 e1 5a eb 00 00 0=
0 80 00
[14544.432910] end_request: I/O error, dev sdh, sector 3780832000
[14544.432933] e1 5a f0 78 00 00 80 00
[14544.432949] end_request: I/O error, dev sdh, sector 3780833400
[14544.432965] sd 9:0:0:0: [sdh] Unhandled error code
[14544.432972] sd 9:0:0:0: [sdh] Result: hostbyte=3DDID_BAD_TARGET
driverbyte=3DDRIVER_OK
[14544.432982] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00 e1 5a ef 00 00 0=
0 78 00
[14544.433002] end_request: I/O error, dev sdh, sector 3780833024
[14544.433012] sd 9:0:0:0: [sdh] Unhandled error code
[14544.433033] sd 9:0:0:0: [sdh] Result: hostbyte=3DDID_BAD_TARGET
driverbyte=3DDRIVER_OK
[14544.433062] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00 e1 5a f0 f8 00 0=
0 80 00
[14544.433121] end_request: I/O error, dev sdh, sector 3780833528
[14544.433205] sd 9:0:0:0: [sdh] Unhandled error code
[14544.433227] sd 9:0:0:0: [sdh] Unhandled error code
[14544.433247] sd 9:0:0:0: [sdh] Result: hostbyte=3DDID_BAD_TARGET
driverbyte=3DDRIVER_OK
[14544.433276] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00 e1 5a f1 78 00 0=
0 80 00
[14544.433333] end_request: I/O error, dev sdh, sector 3780833656
[14544.433357] sd 9:0:0:0: [sdh] Result: hostbyte=3DDID_BAD_TARGET
driverbyte=3DDRIVER_OK
[14544.433366] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00 2e 8d e7 b0 00 0=
0 28 00
[14544.433387] end_request: I/O error, dev sdh, sector 781051824
[14544.433399] sd 9:0:0:0: [sdh] Unhandled error code
[14544.433408] sd 9:0:0:0: [sdh] Result: hostbyte=3DDID_BAD_TARGET
driverbyte=3DDRIVER_OK
[14544.433419] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00 e1 5a f1 f8 00 0=
0 80 00
[14544.433440] end_request: I/O error, dev sdh, sector 3780833784
[14544.433487] sd 9:0:0:0: [sdh] Unhandled error code
[14544.433493] sd 9:0:0:0: [sdh] Result: hostbyte=3DDID_BAD_TARGET
driverbyte=3DDRIVER_OK
[14544.433502] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00 e1 5a f2 78 00 0=
0 08 00
[14544.433523] end_request: I/O error, dev sdh, sector 3780833912
[14544.488761] RAID conf printout:
[14544.488772] --- level:6 rd:7 wd:6
[14544.488779] disk 0, o:1, dev:sdg1
[14544.488783] disk 1, o:1, dev:sdb1
[14544.488788] disk 2, o:1, dev:sdd1
[14544.488793] disk 3, o:1, dev:sdc1
[14544.488797] disk 4, o:1, dev:sdf1
[14544.488801] disk 5, o:0, dev:sdh1
[14544.488805] disk 6, o:1, dev:sde1
[14544.515279] RAID conf printout:
[14544.515289] --- level:6 rd:7 wd:6
[14544.515295] disk 0, o:1, dev:sdg1
[14544.515299] disk 1, o:1, dev:sdb1
[14544.515304] disk 2, o:1, dev:sdd1
[14544.515308] disk 3, o:1, dev:sdc1
[14544.515313] disk 4, o:1, dev:sdf1
[14544.515317] disk 6, o:1, dev:sde1
[14570.639011] nvidia 0000:00:03.5: PCI INT B disabled
[14570.639441] nvidia 0000:03:00.0: PCI INT A disabled
[14591.505699] HDA Intel 0000:00:08.0: PCI INT A disabled
~ $ dmesg > /srv/http/dmesg-failing-harddrive.log
~ $ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdh1[6](F) sdg1[0] sdf1[5] sde1[7] sdd1[4] sdb1[1] s=
dc1[3]
9751756800 blocks super 1.2 level 6, 64k chunk, algorithm 2
[7/6] [UUUUU_U]

unused devices:
~ $

$ mdadm -D /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Tue Oct 19 08:58:41 2010
Raid Level : raid6
Array Size : 9751756800 (9300.00 GiB 9985.80 GB)
Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB)
Raid Devices : 7
Total Devices : 7
Persistence : Superblock is persistent

Update Time : Sun Jul 31 19:07:25 2011
State : clean, degraded
Active Devices : 6
Working Devices : 6
Failed Devices : 1
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

Name : ion:0 (local to host ion)
UUID : e6595c64:b3ae90b3:f01133ac:3f402d20
Events : 6158774

Number Major Minor RaidDevice State
0 8 97 0 active sync /dev/sdg1
1 8 17 1 active sync /dev/sdb1
4 8 49 2 active sync /dev/sdd1
3 8 33 3 active sync /dev/sdc1
5 8 81 4 active sync /dev/sdf1
5 0 0 5 removed
7 8 65 6 active sync /dev/sde1

6 8 113 - faulty spare /dev/sdh1

Shutting down the LV:

$ vgchange -an lvstorage
/dev/sdh: read failed after 0 of 4096 at 0: Input/output error
/dev/sdh: read failed after 0 of 4096 at 2000398843904: Input/output =
error
/dev/sdh: read failed after 0 of 4096 at 2000398925824: Input/output =
error
/dev/sdh: read failed after 0 of 4096 at 4096: Input/output error
/dev/sdh1: read failed after 0 of 4096 at 2000397795328: Input/output=
error
/dev/sdh1: read failed after 0 of 4096 at 2000397877248: Input/output=
error
/dev/sdh1: read failed after 0 of 4096 at 0: Input/output error
/dev/sdh1: read failed after 0 of 4096 at 4096: Input/output error
0 logical volume(s) in volume group "lvstorage" now active

Removing the bad HDD:

$ mdadm --manage /dev/md0 --remove /dev/sdh1
mdadm: hot removed /dev/sdh1 from /dev/md0

$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdg1[0] sdf1[5] sde1[7] sdd1[4] sdb1[1] sdc1[3]
9751756800 blocks super 1.2 level 6, 64k chunk, algorithm 2
[7/6] [UUUUU_U]

unused devices:

$ mdadm -D /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Tue Oct 19 08:58:41 2010
Raid Level : raid6
Array Size : 9751756800 (9300.00 GiB 9985.80 GB)
Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB)
Raid Devices : 7
Total Devices : 6
Persistence : Superblock is persistent

Update Time : Sun Jul 31 19:13:02 2011
State : clean, degraded
Active Devices : 6
Working Devices : 6
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

Name : ion:0 (local to host ion)
UUID : e6595c64:b3ae90b3:f01133ac:3f402d20
Events : 6158779

Number Major Minor RaidDevice State
0 8 97 0 active sync /dev/sdg1
1 8 17 1 active sync /dev/sdb1
4 8 49 2 active sync /dev/sdd1
3 8 33 3 active sync /dev/sdc1
5 8 81 4 active sync /dev/sdf1
5 0 0 5 removed
7 8 65 6 active sync /dev/sde1

I'll shut down the system now and remove the HDD. I suppose I've just
one question to ask then; should I rescrub the array when it's up with
1 HDD removed?

Thanks again,
Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Do I have a bad HDD?

am 31.07.2011 20:25:42 von Johannes Truschnigg

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enig348CB1EF933B20A8866017F9
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi Mathias,

PLEASE don't always quote everything you've just written a few minutes
ago when replying to yourself. This makes these responses really, really
inconvenient to read.

Concerning your problem at hand: Don't panic, everything should be fine.
Your RAID did what it is supposed to do: protect your from a single
drive failure without inducing downtime. A disk has failed, and now it's
time to replace it, that's nothing out of the ordinary. You don't need
to shut down your system - RAID systems have "no service interruption in
case of an accident" as a design goal, and md is rather good at meeting
just that.

If your SATA controller supports hotplug (and if it's SATA-300, that
much is for certain), just unplug the old drive, replace it with a new
one, clone your partition table setup (if any) from an old driver to the
new one, have md pick up the disk and integrate it into your array, and
see how everything'll be taken care of automatically. md will resync the
array once the new disk is part of the array, and it will be smooth
sailing again afterwards. Just make sure you don't pull out the wrong
drive, but correctly identify the broken one. ;)

While the array's being worked on, you can do whatever you intended to
do with your seemingly healthy array from of a few hours ago - the only
difference it that (some) things will go slower, but that's about it.
The only serious problem you could run into is a second and third
harddrive failing while your array isn't 100% OK again yet - but even in
that case, you'd have a backup ready, now wouldn't you? :)

--=20
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www: http://johannes.truschnigg.info/
phone: +43 650 2 133337
xmpp: johannes@truschnigg.info

Please do not bother me with HTML-eMail or attachments. Thank you.

--------------enig348CB1EF933B20A8866017F9
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk41nioACgkQnnUApj8OcoIUlgCffz/vK2wb3RUxrmQlffRy fGpo
/QkAnAiQ5To3XSHGe8BT7kIvFDSp723u
=Uv9J
-----END PGP SIGNATURE-----

--------------enig348CB1EF933B20A8866017F9--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Do I have a bad HDD?

am 31.07.2011 20:31:40 von mathias.buren

On 31 July 2011 19:25, Johannes Truschnigg wrote:
> Hi Mathias,
>
> PLEASE don't always quote everything you've just written a few minutes
> ago when replying to yourself. This makes these responses really, really
> inconvenient to read.
>
> Concerning your problem at hand: Don't panic, everything should be fine.
> Your RAID did what it is supposed to do: protect your from a single
> drive failure without inducing downtime. A disk has failed, and now it's
> time to replace it, that's nothing out of the ordinary. You don't need
> to shut down your system - RAID systems have "no service interruption in
> case of an accident" as a design goal, and md is rather good at meeting
> just that.
>
> If your SATA controller supports hotplug (and if it's SATA-300, that
> much is for certain), just unplug the old drive, replace it with a new
> one, clone your partition table setup (if any) from an old driver to the
> new one, have md pick up the disk and integrate it into your array, and
> see how everything'll be taken care of automatically. md will resync the
> array once the new disk is part of the array, and it will be smooth
> sailing again afterwards. Just make sure you don't pull out the wrong
> drive, but correctly identify the broken one. ;)
>
> While the array's being worked on, you can do whatever you intended to
> do with your seemingly healthy array from of a few hours ago - the only
> difference it that (some) things will go slower, but that's about it.
> The only serious problem you could run into is a second and third
> harddrive failing while your array isn't 100% OK again yet - but even in
> that case, you'd have a backup ready, now wouldn't you? :)
>
>
> --
> with best regards:
> - Johannes Truschnigg ( johannes@truschnigg.info )
>
> www: http://johannes.truschnigg.info/
> phone: +43 650 2 133337
> xmpp: johannes@truschnigg.info
>
> Please do not bother me with HTML-eMail or attachments. Thank you.
>
>

Sorry about that!

No I don't have a backup ;), perhaps it's time to get one. This isn't
super important data, more like "nice to have".

I don't have any hotplug bays or anything like that (you'll see in a
while after I put up the pictures) so I shut down the system to take
the faulty HDD out.

Is it wise to run a check on the array while it's degraded by 1 HDD?

Thanks again,
/Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Do I have a bad HDD?

am 01.08.2011 01:42:42 von Phil Turmel

Good evening Mathias,

On 07/31/2011 02:31 PM, Mathias BurÃ©n wrote:

[trim]

> No I don't have a backup ;), perhaps it's time to get one. This isn't
> super important data, more like "nice to have".
>=20
> I don't have any hotplug bays or anything like that (you'll see in a
> while after I put up the pictures) so I shut down the system to take
> the faulty HDD out.
>=20
> Is it wise to run a check on the array while it's degraded by 1 HDD?

It shouldn't hurt your array, other than a bit of exercise. If another
drive is marginal, it could conceivably "push it over the edge".

Any positive value it has will occur anyways during the resync with the
replacement drive for sdh.

I wouldn't bother running the scrub until the resync is finished.

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Do I have a bad HDD?

am 01.08.2011 01:56:45 von mathias.buren

On 1 August 2011 00:42, Phil Turmel wrote:
> Good evening Mathias,
>
> On 07/31/2011 02:31 PM, Mathias BurÃ©n wrote:
>
> [trim]
>
>> No I don't have a backup ;), perhaps it's time to get one. This isn'=
t
>> super important data, more like "nice to have".
>>
>> I don't have any hotplug bays or anything like that (you'll see in a
>> while after I put up the pictures) so I shut down the system to take
>> the faulty HDD out.
>>
>> Is it wise to run a check on the array while it's degraded by 1 HDD?
>
> It shouldn't hurt your array, other than a bit of exercise. Â If =
another
> drive is marginal, it could conceivably "push it over the edge".
>
> Any positive value it has will occur anyways during the resync with t=
he
> replacement drive for sdh.
>
> I wouldn't bother running the scrub until the resync is finished.
>
> Phil
>

Thanks, I'll wait to scrub until I get the new HDD (hopefully within 2
weeks). Pics of the beast here: http://stuff3.imgur.com/htpcnas

/Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Do I have a bad HDD?

am 22.08.2011 19:10:26 von mathias.buren

On 1 August 2011 00:56, Mathias BurÃ©n wr=
ote:
> On 1 August 2011 00:42, Phil Turmel wrote:
>> Good evening Mathias,
>>
>> On 07/31/2011 02:31 PM, Mathias BurÃ©n wrote:
>>
>> [trim]
>>
>>> No I don't have a backup ;), perhaps it's time to get one. This isn=
't
>>> super important data, more like "nice to have".
>>>
>>> I don't have any hotplug bays or anything like that (you'll see in =
a
>>> while after I put up the pictures) so I shut down the system to tak=
e
>>> the faulty HDD out.
>>>
>>> Is it wise to run a check on the array while it's degraded by 1 HDD=
?
>>
>> It shouldn't hurt your array, other than a bit of exercise. Â If=
another
>> drive is marginal, it could conceivably "push it over the edge".
>>
>> Any positive value it has will occur anyways during the resync with =
the
>> replacement drive for sdh.
>>
>> I wouldn't bother running the scrub until the resync is finished.
>>
>> Phil
>>
>
>
> Thanks, I'll wait to scrub until I get the new HDD (hopefully within =
2
> weeks). Pics of the beast here: http://stuff3.imgur.com/htpcnas
>
> /Mathias
>

New HDD received from Samsung via RMA, all went fine adding it, now
the array is in recovery:

root@ion ~ $ mdadm --manage /dev/md0 --add /dev/sdh1
mdadm: added /dev/sdh1
root@ion ~ $ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdh1[8] sdf1[5] sde1[7] sdg1[0] sdd1[4] sdb1[1] sdc1=
[3]
9751756800 blocks super 1.2 level 6, 64k chunk, algorithm 2
[7/6] [UUUUU_U]
[>....................] recovery =3D 0.0% (87852/1950351360)
finish=3D1109.9min speed=3D29284K/sec

unused devices:

Thanks for all the help.

/Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html