Drives disappearing from /dev/ during surface scan

Drives disappearing from /dev/ during surface scan

am 24.06.2010 17:16:02 von John Hendrikx

Hello all,

I'm wondering if anyone could share some insight into a problem I'm having.

The problem is that every week, one or two harddrives simply disappear
(from /dev/) during the weekly wednesday morning long surface scan
triggered by smartctl. The scan starts at 6 am, and the drives dropped
at 6:30 am and the next week at 8:30 am (the surface scans take I think
~3 hours).

Messages in syslog are similar to this:

> Jun 23 08:27:58 Ukyo kernel: ata3: hard resetting link
> Jun 23 08:27:59 Ukyo kernel: ata3: SATA link down (SStatus 0 SControl 300)
> Jun 23 08:28:04 Ukyo kernel: ata3: hard resetting link
> Jun 23 08:28:04 Ukyo kernel: ata3: SATA link down (SStatus 0 SControl 300)
> Jun 23 08:28:09 Ukyo kernel: ata3: hard resetting link
> Jun 23 08:28:09 Ukyo kernel: ata3: SATA link down (SStatus 0 SControl 300)
> Jun 23 08:28:09 Ukyo kernel: ata3.00: disabled
> Jun 23 08:28:09 Ukyo kernel: sd 3:0:0:0: [sdc] Result: hostbyte=DID_OK
> driverbyte=DRIVER_SENSE,SUGGEST_OK
> Jun 23 08:28:09 Ukyo kernel: sd 3:0:0:0: [sdc] Sense Key : Aborted
> Command [current] [descriptor]
> Jun 23 08:28:09 Ukyo kernel: Descriptor sense data with sense
> descriptors (in hex):
> Jun 23 08:28:09 Ukyo kernel: 72 0b 47 00 00 00 00 0c 00 0a 80
> 00 00 00 00 00
> Jun 23 08:28:09 Ukyo kernel: 0f ff ff ff
> Jun 23 08:28:09 Ukyo kernel: sd 3:0:0:0: [sdc] Add. Sense: Scsi parity
> error
> Jun 23 08:28:09 Ukyo kernel: sd 3:0:0:0: [sdc] Result:
> hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
> Jun 23 08:28:09 Ukyo kernel: md: super_written gets error=-5, uptodate=0
> Jun 23 08:28:09 Ukyo kernel: ata3: EH complete
> Jun 23 08:28:09 Ukyo kernel: ata3.00: detaching (SCSI 3:0:0:0)
> Jun 23 08:28:09 Ukyo kernel: sd 3:0:0:0: [sdc] Stopping disk
> Jun 23 08:28:09 Ukyo kernel: sd 3:0:0:0: [sdc] START_STOP FAILED
> Jun 23 08:28:09 Ukyo kernel: sd 3:0:0:0: [sdc] Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Trying to get the drive back by using:

echo "- - -" /sys/class/scsi_host/hostX/scan

Has no effect (in the log, ata3 gets rescanned but no drives are found):
> Jun 24 16:36:21 Ukyo kernel: ata3: hard resetting link
> Jun 24 16:36:22 Ukyo kernel: ata3: SATA link down (SStatus 0 SControl 300)
> Jun 24 16:36:22 Ukyo kernel: ata3: EH complete
> Jun 24 16:36:31 Ukyo kernel: ata4: hard resetting link
> Jun 24 16:36:32 Ukyo kernel: ata4: SATA link up 3.0 Gbps (SStatus 123
> SControl 300)
> Jun 24 16:36:32 Ukyo kernel: ata4.00: configured for UDMA/133
> Jun 24 16:36:32 Ukyo kernel: ata4: EH complete
> Jun 24 16:36:32 Ukyo kernel: sd 4:0:0:0: [sdd] 1953525168 512-byte
> hardware sectors (1000205 MB)
> Jun 24 16:36:32 Ukyo kernel: sd 4:0:0:0: [sdd] Write Protect is off
> Jun 24 16:36:32 Ukyo kernel: sd 4:0:0:0: [sdd] Write cache: disabled,
> read cache: enabled, doesn't support DPO or
Rebooting the system returns /dev/sdc (ata3) to working order, and
re-adding it to the array results in a short repair and everything is
good again for another week.

This only happens during the surface scan, and has been hard to
reproduce with just regular server use (copying, array rebuilding, etc..)

I suspect it may be a power issue, so I'm supplying some more numbers.
The PSU is rated for 460 watt, 165 watt 5v, 312 watt 12v. There's 10
drives in there, all recent models (1 TB+). System temperature is
normal (four 12 cm fans installed, not counting the PSU one).

I'm however somewhat skeptical about the power issue, as I used to have
another server that would hit its power limiter during a cold start (ie,
it would not power on as the spin-up cycle caused an overload) --
however, that server would still power up fully when forcing it to start
by simply powering it on 2 or 3 times quickly in a row. It would run
stable for months once it managed to spin up all drives.

Any insights why this might occur is appreciated. I'm currently
considering spreading the weekly surface scans out a bit to prevent
this, but would rather find out what the real issue is. Other things
I'm considering is replacing two drives for one drive (2x 1 TB -> 2 TB)
to reduce power load a bit... or finding a PSU that is rated a bit higher.

--John





--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html