OK, Now this is really weird

am 26.02.2011 08:00:42 von Leslie Rhorer

I have a pair of drives each of whose 3 partitions are members of a
set of 3 RAID arrays. One of the two drives had a flaky power connection
which I thought I had fixed, but I guess not, because the drive was taken
offline again on Tuesday. The significant issue, however, is that both
times the drive failed, mdadm behaved really oddly. The first time I
thought it might just be some odd anomaly, but the second time it did
precisely the same thing. Both times, when the drive was de-registered by
udev, the first two arrays properly responded to the failure, but the third
array did not. Here is the layout:

ARRAY /dev/md1 metadata=0.90 UUID=4cde286c:0687556a:4d9996dd:dd23e701
ARRAY /dev/md2 metadata=1.2 name=Backup:2
UUID=d45ff663:9e53774c:6fcf9968:21692025
ARRAY /dev/md3 metadata=1.2 name=Backup:3
UUID=51d22c47:10f58974:0b27ef04:5609d357

Here is the result from examining the live parttions:

/dev/sdl1:
Magic : a92b4efc
Version : 0.90.00
UUID : 4cde286c:0687556a:4d9996dd:dd23e701 (local to host Backup)
Creation Time : Fri Jun 11 20:45:51 2010
Raid Level : raid1
Used Dev Size : 6144704 (5.86 GiB 6.29 GB)
Array Size : 6144704 (5.86 GiB 6.29 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 1

Update Time : Sat Feb 26 00:47:19 2011
State : clean
Internal Bitmap : present
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Checksum : c127a1bf - correct
Events : 1014

Number Major Minor RaidDevice State
this 1 8 177 1 active sync /dev/sdl1

0 0 0 0 0 removed
1 1 8 177 1 active sync /dev/sdl1

/dev/sdl2:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : d45ff663:9e53774c:6fcf9968:21692025
Name : Backup:2 (local to host Backup)
Creation Time : Sat Dec 19 22:59:43 2009
Raid Level : raid1
Raid Devices : 2

Avail Dev Size : 554884828 (264.59 GiB 284.10 GB)
Array Size : 554884828 (264.59 GiB 284.10 GB)
Data Offset : 272 sectors
Super Offset : 8 sectors
State : clean
Device UUID : e0896263:c0f95d43:9c0cb92a:79a95210

Internal Bitmap : 8 sectors from superblock
Update Time : Sat Feb 26 00:47:18 2011
Checksum : 41881e60 - correct
Events : 902752

Device Role : Active device 1
Array State : .A ('A' == active, '.' == missing)

/dev/sdl3:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 51d22c47:10f58974:0b27ef04:5609d357
Name : Backup:3 (local to host Backup)
Creation Time : Sat May 29 14:16:22 2010
Raid Level : raid1
Raid Devices : 2

Avail Dev Size : 409593096 (195.31 GiB 209.71 GB)
Array Size : 409593096 (195.31 GiB 209.71 GB)
Data Offset : 144 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 982c9519:48d21940:3720b6d5:dfb0a312

Internal Bitmap : 8 sectors from superblock
Update Time : Wed Feb 9 20:02:26 2011
Checksum : 6c78f4a2 - correct
Events : 364740

Device Role : Active device 1
Array State : AA ('A' == active, '.' == missing)

Here are the array details:

/dev/md1:
Version : 0.90
Creation Time : Fri Jun 11 20:45:51 2010
Raid Level : raid1
Array Size : 6144704 (5.86 GiB 6.29 GB)
Used Dev Size : 6144704 (5.86 GiB 6.29 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 1
Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Sat Feb 26 00:53:23 2011
State : active, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0

UUID : 4cde286c:0687556a:4d9996dd:dd23e701 (local to host Backup)
Events : 0.1016

Number Major Minor RaidDevice State
0 0 0 0 removed
1 8 177 1 active sync /dev/sdl1

2 8 161 - faulty spare
/dev/md2:
Version : 1.2
Creation Time : Sat Dec 19 22:59:43 2009
Raid Level : raid1
Array Size : 277442414 (264.59 GiB 284.10 GB)
Used Dev Size : 277442414 (264.59 GiB 284.10 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Sat Feb 26 00:53:47 2011
State : active, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0

Name : Backup:2 (local to host Backup)
UUID : d45ff663:9e53774c:6fcf9968:21692025
Events : 902890

Number Major Minor RaidDevice State
0 0 0 0 removed
3 8 178 1 active sync /dev/sdl2

2 8 162 - faulty spare
/dev/md3:
Version : 1.2
Creation Time : Sat May 29 14:16:22 2010
Raid Level : raid1
Array Size : 204796548 (195.31 GiB 209.71 GB)
Used Dev Size : 204796548 (195.31 GiB 209.71 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Wed Feb 9 20:02:26 2011
State : active, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0

Name : Backup:3 (local to host Backup)
UUID : 51d22c47:10f58974:0b27ef04:5609d357
Events : 364740

Number Major Minor RaidDevice State
2 8 163 0 faulty spare rebuilding
3 8 179 1 active sync /dev/sdl3

So what gives? /dev/sdk3 no longer even exists, so why hasn't it
been failed and removed on /dev /md3 like it has on /dev/md1 and /dev/md2?

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: OK, Now this is really weird

am 26.02.2011 08:36:11 von Jeff Woods

Quoting Leslie Rhorer :
> I have a pair of drives each of whose 3 partitions are members of a
> set of 3 RAID arrays. One of the two drives had a flaky power connection
> which I thought I had fixed, but I guess not, because the drive was taken
> offline again on Tuesday. The significant issue, however, is that both
> times the drive failed, mdadm behaved really oddly. The first time I
> thought it might just be some odd anomaly, but the second time it did
> precisely the same thing. Both times, when the drive was de-registered by
> udev, the first two arrays properly responded to the failure, but the third
> array did not. Here is the layout:

[snip lots of technical details]

> So what gives? /dev/sdk3 no longer even exists, so why hasn't it
> been failed and removed on /dev /md3 like it has on /dev/md1 and /dev/md2?

Is it possible there has been no I/O request for /dev/md3 since
/dev/sdk failed?
--
Jeff Woods
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: OK, Now this is really weird

am 26.02.2011 12:20:49 von Leslie Rhorer

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Jeff Woods
> Sent: Saturday, February 26, 2011 1:36 AM
> To: lrhorer@satx.rr.com
> Cc: 'Linux RAID'
> Subject: Re: OK, Now this is really weird
>
> Quoting Leslie Rhorer :
> > I have a pair of drives each of whose 3 partitions are members of a
> > set of 3 RAID arrays. One of the two drives had a flaky power
> connection
> > which I thought I had fixed, but I guess not, because the drive was
> taken
> > offline again on Tuesday. The significant issue, however, is that both
> > times the drive failed, mdadm behaved really oddly. The first time I
> > thought it might just be some odd anomaly, but the second time it did
> > precisely the same thing. Both times, when the drive was de-registered
> by
> > udev, the first two arrays properly responded to the failure, but the
> third
> > array did not. Here is the layout:
>
> [snip lots of technical details]
>
> > So what gives? /dev/sdk3 no longer even exists, so why hasn't it
> > been failed and removed on /dev /md3 like it has on /dev/md1 and
> /dev/md2?
>
> Is it possible there has been no I/O request for /dev/md3 since
> /dev/sdk failed?

Well, I thought about that. It's swap space, so I suppose it's
possible. I would have thought, however, that mdadm would fail a missing
member whether there is any I/O or not.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: OK, Now this is really weird

am 26.02.2011 12:35:11 von mathias.buren

On 26 February 2011 11:20, Leslie Rhorer wrote:
>
>
>> -----Original Message-----
>> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
>> owner@vger.kernel.org] On Behalf Of Jeff Woods
>> Sent: Saturday, February 26, 2011 1:36 AM
>> To: lrhorer@satx.rr.com
>> Cc: 'Linux RAID'
>> Subject: Re: OK, Now this is really weird
>>
>> Quoting Leslie Rhorer :
>> > Â Â I have a pair of drives each of whose 3 partitions a=
re members of a
>> > set of 3 RAID arrays. Â One of the two drives had a flaky powe=
r
>> connection
>> > which I thought I had fixed, but I guess not, because the drive wa=
s
>> taken
>> > offline again on Tuesday. Â The significant issue, however, is=
that both
>> > times the drive failed, mdadm behaved really oddly. Â The firs=
t time I
>> > thought it might just be some odd anomaly, but the second time it =
did
>> > precisely the same thing. Â Both times, when the drive was de-=
registered
>> by
>> > udev, the first two arrays properly responded to the failure, but =
the
>> third
>> > array did not. Â Here is the layout:
>>
>> [snip lots of technical details]
>>
>> > Â Â So what gives? Â /dev/sdk3 no longer even exists=
, so why hasn't it
>> > been failed and removed on /dev /md3 like it has on /dev/md1 and
>> /dev/md2?
>>
>> Is it possible there has been no I/O request for /dev/md3 since
>> /dev/sdk failed?
>
> Â Â Â Â Well, I thought about that. Â It's swa=
p space, so I suppose it's
> possible. Â I would have thought, however, that mdadm would fail =
a missing
> member whether there is any I/O or not.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at Â http://vger.kernel.org/majordomo-info.ht=
ml
>

I thought so as well. But how will mdadm know is the device is faulty,
unless the device is generating errors? (which usually only happens on
read and/or write)

// Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: OK, Now this is really weird

am 26.02.2011 22:34:33 von NeilBrown

On Sat, 26 Feb 2011 11:35:11 +0000 Mathias Bur=E9n com>
wrote:

> On 26 February 2011 11:20, Leslie Rhorer wrote:
> >
> >
> >> -----Original Message-----
> >> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> >> owner@vger.kernel.org] On Behalf Of Jeff Woods
> >> Sent: Saturday, February 26, 2011 1:36 AM
> >> To: lrhorer@satx.rr.com
> >> Cc: 'Linux RAID'
> >> Subject: Re: OK, Now this is really weird
> >>
> >> Quoting Leslie Rhorer :
> >> > =A0 =A0 I have a pair of drives each of whose 3 partitions are m=
embers of a
> >> > set of 3 RAID arrays. =A0One of the two drives had a flaky power
> >> connection
> >> > which I thought I had fixed, but I guess not, because the drive =
was
> >> taken
> >> > offline again on Tuesday. =A0The significant issue, however, is =
that both
> >> > times the drive failed, mdadm behaved really oddly. =A0The first=
time I
> >> > thought it might just be some odd anomaly, but the second time i=
t did
> >> > precisely the same thing. =A0Both times, when the drive was de-r=
egistered
> >> by
> >> > udev, the first two arrays properly responded to the failure, bu=
t the
> >> third
> >> > array did not. =A0Here is the layout:
> >>
> >> [snip lots of technical details]
> >>
> >> > =A0 =A0 So what gives? =A0/dev/sdk3 no longer even exists, so wh=
y hasn't it
> >> > been failed and removed on /dev /md3 like it has on /dev/md1 and
> >> /dev/md2?
> >>
> >> Is it possible there has been no I/O request for /dev/md3 since
> >> /dev/sdk failed?
> >
> > =A0 =A0 =A0 =A0Well, I thought about that. =A0It's swap space, so I=
suppose it's
> > possible. =A0I would have thought, however, that mdadm would fail a=
missing
> > member whether there is any I/O or not.
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rai=
d" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.htm=
l
> >
>=20
> I thought so as well. But how will mdadm know is the device is faulty=
,
> unless the device is generating errors? (which usually only happens o=
n
> read and/or write)

With very recent mdadm the command

mdadm -If sdXX

will find any md array that has /dev/sdXX as a member and will fail and
remove it.
Note the device name is 'sdxx', not '/dev/something'. This is because =
that
at the time you want to do this, udev has probably removed all trace
from /dev so you need to use the name mentioned in /proc/mdstat
or /sys/block/mdXX/md/dev-$DEVNAME

You can set up a udev rule to run mdadm like this automatically when a =
device
is hot-unplugged.
something like

SUBSYSTEM=="block", ACTION=="remove", RUN+=3D"/sbin/mdadm -If =
$name --path $env{ID_PATH}"

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: OK, Now this is really weird

am 27.02.2011 08:15:41 von Leslie Rhorer

> >> > =A0 =A0 So what gives? =A0/dev/sdk3 no longer even exists, so wh=
y hasn't it
> >> > been failed and removed on /dev /md3 like it has on /dev/md1 and
> >> /dev/md2?
> >>
> >> Is it possible there has been no I/O request for /dev/md3 since
> >> /dev/sdk failed?
> >
> > =A0 =A0 =A0 =A0Well, I thought about that. =A0It's swap space, so I=
suppose it's
> > possible. =A0I would have thought, however, that mdadm would fail a
> missing
> > member whether there is any I/O or not.
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rai=
d" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.htm=
l
> >
>=20
> I thought so as well. But how will mdadm know is the device is faulty=
,
> unless the device is generating errors? (which usually only happens o=
n
> read and/or write)

Well, reading here, I believe I have seen posts talking about mdadm wak=
ing
up sleeping spindles periodically, thereby killing part of the power sa=
ving
functions of "green" drives. Have those posts been in error? It's bee=
n
days since the drive "failed".

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: OK, Now this is really weird

am 27.02.2011 08:22:41 von Leslie Rhorer

> > >> > =A0 =A0 So what gives? =A0/dev/sdk3 no longer even exists, so =
why hasn't
> it
> > >> > been failed and removed on /dev /md3 like it has on /dev/md1 a=
nd
> > >> /dev/md2?
> > >>
> > >> Is it possible there has been no I/O request for /dev/md3 since
> > >> /dev/sdk failed?
> > >
> > > =A0 =A0 =A0 =A0Well, I thought about that. =A0It's swap space, so=
I suppose it's
> > > possible. =A0I would have thought, however, that mdadm would fail=
a
> missing
> > > member whether there is any I/O or not.
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-r=
aid"
> in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at =A0http://vger.kernel.org/majordomo-info.h=
tml
> > >
> >
> > I thought so as well. But how will mdadm know is the device is faul=
ty,
> > unless the device is generating errors? (which usually only happens=
on
> > read and/or write)
>=20
> With very recent mdadm the command
>=20
> mdadm -If sdXX
>=20
> will find any md array that has /dev/sdXX as a member and will fail a=
nd
> remove it.

No, it's version 3.1.4, and that gives me a "Device or Resource
busy" error. It does report that it set sdk3 faulty, but the hot remov=
e
fails.

So how can I remove the drive (so I can add it back)?

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: OK, Now this is really weird

am 27.02.2011 08:57:04 von NeilBrown

On Sun, 27 Feb 2011 01:22:41 -0600 "Leslie Rhorer" >
wrote:

>=20
> > > >> > =A0 =A0 So what gives? =A0/dev/sdk3 no longer even exists, s=
o why hasn't
> > it
> > > >> > been failed and removed on /dev /md3 like it has on /dev/md1=
and
> > > >> /dev/md2?
> > > >>
> > > >> Is it possible there has been no I/O request for /dev/md3 sinc=
e
> > > >> /dev/sdk failed?
> > > >
> > > > =A0 =A0 =A0 =A0Well, I thought about that. =A0It's swap space, =
so I suppose it's
> > > > possible. =A0I would have thought, however, that mdadm would fa=
il a
> > missing
> > > > member whether there is any I/O or not.
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux=
-raid"
> > in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at =A0http://vger.kernel.org/majordomo-info=
html
> > > >
> > >
> > > I thought so as well. But how will mdadm know is the device is fa=
ulty,
> > > unless the device is generating errors? (which usually only happe=
ns on
> > > read and/or write)
> >=20
> > With very recent mdadm the command
> >=20
> > mdadm -If sdXX
> >=20
> > will find any md array that has /dev/sdXX as a member and will fail=
and
> > remove it.
>=20
> No, it's version 3.1.4, and that gives me a "Device or Resource
> busy" error. It does report that it set sdk3 faulty, but the hot rem=
ove
> fails.
>=20
> So how can I remove the drive (so I can add it back)?

Maybe:
mdadm /dev/md2 --remove failed

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html