Raid 5 - not clean and then a failure.

Raid 5 - not clean and then a failure.

am 25.08.2009 09:54:49 von Jon Hardcastle

Guys,

I have been having some problems with my arrays that I think i have nailed down to a pci controller (well I say that - it is always the drives connected to *a* controller but I have tried 2!) anyway the latest saga is i was trying some new kernel options last night - which didn't work.

But when i booted up again this morning it said one of the drives was in an inconsistent state (not sure of the *exact* error message). I then kicked off an add of the drive and it started syncing. It got about 5% in and then the second drive in on that controller complained and the array failed.

Is there any hope for my data? If i get a good controller in there will the resync continue? can I try and tell it to assume the drives are good (which they ought to be)?

Please help!

-----------------------
N: Jon Hardcastle
E: Jon@eHardcastle.com
'Do not worry about tomorrow, for tomorrow will bring worries of its own.'
-----------------------



--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Raid 5 - not clean and then a failure.

am 25.08.2009 10:16:17 von Robin Hill

--MGYHOYXEY6WxJCY8
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue Aug 25, 2009 at 12:54:49AM -0700, Jon Hardcastle wrote:

> Guys,
>=20
> I have been having some problems with my arrays that I think i have
> nailed down to a pci controller (well I say that - it is always the
> drives connected to *a* controller but I have tried 2!) anyway the
> latest saga is i was trying some new kernel options last night - which
> didn't work.
>=20
Did they have the same chipset? I had problems with PCI controllers on
one of my systems, which turned out to be some sort of conflict between
the onboard chipset and the chipset on the controllers. I found a PCI
card with a different chipset and have had no issues since.

> But when i booted up again this morning it said one of the drives was
> in an inconsistent state (not sure of the *exact* error message). I
> then kicked off an add of the drive and it started syncing. It got
> about 5% in and then the second drive in on that controller complained
> and the array failed.
>=20
> Is there any hope for my data? If i get a good controller in there
> will the resync continue? can I try and tell it to assume the drives
> are good (which they ought to be)?
>=20
There's definitely hope. You can assemble the array (using the good
drives and the last drive to fail) using the --force option, then re-add
(and sync) the other drive (I'd recommend doing a fsck on the filesystem
as well). I've just had to do a similar thing myself after two drives
failed (overheated after a fan failure).

Cheers,
Robin
--=20
___ =20
( ' } | Robin Hill |
/ / ) | Little Jim says .... |
// !! | "He fallen in de water !!" |

--MGYHOYXEY6WxJCY8
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.11 (GNU/Linux)

iEYEARECAAYFAkqTndEACgkQShxCyD40xBKFXwCgxKjyHU4uoWZ06GpWbc2d Ly6p
3VYAnjkIazj4HUHu2Z4I22Bkbi0QrhcX
=nVgV
-----END PGP SIGNATURE-----

--MGYHOYXEY6WxJCY8--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Raid 5 - not clean and then a failure.

am 25.08.2009 10:40:31 von Jon Hardcastle

--- On Tue, 25/8/09, Robin Hill wrote:

> From: Robin Hill
> Subject: Re: Raid 5 - not clean and then a failure.
> To: linux-raid@vger.kernel.org
> Date: Tuesday, 25 August, 2009, 9:16 AM
> On Tue Aug 25, 2009 at 12:54:49AM
> -0700, Jon Hardcastle wrote:
>=20
> > Guys,
> >=20
> > I have been having some problems with my arrays that I
> think i have
> > nailed down to a pci controller (well I say that - it
> is always the
> > drives connected to *a* controller but I have tried
> 2!) anyway the
> > latest saga is i was trying some new kernel options
> last night - which
> > didn't work.
> >=20
> Did they have the same chipset?=A0 I had problems with
> PCI controllers on
> one of my systems, which turned out to be some sort of
> conflict between
> the onboard chipset and the chipset on the
> controllers.=A0 I found a PCI
> card with a different chipset and have had no issues
> since.

They are/were cheapy little via ones from 'aria' I got a new one and in=
stalled it along side a week ago 1 drive on it to 'test', when my array=
came to do a scrub a week later I got a whole host of issues. I am not=
sure what the cause was but now either of the controllers seem work re=
liably. I have a pci express controller but my kernel doesnt (yet!) sup=
port pci express. Do you know of you can get sata 3 on pci? or is it to=
o slow?

> > But when i booted up again this morning it said one of
> the drives was
> > in an inconsistent state (not sure of the *exact*
> error message). I
> > then kicked off an add of the drive and it started
> syncing. It got
> > about 5% in and then the second drive in on that
> controller complained
> > and the array failed.
> >=20
> > Is there any hope for my data? If i get a good
> controller in there
> > will the resync continue? can I try and tell it to
> assume the drives
> > are good (which they ought to be)?
> >=20
> There's definitely hope.=A0 You can assemble the array
> (using the good
> drives and the last drive to fail) using the --force
> option, then re-add
> (and sync) the other drive (I'd recommend doing a fsck on
> the filesystem
> as well).=A0 I've just had to do a similar thing myself
> after two drives
> failed (overheated after a fan failure).
>=20
> Cheers,
> =A0 =A0 Robin
> --=20

Thank you, Thank you, Thank you. I probably wont look at this for a few=
days now - i find when sitting down without enough time to really see =
it through is when I get problems! and I am abit busy bee atm!


-----------------------
N: Jon Hardcastle
E: Jon@eHardcastle.com
'Do not worry about tomorrow, for tomorrow will bring worries of its ow=
n.'
-----------------------





=20
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Raid 5 - not clean and then a failure.

am 25.08.2009 11:34:37 von Robin Hill

--hHWLQfXTYDoKhP50
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue Aug 25, 2009 at 01:40:31AM -0700, Jon Hardcastle wrote:

> --- On Tue, 25/8/09, Robin Hill wrote:
>=20
> > From: Robin Hill
> > Subject: Re: Raid 5 - not clean and then a failure.
> > To: linux-raid@vger.kernel.org
> > Date: Tuesday, 25 August, 2009, 9:16 AM
> > On Tue Aug 25, 2009 at 12:54:49AM
> > -0700, Jon Hardcastle wrote:
> >=20
> > > Guys,
> > >=20
> > > I have been having some problems with my arrays that I think i have
> > > nailed down to a pci controller (well I say that - it is always the
> > > drives connected to *a* controller but I have tried 2!) anyway the
> > > latest saga is i was trying some new kernel options last night - which
> > > didn't work.
> > >=20
> > Did they have the same chipset?=A0 I had problems with PCI controllers =
on
> > one of my systems, which turned out to be some sort of conflict between
> > the onboard chipset and the chipset on the controllers.=A0 I found a PCI
> > card with a different chipset and have had no issues since.
>=20
> They are/were cheapy little via ones from 'aria' I got a new one and
> installed it along side a week ago 1 drive on it to 'test', when my
> array came to do a scrub a week later I got a whole host of issues. I
> am not sure what the cause was but now either of the controllers seem
> work reliably. I have a pci express controller but my kernel doesnt
> (yet!) support pci express. Do you know of you can get sata 3 on pci?
> or is it too slow?
>=20
By SATA 3, I assume you're actually referring to SATA 3GBit/s (as SATA 3
has only just been ratified, and I doubt you can get it at all yet)?
You'll probably be able to find PCI cards that support it, but the
standard PCI bus (32-bit, 33MHz) only has a bandwidth of 1 GBit/s, so
can't even keep up with 1.5GBit/s SATA, let alone 3GBit/s. A 64-bit or
66MHz bus would do better though.

Cheers,
Robin
--=20
___ =20
( ' } | Robin Hill |
/ / ) | Little Jim says .... |
// !! | "He fallen in de water !!" |

--hHWLQfXTYDoKhP50
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.11 (GNU/Linux)

iEYEARECAAYFAkqTsCwACgkQShxCyD40xBJ83ACcDacVfmUe1hpebeq4ymHO 7oek
PQEAoKy9lqHPWXO0nO3HjsRfRUwPGB2Z
=AkmB
-----END PGP SIGNATURE-----

--hHWLQfXTYDoKhP50--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Raid 5 - not clean and then a failure.

am 25.08.2009 15:47:05 von John Robinson

On 25/08/2009 09:40, Jon Hardcastle wrote:
> --- On Tue, 25/8/09, Robin Hill wrote:
>> On Tue Aug 25, 2009 at 12:54:49AM -0700, Jon Hardcastle wrote:
>>> I have been having some problems with my arrays that I think i have
>>> nailed down to a pci controller (well I say that - it is always the
>>> drives connected to *a* controller but I have tried 2!) anyway the
>>> latest saga is i was trying some new kernel options last night - which
>>> didn't work.
>>
>> Did they have the same chipset? I had problems with PCI controllers on
>> one of my systems, which turned out to be some sort of conflict between
>> the onboard chipset and the chipset on the controllers. I found a PCI
>> card with a different chipset and have had no issues since.
>
> They are/were cheapy little via ones

That's your problem right there. Well, in my experience and therefore
opinion, VIA stuff is all too often junk, or at least iffy enough never
to be trusted with anything professional or important.

[...]
> I have a pci express controller but my kernel doesnt (yet!) support pci express.

*How old* is your kernel?

Cheers,

John.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Raid 5 - not clean and then a failure.

am 25.08.2009 16:11:40 von Jon Hardcastle

--- On Tue, 25/8/09, John Robinson wro=
te:

> From: John Robinson
> Subject: Re: Raid 5 - not clean and then a failure.
> To: Jon@eHardcastle.com
> Cc: linux-raid@vger.kernel.org
> Date: Tuesday, 25 August, 2009, 2:47 PM
> On 25/08/2009 09:40, Jon Hardcastle
> wrote:
> > --- On Tue, 25/8/09, Robin Hill
> wrote:
> >> On Tue Aug 25, 2009 at 12:54:49AM -0700, Jon
> Hardcastle wrote:
> >>> I have been having some problems with my
> arrays that I think i have
> >>> nailed down to a pci controller (well I say
> that - it is always the
> >>> drives connected to *a* controller but I have
> tried 2!) anyway the
> >>> latest saga is i was trying some new kernel
> options last night - which
> >>> didn't work.
> >>=20
> >> Did they have the same chipset?=A0 I had
> problems with PCI controllers on
> >> one of my systems, which turned out to be some
> sort of conflict between
> >> the onboard chipset and the chipset on the
> controllers.=A0 I found a PCI
> >> card with a different chipset and have had no
> issues since.
> >=20
> > They are/were cheapy little via ones
>=20
> That's your problem right there. Well, in my experience and
> therefore opinion, VIA stuff is all too often junk, or at
> least iffy enough never to be trusted with anything
> professional or important.
>=20
> [...]
> > I have a pci express controller but my kernel doesnt
> (yet!) support pci express.
>=20
> *How old* is your kernel?
>=20
> Cheers,
>=20
> John.
> --

This is what I am finding.. i plugged in a second controller and it see=
ms to have nagered the first one such that I am getting 'port to slow =
to respond' from the drives connected now.

I am looking at my options. Once i get PCI-Express working I have optio=
ns. I am also looking at port multipliers.

(ps support is IN my kernel code.. I just ran down a trimmed kernel. Di=
dn't need PCI express so i didn't enable it. now i do :) )

-----------------------
N: Jon Hardcastle
E: Jon@eHardcastle.com
'Do not worry about tomorrow, for tomorrow will bring worries of its ow=
n.'
-----------------------


=20
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Raid 5 - not clean and then a failure.

am 26.08.2009 13:02:31 von Jon Hardcastle

--- On Tue, 25/8/09, Robin Hill wrote:

> From: Robin Hill
> Subject: Re: Raid 5 - not clean and then a failure.
> To: linux-raid@vger.kernel.org
> Date: Tuesday, 25 August, 2009, 9:16 AM
> On Tue Aug 25, 2009 at 12:54:49AM
> -0700, Jon Hardcastle wrote:
>=20
> > Guys,
> >=20
> > I have been having some problems with my arrays that I
> think i have
> > nailed down to a pci controller (well I say that - it
> is always the
> > drives connected to *a* controller but I have tried
> 2!) anyway the
> > latest saga is i was trying some new kernel options
> last night - which
> > didn't work.
> >=20
> Did they have the same chipset?=A0 I had problems with
> PCI controllers on
> one of my systems, which turned out to be some sort of
> conflict between
> the onboard chipset and the chipset on the
> controllers.=A0 I found a PCI
> card with a different chipset and have had no issues
> since.
>=20
> > But when i booted up again this morning it said one of
> the drives was
> > in an inconsistent state (not sure of the *exact*
> error message). I
> > then kicked off an add of the drive and it started
> syncing. It got
> > about 5% in and then the second drive in on that
> controller complained
> > and the array failed.
> >=20
> > Is there any hope for my data? If i get a good
> controller in there
> > will the resync continue? can I try and tell it to
> assume the drives
> > are good (which they ought to be)?
> >=20
> There's definitely hope.=A0 You can assemble the array
> (using the good
> drives and the last drive to fail) using the --force
> option, then re-add
> (and sync) the other drive (I'd recommend doing a fsck on
> the filesystem
> as well).=A0 I've just had to do a similar thing myself
> after two drives
> failed (overheated after a fan failure).
>=20
> Cheers,
> =A0 =A0 Robin

It worked! I had to force the array, to assemble.. but it did. Had some=
more problems with the controller that I think was caused ultimately b=
y the two via controller conflicting. I think removing them *both* and =
booting up helped the computer to work out what was going on (don't kno=
w how) I also took down the 'minimum guaranteed' speed of the rebuild t=
o 50MB as the 2 drives on the PCI/150 card were struggling I think - no=
t sure about this as the drive does a 'check' once a week and has only =
ever failed last weekend. So basically i am not really 100% sure what c=
aused this problem - but i do know i need to get a more stable way of c=
ontroller these additional drives!

On a side note, if a 'repair' does everything a 'check' does but also r=
epairs it. Is there any merit in just doing repairs?

=46inally, anyone here got a port multiplier working?


-----------------------
N: Jon Hardcastle
E: Jon@eHardcastle.com
'Do not worry about tomorrow, for tomorrow will bring worries of its ow=
n.'
-----------------------


=20
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Raid 5 - not clean and then a failure.

am 26.08.2009 13:18:41 von Goswin von Brederlow

Jon Hardcastle writes:

> Guys,
>
> I have been having some problems with my arrays that I think i have nailed down to a pci controller (well I say that - it is always the drives connected to *a* controller but I have tried 2!) anyway the latest saga is i was trying some new kernel options last night - which didn't work.
>
> But when i booted up again this morning it said one of the drives was in an inconsistent state (not sure of the *exact* error message). I then kicked off an add of the drive and it started syncing. It got about 5% in and then the second drive in on that controller complained and the array failed.
>
> Is there any hope for my data? If i get a good controller in there will the resync continue? can I try and tell it to assume the drives are good (which they ought to be)?
>
> Please help!

The inconsistency is probably just a block here or there and I'm
assuming none of your drives actualy failed. So 99.9999% of your data
should be there. Just rebooting might actualy just get your raid back
(to syncing). If not then you have to force reassembly from the drives
with the newest serials. That will give you some data corruption,
whatever was writing when the controler gave errors. Worst case you
have to recreate the raid with --assume-clean.

I recommend adding a bitmap to the raid. That way a wrongfully failed
drive can be resynced in a matter of minutes instead of hours or
days. Makes it way less likely another error occurs during resync.

MfG
Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Raid 5 - not clean and then a failure.

am 26.08.2009 13:29:39 von Jon Hardcastle

--- On Wed, 26/8/09, Goswin von Brederlow wrote:

> From: Goswin von Brederlow
> Subject: Re: Raid 5 - not clean and then a failure.
> To: Jon@eHardcastle.com
> Cc: linux-raid@vger.kernel.org
> Date: Wednesday, 26 August, 2009, 12:18 PM
> Jon Hardcastle
> writes:
>=20
> > Guys,
> >
> > I have been having some problems with my arrays that I
> think i have nailed down to a pci controller (well I say
> that - it is always the drives connected to *a* controller
> but I have tried 2!) anyway the latest saga is i was trying
> some new kernel options last night - which didn't work.
> >
> > But when i booted up again this morning it said one of
> the drives was in an inconsistent state (not sure of the
> *exact* error message). I then kicked off an add of the
> drive and it started syncing. It got about 5% in and then
> the second drive in on that controller complained and the
> array failed.=20
> >
> > Is there any hope for my data? If i get a good
> controller in there will the resync continue? can I try and
> tell it to assume the drives are good (which they ought to
> be)?
> >
> > Please help!
>=20
> The inconsistency is probably just a block here or there
> and I'm
> assuming none of your drives actualy failed. So 99.9999% of
> your data
> should be there. Just rebooting might actualy just get your
> raid back
> (to syncing). If not then you have to force reassembly from
> the drives
> with the newest serials. That will give you some data
> corruption,
> whatever was writing when the controler gave errors. Worst
> case you
> have to recreate the raid with --assume-clean.
>=20
> I recommend adding a bitmap to the raid. That way a
> wrongfully failed
> drive can be resynced in a matter of minutes instead of
> hours or
> days. Makes it way less likely another error occurs during
> resync.
>=20
> MfG
> =A0 =A0 =A0 =A0 Goswin
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at=A0 http://vger.kernel.org/majordomo-info.html
>=20

I did look into bitmaps *abit* i could easily have the imagine for my 6=
drive raid 5 stored on the raid1 I have in the same system.. The googl=
ing I did tho did not paint a pretty picture it talked about huge perfo=
rmance hits?

-----------------------
N: Jon Hardcastle
E: Jon@eHardcastle.com
'Do not worry about tomorrow, for tomorrow will bring worries of its ow=
n.'
-----------------------


=20
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Raid 5 - not clean and then a failure.

am 26.08.2009 14:47:13 von John Robinson

On 26/08/2009 12:29, Jon Hardcastle wrote:
[...]
> I did look into bitmaps *abit* i could easily have the imagine for my 6 drive raid 5 stored on the raid1 I have in the same system.. The googling I did tho did not paint a pretty picture it talked about huge performance hits?

There is a performance hit but it can be minimised by picking a bitmap
chunk size to suit; I ended up getting about 80% of bitmap-less write
performance using a 16MB bitmap chunk size instead of about 40% with the
default size, on my 3-drive RAID-5 array.

Cheers,

John.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Raid 5 - not clean and then a failure.

am 26.08.2009 16:14:31 von Ryan Wagoner

Wouldn't weekly RAID consistency checks reveal a bad block before you
had a failure that required the need to do a full resync? It only
takes 3 hours to resync my 3 x 1TB drives and having a bitmap would
reduce the performance. I've never had to have a resync in the year
I've had the array up. I just wonder if the performance drawback is
worth having the bitmap to save a possible resync once every couple
years. Or are the RAID consistency checks not reliable enough to
prevent more errors during a resync?

Ryan

On Wed, Aug 26, 2009 at 7:18 AM, Goswin von Brederlow > wrote:
> Jon Hardcastle writes:
>
>> Guys,
>>
>> I have been having some problems with my arrays that I think i have =
nailed down to a pci controller (well I say that - it is always the dri=
ves connected to *a* controller but I have tried 2!) anyway the latest =
saga is i was trying some new kernel options last night - which didn't =
work.
>>
>> But when i booted up again this morning it said one of the drives wa=
s in an inconsistent state (not sure of the *exact* error message). I t=
hen kicked off an add of the drive and it started syncing. It got about=
5% in and then the second drive in on that controller complained and t=
he array failed.
>>
>> Is there any hope for my data? If i get a good controller in there w=
ill the resync continue? can I try and tell it to assume the drives are=
good (which they ought to be)?
>>
>> Please help!
>
> The inconsistency is probably just a block here or there and I'm
> assuming none of your drives actualy failed. So 99.9999% of your data
> should be there. Just rebooting might actualy just get your raid back
> (to syncing). If not then you have to force reassembly from the drive=
s
> with the newest serials. That will give you some data corruption,
> whatever was writing when the controler gave errors. Worst case you
> have to recreate the raid with --assume-clean.
>
> I recommend adding a bitmap to the raid. That way a wrongfully failed
> drive can be resynced in a matter of minutes instead of hours or
> days. Makes it way less likely another error occurs during resync.
>
> MfG
> =A0 =A0 =A0 =A0Goswin
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Raid 5 - not clean and then a failure.

am 26.08.2009 16:19:51 von Jon Hardcastle

Can a bitmap be easily removed? I might give it ago if it can.

I am never sure how thorough these checks are. Are they read/write, or =
just read? for example. I make of point of doing read/write badblocks c=
hecks with e2fck -cc when I do run them (not the automatic ones tho - d=
unno how) but that only checks that partition, which is on LVM, which i=
s on RAID so WHO KNOWS what underlying drives are being checked.

I have before now, dismantled the array and run read/write badblocks di=
rectly on the constituent drives so at least smart is aware of them and=
although i aim to do this once every six months, I think I have actual=
ly done it only 1nce in the 2 year life of the array.

-----------------------
N: Jon Hardcastle
E: Jon@eHardcastle.com
'Do not worry about tomorrow, for tomorrow will bring worries of its ow=
n.'
-----------------------


--- On Wed, 26/8/09, Ryan Wagoner wrote:

> From: Ryan Wagoner
> Subject: Re: Raid 5 - not clean and then a failure.
> To: "Goswin von Brederlow"
> Cc: Jon@ehardcastle.com, linux-raid@vger.kernel.org
> Date: Wednesday, 26 August, 2009, 3:14 PM
> Wouldn't weekly RAID consistency
> checks reveal a bad block before you
> had a failure that required the need to do a full resync?
> It only
> takes 3 hours to resync my 3 x 1TB drives and having a
> bitmap would
> reduce the performance. I've never had to have a resync in
> the year
> I've had the array up. I just wonder if the performance
> drawback is
> worth having the bitmap to save a possible resync once
> every couple
> years. Or are the RAID consistency checks not reliable
> enough to
> prevent more errors during a resync?
>=20
> Ryan
>=20
> On Wed, Aug 26, 2009 at 7:18 AM, Goswin von Brederlow de>
> wrote:
> > Jon Hardcastle
> writes:
> >
> >> Guys,
> >>
> >> I have been having some problems with my arrays
> that I think i have nailed down to a pci controller (well I
> say that - it is always the drives connected to *a*
> controller but I have tried 2!) anyway the latest saga is i
> was trying some new kernel options last night - which didn't
> work.
> >>
> >> But when i booted up again this morning it said
> one of the drives was in an inconsistent state (not sure of
> the *exact* error message). I then kicked off an add of the
> drive and it started syncing. It got about 5% in and then
> the second drive in on that controller complained and the
> array failed.
> >>
> >> Is there any hope for my data? If i get a good
> controller in there will the resync continue? can I try and
> tell it to assume the drives are good (which they ought to
> be)?
> >>
> >> Please help!
> >
> > The inconsistency is probably just a block here or
> there and I'm
> > assuming none of your drives actualy failed. So
> 99.9999% of your data
> > should be there. Just rebooting might actualy just get
> your raid back
> > (to syncing). If not then you have to force reassembly
> from the drives
> > with the newest serials. That will give you some data
> corruption,
> > whatever was writing when the controler gave errors.
> Worst case you
> > have to recreate the raid with --assume-clean.
> >
> > I recommend adding a bitmap to the raid. That way a
> wrongfully failed
> > drive can be resynced in a matter of minutes instead
> of hours or
> > days. Makes it way less likely another error occurs
> during resync.
> >
> > MfG
> > =A0 =A0 =A0 =A0Goswin
> > --
> > To unsubscribe from this list: send the line
> "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.htm=
l
> >
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at=A0 http://vger.kernel.org/majordomo-info.html
>=20


=20
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Raid 5 - not clean and then a failure.

am 26.08.2009 16:33:16 von Robin Hill

--dDRMvlgZJXvWKvBx
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed Aug 26, 2009 at 10:14:31AM -0400, Ryan Wagoner wrote:

> Wouldn't weekly RAID consistency checks reveal a bad block before you
> had a failure that required the need to do a full resync? It only
> takes 3 hours to resync my 3 x 1TB drives and having a bitmap would
> reduce the performance. I've never had to have a resync in the year
> I've had the array up. I just wonder if the performance drawback is
> worth having the bitmap to save a possible resync once every couple
> years. Or are the RAID consistency checks not reliable enough to
> prevent more errors during a resync?
>=20
If your system is that stable, then bitmaps will be a waste of time for
you. A lot of people have hardware/software issues which cause drives
to be kicked out of arrays occasionally, or arrays to fail to shut down
cleanly. A bitmap will save time when adding the drive back into the
array in these cases.

Cheers,
Robin
--=20
___ =20
( ' } | Robin Hill |
/ / ) | Little Jim says .... |
// !! | "He fallen in de water !!" |

--dDRMvlgZJXvWKvBx
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.11 (GNU/Linux)

iEYEARECAAYFAkqVR6sACgkQShxCyD40xBI+IACdHR3MbuV02/TFAR6dupM9 SmP5
BakAnjssqcLEWOqjGbDVZzGnPK+Ydxth
=qLRA
-----END PGP SIGNATURE-----

--dDRMvlgZJXvWKvBx--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Raid 5 - not clean and then a failure.

am 26.08.2009 16:50:22 von Robin Hill

--yEPQxsgoJgBvi8ip
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed Aug 26, 2009 at 07:19:51AM -0700, Jon Hardcastle wrote:

> Can a bitmap be easily removed? I might give it ago if it can.
>=20
Yes - you can add/remove a bitmap at any time (on a non-degraded array).

> I am never sure how thorough these checks are. Are they read/write, or
> just read? for example. I make of point of doing read/write badblocks
> checks with e2fck -cc when I do run them (not the automatic ones tho -
> dunno how) but that only checks that partition, which is on LVM, which
> is on RAID so WHO KNOWS what underlying drives are being checked.
>=20
My understanding is that the md "check" action does a read-only check,
verifying the checksum is valid for the data. The "repair" action
will rewrite the checksum if it's not valid. Neither of these will
write to the data blocks, or any valid checksum blocks.

Running e2fsck -cc should do a read/write check. This will only check
the filesystem data blocks though (and only on ext2/ext3 filesystems of
course), so will miss the LVM metadata and RAID checksums and metadata.

> I have before now, dismantled the array and run read/write badblocks
> directly on the constituent drives so at least smart is aware of them
> and although i aim to do this once every six months, I think I have
> actually done it only 1nce in the 2 year life of the array.
>=20
If you mean running badblocks in read/write mode, that'll be a
destructive test then. In this case, you're trading the risk of a
failure on one disk for the risk of a failure on one of the others
during rebuild.

You could also run background SMART tests (though this has caused drives
to be kicked out of the array on some occasions for me) - these look to
be mostly read-only tests again (though I'm not 100% sure on that).

Cheers,
Robin
--=20
___ =20
( ' } | Robin Hill |
/ / ) | Little Jim says .... |
// !! | "He fallen in de water !!" |

--yEPQxsgoJgBvi8ip
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.11 (GNU/Linux)

iEYEARECAAYFAkqVS64ACgkQShxCyD40xBLtmwCeINWoEgJQepo66VFYBGh6 zX1z
vfsAn3Gm3N2uvup+huPr0b0jd/D0KvV4
=P/jl
-----END PGP SIGNATURE-----

--yEPQxsgoJgBvi8ip--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Raid 5 - not clean and then a failure.

am 26.08.2009 22:34:30 von Goswin von Brederlow

Jon Hardcastle writes:

> --- On Wed, 26/8/09, Goswin von Brederlow wrote:
>
>> From: Goswin von Brederlow
>> Subject: Re: Raid 5 - not clean and then a failure.
>> To: Jon@eHardcastle.com
>> Cc: linux-raid@vger.kernel.org
>> Date: Wednesday, 26 August, 2009, 12:18 PM
>> Jon Hardcastle
>> writes:
>>=20
>> > Guys,
>> >
>> > I have been having some problems with my arrays that I
>> think i have nailed down to a pci controller (well I say
>> that - it is always the drives connected to *a* controller
>> but I have tried 2!) anyway the latest saga is i was trying
>> some new kernel options last night - which didn't work.
>> >
>> > But when i booted up again this morning it said one of
>> the drives was in an inconsistent state (not sure of the
>> *exact* error message). I then kicked off an add of the
>> drive and it started syncing. It got about 5% in and then
>> the second drive in on that controller complained and the
>> array failed.=20
>> >
>> > Is there any hope for my data? If i get a good
>> controller in there will the resync continue? can I try and
>> tell it to assume the drives are good (which they ought to
>> be)?
>> >
>> > Please help!
>>=20
>> The inconsistency is probably just a block here or there
>> and I'm
>> assuming none of your drives actualy failed. So 99.9999% of
>> your data
>> should be there. Just rebooting might actualy just get your
>> raid back
>> (to syncing). If not then you have to force reassembly from
>> the drives
>> with the newest serials. That will give you some data
>> corruption,
>> whatever was writing when the controler gave errors. Worst
>> case you
>> have to recreate the raid with --assume-clean.
>>=20
>> I recommend adding a bitmap to the raid. That way a
>> wrongfully failed
>> drive can be resynced in a matter of minutes instead of
>> hours or
>> days. Makes it way less likely another error occurs during
>> resync.
>>=20
>> MfG
>> =A0 =A0 =A0 =A0 Goswin
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at=A0 http://vger.kernel.org/majordomo-info.html
>>=20
>
> I did look into bitmaps *abit* i could easily have the imagine for my=
6 drive raid 5 stored on the raid1 I have in the same system.. The goo=
gling I did tho did not paint a pretty picture it talked about huge per=
formance hits?

That depends on the bitmap size a lot.

It also depends on the frequency of errors. If your controler has a
hickup once a week causing a drive to fail and you need 1 day to
rebuild the array you will be left with a double disk failure pretty
quickly without bitmaps.

MfG
Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Raid 5 - not clean and then a failure.

am 26.08.2009 22:35:59 von Goswin von Brederlow

Ryan Wagoner writes:

> Wouldn't weekly RAID consistency checks reveal a bad block before you
> had a failure that required the need to do a full resync? It only
> takes 3 hours to resync my 3 x 1TB drives and having a bitmap would
> reduce the performance. I've never had to have a resync in the year
> I've had the array up. I just wonder if the performance drawback is
> worth having the bitmap to save a possible resync once every couple
> years. Or are the RAID consistency checks not reliable enough to
> prevent more errors during a resync?
>
> Ryan

Bitmaps don't protect against disk failures. They help with
intermittent failures, usualy caused by the controler. If you don't
have intermittent failures then bitmaps will only cost you for no
benefit.

MfG
Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html