Re: Bug#624343: linux-image-2.6.38-2-amd64: frequent message "biotoo big device md0 (248

Re: Bug#624343: linux-image-2.6.38-2-amd64: frequent message "biotoo big device md0 (248 > 2

am 29.04.2011 06:39:40 von Ben Hutchings

--=-8RJ44Dw6X/aTR+iFO+z9
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Wed, 2011-04-27 at 09:19 -0700, Jameson Graef Rollins wrote:
> Package: linux-2.6
> Version: 2.6.38-3
> Severity: normal
>=20
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>=20
> As you can see from the kern.log snippet below, I am seeing frequent
> messages reporting "bio too big device md0 (248 > 240)".
>=20
> I run what I imagine is a fairly unusual disk setup on my laptop,
> consisting of:
>=20
> ssd -> raid1 -> dm-crypt -> lvm -> ext4
>=20
> I use the raid1 as a backup. The raid1 operates normally in degraded
> mode. For backups I then hot-add a usb hdd, let the raid1 sync, and
> then fail/remove the external hdd.=20

Well, this is not expected to work. Possibly the hot-addition of a disk
with different bio restrictions should be rejected. But I'm not sure,
because it is safe to do that if there is no mounted filesystem or
stacking device on top of the RAID.

I would recommend using filesystem-level backup (e.g. dirvish or
backuppc). Aside from this bug, if the SSD fails during a RAID resync
you will be left with an inconsistent and therefore useless 'backup'.

> I started noticing these messages after my last sync. I have not
> rebooted since.
>=20
> I found a bug report on the launchpad that describes an almost
> identical situation:
>=20
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/320638
>=20
> The reporter seemed to be concerned that their may be data loss
> happening. I have not yet noticed any, but of course I'm terrified
> that it's happening and I just haven't found it yet. Unfortunately
> the bug was closed with a "Won't Fix" without any resolution.
>=20
> Is this a kernel bug, or is there something I can do to remedy the
> situation? I haven't tried to reboot yet to see if the messages stop.
> I'm obviously most worried about data loss. Please advise!

The block layer correctly returns an error after logging this message.
If it's due to a read operation, the error should be propagated up to
the application that tried to read. If it's due to a write operation, I
would expect the error to result in the RAID becoming desynchronised.
In some cases it might be propagated to the application that tried to
write.

If the error is somehow discarded then there *is* a kernel bug with the
risk of data loss.

> I am starting to suspect that these messages are in face associated with
> data loss on my system. I have witnessed these messages occur during
> write operations to the disk, and I have also started to see some
> strange behavior on my system. dhclient started acting weird after
> these messages appeared (not holding on to leases) and I started to
> notice database exceptions in my mail client.
>
> Interestingly, the messages seem to have gone away after reboot. I will
> watch closely to see if they return after my next raid1 sync.

Ben.

--=20
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.

--=-8RJ44Dw6X/aTR+iFO+z9
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iQIVAwUATbpBBOe/yOyVhhEJAQqDOQ/7BkbbBX9OpHznrRjBFtkC1pFsNTOd +Uwf
xNEL6BHG0wXomwzYAY0KiPe+/zMeQW3YNBwv+rVZDdyJfFd6xzqgNRnzxN13 jdu2
Oh51qr11ePFwy4QfDX9F4hv2aHO0OdOgtHUOnpltOqqBMTDW163n/+e9p0Zg OKDG
FCUhryIp7xGdMdoJZ0v9HOBf1/U08p3e1o90N449QCIlI5ATCXQqnVeJxIe4 F8W9
4+42rnTQkFHcbr1U4b65+Fk/GzJ/cjR0G1DibFUa1Ud6K0AsRgRKjuf3WLVI Q0cm
YVSatduODV+6qSTUNWKoG4+ftyimSf7mwL1GFZVpnHfYpQG8xrXq745bRSYe 9tfb
SS5Gu40JCBGkXOFGUinzerSBf9p73vA4pb/nqwkPg7tvKT9+QhxojMOxoWaK jf/n
4sWMe/eNeVH7vIbr9l/D28ZiHURD0QOakdNFR/sVIzgIpJEjqed4BNS/96D9 e14g
UgLoSBXyT5yc8Co3poTtbX0R9FwKceEy8E4IdDAWivAPDII/FSbnVIq1DhEW xfDk
m+pfgf+brHOAnXy76oYARZOGUufe04CmhOPbYwa24RKQkbv88NuAJ02+9owg 650h
lsl/qc08AIkOVgVet09aRgCTC1uXOqKnzVq4V2c8O76XEttMWnYZzyyuhYgR pv4w
UobnxSWD01s=
=aJZF
-----END PGP SIGNATURE-----

--=-8RJ44Dw6X/aTR+iFO+z9--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Bug#624343: linux-image-2.6.38-2-amd64: frequent message "bio too big device md0 (248 >

am 02.05.2011 00:06:45 von Jameson Graef Rollins

--=-=-=
Content-Transfer-Encoding: quoted-printable

On Fri, 29 Apr 2011 05:39:40 +0100, Ben Hutchings wro=
te:
> On Wed, 2011-04-27 at 09:19 -0700, Jameson Graef Rollins wrote:
> > I run what I imagine is a fairly unusual disk setup on my laptop,
> > consisting of:
> >=20
> > ssd -> raid1 -> dm-crypt -> lvm -> ext4
> >=20
> > I use the raid1 as a backup. The raid1 operates normally in degraded
> > mode. For backups I then hot-add a usb hdd, let the raid1 sync, and
> > then fail/remove the external hdd.=20
>=20
> Well, this is not expected to work. Possibly the hot-addition of a disk
> with different bio restrictions should be rejected. But I'm not sure,
> because it is safe to do that if there is no mounted filesystem or
> stacking device on top of the RAID.

Hi, Ben. Can you explain why this is not expected to work? Which part
exactly is not expected to work and why?

> I would recommend using filesystem-level backup (e.g. dirvish or
> backuppc). Aside from this bug, if the SSD fails during a RAID resync
> you will be left with an inconsistent and therefore useless 'backup'.

I appreciate your recommendation, but it doesn't really have anything to
do with this bug report. Unless I am doing something that is
*expressly* not supposed to work, then it should work, and if it doesn't
then it's either a bug or a documentation failure (ie. if this setup is
not supposed to work then it should be clearly documented somewhere what
exactly the problem is).

> The block layer correctly returns an error after logging this message.
> If it's due to a read operation, the error should be propagated up to
> the application that tried to read. If it's due to a write operation, I
> would expect the error to result in the RAID becoming desynchronised.
> In some cases it might be propagated to the application that tried to
> write.

Can you say what is "correct" about the returned error? That's what I'm
still not understanding. Why is there an error and what is it coming
from?

jamie.

--=-=-=
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iQIcBAEBCAAGBQJNvdl1AAoJEO00zqvie6q8e7YP/0jerOnezmqjSi8S4CtY oerg
S+R8vX23Lf1uRNpwcB4SXZa1U7COAsmQxaNpUFgf9dq1MD2e4HZxiSKGMIbf Qi+K
SkS+wH9ukuri76tHd8D7iGyiTKJfQHhv6qCi2nwcfJl4JhcntMX779cn92bu FBKj
JoogJT4C10S1P58T+qCQB0ZTenYbVnJNmVh19lnfdmDhn/FRAW/qbCZTh8s0 rimU
zF7yHz6EbXUi2LErks3M51X5bMRCpr2eYdJ7K6hVWBsCR8PLr+w4Da4sa63V 0A4T
tCk7a5t7fkRjbOgKrTUsoPwZDWlFQvFTmeZtxUxNCdybGa1OIiZKrwv1dnpo kNVs
KYODLM5RT697nmZeuiQ0k6wxSNimkEBWZan3sgapjWkOzqwzlbO3RVp5oXzl WeS4
6TZ3d5doKo/z7pm5z/dAgsD3VdfeEKvBL0ishFpLR09Mi9trkop2XnrpehsI Czyn
KKVWlcRgIzt+3KYmjtaa3EZPj/jekOAHZkuQIhyR27XXrR+6tg1fGSkIsOL+ mhN8
gFUfeP4zahglzkDg2GH57P8tsgkZvZFTtAJF8QmBf9JlSQ0YqQmodNROcwjE nrcH
k20F7H7SZxlsgkkqL51/cM1NgNRybTP9d0U4fZJTOsigmMEK19BW+fBgfuOd p4lO
gG7VjFchx15qZ5m6so6T
=7j5t
-----END PGP SIGNATURE-----
--=-=-=--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Bug#624343: linux-image-2.6.38-2-amd64: frequent message "biotoo big device md0 (248 > 2

am 02.05.2011 11:11:25 von David Brown

On 02/05/2011 00:06, Jameson Graef Rollins wrote:
> On Fri, 29 Apr 2011 05:39:40 +0100, Ben Hutchings wrote:
>> On Wed, 2011-04-27 at 09:19 -0700, Jameson Graef Rollins wrote:
>>> I run what I imagine is a fairly unusual disk setup on my laptop,
>>> consisting of:
>>>
>>> ssd -> raid1 -> dm-crypt -> lvm -> ext4
>>>
>>> I use the raid1 as a backup. The raid1 operates normally in degraded
>>> mode. For backups I then hot-add a usb hdd, let the raid1 sync, and
>>> then fail/remove the external hdd.

This is not directly related to your issues here, but it is possible to
make a 1-disk raid1 set so that you are not normally degraded. When you
want to do the backup, you can grow the raid1 set with the usb disk,
want for the resync, then fail it and remove it, then "grow" the raid1
back to 1 disk. That way you don't feel you are always living in a
degraded state.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Bug#624343: linux-image-2.6.38-2-amd64: frequent message "bio too big device md0 (248 >

am 02.05.2011 18:38:05 von Jameson Graef Rollins

--=-=-=
Content-Transfer-Encoding: quoted-printable

On Mon, 02 May 2011 11:11:25 +0200, David Brown wro=
te:
> This is not directly related to your issues here, but it is possible to=20
> make a 1-disk raid1 set so that you are not normally degraded. When you=
=20
> want to do the backup, you can grow the raid1 set with the usb disk,=20
> want for the resync, then fail it and remove it, then "grow" the raid1=20
> back to 1 disk. That way you don't feel you are always living in a=20
> degraded state.

Hi, David. I appreciate the concern, but I am not at all concerned
about "living in a degraded state". I'm far more concerned about data
loss and the fact that this bug has seemingly revealed that some
commonly held assumptions and uses of software raid are wrong, with
potentially far-reaching affects.

I also don't see how the setup you're describing will avoid this bug.
If this bug is triggered by having a layer between md and the filesystem
and then changing the raid configuration by adding or removing a disk,
then I don't see how there's a difference between hot-adding to a
degraded array and growing a single-disk raid1. In fact, I would
suspect that your suggestion would be more problematic because it
involves *two* raid reconfigurations (grow and then shrink) rather than
one (hot-add) to achieve the same result. I imagine that each raid
reconfiguration could potentially triggering the bug. But I still don't
have a clear understanding of what is going on here to be sure.

jamie.

--=-=-=
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iQIcBAEBCAAGBQJNvt3tAAoJEO00zqvie6q89RgP/0o8ceGiiI0ChPbys/Ht NtL/
ZbNTTdM7ax6EgKeIbBPkYaEsByMWOTfEEBUOAm7plFJNTc1UstVGLRvbT0P3 vxfb
QGKhXm3+bI/uW97sggTOu89vQVnanND6q3RCNvuzgEs/0AZaEASJMLx13Vix 7gCh
7g2caJmRO9goTbW46kDQaicCOAy5x84LsGG9vPr9fwVQmILJR+JpXrAXJjOW qpxr
8utES23mtCDZnI3XgXWHxQ0mqFICHivKg96mdwnGFGaTTZ8rOyumf1ZvNUUx Z4aA
o/H6t0pQxi26Yu+1mRylz+Frffw1E3lP9G4wiGk1pv+IN0/TtZm/BBhDg4qI YnSd
QKS9wJc7dKhbCRnk3m9Fqt9qYLaz92Jb0sMdZAeY4LACPevzlgf4ouTXC8FP bX7z
pBtgq/3LoMadD++hP+ph4Q1HNCR12/sxaYZfz5r9X7AbrqS2uEZT3C7tLd03 OUMP
Kw2z2qI8Ovnye81nTNDdGRTVYBisox9t2/pTxi7yrHg2FX6zOmXe14avIJwQ tQFZ
xaDcgTqHP26GeKela0204nuRwKYomVNFiEP0fXSlh7jmh7NKA93eXBhuRhcy 3MXn
kEEvuaBkMN40lJbFpuPJz11rKs+25I4rgjjPdVNX+PX/8/ceaz13qsgMPRQS Kzds
bxDasCUqPz0Q/K6uA24J
=DLJU
-----END PGP SIGNATURE-----
--=-=-=--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Bug#624343: linux-image-2.6.38-2-amd64: frequent message "biotoo big device md0 (248 > 2

am 02.05.2011 20:54:05 von David Brown

On 02/05/11 18:38, Jameson Graef Rollins wrote:
> On Mon, 02 May 2011 11:11:25 +0200, David Brown wrote:
>> This is not directly related to your issues here, but it is possible to
>> make a 1-disk raid1 set so that you are not normally degraded. When you
>> want to do the backup, you can grow the raid1 set with the usb disk,
>> want for the resync, then fail it and remove it, then "grow" the raid1
>> back to 1 disk. That way you don't feel you are always living in a
>> degraded state.
>
> Hi, David. I appreciate the concern, but I am not at all concerned
> about "living in a degraded state". I'm far more concerned about data
> loss and the fact that this bug has seemingly revealed that some
> commonly held assumptions and uses of software raid are wrong, with
> potentially far-reaching affects.
>
> I also don't see how the setup you're describing will avoid this bug.
> If this bug is triggered by having a layer between md and the filesystem
> and then changing the raid configuration by adding or removing a disk,
> then I don't see how there's a difference between hot-adding to a
> degraded array and growing a single-disk raid1. In fact, I would
> suspect that your suggestion would be more problematic because it
> involves *two* raid reconfigurations (grow and then shrink) rather than
> one (hot-add) to achieve the same result. I imagine that each raid
> reconfiguration could potentially triggering the bug. But I still don't
> have a clear understanding of what is going on here to be sure.
>

I didn't mean to suggest this as a way around these issues - I was just
making a side point. Like you and others in this thread, I am concerned
about failures that could be caused by having the sort of layered and
non-homogeneous raid you describe.

I merely mentioned single-disk raid1 "mirrors" as an interesting feature
you can get with md raid. Many people don't like to have their system
in a continuous error state - it can make it harder to notice when you
have a /real/ problem. And single-disk "mirrors" gives you the same
features, but no "degraded" state.

As you say, it is conceivable that adding or removing disks to the raid
could make matters worse.

From what I have read so far, it looks like you can get around problems
here if the usb disk is attached when the block layers are built up
(i.e., when the dm-crypt is activated, and the lvm and filesystems on
top of it). It should then be safe to remove it, and re-attach it
later. Of course, it's hardly ideal to have to attach your backup
device every time you boot the machine!

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html