RAID6 seemingly shrunk itself after hard power outage and rebuildwith replacement disk

am 05.03.2011 00:27:09 von robbat2

--3MwIy2ne0vdjdPXF
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

(Please CC, not subscribed to linux-raid).

Problem summary:
-------------------
After a rebuild following disk replacement, the MD array (RAID6, 12 devices)
appears to have shrunk by 10880KiB. Presumed at the start of the device, bu=
t no
confirmation.

Background:
-----------
I got called in to help a friend with a data loss problem after a catastrop=
hic
UPS failure which killed at least one motherboards, and several disks. Almo=
st
all of which lead to no data loss, except for one system...

For the system in question, one disk died (cciss/c1d12), and was
promptly replaced, and this problem started when the rebuild kicked in.

Prior to calling me, my friend had already tried a few things from a rescue
env, and almost certainly contributed to making the problem worse, and does=
n't
have good logs of what he did.

The MD array was portions of two very large LVM LVs (15TiB and ~20TiB
respectively). Specifically, the PV of the MD array was chunk in the middl=
e of
each of the two LVs.

The kernel version 2.6.35.4 did not change during the power outage.

Problem identification:
-----------------------
When bringing the system back online, LVM refused to make one LV accessible=
as
it complained of a shrunk device. One other LV exhibited corruption.

The entry in /proc/partitions noted the array size of 14651023360KiB, while
older LVM backups showed the usable size of the array to previously be
14651034240KiB, a difference of 10880KiB.

The first LV has inaccessible data for all files at or after the missing ch=
unk.
All files prior to that point are accessible.

LVM refused to bring the second LV online as it complained the physical dev=
ice
was now too small for all the extents.=20

Prior to the outage, 800KiB of the collected devices was used for metadata,=
and
post the outage, now 11680KiB is used (difference of 10880 KIB).

Questions:
----------
Why did the array shrink? How can I get it back to the original size, or
accurately identify the missing chunk size and offset, so that I can adjust=
the
LVM definitions and recover the other data.

Collected information:
----------------------

Relevant lines from /proc/partitions:
==================== =====3D=
============
9 3 14651023360 md3
105 209 1465103504 cciss/c1d13p1
...

Line from mdstat right now:
==================== =====3D=
==
md3 : active raid6 cciss/c1d18p1[5] cciss/c1d17p1[4] cciss/c1d13p1[0]
cciss/c1d21p1[8] cciss/c1d20p1[7] cciss/c1d19p1[6] cciss/c1d15p1[2]
cciss/c1d12p1[12] cciss/c1d14p1[1] cciss/c1d23p1[10] cciss/c1d16p1[3]
cciss/c1d22p1[9]
14651023360 blocks super 1.2 level 6, 64k chunk, algorithm 2
[12/12] [UUUUUUUUUUUU]

MDADM output:
=============3D
# mdadm --detail /dev/md3
/dev/md3:
Version : 1.2
Creation Time : Wed Feb 16 19:53:05 2011
Raid Level : raid6
Array Size : 14651023360 (13972.30 GiB 15002.65 GB)
Used Dev Size : 1465102336 (1397.23 GiB 1500.26 GB)
Raid Devices : 12
Total Devices : 12
Persistence : Superblock is persistent

Update Time : Fri Mar 4 17:19:43 2011
State : clean
Active Devices : 12
Working Devices : 12
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

Name : CENSORED:3 (local to host CENSORED)
UUID : efa04ecf:4dbd0bfa:820a5942:de8a234f
Events : 25

Number Major Minor RaidDevice State
0 105 209 0 active sync /dev/cciss/c1d13p1
1 105 225 1 active sync /dev/cciss/c1d14p1
2 105 241 2 active sync /dev/cciss/c1d15p1
3 105 257 3 active sync /dev/cciss/c1d16p1
4 105 273 4 active sync /dev/cciss/c1d17p1
5 105 289 5 active sync /dev/cciss/c1d18p1
6 105 305 6 active sync /dev/cciss/c1d19p1
7 105 321 7 active sync /dev/cciss/c1d20p1
8 105 337 8 active sync /dev/cciss/c1d21p1
9 105 353 9 active sync /dev/cciss/c1d22p1
10 105 369 10 active sync /dev/cciss/c1d23p1
12 105 193 11 active sync /dev/cciss/c1d12p1

LVM PV definition:
==================
pv1 {
id =3D "CENSORED"
device =3D "/dev/md3" # Hint only
status =3D ["ALLOCATABLE"]
flags =3D []
dev_size =3D 29302068480 # 13.6448 Terabytes
pe_start =3D 384=20
pe_count =3D 3576912 # 13.6448 Terabytes
} =20

LVM segments output:
====================

# lvs --units 1m --segments \
-o lv_name,lv_size,seg_start,seg_start_pe,seg_size,seg_pe_range s \
vg/LV1 vg/LV2
LV LSize Start Start SSize PE Ranges =20
LV1 15728640m 0m 0 1048576m /dev/md2:1048576-1310719
LV1 15728640m 1048576m 262144 1048576m /dev/md2:2008320-2270463
LV1 15728640m 2097152m 524288 7936132m /dev/md3:1592879-3576911
LV1 15728640m 10033284m 2508321 452476m /dev/md4:2560-115678 =20
LV1 15728640m 10485760m 2621440 5242880m /dev/md4:2084381-3395100
LV2 20969720m 0m 0 4194304m /dev/md2:0-1048575 =20
LV2 20969720m 4194304m 1048576 1048576m /dev/md2:1746176-2008319
LV2 20969720m 5242880m 1310720 456516m /dev/md2:2270464-2384592
LV2 20969720m 5699396m 1424849 511996m /dev/md2:1566721-1694719
LV2 20969720m 6211392m 1552848 4m /dev/md2:1566720-1566720
LV2 20969720m 6211396m 1552849 6371516m /dev/md3:0-1592878 =20
LV2 20969720m 12582912m 3145728 512000m /dev/md2:1438720-1566719
LV2 20969720m 13094912m 3273728 7874808m /dev/md4:115679-2084380=20

--=20
Robin Hugh Johnson
Gentoo Linux: Developer, Trustee & Infrastructure Lead
E-Mail : robbat2@gentoo.org
GnuPG FP : 11AC BA4F 4778 E3F6 E4ED F38E B27B 944E 3488 4E85

--3MwIy2ne0vdjdPXF
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)
Comment: Robbat2 @ Orbis-Terrarum Networks - The text below is a digital signature. If it doesn't make any sense to you, ignore it.

iEYEARECAAYFAk1xdU0ACgkQPpIsIjIzwixhBgCgz3mSZBwwNszikMgLNJ6I YDNL
JxsAoOS2Qg7OvwKvqGjxmEfL6+ihZSlL
=Pkit
-----END PGP SIGNATURE-----

--3MwIy2ne0vdjdPXF--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID6 seemingly shrunk itself after hard power outage and rebuildwith replacement disk

am 05.03.2011 09:32:43 von Stan Hoeppner

Robin H. Johnson put forth on 3/4/2011 5:27 PM:

> After a rebuild following disk replacement, the MD array (RAID6, 12 devices)
> appears to have shrunk by 10880KiB.
------------------------------------------------------------ ----------------------
> The entry in /proc/partitions noted the array size of 14651023360KiB, while
> older LVM backups showed the usable size of the array to previously be
> 14651034240KiB, a difference of 10880KiB.
------------------------------------------------------------ ----------------------
> Prior to the outage, 800KiB of the collected devices was used for metadata, and
> post the outage, now 11680KiB is used (difference of 10880 KIB).
------------------------------------------------------------ ----------------------
> Why did the array shrink?

It appears it shrunk by exactly the size of the new metadata, 10880KiB,
if this is actually considered an array shrink. It seems you need to
identify why the metadata size increased, and figure out a way to revert
it back to its previous size.

Your current metadata version is 1.2. What was it prior to the
catastrophic UPS event?

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID6 seemingly shrunk itself after hard power outage and rebuildwith replacement disk

am 05.03.2011 17:57:58 von Phil Turmel

Hi Robin,

On 03/04/2011 06:27 PM, Robin H. Johnson wrote:
> (Please CC, not subscribed to linux-raid).
>
> Problem summary:
> -------------------
> After a rebuild following disk replacement, the MD array (RAID6, 12 devices)
> appears to have shrunk by 10880KiB. Presumed at the start of the device, but no
> confirmation.

Sounds similar to a problem recently encountered by Simon McNeil...

> Background:
> -----------
> I got called in to help a friend with a data loss problem after a catastrophic
> UPS failure which killed at least one motherboards, and several disks. Almost
> all of which lead to no data loss, except for one system...
>
> For the system in question, one disk died (cciss/c1d12), and was
> promptly replaced, and this problem started when the rebuild kicked in.
>
> Prior to calling me, my friend had already tried a few things from a rescue
> env, and almost certainly contributed to making the problem worse, and doesn't
> have good logs of what he did.

I have a suspicion that 'mdadm --create --assume-clean' or some variant was one of those. And that the rescue environment has a version of mdadm >= 3.1.2. The default metadata alignment changed in that version.

> The MD array was portions of two very large LVM LVs (15TiB and ~20TiB
> respectively). Specifically, the PV of the MD array was chunk in the middle of
> each of the two LVs.
>
> The kernel version 2.6.35.4 did not change during the power outage.
>
> Problem identification:
> -----------------------
> When bringing the system back online, LVM refused to make one LV accessible as
> it complained of a shrunk device. One other LV exhibited corruption.
>
> The entry in /proc/partitions noted the array size of 14651023360KiB, while
> older LVM backups showed the usable size of the array to previously be
> 14651034240KiB, a difference of 10880KiB.
>
> The first LV has inaccessible data for all files at or after the missing chunk.
> All files prior to that point are accessible.
>
> LVM refused to bring the second LV online as it complained the physical device
> was now too small for all the extents.
>
> Prior to the outage, 800KiB of the collected devices was used for metadata, and
> post the outage, now 11680KiB is used (difference of 10880 KIB).
>
> Questions:
> ----------
> Why did the array shrink? How can I get it back to the original size, or
> accurately identify the missing chunk size and offset, so that I can adjust the
> LVM definitions and recover the other data.

Please share mdadm -E for all of the devices in the problem array, and a sample of mdadm -E for some of the devices in the working arrays. I think you'll find differences in the data offset. Newer mdadm aligns to 1MB. Older mdadm aligns to "superblock size + bitmap size".

"mdadm -E /dev/cciss/c1d{12..23}p1" should show us individual device details for the problem array.

> Collected information:
> ----------------------
>
> Relevant lines from /proc/partitions:
> =====================================
> 9 3 14651023360 md3
> 105 209 1465103504 cciss/c1d13p1
> ...
>
> Line from mdstat right now:
> ===========================
> md3 : active raid6 cciss/c1d18p1[5] cciss/c1d17p1[4] cciss/c1d13p1[0]
> cciss/c1d21p1[8] cciss/c1d20p1[7] cciss/c1d19p1[6] cciss/c1d15p1[2]
> cciss/c1d12p1[12] cciss/c1d14p1[1] cciss/c1d23p1[10] cciss/c1d16p1[3]
> cciss/c1d22p1[9]
> 14651023360 blocks super 1.2 level 6, 64k chunk, algorithm 2
> [12/12] [UUUUUUUUUUUU]
>
> MDADM output:
> =============
> # mdadm --detail /dev/md3
> /dev/md3:
> Version : 1.2
> Creation Time : Wed Feb 16 19:53:05 2011
> Raid Level : raid6
> Array Size : 14651023360 (13972.30 GiB 15002.65 GB)
> Used Dev Size : 1465102336 (1397.23 GiB 1500.26 GB)
> Raid Devices : 12
> Total Devices : 12
> Persistence : Superblock is persistent
>
> Update Time : Fri Mar 4 17:19:43 2011
> State : clean
> Active Devices : 12
> Working Devices : 12
> Failed Devices : 0
> Spare Devices : 0
>
> Layout : left-symmetric
> Chunk Size : 64K
>
> Name : CENSORED:3 (local to host CENSORED)
> UUID : efa04ecf:4dbd0bfa:820a5942:de8a234f
> Events : 25
>
> Number Major Minor RaidDevice State
> 0 105 209 0 active sync /dev/cciss/c1d13p1
> 1 105 225 1 active sync /dev/cciss/c1d14p1
> 2 105 241 2 active sync /dev/cciss/c1d15p1
> 3 105 257 3 active sync /dev/cciss/c1d16p1
> 4 105 273 4 active sync /dev/cciss/c1d17p1
> 5 105 289 5 active sync /dev/cciss/c1d18p1
> 6 105 305 6 active sync /dev/cciss/c1d19p1
> 7 105 321 7 active sync /dev/cciss/c1d20p1
> 8 105 337 8 active sync /dev/cciss/c1d21p1
> 9 105 353 9 active sync /dev/cciss/c1d22p1
> 10 105 369 10 active sync /dev/cciss/c1d23p1
> 12 105 193 11 active sync /dev/cciss/c1d12p1

The lowest device node is the last device role? Any chance these are also out of order?

> LVM PV definition:
> ==================
> pv1 {
> id = "CENSORED"
> device = "/dev/md3" # Hint only
> status = ["ALLOCATABLE"]
> flags = []
> dev_size = 29302068480 # 13.6448 Terabytes
> pe_start = 384
> pe_count = 3576912 # 13.6448 Terabytes
> }

It would be good to know where the LVM PV signature is on the problem array's devices, and which one has it. LVM stores a text copy of the VG's configuration in its metadata blocks at the beginning of a PV, so you should find it on the true "Raid device 0", at the original MD data offset from the beginning of the device.

I suggest scripting a loop through each device, piping the first 1MB (with dd) to "strings -t x" to grep, looking for the PV uuid in clear text.

> LVM segments output:
> ====================
>
> # lvs --units 1m --segments \
> -o lv_name,lv_size,seg_start,seg_start_pe,seg_size,seg_pe_range s \
> vg/LV1 vg/LV2
> LV LSize Start Start SSize PE Ranges
> LV1 15728640m 0m 0 1048576m /dev/md2:1048576-1310719
> LV1 15728640m 1048576m 262144 1048576m /dev/md2:2008320-2270463
> LV1 15728640m 2097152m 524288 7936132m /dev/md3:1592879-3576911
> LV1 15728640m 10033284m 2508321 452476m /dev/md4:2560-115678
> LV1 15728640m 10485760m 2621440 5242880m /dev/md4:2084381-3395100
> LV2 20969720m 0m 0 4194304m /dev/md2:0-1048575
> LV2 20969720m 4194304m 1048576 1048576m /dev/md2:1746176-2008319
> LV2 20969720m 5242880m 1310720 456516m /dev/md2:2270464-2384592
> LV2 20969720m 5699396m 1424849 511996m /dev/md2:1566721-1694719
> LV2 20969720m 6211392m 1552848 4m /dev/md2:1566720-1566720
> LV2 20969720m 6211396m 1552849 6371516m /dev/md3:0-1592878
> LV2 20969720m 12582912m 3145728 512000m /dev/md2:1438720-1566719
> LV2 20969720m 13094912m 3273728 7874808m /dev/md4:115679-2084380
>

If my suspicions are right, you'll have to use an old version of mdadm to redo an 'mdadm --create --assume-clean'.

HTH,

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID6 seemingly shrunk itself after hard power outage and rebuildwith replacement disk

am 05.03.2011 18:09:12 von Phil Turmel

On 03/05/2011 11:57 AM, Phil Turmel wrote:
> Hi Robin,
[trim /]
>
> Sounds similar to a problem recently encountered by Simon McNeil...

Whoops! That was "Simon Mcnair".

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID6 seemingly shrunk itself after hard power outage andrebuild with replacement disk

am 06.03.2011 20:22:56 von robbat2

On Sat, Mar 05, 2011 at 11:57:58AM -0500, Phil Turmel wrote:
> I have a suspicion that 'mdadm --create --assume-clean' or some
> variant was one of those. And that the rescue environment has a
> version of mdadm >= 3.1.2. The default metadata alignment changed in
> that version.
Confirmed.

> The lowest device node is the last device role? Any chance these are also out of order?
Yes, the data was confirmed to be shuffled later.

> If my suspicions are right, you'll have to use an old version of mdadm
> to redo an 'mdadm --create --assume-clean'.
Passed -e 0 with that, and corrected the order of the devices, and then
it looked much better. Some minor data corruption where the new metadata
overwrote stuff, but much easier to recover those files than the entire
35 TiB.

Thanks everybody for the help.

--
Robin Hugh Johnson
Gentoo Linux: Developer, Trustee & Infrastructure Lead
E-Mail : robbat2@gentoo.org
GnuPG FP : 11AC BA4F 4778 E3F6 E4ED F38E B27B 944E 3488 4E85
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html