filesystem corruption

am 03.01.2011 02:58:34 von linux-raid

I've been trying to track down an issue for a while now and from digging
around it appears (though not certain) the issue lies with the md raid
device.
Whats happening is that after improperly shutting down a raid-5 array,
upon reassembly, a few files on the filesystem will be corrupt. I dont
think this is normal filesystem corruption from files being modified
during the shut down because some of the files that end up corrupted are
several hours old.

The exact details of what I'm doing:
I have a 3-node test cluster I'm doing integrity testing on. Each node
in the cluster is exporting a couple of disks via ATAoE.
I have the first disk of all 3 nodes in a raid-1 that is holding the
journal data for the ext3 filesystem. The array is running with an
internal bitmap as well.
The second disk of all 3 nodes is a raid-5 array holding the ext3
filesystem itself. This is also running with an internal bitmap.
The ext3 filesystem is mounted with 'data=journal,barrier=1,sync'.
When I power down the node which is actively running both md raid
devices, another node in the cluster takes over and starts both arrays
up (in degraded mode of course).
Once the original node comes back up, the new master re-adds its disks
back into the raid arrays and re-syncs them.
During all this, the filesystem is exported through nfs (nfs also has
sync turned on) and a client is randomly creating, removing, and
verifying checksums on the files in the filesystem (nfs is hard mounted
so operations always retry). The client script averages about 30
creations/s, 30 deletes/s, and 30 checksums/s.

So, as stated above, every now and then (1 in 50 chance or so), when the
master is hard-rebooted, the client will detect a few files with invalid
md5 checksums. These files could be hours old so they were not being
actively modified.
Another key point that leads me to believe its a md raid issue is that
before I had the ext3 journal running internally on the raid-5 array
(part of the filesystem itself). When I did this, there would
occasionally be massive corruption. As in file modification times in the
future, lots of corrupt files, thousands of files put in the
'lost+found' dir upon fsck, etc. After I put it on a separate raid-1,
there are no more invalid modification times, there hasnt been a single
file added to 'lost+found', and the number of corrupt files dropped
significantly. This would seem to indicate that the journal was getting
corrupted, and when it was played back, it went horribly wrong.

So it would seem there's something wrong with the raid-5 array, but I
dont know what it could be. Any ideas or input would be much
appreciated. I can modify the clustering scripts to obtain whatever
information is needed when they start the arrays.

-Patrick
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: filesystem corruption

am 03.01.2011 04:16:03 von NeilBrown

On Sun, 02 Jan 2011 18:58:34 -0700 "Patrick H."
wrote:

> I've been trying to track down an issue for a while now and from digging
> around it appears (though not certain) the issue lies with the md raid
> device.
> Whats happening is that after improperly shutting down a raid-5 array,
> upon reassembly, a few files on the filesystem will be corrupt. I dont
> think this is normal filesystem corruption from files being modified
> during the shut down because some of the files that end up corrupted are
> several hours old.
>
> The exact details of what I'm doing:
> I have a 3-node test cluster I'm doing integrity testing on. Each node
> in the cluster is exporting a couple of disks via ATAoE.
> I have the first disk of all 3 nodes in a raid-1 that is holding the
> journal data for the ext3 filesystem. The array is running with an
> internal bitmap as well.
> The second disk of all 3 nodes is a raid-5 array holding the ext3
> filesystem itself. This is also running with an internal bitmap.
> The ext3 filesystem is mounted with 'data=journal,barrier=1,sync'.
> When I power down the node which is actively running both md raid
> devices, another node in the cluster takes over and starts both arrays
> up (in degraded mode of course).
> Once the original node comes back up, the new master re-adds its disks
> back into the raid arrays and re-syncs them.
> During all this, the filesystem is exported through nfs (nfs also has
> sync turned on) and a client is randomly creating, removing, and
> verifying checksums on the files in the filesystem (nfs is hard mounted
> so operations always retry). The client script averages about 30
> creations/s, 30 deletes/s, and 30 checksums/s.
>
> So, as stated above, every now and then (1 in 50 chance or so), when the
> master is hard-rebooted, the client will detect a few files with invalid
> md5 checksums. These files could be hours old so they were not being
> actively modified.
> Another key point that leads me to believe its a md raid issue is that
> before I had the ext3 journal running internally on the raid-5 array
> (part of the filesystem itself). When I did this, there would
> occasionally be massive corruption. As in file modification times in the
> future, lots of corrupt files, thousands of files put in the
> 'lost+found' dir upon fsck, etc. After I put it on a separate raid-1,
> there are no more invalid modification times, there hasnt been a single
> file added to 'lost+found', and the number of corrupt files dropped
> significantly. This would seem to indicate that the journal was getting
> corrupted, and when it was played back, it went horribly wrong.
>
> So it would seem there's something wrong with the raid-5 array, but I
> dont know what it could be. Any ideas or input would be much
> appreciated. I can modify the clustering scripts to obtain whatever
> information is needed when they start the arrays.

What you are doing cannot work reliably.

If a RAID5 suffers an unclean shutdown and is restarted without a full
complement of devices, then it can corrupt data that has not been changed
recently, just as you are seeing.
This is why mdadm will not assemble that array unless you provide the --force
flag which essentially says "I know what I am doing and accept the risk".

When md needs to update a block in your 3-drive RAID5, it will read the other
block in the same stripe (if that isn't in the cache or being written at the
same time) and then write out the data block (or blocks) and the newly
computed parity block.

If you crash after one of those writes has completed, but before all of the
writes have completed, then the parity block will not match the data blocks
on disk.

When you re-assemble the array with one device missing, md will compute the
data that was on the device using the other data block and the parity block.
As the parity and data blocks could be inconsistent, the result could easily
be wrong.

With RAID1 there is no similar problem. When you read after a crash you will
always get "correct" data. It maybe from before the last write that was
attempted, or after, but if the data was not written recently you will read
exactly the right data.

This is why the situation improved substantially when you moved the journal
to RAID1.

The get full improvement, you need to move the data to RAID1 (or RAID10) as
well.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: filesystem corruption

am 03.01.2011 05:56:30 von NeilBrown

On Sun, 02 Jan 2011 21:06:52 -0700 "Patrick H."
wrote:

> That makes sense assuming that MD acknowleges the write once the data is
> written to the data disks but not necessarily the parity disk, which is
> what I gather you were saying is what happens. Is there any option that
> can change the behavior so that md wont ack the write until its been
> committed to all disks (I'm guessing no since you didnt mention it)?
> Also does raid6 suffer this problem? Is it smart enough to use both
> parity disks when calculating replacement, or will it just use one?
>

md/raid5 doesn't acknowledge the write until both the data and the parity
have been written. But that doesn't make any difference.
If you schedule a number of interdependent writes (data and parity) and then
allow some to complete but not all, then you have inconsistency.
Recovery from losing a single device requires consistency of parity and data.

RAID6 suffers equally from this problem. Even if it used both parity disks
to recover (which it doesn't) how would that help? It would then have two
possible value for the data and no way to know which was correct, and every
possibility that both are incorrect. This would happen if a single data
block was successfully written, but neither parity blocks were.

The only way you can avoid this 'write hole' is by journalling in multiples
of whole stripes. No current filesystems that I know of can do this as they
journal in blocks, and the maximum block size is less than the minimum stripe
size. So you would need journalling integrated with md/raid, or you would
need a filesystem which was designed to understand this problem and write
whole stripes at a time, always to an area of the device which did not
contain live data.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: filesystem corruption

am 03.01.2011 06:05:06 von linux-raid

Sent: Sun Jan 02 2011 21:56:30 GMT-0700 (Mountain Standard Time)
From: Neil Brown
To: Patrick H. linux-raid@vger.kernel.org
Subject: Re: filesystem corruption
> On Sun, 02 Jan 2011 21:06:52 -0700 "Patrick H."
> wrote:
>
>
>
>> That makes sense assuming that MD acknowleges the write once the data is
>> written to the data disks but not necessarily the parity disk, which is
>> what I gather you were saying is what happens. Is there any option that
>> can change the behavior so that md wont ack the write until its been
>> committed to all disks (I'm guessing no since you didnt mention it)?
>> Also does raid6 suffer this problem? Is it smart enough to use both
>> parity disks when calculating replacement, or will it just use one?
>>
>>
>
> md/raid5 doesn't acknowledge the write until both the data and the parity
> have been written. But that doesn't make any difference.
> If you schedule a number of interdependent writes (data and parity) and then
> allow some to complete but not all, then you have inconsistency.
> Recovery from losing a single device requires consistency of parity and data.
>
> RAID6 suffers equally from this problem. Even if it used both parity disks
> to recover (which it doesn't) how would that help? It would then have two
> possible value for the data and no way to know which was correct, and every
> possibility that both are incorrect. This would happen if a single data
> block was successfully written, but neither parity blocks were.
>
> The only way you can avoid this 'write hole' is by journalling in multiples
> of whole stripes. No current filesystems that I know of can do this as they
> journal in blocks, and the maximum block size is less than the minimum stripe
> size. So you would need journalling integrated with md/raid, or you would
> need a filesystem which was designed to understand this problem and write
> whole stripes at a time, always to an area of the device which did not
> contain live data.
>
> NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

Ok, thanks for the info.
I think I'll solve it by creating 2 dedicated hosts for running the
array, but not actually export any disks themselves. This way if a
master dies, all the raid disks are still there and can be picked up by
the other master.

-Patrick
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: filesystem corruption

am 04.01.2011 06:33:24 von NeilBrown

On Sun, 02 Jan 2011 22:05:06 -0700 "Patrick H."
wrote:

> Ok, thanks for the info.
> I think I'll solve it by creating 2 dedicated hosts for running the
> array, but not actually export any disks themselves. This way if a
> master dies, all the raid disks are still there and can be picked up by
> the other master.
>

That sounds like it should work OK.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: filesystem corruption

am 04.01.2011 08:50:39 von linux-raid

Sent: Mon Jan 03 2011 22:33:24 GMT-0700 (Mountain Standard Time)
From: NeilBrown
To: Patrick H. linux-raid@vger.kernel.org
Subject: Re: filesystem corruption
> On Sun, 02 Jan 2011 22:05:06 -0700 "Patrick H."
> wrote:
>
>
>> Ok, thanks for the info.
>> I think I'll solve it by creating 2 dedicated hosts for running the
>> array, but not actually export any disks themselves. This way if a
>> master dies, all the raid disks are still there and can be picked up by
>> the other master.
>>
>>
>
> That sounds like it should work OK.
>
> NeilBrown
>
Well, it didnt solve it. if I power the entire cluster down and start it
back up, I get corruption, on old files that werent being modified
still. If I power off just a single node, it seems to handle it fine,
just not the whole cluster.

It also seems to happen fairly frequently now. In the previous setup it
was probably 1 in 50 failures that there was corruption. Now its pretty
much a guarantee there will be corruption if I kill it.
On the last failure I did, when it came back up, it re-assembled the
entire raid-5 array with all disks active and none of them needing any
sort of re-sync. The disk controller is battery backed, so even if it
was re-ordering the writes, the battery should ensure that it all gets
committed.

Any other ideas?

-Patrick
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: filesystem corruption

am 04.01.2011 18:31:56 von linux-raid

Sent: Tue Jan 04 2011 00:50:39 GMT-0700 (Mountain Standard Time)
From: Patrick H.
To: linux-raid@vger.kernel.org
Subject: Re: filesystem corruption
> Sent: Mon Jan 03 2011 22:33:24 GMT-0700 (Mountain Standard Time)
> From: NeilBrown
> To: Patrick H. linux-raid@vger.kernel.org
> Subject: Re: filesystem corruption
>> On Sun, 02 Jan 2011 22:05:06 -0700 "Patrick H."
>>
>> wrote:
>>
>>
>>> Ok, thanks for the info.
>>> I think I'll solve it by creating 2 dedicated hosts for running the
>>> array, but not actually export any disks themselves. This way if a
>>> master dies, all the raid disks are still there and can be picked up
>>> by the other master.
>>>
>>>
>>
>> That sounds like it should work OK.
>>
>> NeilBrown
>>
> Well, it didnt solve it. if I power the entire cluster down and start
> it back up, I get corruption, on old files that werent being modified
> still. If I power off just a single node, it seems to handle it fine,
> just not the whole cluster.
>
> It also seems to happen fairly frequently now. In the previous setup
> it was probably 1 in 50 failures that there was corruption. Now its
> pretty much a guarantee there will be corruption if I kill it.
> On the last failure I did, when it came back up, it re-assembled the
> entire raid-5 array with all disks active and none of them needing any
> sort of re-sync. The disk controller is battery backed, so even if it
> was re-ordering the writes, the battery should ensure that it all gets
> committed.
>
> Any other ideas?
>
> -Patrick
Here is some info from my most recent failure simulation. This one
resulted in about 50 corrupt files, another 40 or so that cant even be
opened, and one stale nfs file handle.
I had the cluster script dump out a bunch of info before and after
assembling the array.

= = = = = = = = = =
# mdadm -E /dev/etherd/e1.1p1
/dev/etherd/e1.1p1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
Name : dm01:126 (local to host dm01)
Creation Time : Tue Jan 4 04:45:50 2011
Raid Level : raid5
Raid Devices : 3

Avail Dev Size : 2119520 (1035.10 MiB 1085.19 MB)
Array Size : 4238848 (2.02 GiB 2.17 GB)
Used Dev Size : 2119424 (1035.05 MiB 1085.15 MB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : a20adb76:af00f276:5be79a36:b4ff3a8b

Internal Bitmap : 2 sectors from superblock
Update Time : Tue Jan 4 16:45:56 2011
Checksum : 361041f6 - correct
Events : 486

Layout : left-symmetric
Chunk Size : 64K

Device Role : Active device 0
Array State : AAA ('A' == active, '.' == missing)

# mdadm -X /dev/etherd/e1.1p1
Filename : /dev/etherd/e1.1p1
Magic : 6d746962
Version : 4
UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
Events : 486
Events Cleared : 486
State : OK
Chunksize : 64 KB
Daemon : 5s flush period
Write Mode : Normal
Sync Size : 1059712 (1035.05 MiB 1085.15 MB)
Bitmap : 16558 bits (chunks), 189 dirty (1.1%)
= = = = = = = = = =

= = = = = = = = = =
# mdadm -E /dev/etherd/e2.1p1
/dev/etherd/e2.1p1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
Name : dm01:126 (local to host dm01)
Creation Time : Tue Jan 4 04:45:50 2011
Raid Level : raid5
Raid Devices : 3

Avail Dev Size : 2119520 (1035.10 MiB 1085.19 MB)
Array Size : 4238848 (2.02 GiB 2.17 GB)
Used Dev Size : 2119424 (1035.05 MiB 1085.15 MB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : f9205ace:0796ecf5:2cca363c:c2873816

Internal Bitmap : 2 sectors from superblock
Update Time : Tue Jan 4 16:45:56 2011
Checksum : 9d235885 - correct
Events : 486

Layout : left-symmetric
Chunk Size : 64K

Device Role : Active device 1
Array State : AAA ('A' == active, '.' == missing)

# mdadm -X /dev/etherd/e2.1p1
Filename : /dev/etherd/e2.1p1
Magic : 6d746962
Version : 4
UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
Events : 486
Events Cleared : 486
State : OK
Chunksize : 64 KB
Daemon : 5s flush period
Write Mode : Normal
Sync Size : 1059712 (1035.05 MiB 1085.15 MB)
Bitmap : 16558 bits (chunks), 189 dirty (1.1%)
= = = = = = = = = =

= = = = = = = = = =
# mdadm -E /dev/etherd/e3.1p1
/dev/etherd/e3.1p1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
Name : dm01:126 (local to host dm01)
Creation Time : Tue Jan 4 04:45:50 2011
Raid Level : raid5
Raid Devices : 3

Avail Dev Size : 2119520 (1035.10 MiB 1085.19 MB)
Array Size : 4238848 (2.02 GiB 2.17 GB)
Used Dev Size : 2119424 (1035.05 MiB 1085.15 MB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : active
Device UUID : 7f90958d:22de5c08:88750ecb:5f376058

Internal Bitmap : 2 sectors from superblock
Update Time : Tue Jan 4 16:46:13 2011
Checksum : 3fce6b33 - correct
Events : 487

Layout : left-symmetric
Chunk Size : 64K

Device Role : Active device 2
Array State : AAA ('A' == active, '.' == missing)

# mdadm -X /dev/etherd/e3.1p1
Filename : /dev/etherd/e3.1p1
Magic : 6d746962
Version : 4
UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
Events : 487
Events Cleared : 486
State : OK
Chunksize : 64 KB
Daemon : 5s flush period
Write Mode : Normal
Sync Size : 1059712 (1035.05 MiB 1085.15 MB)
Bitmap : 16558 bits (chunks), 249 dirty (1.5%)
= = = = = = = = = =

- - - - - - - - - - -
# mdadm -D /dev/md/fs01
/dev/md/fs01:
Version : 1.2
Creation Time : Tue Jan 4 04:45:50 2011
Raid Level : raid5
Array Size : 2119424 (2.02 GiB 2.17 GB)
Used Dev Size : 1059712 (1035.05 MiB 1085.15 MB)
Raid Devices : 3
Total Devices : 3
Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Tue Jan 4 16:46:13 2011
State : active, resyncing
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

Rebuild Status : 1% complete

Name : dm01:126 (local to host dm01)
UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
Events : 486

Number Major Minor RaidDevice State
0 152 273 0 active sync /dev/block/152:273
1 152 529 1 active sync /dev/block/152:529
3 152 785 2 active sync /dev/block/152:785
- - - - - - - - - - -

The old method *never* resulted in this much corruption, and never
generated stale nfs file handles. Why is this so much worse now when it
was supposed to be better?
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: filesystem corruption

am 05.01.2011 02:22:33 von linux-raid

I think I may have found something on this. I was messing around with it
more (switched to iSCSI instead of ATAoE), and managed to create a
situation where 2 of the 3 raid-5 disks had failed, yet the MD device
was still active, and it was letting me use it. This is bad.

mdadm -D /dev/md/fs01
/dev/md/fs01:
Version : 1.2
Creation Time : Tue Jan 4 04:45:50 2011
Raid Level : raid5
Array Size : 2119424 (2.02 GiB 2.17 GB)
Used Dev Size : 1059712 (1035.05 MiB 1085.15 MB)
Raid Devices : 3
Total Devices : 1
Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Tue Jan 4 22:58:44 2011
State : active, FAILED
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

Name : dm01:125 (local to host dm01)
UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
Events : 2980

Number Major Minor RaidDevice State
0 0 0 0 removed
1 8 80 1 active sync /dev/sdf
2 0 0 2 removed

Notice, there's only one disk in the array, the other 2 failed and were
removed. Yet state is still saying active. The filesystem is still up
and running, and I can even read and write to it, though it spits out
tons of IO errors.
I then stopped the array and tried to reassemble it, and now it wont
reassemble.

# mdadm -A /dev/md/fs01 --uuid 9cd9ae9b:39454845:62f2b08d:a4a1ac6c -vv
mdadm: looking for devices for /dev/md/fs01
mdadm: no recogniseable superblock on /dev/md/fs01_journal
mdadm: /dev/md/fs01_journal has wrong uuid.
mdadm: cannot open device /dev/sdg: Device or resource busy
mdadm: /dev/sdg has wrong uuid.
mdadm: cannot open device /dev/sdd: Device or resource busy
mdadm: /dev/sdd has wrong uuid.
mdadm: cannot open device /dev/sdb: Device or resource busy
mdadm: /dev/sdb has wrong uuid.
mdadm: cannot open device /dev/sda2: Device or resource busy
mdadm: /dev/sda2 has wrong uuid.
mdadm: cannot open device /dev/sda1: Device or resource busy
mdadm: /dev/sda1 has wrong uuid.
mdadm: cannot open device /dev/sda: Device or resource busy
mdadm: /dev/sda has wrong uuid.
mdadm: /dev/sde is identified as a member of /dev/md/fs01, slot 2.
mdadm: /dev/sdc is identified as a member of /dev/md/fs01, slot 0.
mdadm: /dev/sdf is identified as a member of /dev/md/fs01, slot 1.
mdadm: added /dev/sdc to /dev/md/fs01 as 0
mdadm: added /dev/sde to /dev/md/fs01 as 2
mdadm: added /dev/sdf to /dev/md/fs01 as 1
mdadm: /dev/md/fs01 assembled from 1 drive - not enough to start the array.

# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md125 : inactive sdf[1](S) sde[3](S) sdc[0](S)
3179280 blocks super 1.2

md126 : active raid1 sdg[0] sdb[2] sdd[1]
265172 blocks super 1.2 [3/3] [UUU]
bitmap: 0/3 pages [0KB], 64KB chunk

unused devices:

md126 is the ext3 journal for the filesystem
Below is mdadm info on all the devices in the array

# mdadm -E /dev/sdc
/dev/sdc:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
Name : dm01:125 (local to host dm01)
Creation Time : Tue Jan 4 04:45:50 2011
Raid Level : raid5
Raid Devices : 3

Avail Dev Size : 2119520 (1035.10 MiB 1085.19 MB)
Array Size : 4238848 (2.02 GiB 2.17 GB)
Used Dev Size : 2119424 (1035.05 MiB 1085.15 MB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : active
Device UUID : a20adb76:af00f276:5be79a36:b4ff3a8b

Internal Bitmap : 2 sectors from superblock
Update Time : Tue Jan 4 22:44:20 2011
Checksum : 350c988f - correct
Events : 1150

Layout : left-symmetric
Chunk Size : 64K

Device Role : Active device 0
Array State : AA. ('A' == active, '.' == missing)

# mdadm -X /dev/sdc
Filename : /dev/sdc
Magic : 6d746962
Version : 4
UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
Events : 1150
Events Cleared : 1144
State : OK
Chunksize : 64 KB
Daemon : 5s flush period
Write Mode : Normal
Sync Size : 1059712 (1035.05 MiB 1085.15 MB)
Bitmap : 16558 bits (chunks), 93 dirty (0.6%)

# mdadm -E /dev/sdf
/dev/sdf:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
Name : dm01:125 (local to host dm01)
Creation Time : Tue Jan 4 04:45:50 2011
Raid Level : raid5
Raid Devices : 3

Avail Dev Size : 2119520 (1035.10 MiB 1085.19 MB)
Array Size : 4238848 (2.02 GiB 2.17 GB)
Used Dev Size : 2119424 (1035.05 MiB 1085.15 MB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : f9205ace:0796ecf5:2cca363c:c2873816

Internal Bitmap : 2 sectors from superblock
Update Time : Tue Jan 4 23:00:49 2011
Checksum : 9c20ba71 - correct
Events : 3062

Layout : left-symmetric
Chunk Size : 64K

Device Role : Active device 1
Array State : .A. ('A' == active, '.' == missing)

# mdadm -X /dev/sdf
Filename : /dev/sdf
Magic : 6d746962
Version : 4
UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
Events : 3062
Events Cleared : 1144
State : OK
Chunksize : 64 KB
Daemon : 5s flush period
Write Mode : Normal
Sync Size : 1059712 (1035.05 MiB 1085.15 MB)
Bitmap : 16558 bits (chunks), 150 dirty (0.9%)

# mdadm -E /dev/sde
/dev/sde:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
Name : dm01:125 (local to host dm01)
Creation Time : Tue Jan 4 04:45:50 2011
Raid Level : raid5
Raid Devices : 3

Avail Dev Size : 2119520 (1035.10 MiB 1085.19 MB)
Array Size : 4238848 (2.02 GiB 2.17 GB)
Used Dev Size : 2119424 (1035.05 MiB 1085.15 MB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : active
Device UUID : 7f90958d:22de5c08:88750ecb:5f376058

Internal Bitmap : 2 sectors from superblock
Update Time : Tue Jan 4 22:43:53 2011
Checksum : 3ecec198 - correct
Events : 1144

Layout : left-symmetric
Chunk Size : 64K

Device Role : Active device 2
Array State : AAA ('A' == active, '.' == missing)

# mdadm -X /dev/sde
Filename : /dev/sde
Magic : 6d746962
Version : 4
UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
Events : 1144
Events Cleared : 1143
State : OK
Chunksize : 64 KB
Daemon : 5s flush period
Write Mode : Normal
Sync Size : 1059712 (1035.05 MiB 1085.15 MB)
Bitmap : 16558 bits (chunks), 38 dirty (0.2%)

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: filesystem corruption

am 05.01.2011 08:02:53 von CoolCold

On Mon, Jan 3, 2011 at 6:16 AM, Neil Brown wrote:
> On Sun, 02 Jan 2011 18:58:34 -0700 "Patrick H." net>
> wrote:
>
>> I've been trying to track down an issue for a while now and from dig=
ging
>> around it appears (though not certain) the issue lies with the md ra=
id
>> device.
>> Whats happening is that after improperly shutting down a raid-5 arra=
y,
>> upon reassembly, a few files on the filesystem will be corrupt. I do=
nt
>> think this is normal filesystem corruption from files being modified
>> during the shut down because some of the files that end up corrupted=
are
>> several hours old.
>>
>> The exact details of what I'm doing:
>> I have a 3-node test cluster I'm doing integrity testing on. Each no=
de
>> in the cluster is exporting a couple of disks via ATAoE.
>> I have the first disk of all 3 nodes in a raid-1 that is holding the
>> journal data for the ext3 filesystem. The array is running with an
>> internal bitmap as well.
>> The second disk of all 3 nodes is a raid-5 array holding the ext3
>> filesystem itself. This is also running with an internal bitmap.
>> The ext3 filesystem is mounted with 'data=3Djournal,barrier=3D1,sync=
'.
>> When I power down the node which is actively running both md raid
>> devices, another node in the cluster takes over and starts both arra=
ys
>> up (in degraded mode of course).
>> Once the original node comes back up, the new master re-adds its dis=
ks
>> back into the raid arrays and re-syncs them.
>> During all this, the filesystem is exported through nfs (nfs also ha=
s
>> sync turned on) and a client is randomly creating, removing, and
>> verifying checksums on the files in the filesystem (nfs is hard moun=
ted
>> so operations always retry). The client script averages about 30
>> creations/s, 30 deletes/s, and 30 checksums/s.
>>
>> So, as stated above, every now and then (1 in 50 chance or so), when=
the
>> master is hard-rebooted, the client will detect a few files with inv=
alid
>> md5 checksums. These files could be hours old so they were not being
>> actively modified.
>> Another key point that leads me to believe its a md raid issue is th=
at
>> before I had the ext3 journal running internally on the raid-5 array
>> (part of the filesystem itself). When I did this, there would
>> occasionally be massive corruption. As in file modification times in=
the
>> future, lots of corrupt files, thousands of files put in the
>> 'lost+found' dir upon fsck, etc. After I put it on a separate raid-1=
,
>> there are no more invalid modification times, there hasnt been a sin=
gle
>> file added to 'lost+found', and the number of corrupt files dropped
>> significantly. This would seem to indicate that the journal was gett=
ing
>> corrupted, and when it was played back, it went horribly wrong.
>>
>> So it would seem there's something wrong with the raid-5 array, but =
I
>> dont know what it could be. Any ideas or input would be much
>> appreciated. I can modify the clustering scripts to obtain whatever
>> information is needed when they start the arrays.
>
> What you are doing cannot work reliably.
>
> If a RAID5 suffers an unclean shutdown and is restarted without a ful=
l
> complement of devices, then it can corrupt data that has not been cha=
nged
> recently, just as you are seeing.
> This is why mdadm will not assemble that array unless you provide the=
--force
> flag which essentially says "I know what I am doing and accept the ri=
sk".
>
> When md needs to update a block in your 3-drive RAID5, it will read t=
he other
> block in the same stripe (if that isn't in the cache or being written=
at the
> same time) and then write out the data block (or blocks) and the newl=
y
> computed parity block.
>
> If you crash after one of those writes has completed, but before all =
of the
> writes have completed, then the parity block will not match the data =
blocks
> on disk.
Am I understanding right, that in case of hardware controller with
bbu, data and parity gonna be written properly ( for locally connected
drives of course ) even in case of powerloss and this is the only
feature which hardware raid controllers can do and softraid can't ?
(well, except some nice features like maxiq - cache on ssd for adaptec
controllers and overall write performance expansion because of
ram/bbu)

>
> When you re-assemble the array with one device missing, md will compu=
te the
> data that was on the device using the other data block and the parity=
block.
> As the parity and data blocks could be inconsistent, the result could=
easily
> be wrong.
>
> With RAID1 there is no similar problem. =A0When you read after a cras=
h you will
> always get "correct" data. =A0It maybe from before the last write tha=
t was
> attempted, or after, but if the data was not written recently you wil=
l read
> exactly the right data.
>
> This is why the situation improved substantially when you moved the j=
ournal
> to RAID1.
>
> The get full improvement, you need to move the data to RAID1 (or RAID=
10) as
> well.
>
> NeilBrown
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>

--=20
Best regards,
[COOLCOLD-RIPN]
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: filesystem corruption

am 05.01.2011 15:28:11 von linux-raid

Sent: Wed Jan 05 2011 00:00:48 GMT-0700 (Mountain Standard Time)
From: CoolCold
To: Neil Brown "Patrick H." ,
linux-raid@vger.kernel.org
Subject: Re: filesystem corruption
>
> Am I understanding right, that in case of hardware controller with
> bbu, data and parity gonna be written properly ( for locally
> connected drives of course ) even in case of powerloss and this is
> the only feature which hardware raid controllers can do and softraid
> can't ? (well, except some nice features like maxiq - cache on ssd for
> adaptec controllers and overall write performance expansion because of
> ram/bbu)
>
>
No, my drives are battery backed as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: filesystem corruption

am 05.01.2011 16:52:04 von Spelic

On 01/05/2011 03:28 PM, Patrick H. wrote:
> No, my drives are battery backed as well.

what drives are they, if I can ask? OCZ SSDs with supercapacitor maybe?

Do you know if they will really flush the whole write cache on sudden
power off? I read smoky sentences about this for the OCZ drives. In
certain points it seemed like the supercapacitor was only able to
provide the same guarantees of a HDD, that is, no further data loss due
to erase-then-rewrite-32K and flash wear levelling stuff, but was not
able to flush the write cache.
Did you try with e.g. a stream of simple databases transactions then
disconnecting the cable suddenly like this test
http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-f sync-write-cache-barrier-and-lost-transactions/
?

Thank you
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: filesystem corruption

am 05.01.2011 16:55:17 von linux-raid

HP DL360-G6. SAS controller with battery backed write accelerator.
I havent been focusing on the reliability of the drives as this is proof
of concept testing. If we decide to use it, the drives will be replaced
with 2TB SSD PCIe cards.

-Patrick

Sent: Wed Jan 05 2011 08:52:04 GMT-0700 (Mountain Standard Time)
From: Spelic
To: Patrick H. linux-raid

Subject: Re: filesystem corruption
> On 01/05/2011 03:28 PM, Patrick H. wrote:
>> No, my drives are battery backed as well.
>
> what drives are they, if I can ask? OCZ SSDs with supercapacitor maybe?
>
> Do you know if they will really flush the whole write cache on sudden
> power off? I read smoky sentences about this for the OCZ drives. In
> certain points it seemed like the supercapacitor was only able to
> provide the same guarantees of a HDD, that is, no further data loss
> due to erase-then-rewrite-32K and flash wear levelling stuff, but was
> not able to flush the write cache.
> Did you try with e.g. a stream of simple databases transactions then
> disconnecting the cable suddenly like this test
> http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-f sync-write-cache-barrier-and-lost-transactions/
>
> ?
>
> Thank you
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html