File system corruption during setting new size (native/extarnalmetatdat) after expansion

am 16.02.2011 16:24:53 von adam.kwolek

Hi,

I've met again array_size change problem. A while ago I've indicated it=
, during array expansion, file system corruption occurs.
The previous conclusion was that my tests were wrong (too small arrays,=
on bigger arrays it seems to be ok),=20
I've got now reproduction in our lab on quite big arrays (250G) and on =
my small test arrays also.

This problem is related to array expansion using native and external me=
tadata.
=20
Problem is that after expansion file system on array is corrupted.
Operation that corrupts file system is array size change via sysfs (and=
for native metadata during reshape finalization in md).
In logs I've got information:
VFS: busy inodes on changed media or resized disks

My investigation in md goes to block_dev.c/inode.c and flush_disk() (ca=
lled from check_disk_size_change()).
I've found above text there (about busy inodes).=20
My traces shows that in __invalidate_inodes(), inode->i_count == 2 =
and it is locked. This makes me to look how invalidate_inodes() is call=
ed.
Before invalidate_inodes() is called, cache is cleared by shrink_dcache=
_sb().
If I've commented shrink_dcache_sb() file system seems to be safe. It i=
s the same when invalidate_inodes() is commented (shrink_cache_sb() cal=
led only).
If I've changed functions order (1. invalidate_inodes(), 2. shrink_dcac=
he_sb()) in __invalidate_device() (block_dev.c) there is no corruption =
also.
When after shrink_dcache_sb() is called invalidate_inodes() on busy ino=
des corruption occurs (if it called earlier it doesn't matter).

This problem can be reproduced on mounted arrays only. If array is not =
mounted, file system corruption disappears.
This means that whole expansion process is correct.
In some cases (rarely) inodes on mounted device are not locked and then=
FS corruption doesn't happened.

I'd like to know your opinion.
Do you think that problem is in releasing cache but not all inodes (bus=
y case), commands order...
.. or probably changing size on mounted array is a mistake (for curr=
ent code)?

BR
Adam

---------------------------------
My test commands:=09

#create container
mdadm -C /dev/md/imsm0 -amd -e imsm -n 3 /dev/sdb /dev/sdc /dev/sde -R

#create volume
mdadm -C /dev/md/raid5vol_0 -amd -l 5 --chunk 64 --size 104857 -n 3 /de=
v/sdb /dev/sdc /dev/sde -R
mkfs /dev/md/raid5vol_0
mount /dev/md/raid5vol_0 /mnt/vol

#copy some files from current directory
cp * /mnt/vol

#add spare
mdadm --add /dev/md/imsm0 /dev/sdd

#start reshape
mdadm --grow /dev/md/imsm0 --raid-devices 4 --backup-file=3D/backup.bak

------------------------------------------------------------ -
Intel Technology Poland Sp. z o.o.
email: adam.kwolek@intel.com
phone: +48 58 766 1773

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: File system corruption during setting new size (native/extarnalmetatdat) after expansion

am 17.02.2011 05:38:15 von NeilBrown

On Wed, 16 Feb 2011 15:24:53 +0000 "Kwolek, Adam"
wrote:

> Hi,
>
> I've met again array_size change problem. A while ago I've indicated it, during array expansion, file system corruption occurs.
> The previous conclusion was that my tests were wrong (too small arrays, on bigger arrays it seems to be ok),
> I've got now reproduction in our lab on quite big arrays (250G) and on my small test arrays also.
>
> This problem is related to array expansion using native and external metadata.
>
> Problem is that after expansion file system on array is corrupted.
> Operation that corrupts file system is array size change via sysfs (and for native metadata during reshape finalization in md).
> In logs I've got information:
> VFS: busy inodes on changed media or resized disks
>
> My investigation in md goes to block_dev.c/inode.c and flush_disk() (called from check_disk_size_change()).
> I've found above text there (about busy inodes).
> My traces shows that in __invalidate_inodes(), inode->i_count == 2 and it is locked. This makes me to look how invalidate_inodes() is called.
> Before invalidate_inodes() is called, cache is cleared by shrink_dcache_sb().
> If I've commented shrink_dcache_sb() file system seems to be safe. It is the same when invalidate_inodes() is commented (shrink_cache_sb() called only).
> If I've changed functions order (1. invalidate_inodes(), 2. shrink_dcache_sb()) in __invalidate_device() (block_dev.c) there is no corruption also.
> When after shrink_dcache_sb() is called invalidate_inodes() on busy inodes corruption occurs (if it called earlier it doesn't matter).
>
> This problem can be reproduced on mounted arrays only. If array is not mounted, file system corruption disappears.
> This means that whole expansion process is correct.
> In some cases (rarely) inodes on mounted device are not locked and then FS corruption doesn't happened.
>
>
> I'd like to know your opinion.
> Do you think that problem is in releasing cache but not all inodes (busy case), commands order...
> ... or probably changing size on mounted array is a mistake (for current code)?
>
> BR
> Adam
>
>
> ---------------------------------
> My test commands:
>
> #create container
> mdadm -C /dev/md/imsm0 -amd -e imsm -n 3 /dev/sdb /dev/sdc /dev/sde -R
>
> #create volume
> mdadm -C /dev/md/raid5vol_0 -amd -l 5 --chunk 64 --size 104857 -n 3 /dev/sdb /dev/sdc /dev/sde -R
> mkfs /dev/md/raid5vol_0
> mount /dev/md/raid5vol_0 /mnt/vol
>
> #copy some files from current directory
> cp * /mnt/vol
>
> #add spare
> mdadm --add /dev/md/imsm0 /dev/sdd
>
> #start reshape
> mdadm --grow /dev/md/imsm0 --raid-devices 4 --backup-file=/backup.bak

Thanks for the script - that makes things a lot easier.

I had to add some "mdadm --wait" commands etc to make it work reliably, but
then I could get it to generate filesystem corruption quite easily.

I also had "mdmon" crashing because imsm_num_data_members got a 'map' of NULL,
and then tried to get_imsm_raid_level on it. Something needs to be fixed
there (I hacked it so that if it got a NULL, it tried again with second_map
== 0, but I don't know if that is right).

The problem seems to be in the VFS/VM layer. I can see two ways to fix it.
I'll see if I can find some who wants to take responsibility and make a
decision.

1/ in invalidate_inodes (in fs/inodes.c), inside the loop, add a clause:

if (inode->i_state & I_DIRTY)
continue;

That is what is causing the problem - dirty inodes are being discarded.

2/ in check_disk_size_change (in fs/block_dev.c), replace

flush_disk(bdev);
with

if (disk_size == 0 || bdev_size == 0)
flush_disk(bdev);

because I don't think there is any justification for calling flush_bdev if
we are just changing the size of the disk - though I don't really know why
that is there at all, so it is hard to say for sure. Certainly if
disk_size > bdev_size we shouldn't be calling flush disk.

So I suggest you make one of those changes to continue with your testing, and
I'll try to get something sorted out.

If you could look into the circumstances that might make imsm_num_data_members
get a NULL map, I would appreciate that.

Thanks,
NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: File system corruption during setting new size (native/extarnalmetatdat) after expansion

am 17.02.2011 09:45:36 von adam.kwolek

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of NeilBrown
> Sent: Thursday, February 17, 2011 5:38 AM
> To: Kwolek, Adam
> Cc: linux-raid@vger.kernel.org; Williams, Dan J; Ciechanowski, Ed;
> Neubauer, Wojciech
> Subject: Re: File system corruption during setting new size
> (native/extarnal metatdat) after expansion
>
> On Wed, 16 Feb 2011 15:24:53 +0000 "Kwolek, Adam"
>
> wrote:
>
> > Hi,
> >
> > I've met again array_size change problem. A while ago I've indicated
> it, during array expansion, file system corruption occurs.
> > The previous conclusion was that my tests were wrong (too small
> arrays, on bigger arrays it seems to be ok),
> > I've got now reproduction in our lab on quite big arrays (250G) and
> on my small test arrays also.
> >
> > This problem is related to array expansion using native and external
> metadata.
> >
> > Problem is that after expansion file system on array is corrupted.
> > Operation that corrupts file system is array size change via sysfs
> (and for native metadata during reshape finalization in md).
> > In logs I've got information:
> > VFS: busy inodes on changed media or resized disks
> >
> > My investigation in md goes to block_dev.c/inode.c and flush_disk()
> (called from check_disk_size_change()).
> > I've found above text there (about busy inodes).
> > My traces shows that in __invalidate_inodes(), inode->i_count == 2
> and it is locked. This makes me to look how invalidate_inodes() is
> called.
> > Before invalidate_inodes() is called, cache is cleared by
> shrink_dcache_sb().
> > If I've commented shrink_dcache_sb() file system seems to be safe. It
> is the same when invalidate_inodes() is commented (shrink_cache_sb()
> called only).
> > If I've changed functions order (1. invalidate_inodes(), 2.
> shrink_dcache_sb()) in __invalidate_device() (block_dev.c) there is no
> corruption also.
> > When after shrink_dcache_sb() is called invalidate_inodes() on busy
> inodes corruption occurs (if it called earlier it doesn't matter).
> >
> > This problem can be reproduced on mounted arrays only. If array is
> not mounted, file system corruption disappears.
> > This means that whole expansion process is correct.
> > In some cases (rarely) inodes on mounted device are not locked and
> then FS corruption doesn't happened.
> >
> >
> > I'd like to know your opinion.
> > Do you think that problem is in releasing cache but not all inodes
> (busy case), commands order...
> > ... or probably changing size on mounted array is a mistake (for
> current code)?
> >
> > BR
> > Adam
> >
> >
> > ---------------------------------
> > My test commands:
> >
> > #create container
> > mdadm -C /dev/md/imsm0 -amd -e imsm -n 3 /dev/sdb /dev/sdc /dev/sde -
> R
> >
> > #create volume
> > mdadm -C /dev/md/raid5vol_0 -amd -l 5 --chunk 64 --size 104857 -n 3
> /dev/sdb /dev/sdc /dev/sde -R
> > mkfs /dev/md/raid5vol_0
> > mount /dev/md/raid5vol_0 /mnt/vol
> >
> > #copy some files from current directory
> > cp * /mnt/vol
> >
> > #add spare
> > mdadm --add /dev/md/imsm0 /dev/sdd
> >
> > #start reshape
> > mdadm --grow /dev/md/imsm0 --raid-devices 4 --backup-file=/backup.bak
>
> Thanks for the script - that makes things a lot easier.
>
> I had to add some "mdadm --wait" commands etc to make it work reliably,
> but
> then I could get it to generate filesystem corruption quite easily.
>
> I also had "mdmon" crashing because imsm_num_data_members got a 'map'
> of NULL,
> and then tried to get_imsm_raid_level on it. Something needs to be
> fixed
> there (I hacked it so that if it got a NULL, it tried again with
> second_map
> == 0, but I don't know if that is right).
>
> The problem seems to be in the VFS/VM layer. I can see two ways to fix
> it.
> I'll see if I can find some who wants to take responsibility and make a
> decision.
>
> 1/ in invalidate_inodes (in fs/inodes.c), inside the loop, add a
> clause:
>
> if (inode->i_state & I_DIRTY)
> continue;
>
> That is what is causing the problem - dirty inodes are being
> discarded.
>
> 2/ in check_disk_size_change (in fs/block_dev.c), replace
>
> flush_disk(bdev);
> with
>
> if (disk_size == 0 || bdev_size == 0)
> flush_disk(bdev);
>
> because I don't think there is any justification for calling
> flush_bdev if
> we are just changing the size of the disk - though I don't really know
> why
> that is there at all, so it is hard to say for sure. Certainly if
> disk_size > bdev_size we shouldn't be calling flush disk.
>
> So I suggest you make one of those changes to continue with your
> testing, and
> I'll try to get something sorted out.
>
> If you could look into the circumstances that might make
> imsm_num_data_members
> get a NULL map, I would appreciate that.
>
> Thanks,
> NeilBrown

Thank you for workarounds/temporary fixes.

Regarding imsm_num_data_members() setting second_map to 0 cannot help as it is always called with this parameter set to 0.
In this situation, when first map should be always present, we can have some race condition.
I have no reproduction for mdmon crash you are observing, but I'll try some changes in my scripts and I'll carefully watch any signs
that can indicate reproduction of this problem.

If you could let me know details about changes you made to my scenario, it could help.

Thanks again
Adam

> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: File system corruption during setting new size (native/extarnalmetatdat) after expansion

am 17.02.2011 11:03:39 von NeilBrown

On Thu, 17 Feb 2011 08:45:36 +0000 "Kwolek, Adam"
wrote:

> Thank you for workarounds/temporary fixes.
>
> Regarding imsm_num_data_members() setting second_map to 0 cannot help as it is always called with this parameter set to 0.
> In this situation, when first map should be always present, we can have some race condition.
> I have no reproduction for mdmon crash you are observing, but I'll try some changes in my scripts and I'll carefully watch any signs
> that can indicate reproduction of this problem.
>
> If you could let me know details about changes you made to my scenario, it could help.
>

This is the script I was using:

----------------------------------------------
export IMSM_NO_PLATFORM=1
export IMSM_DEVNAME_AS_SERIAL=1
export MDADM_EXPERIMENTAL=1
umount /mnt/vol
mdadm -Ss
rm -f /backup.bak

#create container
mdadm -C /dev/md/imsm0 -amd -e imsm -n 3 /dev/sda /dev/sdb /dev/sdc -R

#create volume
mdadm -C /dev/md/raid5vol_0 -amd -l 5 --chunk 64 --size 104857 -n 3 /dev/sda /dev/sdb /dev/sdc -R
mkfs /dev/md/raid5vol_0
mount /dev/md/raid5vol_0 /mnt/vol

#copy some files from current directory
cp * /mnt/vol

#add spare
mdadm --add /dev/md/imsm0 /dev/sdd

mdadm --wait /dev/md/raid5vol_0

#start reshape
mdadm --grow /dev/md/imsm0 --raid-devices 4 --backup-file=/backup.bak
#mdadm --wait /dev/md/raid5vol_0
sleep 10
while grep reshape /proc/mdstat > /dev/null
do sleep 1
done
while ps axgu | grep 'md[a]dm' > /dev/null
do sleep 1
done
umount /mnt/vol
fsck -f -n /dev/md/raid5vol_0
-------------------------------------------------

I did have an 'mdadm --wait' where the 'while grep reshape' is. I changed it
because it seemed to be causing problems, but I may have been wrong about the
cause.

This would fairly reliably result in mdmon dying.

This is the patch I applied
-------------------------------------
diff --git a/super-intel.c b/super-intel.c
index 5d39d5b..fa195c3 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -1600,6 +1600,7 @@ static __u8 imsm_num_data_members(struct imsm_dev *dev, int second_map)
*/
struct imsm_map *map = get_imsm_map(dev, second_map);

+ if (map == NULL) map = get_imsm_map(dev, 0);
switch (get_imsm_raid_level(map)) {
case 0:
case 1:

-----------------------------------

This was on an oldish source tree (commit 152b223157), so maybe it is already
fixed.
But without that patch is crashed often, and with it in didn't crash at all.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: File system corruption during setting new size (native/extarnalmetatdat) after expansion

am 17.02.2011 11:27:05 von adam.kwolek

> -----Original Message-----
> From: NeilBrown [mailto:neilb@suse.de]
> Sent: Thursday, February 17, 2011 11:04 AM
> To: Kwolek, Adam
> Cc: linux-raid@vger.kernel.org; Williams, Dan J; Ciechanowski, Ed;
> Neubauer, Wojciech
> Subject: Re: File system corruption during setting new size
> (native/extarnal metatdat) after expansion
>
> On Thu, 17 Feb 2011 08:45:36 +0000 "Kwolek, Adam"
>
> wrote:
>
> > Thank you for workarounds/temporary fixes.
> >
> > Regarding imsm_num_data_members() setting second_map to 0 cannot help
> as it is always called with this parameter set to 0.
> > In this situation, when first map should be always present, we can
> have some race condition.
> > I have no reproduction for mdmon crash you are observing, but I'll try
> some changes in my scripts and I'll carefully watch any signs
> > that can indicate reproduction of this problem.
> >
> > If you could let me know details about changes you made to my
> scenario, it could help.
> >
>
> This is the script I was using:
>
> ----------------------------------------------
> export IMSM_NO_PLATFORM=1
> export IMSM_DEVNAME_AS_SERIAL=1
> export MDADM_EXPERIMENTAL=1
> umount /mnt/vol
> mdadm -Ss
> rm -f /backup.bak
>
> #create container
> mdadm -C /dev/md/imsm0 -amd -e imsm -n 3 /dev/sda /dev/sdb /dev/sdc -R
>
> #create volume
> mdadm -C /dev/md/raid5vol_0 -amd -l 5 --chunk 64 --size 104857 -n 3
> /dev/sda /dev/sdb /dev/sdc -R
> mkfs /dev/md/raid5vol_0
> mount /dev/md/raid5vol_0 /mnt/vol
>
> #copy some files from current directory
> cp * /mnt/vol
>
> #add spare
> mdadm --add /dev/md/imsm0 /dev/sdd
>
> mdadm --wait /dev/md/raid5vol_0
>
> #start reshape
> mdadm --grow /dev/md/imsm0 --raid-devices 4 --backup-file=/backup.bak
> #mdadm --wait /dev/md/raid5vol_0
> sleep 10
> while grep reshape /proc/mdstat > /dev/null
> do sleep 1
> done
> while ps axgu | grep 'md[a]dm' > /dev/null
> do sleep 1
> done
> umount /mnt/vol
> fsck -f -n /dev/md/raid5vol_0
> -------------------------------------------------
>
> I did have an 'mdadm --wait' where the 'while grep reshape' is. I
> changed it
> because it seemed to be causing problems, but I may have been wrong
> about the
> cause.
>
> This would fairly reliably result in mdmon dying.
>
> This is the patch I applied
> -------------------------------------
> diff --git a/super-intel.c b/super-intel.c
> index 5d39d5b..fa195c3 100644
> --- a/super-intel.c
> +++ b/super-intel.c
> @@ -1600,6 +1600,7 @@ static __u8 imsm_num_data_members(struct imsm_dev
> *dev, int second_map)
> */
> struct imsm_map *map = get_imsm_map(dev, second_map);
>
> + if (map == NULL) map = get_imsm_map(dev, 0);
> switch (get_imsm_raid_level(map)) {
> case 0:
> case 1:
>
> -----------------------------------
>
> This was on an oldish source tree (commit 152b223157), so maybe it is
> already
> fixed.
> But without that patch is crashed often, and with it in didn't crash at
> all.
>
> NeilBrown

Thank you for information.

On older mdadm I've saw this problem and I think problem is fixed now

Problem was in set_array_state(),when second_map parameter was set to -1, and get_imsm_map() implementation causes problem (NULL can be returned for second_map == '-1').
get_imsm_map() is fixed now (patch: 'imsm: FIX: crash during getting map' /2011-02-03/).

'-1' as second map was changed to 0 later by Anna in 'fix: imsm: assemble doesn't restart recovery' /2011-02-13/ patch.
At this moment second_map is never set to '-1'.

Our lab doesn't report mdmon crashes on latest mdadm code also.

Anyway, I'll have a look during my tests for mdmon core (let say, more carefully than as usual ;)).

BR
Adam
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html