Re: Repeatable md OOPS on suspend,2.6.39.4 and 3.0.3

Re: Repeatable md OOPS on suspend,2.6.39.4 and 3.0.3

am 15.09.2011 01:32:10 von Nigel Cunningham

This is a multi-part message in MIME format.
--------------070503060908090704000505
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Hi.

Please try/review the attached patch.

The problem is that TuxOnIce adds a BUG_ON() to catch non-TuxOnIce I/O
during hibernation, as a method of seeking to stop on-disk data getting
corrupted by the writing of data that has potentially been overwritten
by the atomic copy.

Stopping the md devices from being marked readonly is the right thing to
do - if we don't resume, we want recovery to be run. If we do resume,
they should still be in the pre-hibernate state.

Regards,

Nigel

--------------070503060908090704000505
Content-Type: text/x-diff;
name="md-reboot-mark-readonly.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="md-reboot-mark-readonly.patch"

diff --git a/drivers/md/md.c b/drivers/md/md.c
index af0e52c..25af0a8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8056,7 +8056,7 @@ static int md_notify_reboot(struct notifier_block *this,
struct list_head *tmp;
mddev_t *mddev;

- if ((code == SYS_DOWN) || (code == SYS_HALT) || (code == SYS_POWER_OFF)) {
+ if (((code == SYS_DOWN) || (code == SYS_HALT) || (code == SYS_POWER_OFF)) && !freezer_state) {

printk(KERN_INFO "md: stopping all md devices.\n");


--------------070503060908090704000505
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
TuxOnIce-users mailing list
TuxOnIce-users@lists.tuxonice.net
http://lists.tuxonice.net/listinfo/tuxonice-users
--------------070503060908090704000505--

Re: Repeatable md OOPS on suspend,2.6.39.4 and 3.0.3

am 15.09.2011 06:18:08 von Nigel Cunningham

Hi.

On 15/09/11 13:31, NeilBrown wrote:
> On Thu, 15 Sep 2011 09:32:10 +1000 Nigel Cunningham
> wrote:
>
>> Hi.
>>
>> Please try/review the attached patch.
>>
>> The problem is that TuxOnIce adds a BUG_ON() to catch non-TuxOnIce I/O
>> during hibernation, as a method of seeking to stop on-disk data getting
>> corrupted by the writing of data that has potentially been overwritten
>> by the atomic copy.
>>
>> Stopping the md devices from being marked readonly is the right thing to
>> do - if we don't resume, we want recovery to be run. If we do resume,
>> they should still be in the pre-hibernate state.
>>
>> Regards,
>>
>> Nigel
>
> This doesn't feel like the right approach to me.
>
> I think the 'md' device *should* be marked 'clean' when it is clean to
> avoid unnecessary resyncs.

I must be missing something. In raid terminology, what does 'clean'
mean? Googling gives me lots of references to flyspray :) I thought it
meant the filesystems contained therein were cleanly unmounted (which it
isn't in this case). Just 'cleanly shutdown'?

> It would almost certainly make sense to have a way to tell md 'hibernate
> wrote to your device so things might have changed - you should check'.
> Then md could look at the metadata and refresh any in-memory information
> such as device failures and event counts.
> After all if a device fails while writing out the hibernation image, we want
> the hibernation to succeed (I assume) and we want md to know that the device
> is failed when it wakes back up, and currently it won't. So we really need
> that notification anyway.

Now that I understand and agree with.

Regards,

Nigel
--
Evolution (n): A hypothetical process whereby improbable
events occur with alarming frequency, order arises from chaos, and
no one is given credit.