Why does MD overwrite the superblock upon temporary disconnect?

am 21.09.2010 04:49:48 von Jim Schatzman

I have seen quite a number of people writing about what happens to their SATA RAID5 arrays when several drives get accidentally unplugged. Especially with port multipliers and external drive boxes with eSata cables, this can happen rather easily.

In my case, an 8-drive RAID6 array had this happen to it. Even though the array was not in active use at the time, MD immediately marked the 4 temporarily-disconnected drives as "Spare". No combination of "assemble" options seems able to fix this. 4 drives still have "active slot N" status; the other 4 are "spare", wiping out the slot metadata. Apparently, you have to use "create" to recreate the metadata, marking two slots as "missing" (for RAID 6); check the resulting RAID data; then add the "missing" drives back in. Presumably, you should write down the slot numbers or this may be difficult.

This procedure works, but for a 12 TB array, the resyncs take a long time. 1 second of cable disconnect for 2 days of resync.

I have some questions-

1) When MD detects that so many drives are offline that the RAID can't function, why not just put the RAID in "stop" state and avoid changing any metadata? Yes, I know that there is a risk of data corruption, but isn't minor data corruption often better than total data loss?

2) Couldn't there be a way to put the RAID back together tentatively (with all 8 drives), check the parity, and go with the result if the parity is o.k.? That would save some time as compared to resyncing two disks from scratch.

3) In my case, I am more concerned about catastrophic data loss than 100% up time. I want to have the raids assembled with "--no-degraded". Is it possible to tell the kernel to do this, or would it be better to specify "AUTO -1.x" in mdadm.conf and use a cron.reboot script to start the RAID?

4) Also, would "--no-degraded" help prevent MD from overwriting the metadata (switching drives to "spare" state)?

Thanks!

Jim

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does MD overwrite the superblock upon temporary disconnect?

am 21.09.2010 05:34:54 von Richard Scobie

Jim Schatzman wrote:
> I have seen quite a number of people writing about what happens to their SATA RAID5 arrays when several drives get accidentally unplugged. Especially with port multipliers and external drive boxes with eSata cables, this can happen rather easily.

In this scenario, if you are using md UUIDs in your mdadm.conf, an
"mdadm -A --force /dev/mdX", has worked fine for me on a number of
occasions.

Quoting the man page:

"An array which requires --force to be started may contain data
corruption. Use it carefully."

Therefore a fsck is a good idea afterwards.

Regards,

Richard
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does MD overwrite the superblock upon temporary disconnect?

am 21.09.2010 06:09:14 von NeilBrown

On Mon, 20 Sep 2010 20:49:48 -0600
Jim Schatzman wrote:

> I have seen quite a number of people writing about what happens to their SATA RAID5 arrays when several drives get accidentally unplugged. Especially with port multipliers and external drive boxes with eSata cables, this can happen rather easily.
>
> In my case, an 8-drive RAID6 array had this happen to it. Even though the array was not in active use at the time, MD immediately marked the 4 temporarily-disconnected drives as "Spare". No combination of "assemble" options seems able to fix this. 4 drives still have "active slot N" status; the other 4 are "spare", wiping out the slot metadata. Apparently, you have to use "create" to recreate the metadata, marking two slots as "missing" (for RAID 6); check the resulting RAID data; then add the "missing" drives back in. Presumably, you should write down the slot numbers or this may be difficult.
>
> This procedure works, but for a 12 TB array, the resyncs take a long time. 1 second of cable disconnect for 2 days of resync.
>
> I have some questions-
>
> 1) When MD detects that so many drives are offline that the RAID can't function, why not just put the RAID in "stop" state and avoid changing any metadata? Yes, I know that there is a risk of data corruption, but isn't minor data corruption often better than total data loss?

This is essentially what MD does to. It marks the devices as having failed
but otherwise doesn't change the metadata. I've occasionally thought about
leaving the metadata alone once enough devices have failed that the array
cannot work, but I'm not sure it would really gain anything.

The 'destruction' of the metadata happens later, not at the time of device
failure.

>
> 2) Couldn't there be a way to put the RAID back together tentatively (with all 8 drives), check the parity, and go with the result if the parity is o.k.? That would save some time as compared to resyncing two disks from scratch.
>

How is 'check the parity' different from 'resync two disks from scratch' ??
Both require reading every block on every disk.

When you create a new RAID6, md does just check the parity - if all the
parity is correct it won't write anything to the devices.

If you believe the parity to be correct, you can create the array with
'--assume-clean'. If you were wrong and you lose a device, then you could
get data corruption though.

And the '2 days' of resync isn't two wasted days. You can still use the
array. If you tune min_sync_speed right down it should only do resync while
nothing else is happening so you shouldn't notice.

> 3) In my case, I am more concerned about catastrophic data loss than 100% up time. I want to have the raids assembled with "--no-degraded". Is it possible to tell the kernel to do this, or would it be better to specify "AUTO -1.x" in mdadm.conf and use a cron.reboot script to start the RAID?

The kernel doesn't auto-assemble 1.x metadata arrays.
Presumably your arrays are being assembled by the initrd - possibly you could
put the --no-degraded flag in the initrd somewhere.

>
> 4) Also, would "--no-degraded" help prevent MD from overwriting the metadata (switching drives to "spare" state)?
>

So here is the crux of the matter - what is over-writing the metadata and
converting the devices to spares? So far: I don't know.
I have tried to reproduce this and cannot. Where I assemble the array with
"mdadm --assemble" or "mdadm --incremental" the superblocks stay unchanged.
If I remove a device from a incompletely assembled array and try to re-add it,
that just fails, superblock is still fine.

If I assemble with '--force' it does the best it can and creates an array with
two missing devices. The superblocks on the others are unchanged... until I
add them of course, then it has to do a rebuilt.

If I create the array with an internal bitmap and do all the same, then they
get --re-added quite quickly as you would expect.

So while I'm sure you have a problem, I cannot see how it could happen. I
must be missing something important.
If anyone is able to reproduce this (on a test-array presumably) I would love
to see details.

Just for completeness: what distro are you using, what kernel version, and
what mdadm version?

Thanks,
NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does MD overwrite the superblock upon temporary disconnect?

am 21.09.2010 15:16:23 von Jim Schatzman

Neil and Richard-

Thanks for your responses. My environment.

OS: Linux l1.fu-lab.com 2.6.34.6-47.fc13.i686.PAE #1 SMP Fri Aug 27 09:29:49 UTC 2010 i686 i686 i386 GNU/Linux

MDADM: mdadm - v3.1.2 - 10th March 2010

SATA controller: SiI 3124 PCI-X Serial ATA Controller

Drive cages: 8 drive chassis with 4x port multipliers.

More details: I tried reassembling the array with mdadm -A --force /dev/mdX
and also by specifying all the devices explicitly. I tried this multiple times. This did not work. A couple of things happened

a) mdadm always reported that there weren't enough drives to start the array

b) about 75% of the time, it would complain that one of the drives was busy, so that the result was 4 active; 3 spare

c) there was no reason that I could see why it would report one busy drive - the drive wasn't part of another array, mounted separately, bad, or marked anything other than "spare".
I had no trouble copying data from the "busy" drive with dd.

As I originally reported, I could not get "assemble" to work, with the above symptoms.

Also, I noticed that the "events" counter was messed up on the "spare" drives. The 4 "active" drives had values of 90, the spare drives had varying events values - most were 0 but as I recall one had a value around 30 or so.

I didn't note the counter values and the "spare" state until after I rebooted. The exact process was this

1) Jogged the mouse cable which jogged the eSATA cable.

2) I noticed that the array was inactive and immediate shut the system down.

3) Fixed the cables and rebooted.

4) At this point, had 4 "active" disks and 4 "spares". Tried reassembling many different ways. Sometimes, mdadm would reduce this to 4 "active" and 3 "spares".

5) No progress with the above at all until I recreated ("mdadm -C") the array with 6 drives, checked the data, added the 2 additional drives, at which point resyncing occurred.

Re: "It marks the devices as having failed
but otherwise doesn't change the metadata.
I've occasionally thought about
leaving the metadata alone once enough devices have failed that the array
cannot work, but I'm not sure it would really gain anything."

My response: The problem with what MD does now (overwriting the metadata) is that it loses track of the slot numbers, and also apparently will not allow you to reassemble the drive (maybe based on the events counter??). If it kept the slot numbers around, and allowed you to force "spare" drives to be considered "active", that would be easier to deal with. I think you are saying that this occured when I rebooted - is that correct?

Re: "The 'destruction' of the metadata happens later, not at the time of device
failure."

My response: So... maybe if I had prevented initrd from trying to start the array when I rebooted, I could have diagnosed the situation and fixed it more easily than by recreating the array. How?

Re: "How is 'check the parity' different from 'resync two disks from scratch' ??
Both require reading every block on every disk."

My response: With RAID6, it appears that MD reads all the data twice - once for each set of parity data. I added the 7th and 8th drives simultaneously, but the resyncing was done one drive at a time (according to mdadm --detail /dev/mdX).

O.k., so this wasn't catastrophic. I was just afraid to stress anything by using the array until the syncing was complete.

Re: "So here is the crux of the matter - what is over-writing the metadata and
converting the devices to spares? So far: I don't know.
I have tried to reproduce this and cannot. "

My response: If I am able to, I will create another, similar array (with no valuable data!) and try this again. Here would be my procedure
a) Create an 8-drive RAID 6 array.

b) With the array running, unplug half of the array. Observe that the array goes inactive.

c) Reboot the system

If the same behavior repeats itself, I will end up with 4 active drives and 4 spares. It is also possible that connectivity with the 2nd half of the array went on and off several times over the several seconds while I was pulling on the mouse cable - eSata connectors don't seem to be 100% reliable.

Another question

Do I understand correctly, that if I had added the last 2 drives with "--assume-clean", would the resync have been skipped?

Thanks!

Jim

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Why does MD overwrite the superblock upon temporary disconnect?

am 05.10.2010 02:22:49 von NeilBrown

On Tue, 21 Sep 2010 07:16:23 -0600
Jim Schatzman wrote:

> Neil and Richard-
>
> Thanks for your responses. My environment.
>
> OS: Linux l1.fu-lab.com 2.6.34.6-47.fc13.i686.PAE #1 SMP Fri Aug 27 09:29:49 UTC 2010 i686 i686 i386 GNU/Linux
>
> MDADM: mdadm - v3.1.2 - 10th March 2010
>
> SATA controller: SiI 3124 PCI-X Serial ATA Controller
>
> Drive cages: 8 drive chassis with 4x port multipliers.
>
> More details: I tried reassembling the array with mdadm -A --force /dev/mdX
> and also by specifying all the devices explicitly. I tried this multiple times. This did not work. A couple of things happened
>
> a) mdadm always reported that there weren't enough drives to start the array
>
> b) about 75% of the time, it would complain that one of the drives was busy, so that the result was 4 active; 3 spare
>
> c) there was no reason that I could see why it would report one busy drive - the drive wasn't part of another array, mounted separately, bad, or marked anything other than "spare".
> I had no trouble copying data from the "busy" drive with dd.
>
> As I originally reported, I could not get "assemble" to work, with the above symptoms.
>
>
> Also, I noticed that the "events" counter was messed up on the "spare" drives. The 4 "active" drives had values of 90, the spare drives had varying events values - most were 0 but as I recall one had a value around 30 or so.
>
>
> I didn't note the counter values and the "spare" state until after I rebooted. The exact process was this
>
> 1) Jogged the mouse cable which jogged the eSATA cable.
>
> 2) I noticed that the array was inactive and immediate shut the system down.
>
> 3) Fixed the cables and rebooted.
>
> 4) At this point, had 4 "active" disks and 4 "spares". Tried reassembling many different ways. Sometimes, mdadm would reduce this to 4 "active" and 3 "spares".
>
> 5) No progress with the above at all until I recreated ("mdadm -C") the array with 6 drives, checked the data, added the 2 additional drives, at which point resyncing occurred.
>
> Re: "It marks the devices as having failed
> but otherwise doesn't change the metadata.
> I've occasionally thought about
> leaving the metadata alone once enough devices have failed that the array
> cannot work, but I'm not sure it would really gain anything."
>
>
> My response: The problem with what MD does now (overwriting the metadata) is that it loses track of the slot numbers, and also apparently will not allow you to reassemble the drive (maybe based on the events counter??). If it kept the slot numbers around, and allowed you to force "spare" drives to be considered "active", that would be easier to deal with. I think you are saying that this occured when I rebooted - is that correct?
>
>
> Re: "The 'destruction' of the metadata happens later, not at the time of device
> failure."
>
> My response: So... maybe if I had prevented initrd from trying to start the array when I rebooted, I could have diagnosed the situation and fixed it more easily than by recreating the array. How?

I suspect this holds the key - the initrd is doing something, while trying to
assemble the arrays, which causes the metadata to be over-written. I don't
really know what this would be. I will try to make sure that mdadm-3.2
doesn't have any opportunity to do this.

>
>
> Re: "How is 'check the parity' different from 'resync two disks from scratch' ??
> Both require reading every block on every disk."
>
> My response: With RAID6, it appears that MD reads all the data twice - once for each set of parity data. I added the 7th and 8th drives simultaneously, but the resyncing was done one drive at a time (according to mdadm --detail /dev/mdX).
>

Good point. I should get mdadm to freeze recovery while adding devices so
that it won't start one resync before the next device is added. I've added
that to my list for mdadm-3.2.

> O.k., so this wasn't catastrophic. I was just afraid to stress anything by using the array until the syncing was complete.
>
>
> Re: "So here is the crux of the matter - what is over-writing the metadata and
> converting the devices to spares? So far: I don't know.
> I have tried to reproduce this and cannot. "
>
> My response: If I am able to, I will create another, similar array (with no valuable data!) and try this again. Here would be my procedure
> a) Create an 8-drive RAID 6 array.
>
> b) With the array running, unplug half of the array. Observe that the array goes inactive.
>
> c) Reboot the system
>
> If the same behavior repeats itself, I will end up with 4 active drives and 4 spares. It is also possible that connectivity with the 2nd half of the array went on and off several times over the several seconds while I was pulling on the mouse cable - eSata connectors don't seem to be 100% reliable.
>

.... I don't really have enough devices to try that. I can simulate lots with
loop back or partitions, but you cannot really unplug those physically (not
that I like doing physical unplugs - I need to leave my desk for that:-).
Maybe I can try something though...

>
> Another question
>
> Do I understand correctly, that if I had added the last 2 drives with "--assume-clean", would the resync have been skipped?

No. --assume-clean is only an option for --create, not for --add or --re-add.
Maybe I could make it an option for --re-add... Not sure if it might be too
dangerous though.

NeilBrown

>
>
> Thanks!
>
> Jim
>

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html