RAID5 write hole?

RAID5 write hole?

am 26.06.2010 16:31:25 von Shaochun Wang

Hi:

Recently I heard of the so called "write hole" problem of raid5 in
Linux software raid. I use ext4 filesystem on my NAS, which assembles
data disks using Linux software raid. So I wonder how safe my such
system!

If the "write hole" is inevitable, will it result in the corruption of
ext4 filesystem?

--
Shaochun Wang

Jabber: fungusw@jabber.org
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID5 write hole?

am 26.06.2010 17:42:56 von Mikael Abrahamsson

On Sat, 26 Jun 2010, Shaochun Wang wrote:

> Hi:
>
> Recently I heard of the so called "write hole" problem of raid5 in
> Linux software raid. I use ext4 filesystem on my NAS, which assembles
> data disks using Linux software raid. So I wonder how safe my such
> system!

RAID is never a replacement for backups, corruption can happen at multiple
levels in your system for different reasons. Non-ECC memory can have bit
flips which corrupts your data, write hole can cause data corruption, etc.

Generally, unless you have really really high demands on data integrity,
this is not a major problem.

Ext4 has other potential software/fs interactions when it comes to data
integrity, in that it write buffers for quite some time, so even if you
think your file is saved, it might take many seconds before it's actually
on disk if your software doesn't fsync() it. Most don't, because Ext3 took
so long to do it.

So generally, don't worry too much, but make sure you have backups for
your important data.

--
Mikael Abrahamsson email: swmike@swm.pp.se
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID5 write hole?

am 26.06.2010 23:28:59 von Shaochun Wang

Maybe Sun's ZFS is the ultimate choice!

--
Shaochun Wang

Jabber: fungusw@jabber.org
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID5 write hole?

am 27.06.2010 12:33:49 von John Hendrikx

Shaochun Wang wrote:
> Hi:
>
> Recently I heard of the so called "write hole" problem of raid5 in
> Linux software raid. I use ext4 filesystem on my NAS, which assembles
> data disks using Linux software raid. So I wonder how safe my such
> system!
>
> If the "write hole" is inevitable, will it result in the corruption of
> ext4 filesystem?
The write hole occurs if your system crashes during a write operation,
where one stripe gets updated but the other corresponding stripe does
not. This could lead to parity information not matching the
corresponding data.

If the raid 5 system atleast ensures that the data stripe is always
written before parity, then the montly resync check that mdadm does
should be able to detect this and write new parity information.

Atleast this way the bad parity does not lurk around forever on your
raid system causing numerous problems when a disk finally fails.

The write hole is not inevitable, but would require some special
measures at the raid level which could affect performance. And as with
any corruption, it could definitely corrupt your filesystem.

--John

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID5 write hole?

am 27.06.2010 14:16:13 von NeilBrown

On Sun, 27 Jun 2010 12:33:49 +0200
John Hendrikx wrote:

> Shaochun Wang wrote:
> > Hi:
> >
> > Recently I heard of the so called "write hole" problem of raid5 in
> > Linux software raid. I use ext4 filesystem on my NAS, which assembles
> > data disks using Linux software raid. So I wonder how safe my such
> > system!
> >
> > If the "write hole" is inevitable, will it result in the corruption of
> > ext4 filesystem?
> The write hole occurs if your system crashes during a write operation,
> where one stripe gets updated but the other corresponding stripe does
> not. This could lead to parity information not matching the
> corresponding data.

Correct.

>
> If the raid 5 system atleast ensures that the data stripe is always
> written before parity, then the montly resync check that mdadm does
> should be able to detect this and write new parity information.

This bit isn't so correct.
When the RAID5 is next assembled after the crash, if all devices are present
(i.e. the array is not degraded) then it will check and correct all the
parity blocks immediately. If you have a write-intent-bitmap configured,
this will be quite quick. If not it could take hours.

Once the resync has completed you are safe again, any risk from the "write
hole" will have disappeared.

If your array was degraded when the system crashed, or is degraded on
restart, or degrades before the resync completes, then you could suffer from
the "Write hole" ... if a write was interrupted by the crash.

In the first two cases (which are effectively the same case), mdadm will
refuse to assemble the array because it knows it could be suffering from a
write-hole problem. You need to reassemble with "--force" which means you
acknowledge that there could be corruption due to the write hole.

If you lose a device during the resync you could still suffer from the write
hole, but md doesn't alert you to this. That could be seen as a
short-coming, but I'm not sure how it might be fixed. I wouldn't want the
array to suddenly stop working because there is suddenly a risk of write-hold
based corruption....

>
> Atleast this way the bad parity does not lurk around forever on your
> raid system causing numerous problems when a disk finally fails.

Yes, it certainly does not lurk forever - the resync fixes it.

>
> The write hole is not inevitable, but would require some special
> measures at the raid level which could affect performance. And as with
> any corruption, it could definitely corrupt your filesystem.

The write hole can be "fixed" in two ways that I am aware of.
1/ log all writes (including parity updates) to some stable storage before
writing them to the RAID5. This is typically done in "hardware RAID" cards
using NVRAM for the stable storage.
Once NVRAM is widely available on commodity server hardware I suspect
md/raid5 will be enhanced to support this. I have thought about doing
this using a RAID1 as the alternate stable storage, but the performance
cost is unlikely to acceptable.
2/ use a filesystem which understands the layout of the RAID5 and which
somehow "knows" which stripes were written "recently" so that it can
invalidate them (if it cannot verify them) after a crash. This would
almost certainly require a copy-on-write disciple in the filesystem.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID5 write hole?

am 29.06.2010 08:16:35 von Shaochun Wang

On Sun, Jun 27, 2010 at 10:16:13PM +1000, Neil Brown wrote:
> On Sun, 27 Jun 2010 12:33:49 +0200
> John Hendrikx wrote:
>
> parity blocks immediately. If you have a write-intent-bitmap configured,
> this will be quite quick. If not it could take hours.
How do I know whether my RAID5 has wirte-intent bitmap enabled?

--
Shaochun Wang

Jabber: fungusw@jabber.org
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID5 write hole?

am 29.06.2010 08:23:35 von Mikael Abrahamsson

On Tue, 29 Jun 2010, Shaochun Wang wrote:

> How do I know whether my RAID5 has wirte-intent bitmap enabled?

$ cat /proc/mdstat | grep bitmap
bitmap: 0/8 pages [0KB], 131072KB chunk

If you don't get any bitmap information in there, it's not enabled.

--
Mikael Abrahamsson email: swmike@swm.pp.se
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RAID5 write hole?

am 29.06.2010 15:28:16 von Shaochun Wang

On Tue, Jun 29, 2010 at 08:23:35AM +0200, Mikael Abrahamsson wrote:
> On Tue, 29 Jun 2010, Shaochun Wang wrote:
>
> $ cat /proc/mdstat | grep bitmap
> bitmap: 0/8 pages [0KB], 131072KB chunk
>
> If you don't get any bitmap information in there, it's not enabled.
It seems that I do not have write-intent bitmap enabled. If
write-intent bitmap is useful, why is it not the default one?

--
Shaochun Wang

Jabber: fungusw@jabber.org
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html