Using the new bad-block-log in md for Linux 3.1

am 27.07.2011 06:16:52 von NeilBrown

As mentioned earlier, Linux 3.1 will contain support for recording and
avoiding bad blocks on devices in md arrays.

These patches are currently in -next and I expect to send them to Linus
tomorrow.

Using this funcitonality requires support in mdadm. When an array is created
some space needs to be reserved to store the bad block list.

I have just created an mdadm branch called devel-3.3 which provides initial
functionality. The main patch is included inline below.

This only supports creating new arrays with badblock support. It also only
supports 1.x metadata.

I hope to add support to add a bad block list to an existing 1.x array at
some stage, but support for 0.90 metadata is not expected to ever be added.

If you create an array with this mdadm it will add a bad block log - you
cannot turn it off (it is only 4K long so why would you want to). Then as
errors occur they will cause the faulty block to be added to the log rather
than the device to be remove from the array.
If writing the new bad block list fails, then the device as a whole will fail.

I would very much appreciate any reports of success of failure when using
this new feature. If you can make a test array using a known-faulty device
and can experiment with that I would particularly like to hear about any
experiences.

Thanks,
NeilBrown

git://neil.brown.name/mdadm devel-3.3

http://neil.brown.name/git?p=mdadm;a=shortlog;h=refs/heads/d evel-3.3

From f727829c300f5fd56306e5ed5708a55d28fe228e Mon Sep 17 00:00:00 2001
From: NeilBrown
Date: Wed, 27 Jul 2011 14:08:10 +1000
Subject: [PATCH] Bad block log

diff --git a/super1.c b/super1.c
index 09be351..f911593 100644
--- a/super1.c
+++ b/super1.c
@@ -70,7 +70,12 @@ struct mdp_superblock_1 {
__u8 device_uuid[16]; /* user-space setable, ignored by kernel */
__u8 devflags; /* per-device flags. Only one defined...*/
#define WriteMostly1 1 /* mask for writemostly flag in above */
- __u8 pad2[64-57]; /* set to 0 when writing */
+ /* bad block log. If there are any bad blocks the feature flag is set.
+ * if offset and size are non-zero, that space is reserved and available.
+ */
+ __u8 bblog_shift; /* shift from sectors to block size for badblocklist */
+ __u16 bblog_size; /* number of sectors reserved for badblocklist */
+ __u32 bblog_offset; /* sector offset from superblock to bblog, signed */

/* array state information - 64 bytes */
__u64 utime; /* 40 bits second, 24 btes microseconds */
@@ -99,8 +104,9 @@ struct misc_dev_info {
* must be honoured
*/
#define MD_FEATURE_RESHAPE_ACTIVE 4
+#define MD_FEATURE_BAD_BLOCKS 8 /* badblock list is not empty */

-#define MD_FEATURE_ALL (1|2|4)
+#define MD_FEATURE_ALL (1|2|4|8)

#ifndef offsetof
#define offsetof(t,f) ((size_t)&(((t*)0)->f))
@@ -278,7 +284,7 @@ static void examine_super1(struct supertype *st, char *homehost)
printf("Internal Bitmap : %ld sectors from superblock\n",
(long)(int32_t)__le32_to_cpu(sb->bitmap_offset));
}
- if (sb->feature_map & __le32_to_cpu(MD_FEATURE_RESHAPE_ACTIVE)) {
+ if (sb->feature_map & __cpu_to_le32(MD_FEATURE_RESHAPE_ACTIVE)) {
printf(" Reshape pos'n : %llu%s\n", (unsigned long long)__le64_to_cpu(sb->reshape_position)/2,
human_size(__le64_to_cpu(sb->reshape_position)<<9));
if (__le32_to_cpu(sb->delta_disks)) {
@@ -322,6 +328,17 @@ static void examine_super1(struct supertype *st, char *homehost)
atime = __le64_to_cpu(sb->utime) & 0xFFFFFFFFFFULL;
printf(" Update Time : %.24s\n", ctime(&atime));

+ if (sb->bblog_size && sb->bblog_offset) {
+ printf(" Bad Block Log : %d entries available at offset %ld sectors",
+ __le16_to_cpu(sb->bblog_size)*512/8,
+ (long)__le32_to_cpu(sb->bblog_offset));
+ if (sb->feature_map &
+ __cpu_to_le32(MD_FEATURE_BAD_BLOCKS))
+ printf(" - bad blocks present.");
+ printf("\n");
+ }
+
+
if (calc_sb_1_csum(sb) == sb->sb_csum)
printf(" Checksum : %x - correct\n", __le32_to_cpu(sb->sb_csum));
else
@@ -1105,10 +1122,12 @@ static int write_init_super1(struct supertype *st)
* 2: 4K from start of device.
* Depending on the array size, we might leave extra space
* for a bitmap.
+ * Also leave 4K for bad-block log.
*/
array_size = __le64_to_cpu(sb->size);
- /* work out how much space we left for a bitmap */
- bm_space = choose_bm_space(array_size);
+ /* work out how much space we left for a bitmap,
+ * Add 8 sectors for bad block log */
+ bm_space = choose_bm_space(array_size) + 8;

switch(st->minor_version) {
case 0:
@@ -1120,6 +1139,10 @@ static int write_init_super1(struct supertype *st)
if (sb_offset < array_size + bm_space)
bm_space = sb_offset - array_size;
sb->data_size = __cpu_to_le64(sb_offset - bm_space);
+ if (bm_space >= 8) {
+ sb->bblog_size = __cpu_to_le16(8);
+ sb->bblog_offset = __cpu_to_le32((unsigned)-8);
+ }
break;
case 1:
sb->super_offset = __cpu_to_le64(0);
@@ -1134,6 +1157,10 @@ static int write_init_super1(struct supertype *st)

sb->data_offset = __cpu_to_le64(reserved);
sb->data_size = __cpu_to_le64(dsize - reserved);
+ if (reserved >= 16) {
+ sb->bblog_size = __cpu_to_le16(8);
+ sb->bblog_offset = __cpu_to_le32(reserved-8);
+ }
break;
case 2:
sb_offset = 4*2;
@@ -1154,6 +1181,10 @@ static int write_init_super1(struct supertype *st)

sb->data_offset = __cpu_to_le64(reserved);
sb->data_size = __cpu_to_le64(dsize - reserved);
+ if (reserved >= 16+16) {
+ sb->bblog_size = __cpu_to_le16(8);
+ sb->bblog_offset = __cpu_to_le32(reserved-8-8);
+ }
break;
default:
return -EINVAL;
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Using the new bad-block-log in md for Linux 3.1

am 27.07.2011 08:21:10 von Keld Simonsen

On Wed, Jul 27, 2011 at 02:16:52PM +1000, NeilBrown wrote:
>
> As mentioned earlier, Linux 3.1 will contain support for recording and
> avoiding bad blocks on devices in md arrays.
>
> These patches are currently in -next and I expect to send them to Linus
> tomorrow.
>
> Using this funcitonality requires support in mdadm. When an array is created
> some space needs to be reserved to store the bad block list.
>
> I have just created an mdadm branch called devel-3.3 which provides initial
> functionality. The main patch is included inline below.
>
> This only supports creating new arrays with badblock support. It also only
> supports 1.x metadata.
>
> I hope to add support to add a bad block list to an existing 1.x array at
> some stage, but support for 0.90 metadata is not expected to ever be added.
>
> If you create an array with this mdadm it will add a bad block log - you
> cannot turn it off (it is only 4K long so why would you want to). Then as
> errors occur they will cause the faulty block to be added to the log rather
> than the device to be remove from the array.
> If writing the new bad block list fails, then the device as a whole will fail.
>
> I would very much appreciate any reports of success of failure when using
> this new feature. If you can make a test array using a known-faulty device
> and can experiment with that I would particularly like to hear about any
> experiences.
>
> Thanks,
> NeilBrown
>
> git://neil.brown.name/mdadm devel-3.3
>
> http://neil.brown.name/git?p=mdadm;a=shortlog;h=refs/heads/d evel-3.3

How is it implemented? Does the bad block get duplicated in a reserve area?
Or are also corresponding good blocks on other sound devices also excluded?

How big a device can it handle?

If a device fails totally and the remaining devices contain devices with
bad blocks, will there then be lost data?

Best regrads
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Using the new bad-block-log in md for Linux 3.1

am 27.07.2011 08:49:59 von NeilBrown

On Wed, 27 Jul 2011 08:21:10 +0200 keld@keldix.com wrote:

> On Wed, Jul 27, 2011 at 02:16:52PM +1000, NeilBrown wrote:
> >
> > As mentioned earlier, Linux 3.1 will contain support for recording and
> > avoiding bad blocks on devices in md arrays.
> >
> > These patches are currently in -next and I expect to send them to Linus
> > tomorrow.
> >
> > Using this funcitonality requires support in mdadm. When an array is created
> > some space needs to be reserved to store the bad block list.
> >
> > I have just created an mdadm branch called devel-3.3 which provides initial
> > functionality. The main patch is included inline below.
> >
> > This only supports creating new arrays with badblock support. It also only
> > supports 1.x metadata.
> >
> > I hope to add support to add a bad block list to an existing 1.x array at
> > some stage, but support for 0.90 metadata is not expected to ever be added.
> >
> > If you create an array with this mdadm it will add a bad block log - you
> > cannot turn it off (it is only 4K long so why would you want to). Then as
> > errors occur they will cause the faulty block to be added to the log rather
> > than the device to be remove from the array.
> > If writing the new bad block list fails, then the device as a whole will fail.
> >
> > I would very much appreciate any reports of success of failure when using
> > this new feature. If you can make a test array using a known-faulty device
> > and can experiment with that I would particularly like to hear about any
> > experiences.
> >
> > Thanks,
> > NeilBrown
> >
> > git://neil.brown.name/mdadm devel-3.3
> >
> > http://neil.brown.name/git?p=mdadm;a=shortlog;h=refs/heads/d evel-3.3
>
> How is it implemented? Does the bad block get duplicated in a reserve area?

No duplication - I expect the underlying device to be doing that, and doing
it again at another level seems pointless.

The easiest way to think about it is that the strip containing a bad block is
treated as 'degraded'. You can have an array were only some strips are
degraded, and they are each missing different devices.

> Or are also corresponding good blocks on other sound devices also excluded?

Not sure what you mean. A bad block is just on one device. Each device has
its own independent table of bad blocks.

>
> How big a device can it handle?

2^54 sectors which with 512byte sectors is 8 exbibytes.
With larger sectors, larger devices.

>
> If a device fails totally and the remaining devices contain devices with
> bad blocks, will there then be lost data?

Yes. You shouldn't aim to run an array with bad blocks any more than you
should run an array degraded.
The purpose of bad block management is to provide a more graceful failure
path, not to encourage you to run an array with bad drives (except for
testing).

In particular this lays the ground work to implement hot-replace. If you
have a drive that is failing it can stay in the array and hobble along for a
bit longer. Meanwhile you add a fresh new drive as a hot-replace and let it
rebuilt. If there is a bad block elsewhere in the array the hot-replace
drive might still rebuild completely. And even if there is a failure, you
will only lose some blocks, not the whole array.

This all makes is very hard to build confidence in the code - most of the
time it is not used at all and I would rather it that way. But when things
start going wrong, you really want it to be 100% bug free.

Thanks for the questions,
NeilBrown

>
> Best regrads
> keld

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Using the new bad-block-log in md for Linux 3.1

am 27.07.2011 10:17:29 von Keld Simonsen

On Wed, Jul 27, 2011 at 04:49:59PM +1000, NeilBrown wrote:
> On Wed, 27 Jul 2011 08:21:10 +0200 keld@keldix.com wrote:
>
> > On Wed, Jul 27, 2011 at 02:16:52PM +1000, NeilBrown wrote:
> > >
> > > As mentioned earlier, Linux 3.1 will contain support for recording and
> > > avoiding bad blocks on devices in md arrays.
> > >
> > How is it implemented? Does the bad block get duplicated in a reserve area?
>
> No duplication - I expect the underlying device to be doing that, and doing
> it again at another level seems pointless.

My understanding is that most modern disk devices have their own bad blocks
management with replacement blocks. But many disks have an inadequate number of replacement
blocks, and when that buffer of blocks runs out, you get a bad block reported
to the IO system in the kernel.

So because of the sometimes low number of intrinsic disk reserve blocks,
there would be a point in having this facility replicated in the md layer.

I have for instance two 1 TB disks with some bad sectors on them,
which I have saved to test MD bad blocks handeling (when I get the time)
and do some other bad blocks work on. The errors there are stable, the bad blocks
list has not evolved for about a year. And it is only about 100 blocks out of 1 TB.
I can still use most of the disk on my home server, and I would like to use it
in a fully functioning md array. IMHO there should not be much work in doing
a simple implementation that would guarantee full recovery of all valid data,
should one drive have a fatal error.

> The easiest way to think about it is that the strip containing a bad block is
> treated as 'degraded'. You can have an array where only some strips are
> degraded, and they are each missing different devices.
>
> > Or are also corresponding good blocks on other sound devices also excluded?
>
> Not sure what you mean. A bad block is just on one device. Each device has
> its own independent table of bad blocks.

I was thinking that for example a raid1 or raid10 device, if there be 2 copies, then
declaring both copies bad, or marking that specific raid block as half-bad,
then you do not need a reserve area for bad blocks. Or maybe you could report
to the file system - eg. ext3/ext4 that this is a bad or half-bad block,
and then the file system could treat it accordingly.

There could be some process periodically going thru the md bad blocks list -
which would probably be quite short - a few thousands in bad cases,
and compare it to the ext3/ext4 badblocks list, if a new bad block was found, then
try to retrieve the good data, and reallocate the block.
This scheme would only need access to the md badblocks buffer, no
specific APIs needed I think.

For file systems with no intrinsic bad block handling, such as xfs,
one could have a similar periodic process finding new half-bad blocks,
then reallocate the good data, and then mark the good data on the good disk as
unusable - so it will not be used again. That would probably mean
an API to mark or query a block as bad - or virtually bad - in the md badblocks list.
This solution is general and still rather simple.

Both schemes scale well, given an adquate bad block md buffer.

> > How big a device can it handle?
>
> 2^54 sectors which with 512byte sectors is 8 exbibytes.
> With larger sectors, larger devices.

And how many bad blocks can it handle? 4 KB is not much.
Is it just a simple list of 64 bit entries?

> >
> > If a device fails totally and the remaining devices contain devices with
> > bad blocks, will there then be lost data?
>
> Yes. You shouldn't aim to run an array with bad blocks any more than you
> should run an array degraded.

This is of cause true, but I think you could add some more security
if you could handle more incidents occurring almost at the same time.

And in the case with my home server I think it would be OK to run with
a partly damaged disk.

> The purpose of bad block management is to provide a more graceful failure
> path, not to encourage you to run an array with bad drives (except for
> testing).

Yes, that is a great advantage.

> In particular this lays the ground work to implement hot-replace. If you
> have a drive that is failing it can stay in the array and hobble along for a
> bit longer. Meanwhile you add a fresh new drive as a hot-replace and let it
> rebuilt. If there is a bad block elsewhere in the array the hot-replace
> drive might still rebuild completely. And even if there is a failure, you
> will only lose some blocks, not the whole array.
>
> This all makes is very hard to build confidence in the code - most of the
> time it is not used at all and I would rather it that way. But when things
> start going wrong, you really want it to be 100% bug free.

Yes, I appreciate that the code should be simple.

Best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Using the new bad-block-log in md for Linux 3.1

am 27.07.2011 12:22:16 von Mikael Abrahamsson

On Wed, 27 Jul 2011, keld@keldix.com wrote:

> So because of the sometimes low number of intrinsic disk reserve blocks,
> there would be a point in having this facility replicated in the md
> layer.

I don't agree at all. I remember back in the old MFM/RLL drive days and I
don't want to go back there. If a drive has filled up its bad block
reloaction area with relocated blocks, then it's time to RMA that drive
and get a new one. It's obviously defective.

The added complexity the bad block scheme you propose would add to the
code base is worrysome, I'd rather just go with what Neil has suggested so
far. The hot-replace functionality is a really really needed feature and I
imagine it can be done much more cleanly without having to duplicate the
bad block reallocation functionality the drives already have.

--
Mikael Abrahamsson email: swmike@swm.pp.se
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Using the new bad-block-log in md for Linux 3.1

am 27.07.2011 14:30:30 von Lutz Vieweg

On 07/27/2011 06:16 AM, NeilBrown wrote:
> Then as errors occur they will cause the faulty block to be added to the log rather
> than the device to be remove from the array.

Can you describe the criteria for MD considering a block as faulty?

In your blog, I read
"... known to be bad. i.e. either a read or a write has recently failed..."
but that definition may be problematic: I've experienced drives
with intermittent read / write failures (due to controller or power stability
problems), and I wonder whether such a situation could quickly fill up the
"bad block list", doing more harm than good in the "intermittent error"-
szenario.

Another szenario: The write succeeded, but a later reads of the same
block return read errors. This would result in a "pending sector", and the
harddisk may very well re-map the sector on the next write. Do you mark
the block faulty on the MD level after the first read failed (even though
subsequent reads/writes to the block would succeed), or do you first try
to re-write the block, and call it faulty only if that fails?

One more general thing: I guess that "marking bad blocks" is probably
unsuitable for SSDs, which usually do not assign fixed physical
storage location with a certain block number. Maybe mdadm could warn about better
not enabling the feature if the device is known to be a SSD.

Regards,

Lutz Vieweg

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Using the new bad-block-log in md for Linux 3.1

am 27.07.2011 14:44:24 von John Robinson

On 27/07/2011 13:30, Lutz Vieweg wrote:
> On 07/27/2011 06:16 AM, NeilBrown wrote:
>> Then as errors occur they will cause the faulty block to be added to
>> the log rather
>> than the device to be remove from the array.
>
> Can you describe the criteria for MD considering a block as faulty?

I'll try to answer this having followed some of the discussion around
it. It'll be the same circumstances where currently a drive is
considered faulty, causing the the array to become degraded. With the
bad block list, instead of the whole array becoming degraded, only the
stripe with the bad block becomes degraded.

> In your blog, I read
> "... known to be bad. i.e. either a read or a write has recently failed..."
> but that definition may be problematic: I've experienced drives
> with intermittent read / write failures (due to controller or power
> stability
> problems), and I wonder whether such a situation could quickly fill up the
> "bad block list", doing more harm than good in the "intermittent error"-
> szenario.

It might quickly fill up the bad block list, but with no bad block list,
the array would be taken offline much sooner. Once the controller or
power issues are resolved, the bad block list can be administratively
modified or cleared.

> Another szenario: The write succeeded, but a later reads of the same
> block return read errors. This would result in a "pending sector", and the
> harddisk may very well re-map the sector on the next write. Do you mark
> the block faulty on the MD level after the first read failed (even though
> subsequent reads/writes to the block would succeed), or do you first try
> to re-write the block, and call it faulty only if that fails?

MD already handles this and has done for years; if a read fails,
reconstruction is performed and the data written back. It would be at
this point that a failure would cause the block to be called faulty (or
without the bad block list, the device would be called faulty).

> One more general thing: I guess that "marking bad blocks" is probably
> unsuitable for SSDs, which usually do not assign fixed physical
> storage location with a certain block number. Maybe mdadm could warn
> about better
> not enabling the feature if the device is known to be a SSD.

I don't think mdadm knows whether its constituent devices are SSDs.

Cheers,

John.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Using the new bad-block-log in md for Linux 3.1

am 27.07.2011 15:06:10 von Lutz Vieweg

On 07/27/2011 02:44 PM, John Robinson wrote:
>> Can you describe the criteria for MD considering a block as faulty?
>
> I'll try to answer this having followed some of the discussion around it.

Thanks a lot for the explanation!

> Once the controller or power issues are resolved, the bad block list can be
> administratively modified or cleared.

Ah, that's good.

> I don't think mdadm knows whether its constituent devices are SSDs.

In block/cfq-iosched.c I see a test that looks like this:
> if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
> return;

If that isn't conclusive, putting a note into the mdadm man-page is probably
the best one can do.

Regards,

Lutz Vieweg

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Using the new bad-block-log in md for Linux 3.1

am 27.07.2011 15:23:14 von Lutz Vieweg

On 07/27/2011 03:06 PM, Lutz Vieweg wrote:
>> I don't think mdadm knows whether its constituent devices are SSDs.
>
> In block/cfq-iosched.c I see a test that looks like this:
>> if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
>> return;

I just verified that indeed, /sys/block/sd*/queue/rotational contains
0 for SSDs and 1 for magnetic disks.

One catch, though, seems to be that this attribute seems not to be
propagated through additional block device layers, so e.g.
a loop-device based on a SSD is, strangely, tagged as being "rotational",
as is a device-mapper based on the SSD.

So you can detect SSDs if given as underlying devices directly, only.

Regards,

Lutz Vieweg

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Using the new bad-block-log in md for Linux 3.1

am 27.07.2011 22:55:41 von NeilBrown

On Wed, 27 Jul 2011 15:06:10 +0200 Lutz Vieweg wrote:

> On 07/27/2011 02:44 PM, John Robinson wrote:
> >> Can you describe the criteria for MD considering a block as faulty?
> >
> > I'll try to answer this having followed some of the discussion around it.
>
> Thanks a lot for the explanation!

Yes John, thanks for posting.

>
> > Once the controller or power issues are resolved, the bad block list can be
> > administratively modified or cleared.
>
> Ah, that's good.

"administratively" probably isn't the right word. You cannot tell md to
remove blocks from the list (except for testing purposes).

When md finds that it might be good to write to a known-bad-block it has two
options - to write or not.
It makes the choice based on whether it has seen any write errors on that
device since the array was assembled.
If it has - it just doesn't write and leaves the block 'bad'.
If it has not it tries to write. On success it clears the record of the bad
block. On failure it decides not to write to and more bad blocks on that
device.

So if you have a device that is incorrectly reporting errors and filling up
the bad block list, and you then stop the array, fix the hardware, and
re-assemble, then the bad blocks will gradually disappear as writes try to
write to them again and succeed. A 'check' pass should automatically fix
everything up as it tries to re-write bad blocks.

>
> > I don't think mdadm knows whether its constituent devices are SSDs.
>
> In block/cfq-iosched.c I see a test that looks like this:
> > if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
> > return;
>
> If that isn't conclusive, putting a note into the mdadm man-page is probably
> the best one can do.
>

The idea of marking a device as 'rotational' always seemed dumb to me.
Because people assume that 'rotational' is a disk drive and '!rotational' is
an SSD. But what if some other technology comes along with behaviour
somewhere between the two??

I think the primary meaning of 'rotational' as implemented is 'seek is
instant'. This is quite a different meaning to 'blocks migrate around the
device' even though both are true of current SSDs.

I'm not sure that md can usefully do anything different on SSDs than on
spinning rust.
You certainly still want to record read errors. If you get a write error it
probably means that a large part of the device is bad ... but I suspect you
will notice that soon enough anyway.

NeilBrown

> Regards,
>
> Lutz Vieweg
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Using the new bad-block-log in md for Linux 3.1

am 28.07.2011 11:25:13 von Lutz Vieweg

On 07/27/2011 10:55 PM, NeilBrown wrote:
> When md finds that it might be good to write to a known-bad-block it has two
> options - to write or not.
> It makes the choice based on whether it has seen any write errors on that
> device since the array was assembled.
> If it has - it just doesn't write and leaves the block 'bad'.
> If it has not it tries to write. On success it clears the record of the bad
> block.

Sounds reasonable.

> On failure it decides not to write to and more bad blocks on that
> device.

This sentence may just miss one verb, but that might be an important
one. Did you mean to say "on failure (of writing to a block that had
been marked as bad, after a re-assembly) that one block will not be
written to (until after the next re-assembly)"?

> The idea of marking a device as 'rotational' always seemed dumb to me.
> Because people assume that 'rotational' is a disk drive and '!rotational' is
> an SSD. But what if some other technology comes along with behaviour
> somewhere between the two??

The naming of that flag is really awkward.

> I think the primary meaning of 'rotational' as implemented is 'seek is
> instant'.

(That would be the meaning of 'not rotational'.)

> This is quite a different meaning to 'blocks migrate around the
> device' even though both are true of current SSDs.

Right, the seeking and "wear levelling" features are completely orthogonal.

> I'm not sure that md can usefully do anything different on SSDs than on
> spinning rust.

At least MD could make block devices it creates inherit the "rotational"
flag, as an "OR"ed combination of the slave block devices (because if one
slave needs time for seeking, so probably will the RAID as a whole).

From that the scheduler could benefit when writing to the MD device -
at least the amount of places where the "rotational" flag is checked
for in the scheduler code suggests that such a benefit may exist.

> You certainly still want to record read errors.

It probably cannot harm to record them, but it probably has no benefit, either.
I've had SSDs returning read errors for single blocks (which were gone after
rewriting), and the SSD, unlike a magnetic disk, will certainly not take
any significant extra time to report such an error, it's just a checksum-mismatch,
after all, and retries are either extremely fast or futile (no wait for
the next rotation involved).

> If you get a write error it
> probably means that a large part of the device is bad ... but I suspect you
> will notice that soon enough anyway.

I'd guess so, too.

Regards,

Lutz Vieweg

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Using the new bad-block-log in md for Linux 3.1

am 28.07.2011 11:55:19 von John Robinson

On 28/07/2011 10:25, Lutz Vieweg wrote:
> On 07/27/2011 10:55 PM, NeilBrown wrote:
[...]
>> On failure it decides not to write to and more bad blocks on that
>> device.
>
> This sentence may just miss one verb, but that might be an important
> one. Did you mean to say "on failure (of writing to a block that had
> been marked as bad, after a re-assembly) that one block will not be
> written to (until after the next re-assembly)"?

I think the typo was that he meant "any" where he wrote "and": "On
failure it decides not to write to any more bad blocks on that device".

So after a re-assembly (e.g. when you boot up after fixing your power,
cable, controller issues) md will try writing to bad blocks again, until
any such writes fail, after which it will stop trying to write to bad
blocks on that device. By this method, md can automatically recover from
spurious write failures caused by temporary issues.

Sorry I got it wrong in the first place, by the way - I'd seen the
writeable sysfs entries for manipulating the bad block list, so that's
why I thought there was an administrative interface for clearing it, but
if that's only there for md/mdadm's internal use and testing, we
ordinary users had better leave it alone :-)

Cheers,

John.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Using the new bad-block-log in md for Linux 3.1

am 28.07.2011 14:53:36 von Michal Soltys

W dniu 27.07.2011 06:16, NeilBrown pisze:
>
> If you create an array with this mdadm it will add a bad block log - you
> cannot turn it off (it is only 4K long so why would you want to). Then as
> errors occur they will cause the faulty block to be added to the log rather
> than the device to be remove from the array.

I was wondering, 4KiB is "just" 512 entries assuming these are only
64bit sector addresses - it's possible to imagine a failing drive with
lots of not reloctable (for whatever reason) adjacent sectors, that
could fill up the log relatively quickly (mentioned 512 entries ~ 256KiB
worth of data on 512b drives, just a few chunks before the drive is
kicked out). Though - how realistic such scenario is with modern drives,
is hard for me to judge.

Still - are there any plans to make that 4KiB size not hardcoded, also
perhaps along with customizable offset to the 1st chunk ?

Something of analogous functionality to lvm's options such as:

--metadatasize
--dataalignment
--dataalignmentoffset

(though again, hard to judge if there's any practical use for those
aside exotic scenarios)

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html