[md PATCH 00/36] md patches for 3.1 - part 2: bad block logs

[md PATCH 00/36] md patches for 3.1 - part 2: bad block logs

am 21.07.2011 04:58:47 von NeilBrown

As promised this is the second of 2 patch-bombs full of patches
that I plan to submit for linux-3.1

While the first set was a varied assortment, these all have a very
strong theme.
This patch set implements a bad-block-log for RAID1, RAID456 and
RAID10.
i.e. the first thing on my "TODO list":
http://neil.brown.name/blog/20110216044002


On v1.x metadata arrays created with a patched mdadm (which I'll post
a pointer to later) 4K of space is reserved to store a list of
known bad blocks. When md hits an error, it can now fail just the
block instead of failing the whole device. This should mean more
graceful failure modes when devices are producing bad blocks.

I have tested these a reasonable amount (and found a few bugs in the
process) but more testing is needed. One difficulty with testing is
that you need the device to fail occasionally to exercise some of this
code.

One of my tests is below. It inserts a 'faulty' md device between
the RAID5 and each real device and configures two of them to generate
persistent write errors at different rates. The first "mkfs" causes
lots of bad blocks to get logged. The second "mkfs" (after the
'faulty' targets are cleared and flushed) results in all those bad
blocks being successfully repaired and forgotten.
There are obviously lots of other combinations worth testing.

Testing both with the new mdadm and with the old one (or with 0.90
metadata which won't store bad-block lists) would be helpful.

Again, genuine "Reviewed-by" line are very welcome and will be added
if received before I submit this to Linus.

Thanks,
NeilBrown

(from the mdadm man page for "--grow" for faulty arrays:

When setting the failure mode for level faulty, the options are:
write-transient, wt, read-transient, rt, write-persistent, wp,
read-persistent, rp, write-all, read-fixable, rf, clear, flush,
none.

Each failure mode can be followed by a number, which is used as
a period between fault generation. Without a number, the fault
is generated once on the first relevant request. With a number,
the fault will be generated after that many requests, and will
continue to be generated every time the period elapses.

Multiple failure modes can be current simultaneously by using
the --grow option to set subsequent failure modes.

"clear" or "none" will remove any pending or periodic failure
modes, and "flush" will clear any persistent faults.
)

# test badblock code

mdadm -Ss
mdadm -B /dev/md10 -l faulty -n 1 /dev/sda
mdadm -B /dev/md11 -l faulty -n 1 /dev/sdb
mdadm -B /dev/md12 -l faulty -n 1 /dev/sdc
mdadm -B /dev/md13 -l faulty -n 1 /dev/sdd
../mdadm -CR /dev/md0 -l5 -n4 /dev/md1[0123] --assume-clean

mdadm -G /dev/md10 -l faulty -p wp8000
mdadm -G /dev/md11 -l faulty -p wp7000

mkfs /dev/md0

grep . /sys/block/md0/md/rd?/bad*

mdadm -S /dev/md0
mdadm -G /dev/md10 -l faulty -p clear
mdadm -G /dev/md10 -l faulty -p flush
mdadm -G /dev/md11 -l faulty -p clear
mdadm -G /dev/md11 -l faulty -p flush

mdadm -A /dev/md0 /dev/md1[0123]
mkfs /dev/md0
grep . /sys/block/md0/md/rd?/bad*



---

NeilBrown (36):
md/raid10: handle further errors during fix_read_error better.
md/raid10: Handle read errors during recovery better.
md/raid10: simplify read error handling during recovery.
md/raid10: record bad blocks due to write errors during resync/recovery.
md/raid10: attempt to fix read errors during resync/check
md/raid10: Handle write errors by updating badblock log.
md/raid10: clear bad-block record when write succeeds.
md/raid10: avoid writing to known bad blocks on known bad drives.
md/raid10 record bad blocks as needed during recovery.
md/raid10: avoid reading known bad blocks during resync/recovery.
md/raid10 - avoid reading from known bad blocks - part 3
md/raid10: avoid reading from known bad blocks - part 2
md/raid10: avoid reading from known bad blocks - part 1
md/raid10: Split handle_read_error out from raid10d.
md/raid10: simplify/reindent some loops.
md/raid5: Clear bad blocks on successful write.
md/raid5. Don't write to known bad block on doubtful devices.
md/raid5: write errors should be recorded as bad blocks if possible.
md/raid5: use bad-block log to improve handling of uncorrectable read errors.
md/raid5: avoid reading from known bad blocks.
md/raid1: factor several functions out or raid1d()
md/raid1: improve handling of read failure during recovery.
md/raid1: record badblocks found during resync etc.
md/raid1: Handle write errors by updating badblock log.
md/raid1: store behind-write pages in bi_vecs.
md/raid1: clear bad-block record when write succeeds.
md/raid1: avoid writing to known-bad blocks on known-bad drives.
md: make it easier to wait for bad blocks to be acknowledged.
md: add 'write_error' flag to component devices.
md/raid1: avoid reading known bad blocks during resync
md/raid1: avoid reading from known bad blocks.
md: Disable bad blocks and v0.90 metadata.
md: load/store badblock list from v1.x metadata
md: don't allow arrays to contain devices with bad blocks.
md/bad-block-log: add sysfs interface for accessing bad-block-log.
md: beginnings of bad block management.


drivers/md/md.c | 838 ++++++++++++++++++++++++++++++++++++-
drivers/md/md.h | 83 ++++
drivers/md/raid1.c | 923 ++++++++++++++++++++++++++++++++---------
drivers/md/raid1.h | 20 +
drivers/md/raid10.c | 1015 ++++++++++++++++++++++++++++++++++++---------
drivers/md/raid10.h | 16 +
drivers/md/raid5.c | 183 +++++++-
drivers/md/raid5.h | 21 +
include/linux/raid/md_p.h | 14 -
9 files changed, 2637 insertions(+), 476 deletions(-)

--
Signature

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 04/36] md: load/store badblock list from v1.x metadata

am 21.07.2011 04:58:47 von NeilBrown

Space must have been allocated when array was created.
A feature flag is set when the badblock list is non-empty, to
ensure old kernels don't load and trust the whole device.

We only update the on-disk badblocklist when it has changed.
If the badblocklist (or other metadata) is stored on a bad block, we
don't cope very well.

If metadata has no room for bad block, flag bad-blocks as disabled,
and do the same for 0.90 metadata.

Signed-off-by: NeilBrown
---

drivers/md/md.c | 111 +++++++++++++++++++++++++++++++++++++++++++--
drivers/md/md.h | 5 ++
include/linux/raid/md_p.h | 14 ++++--
3 files changed, 119 insertions(+), 11 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 9324635..18c3aab 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -757,6 +757,10 @@ static void free_disk_sb(mdk_rdev_t * rdev)
rdev->sb_start = 0;
rdev->sectors = 0;
}
+ if (rdev->bb_page) {
+ put_page(rdev->bb_page);
+ rdev->bb_page = NULL;
+ }
}


@@ -1395,6 +1399,8 @@ static __le32 calc_sb_1_csum(struct mdp_superblock_1 * sb)
return cpu_to_le32(csum);
}

+static int md_set_badblocks(struct badblocks *bb, sector_t s, int sectors,
+ int acknowledged);
static int super_1_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version)
{
struct mdp_superblock_1 *sb;
@@ -1473,6 +1479,47 @@ static int super_1_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version)
else
rdev->desc_nr = le32_to_cpu(sb->dev_number);

+ if (!rdev->bb_page) {
+ rdev->bb_page = alloc_page(GFP_KERNEL);
+ if (!rdev->bb_page)
+ return -ENOMEM;
+ }
+ if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_BAD_BLOCKS) &&
+ rdev->badblocks.count == 0) {
+ /* need to load the bad block list.
+ * Currently we limit it to one page.
+ */
+ s32 offset;
+ sector_t bb_sector;
+ u64 *bbp;
+ int i;
+ int sectors = le16_to_cpu(sb->bblog_size);
+ if (sectors > (PAGE_SIZE / 512))
+ return -EINVAL;
+ offset = le32_to_cpu(sb->bblog_offset);
+ if (offset == 0)
+ return -EINVAL;
+ bb_sector = (long long)offset;
+ if (!sync_page_io(rdev, bb_sector, sectors << 9,
+ rdev->bb_page, READ, true))
+ return -EIO;
+ bbp = (u64 *)page_address(rdev->bb_page);
+ rdev->badblocks.shift = sb->bblog_shift;
+ for (i = 0 ; i < (sectors << (9-3)) ; i++, bbp++) {
+ u64 bb = le64_to_cpu(*bbp);
+ int count = bb & (0x3ff);
+ u64 sector = bb >> 10;
+ sector <<= sb->bblog_shift;
+ count <<= sb->bblog_shift;
+ if (bb + 1 == 0)
+ break;
+ if (md_set_badblocks(&rdev->badblocks,
+ sector, count, 1) == 0)
+ return -EINVAL;
+ }
+ } else if (sb->bblog_offset == 0)
+ rdev->badblocks.shift = -1;
+
if (!refdev) {
ret = 1;
} else {
@@ -1624,7 +1671,6 @@ static void super_1_sync(mddev_t *mddev, mdk_rdev_t *rdev)
sb->pad0 = 0;
sb->recovery_offset = cpu_to_le64(0);
memset(sb->pad1, 0, sizeof(sb->pad1));
- memset(sb->pad2, 0, sizeof(sb->pad2));
memset(sb->pad3, 0, sizeof(sb->pad3));

sb->utime = cpu_to_le64((__u64)mddev->utime);
@@ -1664,6 +1710,43 @@ static void super_1_sync(mddev_t *mddev, mdk_rdev_t *rdev)
sb->new_chunk = cpu_to_le32(mddev->new_chunk_sectors);
}

+ if (rdev->badblocks.count == 0)
+ /* Nothing to do for bad blocks*/ ;
+ else if (sb->bblog_offset == 0)
+ /* Cannot record bad blocks on this device */
+ md_error(mddev, rdev);
+ else {
+ int havelock = 0;
+ struct badblocks *bb = &rdev->badblocks;
+ u64 *bbp = (u64 *)page_address(rdev->bb_page);
+ u64 *p;
+ sb->feature_map |= cpu_to_le32(MD_FEATURE_BAD_BLOCKS);
+ if (bb->changed) {
+ memset(bbp, 0xff, PAGE_SIZE);
+
+ rcu_read_lock();
+ p = rcu_dereference(bb->active_page);
+ if (!p) {
+ spin_lock_irq(&bb->lock);
+ p = bb->page;
+ havelock = 1;
+ }
+ for (i = 0 ; i < bb->count ; i++) {
+ u64 internal_bb = *p++;
+ u64 store_bb = ((BB_OFFSET(internal_bb) << 10)
+ | BB_LEN(internal_bb));
+ *bbp++ = cpu_to_le64(store_bb);
+ }
+ bb->sector = (rdev->sb_start +
+ (int)le32_to_cpu(sb->bblog_offset));
+ bb->size = le16_to_cpu(sb->bblog_size);
+ bb->changed = 0;
+ if (havelock)
+ spin_unlock_irq(&bb->lock);
+ rcu_read_unlock();
+ }
+ }
+
max_dev = 0;
list_for_each_entry(rdev2, &mddev->disks, same_set)
if (rdev2->desc_nr+1 > max_dev)
@@ -2197,6 +2280,7 @@ static void md_update_sb(mddev_t * mddev, int force_change)
mdk_rdev_t *rdev;
int sync_req;
int nospares = 0;
+ int any_badblocks_changed = 0;

repeat:
/* First make sure individual recovery_offsets are correct */
@@ -2268,6 +2352,11 @@ repeat:
MD_BUG();
mddev->events --;
}
+
+ list_for_each_entry(rdev, &mddev->disks, same_set)
+ if (rdev->badblocks.changed)
+ any_badblocks_changed++;
+
sync_sbs(mddev, nospares);
spin_unlock_irq(&mddev->write_lock);

@@ -2293,6 +2382,13 @@ repeat:
bdevname(rdev->bdev,b),
(unsigned long long)rdev->sb_start);
rdev->sb_events = mddev->events;
+ if (rdev->badblocks.size) {
+ md_super_write(mddev, rdev,
+ rdev->badblocks.sector,
+ rdev->badblocks.size << 9,
+ rdev->bb_page);
+ rdev->badblocks.size = 0;
+ }

} else
dprintk(")\n");
@@ -2316,6 +2412,9 @@ repeat:
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
sysfs_notify(&mddev->kobj, NULL, "sync_completed");

+ if (any_badblocks_changed)
+ list_for_each_entry(rdev, &mddev->disks, same_set)
+ md_ack_all_badblocks(&rdev->badblocks);
}

/* words written to sysfs files may, or may not, be \n terminated.
@@ -2823,6 +2922,8 @@ int md_rdev_init(mdk_rdev_t *rdev)
rdev->sb_events = 0;
rdev->last_read_error.tv_sec = 0;
rdev->last_read_error.tv_nsec = 0;
+ rdev->sb_loaded = 0;
+ rdev->bb_page = NULL;
atomic_set(&rdev->nr_pending, 0);
atomic_set(&rdev->read_errors, 0);
atomic_set(&rdev->corrected_errors, 0);
@@ -2912,11 +3013,9 @@ static mdk_rdev_t *md_import_device(dev_t newdev, int super_format, int super_mi
return rdev;

abort_free:
- if (rdev->sb_page) {
- if (rdev->bdev)
- unlock_rdev(rdev);
- free_disk_sb(rdev);
- }
+ if (rdev->bdev)
+ unlock_rdev(rdev);
+ free_disk_sb(rdev);
kfree(rdev);
return ERR_PTR(err);
}
diff --git a/drivers/md/md.h b/drivers/md/md.h
index d327734..834e46b 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -55,7 +55,7 @@ struct mdk_rdev_s
struct block_device *meta_bdev;
struct block_device *bdev; /* block device handle */

- struct page *sb_page;
+ struct page *sb_page, *bb_page;
int sb_loaded;
__u64 sb_events;
sector_t data_offset; /* start of data in array */
@@ -128,6 +128,9 @@ struct mdk_rdev_s
u64 *active_page; /* either 'page' or 'NULL' */
int changed;
spinlock_t lock;
+
+ sector_t sector;
+ sector_t size; /* in sectors */
} badblocks;
};

diff --git a/include/linux/raid/md_p.h b/include/linux/raid/md_p.h
index 75cbf4f..9e65d9e 100644
--- a/include/linux/raid/md_p.h
+++ b/include/linux/raid/md_p.h
@@ -245,10 +245,16 @@ struct mdp_superblock_1 {
__u8 device_uuid[16]; /* user-space setable, ignored by kernel */
__u8 devflags; /* per-device flags. Only one defined...*/
#define WriteMostly1 1 /* mask for writemostly flag in above */
- __u8 pad2[64-57]; /* set to 0 when writing */
+ /* Bad block log. If there are any bad blocks the feature flag is set.
+ * If offset and size are non-zero, that space is reserved and available
+ */
+ __u8 bblog_shift; /* shift from sectors to block size */
+ __le16 bblog_size; /* number of sectors reserved for list */
+ __le32 bblog_offset; /* sector offset from superblock to bblog,
+ * signed - not unsigned */

/* array state information - 64 bytes */
- __le64 utime; /* 40 bits second, 24 btes microseconds */
+ __le64 utime; /* 40 bits second, 24 bits microseconds */
__le64 events; /* incremented when superblock updated */
__le64 resync_offset; /* data before this offset (from data_offset) known to be in sync */
__le32 sb_csum; /* checksum up to devs[max_dev] */
@@ -270,8 +276,8 @@ struct mdp_superblock_1 {
* must be honoured
*/
#define MD_FEATURE_RESHAPE_ACTIVE 4
+#define MD_FEATURE_BAD_BLOCKS 8 /* badblock list is not empty */

-#define MD_FEATURE_ALL (1|2|4)
+#define MD_FEATURE_ALL (1|2|4|8)

#endif
-


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 02/36] md/bad-block-log: add sysfs interface for accessingbad-block-log.

am 21.07.2011 04:58:47 von NeilBrown

This can show the log (providing it fits in one page) and
allows bad blocks to be 'acknowledged' meaning that they
have safely been recorded in metadata.

Clearing bad blocks is not allowed via sysfs (except for
code testing). A bad block can only be cleared when
a write to the block succeeds.

Signed-off-by: NeilBrown
---

drivers/md/md.c | 127 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 127 insertions(+), 0 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 220fadb..9324635 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2712,6 +2712,35 @@ static ssize_t recovery_start_store(mdk_rdev_t *rdev, const char *buf, size_t le
static struct rdev_sysfs_entry rdev_recovery_start =
__ATTR(recovery_start, S_IRUGO|S_IWUSR, recovery_start_show, recovery_start_store);

+
+static ssize_t
+badblocks_show(struct badblocks *bb, char *page, int unack);
+static ssize_t
+badblocks_store(struct badblocks *bb, const char *page, size_t len, int unack);
+
+static ssize_t bb_show(mdk_rdev_t *rdev, char *page)
+{
+ return badblocks_show(&rdev->badblocks, page, 0);
+}
+static ssize_t bb_store(mdk_rdev_t *rdev, const char *page, size_t len)
+{
+ return badblocks_store(&rdev->badblocks, page, len, 0);
+}
+static struct rdev_sysfs_entry rdev_bad_blocks =
+__ATTR(bad_blocks, S_IRUGO|S_IWUSR, bb_show, bb_store);
+
+
+static ssize_t ubb_show(mdk_rdev_t *rdev, char *page)
+{
+ return badblocks_show(&rdev->badblocks, page, 1);
+}
+static ssize_t ubb_store(mdk_rdev_t *rdev, const char *page, size_t len)
+{
+ return badblocks_store(&rdev->badblocks, page, len, 1);
+}
+static struct rdev_sysfs_entry rdev_unack_bad_blocks =
+__ATTR(unacknowledged_bad_blocks, S_IRUGO|S_IWUSR, ubb_show, ubb_store);
+
static struct attribute *rdev_default_attrs[] = {
&rdev_state.attr,
&rdev_errors.attr,
@@ -2719,6 +2748,8 @@ static struct attribute *rdev_default_attrs[] = {
&rdev_offset.attr,
&rdev_size.attr,
&rdev_recovery_start.attr,
+ &rdev_bad_blocks.attr,
+ &rdev_unack_bad_blocks.attr,
NULL,
};
static ssize_t
@@ -7775,6 +7806,102 @@ void md_ack_all_badblocks(struct badblocks *bb)
}
EXPORT_SYMBOL_GPL(md_ack_all_badblocks);

+/* sysfs access to bad-blocks list.
+ * We present two files.
+ * 'bad-blocks' lists sector numbers and lengths of ranges that
+ * are recorded as bad. The list is truncated to fit within
+ * the one-page limit of sysfs.
+ * Writing "sector length" to this file adds an acknowledged
+ * bad block list.
+ * 'unacknowledged-bad-blocks' lists bad blocks that have not yet
+ * been acknowledged. Writing to this file adds bad blocks
+ * without acknowledging them. This is largely for testing.
+ *
+ */
+
+static ssize_t
+badblocks_show(struct badblocks *bb, char *page, int unack)
+{
+ size_t len = 0;
+ int i;
+ u64 *p;
+ int havelock = 0;
+
+ if (bb->shift < 0)
+ return 0;
+
+ rcu_read_lock();
+ p = rcu_dereference(bb->active_page);
+ if (!p) {
+ spin_lock_irq(&bb->lock);
+ p = bb->page;
+ havelock = 1;
+ }
+
+ i = 0;
+
+ while (len < PAGE_SIZE && i < bb->count) {
+ sector_t s = BB_OFFSET(p[i]);
+ unsigned int length = BB_LEN(p[i]);
+ int ack = BB_ACK(p[i]);
+ i++;
+
+ if (unack && ack)
+ continue;
+
+ len += snprintf(page+len, PAGE_SIZE-len, "%llu %u\n",
+ (unsigned long long)s << bb->shift,
+ length << bb->shift);
+ }
+
+ if (havelock)
+ spin_unlock_irq(&bb->lock);
+ rcu_read_unlock();
+
+ return len;
+}
+
+#define DO_DEBUG 1
+
+static ssize_t
+badblocks_store(struct badblocks *bb, const char *page, size_t len, int unack)
+{
+ unsigned long long sector;
+ int length;
+ char newline;
+#ifdef DO_DEBUG
+ /* Allow clearing via sysfs *only* for testing/debugging.
+ * Normally only a successful write may clear a badblock
+ */
+ int clear = 0;
+ if (page[0] == '-') {
+ clear = 1;
+ page++;
+ }
+#endif /* DO_DEBUG */
+
+ switch (sscanf(page, "%llu %d%c", §or, &length, &newline)) {
+ case 3:
+ if (newline != '\n')
+ return -EINVAL;
+ case 2:
+ break;
+ default:
+ return -EINVAL;
+ }
+
+#ifdef DO_DEBUG
+ if (clear) {
+ md_clear_badblocks(bb, sector, length);
+ return len;
+ }
+#endif /* DO_DEBUG */
+ if (md_set_badblocks(bb, sector, length, !unack))
+ return len;
+ else
+ return -ENOSPC;
+}
+
static int md_notify_reboot(struct notifier_block *this,
unsigned long code, void *x)
{


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 01/36] md: beginnings of bad block management.

am 21.07.2011 04:58:47 von NeilBrown

This the first step in allowing md to track bad-blocks per-device so
that we can fail individual blocks rather than the whole device.

This patch just adds a data structure for recording bad blocks, with
routines to add, remove, search the list.

Signed-off-by: NeilBrown
---

drivers/md/md.c | 457 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
drivers/md/md.h | 49 ++++++
2 files changed, 502 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2a32050..220fadb 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1952,6 +1952,10 @@ static void unbind_rdev_from_array(mdk_rdev_t * rdev)
sysfs_remove_link(&rdev->kobj, "block");
sysfs_put(rdev->sysfs_state);
rdev->sysfs_state = NULL;
+ kfree(rdev->badblocks.page);
+ rdev->badblocks.count = 0;
+ rdev->badblocks.page = NULL;
+ rdev->badblocks.active_page = NULL;
/* We need to delay this, otherwise we can deadlock when
* writing to 'remove' to "dev/state". We also need
* to delay it due to rcu usage.
@@ -2778,7 +2782,7 @@ static struct kobj_type rdev_ktype = {
.default_attrs = rdev_default_attrs,
};

-void md_rdev_init(mdk_rdev_t *rdev)
+int md_rdev_init(mdk_rdev_t *rdev)
{
rdev->desc_nr = -1;
rdev->saved_raid_disk = -1;
@@ -2794,6 +2798,20 @@ void md_rdev_init(mdk_rdev_t *rdev)

INIT_LIST_HEAD(&rdev->same_set);
init_waitqueue_head(&rdev->blocked_wait);
+
+ /* Add space to store bad block list.
+ * This reserves the space even on arrays where it cannot
+ * be used - I wonder if that matters
+ */
+ rdev->badblocks.count = 0;
+ rdev->badblocks.shift = 0;
+ rdev->badblocks.page = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ rdev->badblocks.active_page = rdev->badblocks.page;
+ spin_lock_init(&rdev->badblocks.lock);
+ if (rdev->badblocks.page == NULL)
+ return -ENOMEM;
+
+ return 0;
}
EXPORT_SYMBOL_GPL(md_rdev_init);
/*
@@ -2819,8 +2837,11 @@ static mdk_rdev_t *md_import_device(dev_t newdev, int super_format, int super_mi
return ERR_PTR(-ENOMEM);
}

- md_rdev_init(rdev);
- if ((err = alloc_disk_sb(rdev)))
+ err = md_rdev_init(rdev);
+ if (err)
+ goto abort_free;
+ err = alloc_disk_sb(rdev);
+ if (err)
goto abort_free;

err = lock_rdev(rdev, newdev, super_format == -2);
@@ -7324,6 +7345,436 @@ void md_wait_for_blocked_rdev(mdk_rdev_t *rdev, mddev_t *mddev)
}
EXPORT_SYMBOL(md_wait_for_blocked_rdev);

+
+/* Bad block management.
+ * We can record which blocks on each device are 'bad' and so just
+ * fail those blocks, or that stripe, rather than the whole device.
+ * Entries in the bad-block table are 64bits wide. This comprises:
+ * Length of bad-range, in sectors: 0-511 for lengths 1-512
+ * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
+ * A 'shift' can be set so that larger blocks are tracked and
+ * consequently larger devices can be covered.
+ * 'Acknowledged' flag - 1 bit. - the most significant bit.
+ */
+/* Locking of the bad-block table is a two-layer affair.
+ * Read access through ->active_page only requires an rcu_readlock.
+ * However if ->active_page is found to be NULL, the table
+ * should be accessed through ->page which requires an irq-spinlock.
+ * Updating the page requires setting ->active_page to NULL,
+ * synchronising with rcu, then updating ->page under the same
+ * irq-spinlock.
+ * We always set or clear bad blocks from process context, but
+ * might look-up bad blocks from interrupt/bh context.
+ *
+ */
+/* When looking for a bad block we specify a range and want to
+ * know if any block in the range is bad. So we binary-search
+ * to the last range that starts at-or-before the given endpoint,
+ * (or "before the sector after the target range")
+ * then see if it ends after the given start.
+ * We return
+ * 0 if there are no known bad blocks in the range
+ * 1 if there are known bad block which are all acknowledged
+ * -1 if there are bad blocks which have not yet been acknowledged in metadata.
+ * plus the start/length of the first bad section we overlap.
+ */
+int md_is_badblock(struct badblocks *bb, sector_t s, int sectors,
+ sector_t *first_bad, int *bad_sectors)
+{
+ int hi;
+ int lo = 0;
+ u64 *p;
+ int rv = 0;
+ int havelock = 0;
+ sector_t target = s + sectors;
+ unsigned long uninitialized_var(flags);
+
+ if (bb->shift > 0) {
+ /* round the start down, and the end up */
+ s >>= bb->shift;
+ target += (1<shift) - 1;
+ target >>= bb->shift;
+ sectors = target - s;
+ }
+ /* 'target' is now the first block after the bad range */
+
+ rcu_read_lock();
+ p = rcu_dereference(bb->active_page);
+ if (!p) {
+ spin_lock_irqsave(&bb->lock, flags);
+ p = bb->page;
+ havelock = 1;
+ }
+ hi = bb->count;
+
+ /* Binary search between lo and hi for 'target'
+ * i.e. for the last range that starts before 'target'
+ */
+ /* INVARIANT: ranges before 'lo' and at-or-after 'hi'
+ * are known not to be the last range before target.
+ * VARIANT: hi-lo is the number of possible
+ * ranges, and decreases until it reaches 1
+ */
+ while (hi - lo > 1) {
+ int mid = (lo + hi) / 2;
+ sector_t a = BB_OFFSET(p[mid]);
+ if (a < target)
+ /* This could still be the one, earlier ranges
+ * could not. */
+ lo = mid;
+ else
+ /* This and later ranges are definitely out. */
+ hi = mid;
+ }
+ /* 'lo' might be the last that started before target, but 'hi' isn't */
+ if (hi > lo) {
+ /* need to check all range that end after 's' to see if
+ * any are unacknowledged.
+ */
+ while (lo >= 0 &&
+ BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
+ if (BB_OFFSET(p[lo]) < target) {
+ /* starts before the end, and finishes after
+ * the start, so they must overlap
+ */
+ if (rv != -1 && BB_ACK(p[lo]))
+ rv = 1;
+ else
+ rv = -1;
+ *first_bad = BB_OFFSET(p[lo]);
+ *bad_sectors = BB_LEN(p[lo]);
+ }
+ lo--;
+ }
+ }
+
+ if (havelock)
+ spin_unlock_irqrestore(&bb->lock, flags);
+ rcu_read_unlock();
+ return rv;
+}
+EXPORT_SYMBOL_GPL(md_is_badblock);
+
+/*
+ * Add a range of bad blocks to the table.
+ * This might extend the table, or might contract it
+ * if two adjacent ranges can be merged.
+ * We binary-search to find the 'insertion' point, then
+ * decide how best to handle it.
+ */
+static int md_set_badblocks(struct badblocks *bb, sector_t s, int sectors,
+ int acknowledged)
+{
+ u64 *p;
+ int lo, hi;
+ int rv = 1;
+
+ if (bb->shift < 0)
+ /* badblocks are disabled */
+ return 0;
+
+ if (bb->shift) {
+ /* round the start down, and the end up */
+ sector_t next = s + sectors;
+ s >>= bb->shift;
+ next += (1<shift) - 1;
+ next >>= bb->shift;
+ sectors = next - s;
+ }
+
+ while (1) {
+ rcu_assign_pointer(bb->active_page, NULL);
+ synchronize_rcu();
+ spin_lock_irq(&bb->lock);
+ if (bb->active_page == NULL)
+ break;
+ /* someone else just unlocked, better retry */
+ spin_unlock_irq(&bb->lock);
+ }
+ /* now have exclusive access to the page */
+
+ p = bb->page;
+ lo = 0;
+ hi = bb->count;
+ /* Find the last range that starts at-or-before 's' */
+ while (hi - lo > 1) {
+ int mid = (lo + hi) / 2;
+ sector_t a = BB_OFFSET(p[mid]);
+ if (a <= s)
+ lo = mid;
+ else
+ hi = mid;
+ }
+ if (hi > lo && BB_OFFSET(p[lo]) > s)
+ hi = lo;
+
+ if (hi > lo) {
+ /* we found a range that might merge with the start
+ * of our new range
+ */
+ sector_t a = BB_OFFSET(p[lo]);
+ sector_t e = a + BB_LEN(p[lo]);
+ int ack = BB_ACK(p[lo]);
+ if (e >= s) {
+ /* Yes, we can merge with a previous range */
+ if (s == a && s + sectors >= e)
+ /* new range covers old */
+ ack = acknowledged;
+ else
+ ack = ack && acknowledged;
+
+ if (e < s + sectors)
+ e = s + sectors;
+ if (e - a <= BB_MAX_LEN) {
+ p[lo] = BB_MAKE(a, e-a, ack);
+ s = e;
+ } else {
+ /* does not all fit in one range,
+ * make p[lo] maximal
+ */
+ if (BB_LEN(p[lo]) != BB_MAX_LEN)
+ p[lo] = BB_MAKE(a, BB_MAX_LEN, ack);
+ s = a + BB_MAX_LEN;
+ }
+ sectors = e - s;
+ }
+ }
+ if (sectors && hi < bb->count) {
+ /* 'hi' points to the first range that starts after 's'.
+ * Maybe we can merge with the start of that range */
+ sector_t a = BB_OFFSET(p[hi]);
+ sector_t e = a + BB_LEN(p[hi]);
+ int ack = BB_ACK(p[hi]);
+ if (a <= s + sectors) {
+ /* merging is possible */
+ if (e <= s + sectors) {
+ /* full overlap */
+ e = s + sectors;
+ ack = acknowledged;
+ } else
+ ack = ack && acknowledged;
+
+ a = s;
+ if (e - a <= BB_MAX_LEN) {
+ p[hi] = BB_MAKE(a, e-a, ack);
+ s = e;
+ } else {
+ p[hi] = BB_MAKE(a, BB_MAX_LEN, ack);
+ s = a + BB_MAX_LEN;
+ }
+ sectors = e - s;
+ lo = hi;
+ hi++;
+ }
+ }
+ if (sectors == 0 && hi < bb->count) {
+ /* we might be able to combine lo and hi */
+ /* Note: 's' is at the end of 'lo' */
+ sector_t a = BB_OFFSET(p[hi]);
+ int lolen = BB_LEN(p[lo]);
+ int hilen = BB_LEN(p[hi]);
+ int newlen = lolen + hilen - (s - a);
+ if (s >= a && newlen < BB_MAX_LEN) {
+ /* yes, we can combine them */
+ int ack = BB_ACK(p[lo]) && BB_ACK(p[hi]);
+ p[lo] = BB_MAKE(BB_OFFSET(p[lo]), newlen, ack);
+ memmove(p + hi, p + hi + 1,
+ (bb->count - hi - 1) * 8);
+ bb->count--;
+ }
+ }
+ while (sectors) {
+ /* didn't merge (it all).
+ * Need to add a range just before 'hi' */
+ if (bb->count >= MD_MAX_BADBLOCKS) {
+ /* No room for more */
+ rv = 0;
+ break;
+ } else {
+ int this_sectors = sectors;
+ memmove(p + hi + 1, p + hi,
+ (bb->count - hi) * 8);
+ bb->count++;
+
+ if (this_sectors > BB_MAX_LEN)
+ this_sectors = BB_MAX_LEN;
+ p[hi] = BB_MAKE(s, this_sectors, acknowledged);
+ sectors -= this_sectors;
+ s += this_sectors;
+ }
+ }
+
+ bb->changed = 1;
+ rcu_assign_pointer(bb->active_page, bb->page);
+ spin_unlock_irq(&bb->lock);
+
+ return rv;
+}
+
+int rdev_set_badblocks(mdk_rdev_t *rdev, sector_t s, int sectors,
+ int acknowledged)
+{
+ int rv = md_set_badblocks(&rdev->badblocks,
+ s + rdev->data_offset, sectors, acknowledged);
+ if (rv) {
+ /* Make sure they get written out promptly */
+ set_bit(MD_CHANGE_CLEAN, &rdev->mddev->flags);
+ md_wakeup_thread(rdev->mddev->thread);
+ }
+ return rv;
+}
+EXPORT_SYMBOL_GPL(rdev_set_badblocks);
+
+/*
+ * Remove a range of bad blocks from the table.
+ * This may involve extending the table if we spilt a region,
+ * but it must not fail. So if the table becomes full, we just
+ * drop the remove request.
+ */
+static int md_clear_badblocks(struct badblocks *bb, sector_t s, int sectors)
+{
+ u64 *p;
+ int lo, hi;
+ sector_t target = s + sectors;
+ int rv = 0;
+
+ if (bb->shift > 0) {
+ /* When clearing we round the start up and the end down.
+ * This should not matter as the shift should align with
+ * the block size and no rounding should ever be needed.
+ * However it is better the think a block is bad when it
+ * isn't than to think a block is not bad when it is.
+ */
+ s += (1<shift) - 1;
+ s >>= bb->shift;
+ target >>= bb->shift;
+ sectors = target - s;
+ }
+
+ while (1) {
+ rcu_assign_pointer(bb->active_page, NULL);
+ synchronize_rcu();
+ spin_lock_irq(&bb->lock);
+ if (bb->active_page == NULL)
+ break;
+ /* someone else just unlocked, better retry */
+ spin_unlock_irq(&bb->lock);
+ }
+ /* now have exclusive access to the page */
+
+ p = bb->page;
+ lo = 0;
+ hi = bb->count;
+ /* Find the last range that starts before 'target' */
+ while (hi - lo > 1) {
+ int mid = (lo + hi) / 2;
+ sector_t a = BB_OFFSET(p[mid]);
+ if (a < target)
+ lo = mid;
+ else
+ hi = mid;
+ }
+ if (hi > lo) {
+ /* p[lo] is the last range that could overlap the
+ * current range. Earlier ranges could also overlap,
+ * but only this one can overlap the end of the range.
+ */
+ if (BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > target) {
+ /* Partial overlap, leave the tail of this range */
+ int ack = BB_ACK(p[lo]);
+ sector_t a = BB_OFFSET(p[lo]);
+ sector_t end = a + BB_LEN(p[lo]);
+
+ if (a < s) {
+ /* we need to split this range */
+ if (bb->count >= MD_MAX_BADBLOCKS) {
+ rv = 0;
+ goto out;
+ }
+ memmove(p+lo+1, p+lo, (bb->count - lo) * 8);
+ bb->count++;
+ p[lo] = BB_MAKE(a, s-a, ack);
+ lo++;
+ }
+ p[lo] = BB_MAKE(target, end - target, ack);
+ /* there is no longer an overlap */
+ hi = lo;
+ lo--;
+ }
+ while (lo >= 0 &&
+ BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
+ /* This range does overlap */
+ if (BB_OFFSET(p[lo]) < s) {
+ /* Keep the early parts of this range. */
+ int ack = BB_ACK(p[lo]);
+ sector_t start = BB_OFFSET(p[lo]);
+ p[lo] = BB_MAKE(start, s - start, ack);
+ /* now low doesn't overlap, so.. */
+ break;
+ }
+ lo--;
+ }
+ /* 'lo' is strictly before, 'hi' is strictly after,
+ * anything between needs to be discarded
+ */
+ if (hi - lo > 1) {
+ memmove(p+lo+1, p+hi, (bb->count - hi) * 8);
+ bb->count -= (hi - lo - 1);
+ }
+ }
+
+ bb->changed = 1;
+out:
+ rcu_assign_pointer(bb->active_page, bb->page);
+ spin_unlock_irq(&bb->lock);
+ return rv;
+}
+
+int rdev_clear_badblocks(mdk_rdev_t *rdev, sector_t s, int sectors)
+{
+ return md_clear_badblocks(&rdev->badblocks,
+ s + rdev->data_offset,
+ sectors);
+}
+EXPORT_SYMBOL_GPL(rdev_clear_badblocks);
+
+/*
+ * Acknowledge all bad blocks in a list.
+ * This only succeeds if ->changed is clear. It is used by
+ * in-kernel metadata updates
+ */
+void md_ack_all_badblocks(struct badblocks *bb)
+{
+ if (bb->page == NULL || bb->changed)
+ /* no point even trying */
+ return;
+ while (1) {
+ rcu_assign_pointer(bb->active_page, NULL);
+ synchronize_rcu();
+ spin_lock_irq(&bb->lock);
+ if (bb->active_page == NULL)
+ break;
+ /* someone else just unlocked, better retry */
+ spin_unlock_irq(&bb->lock);
+ }
+ /* now have exclusive access to the page */
+
+ if (bb->changed == 0) {
+ u64 *p = bb->page;
+ int i;
+ for (i = 0; i < bb->count ; i++) {
+ if (!BB_ACK(p[i])) {
+ sector_t start = BB_OFFSET(p[i]);
+ int len = BB_LEN(p[i]);
+ p[i] = BB_MAKE(start, len, 1);
+ }
+ }
+ }
+ rcu_assign_pointer(bb->active_page, bb->page);
+ spin_unlock_irq(&bb->lock);
+}
+EXPORT_SYMBOL_GPL(md_ack_all_badblocks);
+
static int md_notify_reboot(struct notifier_block *this,
unsigned long code, void *x)
{
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 7d906a9..d327734 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -29,6 +29,13 @@
typedef struct mddev_s mddev_t;
typedef struct mdk_rdev_s mdk_rdev_t;

+/* Bad block numbers are stored sorted in a single page.
+ * 64bits is used for each block or extent.
+ * 54 bits are sector number, 9 bits are extent size,
+ * 1 bit is an 'acknowledged' flag.
+ */
+#define MD_MAX_BADBLOCKS (PAGE_SIZE/8)
+
/*
* MD's 'extended' device
*/
@@ -111,8 +118,48 @@ struct mdk_rdev_s

struct sysfs_dirent *sysfs_state; /* handle for 'state'
* sysfs entry */
+
+ struct badblocks {
+ int count; /* count of bad blocks */
+ int shift; /* shift from sectors to block size
+ * a -ve shift means badblocks are
+ * disabled.*/
+ u64 *page; /* badblock list */
+ u64 *active_page; /* either 'page' or 'NULL' */
+ int changed;
+ spinlock_t lock;
+ } badblocks;
};

+#define BB_LEN_MASK (0x00000000000001FFULL)
+#define BB_OFFSET_MASK (0x7FFFFFFFFFFFFE00ULL)
+#define BB_ACK_MASK (0x8000000000000000ULL)
+#define BB_MAX_LEN 512
+#define BB_OFFSET(x) (((x) & BB_OFFSET_MASK) >> 9)
+#define BB_LEN(x) (((x) & BB_LEN_MASK) + 1)
+#define BB_ACK(x) (!!((x) & BB_ACK_MASK))
+#define BB_MAKE(a, l, ack) (((a)<<9) | ((l)-1) | ((u64)(!!(ack)) << 63))
+
+extern int md_is_badblock(struct badblocks *bb, sector_t s, int sectors,
+ sector_t *first_bad, int *bad_sectors);
+static inline int is_badblock(mdk_rdev_t *rdev, sector_t s, int sectors,
+ sector_t *first_bad, int *bad_sectors)
+{
+ if (unlikely(rdev->badblocks.count)) {
+ int rv = md_is_badblock(&rdev->badblocks, rdev->data_offset + s,
+ sectors,
+ first_bad, bad_sectors);
+ if (rv)
+ *first_bad -= rdev->data_offset;
+ return rv;
+ }
+ return 0;
+}
+extern int rdev_set_badblocks(mdk_rdev_t *rdev, sector_t s, int sectors,
+ int acknowledged);
+extern int rdev_clear_badblocks(mdk_rdev_t *rdev, sector_t s, int sectors);
+extern void md_ack_all_badblocks(struct badblocks *bb);
+
struct mddev_s
{
void *private;
@@ -517,7 +564,7 @@ extern void mddev_init(mddev_t *mddev);
extern int md_run(mddev_t *mddev);
extern void md_stop(mddev_t *mddev);
extern void md_stop_writes(mddev_t *mddev);
-extern void md_rdev_init(mdk_rdev_t *rdev);
+extern int md_rdev_init(mdk_rdev_t *rdev);

extern void mddev_suspend(mddev_t *mddev);
extern void mddev_resume(mddev_t *mddev);


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 03/36] md: don"t allow arrays to contain devices with badblocks.

am 21.07.2011 04:58:47 von NeilBrown

As no personality understand bad block lists yet, we must
reject any device that is known to contain bad blocks.
As the personalities get taught, these tests can be removed.

This only applies to raid1/raid5/raid10.
For linear/raid0/multipath/faulty the whole concept of bad blocks
doesn't mean anything so there is no point adding the checks.

Signed-off-by: NeilBrown
---

drivers/md/raid1.c | 7 +++++++
drivers/md/raid10.c | 8 ++++++++
drivers/md/raid5.c | 7 +++++++
3 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 3cbf0ac..8db311d 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1055,6 +1055,9 @@ static int raid1_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
if (mddev->recovery_disabled == conf->recovery_disabled)
return -EBUSY;

+ if (rdev->badblocks.count)
+ return -EINVAL;
+
if (rdev->raid_disk >= 0)
first = last = rdev->raid_disk;

@@ -1994,6 +1997,10 @@ static int run(mddev_t *mddev)
blk_queue_segment_boundary(mddev->queue,
PAGE_CACHE_SIZE - 1);
}
+ if (rdev->badblocks.count) {
+ printk(KERN_ERR "md/raid1: Cannot handle bad blocks yet\n");
+ return -EINVAL;
+ }
}

mddev->degraded = 0;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 5def27c..8aadd2f 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1101,6 +1101,9 @@ static int raid10_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
int first = 0;
int last = conf->raid_disks - 1;

+ if (rdev->badblocks.count)
+ return -EINVAL;
+
if (mddev->recovery_cp < MaxSector)
/* only hot-add to in-sync arrays, as recovery is
* very different from resync
@@ -2263,6 +2266,11 @@ static int run(mddev_t *mddev)
(conf->raid_disks / conf->near_copies));

list_for_each_entry(rdev, &mddev->disks, same_set) {
+
+ if (rdev->badblocks.count) {
+ printk(KERN_ERR "md/raid10: cannot handle bad blocks yet\n");
+ goto out_free_conf;
+ }
disk_idx = rdev->raid_disk;
if (disk_idx >= conf->raid_disks
|| disk_idx < 0)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6337768..db5d703 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4667,6 +4667,10 @@ static int run(mddev_t *mddev)
* 0 for a fully functional array, 1 or 2 for a degraded array.
*/
list_for_each_entry(rdev, &mddev->disks, same_set) {
+ if (rdev->badblocks.count) {
+ printk(KERN_ERR "md/raid5: cannot handle bad blocks yet\n");
+ goto abort;
+ }
if (rdev->raid_disk < 0)
continue;
if (test_bit(In_sync, &rdev->flags)) {
@@ -4975,6 +4979,9 @@ static int raid5_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
int first = 0;
int last = conf->raid_disks - 1;

+ if (rdev->badblocks.count)
+ return -EINVAL;
+
if (has_failed(conf))
/* no point adding a device */
return -EINVAL;


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 10/36] md/raid1: avoid writing to known-bad blocks onknown-bad drives.

am 21.07.2011 04:58:48 von NeilBrown

If we have seen any write error on a drive, then don't write to
any known-bad blocks on that drive.
If necessary, we divide the write request up into pieces just
like we do for reads, so each piece is either all written or
all not written to any given drive.

Signed-off-by: NeilBrown
---

drivers/md/raid1.c | 152 +++++++++++++++++++++++++++++++++++++++-------------
1 files changed, 115 insertions(+), 37 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 4d40d9d..44277dc 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -772,6 +772,9 @@ static int make_request(mddev_t *mddev, struct bio * bio)
const unsigned long do_flush_fua = (bio->bi_rw & (REQ_FLUSH | REQ_FUA));
mdk_rdev_t *blocked_rdev;
int plugged;
+ int first_clone;
+ int sectors_handled;
+ int max_sectors;

/*
* Register the new request and wait if the reconstruction
@@ -832,7 +835,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
/*
* read balancing logic:
*/
- int max_sectors;
int rdisk;

read_again:
@@ -872,7 +874,6 @@ read_again:
/* could not read all from this device, so we will
* need another r1_bio.
*/
- int sectors_handled;

sectors_handled = (r1_bio->sector + max_sectors
- bio->bi_sector);
@@ -906,9 +907,15 @@ read_again:
/*
* WRITE:
*/
- /* first select target devices under spinlock and
+ /* first select target devices under rcu_lock and
* inc refcount on their rdev. Record them by setting
* bios[x] to bio
+ * If there are known/acknowledged bad blocks on any device on
+ * which we have seen a write error, we want to avoid writing those
+ * blocks.
+ * This potentially requires several writes to write around
+ * the bad blocks. Each set of writes gets it's own r1bio
+ * with a set of bios attached.
*/
plugged = mddev_check_plugged(mddev);

@@ -916,6 +923,7 @@ read_again:
retry_write:
blocked_rdev = NULL;
rcu_read_lock();
+ max_sectors = r1_bio->sectors;
for (i = 0; i < disks; i++) {
mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
@@ -923,17 +931,57 @@ read_again:
blocked_rdev = rdev;
break;
}
- if (rdev && !test_bit(Faulty, &rdev->flags)) {
- atomic_inc(&rdev->nr_pending);
- if (test_bit(Faulty, &rdev->flags)) {
+ r1_bio->bios[i] = NULL;
+ if (!rdev || test_bit(Faulty, &rdev->flags)) {
+ set_bit(R1BIO_Degraded, &r1_bio->state);
+ continue;
+ }
+
+ atomic_inc(&rdev->nr_pending);
+ if (test_bit(WriteErrorSeen, &rdev->flags)) {
+ sector_t first_bad;
+ int bad_sectors;
+ int is_bad;
+
+ is_bad = is_badblock(rdev, r1_bio->sector,
+ max_sectors,
+ &first_bad, &bad_sectors);
+ if (is_bad < 0) {
+ /* mustn't write here until the bad block is
+ * acknowledged*/
+ set_bit(BlockedBadBlocks, &rdev->flags);
+ blocked_rdev = rdev;
+ break;
+ }
+ if (is_bad && first_bad <= r1_bio->sector) {
+ /* Cannot write here at all */
+ bad_sectors -= (r1_bio->sector - first_bad);
+ if (bad_sectors < max_sectors)
+ /* mustn't write more than bad_sectors
+ * to other devices yet
+ */
+ max_sectors = bad_sectors;
rdev_dec_pending(rdev, mddev);
- r1_bio->bios[i] = NULL;
- } else {
- r1_bio->bios[i] = bio;
- targets++;
+ /* We don't set R1BIO_Degraded as that
+ * only applies if the disk is
+ * missing, so it might be re-added,
+ * and we want to know to recover this
+ * chunk.
+ * In this case the device is here,
+ * and the fact that this chunk is not
+ * in-sync is recorded in the bad
+ * block log
+ */
+ continue;
}
- } else
- r1_bio->bios[i] = NULL;
+ if (is_bad) {
+ int good_sectors = first_bad - r1_bio->sector;
+ if (good_sectors < max_sectors)
+ max_sectors = good_sectors;
+ }
+ }
+ r1_bio->bios[i] = bio;
+ targets++;
}
rcu_read_unlock();

@@ -944,48 +992,56 @@ read_again:
for (j = 0; j < i; j++)
if (r1_bio->bios[j])
rdev_dec_pending(conf->mirrors[j].rdev, mddev);
-
+ r1_bio->state = 0;
allow_barrier(conf);
md_wait_for_blocked_rdev(blocked_rdev, mddev);
wait_barrier(conf);
goto retry_write;
}

- if (targets < conf->raid_disks) {
- /* array is degraded, we will not clear the bitmap
- * on I/O completion (see raid1_end_write_request) */
- set_bit(R1BIO_Degraded, &r1_bio->state);
+ if (max_sectors < r1_bio->sectors) {
+ /* We are splitting this write into multiple parts, so
+ * we need to prepare for allocating another r1_bio.
+ */
+ r1_bio->sectors = max_sectors;
+ spin_lock_irq(&conf->device_lock);
+ if (bio->bi_phys_segments == 0)
+ bio->bi_phys_segments = 2;
+ else
+ bio->bi_phys_segments++;
+ spin_unlock_irq(&conf->device_lock);
}
-
- /* do behind I/O ?
- * Not if there are too many, or cannot allocate memory,
- * or a reader on WriteMostly is waiting for behind writes
- * to flush */
- if (bitmap &&
- (atomic_read(&bitmap->behind_writes)
- < mddev->bitmap_info.max_write_behind) &&
- !waitqueue_active(&bitmap->behind_wait))
- alloc_behind_pages(bio, r1_bio);
+ sectors_handled = r1_bio->sector + max_sectors - bio->bi_sector;

atomic_set(&r1_bio->remaining, 1);
atomic_set(&r1_bio->behind_remaining, 0);

- bitmap_startwrite(bitmap, bio->bi_sector, r1_bio->sectors,
- test_bit(R1BIO_BehindIO, &r1_bio->state));
+ first_clone = 1;
for (i = 0; i < disks; i++) {
struct bio *mbio;
if (!r1_bio->bios[i])
continue;

mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
- r1_bio->bios[i] = mbio;
-
- mbio->bi_sector = r1_bio->sector + conf->mirrors[i].rdev->data_offset;
- mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
- mbio->bi_end_io = raid1_end_write_request;
- mbio->bi_rw = WRITE | do_flush_fua | do_sync;
- mbio->bi_private = r1_bio;
-
+ md_trim_bio(mbio, r1_bio->sector - bio->bi_sector, max_sectors);
+
+ if (first_clone) {
+ /* do behind I/O ?
+ * Not if there are too many, or cannot
+ * allocate memory, or a reader on WriteMostly
+ * is waiting for behind writes to flush */
+ if (bitmap &&
+ (atomic_read(&bitmap->behind_writes)
+ < mddev->bitmap_info.max_write_behind) &&
+ !waitqueue_active(&bitmap->behind_wait))
+ alloc_behind_pages(mbio, r1_bio);
+
+ bitmap_startwrite(bitmap, r1_bio->sector,
+ r1_bio->sectors,
+ test_bit(R1BIO_BehindIO,
+ &r1_bio->state));
+ first_clone = 0;
+ }
if (r1_bio->behind_pages) {
struct bio_vec *bvec;
int j;
@@ -1003,6 +1059,15 @@ read_again:
atomic_inc(&r1_bio->behind_remaining);
}

+ r1_bio->bios[i] = mbio;
+
+ mbio->bi_sector = (r1_bio->sector +
+ conf->mirrors[i].rdev->data_offset);
+ mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
+ mbio->bi_end_io = raid1_end_write_request;
+ mbio->bi_rw = WRITE | do_flush_fua | do_sync;
+ mbio->bi_private = r1_bio;
+
atomic_inc(&r1_bio->remaining);
spin_lock_irqsave(&conf->device_lock, flags);
bio_list_add(&conf->pending_bio_list, mbio);
@@ -1013,6 +1078,19 @@ read_again:
/* In case raid1d snuck in to freeze_array */
wake_up(&conf->wait_barrier);

+ if (sectors_handled < (bio->bi_size >> 9)) {
+ /* We need another r1_bio. It has already been counted
+ * in bio->bi_phys_segments
+ */
+ r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
+ r1_bio->master_bio = bio;
+ r1_bio->sectors = (bio->bi_size >> 9) - sectors_handled;
+ r1_bio->state = 0;
+ r1_bio->mddev = mddev;
+ r1_bio->sector = bio->bi_sector + sectors_handled;
+ goto retry_write;
+ }
+
if (do_sync || !bitmap || !plugged)
md_wakeup_thread(mddev->thread);



--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 14/36] md/raid1: record badblocks found during resync etc.

am 21.07.2011 04:58:48 von NeilBrown

If we find a bad block while writing as part of resync/recovery we
need to report that back to raid1d which must record the bad block,
or fail the device.

Similarly when fixing a read error, a further error should just
record a bad block if possible rather than failing the device.

Signed-off-by: NeilBrown
---

drivers/md/raid1.c | 81 +++++++++++++++++++++++++++++++++-------------------
1 files changed, 51 insertions(+), 30 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 0ebeec9..83c918e 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1387,7 +1387,9 @@ static void end_sync_write(struct bio *bio, int error)
s += sync_blocks;
sectors_to_go -= sync_blocks;
} while (sectors_to_go > 0);
- md_error(mddev, conf->mirrors[mirror].rdev);
+ set_bit(WriteErrorSeen,
+ &conf->mirrors[mirror].rdev->flags);
+ set_bit(R1BIO_WriteError, &r1_bio->state);
} else if (is_badblock(conf->mirrors[mirror].rdev,
r1_bio->sector,
r1_bio->sectors,
@@ -1398,7 +1400,8 @@ static void end_sync_write(struct bio *bio, int error)

if (atomic_dec_and_test(&r1_bio->remaining)) {
int s = r1_bio->sectors;
- if (test_bit(R1BIO_MadeGood, &r1_bio->state))
+ if (test_bit(R1BIO_MadeGood, &r1_bio->state) ||
+ test_bit(R1BIO_WriteError, &r1_bio->state))
reschedule_retry(r1_bio);
else {
put_buf(r1_bio);
@@ -1407,6 +1410,20 @@ static void end_sync_write(struct bio *bio, int error)
}
}

+static int r1_sync_page_io(mdk_rdev_t *rdev, sector_t sector,
+ int sectors, struct page *page, int rw)
+{
+ if (sync_page_io(rdev, sector, sectors << 9, page, rw, false))
+ /* success */
+ return 1;
+ if (rw == WRITE)
+ set_bit(WriteErrorSeen, &rdev->flags);
+ /* need to record an error - either for the block or the device */
+ if (!rdev_set_badblocks(rdev, sector, sectors, 0))
+ md_error(rdev->mddev, rdev);
+ return 0;
+}
+
static int fix_sync_read_error(r1bio_t *r1_bio)
{
/* Try some synchronous reads of other devices to get
@@ -1478,12 +1495,11 @@ static int fix_sync_read_error(r1bio_t *r1_bio)
if (r1_bio->bios[d]->bi_end_io != end_sync_read)
continue;
rdev = conf->mirrors[d].rdev;
- if (sync_page_io(rdev, sect, s<<9,
- bio->bi_io_vec[idx].bv_page,
- WRITE, false) == 0) {
+ if (r1_sync_page_io(rdev, sect, s,
+ bio->bi_io_vec[idx].bv_page,
+ WRITE) == 0) {
r1_bio->bios[d]->bi_end_io = NULL;
rdev_dec_pending(rdev, mddev);
- md_error(mddev, rdev);
}
}
d = start;
@@ -1494,11 +1510,9 @@ static int fix_sync_read_error(r1bio_t *r1_bio)
if (r1_bio->bios[d]->bi_end_io != end_sync_read)
continue;
rdev = conf->mirrors[d].rdev;
- if (sync_page_io(rdev, sect, s<<9,
- bio->bi_io_vec[idx].bv_page,
- READ, false) == 0)
- md_error(mddev, rdev);
- else
+ if (r1_sync_page_io(rdev, sect, s,
+ bio->bi_io_vec[idx].bv_page,
+ READ) != 0)
atomic_add(s, &rdev->corrected_errors);
}
sectors -= s;
@@ -1683,8 +1697,10 @@ static void fix_read_error(conf_t *conf, int read_disk,
} while (!success && d != read_disk);

if (!success) {
- /* Cannot read from anywhere -- bye bye array */
- md_error(mddev, conf->mirrors[read_disk].rdev);
+ /* Cannot read from anywhere - mark it bad */
+ mdk_rdev_t *rdev = conf->mirrors[read_disk].rdev;
+ if (!rdev_set_badblocks(rdev, sect, s, 0))
+ md_error(mddev, rdev);
break;
}
/* write it back and re-read */
@@ -1695,13 +1711,9 @@ static void fix_read_error(conf_t *conf, int read_disk,
d--;
rdev = conf->mirrors[d].rdev;
if (rdev &&
- test_bit(In_sync, &rdev->flags)) {
- if (sync_page_io(rdev, sect, s<<9,
- conf->tmppage, WRITE, false)
- == 0)
- /* Well, this device is dead */
- md_error(mddev, rdev);
- }
+ test_bit(In_sync, &rdev->flags))
+ r1_sync_page_io(rdev, sect, s,
+ conf->tmppage, WRITE);
}
d = start;
while (d != read_disk) {
@@ -1712,12 +1724,8 @@ static void fix_read_error(conf_t *conf, int read_disk,
rdev = conf->mirrors[d].rdev;
if (rdev &&
test_bit(In_sync, &rdev->flags)) {
- if (sync_page_io(rdev, sect, s<<9,
- conf->tmppage, READ, false)
- == 0)
- /* Well, this device is dead */
- md_error(mddev, rdev);
- else {
+ if (r1_sync_page_io(rdev, sect, s,
+ conf->tmppage, READ)) {
atomic_add(s, &rdev->corrected_errors);
printk(KERN_INFO
"md/raid1:%s: read error corrected "
@@ -1861,20 +1869,33 @@ static void raid1d(mddev_t *mddev)
mddev = r1_bio->mddev;
conf = mddev->private;
if (test_bit(R1BIO_IsSync, &r1_bio->state)) {
- if (test_bit(R1BIO_MadeGood, &r1_bio->state)) {
+ if (test_bit(R1BIO_MadeGood, &r1_bio->state) ||
+ test_bit(R1BIO_WriteError, &r1_bio->state)) {
int m;
int s = r1_bio->sectors;
for (m = 0; m < conf->raid_disks ; m++) {
+ mdk_rdev_t *rdev
+ = conf->mirrors[m].rdev;
struct bio *bio = r1_bio->bios[m];
- if (bio->bi_end_io != NULL &&
- test_bit(BIO_UPTODATE,
+ if (bio->bi_end_io == NULL)
+ continue;
+ if (test_bit(BIO_UPTODATE,
&bio->bi_flags)) {
- rdev = conf->mirrors[m].rdev;
rdev_clear_badblocks(
rdev,
r1_bio->sector,
r1_bio->sectors);
}
+ if (!test_bit(BIO_UPTODATE,
+ &bio->bi_flags) &&
+ test_bit(R1BIO_WriteError,
+ &r1_bio->state)) {
+ if (!rdev_set_badblocks(
+ rdev,
+ r1_bio->sector,
+ r1_bio->sectors, 0))
+ md_error(mddev, rdev);
+ }
}
put_buf(r1_bio);
md_done_sync(mddev, s, 1);


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 11/36] md/raid1: clear bad-block record when write succeeds.

am 21.07.2011 04:58:48 von NeilBrown

If we succeed in writing to a block that was recorded as
being bad, we clear the bad-block record.

This requires some delayed handling as the bad-block-list update has
to happen in process-context.

Signed-off-by: NeilBrown
---

drivers/md/raid1.c | 79 +++++++++++++++++++++++++++++++++++++++++++++-------
drivers/md/raid1.h | 13 ++++++++-
2 files changed, 80 insertions(+), 12 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 44277dc..6e605b2 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -163,7 +163,7 @@ static void put_all_bios(conf_t *conf, r1bio_t *r1_bio)

for (i = 0; i < conf->raid_disks; i++) {
struct bio **bio = r1_bio->bios + i;
- if (*bio && *bio != IO_BLOCKED)
+ if (!BIO_SPECIAL(*bio))
bio_put(*bio);
*bio = NULL;
}
@@ -337,7 +337,10 @@ static void r1_bio_write_done(r1bio_t *r1_bio)
!test_bit(R1BIO_Degraded, &r1_bio->state),
test_bit(R1BIO_BehindIO, &r1_bio->state));
md_write_end(r1_bio->mddev);
- raid_end_bio_io(r1_bio);
+ if (test_bit(R1BIO_MadeGood, &r1_bio->state))
+ reschedule_retry(r1_bio);
+ else
+ raid_end_bio_io(r1_bio);
}
}

@@ -363,7 +366,7 @@ static void raid1_end_write_request(struct bio *bio, int error)
md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
/* an I/O failed, we can't clear the bitmap */
set_bit(R1BIO_Degraded, &r1_bio->state);
- } else
+ } else {
/*
* Set R1BIO_Uptodate in our master bio, so that we
* will return a good error code for to the higher
@@ -374,8 +377,20 @@ static void raid1_end_write_request(struct bio *bio, int error)
* to user-side. So if something waits for IO, then it
* will wait for the 'master' bio.
*/
+ sector_t first_bad;
+ int bad_sectors;
+
set_bit(R1BIO_Uptodate, &r1_bio->state);

+ /* Maybe we can clear some bad blocks. */
+ if (is_badblock(conf->mirrors[mirror].rdev,
+ r1_bio->sector, r1_bio->sectors,
+ &first_bad, &bad_sectors)) {
+ r1_bio->bios[mirror] = IO_MADE_GOOD;
+ set_bit(R1BIO_MadeGood, &r1_bio->state);
+ }
+ }
+
update_head_pos(mirror, r1_bio);

if (behind) {
@@ -402,7 +417,9 @@ static void raid1_end_write_request(struct bio *bio, int error)
}
}
}
- rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
+ if (r1_bio->bios[mirror] == NULL)
+ rdev_dec_pending(conf->mirrors[mirror].rdev,
+ conf->mddev);

/*
* Let's see if all mirrored write operations have finished
@@ -1341,6 +1358,8 @@ static void end_sync_write(struct bio *bio, int error)
conf_t *conf = mddev->private;
int i;
int mirror=0;
+ sector_t first_bad;
+ int bad_sectors;

for (i = 0; i < conf->raid_disks; i++)
if (r1_bio->bios[i] == bio) {
@@ -1359,14 +1378,22 @@ static void end_sync_write(struct bio *bio, int error)
sectors_to_go -= sync_blocks;
} while (sectors_to_go > 0);
md_error(mddev, conf->mirrors[mirror].rdev);
- }
+ } else if (is_badblock(conf->mirrors[mirror].rdev,
+ r1_bio->sector,
+ r1_bio->sectors,
+ &first_bad, &bad_sectors))
+ set_bit(R1BIO_MadeGood, &r1_bio->state);

update_head_pos(mirror, r1_bio);

if (atomic_dec_and_test(&r1_bio->remaining)) {
- sector_t s = r1_bio->sectors;
- put_buf(r1_bio);
- md_done_sync(mddev, s, uptodate);
+ int s = r1_bio->sectors;
+ if (test_bit(R1BIO_MadeGood, &r1_bio->state))
+ reschedule_retry(r1_bio);
+ else {
+ put_buf(r1_bio);
+ md_done_sync(mddev, s, uptodate);
+ }
}
}

@@ -1728,9 +1755,39 @@ static void raid1d(mddev_t *mddev)

mddev = r1_bio->mddev;
conf = mddev->private;
- if (test_bit(R1BIO_IsSync, &r1_bio->state))
- sync_request_write(mddev, r1_bio);
- else if (test_bit(R1BIO_ReadError, &r1_bio->state)) {
+ if (test_bit(R1BIO_IsSync, &r1_bio->state)) {
+ if (test_bit(R1BIO_MadeGood, &r1_bio->state)) {
+ int m;
+ int s = r1_bio->sectors;
+ for (m = 0; m < conf->raid_disks ; m++) {
+ struct bio *bio = r1_bio->bios[m];
+ if (bio->bi_end_io != NULL &&
+ test_bit(BIO_UPTODATE,
+ &bio->bi_flags)) {
+ rdev = conf->mirrors[m].rdev;
+ rdev_clear_badblocks(
+ rdev,
+ r1_bio->sector,
+ r1_bio->sectors);
+ }
+ }
+ put_buf(r1_bio);
+ md_done_sync(mddev, s, 1);
+ } else
+ sync_request_write(mddev, r1_bio);
+ } else if (test_bit(R1BIO_MadeGood, &r1_bio->state)) {
+ int m;
+ for (m = 0; m < conf->raid_disks ; m++)
+ if (r1_bio->bios[m] == IO_MADE_GOOD) {
+ rdev = conf->mirrors[m].rdev;
+ rdev_clear_badblocks(
+ rdev,
+ r1_bio->sector,
+ r1_bio->sectors);
+ rdev_dec_pending(rdev, mddev);
+ }
+ raid_end_bio_io(r1_bio);
+ } else if (test_bit(R1BIO_ReadError, &r1_bio->state)) {
int disk;
int max_sectors;

diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index aa6af37..f81360d 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -116,7 +116,14 @@ struct r1bio_s {
* correct the read error. To keep track of bad blocks on a per-bio
* level, we store IO_BLOCKED in the appropriate 'bios' pointer
*/
-#define IO_BLOCKED ((struct bio*)1)
+#define IO_BLOCKED ((struct bio *)1)
+/* When we successfully write to a known bad-block, we need to remove the
+ * bad-block marking which must be done from process context. So we record
+ * the success by setting bios[n] to IO_MADE_GOOD
+ */
+#define IO_MADE_GOOD ((struct bio *)2)
+
+#define BIO_SPECIAL(bio) ((unsigned long)bio <= 2)

/* bits for r1bio.state */
#define R1BIO_Uptodate 0
@@ -135,6 +142,10 @@ struct r1bio_s {
* Record that bi_end_io was called with this flag...
*/
#define R1BIO_Returned 6
+/* If a write for this request means we can clear some
+ * known-bad-block records, we set this flag
+ */
+#define R1BIO_MadeGood 7

extern int md_raid1_congested(mddev_t *mddev, int bits);



--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 09/36] md: make it easier to wait for bad blocks to beacknowledged.

am 21.07.2011 04:58:48 von NeilBrown

It is only safe to choose not to write to a bad block if that bad
block is safely recorded in metadata - i.e. if it has been
'acknowledged'.

If it hasn't we need to wait for the acknowledgement.

We support that using rdev->blocked wait and
md_wait_for_blocked_rdev by introducing a new device flag
'BlockedBadBlock'.

This flag is only advisory.
It is cleared whenever we acknowledge a bad block, so that a waiter
can re-check the particular bad blocks that it is interested it.

It should be set by a caller when they find they need to wait.
This (set after test) is inherently racy, but as
md_wait_for_blocked_rdev already has a timeout, losing the race will
have minimal impact.

When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
was set incorrectly (see above race).

We also modify the way we manage 'Blocked' to fit better with the new
handling of 'BlockedBadBlocks' and to make it consistent between
externally managed and internally managed metadata. This requires
that each raidXd loop checks if the metadata needs to be written and
triggers a write (md_check_recovery) if needed. Otherwise a queued
write request might cause raidXd to wait for the metadata to write,
and only that thread can write it.

Before writing metadata, we set FaultRecorded for all devices that
are Faulty, then after writing the metadata we clear Blocked for any
device for which the Fault was certainly Recorded.

The 'faulty' device flag now appears in sysfs if the device is faulty
*or* it has unacknowledged bad blocks. So user-space which does not
understand bad blocks can continue to function correctly.
User space which does, should not assume a device is faulty until it
sees the 'faulty' flag, and then sees the list of unacknowledged bad
blocks is empty.

Signed-off-by: NeilBrown
---

drivers/md/md.c | 77 ++++++++++++++++++++++++++++++++++-----------------
drivers/md/md.h | 25 +++++++++++++++--
drivers/md/raid1.c | 3 ++
drivers/md/raid10.c | 3 ++
drivers/md/raid5.c | 4 +++
5 files changed, 85 insertions(+), 27 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index aa96ead..90d07ab 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2345,8 +2345,18 @@ repeat:
if (!mddev->persistent) {
clear_bit(MD_CHANGE_CLEAN, &mddev->flags);
clear_bit(MD_CHANGE_DEVS, &mddev->flags);
- if (!mddev->external)
+ if (!mddev->external) {
clear_bit(MD_CHANGE_PENDING, &mddev->flags);
+ list_for_each_entry(rdev, &mddev->disks, same_set) {
+ if (rdev->badblocks.changed) {
+ md_ack_all_badblocks(&rdev->badblocks);
+ md_error(mddev, rdev);
+ }
+ clear_bit(Blocked, &rdev->flags);
+ clear_bit(BlockedBadBlocks, &rdev->flags);
+ wake_up(&rdev->blocked_wait);
+ }
+ }
wake_up(&mddev->sb_wait);
return;
}
@@ -2403,9 +2413,12 @@ repeat:
mddev->events --;
}

- list_for_each_entry(rdev, &mddev->disks, same_set)
+ list_for_each_entry(rdev, &mddev->disks, same_set) {
if (rdev->badblocks.changed)
any_badblocks_changed++;
+ if (test_bit(Faulty, &rdev->flags))
+ set_bit(FaultRecorded, &rdev->flags);
+ }

sync_sbs(mddev, nospares);
spin_unlock_irq(&mddev->write_lock);
@@ -2462,9 +2475,15 @@ repeat:
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
sysfs_notify(&mddev->kobj, NULL, "sync_completed");

- if (any_badblocks_changed)
- list_for_each_entry(rdev, &mddev->disks, same_set)
+ list_for_each_entry(rdev, &mddev->disks, same_set) {
+ if (test_and_clear_bit(FaultRecorded, &rdev->flags))
+ clear_bit(Blocked, &rdev->flags);
+
+ if (any_badblocks_changed)
md_ack_all_badblocks(&rdev->badblocks);
+ clear_bit(BlockedBadBlocks, &rdev->flags);
+ wake_up(&rdev->blocked_wait);
+ }
}

/* words written to sysfs files may, or may not, be \n terminated.
@@ -2499,7 +2518,8 @@ state_show(mdk_rdev_t *rdev, char *page)
char *sep = "";
size_t len = 0;

- if (test_bit(Faulty, &rdev->flags)) {
+ if (test_bit(Faulty, &rdev->flags) ||
+ rdev->badblocks.unacked_exist) {
len+= sprintf(page+len, "%sfaulty",sep);
sep = ",";
}
@@ -2511,7 +2531,8 @@ state_show(mdk_rdev_t *rdev, char *page)
len += sprintf(page+len, "%swrite_mostly",sep);
sep = ",";
}
- if (test_bit(Blocked, &rdev->flags)) {
+ if (test_bit(Blocked, &rdev->flags) ||
+ rdev->badblocks.unacked_exist) {
len += sprintf(page+len, "%sblocked", sep);
sep = ",";
}
@@ -2531,12 +2552,12 @@ static ssize_t
state_store(mdk_rdev_t *rdev, const char *buf, size_t len)
{
/* can write
- * faulty - simulates and error
+ * faulty - simulates an error
* remove - disconnects the device
* writemostly - sets write_mostly
* -writemostly - clears write_mostly
- * blocked - sets the Blocked flag
- * -blocked - clears the Blocked flag
+ * blocked - sets the Blocked flags
+ * -blocked - clears the Blocked and possibly simulates an error
* insync - sets Insync providing device isn't active
* write_error - sets WriteErrorSeen
* -write_error - clears WriteErrorSeen
@@ -2566,7 +2587,15 @@ state_store(mdk_rdev_t *rdev, const char *buf, size_t len)
set_bit(Blocked, &rdev->flags);
err = 0;
} else if (cmd_match(buf, "-blocked")) {
+ if (!test_bit(Faulty, &rdev->flags) &&
+ test_bit(BlockedBadBlocks, &rdev->flags)) {
+ /* metadata handler doesn't understand badblocks,
+ * so we need to fail the device
+ */
+ md_error(rdev->mddev, rdev);
+ }
clear_bit(Blocked, &rdev->flags);
+ clear_bit(BlockedBadBlocks, &rdev->flags);
wake_up(&rdev->blocked_wait);
set_bit(MD_RECOVERY_NEEDED, &rdev->mddev->recovery);
md_wakeup_thread(rdev->mddev->thread);
@@ -2885,7 +2914,11 @@ static ssize_t bb_show(mdk_rdev_t *rdev, char *page)
}
static ssize_t bb_store(mdk_rdev_t *rdev, const char *page, size_t len)
{
- return badblocks_store(&rdev->badblocks, page, len, 0);
+ int rv = badblocks_store(&rdev->badblocks, page, len, 0);
+ /* Maybe that ack was all we needed */
+ if (test_and_clear_bit(BlockedBadBlocks, &rdev->flags))
+ wake_up(&rdev->blocked_wait);
+ return rv;
}
static struct rdev_sysfs_entry rdev_bad_blocks =
__ATTR(bad_blocks, S_IRUGO|S_IWUSR, bb_show, bb_store);
@@ -6401,18 +6434,7 @@ void md_error(mddev_t *mddev, mdk_rdev_t *rdev)
if (!rdev || test_bit(Faulty, &rdev->flags))
return;

- if (mddev->external)
- set_bit(Blocked, &rdev->flags);
-/*
- dprintk("md_error dev:%s, rdev:(%d:%d), (caller: %p,%p,%p,%p).\n",
- mdname(mddev),
- MAJOR(rdev->bdev->bd_dev), MINOR(rdev->bdev->bd_dev),
- __builtin_return_address(0),__builtin_return_address(1),
- __builtin_return_address(2),__builtin_return_address(3));
-*/
- if (!mddev->pers)
- return;
- if (!mddev->pers->error_handler)
+ if (!mddev->pers || !mddev->pers->error_handler)
return;
mddev->pers->error_handler(mddev,rdev);
if (mddev->degraded)
@@ -7289,8 +7311,7 @@ static int remove_and_add_spares(mddev_t *mddev)
list_for_each_entry(rdev, &mddev->disks, same_set) {
if (rdev->raid_disk >= 0 &&
!test_bit(In_sync, &rdev->flags) &&
- !test_bit(Faulty, &rdev->flags) &&
- !test_bit(Blocked, &rdev->flags))
+ !test_bit(Faulty, &rdev->flags))
spares++;
if (rdev->raid_disk < 0
&& !test_bit(Faulty, &rdev->flags)) {
@@ -7534,7 +7555,8 @@ void md_wait_for_blocked_rdev(mdk_rdev_t *rdev, mddev_t *mddev)
{
sysfs_notify_dirent_safe(rdev->sysfs_state);
wait_event_timeout(rdev->blocked_wait,
- !test_bit(Blocked, &rdev->flags),
+ !test_bit(Blocked, &rdev->flags) &&
+ !test_bit(BlockedBadBlocks, &rdev->flags),
msecs_to_jiffies(5000));
rdev_dec_pending(rdev, mddev);
}
@@ -7800,6 +7822,8 @@ static int md_set_badblocks(struct badblocks *bb, sector_t s, int sectors,
}

bb->changed = 1;
+ if (!acknowledged)
+ bb->unacked_exist = 1;
rcu_assign_pointer(bb->active_page, bb->page);
spin_unlock_irq(&bb->lock);

@@ -7964,6 +7988,7 @@ void md_ack_all_badblocks(struct badblocks *bb)
p[i] = BB_MAKE(start, len, 1);
}
}
+ bb->unacked_exist = 0;
}
rcu_assign_pointer(bb->active_page, bb->page);
spin_unlock_irq(&bb->lock);
@@ -8017,6 +8042,8 @@ badblocks_show(struct badblocks *bb, char *page, int unack)
(unsigned long long)s << bb->shift,
length << bb->shift);
}
+ if (unack && len == 0)
+ bb->unacked_exist = 0;

if (havelock)
spin_unlock_irq(&bb->lock);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 723c0cd..cd6144f 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -81,12 +81,29 @@ struct mdk_rdev_s
#define In_sync 2 /* device is in_sync with rest of array */
#define WriteMostly 4 /* Avoid reading if at all possible */
#define AutoDetected 7 /* added by auto-detect */
-#define Blocked 8 /* An error occurred on an externally
- * managed array, don't allow writes
+#define Blocked 8 /* An error occurred but has not yet
+ * been acknowledged by the metadata
+ * handler, so don't allow writes
* until it is cleared */
#define WriteErrorSeen 9 /* A write error has been seen on this
* device
*/
+#define FaultRecorded 10 /* Intermediate state for clearing
+ * Blocked. The Fault is/will-be
+ * recorded in the metadata, but that
+ * metadata hasn't been stored safely
+ * on disk yet.
+ */
+#define BlockedBadBlocks 11 /* A writer is blocked because they
+ * found an unacknowledged bad-block.
+ * This can safely be cleared at any
+ * time, and the writer will re-check.
+ * It may be set at any time, and at
+ * worst the writer will timeout and
+ * re-check. So setting it as
+ * accurately as possible is good, but
+ * not absolutely critical.
+ */
wait_queue_head_t blocked_wait;

int desc_nr; /* descriptor index in the superblock */
@@ -124,6 +141,10 @@ struct mdk_rdev_s

struct badblocks {
int count; /* count of bad blocks */
+ int unacked_exist; /* there probably are unacknowledged
+ * bad blocks. This is only cleared
+ * when a read discovers none
+ */
int shift; /* shift from sectors to block size
* a -ve shift means badblocks are
* disabled.*/
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 8c31c39..4d40d9d 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1059,6 +1059,7 @@ static void error(mddev_t *mddev, mdk_rdev_t *rdev)
conf->recovery_disabled = mddev->recovery_disabled;
return;
}
+ set_bit(Blocked, &rdev->flags);
if (test_and_clear_bit(In_sync, &rdev->flags)) {
unsigned long flags;
spin_lock_irqsave(&conf->device_lock, flags);
@@ -1751,6 +1752,8 @@ read_more:
generic_make_request(r1_bio->bios[r1_bio->read_disk]);
}
cond_resched();
+ if (mddev->flags & ~(1< + md_check_recovery(mddev);
}
blk_finish_plug(&plug);
}
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 8aadd2f..fe6692e 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1021,6 +1021,7 @@ static void error(mddev_t *mddev, mdk_rdev_t *rdev)
*/
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
}
+ set_bit(Blocked, &rdev->flags);
set_bit(Faulty, &rdev->flags);
set_bit(MD_CHANGE_DEVS, &mddev->flags);
printk(KERN_ALERT
@@ -1703,6 +1704,8 @@ static void raid10d(mddev_t *mddev)
}
}
cond_resched();
+ if (mddev->flags & ~(1< + md_check_recovery(mddev);
}
blk_finish_plug(&plug);
}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index db5d703..7a3fb72 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1706,6 +1706,7 @@ static void error(mddev_t *mddev, mdk_rdev_t *rdev)
*/
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
}
+ set_bit(Blocked, &rdev->flags);
set_bit(Faulty, &rdev->flags);
set_bit(MD_CHANGE_DEVS, &mddev->flags);
printk(KERN_ALERT
@@ -4143,6 +4144,9 @@ static void raid5d(mddev_t *mddev)
release_stripe(sh);
cond_resched();

+ if (mddev->flags & ~(1< + md_check_recovery(mddev);
+
spin_lock_irq(&conf->device_lock);
}
pr_debug("%d stripes handled\n", handled);


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 06/36] md/raid1: avoid reading from known bad blocks.

am 21.07.2011 04:58:48 von NeilBrown

Now that we have a bad block list, we should not read from those
blocks.
There are several main parts to this:
1/ read_balance needs to check for bad blocks, and return not only
the chosen device, but also how many good blocks are available
there.
2/ fix_read_error needs to avoid trying to read from bad blocks.
3/ read submission must be ready to issue multiple reads to
different devices as different bad blocks on different devices
could mean that a single large read cannot be served by any one
device, but can still be served by the array.
This requires keeping count of the number of outstanding requests
per bio. This count is stored in 'bi_phys_segments'
4/ retrying a read needs to also be ready to submit a smaller read
and queue another request for the rest.

This does not yet handle bad blocks when reading to perform resync,
recovery, or check.

'md_trim_bio' will also be used for RAID10, so put it in md.c and
export it.

Signed-off-by: NeilBrown
---

drivers/md/md.c | 49 ++++++++++++
drivers/md/md.h | 1
drivers/md/raid1.c | 208 +++++++++++++++++++++++++++++++++++++++++++++-------
drivers/md/raid1.h | 4 +
4 files changed, 233 insertions(+), 29 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 340e2d4..430bc8b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -215,6 +215,55 @@ struct bio *bio_clone_mddev(struct bio *bio, gfp_t gfp_mask,
}
EXPORT_SYMBOL_GPL(bio_clone_mddev);

+void md_trim_bio(struct bio *bio, int offset, int size)
+{
+ /* 'bio' is a cloned bio which we need to trim to match
+ * the given offset and size.
+ * This requires adjusting bi_sector, bi_size, and bi_io_vec
+ */
+ int i;
+ struct bio_vec *bvec;
+ int sofar = 0;
+
+ size <<= 9;
+ if (offset == 0 && size == bio->bi_size)
+ return;
+
+ bio->bi_sector += offset;
+ bio->bi_size = size;
+ offset <<= 9;
+ clear_bit(BIO_SEG_VALID, &bio->bi_flags);
+
+ while (bio->bi_idx < bio->bi_vcnt &&
+ bio->bi_io_vec[bio->bi_idx].bv_len <= offset) {
+ /* remove this whole bio_vec */
+ offset -= bio->bi_io_vec[bio->bi_idx].bv_len;
+ bio->bi_idx++;
+ }
+ if (bio->bi_idx < bio->bi_vcnt) {
+ bio->bi_io_vec[bio->bi_idx].bv_offset += offset;
+ bio->bi_io_vec[bio->bi_idx].bv_len -= offset;
+ }
+ /* avoid any complications with bi_idx being non-zero*/
+ if (bio->bi_idx) {
+ memmove(bio->bi_io_vec, bio->bi_io_vec+bio->bi_idx,
+ (bio->bi_vcnt - bio->bi_idx) * sizeof(struct bio_vec));
+ bio->bi_vcnt -= bio->bi_idx;
+ bio->bi_idx = 0;
+ }
+ /* Make sure vcnt and last bv are not too big */
+ bio_for_each_segment(bvec, bio, i) {
+ if (sofar + bvec->bv_len > size)
+ bvec->bv_len = size - sofar;
+ if (bvec->bv_len == 0) {
+ bio->bi_vcnt = i;
+ break;
+ }
+ sofar += bvec->bv_len;
+ }
+}
+EXPORT_SYMBOL_GPL(md_trim_bio);
+
/*
* We have a system wide 'event count' that is incremented
* on any 'interesting' event, and readers of /proc/mdstat
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 834e46b..eb11449 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -576,4 +576,5 @@ extern struct bio *bio_clone_mddev(struct bio *bio, gfp_t gfp_mask,
extern struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
mddev_t *mddev);
extern int mddev_check_plugged(mddev_t *mddev);
+extern void md_trim_bio(struct bio *bio, int offset, int size);
#endif /* _MD_MD_H */
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 8db311d..cc3939d 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -41,11 +41,7 @@
#include "bitmap.h"

#define DEBUG 0
-#if DEBUG
-#define PRINTK(x...) printk(x)
-#else
-#define PRINTK(x...)
-#endif
+#define PRINTK(x...) do { if (DEBUG) printk(x); } while (0)

/*
* Number of guaranteed r1bios in case of extreme VM load:
@@ -177,12 +173,6 @@ static void free_r1bio(r1bio_t *r1_bio)
{
conf_t *conf = r1_bio->mddev->private;

- /*
- * Wake up any possible resync thread that waits for the device
- * to go idle.
- */
- allow_barrier(conf);
-
put_all_bios(conf, r1_bio);
mempool_free(r1_bio, conf->r1bio_pool);
}
@@ -223,6 +213,33 @@ static void reschedule_retry(r1bio_t *r1_bio)
* operation and are ready to return a success/failure code to the buffer
* cache layer.
*/
+static void call_bio_endio(r1bio_t *r1_bio)
+{
+ struct bio *bio = r1_bio->master_bio;
+ int done;
+ conf_t *conf = r1_bio->mddev->private;
+
+ if (bio->bi_phys_segments) {
+ unsigned long flags;
+ spin_lock_irqsave(&conf->device_lock, flags);
+ bio->bi_phys_segments--;
+ done = (bio->bi_phys_segments == 0);
+ spin_unlock_irqrestore(&conf->device_lock, flags);
+ } else
+ done = 1;
+
+ if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
+ clear_bit(BIO_UPTODATE, &bio->bi_flags);
+ if (done) {
+ bio_endio(bio, 0);
+ /*
+ * Wake up any possible resync thread that waits for the device
+ * to go idle.
+ */
+ allow_barrier(conf);
+ }
+}
+
static void raid_end_bio_io(r1bio_t *r1_bio)
{
struct bio *bio = r1_bio->master_bio;
@@ -235,8 +252,7 @@ static void raid_end_bio_io(r1bio_t *r1_bio)
(unsigned long long) bio->bi_sector +
(bio->bi_size >> 9) - 1);

- bio_endio(bio,
- test_bit(R1BIO_Uptodate, &r1_bio->state) ? 0 : -EIO);
+ call_bio_endio(r1_bio);
}
free_r1bio(r1_bio);
}
@@ -295,6 +311,7 @@ static void raid1_end_read_request(struct bio *bio, int error)
bdevname(conf->mirrors[mirror].rdev->bdev,
b),
(unsigned long long)r1_bio->sector);
+ set_bit(R1BIO_ReadError, &r1_bio->state);
reschedule_retry(r1_bio);
}

@@ -381,7 +398,7 @@ static void raid1_end_write_request(struct bio *bio, int error)
(unsigned long long) mbio->bi_sector,
(unsigned long long) mbio->bi_sector +
(mbio->bi_size >> 9) - 1);
- bio_endio(mbio, 0);
+ call_bio_endio(r1_bio);
}
}
}
@@ -412,10 +429,11 @@ static void raid1_end_write_request(struct bio *bio, int error)
*
* The rdev for the device selected will have nr_pending incremented.
*/
-static int read_balance(conf_t *conf, r1bio_t *r1_bio)
+static int read_balance(conf_t *conf, r1bio_t *r1_bio, int *max_sectors)
{
const sector_t this_sector = r1_bio->sector;
- const int sectors = r1_bio->sectors;
+ int sectors;
+ int best_good_sectors;
int start_disk;
int best_disk;
int i;
@@ -430,8 +448,11 @@ static int read_balance(conf_t *conf, r1bio_t *r1_bio)
* We take the first readable disk when above the resync window.
*/
retry:
+ sectors = r1_bio->sectors;
best_disk = -1;
best_dist = MaxSector;
+ best_good_sectors = 0;
+
if (conf->mddev->recovery_cp < MaxSector &&
(this_sector + sectors >= conf->next_resync)) {
choose_first = 1;
@@ -443,6 +464,9 @@ static int read_balance(conf_t *conf, r1bio_t *r1_bio)

for (i = 0 ; i < conf->raid_disks ; i++) {
sector_t dist;
+ sector_t first_bad;
+ int bad_sectors;
+
int disk = start_disk + i;
if (disk >= conf->raid_disks)
disk -= conf->raid_disks;
@@ -465,6 +489,35 @@ static int read_balance(conf_t *conf, r1bio_t *r1_bio)
/* This is a reasonable device to use. It might
* even be best.
*/
+ if (is_badblock(rdev, this_sector, sectors,
+ &first_bad, &bad_sectors)) {
+ if (best_dist < MaxSector)
+ /* already have a better device */
+ continue;
+ if (first_bad <= this_sector) {
+ /* cannot read here. If this is the 'primary'
+ * device, then we must not read beyond
+ * bad_sectors from another device..
+ */
+ bad_sectors -= (this_sector - first_bad);
+ if (choose_first && sectors > bad_sectors)
+ sectors = bad_sectors;
+ if (best_good_sectors > sectors)
+ best_good_sectors = sectors;
+
+ } else {
+ sector_t good_sectors = first_bad - this_sector;
+ if (good_sectors > best_good_sectors) {
+ best_good_sectors = good_sectors;
+ best_disk = disk;
+ }
+ if (choose_first)
+ break;
+ }
+ continue;
+ } else
+ best_good_sectors = sectors;
+
dist = abs(this_sector - conf->mirrors[disk].head_position);
if (choose_first
/* Don't change to another disk for sequential reads */
@@ -493,10 +546,12 @@ static int read_balance(conf_t *conf, r1bio_t *r1_bio)
rdev_dec_pending(rdev, conf->mddev);
goto retry;
}
+ sectors = best_good_sectors;
conf->next_seq_sect = this_sector + sectors;
conf->last_used = best_disk;
}
rcu_read_unlock();
+ *max_sectors = sectors;

return best_disk;
}
@@ -763,11 +818,25 @@ static int make_request(mddev_t *mddev, struct bio * bio)
r1_bio->mddev = mddev;
r1_bio->sector = bio->bi_sector;

+ /* We might need to issue multiple reads to different
+ * devices if there are bad blocks around, so we keep
+ * track of the number of reads in bio->bi_phys_segments.
+ * If this is 0, there is only one r1_bio and no locking
+ * will be needed when requests complete. If it is
+ * non-zero, then it is the number of not-completed requests.
+ */
+ bio->bi_phys_segments = 0;
+ clear_bit(BIO_SEG_VALID, &bio->bi_flags);
+
if (rw == READ) {
/*
* read balancing logic:
*/
- int rdisk = read_balance(conf, r1_bio);
+ int max_sectors;
+ int rdisk;
+
+read_again:
+ rdisk = read_balance(conf, r1_bio, &max_sectors);

if (rdisk < 0) {
/* couldn't find anywhere to read from */
@@ -788,6 +857,8 @@ static int make_request(mddev_t *mddev, struct bio * bio)
r1_bio->read_disk = rdisk;

read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+ md_trim_bio(read_bio, r1_bio->sector - bio->bi_sector,
+ max_sectors);

r1_bio->bios[rdisk] = read_bio;

@@ -797,7 +868,38 @@ static int make_request(mddev_t *mddev, struct bio * bio)
read_bio->bi_rw = READ | do_sync;
read_bio->bi_private = r1_bio;

- generic_make_request(read_bio);
+ if (max_sectors < r1_bio->sectors) {
+ /* could not read all from this device, so we will
+ * need another r1_bio.
+ */
+ int sectors_handled;
+
+ sectors_handled = (r1_bio->sector + max_sectors
+ - bio->bi_sector);
+ r1_bio->sectors = max_sectors;
+ spin_lock_irq(&conf->device_lock);
+ if (bio->bi_phys_segments == 0)
+ bio->bi_phys_segments = 2;
+ else
+ bio->bi_phys_segments++;
+ spin_unlock_irq(&conf->device_lock);
+ /* Cannot call generic_make_request directly
+ * as that will be queued in __make_request
+ * and subsequent mempool_alloc might block waiting
+ * for it. So hand bio over to raid1d.
+ */
+ reschedule_retry(r1_bio);
+
+ r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
+
+ r1_bio->master_bio = bio;
+ r1_bio->sectors = (bio->bi_size >> 9) - sectors_handled;
+ r1_bio->state = 0;
+ r1_bio->mddev = mddev;
+ r1_bio->sector = bio->bi_sector + sectors_handled;
+ goto read_again;
+ } else
+ generic_make_request(read_bio);
return 0;
}

@@ -849,8 +951,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
goto retry_write;
}

- BUG_ON(targets == 0); /* we never fail the last device */
-
if (targets < conf->raid_disks) {
/* array is degraded, we will not clear the bitmap
* on I/O completion (see raid1_end_write_request) */
@@ -1425,7 +1525,7 @@ static void sync_request_write(mddev_t *mddev, r1bio_t *r1_bio)
*
* 1. Retries failed read operations on working mirrors.
* 2. Updates the raid superblock when problems encounter.
- * 3. Performs writes following reads for array syncronising.
+ * 3. Performs writes following reads for array synchronising.
*/

static void fix_read_error(conf_t *conf, int read_disk,
@@ -1448,9 +1548,14 @@ static void fix_read_error(conf_t *conf, int read_disk,
* which is the thread that might remove
* a device. If raid1d ever becomes multi-threaded....
*/
+ sector_t first_bad;
+ int bad_sectors;
+
rdev = conf->mirrors[d].rdev;
if (rdev &&
test_bit(In_sync, &rdev->flags) &&
+ is_badblock(rdev, sect, s,
+ &first_bad, &bad_sectors) == 0 &&
sync_page_io(rdev, sect, s<<9,
conf->tmppage, READ, false))
success = 1;
@@ -1546,9 +1651,11 @@ static void raid1d(mddev_t *mddev)
conf = mddev->private;
if (test_bit(R1BIO_IsSync, &r1_bio->state))
sync_request_write(mddev, r1_bio);
- else {
+ else if (test_bit(R1BIO_ReadError, &r1_bio->state)) {
int disk;
+ int max_sectors;

+ clear_bit(R1BIO_ReadError, &r1_bio->state);
/* we got a read error. Maybe the drive is bad. Maybe just
* the block and we can fix it.
* We freeze all other IO, and try reading the block from
@@ -1568,21 +1675,28 @@ static void raid1d(mddev_t *mddev)
conf->mirrors[r1_bio->read_disk].rdev);

bio = r1_bio->bios[r1_bio->read_disk];
- if ((disk=read_balance(conf, r1_bio)) == -1) {
+ bdevname(bio->bi_bdev, b);
+read_more:
+ disk = read_balance(conf, r1_bio, &max_sectors);
+ if (disk == -1) {
printk(KERN_ALERT "md/raid1:%s: %s: unrecoverable I/O"
" read error for block %llu\n",
- mdname(mddev),
- bdevname(bio->bi_bdev,b),
+ mdname(mddev), b,
(unsigned long long)r1_bio->sector);
raid_end_bio_io(r1_bio);
} else {
const unsigned long do_sync = r1_bio->master_bio->bi_rw & REQ_SYNC;
- r1_bio->bios[r1_bio->read_disk] =
- mddev->ro ? IO_BLOCKED : NULL;
+ if (bio) {
+ r1_bio->bios[r1_bio->read_disk] =
+ mddev->ro ? IO_BLOCKED : NULL;
+ bio_put(bio);
+ }
r1_bio->read_disk = disk;
- bio_put(bio);
bio = bio_clone_mddev(r1_bio->master_bio,
GFP_NOIO, mddev);
+ md_trim_bio(bio,
+ r1_bio->sector - bio->bi_sector,
+ max_sectors);
r1_bio->bios[r1_bio->read_disk] = bio;
rdev = conf->mirrors[disk].rdev;
printk_ratelimited(
@@ -1597,8 +1711,44 @@ static void raid1d(mddev_t *mddev)
bio->bi_end_io = raid1_end_read_request;
bio->bi_rw = READ | do_sync;
bio->bi_private = r1_bio;
- generic_make_request(bio);
+ if (max_sectors < r1_bio->sectors) {
+ /* Drat - have to split this up more */
+ struct bio *mbio = r1_bio->master_bio;
+ int sectors_handled =
+ r1_bio->sector + max_sectors
+ - mbio->bi_sector;
+ r1_bio->sectors = max_sectors;
+ spin_lock_irq(&conf->device_lock);
+ if (mbio->bi_phys_segments == 0)
+ mbio->bi_phys_segments = 2;
+ else
+ mbio->bi_phys_segments++;
+ spin_unlock_irq(&conf->device_lock);
+ generic_make_request(bio);
+ bio = NULL;
+
+ r1_bio = mempool_alloc(conf->r1bio_pool,
+ GFP_NOIO);
+
+ r1_bio->master_bio = mbio;
+ r1_bio->sectors = (mbio->bi_size >> 9)
+ - sectors_handled;
+ r1_bio->state = 0;
+ set_bit(R1BIO_ReadError,
+ &r1_bio->state);
+ r1_bio->mddev = mddev;
+ r1_bio->sector = mbio->bi_sector
+ + sectors_handled;
+
+ goto read_more;
+ } else
+ generic_make_request(bio);
}
+ } else {
+ /* just a partial read to be scheduled from separate
+ * context
+ */
+ generic_make_request(r1_bio->bios[r1_bio->read_disk]);
}
cond_resched();
}
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index 3cd18cf..aa6af37 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -123,6 +123,10 @@ struct r1bio_s {
#define R1BIO_IsSync 1
#define R1BIO_Degraded 2
#define R1BIO_BehindIO 3
+/* Set ReadError on bios that experience a readerror so that
+ * raid1d knows what to do with them.
+ */
+#define R1BIO_ReadError 4
/* For write-behind requests, we call bi_end_io when
* the last non-write-behind device completes, providing
* any write was successful. Otherwise we call when


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 08/36] md: add "write_error" flag to component devices.

am 21.07.2011 04:58:48 von NeilBrown

If a device has ever seen a write error, we will want to handle
known-bad-blocks differently.
So create an appropriate state flag and export it via sysfs.

Signed-off-by: NeilBrown
---

drivers/md/md.c | 12 ++++++++++++
drivers/md/md.h | 3 +++
2 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 430bc8b..aa96ead 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2520,6 +2520,10 @@ state_show(mdk_rdev_t *rdev, char *page)
len += sprintf(page+len, "%sspare", sep);
sep = ",";
}
+ if (test_bit(WriteErrorSeen, &rdev->flags)) {
+ len += sprintf(page+len, "%swrite_error", sep);
+ sep = ",";
+ }
return len+sprintf(page+len, "\n");
}

@@ -2534,6 +2538,8 @@ state_store(mdk_rdev_t *rdev, const char *buf, size_t len)
* blocked - sets the Blocked flag
* -blocked - clears the Blocked flag
* insync - sets Insync providing device isn't active
+ * write_error - sets WriteErrorSeen
+ * -write_error - clears WriteErrorSeen
*/
int err = -EINVAL;
if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
@@ -2569,6 +2575,12 @@ state_store(mdk_rdev_t *rdev, const char *buf, size_t len)
} else if (cmd_match(buf, "insync") && rdev->raid_disk == -1) {
set_bit(In_sync, &rdev->flags);
err = 0;
+ } else if (cmd_match(buf, "write_error")) {
+ set_bit(WriteErrorSeen, &rdev->flags);
+ err = 0;
+ } else if (cmd_match(buf, "-write_error")) {
+ clear_bit(WriteErrorSeen, &rdev->flags);
+ err = 0;
}
if (!err)
sysfs_notify_dirent_safe(rdev->sysfs_state);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index eb11449..723c0cd 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -84,6 +84,9 @@ struct mdk_rdev_s
#define Blocked 8 /* An error occurred on an externally
* managed array, don't allow writes
* until it is cleared */
+#define WriteErrorSeen 9 /* A write error has been seen on this
+ * device
+ */
wait_queue_head_t blocked_wait;

int desc_nr; /* descriptor index in the superblock */


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 07/36] md/raid1: avoid reading known bad blocks duringresync

am 21.07.2011 04:58:48 von NeilBrown

When performing resync/etc, keep the size of the request
small enough that it doesn't overlap any known bad blocks.
Devices with badblocks at the start of the request are completely
excluded.
If there is nowhere to read from due to bad blocks, record
a bad block on each target device.

Now that we never read from known-bad-blocks we can allow devices with
known-bad-blocks into a RAID1.

Signed-off-by: NeilBrown
---

drivers/md/raid1.c | 97 ++++++++++++++++++++++++++++++++++++++++------------
1 files changed, 75 insertions(+), 22 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index cc3939d..8c31c39 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1155,9 +1155,6 @@ static int raid1_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
if (mddev->recovery_disabled == conf->recovery_disabled)
return -EBUSY;

- if (rdev->badblocks.count)
- return -EINVAL;
-
if (rdev->raid_disk >= 0)
first = last = rdev->raid_disk;

@@ -1303,6 +1300,9 @@ static int fix_sync_read_error(r1bio_t *r1_bio)
* We don't need to freeze the array, because being in an
* active sync request, there is no normal IO, and
* no overlapping syncs.
+ * We don't need to check is_badblock() again as we
+ * made sure that anything with a bad block in range
+ * will have bi_end_io clear.
*/
mddev_t *mddev = r1_bio->mddev;
conf_t *conf = mddev->private;
@@ -1792,6 +1792,8 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, i
int write_targets = 0, read_targets = 0;
sector_t sync_blocks;
int still_degraded = 0;
+ int good_sectors = RESYNC_SECTORS;
+ int min_bad = 0; /* number of sectors that are bad in all devices */

if (!conf->r1buf_pool)
if (init_resync(conf))
@@ -1879,36 +1881,89 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, i

rdev = rcu_dereference(conf->mirrors[i].rdev);
if (rdev == NULL ||
- test_bit(Faulty, &rdev->flags)) {
+ test_bit(Faulty, &rdev->flags)) {
still_degraded = 1;
- continue;
} else if (!test_bit(In_sync, &rdev->flags)) {
bio->bi_rw = WRITE;
bio->bi_end_io = end_sync_write;
write_targets ++;
} else {
/* may need to read from here */
- bio->bi_rw = READ;
- bio->bi_end_io = end_sync_read;
- if (test_bit(WriteMostly, &rdev->flags)) {
- if (wonly < 0)
- wonly = i;
- } else {
- if (disk < 0)
- disk = i;
+ sector_t first_bad = MaxSector;
+ int bad_sectors;
+
+ if (is_badblock(rdev, sector_nr, good_sectors,
+ &first_bad, &bad_sectors)) {
+ if (first_bad > sector_nr)
+ good_sectors = first_bad - sector_nr;
+ else {
+ bad_sectors -= (sector_nr - first_bad);
+ if (min_bad == 0 ||
+ min_bad > bad_sectors)
+ min_bad = bad_sectors;
+ }
+ }
+ if (sector_nr < first_bad) {
+ if (test_bit(WriteMostly, &rdev->flags)) {
+ if (wonly < 0)
+ wonly = i;
+ } else {
+ if (disk < 0)
+ disk = i;
+ }
+ bio->bi_rw = READ;
+ bio->bi_end_io = end_sync_read;
+ read_targets++;
}
- read_targets++;
}
- atomic_inc(&rdev->nr_pending);
- bio->bi_sector = sector_nr + rdev->data_offset;
- bio->bi_bdev = rdev->bdev;
- bio->bi_private = r1_bio;
+ if (bio->bi_end_io) {
+ atomic_inc(&rdev->nr_pending);
+ bio->bi_sector = sector_nr + rdev->data_offset;
+ bio->bi_bdev = rdev->bdev;
+ bio->bi_private = r1_bio;
+ }
}
rcu_read_unlock();
if (disk < 0)
disk = wonly;
r1_bio->read_disk = disk;

+ if (read_targets == 0 && min_bad > 0) {
+ /* These sectors are bad on all InSync devices, so we
+ * need to mark them bad on all write targets
+ */
+ int ok = 1;
+ for (i = 0 ; i < conf->raid_disks ; i++)
+ if (r1_bio->bios[i]->bi_end_io == end_sync_write) {
+ mdk_rdev_t *rdev =
+ rcu_dereference(conf->mirrors[i].rdev);
+ ok = rdev_set_badblocks(rdev, sector_nr,
+ min_bad, 0
+ ) && ok;
+ }
+ set_bit(MD_CHANGE_DEVS, &mddev->flags);
+ *skipped = 1;
+ put_buf(r1_bio);
+
+ if (!ok) {
+ /* Cannot record the badblocks, so need to
+ * abort the resync.
+ * If there are multiple read targets, could just
+ * fail the really bad ones ???
+ */
+ conf->recovery_disabled = mddev->recovery_disabled;
+ set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+ return 0;
+ } else
+ return min_bad;
+
+ }
+ if (min_bad > 0 && min_bad < good_sectors) {
+ /* only resync enough to reach the next bad->good
+ * transition */
+ good_sectors = min_bad;
+ }
+
if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) && read_targets > 0)
/* extra read targets are also write targets */
write_targets += read_targets-1;
@@ -1925,6 +1980,8 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, i

if (max_sector > mddev->resync_max)
max_sector = mddev->resync_max; /* Don't do IO beyond here */
+ if (max_sector > sector_nr + good_sectors)
+ max_sector = sector_nr + good_sectors;
nr_sectors = 0;
sync_blocks = 0;
do {
@@ -2147,10 +2204,6 @@ static int run(mddev_t *mddev)
blk_queue_segment_boundary(mddev->queue,
PAGE_CACHE_SIZE - 1);
}
- if (rdev->badblocks.count) {
- printk(KERN_ERR "md/raid1: Cannot handle bad blocks yet\n");
- return -EINVAL;
- }
}

mddev->degraded = 0;


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 13/36] md/raid1: Handle write errors by updating badblocklog.

am 21.07.2011 04:58:48 von NeilBrown

When we get a write error (in the data area, not in metadata),
update the badblock log rather than failing the whole device.

As the write may well be many blocks, we trying writing each
block individually and only log the ones which fail.

Signed-off-by: NeilBrown
---

drivers/md/raid1.c | 168 +++++++++++++++++++++++++++++++++++++++++++++-------
drivers/md/raid1.h | 3 +
2 files changed, 147 insertions(+), 24 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 42e815d..0ebeec9 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -318,25 +318,34 @@ static void raid1_end_read_request(struct bio *bio, int error)
rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
}

+static void close_write(r1bio_t *r1_bio)
+{
+ /* it really is the end of this request */
+ if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
+ /* free extra copy of the data pages */
+ int i = r1_bio->behind_page_count;
+ while (i--)
+ safe_put_page(r1_bio->behind_bvecs[i].bv_page);
+ kfree(r1_bio->behind_bvecs);
+ r1_bio->behind_bvecs = NULL;
+ }
+ /* clear the bitmap if all writes complete successfully */
+ bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
+ r1_bio->sectors,
+ !test_bit(R1BIO_Degraded, &r1_bio->state),
+ test_bit(R1BIO_BehindIO, &r1_bio->state));
+ md_write_end(r1_bio->mddev);
+}
+
static void r1_bio_write_done(r1bio_t *r1_bio)
{
- if (atomic_dec_and_test(&r1_bio->remaining))
- {
- /* it really is the end of this request */
- if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
- /* free extra copy of the data pages */
- int i = r1_bio->behind_page_count;
- while (i--)
- safe_put_page(r1_bio->behind_bvecs[i].bv_page);
- kfree(r1_bio->behind_bvecs);
- r1_bio->behind_bvecs = NULL;
- }
- /* clear the bitmap if all writes complete successfully */
- bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
- r1_bio->sectors,
- !test_bit(R1BIO_Degraded, &r1_bio->state),
- test_bit(R1BIO_BehindIO, &r1_bio->state));
- md_write_end(r1_bio->mddev);
+ if (!atomic_dec_and_test(&r1_bio->remaining))
+ return;
+
+ if (test_bit(R1BIO_WriteError, &r1_bio->state))
+ reschedule_retry(r1_bio);
+ else {
+ close_write(r1_bio);
if (test_bit(R1BIO_MadeGood, &r1_bio->state))
reschedule_retry(r1_bio);
else
@@ -360,12 +369,10 @@ static void raid1_end_write_request(struct bio *bio, int error)
/*
* 'one mirror IO has finished' event handler:
*/
- r1_bio->bios[mirror] = NULL;
- to_put = bio;
if (!uptodate) {
- md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
- /* an I/O failed, we can't clear the bitmap */
- set_bit(R1BIO_Degraded, &r1_bio->state);
+ set_bit(WriteErrorSeen,
+ &conf->mirrors[mirror].rdev->flags);
+ set_bit(R1BIO_WriteError, &r1_bio->state);
} else {
/*
* Set R1BIO_Uptodate in our master bio, so that we
@@ -380,6 +387,8 @@ static void raid1_end_write_request(struct bio *bio, int error)
sector_t first_bad;
int bad_sectors;

+ r1_bio->bios[mirror] = NULL;
+ to_put = bio;
set_bit(R1BIO_Uptodate, &r1_bio->state);

/* Maybe we can clear some bad blocks. */
@@ -1725,6 +1734,101 @@ static void fix_read_error(conf_t *conf, int read_disk,
}
}

+static void bi_complete(struct bio *bio, int error)
+{
+ complete((struct completion *)bio->bi_private);
+}
+
+static int submit_bio_wait(int rw, struct bio *bio)
+{
+ struct completion event;
+ rw |= REQ_SYNC;
+
+ init_completion(&event);
+ bio->bi_private = &event;
+ bio->bi_end_io = bi_complete;
+ submit_bio(rw, bio);
+ wait_for_completion(&event);
+
+ return test_bit(BIO_UPTODATE, &bio->bi_flags);
+}
+
+static int narrow_write_error(r1bio_t *r1_bio, int i)
+{
+ mddev_t *mddev = r1_bio->mddev;
+ conf_t *conf = mddev->private;
+ mdk_rdev_t *rdev = conf->mirrors[i].rdev;
+ int vcnt, idx;
+ struct bio_vec *vec;
+
+ /* bio has the data to be written to device 'i' where
+ * we just recently had a write error.
+ * We repeatedly clone the bio and trim down to one block,
+ * then try the write. Where the write fails we record
+ * a bad block.
+ * It is conceivable that the bio doesn't exactly align with
+ * blocks. We must handle this somehow.
+ *
+ * We currently own a reference on the rdev.
+ */
+
+ int block_sectors;
+ sector_t sector;
+ int sectors;
+ int sect_to_write = r1_bio->sectors;
+ int ok = 1;
+
+ if (rdev->badblocks.shift < 0)
+ return 0;
+
+ block_sectors = 1 << rdev->badblocks.shift;
+ sector = r1_bio->sector;
+ sectors = ((sector + block_sectors)
+ & ~(sector_t)(block_sectors - 1))
+ - sector;
+
+ if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
+ vcnt = r1_bio->behind_page_count;
+ vec = r1_bio->behind_bvecs;
+ idx = 0;
+ while (vec[idx].bv_page == NULL)
+ idx++;
+ } else {
+ vcnt = r1_bio->master_bio->bi_vcnt;
+ vec = r1_bio->master_bio->bi_io_vec;
+ idx = r1_bio->master_bio->bi_idx;
+ }
+ while (sect_to_write) {
+ struct bio *wbio;
+ if (sectors > sect_to_write)
+ sectors = sect_to_write;
+ /* Write at 'sector' for 'sectors'*/
+
+ wbio = bio_alloc_mddev(GFP_NOIO, vcnt, mddev);
+ memcpy(wbio->bi_io_vec, vec, vcnt * sizeof(struct bio_vec));
+ wbio->bi_sector = r1_bio->sector;
+ wbio->bi_rw = WRITE;
+ wbio->bi_vcnt = vcnt;
+ wbio->bi_size = r1_bio->sectors << 9;
+ wbio->bi_idx = idx;
+
+ md_trim_bio(wbio, sector - r1_bio->sector, sectors);
+ wbio->bi_sector += rdev->data_offset;
+ wbio->bi_bdev = rdev->bdev;
+ if (submit_bio_wait(WRITE, wbio) == 0)
+ /* failure! */
+ ok = rdev_set_badblocks(rdev, sector,
+ sectors, 0)
+ && ok;
+
+ bio_put(wbio);
+ sect_to_write -= sectors;
+ sector += sectors;
+ sectors = block_sectors;
+ }
+ return ok;
+}
+
static void raid1d(mddev_t *mddev)
{
r1bio_t *r1_bio;
@@ -1776,7 +1880,8 @@ static void raid1d(mddev_t *mddev)
md_done_sync(mddev, s, 1);
} else
sync_request_write(mddev, r1_bio);
- } else if (test_bit(R1BIO_MadeGood, &r1_bio->state)) {
+ } else if (test_bit(R1BIO_MadeGood, &r1_bio->state) ||
+ test_bit(R1BIO_WriteError, &r1_bio->state)) {
int m;
for (m = 0; m < conf->raid_disks ; m++)
if (r1_bio->bios[m] == IO_MADE_GOOD) {
@@ -1786,7 +1891,24 @@ static void raid1d(mddev_t *mddev)
r1_bio->sector,
r1_bio->sectors);
rdev_dec_pending(rdev, mddev);
+ } else if (r1_bio->bios[m] != NULL) {
+ /* This drive got a write error. We
+ * need to narrow down and record
+ * precise write errors.
+ */
+ if (!narrow_write_error(r1_bio, m)) {
+ md_error(mddev,
+ conf->mirrors[m].rdev);
+ /* an I/O failed, we can't clear
+ * the bitmap */
+ set_bit(R1BIO_Degraded,
+ &r1_bio->state);
+ }
+ rdev_dec_pending(conf->mirrors[m].rdev,
+ mddev);
}
+ if (test_bit(R1BIO_WriteError, &r1_bio->state))
+ close_write(r1_bio);
raid_end_bio_io(r1_bio);
} else if (test_bit(R1BIO_ReadError, &r1_bio->state)) {
int disk;
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index b788547..e0d676b 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -145,7 +145,8 @@ struct r1bio_s {
/* If a write for this request means we can clear some
* known-bad-block records, we set this flag
*/
-#define R1BIO_MadeGood 7
+#define R1BIO_MadeGood 7
+#define R1BIO_WriteError 8

extern int md_raid1_congested(mddev_t *mddev, int bits);



--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 05/36] md: Disable bad blocks and v0.90 metadata.

am 21.07.2011 04:58:48 von NeilBrown

v0.90 metadata cannot record bad blocks, so when loading metadata
for such a device, set shift to -1.

Signed-off-by: NeilBrown
---

drivers/md/md.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 18c3aab..340e2d4 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1058,6 +1058,7 @@ static int super_90_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version
rdev->preferred_minor = sb->md_minor;
rdev->data_offset = 0;
rdev->sb_size = MD_SB_BYTES;
+ rdev->badblocks.shift = -1;

if (sb->level == LEVEL_MULTIPATH)
rdev->desc_nr = -1;
@@ -3009,6 +3010,9 @@ static mdk_rdev_t *md_import_device(dev_t newdev, int super_format, int super_mi
goto abort_free;
}
}
+ if (super_format == -1)
+ /* hot-add for 0.90, or non-persistent: so no badblocks */
+ rdev->badblocks.shift = -1;

return rdev;



--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 24/36] md/raid10: avoid reading from known bad blocks -part 1

am 21.07.2011 04:58:49 von NeilBrown

This patch just covers the basic read path:
1/ read_balance needs to check for badblocks, and return not only
the chosen slot, but also how many good blocks are available
there.
2/ read submission must be ready to issue multiple reads to
different devices as different bad blocks on different devices
could mean that a single large read cannot be served by any one
device, but can still be served by the array.
This requires keeping count of the number of outstanding requests
per bio. This count is stored in 'bi_phys_segments'

On read error we currently just fail the request if another target
cannot handle the whole request. Next patch refines that a bit.

Signed-off-by: NeilBrown
---

drivers/md/raid10.c | 141 +++++++++++++++++++++++++++++++++++++++++++++------
drivers/md/raid10.h | 4 +
2 files changed, 129 insertions(+), 16 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index f1b749c..872bf94 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -191,12 +191,6 @@ static void free_r10bio(r10bio_t *r10_bio)
{
conf_t *conf = r10_bio->mddev->private;

- /*
- * Wake up any possible resync thread that waits for the device
- * to go idle.
- */
- allow_barrier(conf);
-
put_all_bios(conf, r10_bio);
mempool_free(r10_bio, conf->r10bio_pool);
}
@@ -235,9 +229,27 @@ static void reschedule_retry(r10bio_t *r10_bio)
static void raid_end_bio_io(r10bio_t *r10_bio)
{
struct bio *bio = r10_bio->master_bio;
+ int done;
+ conf_t *conf = r10_bio->mddev->private;

- bio_endio(bio,
- test_bit(R10BIO_Uptodate, &r10_bio->state) ? 0 : -EIO);
+ if (bio->bi_phys_segments) {
+ unsigned long flags;
+ spin_lock_irqsave(&conf->device_lock, flags);
+ bio->bi_phys_segments--;
+ done = (bio->bi_phys_segments == 0);
+ spin_unlock_irqrestore(&conf->device_lock, flags);
+ } else
+ done = 1;
+ if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
+ clear_bit(BIO_UPTODATE, &bio->bi_flags);
+ if (done) {
+ bio_endio(bio, 0);
+ /*
+ * Wake up any possible resync thread that waits for the device
+ * to go idle.
+ */
+ allow_barrier(conf);
+ }
free_r10bio(r10_bio);
}

@@ -307,6 +319,7 @@ static void raid10_end_read_request(struct bio *bio, int error)
mdname(conf->mddev),
bdevname(conf->mirrors[dev].rdev->bdev, b),
(unsigned long long)r10_bio->sector);
+ set_bit(R10BIO_ReadError, &r10_bio->state);
reschedule_retry(r10_bio);
}
}
@@ -505,11 +518,12 @@ static int raid10_mergeable_bvec(struct request_queue *q,
* FIXME: possibly should rethink readbalancing and do it differently
* depending on near_copies / far_copies geometry.
*/
-static int read_balance(conf_t *conf, r10bio_t *r10_bio)
+static int read_balance(conf_t *conf, r10bio_t *r10_bio, int *max_sectors)
{
const sector_t this_sector = r10_bio->sector;
int disk, slot;
- const int sectors = r10_bio->sectors;
+ int sectors = r10_bio->sectors;
+ int best_good_sectors;
sector_t new_distance, best_dist;
mdk_rdev_t *rdev;
int do_balance;
@@ -518,8 +532,10 @@ static int read_balance(conf_t *conf, r10bio_t *r10_bio)
raid10_find_phys(conf, r10_bio);
rcu_read_lock();
retry:
+ sectors = r10_bio->sectors;
best_slot = -1;
best_dist = MaxSector;
+ best_good_sectors = 0;
do_balance = 1;
/*
* Check if we can balance. We can balance on the whole
@@ -532,6 +548,10 @@ retry:
do_balance = 0;

for (slot = 0; slot < conf->copies ; slot++) {
+ sector_t first_bad;
+ int bad_sectors;
+ sector_t dev_sector;
+
if (r10_bio->devs[slot].bio == IO_BLOCKED)
continue;
disk = r10_bio->devs[slot].devnum;
@@ -541,6 +561,37 @@ retry:
if (!test_bit(In_sync, &rdev->flags))
continue;

+ dev_sector = r10_bio->devs[slot].addr;
+ if (is_badblock(rdev, dev_sector, sectors,
+ &first_bad, &bad_sectors)) {
+ if (best_dist < MaxSector)
+ /* Already have a better slot */
+ continue;
+ if (first_bad <= dev_sector) {
+ /* Cannot read here. If this is the
+ * 'primary' device, then we must not read
+ * beyond 'bad_sectors' from another device.
+ */
+ bad_sectors -= (dev_sector - first_bad);
+ if (!do_balance && sectors > bad_sectors)
+ sectors = bad_sectors;
+ if (best_good_sectors > sectors)
+ best_good_sectors = sectors;
+ } else {
+ sector_t good_sectors =
+ first_bad - dev_sector;
+ if (good_sectors > best_good_sectors) {
+ best_good_sectors = good_sectors;
+ best_slot = slot;
+ }
+ if (!do_balance)
+ /* Must read from here */
+ break;
+ }
+ continue;
+ } else
+ best_good_sectors = sectors;
+
if (!do_balance)
break;

@@ -582,6 +633,7 @@ retry:
} else
disk = -1;
rcu_read_unlock();
+ *max_sectors = best_good_sectors;

return disk;
}
@@ -829,12 +881,27 @@ static int make_request(mddev_t *mddev, struct bio * bio)
r10_bio->sector = bio->bi_sector;
r10_bio->state = 0;

+ /* We might need to issue multiple reads to different
+ * devices if there are bad blocks around, so we keep
+ * track of the number of reads in bio->bi_phys_segments.
+ * If this is 0, there is only one r10_bio and no locking
+ * will be needed when the request completes. If it is
+ * non-zero, then it is the number of not-completed requests.
+ */
+ bio->bi_phys_segments = 0;
+ clear_bit(BIO_SEG_VALID, &bio->bi_flags);
+
if (rw == READ) {
/*
* read balancing logic:
*/
- int disk = read_balance(conf, r10_bio);
- int slot = r10_bio->read_slot;
+ int max_sectors;
+ int disk;
+ int slot;
+
+read_again:
+ disk = read_balance(conf, r10_bio, &max_sectors);
+ slot = r10_bio->read_slot;
if (disk < 0) {
raid_end_bio_io(r10_bio);
return 0;
@@ -842,6 +909,8 @@ static int make_request(mddev_t *mddev, struct bio * bio)
mirror = conf->mirrors + disk;

read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+ md_trim_bio(read_bio, r10_bio->sector - bio->bi_sector,
+ max_sectors);

r10_bio->devs[slot].bio = read_bio;

@@ -852,7 +921,39 @@ static int make_request(mddev_t *mddev, struct bio * bio)
read_bio->bi_rw = READ | do_sync;
read_bio->bi_private = r10_bio;

- generic_make_request(read_bio);
+ if (max_sectors < r10_bio->sectors) {
+ /* Could not read all from this device, so we will
+ * need another r10_bio.
+ */
+ int sectors_handled;
+
+ sectors_handled = (r10_bio->sectors + max_sectors
+ - bio->bi_sector);
+ r10_bio->sectors = max_sectors;
+ spin_lock_irq(&conf->device_lock);
+ if (bio->bi_phys_segments == 0)
+ bio->bi_phys_segments = 2;
+ else
+ bio->bi_phys_segments++;
+ spin_unlock(&conf->device_lock);
+ /* Cannot call generic_make_request directly
+ * as that will be queued in __generic_make_request
+ * and subsequent mempool_alloc might block
+ * waiting for it. so hand bio over to raid10d.
+ */
+ reschedule_retry(r10_bio);
+
+ r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
+
+ r10_bio->master_bio = bio;
+ r10_bio->sectors = ((bio->bi_size >> 9)
+ - sectors_handled);
+ r10_bio->state = 0;
+ r10_bio->mddev = mddev;
+ r10_bio->sector = bio->bi_sector + sectors_handled;
+ goto read_again;
+ } else
+ generic_make_request(read_bio);
return 0;
}

@@ -1627,6 +1728,7 @@ static void handle_read_error(mddev_t *mddev, r10bio_t *r10_bio)
mdk_rdev_t *rdev;
char b[BDEVNAME_SIZE];
unsigned long do_sync;
+ int max_sectors;

/* we got a read error. Maybe the drive is bad. Maybe just
* the block and we can fix it.
@@ -1646,8 +1748,8 @@ static void handle_read_error(mddev_t *mddev, r10bio_t *r10_bio)
bio = r10_bio->devs[slot].bio;
r10_bio->devs[slot].bio =
mddev->ro ? IO_BLOCKED : NULL;
- mirror = read_balance(conf, r10_bio);
- if (mirror == -1) {
+ mirror = read_balance(conf, r10_bio, &max_sectors);
+ if (mirror == -1 || max_sectors < r10_bio->sectors) {
printk(KERN_ALERT "md/raid10:%s: %s: unrecoverable I/O"
" read error for block %llu\n",
mdname(mddev),
@@ -1712,8 +1814,15 @@ static void raid10d(mddev_t *mddev)
sync_request_write(mddev, r10_bio);
else if (test_bit(R10BIO_IsRecover, &r10_bio->state))
recovery_request_write(mddev, r10_bio);
- else
+ else if (test_bit(R10BIO_ReadError, &r10_bio->state))
handle_read_error(mddev, r10_bio);
+ else {
+ /* just a partial read to be scheduled from a
+ * separate context
+ */
+ int slot = r10_bio->read_slot;
+ generic_make_request(r10_bio->devs[slot].bio);
+ }

cond_resched();
if (mddev->flags & ~(1< diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index a485914..c646152 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -124,4 +124,8 @@ struct r10bio_s {
#define R10BIO_IsSync 1
#define R10BIO_IsRecover 2
#define R10BIO_Degraded 3
+/* Set ReadError on bios that experience a read error
+ * so that raid10d knows what to do with them.
+ */
+#define R10BIO_ReadError 4
#endif


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 16/36] md/raid1: factor several functions out or raid1d()

am 21.07.2011 04:58:49 von NeilBrown

raid1d is too big with several deep branches.
So separate them out into their own functions.

Signed-off-by: NeilBrown
---

drivers/md/raid1.c | 318 ++++++++++++++++++++++++++--------------------------
1 files changed, 159 insertions(+), 159 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 08ff21a..d7518dc 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1862,21 +1862,168 @@ static int narrow_write_error(r1bio_t *r1_bio, int i)
return ok;
}

+static void handle_sync_write_finished(conf_t *conf, r1bio_t *r1_bio)
+{
+ int m;
+ int s = r1_bio->sectors;
+ for (m = 0; m < conf->raid_disks ; m++) {
+ mdk_rdev_t *rdev = conf->mirrors[m].rdev;
+ struct bio *bio = r1_bio->bios[m];
+ if (bio->bi_end_io == NULL)
+ continue;
+ if (test_bit(BIO_UPTODATE, &bio->bi_flags) &&
+ test_bit(R1BIO_MadeGood, &r1_bio->state)) {
+ rdev_clear_badblocks(rdev,
+ r1_bio->sector,
+ r1_bio->sectors);
+ }
+ if (!test_bit(BIO_UPTODATE, &bio->bi_flags) &&
+ test_bit(R1BIO_WriteError, &r1_bio->state)) {
+ if (!rdev_set_badblocks(rdev,
+ r1_bio->sector,
+ r1_bio->sectors, 0))
+ md_error(conf->mddev, rdev);
+ }
+ }
+ put_buf(r1_bio);
+ md_done_sync(conf->mddev, s, 1);
+}
+
+static void handle_write_finished(conf_t *conf, r1bio_t *r1_bio)
+{
+ int m;
+ for (m = 0; m < conf->raid_disks ; m++)
+ if (r1_bio->bios[m] == IO_MADE_GOOD) {
+ mdk_rdev_t *rdev = conf->mirrors[m].rdev;
+ rdev_clear_badblocks(rdev,
+ r1_bio->sector,
+ r1_bio->sectors);
+ rdev_dec_pending(rdev, conf->mddev);
+ } else if (r1_bio->bios[m] != NULL) {
+ /* This drive got a write error. We need to
+ * narrow down and record precise write
+ * errors.
+ */
+ if (!narrow_write_error(r1_bio, m)) {
+ md_error(conf->mddev,
+ conf->mirrors[m].rdev);
+ /* an I/O failed, we can't clear the bitmap */
+ set_bit(R1BIO_Degraded, &r1_bio->state);
+ }
+ rdev_dec_pending(conf->mirrors[m].rdev,
+ conf->mddev);
+ }
+ if (test_bit(R1BIO_WriteError, &r1_bio->state))
+ close_write(r1_bio);
+ raid_end_bio_io(r1_bio);
+}
+
+static void handle_read_error(conf_t *conf, r1bio_t *r1_bio)
+{
+ int disk;
+ int max_sectors;
+ mddev_t *mddev = conf->mddev;
+ struct bio *bio;
+ char b[BDEVNAME_SIZE];
+ mdk_rdev_t *rdev;
+
+ clear_bit(R1BIO_ReadError, &r1_bio->state);
+ /* we got a read error. Maybe the drive is bad. Maybe just
+ * the block and we can fix it.
+ * We freeze all other IO, and try reading the block from
+ * other devices. When we find one, we re-write
+ * and check it that fixes the read error.
+ * This is all done synchronously while the array is
+ * frozen
+ */
+ if (mddev->ro == 0) {
+ freeze_array(conf);
+ fix_read_error(conf, r1_bio->read_disk,
+ r1_bio->sector,
+ r1_bio->sectors);
+ unfreeze_array(conf);
+ } else
+ md_error(mddev,
+ conf->mirrors[r1_bio->read_disk].rdev);
+
+ bio = r1_bio->bios[r1_bio->read_disk];
+ bdevname(bio->bi_bdev, b);
+read_more:
+ disk = read_balance(conf, r1_bio, &max_sectors);
+ if (disk == -1) {
+ printk(KERN_ALERT "md/raid1:%s: %s: unrecoverable I/O"
+ " read error for block %llu\n",
+ mdname(mddev), b,
+ (unsigned long long)r1_bio->sector);
+ raid_end_bio_io(r1_bio);
+ } else {
+ const unsigned long do_sync
+ = r1_bio->master_bio->bi_rw & REQ_SYNC;
+ if (bio) {
+ r1_bio->bios[r1_bio->read_disk] =
+ mddev->ro ? IO_BLOCKED : NULL;
+ bio_put(bio);
+ }
+ r1_bio->read_disk = disk;
+ bio = bio_clone_mddev(r1_bio->master_bio, GFP_NOIO, mddev);
+ md_trim_bio(bio,
+ r1_bio->sector - bio->bi_sector,
+ max_sectors);
+ r1_bio->bios[r1_bio->read_disk] = bio;
+ rdev = conf->mirrors[disk].rdev;
+ printk_ratelimited(KERN_ERR
+ "md/raid1:%s: redirecting sector %llu"
+ " to other mirror: %s\n",
+ mdname(mddev),
+ (unsigned long long)r1_bio->sector,
+ bdevname(rdev->bdev, b));
+ bio->bi_sector = r1_bio->sector + rdev->data_offset;
+ bio->bi_bdev = rdev->bdev;
+ bio->bi_end_io = raid1_end_read_request;
+ bio->bi_rw = READ | do_sync;
+ bio->bi_private = r1_bio;
+ if (max_sectors < r1_bio->sectors) {
+ /* Drat - have to split this up more */
+ struct bio *mbio = r1_bio->master_bio;
+ int sectors_handled = (r1_bio->sector + max_sectors
+ - mbio->bi_sector);
+ r1_bio->sectors = max_sectors;
+ spin_lock_irq(&conf->device_lock);
+ if (mbio->bi_phys_segments == 0)
+ mbio->bi_phys_segments = 2;
+ else
+ mbio->bi_phys_segments++;
+ spin_unlock_irq(&conf->device_lock);
+ generic_make_request(bio);
+ bio = NULL;
+
+ r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
+
+ r1_bio->master_bio = mbio;
+ r1_bio->sectors = (mbio->bi_size >> 9) - sectors_handled;
+ r1_bio->state = 0;
+ set_bit(R1BIO_ReadError, &r1_bio->state);
+ r1_bio->mddev = mddev;
+ r1_bio->sector = mbio->bi_sector + sectors_handled;
+
+ goto read_more;
+ } else
+ generic_make_request(bio);
+ }
+}
+
static void raid1d(mddev_t *mddev)
{
r1bio_t *r1_bio;
- struct bio *bio;
unsigned long flags;
conf_t *conf = mddev->private;
struct list_head *head = &conf->retry_list;
- mdk_rdev_t *rdev;
struct blk_plug plug;

md_check_recovery(mddev);

blk_start_plug(&plug);
for (;;) {
- char b[BDEVNAME_SIZE];

if (atomic_read(&mddev->plug_cnt) == 0)
flush_pending_writes(conf);
@@ -1895,168 +2042,21 @@ static void raid1d(mddev_t *mddev)
conf = mddev->private;
if (test_bit(R1BIO_IsSync, &r1_bio->state)) {
if (test_bit(R1BIO_MadeGood, &r1_bio->state) ||
- test_bit(R1BIO_WriteError, &r1_bio->state)) {
- int m;
- int s = r1_bio->sectors;
- for (m = 0; m < conf->raid_disks ; m++) {
- mdk_rdev_t *rdev
- = conf->mirrors[m].rdev;
- struct bio *bio = r1_bio->bios[m];
- if (bio->bi_end_io == NULL)
- continue;
- if (test_bit(BIO_UPTODATE,
- &bio->bi_flags) &&
- test_bit(R1BIO_MadeGood,
- &r1_bio->state)) {
- rdev_clear_badblocks(
- rdev,
- r1_bio->sector,
- r1_bio->sectors);
- }
- if (!test_bit(BIO_UPTODATE,
- &bio->bi_flags) &&
- test_bit(R1BIO_WriteError,
- &r1_bio->state)) {
- if (!rdev_set_badblocks(
- rdev,
- r1_bio->sector,
- r1_bio->sectors, 0))
- md_error(mddev, rdev);
- }
- }
- put_buf(r1_bio);
- md_done_sync(mddev, s, 1);
- } else
+ test_bit(R1BIO_WriteError, &r1_bio->state))
+ handle_sync_write_finished(conf, r1_bio);
+ else
sync_request_write(mddev, r1_bio);
} else if (test_bit(R1BIO_MadeGood, &r1_bio->state) ||
- test_bit(R1BIO_WriteError, &r1_bio->state)) {
- int m;
- for (m = 0; m < conf->raid_disks ; m++)
- if (r1_bio->bios[m] == IO_MADE_GOOD) {
- rdev = conf->mirrors[m].rdev;
- rdev_clear_badblocks(
- rdev,
- r1_bio->sector,
- r1_bio->sectors);
- rdev_dec_pending(rdev, mddev);
- } else if (r1_bio->bios[m] != NULL) {
- /* This drive got a write error. We
- * need to narrow down and record
- * precise write errors.
- */
- if (!narrow_write_error(r1_bio, m)) {
- md_error(mddev,
- conf->mirrors[m].rdev);
- /* an I/O failed, we can't clear
- * the bitmap */
- set_bit(R1BIO_Degraded,
- &r1_bio->state);
- }
- rdev_dec_pending(conf->mirrors[m].rdev,
- mddev);
- }
- if (test_bit(R1BIO_WriteError, &r1_bio->state))
- close_write(r1_bio);
- raid_end_bio_io(r1_bio);
- } else if (test_bit(R1BIO_ReadError, &r1_bio->state)) {
- int disk;
- int max_sectors;
-
- clear_bit(R1BIO_ReadError, &r1_bio->state);
- /* we got a read error. Maybe the drive is bad. Maybe just
- * the block and we can fix it.
- * We freeze all other IO, and try reading the block from
- * other devices. When we find one, we re-write
- * and check it that fixes the read error.
- * This is all done synchronously while the array is
- * frozen
- */
- if (mddev->ro == 0) {
- freeze_array(conf);
- fix_read_error(conf, r1_bio->read_disk,
- r1_bio->sector,
- r1_bio->sectors);
- unfreeze_array(conf);
- } else
- md_error(mddev,
- conf->mirrors[r1_bio->read_disk].rdev);
-
- bio = r1_bio->bios[r1_bio->read_disk];
- bdevname(bio->bi_bdev, b);
-read_more:
- disk = read_balance(conf, r1_bio, &max_sectors);
- if (disk == -1) {
- printk(KERN_ALERT "md/raid1:%s: %s: unrecoverable I/O"
- " read error for block %llu\n",
- mdname(mddev), b,
- (unsigned long long)r1_bio->sector);
- raid_end_bio_io(r1_bio);
- } else {
- const unsigned long do_sync = r1_bio->master_bio->bi_rw & REQ_SYNC;
- if (bio) {
- r1_bio->bios[r1_bio->read_disk] =
- mddev->ro ? IO_BLOCKED : NULL;
- bio_put(bio);
- }
- r1_bio->read_disk = disk;
- bio = bio_clone_mddev(r1_bio->master_bio,
- GFP_NOIO, mddev);
- md_trim_bio(bio,
- r1_bio->sector - bio->bi_sector,
- max_sectors);
- r1_bio->bios[r1_bio->read_disk] = bio;
- rdev = conf->mirrors[disk].rdev;
- printk_ratelimited(
- KERN_ERR
- "md/raid1:%s: redirecting sector %llu"
- " to other mirror: %s\n",
- mdname(mddev),
- (unsigned long long)r1_bio->sector,
- bdevname(rdev->bdev, b));
- bio->bi_sector = r1_bio->sector + rdev->data_offset;
- bio->bi_bdev = rdev->bdev;
- bio->bi_end_io = raid1_end_read_request;
- bio->bi_rw = READ | do_sync;
- bio->bi_private = r1_bio;
- if (max_sectors < r1_bio->sectors) {
- /* Drat - have to split this up more */
- struct bio *mbio = r1_bio->master_bio;
- int sectors_handled =
- r1_bio->sector + max_sectors
- - mbio->bi_sector;
- r1_bio->sectors = max_sectors;
- spin_lock_irq(&conf->device_lock);
- if (mbio->bi_phys_segments == 0)
- mbio->bi_phys_segments = 2;
- else
- mbio->bi_phys_segments++;
- spin_unlock_irq(&conf->device_lock);
- generic_make_request(bio);
- bio = NULL;
-
- r1_bio = mempool_alloc(conf->r1bio_pool,
- GFP_NOIO);
-
- r1_bio->master_bio = mbio;
- r1_bio->sectors = (mbio->bi_size >> 9)
- - sectors_handled;
- r1_bio->state = 0;
- set_bit(R1BIO_ReadError,
- &r1_bio->state);
- r1_bio->mddev = mddev;
- r1_bio->sector = mbio->bi_sector
- + sectors_handled;
-
- goto read_more;
- } else
- generic_make_request(bio);
- }
- } else {
+ test_bit(R1BIO_WriteError, &r1_bio->state))
+ handle_write_finished(conf, r1_bio);
+ else if (test_bit(R1BIO_ReadError, &r1_bio->state))
+ handle_read_error(conf, r1_bio);
+ else
/* just a partial read to be scheduled from separate
* context
*/
generic_make_request(r1_bio->bios[r1_bio->read_disk]);
- }
+
cond_resched();
if (mddev->flags & ~(1< md_check_recovery(mddev);


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 23/36] md/raid10: Split handle_read_error out from raid10d.

am 21.07.2011 04:58:49 von NeilBrown

raid10d() is too big and is about to get bigger, so split
handle_read_error() out as a separate function.

Signed-off-by: NeilBrown
---

drivers/md/raid10.c | 123 +++++++++++++++++++++++++++------------------------
1 files changed, 66 insertions(+), 57 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index c489b5c..f1b749c 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1618,21 +1618,81 @@ static void fix_read_error(conf_t *conf, mddev_t *mddev, r10bio_t *r10_bio)
}
}

+static void handle_read_error(mddev_t *mddev, r10bio_t *r10_bio)
+{
+ int slot = r10_bio->read_slot;
+ int mirror = r10_bio->devs[slot].devnum;
+ struct bio *bio;
+ conf_t *conf = mddev->private;
+ mdk_rdev_t *rdev;
+ char b[BDEVNAME_SIZE];
+ unsigned long do_sync;
+
+ /* we got a read error. Maybe the drive is bad. Maybe just
+ * the block and we can fix it.
+ * We freeze all other IO, and try reading the block from
+ * other devices. When we find one, we re-write
+ * and check it that fixes the read error.
+ * This is all done synchronously while the array is
+ * frozen.
+ */
+ if (mddev->ro == 0) {
+ freeze_array(conf);
+ fix_read_error(conf, mddev, r10_bio);
+ unfreeze_array(conf);
+ }
+ rdev_dec_pending(conf->mirrors[mirror].rdev, mddev);
+
+ bio = r10_bio->devs[slot].bio;
+ r10_bio->devs[slot].bio =
+ mddev->ro ? IO_BLOCKED : NULL;
+ mirror = read_balance(conf, r10_bio);
+ if (mirror == -1) {
+ printk(KERN_ALERT "md/raid10:%s: %s: unrecoverable I/O"
+ " read error for block %llu\n",
+ mdname(mddev),
+ bdevname(bio->bi_bdev, b),
+ (unsigned long long)r10_bio->sector);
+ raid_end_bio_io(r10_bio);
+ bio_put(bio);
+ return;
+ }
+
+ do_sync = (r10_bio->master_bio->bi_rw & REQ_SYNC);
+ bio_put(bio);
+ slot = r10_bio->read_slot;
+ rdev = conf->mirrors[mirror].rdev;
+ printk_ratelimited(
+ KERN_ERR
+ "md/raid10:%s: %s: redirecting"
+ "sector %llu to another mirror\n",
+ mdname(mddev),
+ bdevname(rdev->bdev, b),
+ (unsigned long long)r10_bio->sector);
+ bio = bio_clone_mddev(r10_bio->master_bio,
+ GFP_NOIO, mddev);
+ r10_bio->devs[slot].bio = bio;
+ bio->bi_sector = r10_bio->devs[slot].addr
+ + rdev->data_offset;
+ bio->bi_bdev = rdev->bdev;
+ bio->bi_rw = READ | do_sync;
+ bio->bi_private = r10_bio;
+ bio->bi_end_io = raid10_end_read_request;
+ generic_make_request(bio);
+}
+
static void raid10d(mddev_t *mddev)
{
r10bio_t *r10_bio;
- struct bio *bio;
unsigned long flags;
conf_t *conf = mddev->private;
struct list_head *head = &conf->retry_list;
- mdk_rdev_t *rdev;
struct blk_plug plug;

md_check_recovery(mddev);

blk_start_plug(&plug);
for (;;) {
- char b[BDEVNAME_SIZE];

flush_pending_writes(conf);

@@ -1652,60 +1712,9 @@ static void raid10d(mddev_t *mddev)
sync_request_write(mddev, r10_bio);
else if (test_bit(R10BIO_IsRecover, &r10_bio->state))
recovery_request_write(mddev, r10_bio);
- else {
- int slot = r10_bio->read_slot;
- int mirror = r10_bio->devs[slot].devnum;
- /* we got a read error. Maybe the drive is bad. Maybe just
- * the block and we can fix it.
- * We freeze all other IO, and try reading the block from
- * other devices. When we find one, we re-write
- * and check it that fixes the read error.
- * This is all done synchronously while the array is
- * frozen.
- */
- if (mddev->ro == 0) {
- freeze_array(conf);
- fix_read_error(conf, mddev, r10_bio);
- unfreeze_array(conf);
- }
- rdev_dec_pending(conf->mirrors[mirror].rdev, mddev);
-
- bio = r10_bio->devs[slot].bio;
- r10_bio->devs[slot].bio =
- mddev->ro ? IO_BLOCKED : NULL;
- mirror = read_balance(conf, r10_bio);
- if (mirror == -1) {
- printk(KERN_ALERT "md/raid10:%s: %s: unrecoverable I/O"
- " read error for block %llu\n",
- mdname(mddev),
- bdevname(bio->bi_bdev,b),
- (unsigned long long)r10_bio->sector);
- raid_end_bio_io(r10_bio);
- bio_put(bio);
- } else {
- const unsigned long do_sync = (r10_bio->master_bio->bi_rw & REQ_SYNC);
- bio_put(bio);
- slot = r10_bio->read_slot;
- rdev = conf->mirrors[mirror].rdev;
- printk_ratelimited(
- KERN_ERR
- "md/raid10:%s: %s: redirecting"
- "sector %llu to another mirror\n",
- mdname(mddev),
- bdevname(rdev->bdev, b),
- (unsigned long long)r10_bio->sector);
- bio = bio_clone_mddev(r10_bio->master_bio,
- GFP_NOIO, mddev);
- r10_bio->devs[slot].bio = bio;
- bio->bi_sector = r10_bio->devs[slot].addr
- + rdev->data_offset;
- bio->bi_bdev = rdev->bdev;
- bio->bi_rw = READ | do_sync;
- bio->bi_private = r10_bio;
- bio->bi_end_io = raid10_end_read_request;
- generic_make_request(bio);
- }
- }
+ else
+ handle_read_error(mddev, r10_bio);
+
cond_resched();
if (mddev->flags & ~(1< md_check_recovery(mddev);


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 17/36] md/raid5: avoid reading from known bad blocks.

am 21.07.2011 04:58:49 von NeilBrown

There are two times that we might read in raid5:
1/ when a read request fits within a chunk on a single
working device.
In this case, if there is any bad block in the range of
the read, we simply fail the cache-bypass read and
perform the read though the stripe cache.

2/ when reading into the stripe cache. In this case we
mark as failed any device which has a bad block in that
strip (1 page wide).
Note that we will both avoid reading and avoid writing.
This is correct (as we will never read from the block, there
is no point writing), but not optimal (as writing could 'fix'
the error) - that will be addressed later.

If we have not seen any write errors on the device yet, we treat a bad
block like a recent read error. This will encourage an attempt to fix
the read error which will either generate a write error, or will
ensure good data is stored there. We don't yet forget the bad block
in that case. That comes later.

Now that we honour bad blocks when reading we can allow devices with
bad blocks into the array.

Signed-off-by: NeilBrown
---

drivers/md/raid5.c | 46 ++++++++++++++++++++++++++++++++--------------
1 files changed, 32 insertions(+), 14 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7a3fb72..52bedc6 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2916,6 +2916,9 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
spin_lock_irq(&conf->device_lock);
for (i=disks; i--; ) {
mdk_rdev_t *rdev;
+ sector_t first_bad;
+ int bad_sectors;
+ int is_bad = 0;

dev = &sh->dev[i];

@@ -2952,15 +2955,32 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
if (dev->written)
s->written++;
rdev = rcu_dereference(conf->disks[i].rdev);
- if (s->blocked_rdev == NULL &&
- rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
- s->blocked_rdev = rdev;
- atomic_inc(&rdev->nr_pending);
+ if (rdev) {
+ is_bad = is_badblock(rdev, sh->sector, STRIPE_SECTORS,
+ &first_bad, &bad_sectors);
+ if (s->blocked_rdev == NULL
+ && (test_bit(Blocked, &rdev->flags)
+ || is_bad < 0)) {
+ if (is_bad < 0)
+ set_bit(BlockedBadBlocks,
+ &rdev->flags);
+ s->blocked_rdev = rdev;
+ atomic_inc(&rdev->nr_pending);
+ }
}
clear_bit(R5_Insync, &dev->flags);
if (!rdev)
/* Not in-sync */;
- else if (test_bit(In_sync, &rdev->flags))
+ else if (is_bad) {
+ /* also not in-sync */
+ if (!test_bit(WriteErrorSeen, &rdev->flags)) {
+ /* treat as in-sync, but with a read error
+ * which we can now try to correct
+ */
+ set_bit(R5_Insync, &dev->flags);
+ set_bit(R5_ReadError, &dev->flags);
+ }
+ } else if (test_bit(In_sync, &rdev->flags))
set_bit(R5_Insync, &dev->flags);
else {
/* in sync if before recovery_offset */
@@ -3471,6 +3491,9 @@ static int chunk_aligned_read(mddev_t *mddev, struct bio * raid_bio)
rcu_read_lock();
rdev = rcu_dereference(conf->disks[dd_idx].rdev);
if (rdev && test_bit(In_sync, &rdev->flags)) {
+ sector_t first_bad;
+ int bad_sectors;
+
atomic_inc(&rdev->nr_pending);
rcu_read_unlock();
raid_bio->bi_next = (void*)rdev;
@@ -3478,8 +3501,10 @@ static int chunk_aligned_read(mddev_t *mddev, struct bio * raid_bio)
align_bi->bi_flags &= ~(1 << BIO_SEG_VALID);
align_bi->bi_sector += rdev->data_offset;

- if (!bio_fits_rdev(align_bi)) {
- /* too big in some way */
+ if (!bio_fits_rdev(align_bi) ||
+ is_badblock(rdev, align_bi->bi_sector, align_bi->bi_size>>9,
+ &first_bad, &bad_sectors)) {
+ /* too big in some way, or has a known bad block */
bio_put(align_bi);
rdev_dec_pending(rdev, mddev);
return 0;
@@ -4671,10 +4696,6 @@ static int run(mddev_t *mddev)
* 0 for a fully functional array, 1 or 2 for a degraded array.
*/
list_for_each_entry(rdev, &mddev->disks, same_set) {
- if (rdev->badblocks.count) {
- printk(KERN_ERR "md/raid5: cannot handle bad blocks yet\n");
- goto abort;
- }
if (rdev->raid_disk < 0)
continue;
if (test_bit(In_sync, &rdev->flags)) {
@@ -4983,9 +5004,6 @@ static int raid5_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
int first = 0;
int last = conf->raid_disks - 1;

- if (rdev->badblocks.count)
- return -EINVAL;
-
if (has_failed(conf))
/* no point adding a device */
return -EINVAL;


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 21/36] md/raid5: Clear bad blocks on successful write.

am 21.07.2011 04:58:49 von NeilBrown

On a successful write to a known bad block, flag the sh
so that raid5d can remove the known bad block from the list.

Signed-off-by: NeilBrown
---

drivers/md/raid5.c | 19 ++++++++++++++++++-
drivers/md/raid5.h | 1 +
2 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 33ae4e2..204938c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1675,6 +1675,8 @@ static void raid5_end_write_request(struct bio *bi, int error)
raid5_conf_t *conf = sh->raid_conf;
int disks = sh->disks, i;
int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
+ sector_t first_bad;
+ int bad_sectors;

for (i=0 ; i if (bi == &sh->dev[i].req)
@@ -1691,7 +1693,9 @@ static void raid5_end_write_request(struct bio *bi, int error)
if (!uptodate) {
set_bit(WriteErrorSeen, &conf->disks[i].rdev->flags);
set_bit(R5_WriteError, &sh->dev[i].flags);
- }
+ } else if (is_badblock(conf->disks[i].rdev, sh->sector, STRIPE_SECTORS,
+ &first_bad, &bad_sectors))
+ set_bit(R5_MadeGood, &sh->dev[i].flags);

rdev_dec_pending(conf->disks[i].rdev, conf->mddev);

@@ -3071,6 +3075,13 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
} else
clear_bit(R5_WriteError, &dev->flags);
}
+ if (test_bit(R5_MadeGood, &dev->flags)) {
+ if (!test_bit(Faulty, &rdev->flags)) {
+ s->handle_bad_blocks = 1;
+ atomic_inc(&rdev->nr_pending);
+ } else
+ clear_bit(R5_MadeGood, &dev->flags);
+ }
if (!test_bit(R5_Insync, &dev->flags)) {
/* The ReadError flag will just be confusing now */
clear_bit(R5_ReadError, &dev->flags);
@@ -3340,6 +3351,12 @@ finish:
md_error(conf->mddev, rdev);
rdev_dec_pending(rdev, conf->mddev);
}
+ if (test_and_clear_bit(R5_MadeGood, &dev->flags)) {
+ rdev = conf->disks[i].rdev;
+ rdev_clear_badblocks(rdev, sh->sector,
+ STRIPE_SECTORS);
+ rdev_dec_pending(rdev, conf->mddev);
+ }
}

if (s.ops_request)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index d4729f5..9b96157 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -277,6 +277,7 @@ struct stripe_head_state {
#define R5_Wantdrain 13 /* dev->towrite needs to be drained */
#define R5_WantFUA 14 /* Write should be FUA */
#define R5_WriteError 15 /* got a write error - need to record it */
+#define R5_MadeGood 16 /* A bad block has been fixed by writing to it*/
/*
* Write method
*/


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 22/36] md/raid10: simplify/reindent some loops.

am 21.07.2011 04:58:49 von NeilBrown

When a loop ends with a large if, it can be neater to change the
if to invert the condition and just 'continue'.
Then the body of the if can be indented to a lower level.

Signed-off-by: NeilBrown
---

drivers/md/raid10.c | 127 ++++++++++++++++++++++++++-------------------------
1 files changed, 65 insertions(+), 62 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index fe6692e..c489b5c 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1533,80 +1533,83 @@ static void fix_read_error(conf_t *conf, mddev_t *mddev, r10bio_t *r10_bio)
sl--;
d = r10_bio->devs[sl].devnum;
rdev = rcu_dereference(conf->mirrors[d].rdev);
- if (rdev &&
- test_bit(In_sync, &rdev->flags)) {
- atomic_inc(&rdev->nr_pending);
- rcu_read_unlock();
- if (sync_page_io(rdev,
- r10_bio->devs[sl].addr +
- sect,
- s<<9, conf->tmppage, WRITE, false)
- == 0) {
- /* Well, this device is dead */
- printk(KERN_NOTICE
- "md/raid10:%s: read correction "
- "write failed"
- " (%d sectors at %llu on %s)\n",
- mdname(mddev), s,
- (unsigned long long)(
- sect + rdev->data_offset),
- bdevname(rdev->bdev, b));
- printk(KERN_NOTICE "md/raid10:%s: %s: failing "
- "drive\n",
- mdname(mddev),
- bdevname(rdev->bdev, b));
- md_error(mddev, rdev);
- }
- rdev_dec_pending(rdev, mddev);
- rcu_read_lock();
+ if (!rdev ||
+ !test_bit(In_sync, &rdev->flags))
+ continue;
+
+ atomic_inc(&rdev->nr_pending);
+ rcu_read_unlock();
+ if (sync_page_io(rdev,
+ r10_bio->devs[sl].addr +
+ sect,
+ s<<9, conf->tmppage, WRITE, false)
+ == 0) {
+ /* Well, this device is dead */
+ printk(KERN_NOTICE
+ "md/raid10:%s: read correction "
+ "write failed"
+ " (%d sectors at %llu on %s)\n",
+ mdname(mddev), s,
+ (unsigned long long)(
+ sect + rdev->data_offset),
+ bdevname(rdev->bdev, b));
+ printk(KERN_NOTICE "md/raid10:%s: %s: failing "
+ "drive\n",
+ mdname(mddev),
+ bdevname(rdev->bdev, b));
+ md_error(mddev, rdev);
}
+ rdev_dec_pending(rdev, mddev);
+ rcu_read_lock();
}
sl = start;
while (sl != r10_bio->read_slot) {
+ char b[BDEVNAME_SIZE];

if (sl==0)
sl = conf->copies;
sl--;
d = r10_bio->devs[sl].devnum;
rdev = rcu_dereference(conf->mirrors[d].rdev);
- if (rdev &&
- test_bit(In_sync, &rdev->flags)) {
- char b[BDEVNAME_SIZE];
- atomic_inc(&rdev->nr_pending);
- rcu_read_unlock();
- if (sync_page_io(rdev,
- r10_bio->devs[sl].addr +
- sect,
- s<<9, conf->tmppage,
- READ, false) == 0) {
- /* Well, this device is dead */
- printk(KERN_NOTICE
- "md/raid10:%s: unable to read back "
- "corrected sectors"
- " (%d sectors at %llu on %s)\n",
- mdname(mddev), s,
- (unsigned long long)(
- sect + rdev->data_offset),
- bdevname(rdev->bdev, b));
- printk(KERN_NOTICE "md/raid10:%s: %s: failing drive\n",
- mdname(mddev),
- bdevname(rdev->bdev, b));
-
- md_error(mddev, rdev);
- } else {
- printk(KERN_INFO
- "md/raid10:%s: read error corrected"
- " (%d sectors at %llu on %s)\n",
- mdname(mddev), s,
- (unsigned long long)(
- sect + rdev->data_offset),
- bdevname(rdev->bdev, b));
- atomic_add(s, &rdev->corrected_errors);
- }
+ if (!rdev ||
+ !test_bit(In_sync, &rdev->flags))
+ continue;

- rdev_dec_pending(rdev, mddev);
- rcu_read_lock();
+ atomic_inc(&rdev->nr_pending);
+ rcu_read_unlock();
+ if (sync_page_io(rdev,
+ r10_bio->devs[sl].addr +
+ sect,
+ s<<9, conf->tmppage,
+ READ, false) == 0) {
+ /* Well, this device is dead */
+ printk(KERN_NOTICE
+ "md/raid10:%s: unable to read back "
+ "corrected sectors"
+ " (%d sectors at %llu on %s)\n",
+ mdname(mddev), s,
+ (unsigned long long)(
+ sect + rdev->data_offset),
+ bdevname(rdev->bdev, b));
+ printk(KERN_NOTICE "md/raid10:%s: %s: failing "
+ "drive\n",
+ mdname(mddev),
+ bdevname(rdev->bdev, b));
+
+ md_error(mddev, rdev);
+ } else {
+ printk(KERN_INFO
+ "md/raid10:%s: read error corrected"
+ " (%d sectors at %llu on %s)\n",
+ mdname(mddev), s,
+ (unsigned long long)(
+ sect + rdev->data_offset),
+ bdevname(rdev->bdev, b));
+ atomic_add(s, &rdev->corrected_errors);
}
+
+ rdev_dec_pending(rdev, mddev);
+ rcu_read_lock();
}
rcu_read_unlock();



--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 15/36] md/raid1: improve handling of read failure duringrecovery.

am 21.07.2011 04:58:49 von NeilBrown

If we cannot read a block from anywhere during recovery, there is
now a better approach than just giving up.
We can record a bad block on each device and keep going - being
careful not to clear the bad block when a write succeeds as it might -
it will be a write of incorrect data.

We have now reached the state where - for raid1 - we only call
md_error if md_set_badblocks has failed.

Signed-off-by: NeilBrown
---

drivers/md/raid1.c | 41 ++++++++++++++++++++++++++++++++++-------
1 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 83c918e..08ff21a 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1393,7 +1393,12 @@ static void end_sync_write(struct bio *bio, int error)
} else if (is_badblock(conf->mirrors[mirror].rdev,
r1_bio->sector,
r1_bio->sectors,
- &first_bad, &bad_sectors))
+ &first_bad, &bad_sectors) &&
+ !is_badblock(conf->mirrors[r1_bio->read_disk].rdev,
+ r1_bio->sector,
+ r1_bio->sectors,
+ &first_bad, &bad_sectors)
+ )
set_bit(R1BIO_MadeGood, &r1_bio->state);

update_head_pos(mirror, r1_bio);
@@ -1474,16 +1479,36 @@ static int fix_sync_read_error(r1bio_t *r1_bio)

if (!success) {
char b[BDEVNAME_SIZE];
- /* Cannot read from anywhere, array is toast */
- md_error(mddev, conf->mirrors[r1_bio->read_disk].rdev);
+ int abort = 0;
+ /* Cannot read from anywhere, this block is lost.
+ * Record a bad block on each device. If that doesn't
+ * work just disable and interrupt the recovery.
+ * Don't fail devices as that won't really help.
+ */
printk(KERN_ALERT "md/raid1:%s: %s: unrecoverable I/O read error"
" for block %llu\n",
mdname(mddev),
bdevname(bio->bi_bdev, b),
(unsigned long long)r1_bio->sector);
- md_done_sync(mddev, r1_bio->sectors, 0);
- put_buf(r1_bio);
- return 0;
+ for (d = 0; d < conf->raid_disks; d++) {
+ rdev = conf->mirrors[d].rdev;
+ if (!rdev || test_bit(Faulty, &rdev->flags))
+ continue;
+ if (!rdev_set_badblocks(rdev, sect, s, 0))
+ abort = 1;
+ }
+ if (abort) {
+ mddev->recovery_disabled = 1;
+ set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+ md_done_sync(mddev, r1_bio->sectors, 0);
+ put_buf(r1_bio);
+ return 0;
+ }
+ /* Try next page */
+ sectors -= s;
+ sect += s;
+ idx++;
+ continue;
}

start = d;
@@ -1880,7 +1905,9 @@ static void raid1d(mddev_t *mddev)
if (bio->bi_end_io == NULL)
continue;
if (test_bit(BIO_UPTODATE,
- &bio->bi_flags)) {
+ &bio->bi_flags) &&
+ test_bit(R1BIO_MadeGood,
+ &r1_bio->state)) {
rdev_clear_badblocks(
rdev,
r1_bio->sector,


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 19/36] md/raid5: write errors should be recorded as badblocks if possible.

am 21.07.2011 04:58:49 von NeilBrown

When a write error is detected, don't mark the device as failed
immediately but rather record the fact for handle_stripe to deal with.

Handle_stripe then attempts to record a bad block. Only if that fails
does the device get marked as faulty.

Signed-off-by: NeilBrown
---

drivers/md/raid5.c | 33 +++++++++++++++++++++++++++++++--
drivers/md/raid5.h | 18 ++++++++++--------
2 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c34c97d..54e89d1 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1658,8 +1658,10 @@ static void raid5_end_write_request(struct bio *bi, int error)
return;
}

- if (!uptodate)
- md_error(conf->mddev, conf->disks[i].rdev);
+ if (!uptodate) {
+ set_bit(WriteErrorSeen, &conf->disks[i].rdev->flags);
+ set_bit(R5_WriteError, &sh->dev[i].flags);
+ }

rdev_dec_pending(conf->disks[i].rdev, conf->mddev);

@@ -3031,6 +3033,14 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
if (sh->sector + STRIPE_SECTORS <= rdev->recovery_offset)
set_bit(R5_Insync, &dev->flags);
}
+ if (test_bit(R5_WriteError, &dev->flags)) {
+ clear_bit(R5_Insync, &dev->flags);
+ if (!test_bit(Faulty, &rdev->flags)) {
+ s->handle_bad_blocks = 1;
+ atomic_inc(&rdev->nr_pending);
+ } else
+ clear_bit(R5_WriteError, &dev->flags);
+ }
if (!test_bit(R5_Insync, &dev->flags)) {
/* The ReadError flag will just be confusing now */
clear_bit(R5_ReadError, &dev->flags);
@@ -3086,6 +3096,11 @@ static void handle_stripe(struct stripe_head *sh)

analyse_stripe(sh, &s);

+ if (s.handle_bad_blocks) {
+ set_bit(STRIPE_HANDLE, &sh->state);
+ goto finish;
+ }
+
if (unlikely(s.blocked_rdev)) {
if (s.syncing || s.expanding || s.expanded ||
s.to_write || s.written) {
@@ -3283,6 +3298,20 @@ finish:
if (unlikely(s.blocked_rdev))
md_wait_for_blocked_rdev(s.blocked_rdev, conf->mddev);

+ if (s.handle_bad_blocks)
+ for (i = disks; i--; ) {
+ mdk_rdev_t *rdev;
+ struct r5dev *dev = &sh->dev[i];
+ if (test_and_clear_bit(R5_WriteError, &dev->flags)) {
+ /* We own a safe reference to the rdev */
+ rdev = conf->disks[i].rdev;
+ if (!rdev_set_badblocks(rdev, sh->sector,
+ STRIPE_SECTORS, 0))
+ md_error(conf->mddev, rdev);
+ rdev_dec_pending(rdev, conf->mddev);
+ }
+ }
+
if (s.ops_request)
raid_run_ops(sh, s.ops_request);

diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 719fc86..d4729f5 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -252,6 +252,7 @@ struct stripe_head_state {
struct bio *return_bi;
mdk_rdev_t *blocked_rdev;
int dec_preread_active;
+ int handle_bad_blocks;
};

/* Flags */
@@ -267,14 +268,15 @@ struct stripe_head_state {
#define R5_ReWrite 9 /* have tried to over-write the readerror */

#define R5_Expanded 10 /* This block now has post-expand data */
-#define R5_Wantcompute 11 /* compute_block in progress treat as
- * uptodate
- */
-#define R5_Wantfill 12 /* dev->toread contains a bio that needs
- * filling
- */
-#define R5_Wantdrain 13 /* dev->towrite needs to be drained */
-#define R5_WantFUA 14 /* Write should be FUA */
+#define R5_Wantcompute 11 /* compute_block in progress treat as
+ * uptodate
+ */
+#define R5_Wantfill 12 /* dev->toread contains a bio that needs
+ * filling
+ */
+#define R5_Wantdrain 13 /* dev->towrite needs to be drained */
+#define R5_WantFUA 14 /* Write should be FUA */
+#define R5_WriteError 15 /* got a write error - need to record it */
/*
* Write method
*/


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 20/36] md/raid5. Don"t write to known bad block on doubtfuldevices.

am 21.07.2011 04:58:49 von NeilBrown

If a device has seen write errors, don't write to any known
bad blocks on that device.

Signed-off-by: NeilBrown
---

drivers/md/raid5.c | 31 ++++++++++++++++++++++++++++++-
1 files changed, 30 insertions(+), 1 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 54e89d1..33ae4e2 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -526,6 +526,36 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
atomic_inc(&rdev->nr_pending);
rcu_read_unlock();

+ /* We have already checked bad blocks for reads. Now
+ * need to check for writes.
+ */
+ while ((rw & WRITE) && rdev &&
+ test_bit(WriteErrorSeen, &rdev->flags)) {
+ sector_t first_bad;
+ int bad_sectors;
+ int bad = is_badblock(rdev, sh->sector, STRIPE_SECTORS,
+ &first_bad, &bad_sectors);
+ if (!bad)
+ break;
+
+ if (bad < 0) {
+ set_bit(BlockedBadBlocks, &rdev->flags);
+ if (!conf->mddev->external &&
+ conf->mddev->flags) {
+ /* It is very unlikely, but we might
+ * still need to write out the
+ * bad block log - better give it
+ * a chance*/
+ md_check_recovery(conf->mddev);
+ }
+ md_wait_for_blocked_rdev(rdev, conf->mddev);
+ } else {
+ /* Acknowledged bad block - skip the write */
+ rdev_dec_pending(rdev, conf->mddev);
+ rdev = NULL;
+ }
+ }
+
if (rdev) {
if (s->syncing || s->expanding || s->expanded)
md_sync_acct(rdev->bdev, STRIPE_SECTORS);
@@ -3317,7 +3347,6 @@ finish:

ops_run_io(sh, &s);

-
if (s.dec_preread_active) {
/* We delay this until after ops_run_io so that if make_request
* is waiting on a flush, it won't continue until the writes


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 33/36] md/raid10: record bad blocks due to write errorsduring resync/recovery.

am 21.07.2011 04:58:50 von NeilBrown

If we get a write error during resync/recovery don't fail the device
but instead record a bad block. If that fails we can then fail the
device.

Signed-off-by: NeilBrown
---

drivers/md/raid10.c | 33 +++++++++++++++++++++++----------
1 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index dfea30e..0f120ac 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1452,9 +1452,10 @@ static void end_sync_write(struct bio *bio, int error)

d = find_bio_disk(conf, r10_bio, bio, &slot);

- if (!uptodate)
- md_error(mddev, conf->mirrors[d].rdev);
- else if (is_badblock(conf->mirrors[d].rdev,
+ if (!uptodate) {
+ set_bit(WriteErrorSeen, &conf->mirrors[d].rdev->flags);
+ set_bit(R10BIO_WriteError, &r10_bio->state);
+ } else if (is_badblock(conf->mirrors[d].rdev,
r10_bio->devs[slot].addr,
r10_bio->sectors,
&first_bad, &bad_sectors))
@@ -1465,7 +1466,8 @@ static void end_sync_write(struct bio *bio, int error)
if (r10_bio->master_bio == NULL) {
/* the primary of several recovery bios */
sector_t s = r10_bio->sectors;
- if (test_bit(R10BIO_MadeGood, &r10_bio->state))
+ if (test_bit(R10BIO_MadeGood, &r10_bio->state) ||
+ test_bit(R10BIO_WriteError, &r10_bio->state))
reschedule_retry(r10_bio);
else
put_buf(r10_bio);
@@ -1473,7 +1475,8 @@ static void end_sync_write(struct bio *bio, int error)
break;
} else {
r10bio_t *r10_bio2 = (r10bio_t *)r10_bio->master_bio;
- if (test_bit(R10BIO_MadeGood, &r10_bio->state))
+ if (test_bit(R10BIO_MadeGood, &r10_bio->state) ||
+ test_bit(R10BIO_WriteError, &r10_bio->state))
reschedule_retry(r10_bio);
else
put_buf(r10_bio);
@@ -2029,23 +2032,33 @@ static void handle_write_completed(conf_t *conf, r10bio_t *r10_bio)
/* Some sort of write request has finished and it
* succeeded in writing where we thought there was a
* bad block. So forget the bad block.
+ * Or possibly if failed and we need to record
+ * a bad block.
*/
int m;
mdk_rdev_t *rdev;

if (test_bit(R10BIO_IsSync, &r10_bio->state) ||
test_bit(R10BIO_IsRecover, &r10_bio->state)) {
- for (m = 0; m < conf->copies; m++)
- if (r10_bio->devs[m].bio &&
- test_bit(BIO_UPTODATE,
+ for (m = 0; m < conf->copies; m++) {
+ int dev = r10_bio->devs[m].devnum;
+ rdev = conf->mirrors[dev].rdev;
+ if (r10_bio->devs[m].bio == NULL)
+ continue;
+ if (test_bit(BIO_UPTODATE,
&r10_bio->devs[m].bio->bi_flags)) {
- int dev = r10_bio->devs[m].devnum;
- rdev = conf->mirrors[dev].rdev;
rdev_clear_badblocks(
rdev,
r10_bio->devs[m].addr,
r10_bio->sectors);
+ } else {
+ if (!rdev_set_badblocks(
+ rdev,
+ r10_bio->devs[m].addr,
+ r10_bio->sectors, 0))
+ md_error(conf->mddev, rdev);
}
+ }
put_buf(r10_bio);
} else {
for (m = 0; m < conf->copies; m++) {


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 34/36] md/raid10: simplify read error handling duringrecovery.

am 21.07.2011 04:58:50 von NeilBrown

If a read error is detected during recovery the code currently
fails the read device.
This isn't really necessary. recovery_request_write will signal
a write error to end_sync_write and it will record a write
error on the destination device which will record a bad block
there or kick it from the array.

So just remove this call to do md_error.

Signed-off-by: NeilBrown
---

drivers/md/raid10.c | 9 ++++-----
1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 0f120ac..f87c8d9 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1418,13 +1418,12 @@ static void end_sync_read(struct bio *bio, int error)

if (test_bit(BIO_UPTODATE, &bio->bi_flags))
set_bit(R10BIO_Uptodate, &r10_bio->state);
- else {
+ else
+ /* The write handler will notice the lack of
+ * R10BIO_Uptodate and record any errors etc
+ */
atomic_add(r10_bio->sectors,
&conf->mirrors[d].rdev->corrected_errors);
- if (!test_bit(MD_RECOVERY_SYNC, &conf->mddev->recovery))
- md_error(r10_bio->mddev,
- conf->mirrors[d].rdev);
- }

/* for reconstruct, we always reschedule after a read.
* for resync, only after all reads


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 31/36] md/raid10: Handle write errors by updating badblocklog.

am 21.07.2011 04:58:50 von NeilBrown

When we get a write error (in the data area, not in metadata),
update the badblock log rather than failing the whole device.

As the write may well be many blocks, we trying writing each
block individually and only log the ones which fail.

Signed-off-by: NeilBrown
---

drivers/md/raid10.c | 133 ++++++++++++++++++++++++++++++++++++++++++++-------
drivers/md/raid10.h | 1
2 files changed, 117 insertions(+), 17 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index c809d3d..428db23 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -327,6 +327,16 @@ static void raid10_end_read_request(struct bio *bio, int error)
}
}

+static void close_write(r10bio_t *r10_bio)
+{
+ /* clear the bitmap if all writes complete successfully */
+ bitmap_endwrite(r10_bio->mddev->bitmap, r10_bio->sector,
+ r10_bio->sectors,
+ !test_bit(R10BIO_Degraded, &r10_bio->state),
+ 0);
+ md_write_end(r10_bio->mddev);
+}
+
static void raid10_end_write_request(struct bio *bio, int error)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
@@ -342,9 +352,9 @@ static void raid10_end_write_request(struct bio *bio, int error)
* this branch is our 'one mirror IO has finished' event handler:
*/
if (!uptodate) {
- md_error(r10_bio->mddev, conf->mirrors[dev].rdev);
- /* an I/O failed, we can't clear the bitmap */
- set_bit(R10BIO_Degraded, &r10_bio->state);
+ set_bit(WriteErrorSeen, &conf->mirrors[dev].rdev->flags);
+ set_bit(R10BIO_WriteError, &r10_bio->state);
+ dec_rdev = 0;
} else {
/*
* Set R10BIO_Uptodate in our master bio, so that
@@ -378,16 +388,15 @@ static void raid10_end_write_request(struct bio *bio, int error)
* already.
*/
if (atomic_dec_and_test(&r10_bio->remaining)) {
- /* clear the bitmap if all writes complete successfully */
- bitmap_endwrite(r10_bio->mddev->bitmap, r10_bio->sector,
- r10_bio->sectors,
- !test_bit(R10BIO_Degraded, &r10_bio->state),
- 0);
- md_write_end(r10_bio->mddev);
- if (test_bit(R10BIO_MadeGood, &r10_bio->state))
+ if (test_bit(R10BIO_WriteError, &r10_bio->state))
reschedule_retry(r10_bio);
- else
- raid_end_bio_io(r10_bio);
+ else {
+ close_write(r10_bio);
+ if (test_bit(R10BIO_MadeGood, &r10_bio->state))
+ reschedule_retry(r10_bio);
+ else
+ raid_end_bio_io(r10_bio);
+ }
}
if (dec_rdev)
rdev_dec_pending(conf->mirrors[dev].rdev, conf->mddev);
@@ -1839,6 +1848,82 @@ static void fix_read_error(conf_t *conf, mddev_t *mddev, r10bio_t *r10_bio)
}
}

+static void bi_complete(struct bio *bio, int error)
+{
+ complete((struct completion *)bio->bi_private);
+}
+
+static int submit_bio_wait(int rw, struct bio *bio)
+{
+ struct completion event;
+ rw |= REQ_SYNC;
+
+ init_completion(&event);
+ bio->bi_private = &event;
+ bio->bi_end_io = bi_complete;
+ submit_bio(rw, bio);
+ wait_for_completion(&event);
+
+ return test_bit(BIO_UPTODATE, &bio->bi_flags);
+}
+
+static int narrow_write_error(r10bio_t *r10_bio, int i)
+{
+ struct bio *bio = r10_bio->master_bio;
+ mddev_t *mddev = r10_bio->mddev;
+ conf_t *conf = mddev->private;
+ mdk_rdev_t *rdev = conf->mirrors[r10_bio->devs[i].devnum].rdev;
+ /* bio has the data to be written to slot 'i' where
+ * we just recently had a write error.
+ * We repeatedly clone the bio and trim down to one block,
+ * then try the write. Where the write fails we record
+ * a bad block.
+ * It is conceivable that the bio doesn't exactly align with
+ * blocks. We must handle this.
+ *
+ * We currently own a reference to the rdev.
+ */
+
+ int block_sectors;
+ sector_t sector;
+ int sectors;
+ int sect_to_write = r10_bio->sectors;
+ int ok = 1;
+
+ if (rdev->badblocks.shift < 0)
+ return 0;
+
+ block_sectors = 1 << rdev->badblocks.shift;
+ sector = r10_bio->sector;
+ sectors = ((r10_bio->sector + block_sectors)
+ & ~(sector_t)(block_sectors - 1))
+ - sector;
+
+ while (sect_to_write) {
+ struct bio *wbio;
+ if (sectors > sect_to_write)
+ sectors = sect_to_write;
+ /* Write at 'sector' for 'sectors' */
+ wbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+ md_trim_bio(wbio, sector - bio->bi_sector, sectors);
+ wbio->bi_sector = (r10_bio->devs[i].addr+
+ rdev->data_offset+
+ (sector - r10_bio->sector));
+ wbio->bi_bdev = rdev->bdev;
+ if (submit_bio_wait(WRITE, wbio) == 0)
+ /* Failure! */
+ ok = rdev_set_badblocks(rdev, sector,
+ sectors, 0)
+ && ok;
+
+ bio_put(wbio);
+ sect_to_write -= sectors;
+ sector += sectors;
+ sectors = block_sectors;
+ }
+ return ok;
+}
+
static void handle_read_error(mddev_t *mddev, r10bio_t *r10_bio)
{
int slot = r10_bio->read_slot;
@@ -1962,16 +2047,29 @@ static void handle_write_completed(conf_t *conf, r10bio_t *r10_bio)
}
put_buf(r10_bio);
} else {
- for (m = 0; m < conf->copies; m++)
- if (r10_bio->devs[m].bio == IO_MADE_GOOD) {
- int dev = r10_bio->devs[m].devnum;
- rdev = conf->mirrors[dev].rdev;
+ for (m = 0; m < conf->copies; m++) {
+ int dev = r10_bio->devs[m].devnum;
+ struct bio *bio = r10_bio->devs[m].bio;
+ rdev = conf->mirrors[dev].rdev;
+ if (bio == IO_MADE_GOOD) {
rdev_clear_badblocks(
rdev,
r10_bio->devs[m].addr,
r10_bio->sectors);
rdev_dec_pending(rdev, conf->mddev);
+ } else if (bio != NULL &&
+ !test_bit(BIO_UPTODATE, &bio->bi_flags)) {
+ if (!narrow_write_error(r10_bio, m)) {
+ md_error(conf->mddev, rdev);
+ set_bit(R10BIO_Degraded,
+ &r10_bio->state);
+ }
+ rdev_dec_pending(rdev, conf->mddev);
}
+ }
+ if (test_bit(R10BIO_WriteError,
+ &r10_bio->state))
+ close_write(r10_bio);
raid_end_bio_io(r10_bio);
}
}
@@ -2003,7 +2101,8 @@ static void raid10d(mddev_t *mddev)

mddev = r10_bio->mddev;
conf = mddev->private;
- if (test_bit(R10BIO_MadeGood, &r10_bio->state))
+ if (test_bit(R10BIO_MadeGood, &r10_bio->state) ||
+ test_bit(R10BIO_WriteError, &r10_bio->state))
handle_write_completed(conf, r10_bio);
else if (test_bit(R10BIO_IsSync, &r10_bio->state))
sync_request_write(mddev, r10_bio);
diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index d8b7f9a..79cb52a 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -139,4 +139,5 @@ struct r10bio_s {
* known-bad-block records, we set this flag.
*/
#define R10BIO_MadeGood 5
+#define R10BIO_WriteError 6
#endif


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 25/36] md/raid10: avoid reading from known bad blocks -part 2

am 21.07.2011 04:58:50 von NeilBrown

When redirecting a read error to a different device, we must
again avoid bad blocks and possibly split the request.

Signed-off-by: NeilBrown
---

drivers/md/raid10.c | 45 ++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 872bf94..0dcd172 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1746,14 +1746,15 @@ static void handle_read_error(mddev_t *mddev, r10bio_t *r10_bio)
rdev_dec_pending(conf->mirrors[mirror].rdev, mddev);

bio = r10_bio->devs[slot].bio;
+ bdevname(bio->bi_bdev, b);
r10_bio->devs[slot].bio =
mddev->ro ? IO_BLOCKED : NULL;
+read_more:
mirror = read_balance(conf, r10_bio, &max_sectors);
- if (mirror == -1 || max_sectors < r10_bio->sectors) {
+ if (mirror == -1) {
printk(KERN_ALERT "md/raid10:%s: %s: unrecoverable I/O"
" read error for block %llu\n",
- mdname(mddev),
- bdevname(bio->bi_bdev, b),
+ mdname(mddev), b,
(unsigned long long)r10_bio->sector);
raid_end_bio_io(r10_bio);
bio_put(bio);
@@ -1761,7 +1762,8 @@ static void handle_read_error(mddev_t *mddev, r10bio_t *r10_bio)
}

do_sync = (r10_bio->master_bio->bi_rw & REQ_SYNC);
- bio_put(bio);
+ if (bio)
+ bio_put(bio);
slot = r10_bio->read_slot;
rdev = conf->mirrors[mirror].rdev;
printk_ratelimited(
@@ -1773,6 +1775,9 @@ static void handle_read_error(mddev_t *mddev, r10bio_t *r10_bio)
(unsigned long long)r10_bio->sector);
bio = bio_clone_mddev(r10_bio->master_bio,
GFP_NOIO, mddev);
+ md_trim_bio(bio,
+ r10_bio->sector - bio->bi_sector,
+ max_sectors);
r10_bio->devs[slot].bio = bio;
bio->bi_sector = r10_bio->devs[slot].addr
+ rdev->data_offset;
@@ -1780,7 +1785,37 @@ static void handle_read_error(mddev_t *mddev, r10bio_t *r10_bio)
bio->bi_rw = READ | do_sync;
bio->bi_private = r10_bio;
bio->bi_end_io = raid10_end_read_request;
- generic_make_request(bio);
+ if (max_sectors < r10_bio->sectors) {
+ /* Drat - have to split this up more */
+ struct bio *mbio = r10_bio->master_bio;
+ int sectors_handled =
+ r10_bio->sector + max_sectors
+ - mbio->bi_sector;
+ r10_bio->sectors = max_sectors;
+ spin_lock_irq(&conf->device_lock);
+ if (mbio->bi_phys_segments == 0)
+ mbio->bi_phys_segments = 2;
+ else
+ mbio->bi_phys_segments++;
+ spin_lock_irq(&conf->device_lock);
+ generic_make_request(bio);
+ bio = NULL;
+
+ r10_bio = mempool_alloc(conf->r10bio_pool,
+ GFP_NOIO);
+ r10_bio->master_bio = mbio;
+ r10_bio->sectors = (mbio->bi_size >> 9)
+ - sectors_handled;
+ r10_bio->state = 0;
+ set_bit(R10BIO_ReadError,
+ &r10_bio->state);
+ r10_bio->mddev = mddev;
+ r10_bio->sector = mbio->bi_sector
+ + sectors_handled;
+
+ goto read_more;
+ } else
+ generic_make_request(bio);
}

static void raid10d(mddev_t *mddev)


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 29/36] md/raid10: avoid writing to known bad blocks onknown bad drives.

am 21.07.2011 04:58:50 von NeilBrown

Writing to known bad blocks on drives that have seen a write error
is asking for trouble. So try to avoid these blocks.

Signed-off-by: NeilBrown
---

drivers/md/raid10.c | 105 +++++++++++++++++++++++++++++++++++++++++++++------
1 files changed, 93 insertions(+), 12 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 7f11924..7dcd318 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -807,6 +807,8 @@ static int make_request(mddev_t *mddev, struct bio * bio)
unsigned long flags;
mdk_rdev_t *blocked_rdev;
int plugged;
+ int sectors_handled;
+ int max_sectors;

if (unlikely(bio->bi_rw & REQ_FLUSH)) {
md_flush_request(mddev, bio);
@@ -895,7 +897,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
/*
* read balancing logic:
*/
- int max_sectors;
int disk;
int slot;

@@ -925,8 +926,6 @@ read_again:
/* Could not read all from this device, so we will
* need another r10_bio.
*/
- int sectors_handled;
-
sectors_handled = (r10_bio->sectors + max_sectors
- bio->bi_sector);
r10_bio->sectors = max_sectors;
@@ -963,13 +962,22 @@ read_again:
/* first select target devices under rcu_lock and
* inc refcount on their rdev. Record them by setting
* bios[x] to bio
+ * If there are known/acknowledged bad blocks on any device
+ * on which we have seen a write error, we want to avoid
+ * writing to those blocks. This potentially requires several
+ * writes to write around the bad blocks. Each set of writes
+ * gets its own r10_bio with a set of bios attached. The number
+ * of r10_bios is recored in bio->bi_phys_segments just as with
+ * the read case.
*/
plugged = mddev_check_plugged(mddev);

raid10_find_phys(conf, r10_bio);
- retry_write:
+retry_write:
blocked_rdev = NULL;
rcu_read_lock();
+ max_sectors = r10_bio->sectors;
+
for (i = 0; i < conf->copies; i++) {
int d = r10_bio->devs[i].devnum;
mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[d].rdev);
@@ -978,13 +986,55 @@ read_again:
blocked_rdev = rdev;
break;
}
- if (rdev && !test_bit(Faulty, &rdev->flags)) {
- atomic_inc(&rdev->nr_pending);
- r10_bio->devs[i].bio = bio;
- } else {
- r10_bio->devs[i].bio = NULL;
+ r10_bio->devs[i].bio = NULL;
+ if (!rdev || test_bit(Faulty, &rdev->flags)) {
set_bit(R10BIO_Degraded, &r10_bio->state);
+ continue;
+ }
+ if (test_bit(WriteErrorSeen, &rdev->flags)) {
+ sector_t first_bad;
+ sector_t dev_sector = r10_bio->devs[i].addr;
+ int bad_sectors;
+ int is_bad;
+
+ is_bad = is_badblock(rdev, dev_sector,
+ max_sectors,
+ &first_bad, &bad_sectors);
+ if (is_bad < 0) {
+ /* Mustn't write here until the bad block
+ * is acknowledged
+ */
+ atomic_inc(&rdev->nr_pending);
+ set_bit(BlockedBadBlocks, &rdev->flags);
+ blocked_rdev = rdev;
+ break;
+ }
+ if (is_bad && first_bad <= dev_sector) {
+ /* Cannot write here at all */
+ bad_sectors -= (dev_sector - first_bad);
+ if (bad_sectors < max_sectors)
+ /* Mustn't write more than bad_sectors
+ * to other devices yet
+ */
+ max_sectors = bad_sectors;
+ /* We don't set R10BIO_Degraded as that
+ * only applies if the disk is missing,
+ * so it might be re-added, and we want to
+ * know to recover this chunk.
+ * In this case the device is here, and the
+ * fact that this chunk is not in-sync is
+ * recorded in the bad block log.
+ */
+ continue;
+ }
+ if (is_bad) {
+ int good_sectors = first_bad - dev_sector;
+ if (good_sectors < max_sectors)
+ max_sectors = good_sectors;
+ }
}
+ r10_bio->devs[i].bio = bio;
+ atomic_inc(&rdev->nr_pending);
}
rcu_read_unlock();

@@ -1004,8 +1054,22 @@ read_again:
goto retry_write;
}

+ if (max_sectors < r10_bio->sectors) {
+ /* We are splitting this into multiple parts, so
+ * we need to prepare for allocating another r10_bio.
+ */
+ r10_bio->sectors = max_sectors;
+ spin_lock_irq(&conf->device_lock);
+ if (bio->bi_phys_segments == 0)
+ bio->bi_phys_segments = 2;
+ else
+ bio->bi_phys_segments++;
+ spin_unlock_irq(&conf->device_lock);
+ }
+ sectors_handled = r10_bio->sector + max_sectors - bio->bi_sector;
+
atomic_set(&r10_bio->remaining, 1);
- bitmap_startwrite(mddev->bitmap, bio->bi_sector, r10_bio->sectors, 0);
+ bitmap_startwrite(mddev->bitmap, r10_bio->sector, r10_bio->sectors, 0);

for (i = 0; i < conf->copies; i++) {
struct bio *mbio;
@@ -1014,10 +1078,12 @@ read_again:
continue;

mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+ md_trim_bio(mbio, r10_bio->sector - bio->bi_sector,
+ max_sectors);
r10_bio->devs[i].bio = mbio;

- mbio->bi_sector = r10_bio->devs[i].addr+
- conf->mirrors[d].rdev->data_offset;
+ mbio->bi_sector = (r10_bio->devs[i].addr+
+ conf->mirrors[d].rdev->data_offset);
mbio->bi_bdev = conf->mirrors[d].rdev->bdev;
mbio->bi_end_io = raid10_end_write_request;
mbio->bi_rw = WRITE | do_sync | do_fua;
@@ -1042,6 +1108,21 @@ read_again:
/* In case raid10d snuck in to freeze_array */
wake_up(&conf->wait_barrier);

+ if (sectors_handled < (bio->bi_size >> 9)) {
+ /* We need another r1_bio. It has already been counted
+ * in bio->bi_phys_segments.
+ */
+ r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
+
+ r10_bio->master_bio = bio;
+ r10_bio->sectors = (bio->bi_size >> 9) - sectors_handled;
+
+ r10_bio->mddev = mddev;
+ r10_bio->sector = bio->bi_sector + sectors_handled;
+ r10_bio->state = 0;
+ goto retry_write;
+ }
+
if (do_sync || !mddev->bitmap || !plugged)
md_wakeup_thread(mddev->thread);
return 0;


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 27/36] md/raid10: avoid reading known bad blocks duringresync/recovery.

am 21.07.2011 04:58:50 von NeilBrown

During resync/recovery limit the size of the request to avoid
reading into a bad block that does not start at-or-before the current
read address.

Similarly if there is a bad block at this address, don't allow the
current request to extend beyond the end of that bad block.

Now that we don't ever read from known bad blocks, it is safe to allow
devices with those blocks into the array.

Signed-off-by: NeilBrown
---

drivers/md/raid10.c | 44 +++++++++++++++++++++++++++++++++++---------
1 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 47e6959..9bac312 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1203,9 +1203,6 @@ static int raid10_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
int first = 0;
int last = conf->raid_disks - 1;

- if (rdev->badblocks.count)
- return -EINVAL;
-
if (mddev->recovery_cp < MaxSector)
/* only hot-add to in-sync arrays, as recovery is
* very different from resync
@@ -1927,7 +1924,6 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr,
int i;
int max_sync;
sector_t sync_blocks;
-
sector_t sectors_skipped = 0;
int chunks_skipped = 0;

@@ -2070,10 +2066,28 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr,

for (j=0; jcopies;j++) {
int d = r10_bio->devs[j].devnum;
+ mdk_rdev_t *rdev;
+ sector_t sector, first_bad;
+ int bad_sectors;
if (!conf->mirrors[d].rdev ||
!test_bit(In_sync, &conf->mirrors[d].rdev->flags))
continue;
/* This is where we read from */
+ rdev = conf->mirrors[d].rdev;
+ sector = r10_bio->devs[j].addr;
+
+ if (is_badblock(rdev, sector, max_sync,
+ &first_bad, &bad_sectors)) {
+ if (first_bad > sector)
+ max_sync = first_bad - sector;
+ else {
+ bad_sectors -= (sector
+ - first_bad);
+ if (max_sync > bad_sectors)
+ max_sync = bad_sectors;
+ continue;
+ }
+ }
bio = r10_bio->devs[0].bio;
bio->bi_next = biolist;
biolist = bio;
@@ -2160,12 +2174,28 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr,

for (i=0; icopies; i++) {
int d = r10_bio->devs[i].devnum;
+ sector_t first_bad, sector;
+ int bad_sectors;
+
bio = r10_bio->devs[i].bio;
bio->bi_end_io = NULL;
clear_bit(BIO_UPTODATE, &bio->bi_flags);
if (conf->mirrors[d].rdev == NULL ||
test_bit(Faulty, &conf->mirrors[d].rdev->flags))
continue;
+ sector = r10_bio->devs[i].addr;
+ if (is_badblock(conf->mirrors[d].rdev,
+ sector, max_sync,
+ &first_bad, &bad_sectors)) {
+ if (first_bad > sector)
+ max_sync = first_bad - sector;
+ else {
+ bad_sectors -= (sector - first_bad);
+ if (max_sync > bad_sectors)
+ max_sync = max_sync;
+ continue;
+ }
+ }
atomic_inc(&conf->mirrors[d].rdev->nr_pending);
atomic_inc(&r10_bio->remaining);
bio->bi_next = biolist;
@@ -2173,7 +2203,7 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr,
bio->bi_private = r10_bio;
bio->bi_end_io = end_sync_read;
bio->bi_rw = READ;
- bio->bi_sector = r10_bio->devs[i].addr +
+ bio->bi_sector = sector +
conf->mirrors[d].rdev->data_offset;
bio->bi_bdev = conf->mirrors[d].rdev->bdev;
count++;
@@ -2431,10 +2461,6 @@ static int run(mddev_t *mddev)

list_for_each_entry(rdev, &mddev->disks, same_set) {

- if (rdev->badblocks.count) {
- printk(KERN_ERR "md/raid10: cannot handle bad blocks yet\n");
- goto out_free_conf;
- }
disk_idx = rdev->raid_disk;
if (disk_idx >= conf->raid_disks
|| disk_idx < 0)


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 28/36] md/raid10 record bad blocks as needed duringrecovery.

am 21.07.2011 04:58:50 von NeilBrown

When recovering one or more devices, if all the good devices have
bad blocks we should record a bad block on the device being rebuilt.

If this fails, we need to abort the recovery.

To ensure we don't think that we aborted later than we actually did,
we need to move the check for MD_RECOVERY_INTR earlier in md_do_sync,
in particular before mddev->curr_resync is updated.

Signed-off-by: NeilBrown
---

drivers/md/md.c | 9 ++++-----
drivers/md/raid10.c | 40 ++++++++++++++++++++++++++++++++--------
2 files changed, 36 insertions(+), 13 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 90d07ab..b4e1629 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7168,11 +7168,14 @@ void md_do_sync(mddev_t *mddev)
atomic_add(sectors, &mddev->recovery_active);
}

+ if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
+ break;
+
j += sectors;
if (j>1) mddev->curr_resync = j;
mddev->curr_mark_cnt = io_sectors;
if (last_check == 0)
- /* this is the earliers that rebuilt will be
+ /* this is the earliest that rebuild will be
* visible in /proc/mdstat
*/
md_new_event(mddev);
@@ -7181,10 +7184,6 @@ void md_do_sync(mddev_t *mddev)
continue;

last_check = io_sectors;
-
- if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
- break;
-
repeat:
if (time_after_eq(jiffies, mark[last_mark] + SYNC_MARK_STEP )) {
/* step marks */
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 9bac312..7f11924 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2005,7 +2005,7 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr,
max_sync = RESYNC_PAGES << (PAGE_SHIFT-9);
if (!test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
/* recovery... the complicated one */
- int j, k;
+ int j;
r10_bio = NULL;

for (i=0 ; iraid_disks; i++) {
@@ -2013,6 +2013,7 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr,
r10bio_t *rb2;
sector_t sect;
int must_sync;
+ int any_working;

if (conf->mirrors[i].rdev == NULL ||
test_bit(In_sync, &conf->mirrors[i].rdev->flags))
@@ -2064,7 +2065,9 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr,
must_sync = bitmap_start_sync(mddev->bitmap, sect,
&sync_blocks, still_degraded);

+ any_working = 0;
for (j=0; jcopies;j++) {
+ int k;
int d = r10_bio->devs[j].devnum;
mdk_rdev_t *rdev;
sector_t sector, first_bad;
@@ -2073,6 +2076,7 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr,
!test_bit(In_sync, &conf->mirrors[d].rdev->flags))
continue;
/* This is where we read from */
+ any_working = 1;
rdev = conf->mirrors[d].rdev;
sector = r10_bio->devs[j].addr;

@@ -2121,16 +2125,35 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr,
break;
}
if (j == conf->copies) {
- /* Cannot recover, so abort the recovery */
+ /* Cannot recover, so abort the recovery or
+ * record a bad block */
put_buf(r10_bio);
if (rb2)
atomic_dec(&rb2->remaining);
r10_bio = rb2;
- if (!test_and_set_bit(MD_RECOVERY_INTR,
- &mddev->recovery))
- printk(KERN_INFO "md/raid10:%s: insufficient "
- "working devices for recovery.\n",
- mdname(mddev));
+ if (any_working) {
+ /* problem is that there are bad blocks
+ * on other device(s)
+ */
+ int k;
+ for (k = 0; k < conf->copies; k++)
+ if (r10_bio->devs[k].devnum == i)
+ break;
+ if (!rdev_set_badblocks(
+ conf->mirrors[i].rdev,
+ r10_bio->devs[k].addr,
+ max_sync, 0))
+ any_working = 0;
+ }
+ if (!any_working) {
+ if (!test_and_set_bit(MD_RECOVERY_INTR,
+ &mddev->recovery))
+ printk(KERN_INFO "md/raid10:%s: insufficient "
+ "working devices for recovery.\n",
+ mdname(mddev));
+ conf->mirrors[i].recovery_disabled
+ = mddev->recovery_disabled;
+ }
break;
}
}
@@ -2290,7 +2313,8 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr,
return sectors_skipped + nr_sectors;
giveup:
/* There is nowhere to write, so all non-sync
- * drives must be failed, so try the next chunk...
+ * drives must be failed or in resync, all drives
+ * have a bad block, so try the next chunk...
*/
if (sector_nr + max_sync < max_sector)
max_sector = sector_nr + max_sync;


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 32/36] md/raid10: attempt to fix read errors duringresync/check

am 21.07.2011 04:58:50 von NeilBrown

We already attempt to fix read errors found during normal IO
and a 'repair' process.
It is best to try to repair them at any time they are found,
so move a test so that during sync and check a read error will
be corrected by over-writing with good data.

If both (all) devices have known bad blocks in the sync section we
won't try to fix even though the bad blocks might not overlap. That
should be considered later.

Also if we hit a read error during recovery we don't try to fix it.
It would only be possible to fix if there were at least three copies
of data, which is not very common with RAID10. But it should still
be considered later.

Signed-off-by: NeilBrown
---

drivers/md/raid10.c | 9 +++++----
1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 428db23..dfea30e 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1541,11 +1541,12 @@ static void sync_request_write(mddev_t *mddev, r10bio_t *r10_bio)
if (j == vcnt)
continue;
mddev->resync_mismatches += r10_bio->sectors;
+ if (test_bit(MD_RECOVERY_CHECK, &mddev->recovery))
+ /* Don't fix anything. */
+ continue;
}
- if (test_bit(MD_RECOVERY_CHECK, &mddev->recovery))
- /* Don't fix anything. */
- continue;
- /* Ok, we need to write this bio
+ /* Ok, we need to write this bio, either to correct an
+ * inconsistency or to correct an unreadable block.
* First we need to fixup bv_offset, bv_len and
* bi_vecs, as the read request might have corrupted these
*/


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 26/36] md/raid10 - avoid reading from known bad blocks -part 3

am 21.07.2011 04:58:50 von NeilBrown

When attempting to repair a read error, don't read from
devices with a known bad block.

As we are only reading PAGE_SIZE blocks, we don't try to
narrow down to smaller regions in the hope that only part of this
page is bad - it isn't worth the effort.

Signed-off-by: NeilBrown
---

drivers/md/raid10.c | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 0dcd172..47e6959 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1594,10 +1594,15 @@ static void fix_read_error(conf_t *conf, mddev_t *mddev, r10bio_t *r10_bio)

rcu_read_lock();
do {
+ sector_t first_bad;
+ int bad_sectors;
+
d = r10_bio->devs[sl].devnum;
rdev = rcu_dereference(conf->mirrors[d].rdev);
if (rdev &&
- test_bit(In_sync, &rdev->flags)) {
+ test_bit(In_sync, &rdev->flags) &&
+ is_badblock(rdev, r10_bio->devs[sl].addr + sect, s,
+ &first_bad, &bad_sectors) == 0) {
atomic_inc(&rdev->nr_pending);
rcu_read_unlock();
success = sync_page_io(rdev,


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 36/36] md/raid10: handle further errors duringfix_read_error better.

am 21.07.2011 04:58:51 von NeilBrown

If we find more read/write errors we should record a bad block before
failing the device.

Signed-off-by: NeilBrown
---

drivers/md/raid10.c | 59 ++++++++++++++++++++++++++++++++++++++-------------
1 files changed, 44 insertions(+), 15 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index b02e384..9214e08 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1749,6 +1749,26 @@ static void check_decay_read_errors(mddev_t *mddev, mdk_rdev_t *rdev)
atomic_set(&rdev->read_errors, read_errors >> hours_since_last);
}

+static int r10_sync_page_io(mdk_rdev_t *rdev, sector_t sector,
+ int sectors, struct page *page, int rw)
+{
+ sector_t first_bad;
+ int bad_sectors;
+
+ if (is_badblock(rdev, sector, sectors, &first_bad, &bad_sectors)
+ && (rw == READ || test_bit(WriteErrorSeen, &rdev->flags)))
+ return -1;
+ if (sync_page_io(rdev, sector, sectors << 9, page, rw, false))
+ /* success */
+ return 1;
+ if (rw == WRITE)
+ set_bit(WriteErrorSeen, &rdev->flags);
+ /* need to record an error - either for the block or the device */
+ if (!rdev_set_badblocks(rdev, sector, sectors, 0))
+ md_error(rdev->mddev, rdev);
+ return 0;
+}
+
/*
* This is a kernel thread which:
*
@@ -1832,9 +1852,19 @@ static void fix_read_error(conf_t *conf, mddev_t *mddev, r10bio_t *r10_bio)
rcu_read_unlock();

if (!success) {
- /* Cannot read from anywhere -- bye bye array */
+ /* Cannot read from anywhere, just mark the block
+ * as bad on the first device to discourage future
+ * reads.
+ */
int dn = r10_bio->devs[r10_bio->read_slot].devnum;
- md_error(mddev, conf->mirrors[dn].rdev);
+ rdev = conf->mirrors[dn].rdev;
+
+ if (!rdev_set_badblocks(
+ rdev,
+ r10_bio->devs[r10_bio->read_slot].addr
+ + sect,
+ s, 0))
+ md_error(mddev, rdev);
break;
}

@@ -1855,10 +1885,10 @@ static void fix_read_error(conf_t *conf, mddev_t *mddev, r10bio_t *r10_bio)

atomic_inc(&rdev->nr_pending);
rcu_read_unlock();
- if (sync_page_io(rdev,
- r10_bio->devs[sl].addr +
- sect,
- s<<9, conf->tmppage, WRITE, false)
+ if (r10_sync_page_io(rdev,
+ r10_bio->devs[sl].addr +
+ sect,
+ s<<9, conf->tmppage, WRITE)
== 0) {
/* Well, this device is dead */
printk(KERN_NOTICE
@@ -1873,7 +1903,6 @@ static void fix_read_error(conf_t *conf, mddev_t *mddev, r10bio_t *r10_bio)
"drive\n",
mdname(mddev),
bdevname(rdev->bdev, b));
- md_error(mddev, rdev);
}
rdev_dec_pending(rdev, mddev);
rcu_read_lock();
@@ -1893,11 +1922,12 @@ static void fix_read_error(conf_t *conf, mddev_t *mddev, r10bio_t *r10_bio)

atomic_inc(&rdev->nr_pending);
rcu_read_unlock();
- if (sync_page_io(rdev,
- r10_bio->devs[sl].addr +
- sect,
- s<<9, conf->tmppage,
- READ, false) == 0) {
+ switch (r10_sync_page_io(rdev,
+ r10_bio->devs[sl].addr +
+ sect,
+ s<<9, conf->tmppage,
+ READ)) {
+ case 0:
/* Well, this device is dead */
printk(KERN_NOTICE
"md/raid10:%s: unable to read back "
@@ -1911,9 +1941,8 @@ static void fix_read_error(conf_t *conf, mddev_t *mddev, r10bio_t *r10_bio)
"drive\n",
mdname(mddev),
bdevname(rdev->bdev, b));
-
- md_error(mddev, rdev);
- } else {
+ break;
+ case 1:
printk(KERN_INFO
"md/raid10:%s: read error corrected"
" (%d sectors at %llu on %s)\n",


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 35/36] md/raid10: Handle read errors during recovery better.

am 21.07.2011 04:58:51 von NeilBrown

Currently when we get a read error during recovery, we simply abort
the recovery.

Instead, repeat the read in page-sized blocks.
On successful reads, write to the target.
On read errors, record a bad block on the destination,
and only if that fails do we abort the recovery.

As we now retry reads we need to know where we read from. This was in
bi_sector but that can be changed during a read attempt.
So store the correct from_addr and to_addr in the r10_bio for later
access.


Signed-off-by: NeilBrown
---

drivers/md/raid10.c | 154 ++++++++++++++++++++++++++++++++++++++++-----------
1 files changed, 121 insertions(+), 33 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index f87c8d9..b02e384 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1141,7 +1141,7 @@ retry_write:
wake_up(&conf->wait_barrier);

if (sectors_handled < (bio->bi_size >> 9)) {
- /* We need another r1_bio. It has already been counted
+ /* We need another r10_bio. It has already been counted
* in bio->bi_phys_segments.
*/
r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
@@ -1438,29 +1438,10 @@ static void end_sync_read(struct bio *bio, int error)
}
}

-static void end_sync_write(struct bio *bio, int error)
+static void end_sync_request(r10bio_t *r10_bio)
{
- int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
- r10bio_t *r10_bio = bio->bi_private;
mddev_t *mddev = r10_bio->mddev;
- conf_t *conf = mddev->private;
- int d;
- sector_t first_bad;
- int bad_sectors;
- int slot;
-
- d = find_bio_disk(conf, r10_bio, bio, &slot);
-
- if (!uptodate) {
- set_bit(WriteErrorSeen, &conf->mirrors[d].rdev->flags);
- set_bit(R10BIO_WriteError, &r10_bio->state);
- } else if (is_badblock(conf->mirrors[d].rdev,
- r10_bio->devs[slot].addr,
- r10_bio->sectors,
- &first_bad, &bad_sectors))
- set_bit(R10BIO_MadeGood, &r10_bio->state);

- rdev_dec_pending(conf->mirrors[d].rdev, mddev);
while (atomic_dec_and_test(&r10_bio->remaining)) {
if (r10_bio->master_bio == NULL) {
/* the primary of several recovery bios */
@@ -1484,6 +1465,33 @@ static void end_sync_write(struct bio *bio, int error)
}
}

+static void end_sync_write(struct bio *bio, int error)
+{
+ int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+ r10bio_t *r10_bio = bio->bi_private;
+ mddev_t *mddev = r10_bio->mddev;
+ conf_t *conf = mddev->private;
+ int d;
+ sector_t first_bad;
+ int bad_sectors;
+ int slot;
+
+ d = find_bio_disk(conf, r10_bio, bio, &slot);
+
+ if (!uptodate) {
+ set_bit(WriteErrorSeen, &conf->mirrors[d].rdev->flags);
+ set_bit(R10BIO_WriteError, &r10_bio->state);
+ } else if (is_badblock(conf->mirrors[d].rdev,
+ r10_bio->devs[slot].addr,
+ r10_bio->sectors,
+ &first_bad, &bad_sectors))
+ set_bit(R10BIO_MadeGood, &r10_bio->state);
+
+ rdev_dec_pending(conf->mirrors[d].rdev, mddev);
+
+ end_sync_request(r10_bio);
+}
+
/*
* Note: sync and recover and handled very differently for raid10
* This code is for resync.
@@ -1600,6 +1608,84 @@ done:
* The second for writing.
*
*/
+static void fix_recovery_read_error(r10bio_t *r10_bio)
+{
+ /* We got a read error during recovery.
+ * We repeat the read in smaller page-sized sections.
+ * If a read succeeds, write it to the new device or record
+ * a bad block if we cannot.
+ * If a read fails, record a bad block on both old and
+ * new devices.
+ */
+ mddev_t *mddev = r10_bio->mddev;
+ conf_t *conf = mddev->private;
+ struct bio *bio = r10_bio->devs[0].bio;
+ sector_t sect = 0;
+ int sectors = r10_bio->sectors;
+ int idx = 0;
+ int dr = r10_bio->devs[0].devnum;
+ int dw = r10_bio->devs[1].devnum;
+
+ while (sectors) {
+ int s = sectors;
+ mdk_rdev_t *rdev;
+ sector_t addr;
+ int ok;
+
+ if (s > (PAGE_SIZE>>9))
+ s = PAGE_SIZE >> 9;
+
+ rdev = conf->mirrors[dr].rdev;
+ addr = r10_bio->devs[0].addr + sect,
+ ok = sync_page_io(rdev,
+ addr,
+ s << 9,
+ bio->bi_io_vec[idx].bv_page,
+ READ, false);
+ if (ok) {
+ rdev = conf->mirrors[dw].rdev;
+ addr = r10_bio->devs[1].addr + sect;
+ ok = sync_page_io(rdev,
+ addr,
+ s << 9,
+ bio->bi_io_vec[idx].bv_page,
+ WRITE, false);
+ if (!ok)
+ set_bit(WriteErrorSeen, &rdev->flags);
+ }
+ if (!ok) {
+ /* We don't worry if we cannot set a bad block -
+ * it really is bad so there is no loss in not
+ * recording it yet
+ */
+ rdev_set_badblocks(rdev, addr, s, 0);
+
+ if (rdev != conf->mirrors[dw].rdev) {
+ /* need bad block on destination too */
+ mdk_rdev_t *rdev2 = conf->mirrors[dw].rdev;
+ addr = r10_bio->devs[1].addr + sect;
+ ok = rdev_set_badblocks(rdev2, addr, s, 0);
+ if (!ok) {
+ /* just abort the recovery */
+ printk(KERN_NOTICE
+ "md/raid10:%s: recovery aborted"
+ " due to read error\n",
+ mdname(mddev));
+
+ conf->mirrors[dw].recovery_disabled
+ = mddev->recovery_disabled;
+ set_bit(MD_RECOVERY_INTR,
+ &mddev->recovery);
+ break;
+ }
+ }
+ }
+
+ sectors -= s;
+ sect += s;
+ idx++;
+ }
+}

static void recovery_request_write(mddev_t *mddev, r10bio_t *r10_bio)
{
@@ -1607,6 +1693,12 @@ static void recovery_request_write(mddev_t *mddev, r10bio_t *r10_bio)
int d;
struct bio *wbio;

+ if (!test_bit(R10BIO_Uptodate, &r10_bio->state)) {
+ fix_recovery_read_error(r10_bio);
+ end_sync_request(r10_bio);
+ return;
+ }
+
/*
* share the pages with the first bio
* and submit the write request
@@ -1616,16 +1708,7 @@ static void recovery_request_write(mddev_t *mddev, r10bio_t *r10_bio)

atomic_inc(&conf->mirrors[d].rdev->nr_pending);
md_sync_acct(conf->mirrors[d].rdev->bdev, wbio->bi_size >> 9);
- if (test_bit(R10BIO_Uptodate, &r10_bio->state))
- generic_make_request(wbio);
- else {
- printk(KERN_NOTICE
- "md/raid10:%s: recovery aborted due to read error\n",
- mdname(mddev));
- conf->mirrors[d].recovery_disabled = mddev->recovery_disabled;
- set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- bio_endio(wbio, 0);
- }
+ generic_make_request(wbio);
}


@@ -2339,6 +2422,7 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr,
for (j=0; jcopies;j++) {
int k;
int d = r10_bio->devs[j].devnum;
+ sector_t from_addr, to_addr;
mdk_rdev_t *rdev;
sector_t sector, first_bad;
int bad_sectors;
@@ -2368,7 +2452,8 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr,
bio->bi_private = r10_bio;
bio->bi_end_io = end_sync_read;
bio->bi_rw = READ;
- bio->bi_sector = r10_bio->devs[j].addr +
+ from_addr = r10_bio->devs[j].addr;
+ bio->bi_sector = from_addr +
conf->mirrors[d].rdev->data_offset;
bio->bi_bdev = conf->mirrors[d].rdev->bdev;
atomic_inc(&conf->mirrors[d].rdev->nr_pending);
@@ -2385,12 +2470,15 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr,
bio->bi_private = r10_bio;
bio->bi_end_io = end_sync_write;
bio->bi_rw = WRITE;
- bio->bi_sector = r10_bio->devs[k].addr +
+ to_addr = r10_bio->devs[k].addr;
+ bio->bi_sector = to_addr +
conf->mirrors[i].rdev->data_offset;
bio->bi_bdev = conf->mirrors[i].rdev->bdev;

r10_bio->devs[0].devnum = d;
+ r10_bio->devs[0].addr = from_addr;
r10_bio->devs[1].devnum = i;
+ r10_bio->devs[1].addr = to_addr;

break;
}


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 01/36] md: beginnings of bad block management.

am 22.07.2011 17:03:45 von Namhyung Kim

NeilBrown writes:

> This the first step in allowing md to track bad-blocks per-device so
> that we can fail individual blocks rather than the whole device.
>
> This patch just adds a data structure for recording bad blocks, with
> routines to add, remove, search the list.
>
> Signed-off-by: NeilBrown
> ---
>
> drivers/md/md.c | 457 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> drivers/md/md.h | 49 ++++++
> 2 files changed, 502 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 2a32050..220fadb 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1952,6 +1952,10 @@ static void unbind_rdev_from_array(mdk_rdev_t * rdev)
> sysfs_remove_link(&rdev->kobj, "block");
> sysfs_put(rdev->sysfs_state);
> rdev->sysfs_state = NULL;
> + kfree(rdev->badblocks.page);
> + rdev->badblocks.count = 0;
> + rdev->badblocks.page = NULL;
> + rdev->badblocks.active_page = NULL;
> /* We need to delay this, otherwise we can deadlock when
> * writing to 'remove' to "dev/state". We also need
> * to delay it due to rcu usage.
> @@ -2778,7 +2782,7 @@ static struct kobj_type rdev_ktype = {
> .default_attrs = rdev_default_attrs,
> };
>
> -void md_rdev_init(mdk_rdev_t *rdev)
> +int md_rdev_init(mdk_rdev_t *rdev)
> {
> rdev->desc_nr = -1;
> rdev->saved_raid_disk = -1;
> @@ -2794,6 +2798,20 @@ void md_rdev_init(mdk_rdev_t *rdev)
>
> INIT_LIST_HEAD(&rdev->same_set);
> init_waitqueue_head(&rdev->blocked_wait);
> +
> + /* Add space to store bad block list.
> + * This reserves the space even on arrays where it cannot
> + * be used - I wonder if that matters
> + */
> + rdev->badblocks.count = 0;
> + rdev->badblocks.shift = 0;
> + rdev->badblocks.page = kmalloc(PAGE_SIZE, GFP_KERNEL);
> + rdev->badblocks.active_page = rdev->badblocks.page;
> + spin_lock_init(&rdev->badblocks.lock);
> + if (rdev->badblocks.page == NULL)
> + return -ENOMEM;
> +
> + return 0;
> }
> EXPORT_SYMBOL_GPL(md_rdev_init);
> /*
> @@ -2819,8 +2837,11 @@ static mdk_rdev_t *md_import_device(dev_t newdev, int super_format, int super_mi
> return ERR_PTR(-ENOMEM);
> }
>
> - md_rdev_init(rdev);
> - if ((err = alloc_disk_sb(rdev)))
> + err = md_rdev_init(rdev);
> + if (err)
> + goto abort_free;
> + err = alloc_disk_sb(rdev);
> + if (err)
> goto abort_free;
>
> err = lock_rdev(rdev, newdev, super_format == -2);
> @@ -7324,6 +7345,436 @@ void md_wait_for_blocked_rdev(mdk_rdev_t *rdev, mddev_t *mddev)
> }
> EXPORT_SYMBOL(md_wait_for_blocked_rdev);
>
> +
> +/* Bad block management.
> + * We can record which blocks on each device are 'bad' and so just
> + * fail those blocks, or that stripe, rather than the whole device.
> + * Entries in the bad-block table are 64bits wide. This comprises:
> + * Length of bad-range, in sectors: 0-511 for lengths 1-512
> + * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
> + * A 'shift' can be set so that larger blocks are tracked and
> + * consequently larger devices can be covered.
> + * 'Acknowledged' flag - 1 bit. - the most significant bit.
> + */
> +/* Locking of the bad-block table is a two-layer affair.
> + * Read access through ->active_page only requires an rcu_readlock.
> + * However if ->active_page is found to be NULL, the table
> + * should be accessed through ->page which requires an irq-spinlock.
> + * Updating the page requires setting ->active_page to NULL,
> + * synchronising with rcu, then updating ->page under the same
> + * irq-spinlock.
> + * We always set or clear bad blocks from process context, but
> + * might look-up bad blocks from interrupt/bh context.
> + *

Empty line.

If the locking is complex, it'd be better defining separate functions to
deal with it, IMHO. Please see below.


> + */
> +/* When looking for a bad block we specify a range and want to
> + * know if any block in the range is bad. So we binary-search
> + * to the last range that starts at-or-before the given endpoint,
> + * (or "before the sector after the target range")
> + * then see if it ends after the given start.
> + * We return
> + * 0 if there are no known bad blocks in the range
> + * 1 if there are known bad block which are all acknowledged
> + * -1 if there are bad blocks which have not yet been acknowledged in metadata.
> + * plus the start/length of the first bad section we overlap.
> + */
> +int md_is_badblock(struct badblocks *bb, sector_t s, int sectors,
> + sector_t *first_bad, int *bad_sectors)
> +{
> + int hi;
> + int lo = 0;
> + u64 *p;
> + int rv = 0;
> + int havelock = 0;
> + sector_t target = s + sectors;
> + unsigned long uninitialized_var(flags);
> +
> + if (bb->shift > 0) {
> + /* round the start down, and the end up */
> + s >>= bb->shift;
> + target += (1<shift) - 1;
> + target >>= bb->shift;
> + sectors = target - s;
> + }
> + /* 'target' is now the first block after the bad range */
> +
> + rcu_read_lock();
> + p = rcu_dereference(bb->active_page);
> + if (!p) {
> + spin_lock_irqsave(&bb->lock, flags);
> + p = bb->page;
> + havelock = 1;
> + }

Maybe something like:

p = md_read_lock_bb(bb, &havelock, &flags);


> + hi = bb->count;
> +
> + /* Binary search between lo and hi for 'target'
> + * i.e. for the last range that starts before 'target'
> + */
> + /* INVARIANT: ranges before 'lo' and at-or-after 'hi'
> + * are known not to be the last range before target.
> + * VARIANT: hi-lo is the number of possible
> + * ranges, and decreases until it reaches 1
> + */
> + while (hi - lo > 1) {
> + int mid = (lo + hi) / 2;
> + sector_t a = BB_OFFSET(p[mid]);
> + if (a < target)
> + /* This could still be the one, earlier ranges
> + * could not. */
> + lo = mid;
> + else
> + /* This and later ranges are definitely out. */
> + hi = mid;
> + }
> + /* 'lo' might be the last that started before target, but 'hi' isn't */
> + if (hi > lo) {
> + /* need to check all range that end after 's' to see if
> + * any are unacknowledged.
> + */
> + while (lo >= 0 &&
> + BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
> + if (BB_OFFSET(p[lo]) < target) {
> + /* starts before the end, and finishes after
> + * the start, so they must overlap
> + */
> + if (rv != -1 && BB_ACK(p[lo]))
> + rv = 1;
> + else
> + rv = -1;
> + *first_bad = BB_OFFSET(p[lo]);
> + *bad_sectors = BB_LEN(p[lo]);
> + }
> + lo--;
> + }
> + }
> +
> + if (havelock)
> + spin_unlock_irqrestore(&bb->lock, flags);
> + rcu_read_unlock();

And
md_read_unlock_bb(bb, havelock, flags);


> + return rv;
> +}
> +EXPORT_SYMBOL_GPL(md_is_badblock);
> +
> +/*
> + * Add a range of bad blocks to the table.
> + * This might extend the table, or might contract it
> + * if two adjacent ranges can be merged.
> + * We binary-search to find the 'insertion' point, then
> + * decide how best to handle it.
> + */
> +static int md_set_badblocks(struct badblocks *bb, sector_t s, int sectors,
> + int acknowledged)
> +{
> + u64 *p;
> + int lo, hi;
> + int rv = 1;
> +
> + if (bb->shift < 0)
> + /* badblocks are disabled */
> + return 0;
> +
> + if (bb->shift) {
> + /* round the start down, and the end up */
> + sector_t next = s + sectors;
> + s >>= bb->shift;
> + next += (1<shift) - 1;
> + next >>= bb->shift;
> + sectors = next - s;
> + }
> +
> + while (1) {
> + rcu_assign_pointer(bb->active_page, NULL);
> + synchronize_rcu();
> + spin_lock_irq(&bb->lock);
> + if (bb->active_page == NULL)
> + break;
> + /* someone else just unlocked, better retry */
> + spin_unlock_irq(&bb->lock);
> + }

md_write_lock_bb(bb);


> + /* now have exclusive access to the page */
> +
> + p = bb->page;
> + lo = 0;
> + hi = bb->count;
> + /* Find the last range that starts at-or-before 's' */
> + while (hi - lo > 1) {
> + int mid = (lo + hi) / 2;
> + sector_t a = BB_OFFSET(p[mid]);
> + if (a <= s)
> + lo = mid;
> + else
> + hi = mid;
> + }
> + if (hi > lo && BB_OFFSET(p[lo]) > s)
> + hi = lo;
> +
> + if (hi > lo) {
> + /* we found a range that might merge with the start
> + * of our new range
> + */
> + sector_t a = BB_OFFSET(p[lo]);
> + sector_t e = a + BB_LEN(p[lo]);
> + int ack = BB_ACK(p[lo]);
> + if (e >= s) {
> + /* Yes, we can merge with a previous range */
> + if (s == a && s + sectors >= e)
> + /* new range covers old */
> + ack = acknowledged;
> + else
> + ack = ack && acknowledged;
> +
> + if (e < s + sectors)
> + e = s + sectors;
> + if (e - a <= BB_MAX_LEN) {
> + p[lo] = BB_MAKE(a, e-a, ack);
> + s = e;
> + } else {
> + /* does not all fit in one range,
> + * make p[lo] maximal
> + */
> + if (BB_LEN(p[lo]) != BB_MAX_LEN)
> + p[lo] = BB_MAKE(a, BB_MAX_LEN, ack);
> + s = a + BB_MAX_LEN;
> + }
> + sectors = e - s;
> + }
> + }
> + if (sectors && hi < bb->count) {
> + /* 'hi' points to the first range that starts after 's'.
> + * Maybe we can merge with the start of that range */
> + sector_t a = BB_OFFSET(p[hi]);
> + sector_t e = a + BB_LEN(p[hi]);
> + int ack = BB_ACK(p[hi]);
> + if (a <= s + sectors) {
> + /* merging is possible */
> + if (e <= s + sectors) {
> + /* full overlap */
> + e = s + sectors;
> + ack = acknowledged;
> + } else
> + ack = ack && acknowledged;
> +
> + a = s;
> + if (e - a <= BB_MAX_LEN) {
> + p[hi] = BB_MAKE(a, e-a, ack);
> + s = e;
> + } else {
> + p[hi] = BB_MAKE(a, BB_MAX_LEN, ack);
> + s = a + BB_MAX_LEN;
> + }
> + sectors = e - s;
> + lo = hi;
> + hi++;
> + }
> + }
> + if (sectors == 0 && hi < bb->count) {
> + /* we might be able to combine lo and hi */
> + /* Note: 's' is at the end of 'lo' */
> + sector_t a = BB_OFFSET(p[hi]);
> + int lolen = BB_LEN(p[lo]);
> + int hilen = BB_LEN(p[hi]);
> + int newlen = lolen + hilen - (s - a);
> + if (s >= a && newlen < BB_MAX_LEN) {
> + /* yes, we can combine them */
> + int ack = BB_ACK(p[lo]) && BB_ACK(p[hi]);
> + p[lo] = BB_MAKE(BB_OFFSET(p[lo]), newlen, ack);
> + memmove(p + hi, p + hi + 1,
> + (bb->count - hi - 1) * 8);
> + bb->count--;
> + }
> + }
> + while (sectors) {
> + /* didn't merge (it all).
> + * Need to add a range just before 'hi' */
> + if (bb->count >= MD_MAX_BADBLOCKS) {
> + /* No room for more */
> + rv = 0;
> + break;
> + } else {
> + int this_sectors = sectors;
> + memmove(p + hi + 1, p + hi,
> + (bb->count - hi) * 8);
> + bb->count++;
> +
> + if (this_sectors > BB_MAX_LEN)
> + this_sectors = BB_MAX_LEN;
> + p[hi] = BB_MAKE(s, this_sectors, acknowledged);
> + sectors -= this_sectors;
> + s += this_sectors;
> + }
> + }
> +
> + bb->changed = 1;
> + rcu_assign_pointer(bb->active_page, bb->page);
> + spin_unlock_irq(&bb->lock);

md_write_unlock_bb(bb);


> +
> + return rv;
> +}
> +
> +int rdev_set_badblocks(mdk_rdev_t *rdev, sector_t s, int sectors,
> + int acknowledged)
> +{
> + int rv = md_set_badblocks(&rdev->badblocks,
> + s + rdev->data_offset, sectors, acknowledged);
> + if (rv) {
> + /* Make sure they get written out promptly */
> + set_bit(MD_CHANGE_CLEAN, &rdev->mddev->flags);
> + md_wakeup_thread(rdev->mddev->thread);
> + }
> + return rv;
> +}
> +EXPORT_SYMBOL_GPL(rdev_set_badblocks);

I think it would be better if all exported functions in md.c have
prefixed 'md_'.


> +
> +/*
> + * Remove a range of bad blocks from the table.
> + * This may involve extending the table if we spilt a region,
> + * but it must not fail. So if the table becomes full, we just
> + * drop the remove request.
> + */
> +static int md_clear_badblocks(struct badblocks *bb, sector_t s, int sectors)
> +{
> + u64 *p;
> + int lo, hi;
> + sector_t target = s + sectors;
> + int rv = 0;
> +
> + if (bb->shift > 0) {
> + /* When clearing we round the start up and the end down.
> + * This should not matter as the shift should align with
> + * the block size and no rounding should ever be needed.
> + * However it is better the think a block is bad when it
> + * isn't than to think a block is not bad when it is.
> + */
> + s += (1<shift) - 1;
> + s >>= bb->shift;
> + target >>= bb->shift;
> + sectors = target - s;
> + }
> +
> + while (1) {
> + rcu_assign_pointer(bb->active_page, NULL);
> + synchronize_rcu();
> + spin_lock_irq(&bb->lock);
> + if (bb->active_page == NULL)
> + break;
> + /* someone else just unlocked, better retry */
> + spin_unlock_irq(&bb->lock);
> + }
> + /* now have exclusive access to the page */
> +
> + p = bb->page;
> + lo = 0;
> + hi = bb->count;
> + /* Find the last range that starts before 'target' */
> + while (hi - lo > 1) {
> + int mid = (lo + hi) / 2;
> + sector_t a = BB_OFFSET(p[mid]);
> + if (a < target)
> + lo = mid;
> + else
> + hi = mid;
> + }
> + if (hi > lo) {
> + /* p[lo] is the last range that could overlap the
> + * current range. Earlier ranges could also overlap,
> + * but only this one can overlap the end of the range.
> + */
> + if (BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > target) {
> + /* Partial overlap, leave the tail of this range */
> + int ack = BB_ACK(p[lo]);
> + sector_t a = BB_OFFSET(p[lo]);
> + sector_t end = a + BB_LEN(p[lo]);
> +
> + if (a < s) {
> + /* we need to split this range */
> + if (bb->count >= MD_MAX_BADBLOCKS) {
> + rv = 0;
> + goto out;
> + }
> + memmove(p+lo+1, p+lo, (bb->count - lo) * 8);
> + bb->count++;
> + p[lo] = BB_MAKE(a, s-a, ack);
> + lo++;
> + }
> + p[lo] = BB_MAKE(target, end - target, ack);
> + /* there is no longer an overlap */
> + hi = lo;
> + lo--;
> + }
> + while (lo >= 0 &&
> + BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
> + /* This range does overlap */
> + if (BB_OFFSET(p[lo]) < s) {
> + /* Keep the early parts of this range. */
> + int ack = BB_ACK(p[lo]);
> + sector_t start = BB_OFFSET(p[lo]);
> + p[lo] = BB_MAKE(start, s - start, ack);
> + /* now low doesn't overlap, so.. */
> + break;
> + }
> + lo--;
> + }
> + /* 'lo' is strictly before, 'hi' is strictly after,
> + * anything between needs to be discarded
> + */
> + if (hi - lo > 1) {
> + memmove(p+lo+1, p+hi, (bb->count - hi) * 8);
> + bb->count -= (hi - lo - 1);
> + }
> + }
> +
> + bb->changed = 1;
> +out:
> + rcu_assign_pointer(bb->active_page, bb->page);
> + spin_unlock_irq(&bb->lock);
> + return rv;
> +}
> +
> +int rdev_clear_badblocks(mdk_rdev_t *rdev, sector_t s, int sectors)
> +{
> + return md_clear_badblocks(&rdev->badblocks,
> + s + rdev->data_offset,
> + sectors);
> +}
> +EXPORT_SYMBOL_GPL(rdev_clear_badblocks);

Same here.

Thanks.


> +
> +/*
> + * Acknowledge all bad blocks in a list.
> + * This only succeeds if ->changed is clear. It is used by
> + * in-kernel metadata updates
> + */
> +void md_ack_all_badblocks(struct badblocks *bb)
> +{
> + if (bb->page == NULL || bb->changed)
> + /* no point even trying */
> + return;
> + while (1) {
> + rcu_assign_pointer(bb->active_page, NULL);
> + synchronize_rcu();
> + spin_lock_irq(&bb->lock);
> + if (bb->active_page == NULL)
> + break;
> + /* someone else just unlocked, better retry */
> + spin_unlock_irq(&bb->lock);
> + }
> + /* now have exclusive access to the page */
> +
> + if (bb->changed == 0) {
> + u64 *p = bb->page;
> + int i;
> + for (i = 0; i < bb->count ; i++) {
> + if (!BB_ACK(p[i])) {
> + sector_t start = BB_OFFSET(p[i]);
> + int len = BB_LEN(p[i]);
> + p[i] = BB_MAKE(start, len, 1);
> + }
> + }
> + }
> + rcu_assign_pointer(bb->active_page, bb->page);
> + spin_unlock_irq(&bb->lock);
> +}
> +EXPORT_SYMBOL_GPL(md_ack_all_badblocks);
> +
> static int md_notify_reboot(struct notifier_block *this,
> unsigned long code, void *x)
> {
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 7d906a9..d327734 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -29,6 +29,13 @@
> typedef struct mddev_s mddev_t;
> typedef struct mdk_rdev_s mdk_rdev_t;
>
> +/* Bad block numbers are stored sorted in a single page.
> + * 64bits is used for each block or extent.
> + * 54 bits are sector number, 9 bits are extent size,
> + * 1 bit is an 'acknowledged' flag.
> + */
> +#define MD_MAX_BADBLOCKS (PAGE_SIZE/8)
> +
> /*
> * MD's 'extended' device
> */
> @@ -111,8 +118,48 @@ struct mdk_rdev_s
>
> struct sysfs_dirent *sysfs_state; /* handle for 'state'
> * sysfs entry */
> +
> + struct badblocks {
> + int count; /* count of bad blocks */
> + int shift; /* shift from sectors to block size
> + * a -ve shift means badblocks are
> + * disabled.*/
> + u64 *page; /* badblock list */
> + u64 *active_page; /* either 'page' or 'NULL' */
> + int changed;
> + spinlock_t lock;
> + } badblocks;
> };
>
> +#define BB_LEN_MASK (0x00000000000001FFULL)
> +#define BB_OFFSET_MASK (0x7FFFFFFFFFFFFE00ULL)
> +#define BB_ACK_MASK (0x8000000000000000ULL)
> +#define BB_MAX_LEN 512
> +#define BB_OFFSET(x) (((x) & BB_OFFSET_MASK) >> 9)
> +#define BB_LEN(x) (((x) & BB_LEN_MASK) + 1)
> +#define BB_ACK(x) (!!((x) & BB_ACK_MASK))
> +#define BB_MAKE(a, l, ack) (((a)<<9) | ((l)-1) | ((u64)(!!(ack)) << 63))
> +
> +extern int md_is_badblock(struct badblocks *bb, sector_t s, int sectors,
> + sector_t *first_bad, int *bad_sectors);
> +static inline int is_badblock(mdk_rdev_t *rdev, sector_t s, int sectors,
> + sector_t *first_bad, int *bad_sectors)
> +{
> + if (unlikely(rdev->badblocks.count)) {
> + int rv = md_is_badblock(&rdev->badblocks, rdev->data_offset + s,
> + sectors,
> + first_bad, bad_sectors);
> + if (rv)
> + *first_bad -= rdev->data_offset;
> + return rv;
> + }
> + return 0;
> +}
> +extern int rdev_set_badblocks(mdk_rdev_t *rdev, sector_t s, int sectors,
> + int acknowledged);
> +extern int rdev_clear_badblocks(mdk_rdev_t *rdev, sector_t s, int sectors);
> +extern void md_ack_all_badblocks(struct badblocks *bb);
> +
> struct mddev_s
> {
> void *private;
> @@ -517,7 +564,7 @@ extern void mddev_init(mddev_t *mddev);
> extern int md_run(mddev_t *mddev);
> extern void md_stop(mddev_t *mddev);
> extern void md_stop_writes(mddev_t *mddev);
> -extern void md_rdev_init(mdk_rdev_t *rdev);
> +extern int md_rdev_init(mdk_rdev_t *rdev);
>
> extern void mddev_suspend(mddev_t *mddev);
> extern void mddev_resume(mddev_t *mddev);
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 02/36] md/bad-block-log: add sysfs interface for accessing bad-block-log.

am 22.07.2011 17:43:22 von Namhyung Kim

NeilBrown writes:

> This can show the log (providing it fits in one page) and
> allows bad blocks to be 'acknowledged' meaning that they
> have safely been recorded in metadata.
>
> Clearing bad blocks is not allowed via sysfs (except for
> code testing). A bad block can only be cleared when
> a write to the block succeeds.
>
> Signed-off-by: NeilBrown
> ---
>
> drivers/md/md.c | 127 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 127 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 220fadb..9324635 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -2712,6 +2712,35 @@ static ssize_t recovery_start_store(mdk_rdev_t *rdev, const char *buf, size_t le
> static struct rdev_sysfs_entry rdev_recovery_start =
> __ATTR(recovery_start, S_IRUGO|S_IWUSR, recovery_start_show, recovery_start_store);
>
> +
> +static ssize_t
> +badblocks_show(struct badblocks *bb, char *page, int unack);
> +static ssize_t
> +badblocks_store(struct badblocks *bb, const char *page, size_t len, int unack);
> +
> +static ssize_t bb_show(mdk_rdev_t *rdev, char *page)
> +{
> + return badblocks_show(&rdev->badblocks, page, 0);
> +}
> +static ssize_t bb_store(mdk_rdev_t *rdev, const char *page, size_t len)
> +{
> + return badblocks_store(&rdev->badblocks, page, len, 0);
> +}
> +static struct rdev_sysfs_entry rdev_bad_blocks =
> +__ATTR(bad_blocks, S_IRUGO|S_IWUSR, bb_show, bb_store);
> +
> +
> +static ssize_t ubb_show(mdk_rdev_t *rdev, char *page)
> +{
> + return badblocks_show(&rdev->badblocks, page, 1);
> +}
> +static ssize_t ubb_store(mdk_rdev_t *rdev, const char *page, size_t len)
> +{
> + return badblocks_store(&rdev->badblocks, page, len, 1);
> +}
> +static struct rdev_sysfs_entry rdev_unack_bad_blocks =
> +__ATTR(unacknowledged_bad_blocks, S_IRUGO|S_IWUSR, ubb_show, ubb_store);
> +
> static struct attribute *rdev_default_attrs[] = {
> &rdev_state.attr,
> &rdev_errors.attr,
> @@ -2719,6 +2748,8 @@ static struct attribute *rdev_default_attrs[] = {
> &rdev_offset.attr,
> &rdev_size.attr,
> &rdev_recovery_start.attr,
> + &rdev_bad_blocks.attr,
> + &rdev_unack_bad_blocks.attr,
> NULL,
> };
> static ssize_t
> @@ -7775,6 +7806,102 @@ void md_ack_all_badblocks(struct badblocks *bb)
> }
> EXPORT_SYMBOL_GPL(md_ack_all_badblocks);
>
> +/* sysfs access to bad-blocks list.
> + * We present two files.
> + * 'bad-blocks' lists sector numbers and lengths of ranges that
> + * are recorded as bad. The list is truncated to fit within
> + * the one-page limit of sysfs.
> + * Writing "sector length" to this file adds an acknowledged
> + * bad block list.
> + * 'unacknowledged-bad-blocks' lists bad blocks that have not yet
> + * been acknowledged. Writing to this file adds bad blocks
> + * without acknowledging them. This is largely for testing.
> + *
> + */
> +
> +static ssize_t
> +badblocks_show(struct badblocks *bb, char *page, int unack)
> +{
> + size_t len = 0;
> + int i;
> + u64 *p;
> + int havelock = 0;
> +
> + if (bb->shift < 0)
> + return 0;
> +
> + rcu_read_lock();
> + p = rcu_dereference(bb->active_page);
> + if (!p) {
> + spin_lock_irq(&bb->lock);
> + p = bb->page;
> + havelock = 1;
> + }
> +
> + i = 0;
> +
> + while (len < PAGE_SIZE && i < bb->count) {
> + sector_t s = BB_OFFSET(p[i]);
> + unsigned int length = BB_LEN(p[i]);
> + int ack = BB_ACK(p[i]);
> + i++;
> +
> + if (unack && ack)
> + continue;
> +
> + len += snprintf(page+len, PAGE_SIZE-len, "%llu %u\n",
> + (unsigned long long)s << bb->shift,
> + length << bb->shift);
> + }
> +
> + if (havelock)
> + spin_unlock_irq(&bb->lock);
> + rcu_read_unlock();
> +
> + return len;
> +}
> +
> +#define DO_DEBUG 1
> +
> +static ssize_t
> +badblocks_store(struct badblocks *bb, const char *page, size_t len, int unack)
> +{
> + unsigned long long sector;
> + int length;
> + char newline;
> +#ifdef DO_DEBUG
> + /* Allow clearing via sysfs *only* for testing/debugging.
> + * Normally only a successful write may clear a badblock
> + */
> + int clear = 0;
> + if (page[0] == '-') {
> + clear = 1;
> + page++;
> + }
> +#endif /* DO_DEBUG */
> +
> + switch (sscanf(page, "%llu %d%c", §or, &length, &newline)) {

What if user provides negative 'length' here? Should we check that case?


> + case 3:
> + if (newline != '\n')
> + return -EINVAL;
> + case 2:
> + break;
> + default:
> + return -EINVAL;
> + }
> +
> +#ifdef DO_DEBUG
> + if (clear) {
> + md_clear_badblocks(bb, sector, length);
> + return len;
> + }
> +#endif /* DO_DEBUG */
> + if (md_set_badblocks(bb, sector, length, !unack))
> + return len;
> + else
> + return -ENOSPC;
> +}
> +
> static int md_notify_reboot(struct notifier_block *this,
> unsigned long code, void *x)
> {
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 03/36] md: don"t allow arrays to contain devices with bad blocks.

am 22.07.2011 17:47:38 von Namhyung Kim

NeilBrown writes:

> As no personality understand bad block lists yet, we must
> reject any device that is known to contain bad blocks.
> As the personalities get taught, these tests can be removed.
>
> This only applies to raid1/raid5/raid10.
> For linear/raid0/multipath/faulty the whole concept of bad blocks
> doesn't mean anything so there is no point adding the checks.
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 04/36] md: load/store badblock list from v1.x metadata

am 22.07.2011 18:34:47 von Namhyung Kim

NeilBrown writes:

> Space must have been allocated when array was created.
> A feature flag is set when the badblock list is non-empty, to
> ensure old kernels don't load and trust the whole device.
>
> We only update the on-disk badblocklist when it has changed.
> If the badblocklist (or other metadata) is stored on a bad block, we
> don't cope very well.
>
> If metadata has no room for bad block, flag bad-blocks as disabled,
> and do the same for 0.90 metadata.
>
> Signed-off-by: NeilBrown
> ---
>
> drivers/md/md.c | 111 +++++++++++++++++++++++++++++++++++++++++++--
> drivers/md/md.h | 5 ++
> include/linux/raid/md_p.h | 14 ++++--
> 3 files changed, 119 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 9324635..18c3aab 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -757,6 +757,10 @@ static void free_disk_sb(mdk_rdev_t * rdev)
> rdev->sb_start = 0;
> rdev->sectors = 0;
> }
> + if (rdev->bb_page) {
> + put_page(rdev->bb_page);
> + rdev->bb_page = NULL;
> + }
> }
>
>
> @@ -1395,6 +1399,8 @@ static __le32 calc_sb_1_csum(struct mdp_superblock_1 * sb)
> return cpu_to_le32(csum);
> }
>
> +static int md_set_badblocks(struct badblocks *bb, sector_t s, int sectors,
> + int acknowledged);
> static int super_1_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version)
> {
> struct mdp_superblock_1 *sb;
> @@ -1473,6 +1479,47 @@ static int super_1_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version)
> else
> rdev->desc_nr = le32_to_cpu(sb->dev_number);
>
> + if (!rdev->bb_page) {
> + rdev->bb_page = alloc_page(GFP_KERNEL);
> + if (!rdev->bb_page)
> + return -ENOMEM;
> + }

This will allocate ->bb_page's for unsupported arrays too. Checking
->bblog_offset here might be helpful.


> + if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_BAD_BLOCKS) &&
> + rdev->badblocks.count == 0) {
> + /* need to load the bad block list.
> + * Currently we limit it to one page.
> + */
> + s32 offset;
> + sector_t bb_sector;
> + u64 *bbp;
> + int i;
> + int sectors = le16_to_cpu(sb->bblog_size);
> + if (sectors > (PAGE_SIZE / 512))
> + return -EINVAL;
> + offset = le32_to_cpu(sb->bblog_offset);
> + if (offset == 0)
> + return -EINVAL;
> + bb_sector = (long long)offset;
> + if (!sync_page_io(rdev, bb_sector, sectors << 9,
> + rdev->bb_page, READ, true))
> + return -EIO;
> + bbp = (u64 *)page_address(rdev->bb_page);

Unnecessary cast.


> + rdev->badblocks.shift = sb->bblog_shift;
> + for (i = 0 ; i < (sectors << (9-3)) ; i++, bbp++) {
> + u64 bb = le64_to_cpu(*bbp);
> + int count = bb & (0x3ff);
> + u64 sector = bb >> 10;
> + sector <<= sb->bblog_shift;
> + count <<= sb->bblog_shift;
> + if (bb + 1 == 0)
> + break;

This code probably needs comment.


> + if (md_set_badblocks(&rdev->badblocks,
> + sector, count, 1) == 0)
> + return -EINVAL;
> + }
> + } else if (sb->bblog_offset == 0)
> + rdev->badblocks.shift = -1;

->badblocks.page can be freed as well.


> +
> if (!refdev) {
> ret = 1;
> } else {
> @@ -1624,7 +1671,6 @@ static void super_1_sync(mddev_t *mddev, mdk_rdev_t *rdev)
> sb->pad0 = 0;
> sb->recovery_offset = cpu_to_le64(0);
> memset(sb->pad1, 0, sizeof(sb->pad1));
> - memset(sb->pad2, 0, sizeof(sb->pad2));
> memset(sb->pad3, 0, sizeof(sb->pad3));
>
> sb->utime = cpu_to_le64((__u64)mddev->utime);
> @@ -1664,6 +1710,43 @@ static void super_1_sync(mddev_t *mddev, mdk_rdev_t *rdev)
> sb->new_chunk = cpu_to_le32(mddev->new_chunk_sectors);
> }
>
> + if (rdev->badblocks.count == 0)
> + /* Nothing to do for bad blocks*/ ;
> + else if (sb->bblog_offset == 0)
> + /* Cannot record bad blocks on this device */
> + md_error(mddev, rdev);
> + else {
> + int havelock = 0;
> + struct badblocks *bb = &rdev->badblocks;
> + u64 *bbp = (u64 *)page_address(rdev->bb_page);

Unnecessary cast too.


> + u64 *p;
> + sb->feature_map |= cpu_to_le32(MD_FEATURE_BAD_BLOCKS);
> + if (bb->changed) {
> + memset(bbp, 0xff, PAGE_SIZE);
> +
> + rcu_read_lock();
> + p = rcu_dereference(bb->active_page);
> + if (!p) {
> + spin_lock_irq(&bb->lock);
> + p = bb->page;
> + havelock = 1;
> + }
> + for (i = 0 ; i < bb->count ; i++) {
> + u64 internal_bb = *p++;
> + u64 store_bb = ((BB_OFFSET(internal_bb) << 10)
> + | BB_LEN(internal_bb));
> + *bbp++ = cpu_to_le64(store_bb);
> + }
> + bb->sector = (rdev->sb_start +
> + (int)le32_to_cpu(sb->bblog_offset));
> + bb->size = le16_to_cpu(sb->bblog_size);
> + bb->changed = 0;
> + if (havelock)
> + spin_unlock_irq(&bb->lock);
> + rcu_read_unlock();
> + }
> + }
> +
> max_dev = 0;
> list_for_each_entry(rdev2, &mddev->disks, same_set)
> if (rdev2->desc_nr+1 > max_dev)
> @@ -2197,6 +2280,7 @@ static void md_update_sb(mddev_t * mddev, int force_change)
> mdk_rdev_t *rdev;
> int sync_req;
> int nospares = 0;
> + int any_badblocks_changed = 0;
>
> repeat:
> /* First make sure individual recovery_offsets are correct */
> @@ -2268,6 +2352,11 @@ repeat:
> MD_BUG();
> mddev->events --;
> }
> +
> + list_for_each_entry(rdev, &mddev->disks, same_set)
> + if (rdev->badblocks.changed)
> + any_badblocks_changed++;
> +
> sync_sbs(mddev, nospares);
> spin_unlock_irq(&mddev->write_lock);
>
> @@ -2293,6 +2382,13 @@ repeat:
> bdevname(rdev->bdev,b),
> (unsigned long long)rdev->sb_start);
> rdev->sb_events = mddev->events;
> + if (rdev->badblocks.size) {
> + md_super_write(mddev, rdev,
> + rdev->badblocks.sector,
> + rdev->badblocks.size << 9,
> + rdev->bb_page);
> + rdev->badblocks.size = 0;
> + }
>
> } else
> dprintk(")\n");
> @@ -2316,6 +2412,9 @@ repeat:
> if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
> sysfs_notify(&mddev->kobj, NULL, "sync_completed");
>
> + if (any_badblocks_changed)
> + list_for_each_entry(rdev, &mddev->disks, same_set)
> + md_ack_all_badblocks(&rdev->badblocks);
> }
>
> /* words written to sysfs files may, or may not, be \n terminated.
> @@ -2823,6 +2922,8 @@ int md_rdev_init(mdk_rdev_t *rdev)
> rdev->sb_events = 0;
> rdev->last_read_error.tv_sec = 0;
> rdev->last_read_error.tv_nsec = 0;
> + rdev->sb_loaded = 0;
> + rdev->bb_page = NULL;
> atomic_set(&rdev->nr_pending, 0);
> atomic_set(&rdev->read_errors, 0);
> atomic_set(&rdev->corrected_errors, 0);
> @@ -2912,11 +3013,9 @@ static mdk_rdev_t *md_import_device(dev_t newdev, int super_format, int super_mi
> return rdev;
>
> abort_free:
> - if (rdev->sb_page) {
> - if (rdev->bdev)
> - unlock_rdev(rdev);
> - free_disk_sb(rdev);
> - }
> + if (rdev->bdev)
> + unlock_rdev(rdev);
> + free_disk_sb(rdev);
> kfree(rdev);
> return ERR_PTR(err);
> }
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index d327734..834e46b 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -55,7 +55,7 @@ struct mdk_rdev_s
> struct block_device *meta_bdev;
> struct block_device *bdev; /* block device handle */
>
> - struct page *sb_page;
> + struct page *sb_page, *bb_page;
> int sb_loaded;
> __u64 sb_events;
> sector_t data_offset; /* start of data in array */
> @@ -128,6 +128,9 @@ struct mdk_rdev_s
> u64 *active_page; /* either 'page' or 'NULL' */
> int changed;
> spinlock_t lock;
> +
> + sector_t sector;
> + sector_t size; /* in sectors */

Looks like 'int' is sufficient for 'size'. Anyway md_super_write()
treats it as int.


> } badblocks;
> };
>
> diff --git a/include/linux/raid/md_p.h b/include/linux/raid/md_p.h
> index 75cbf4f..9e65d9e 100644
> --- a/include/linux/raid/md_p.h
> +++ b/include/linux/raid/md_p.h
> @@ -245,10 +245,16 @@ struct mdp_superblock_1 {
> __u8 device_uuid[16]; /* user-space setable, ignored by kernel */
> __u8 devflags; /* per-device flags. Only one defined...*/
> #define WriteMostly1 1 /* mask for writemostly flag in above */
> - __u8 pad2[64-57]; /* set to 0 when writing */
> + /* Bad block log. If there are any bad blocks the feature flag is set.
> + * If offset and size are non-zero, that space is reserved and available
> + */
> + __u8 bblog_shift; /* shift from sectors to block size */
> + __le16 bblog_size; /* number of sectors reserved for list */
> + __le32 bblog_offset; /* sector offset from superblock to bblog,
> + * signed - not unsigned */
>
> /* array state information - 64 bytes */
> - __le64 utime; /* 40 bits second, 24 btes microseconds */
> + __le64 utime; /* 40 bits second, 24 bits microseconds */
> __le64 events; /* incremented when superblock updated */
> __le64 resync_offset; /* data before this offset (from data_offset) known to be in sync */
> __le32 sb_csum; /* checksum up to devs[max_dev] */
> @@ -270,8 +276,8 @@ struct mdp_superblock_1 {
> * must be honoured
> */
> #define MD_FEATURE_RESHAPE_ACTIVE 4
> +#define MD_FEATURE_BAD_BLOCKS 8 /* badblock list is not empty */
>
> -#define MD_FEATURE_ALL (1|2|4)
> +#define MD_FEATURE_ALL (1|2|4|8)
>
> #endif
> -
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 01/36] md: beginnings of bad block management.

am 22.07.2011 18:52:22 von Namhyung Kim

NeilBrown writes:

> This the first step in allowing md to track bad-blocks per-device so
> that we can fail individual blocks rather than the whole device.
>
> This patch just adds a data structure for recording bad blocks, with
> routines to add, remove, search the list.
>
> Signed-off-by: NeilBrown
> ---
>
> drivers/md/md.c | 457 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> drivers/md/md.h | 49 ++++++
> 2 files changed, 502 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 2a32050..220fadb 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1952,6 +1952,10 @@ static void unbind_rdev_from_array(mdk_rdev_t * rdev)
> sysfs_remove_link(&rdev->kobj, "block");
> sysfs_put(rdev->sysfs_state);
> rdev->sysfs_state = NULL;
> + kfree(rdev->badblocks.page);
> + rdev->badblocks.count = 0;
> + rdev->badblocks.page = NULL;
> + rdev->badblocks.active_page = NULL;
> /* We need to delay this, otherwise we can deadlock when
> * writing to 'remove' to "dev/state". We also need
> * to delay it due to rcu usage.
> @@ -2778,7 +2782,7 @@ static struct kobj_type rdev_ktype = {
> .default_attrs = rdev_default_attrs,
> };
>
> -void md_rdev_init(mdk_rdev_t *rdev)
> +int md_rdev_init(mdk_rdev_t *rdev)
> {
> rdev->desc_nr = -1;
> rdev->saved_raid_disk = -1;
> @@ -2794,6 +2798,20 @@ void md_rdev_init(mdk_rdev_t *rdev)
>
> INIT_LIST_HEAD(&rdev->same_set);
> init_waitqueue_head(&rdev->blocked_wait);
> +
> + /* Add space to store bad block list.
> + * This reserves the space even on arrays where it cannot
> + * be used - I wonder if that matters
> + */
> + rdev->badblocks.count = 0;
> + rdev->badblocks.shift = 0;
> + rdev->badblocks.page = kmalloc(PAGE_SIZE, GFP_KERNEL);
> + rdev->badblocks.active_page = rdev->badblocks.page;
> + spin_lock_init(&rdev->badblocks.lock);
> + if (rdev->badblocks.page == NULL)
> + return -ENOMEM;
> +
> + return 0;
> }
> EXPORT_SYMBOL_GPL(md_rdev_init);
> /*
> @@ -2819,8 +2837,11 @@ static mdk_rdev_t *md_import_device(dev_t newdev, int super_format, int super_mi
> return ERR_PTR(-ENOMEM);
> }
>
> - md_rdev_init(rdev);
> - if ((err = alloc_disk_sb(rdev)))
> + err = md_rdev_init(rdev);
> + if (err)
> + goto abort_free;
> + err = alloc_disk_sb(rdev);
> + if (err)
> goto abort_free;
>
> err = lock_rdev(rdev, newdev, super_format == -2);

Now error path at abort_free needs to free rdev->badblocks.page
otherwise there will be memory leaks.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 05/36] md: Disable bad blocks and v0.90 metadata.

am 22.07.2011 19:02:50 von Namhyung Kim

NeilBrown writes:

> v0.90 metadata cannot record bad blocks, so when loading metadata
> for such a device, set shift to -1.
>
> Signed-off-by: NeilBrown
> ---
>
> drivers/md/md.c | 4 ++++
> 1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 18c3aab..340e2d4 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1058,6 +1058,7 @@ static int super_90_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version
> rdev->preferred_minor = sb->md_minor;
> rdev->data_offset = 0;
> rdev->sb_size = MD_SB_BYTES;
> + rdev->badblocks.shift = -1;
>
> if (sb->level == LEVEL_MULTIPATH)
> rdev->desc_nr = -1;
> @@ -3009,6 +3010,9 @@ static mdk_rdev_t *md_import_device(dev_t newdev, int super_format, int super_mi
> goto abort_free;
> }
> }
> + if (super_format == -1)
> + /* hot-add for 0.90, or non-persistent: so no badblocks */
> + rdev->badblocks.shift = -1;

Maybe we need this as well:

if (rdev->badblocks.shift == -1) {
kfree(rdev->badblocks.page);
rdev->badblocks.page = NULL;
rdev->badblocks.active_page = NULL;
rdev->badblocks.count = 0;
}


>
> return rdev;
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 01/36] md: beginnings of bad block management.

am 26.07.2011 04:26:27 von NeilBrown

On Sat, 23 Jul 2011 00:03:45 +0900 Namhyung Kim wrote:

> NeilBrown writes:
>
> > This the first step in allowing md to track bad-blocks per-device so
> > that we can fail individual blocks rather than the whole device.
> >
> > This patch just adds a data structure for recording bad blocks, with
> > routines to add, remove, search the list.
> >
> > Signed-off-by: NeilBrown
> > ---
> >
> > drivers/md/md.c | 457 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > drivers/md/md.h | 49 ++++++
> > 2 files changed, 502 insertions(+), 4 deletions(-)
> >
> > +
> > +/* Bad block management.
> > + * We can record which blocks on each device are 'bad' and so just
> > + * fail those blocks, or that stripe, rather than the whole device.
> > + * Entries in the bad-block table are 64bits wide. This comprises:
> > + * Length of bad-range, in sectors: 0-511 for lengths 1-512
> > + * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
> > + * A 'shift' can be set so that larger blocks are tracked and
> > + * consequently larger devices can be covered.
> > + * 'Acknowledged' flag - 1 bit. - the most significant bit.
> > + */
> > +/* Locking of the bad-block table is a two-layer affair.
> > + * Read access through ->active_page only requires an rcu_readlock.
> > + * However if ->active_page is found to be NULL, the table
> > + * should be accessed through ->page which requires an irq-spinlock.
> > + * Updating the page requires setting ->active_page to NULL,
> > + * synchronising with rcu, then updating ->page under the same
> > + * irq-spinlock.
> > + * We always set or clear bad blocks from process context, but
> > + * might look-up bad blocks from interrupt/bh context.
> > + *
>
> Empty line.
>
> If the locking is complex, it'd be better defining separate functions to
> deal with it, IMHO. Please see below.

I too have been feeling uncomfortable about the locking and I recently
realised that I really should be using a seqlock rather than trying to force
RCU into the mould. So I have changed it and it is much better now. Below
is new version.


> > +EXPORT_SYMBOL_GPL(rdev_set_badblocks);
>
> I think it would be better if all exported functions in md.c have
> prefixed 'md_'.
>

Probably good advice. I don't think I'll change it now but maybe in a
subsequent patch so that I change them all at once.

Thanks,
NeilBrown

commit 0980048be17a45ae9e181ad04a549c31a499dee9
Author: NeilBrown
Date: Tue Jul 26 12:22:08 2011 +1000

md: beginnings of bad block management.

This the first step in allowing md to track bad-blocks per-device so
that we can fail individual blocks rather than the whole device.

This patch just adds a data structure for recording bad blocks, with
routines to add, remove, search the list.

Signed-off-by: NeilBrown

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 7caa096..a9b853b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1952,6 +1952,9 @@ static void unbind_rdev_from_array(mdk_rdev_t * rdev)
sysfs_remove_link(&rdev->kobj, "block");
sysfs_put(rdev->sysfs_state);
rdev->sysfs_state = NULL;
+ kfree(rdev->badblocks.page);
+ rdev->badblocks.count = 0;
+ rdev->badblocks.page = NULL;
/* We need to delay this, otherwise we can deadlock when
* writing to 'remove' to "dev/state". We also need
* to delay it due to rcu usage.
@@ -2778,7 +2781,7 @@ static struct kobj_type rdev_ktype = {
.default_attrs = rdev_default_attrs,
};

-void md_rdev_init(mdk_rdev_t *rdev)
+int md_rdev_init(mdk_rdev_t *rdev)
{
rdev->desc_nr = -1;
rdev->saved_raid_disk = -1;
@@ -2794,6 +2797,19 @@ void md_rdev_init(mdk_rdev_t *rdev)

INIT_LIST_HEAD(&rdev->same_set);
init_waitqueue_head(&rdev->blocked_wait);
+
+ /* Add space to store bad block list.
+ * This reserves the space even on arrays where it cannot
+ * be used - I wonder if that matters
+ */
+ rdev->badblocks.count = 0;
+ rdev->badblocks.shift = 0;
+ rdev->badblocks.page = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ seqlock_init(&rdev->badblocks.lock);
+ if (rdev->badblocks.page == NULL)
+ return -ENOMEM;
+
+ return 0;
}
EXPORT_SYMBOL_GPL(md_rdev_init);
/*
@@ -2819,8 +2835,11 @@ static mdk_rdev_t *md_import_device(dev_t newdev, int super_format, int super_mi
return ERR_PTR(-ENOMEM);
}

- md_rdev_init(rdev);
- if ((err = alloc_disk_sb(rdev)))
+ err = md_rdev_init(rdev);
+ if (err)
+ goto abort_free;
+ err = alloc_disk_sb(rdev);
+ if (err)
goto abort_free;

err = lock_rdev(rdev, newdev, super_format == -2);
@@ -7326,6 +7345,395 @@ void md_wait_for_blocked_rdev(mdk_rdev_t *rdev, mddev_t *mddev)
}
EXPORT_SYMBOL(md_wait_for_blocked_rdev);

+
+/* Bad block management.
+ * We can record which blocks on each device are 'bad' and so just
+ * fail those blocks, or that stripe, rather than the whole device.
+ * Entries in the bad-block table are 64bits wide. This comprises:
+ * Length of bad-range, in sectors: 0-511 for lengths 1-512
+ * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
+ * A 'shift' can be set so that larger blocks are tracked and
+ * consequently larger devices can be covered.
+ * 'Acknowledged' flag - 1 bit. - the most significant bit.
+ *
+ * Locking of the bad-block table uses a seqlock so md_is_badblock
+ * might need to retry if it is very unlucky.
+ * We will sometimes want to check for bad blocks in a bi_end_io function,
+ * so we use the write_seqlock_irq variant.
+ *
+ * When looking for a bad block we specify a range and want to
+ * know if any block in the range is bad. So we binary-search
+ * to the last range that starts at-or-before the given endpoint,
+ * (or "before the sector after the target range")
+ * then see if it ends after the given start.
+ * We return
+ * 0 if there are no known bad blocks in the range
+ * 1 if there are known bad block which are all acknowledged
+ * -1 if there are bad blocks which have not yet been acknowledged in metadata.
+ * plus the start/length of the first bad section we overlap.
+ */
+int md_is_badblock(struct badblocks *bb, sector_t s, int sectors,
+ sector_t *first_bad, int *bad_sectors)
+{
+ int hi;
+ int lo = 0;
+ u64 *p = bb->page;
+ int rv = 0;
+ sector_t target = s + sectors;
+ unsigned seq;
+
+ if (bb->shift > 0) {
+ /* round the start down, and the end up */
+ s >>= bb->shift;
+ target += (1<shift) - 1;
+ target >>= bb->shift;
+ sectors = target - s;
+ }
+ /* 'target' is now the first block after the bad range */
+
+retry:
+ seq = read_seqbegin(&bb->lock);
+
+ hi = bb->count;
+
+ /* Binary search between lo and hi for 'target'
+ * i.e. for the last range that starts before 'target'
+ */
+ /* INVARIANT: ranges before 'lo' and at-or-after 'hi'
+ * are known not to be the last range before target.
+ * VARIANT: hi-lo is the number of possible
+ * ranges, and decreases until it reaches 1
+ */
+ while (hi - lo > 1) {
+ int mid = (lo + hi) / 2;
+ sector_t a = BB_OFFSET(p[mid]);
+ if (a < target)
+ /* This could still be the one, earlier ranges
+ * could not. */
+ lo = mid;
+ else
+ /* This and later ranges are definitely out. */
+ hi = mid;
+ }
+ /* 'lo' might be the last that started before target, but 'hi' isn't */
+ if (hi > lo) {
+ /* need to check all range that end after 's' to see if
+ * any are unacknowledged.
+ */
+ while (lo >= 0 &&
+ BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
+ if (BB_OFFSET(p[lo]) < target) {
+ /* starts before the end, and finishes after
+ * the start, so they must overlap
+ */
+ if (rv != -1 && BB_ACK(p[lo]))
+ rv = 1;
+ else
+ rv = -1;
+ *first_bad = BB_OFFSET(p[lo]);
+ *bad_sectors = BB_LEN(p[lo]);
+ }
+ lo--;
+ }
+ }
+
+ if (read_seqretry(&bb->lock, seq))
+ goto retry;
+
+ return rv;
+}
+EXPORT_SYMBOL_GPL(md_is_badblock);
+
+/*
+ * Add a range of bad blocks to the table.
+ * This might extend the table, or might contract it
+ * if two adjacent ranges can be merged.
+ * We binary-search to find the 'insertion' point, then
+ * decide how best to handle it.
+ */
+static int md_set_badblocks(struct badblocks *bb, sector_t s, int sectors,
+ int acknowledged)
+{
+ u64 *p;
+ int lo, hi;
+ int rv = 1;
+
+ if (bb->shift < 0)
+ /* badblocks are disabled */
+ return 0;
+
+ if (bb->shift) {
+ /* round the start down, and the end up */
+ sector_t next = s + sectors;
+ s >>= bb->shift;
+ next += (1<shift) - 1;
+ next >>= bb->shift;
+ sectors = next - s;
+ }
+
+ write_seqlock_irq(&bb->lock);
+
+ p = bb->page;
+ lo = 0;
+ hi = bb->count;
+ /* Find the last range that starts at-or-before 's' */
+ while (hi - lo > 1) {
+ int mid = (lo + hi) / 2;
+ sector_t a = BB_OFFSET(p[mid]);
+ if (a <= s)
+ lo = mid;
+ else
+ hi = mid;
+ }
+ if (hi > lo && BB_OFFSET(p[lo]) > s)
+ hi = lo;
+
+ if (hi > lo) {
+ /* we found a range that might merge with the start
+ * of our new range
+ */
+ sector_t a = BB_OFFSET(p[lo]);
+ sector_t e = a + BB_LEN(p[lo]);
+ int ack = BB_ACK(p[lo]);
+ if (e >= s) {
+ /* Yes, we can merge with a previous range */
+ if (s == a && s + sectors >= e)
+ /* new range covers old */
+ ack = acknowledged;
+ else
+ ack = ack && acknowledged;
+
+ if (e < s + sectors)
+ e = s + sectors;
+ if (e - a <= BB_MAX_LEN) {
+ p[lo] = BB_MAKE(a, e-a, ack);
+ s = e;
+ } else {
+ /* does not all fit in one range,
+ * make p[lo] maximal
+ */
+ if (BB_LEN(p[lo]) != BB_MAX_LEN)
+ p[lo] = BB_MAKE(a, BB_MAX_LEN, ack);
+ s = a + BB_MAX_LEN;
+ }
+ sectors = e - s;
+ }
+ }
+ if (sectors && hi < bb->count) {
+ /* 'hi' points to the first range that starts after 's'.
+ * Maybe we can merge with the start of that range */
+ sector_t a = BB_OFFSET(p[hi]);
+ sector_t e = a + BB_LEN(p[hi]);
+ int ack = BB_ACK(p[hi]);
+ if (a <= s + sectors) {
+ /* merging is possible */
+ if (e <= s + sectors) {
+ /* full overlap */
+ e = s + sectors;
+ ack = acknowledged;
+ } else
+ ack = ack && acknowledged;
+
+ a = s;
+ if (e - a <= BB_MAX_LEN) {
+ p[hi] = BB_MAKE(a, e-a, ack);
+ s = e;
+ } else {
+ p[hi] = BB_MAKE(a, BB_MAX_LEN, ack);
+ s = a + BB_MAX_LEN;
+ }
+ sectors = e - s;
+ lo = hi;
+ hi++;
+ }
+ }
+ if (sectors == 0 && hi < bb->count) {
+ /* we might be able to combine lo and hi */
+ /* Note: 's' is at the end of 'lo' */
+ sector_t a = BB_OFFSET(p[hi]);
+ int lolen = BB_LEN(p[lo]);
+ int hilen = BB_LEN(p[hi]);
+ int newlen = lolen + hilen - (s - a);
+ if (s >= a && newlen < BB_MAX_LEN) {
+ /* yes, we can combine them */
+ int ack = BB_ACK(p[lo]) && BB_ACK(p[hi]);
+ p[lo] = BB_MAKE(BB_OFFSET(p[lo]), newlen, ack);
+ memmove(p + hi, p + hi + 1,
+ (bb->count - hi - 1) * 8);
+ bb->count--;
+ }
+ }
+ while (sectors) {
+ /* didn't merge (it all).
+ * Need to add a range just before 'hi' */
+ if (bb->count >= MD_MAX_BADBLOCKS) {
+ /* No room for more */
+ rv = 0;
+ break;
+ } else {
+ int this_sectors = sectors;
+ memmove(p + hi + 1, p + hi,
+ (bb->count - hi) * 8);
+ bb->count++;
+
+ if (this_sectors > BB_MAX_LEN)
+ this_sectors = BB_MAX_LEN;
+ p[hi] = BB_MAKE(s, this_sectors, acknowledged);
+ sectors -= this_sectors;
+ s += this_sectors;
+ }
+ }
+
+ bb->changed = 1;
+ write_sequnlock_irq(&bb->lock);
+
+ return rv;
+}
+
+int rdev_set_badblocks(mdk_rdev_t *rdev, sector_t s, int sectors,
+ int acknowledged)
+{
+ int rv = md_set_badblocks(&rdev->badblocks,
+ s + rdev->data_offset, sectors, acknowledged);
+ if (rv) {
+ /* Make sure they get written out promptly */
+ set_bit(MD_CHANGE_CLEAN, &rdev->mddev->flags);
+ md_wakeup_thread(rdev->mddev->thread);
+ }
+ return rv;
+}
+EXPORT_SYMBOL_GPL(rdev_set_badblocks);
+
+/*
+ * Remove a range of bad blocks from the table.
+ * This may involve extending the table if we spilt a region,
+ * but it must not fail. So if the table becomes full, we just
+ * drop the remove request.
+ */
+static int md_clear_badblocks(struct badblocks *bb, sector_t s, int sectors)
+{
+ u64 *p;
+ int lo, hi;
+ sector_t target = s + sectors;
+ int rv = 0;
+
+ if (bb->shift > 0) {
+ /* When clearing we round the start up and the end down.
+ * This should not matter as the shift should align with
+ * the block size and no rounding should ever be needed.
+ * However it is better the think a block is bad when it
+ * isn't than to think a block is not bad when it is.
+ */
+ s += (1<shift) - 1;
+ s >>= bb->shift;
+ target >>= bb->shift;
+ sectors = target - s;
+ }
+
+ write_seqlock_irq(&bb->lock);
+
+ p = bb->page;
+ lo = 0;
+ hi = bb->count;
+ /* Find the last range that starts before 'target' */
+ while (hi - lo > 1) {
+ int mid = (lo + hi) / 2;
+ sector_t a = BB_OFFSET(p[mid]);
+ if (a < target)
+ lo = mid;
+ else
+ hi = mid;
+ }
+ if (hi > lo) {
+ /* p[lo] is the last range that could overlap the
+ * current range. Earlier ranges could also overlap,
+ * but only this one can overlap the end of the range.
+ */
+ if (BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > target) {
+ /* Partial overlap, leave the tail of this range */
+ int ack = BB_ACK(p[lo]);
+ sector_t a = BB_OFFSET(p[lo]);
+ sector_t end = a + BB_LEN(p[lo]);
+
+ if (a < s) {
+ /* we need to split this range */
+ if (bb->count >= MD_MAX_BADBLOCKS) {
+ rv = 0;
+ goto out;
+ }
+ memmove(p+lo+1, p+lo, (bb->count - lo) * 8);
+ bb->count++;
+ p[lo] = BB_MAKE(a, s-a, ack);
+ lo++;
+ }
+ p[lo] = BB_MAKE(target, end - target, ack);
+ /* there is no longer an overlap */
+ hi = lo;
+ lo--;
+ }
+ while (lo >= 0 &&
+ BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
+ /* This range does overlap */
+ if (BB_OFFSET(p[lo]) < s) {
+ /* Keep the early parts of this range. */
+ int ack = BB_ACK(p[lo]);
+ sector_t start = BB_OFFSET(p[lo]);
+ p[lo] = BB_MAKE(start, s - start, ack);
+ /* now low doesn't overlap, so.. */
+ break;
+ }
+ lo--;
+ }
+ /* 'lo' is strictly before, 'hi' is strictly after,
+ * anything between needs to be discarded
+ */
+ if (hi - lo > 1) {
+ memmove(p+lo+1, p+hi, (bb->count - hi) * 8);
+ bb->count -= (hi - lo - 1);
+ }
+ }
+
+ bb->changed = 1;
+out:
+ write_sequnlock_irq(&bb->lock);
+ return rv;
+}
+
+int rdev_clear_badblocks(mdk_rdev_t *rdev, sector_t s, int sectors)
+{
+ return md_clear_badblocks(&rdev->badblocks,
+ s + rdev->data_offset,
+ sectors);
+}
+EXPORT_SYMBOL_GPL(rdev_clear_badblocks);
+
+/*
+ * Acknowledge all bad blocks in a list.
+ * This only succeeds if ->changed is clear. It is used by
+ * in-kernel metadata updates
+ */
+void md_ack_all_badblocks(struct badblocks *bb)
+{
+ if (bb->page == NULL || bb->changed)
+ /* no point even trying */
+ return;
+ write_seqlock_irq(&bb->lock);
+
+ if (bb->changed == 0) {
+ u64 *p = bb->page;
+ int i;
+ for (i = 0; i < bb->count ; i++) {
+ if (!BB_ACK(p[i])) {
+ sector_t start = BB_OFFSET(p[i]);
+ int len = BB_LEN(p[i]);
+ p[i] = BB_MAKE(start, len, 1);
+ }
+ }
+ }
+ write_sequnlock_irq(&bb->lock);
+}
+EXPORT_SYMBOL_GPL(md_ack_all_badblocks);
+
static int md_notify_reboot(struct notifier_block *this,
unsigned long code, void *x)
{
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 7d906a9..85af843 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -29,6 +29,13 @@
typedef struct mddev_s mddev_t;
typedef struct mdk_rdev_s mdk_rdev_t;

+/* Bad block numbers are stored sorted in a single page.
+ * 64bits is used for each block or extent.
+ * 54 bits are sector number, 9 bits are extent size,
+ * 1 bit is an 'acknowledged' flag.
+ */
+#define MD_MAX_BADBLOCKS (PAGE_SIZE/8)
+
/*
* MD's 'extended' device
*/
@@ -111,8 +118,47 @@ struct mdk_rdev_s

struct sysfs_dirent *sysfs_state; /* handle for 'state'
* sysfs entry */
+
+ struct badblocks {
+ int count; /* count of bad blocks */
+ int shift; /* shift from sectors to block size
+ * a -ve shift means badblocks are
+ * disabled.*/
+ u64 *page; /* badblock list */
+ int changed;
+ seqlock_t lock;
+ } badblocks;
};

+#define BB_LEN_MASK (0x00000000000001FFULL)
+#define BB_OFFSET_MASK (0x7FFFFFFFFFFFFE00ULL)
+#define BB_ACK_MASK (0x8000000000000000ULL)
+#define BB_MAX_LEN 512
+#define BB_OFFSET(x) (((x) & BB_OFFSET_MASK) >> 9)
+#define BB_LEN(x) (((x) & BB_LEN_MASK) + 1)
+#define BB_ACK(x) (!!((x) & BB_ACK_MASK))
+#define BB_MAKE(a, l, ack) (((a)<<9) | ((l)-1) | ((u64)(!!(ack)) << 63))
+
+extern int md_is_badblock(struct badblocks *bb, sector_t s, int sectors,
+ sector_t *first_bad, int *bad_sectors);
+static inline int is_badblock(mdk_rdev_t *rdev, sector_t s, int sectors,
+ sector_t *first_bad, int *bad_sectors)
+{
+ if (unlikely(rdev->badblocks.count)) {
+ int rv = md_is_badblock(&rdev->badblocks, rdev->data_offset + s,
+ sectors,
+ first_bad, bad_sectors);
+ if (rv)
+ *first_bad -= rdev->data_offset;
+ return rv;
+ }
+ return 0;
+}
+extern int rdev_set_badblocks(mdk_rdev_t *rdev, sector_t s, int sectors,
+ int acknowledged);
+extern int rdev_clear_badblocks(mdk_rdev_t *rdev, sector_t s, int sectors);
+extern void md_ack_all_badblocks(struct badblocks *bb);
+
struct mddev_s
{
void *private;
@@ -517,7 +563,7 @@ extern void mddev_init(mddev_t *mddev);
extern int md_run(mddev_t *mddev);
extern void md_stop(mddev_t *mddev);
extern void md_stop_writes(mddev_t *mddev);
-extern void md_rdev_init(mdk_rdev_t *rdev);
+extern int md_rdev_init(mdk_rdev_t *rdev);

extern void mddev_suspend(mddev_t *mddev);
extern void mddev_resume(mddev_t *mddev);

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 02/36] md/bad-block-log: add sysfs interface foraccessing bad-block-log.

am 26.07.2011 04:29:27 von NeilBrown

On Sat, 23 Jul 2011 00:43:22 +0900 Namhyung Kim wrote:

> NeilBrown writes:
>
> > This can show the log (providing it fits in one page) and
> > allows bad blocks to be 'acknowledged' meaning that they
> > have safely been recorded in metadata.
> >
> > Clearing bad blocks is not allowed via sysfs (except for
> > code testing). A bad block can only be cleared when
> > a write to the block succeeds.
> >
> > Signed-off-by: NeilBrown
> > ---
> >
> > drivers/md/md.c | 127 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 files changed, 127 insertions(+), 0 deletions(-)
> >

> > +static ssize_t
> > +badblocks_store(struct badblocks *bb, const char *page, size_t len, int unack)
> > +{
> > + unsigned long long sector;
> > + int length;
> > + char newline;
> > +#ifdef DO_DEBUG
> > + /* Allow clearing via sysfs *only* for testing/debugging.
> > + * Normally only a successful write may clear a badblock
> > + */
> > + int clear = 0;
> > + if (page[0] == '-') {
> > + clear = 1;
> > + page++;
> > + }
> > +#endif /* DO_DEBUG */
> > +
> > + switch (sscanf(page, "%llu %d%c", §or, &length, &newline)) {
>
> What if user provides negative 'length' here? Should we check that case?
>

good point. I've added an appropriate test.

Thanks,
NeilBrown

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5de0c84..9ba76c7 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7844,6 +7844,8 @@ badblocks_store(struct badblocks *bb, const char *page, size_t len, int unack)
if (newline != '\n')
return -EINVAL;
case 2:
+ if (length <= 0)
+ return -EINVAL;
break;
default:
return -EINVAL;
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 01/36] md: beginnings of bad block management.

am 26.07.2011 05:20:37 von NeilBrown

On Sat, 23 Jul 2011 01:52:22 +0900 Namhyung Kim wrote:

> NeilBrown writes:
>
> > This the first step in allowing md to track bad-blocks per-device so
> > that we can fail individual blocks rather than the whole device.
> >
> > This patch just adds a data structure for recording bad blocks, with
> > routines to add, remove, search the list.
> >
> > Signed-off-by: NeilBrown
> > ---
> >
> > drivers/md/md.c | 457 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > drivers/md/md.h | 49 ++++++
> > 2 files changed, 502 insertions(+), 4 deletions(-)
> >

> > @@ -2819,8 +2837,11 @@ static mdk_rdev_t *md_import_device(dev_t newdev, int super_format, int super_mi
> > return ERR_PTR(-ENOMEM);
> > }
> >
> > - md_rdev_init(rdev);
> > - if ((err = alloc_disk_sb(rdev)))
> > + err = md_rdev_init(rdev);
> > + if (err)
> > + goto abort_free;
> > + err = alloc_disk_sb(rdev);
> > + if (err)
> > goto abort_free;
> >
> > err = lock_rdev(rdev, newdev, super_format == -2);
>
> Now error path at abort_free needs to free rdev->badblocks.page
> otherwise there will be memory leaks.


Yep, I've fixed that thanks.

I'll revisit when to allocate the badblocks.page in a subsequent patch.
Then I'll push it all out to for-next and give it 48 hours for any further
review and for me to do some more testing - particularly after the locking
change.

So I hope to push all this to Linus in about 48 hours. Any other
Reviewed-by:s (including any I might have missed) welcome before then.

Thanks,

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 01/36] md: beginnings of bad block management.

am 26.07.2011 07:17:12 von Namhyung Kim

2011-07-26 (í™=94), 12:26 +1000, NeilBrown:
> On Sat, 23 Jul 2011 00:03:45 +0900 Namhyung Kim =
wrote:
>=20
> > NeilBrown writes:
> >=20
> > > This the first step in allowing md to track bad-blocks per-device=
so
> > > that we can fail individual blocks rather than the whole device.
> > >
> > > This patch just adds a data structure for recording bad blocks, w=
ith
> > > routines to add, remove, search the list.
> > >
> > > Signed-off-by: NeilBrown
> > > ---
> > >
> > > drivers/md/md.c | 457 +++++++++++++++++++++++++++++++++++++++++=
++++++++++++++
> > > drivers/md/md.h | 49 ++++++
> > > 2 files changed, 502 insertions(+), 4 deletions(-)
> > >
> > > +
> > > +/* Bad block management.
> > > + * We can record which blocks on each device are 'bad' and so ju=
st
> > > + * fail those blocks, or that stripe, rather than the whole devi=
ce.
> > > + * Entries in the bad-block table are 64bits wide. This compris=
es:
> > > + * Length of bad-range, in sectors: 0-511 for lengths 1-512
> > > + * Start of bad-range, sector offset, 54 bits (allows 8 exbibyte=
s)
> > > + * A 'shift' can be set so that larger blocks are tracked and
> > > + * consequently larger devices can be covered.
> > > + * 'Acknowledged' flag - 1 bit. - the most significant bit.
> > > + */
> > > +/* Locking of the bad-block table is a two-layer affair.
> > > + * Read access through ->active_page only requires an rcu_readlo=
ck.
> > > + * However if ->active_page is found to be NULL, the table
> > > + * should be accessed through ->page which requires an irq-spinl=
ock.
> > > + * Updating the page requires setting ->active_page to NULL,
> > > + * synchronising with rcu, then updating ->page under the same
> > > + * irq-spinlock.
> > > + * We always set or clear bad blocks from process context, but
> > > + * might look-up bad blocks from interrupt/bh context.
> > > + *
> >=20
> > Empty line.
> >=20
> > If the locking is complex, it'd be better defining separate functio=
ns to
> > deal with it, IMHO. Please see below.
>=20
> I too have been feeling uncomfortable about the locking and I recentl=
y
> realised that I really should be using a seqlock rather than trying t=
o force
> RCU into the mould. So I have changed it and it is much better now. =
Below
> is new version.
>=20
>=20
> > > +EXPORT_SYMBOL_GPL(rdev_set_badblocks);
> >=20
> > I think it would be better if all exported functions in md.c have
> > prefixed 'md_'.
> >=20
>=20
> Probably good advice. I don't think I'll change it now but maybe in =
a
> subsequent patch so that I change them all at once.
>=20
> Thanks,
> NeilBrown
>=20
> commit 0980048be17a45ae9e181ad04a549c31a499dee9
> Author: NeilBrown
> Date: Tue Jul 26 12:22:08 2011 +1000
>=20
> md: beginnings of bad block management.
> =20
> This the first step in allowing md to track bad-blocks per-device=
so
> that we can fail individual blocks rather than the whole device.
> =20
> This patch just adds a data structure for recording bad blocks, w=
ith
> routines to add, remove, search the list.
> =20
> Signed-off-by: NeilBrown

with your another badblocks.page fix:

Reviewed-by: Namhyung Kim


--=20
Regards,
Namhyung Kim


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 02/36] md/bad-block-log: add sysfs interface foraccessing bad-block-log.

am 26.07.2011 07:17:42 von Namhyung Kim

2011-07-26 (í™=94), 12:29 +1000, NeilBrown:
> On Sat, 23 Jul 2011 00:43:22 +0900 Namhyung Kim =
wrote:
>=20
> > NeilBrown writes:
> >=20
> > > This can show the log (providing it fits in one page) and
> > > allows bad blocks to be 'acknowledged' meaning that they
> > > have safely been recorded in metadata.
> > >
> > > Clearing bad blocks is not allowed via sysfs (except for
> > > code testing). A bad block can only be cleared when
> > > a write to the block succeeds.
> > >
> > > Signed-off-by: NeilBrown
> > > ---
> > >
> > > drivers/md/md.c | 127 +++++++++++++++++++++++++++++++++++++++++=
++++++++++++++
> > > 1 files changed, 127 insertions(+), 0 deletions(-)
> > >
>=20
> > > +static ssize_t
> > > +badblocks_store(struct badblocks *bb, const char *page, size_t l=
en, int unack)
> > > +{
> > > + unsigned long long sector;
> > > + int length;
> > > + char newline;
> > > +#ifdef DO_DEBUG
> > > + /* Allow clearing via sysfs *only* for testing/debugging.
> > > + * Normally only a successful write may clear a badblock
> > > + */
> > > + int clear =3D 0;
> > > + if (page[0] == '-') {
> > > + clear =3D 1;
> > > + page++;
> > > + }
> > > +#endif /* DO_DEBUG */
> > > +
> > > + switch (sscanf(page, "%llu %d%c", §or, &length, &newline)) =
{
> >=20
> > What if user provides negative 'length' here? Should we check that =
case?
> >=20
>=20
> good point. I've added an appropriate test.
>=20
> Thanks,
> NeilBrown
>=20
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 5de0c84..9ba76c7 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7844,6 +7844,8 @@ badblocks_store(struct badblocks *bb, const cha=
r *page, size_t len, int unack)
> if (newline !=3D '\n')
> return -EINVAL;
> case 2:
> + if (length <=3D 0)
> + return -EINVAL;
> break;
> default:
> return -EINVAL;

Reviewed-by: Namhyung Kim

Thanks.


--=20
Regards,
Namhyung Kim


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 02/36] md/bad-block-log: add sysfs interface for accessing bad-block-log.

am 26.07.2011 10:48:27 von Namhyung Kim

NeilBrown writes:

> This can show the log (providing it fits in one page) and
> allows bad blocks to be 'acknowledged' meaning that they
> have safely been recorded in metadata.
>
> Clearing bad blocks is not allowed via sysfs (except for
> code testing). A bad block can only be cleared when
> a write to the block succeeds.
>
> Signed-off-by: NeilBrown
> ---
>
> drivers/md/md.c | 127 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 127 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 220fadb..9324635 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -2712,6 +2712,35 @@ static ssize_t recovery_start_store(mdk_rdev_t *rdev, const char *buf, size_t le
> static struct rdev_sysfs_entry rdev_recovery_start =
> __ATTR(recovery_start, S_IRUGO|S_IWUSR, recovery_start_show, recovery_start_store);
>
> +
> +static ssize_t
> +badblocks_show(struct badblocks *bb, char *page, int unack);
> +static ssize_t
> +badblocks_store(struct badblocks *bb, const char *page, size_t len, int unack);
> +
> +static ssize_t bb_show(mdk_rdev_t *rdev, char *page)
> +{
> + return badblocks_show(&rdev->badblocks, page, 0);
> +}
> +static ssize_t bb_store(mdk_rdev_t *rdev, const char *page, size_t len)
> +{
> + return badblocks_store(&rdev->badblocks, page, len, 0);
> +}
> +static struct rdev_sysfs_entry rdev_bad_blocks =
> +__ATTR(bad_blocks, S_IRUGO|S_IWUSR, bb_show, bb_store);
> +
> +
> +static ssize_t ubb_show(mdk_rdev_t *rdev, char *page)
> +{
> + return badblocks_show(&rdev->badblocks, page, 1);
> +}
> +static ssize_t ubb_store(mdk_rdev_t *rdev, const char *page, size_t len)
> +{
> + return badblocks_store(&rdev->badblocks, page, len, 1);
> +}
> +static struct rdev_sysfs_entry rdev_unack_bad_blocks =
> +__ATTR(unacknowledged_bad_blocks, S_IRUGO|S_IWUSR, ubb_show, ubb_store);
> +
> static struct attribute *rdev_default_attrs[] = {
> &rdev_state.attr,
> &rdev_errors.attr,
> @@ -2719,6 +2748,8 @@ static struct attribute *rdev_default_attrs[] = {
> &rdev_offset.attr,
> &rdev_size.attr,
> &rdev_recovery_start.attr,
> + &rdev_bad_blocks.attr,
> + &rdev_unack_bad_blocks.attr,
> NULL,
> };
> static ssize_t
> @@ -7775,6 +7806,102 @@ void md_ack_all_badblocks(struct badblocks *bb)
> }
> EXPORT_SYMBOL_GPL(md_ack_all_badblocks);
>
> +/* sysfs access to bad-blocks list.
> + * We present two files.
> + * 'bad-blocks' lists sector numbers and lengths of ranges that
> + * are recorded as bad. The list is truncated to fit within
> + * the one-page limit of sysfs.
> + * Writing "sector length" to this file adds an acknowledged
> + * bad block list.
> + * 'unacknowledged-bad-blocks' lists bad blocks that have not yet
> + * been acknowledged. Writing to this file adds bad blocks
> + * without acknowledging them. This is largely for testing.
> + *
> + */

Empty line in comment.

And you might need this as well:
(or maybe it should be located somewhere in Documentation/ABI/)

From 95c3f191bd0cbe6d339fced75656502b2d591fe4 Mon Sep 17 00:00:00 2001
From: Namhyung Kim
Date: Tue, 26 Jul 2011 17:39:52 +0900
Subject: [PATCH] md: add documentation for bad block log

Previous patch in the bad block series added new sysfs interfaces
([unacknowledged_]bad_blocks) for each rdev without documentation.
Add it.

Signed-off-by: Namhyung Kim
---
Documentation/md.txt | 14 ++++++++++++++
1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/Documentation/md.txt b/Documentation/md.txt
index f0eee83ff78a..be88a24b8584 100644
--- a/Documentation/md.txt
+++ b/Documentation/md.txt
@@ -562,6 +562,20 @@ also have
that number to reach sync_max. Then you can either increase
"sync_max", or can write 'idle' to "sync_action".

+ bad_blocks
+ This gives the list of all known bad blocks in the form of
+ start address and length (in sectors respectively). If output
+ is too big to fit in a page, it will be truncated. Writing
+ "sector length" to this file will add new acknowledged (i.e.
+ saved to disk safely) bad blocks.
+
+ unacknowledged_bad_blocks
+ This gives the list of known-but-not-yet-saved-to-disk bad
+ blocks in the same form of 'bad_blocks'. If output is too big
+ to fit in a page, it will be truncated. Writing to this file
+ adds bad blocks without acknowledging them. This is largely
+ for testing.
+

Each active md device may also have attributes specific to the
personality module that manages it.
--
1.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 06/36] md/raid1: avoid reading from known bad blocks.

am 26.07.2011 16:06:37 von Namhyung Kim

NeilBrown writes:

> Now that we have a bad block list, we should not read from those
> blocks.
> There are several main parts to this:
> 1/ read_balance needs to check for bad blocks, and return not only
> the chosen device, but also how many good blocks are available
> there.
> 2/ fix_read_error needs to avoid trying to read from bad blocks.
> 3/ read submission must be ready to issue multiple reads to
> different devices as different bad blocks on different devices
> could mean that a single large read cannot be served by any one
> device, but can still be served by the array.
> This requires keeping count of the number of outstanding requests
> per bio. This count is stored in 'bi_phys_segments'
> 4/ retrying a read needs to also be ready to submit a smaller read
> and queue another request for the rest.
>
> This does not yet handle bad blocks when reading to perform resync,
> recovery, or check.
>
> 'md_trim_bio' will also be used for RAID10, so put it in md.c and
> export it.
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim

and some minor nits below.


> ---
>
> drivers/md/md.c | 49 ++++++++++++
> drivers/md/md.h | 1
> drivers/md/raid1.c | 208 +++++++++++++++++++++++++++++++++++++++++++++-------
> drivers/md/raid1.h | 4 +
> 4 files changed, 233 insertions(+), 29 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 340e2d4..430bc8b 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -215,6 +215,55 @@ struct bio *bio_clone_mddev(struct bio *bio, gfp_t gfp_mask,
> }
> EXPORT_SYMBOL_GPL(bio_clone_mddev);
>
> +void md_trim_bio(struct bio *bio, int offset, int size)
> +{
> + /* 'bio' is a cloned bio which we need to trim to match
> + * the given offset and size.
> + * This requires adjusting bi_sector, bi_size, and bi_io_vec
> + */
> + int i;
> + struct bio_vec *bvec;
> + int sofar = 0;
> +
> + size <<= 9;
> + if (offset == 0 && size == bio->bi_size)
> + return;
> +
> + bio->bi_sector += offset;
> + bio->bi_size = size;
> + offset <<= 9;
> + clear_bit(BIO_SEG_VALID, &bio->bi_flags);
> +
> + while (bio->bi_idx < bio->bi_vcnt &&
> + bio->bi_io_vec[bio->bi_idx].bv_len <= offset) {
> + /* remove this whole bio_vec */
> + offset -= bio->bi_io_vec[bio->bi_idx].bv_len;
> + bio->bi_idx++;
> + }
> + if (bio->bi_idx < bio->bi_vcnt) {
> + bio->bi_io_vec[bio->bi_idx].bv_offset += offset;
> + bio->bi_io_vec[bio->bi_idx].bv_len -= offset;
> + }
> + /* avoid any complications with bi_idx being non-zero*/
> + if (bio->bi_idx) {
> + memmove(bio->bi_io_vec, bio->bi_io_vec+bio->bi_idx,
> + (bio->bi_vcnt - bio->bi_idx) * sizeof(struct bio_vec));
> + bio->bi_vcnt -= bio->bi_idx;
> + bio->bi_idx = 0;
> + }
> + /* Make sure vcnt and last bv are not too big */
> + bio_for_each_segment(bvec, bio, i) {
> + if (sofar + bvec->bv_len > size)
> + bvec->bv_len = size - sofar;
> + if (bvec->bv_len == 0) {
> + bio->bi_vcnt = i;
> + break;
> + }
> + sofar += bvec->bv_len;
> + }
> +}
> +EXPORT_SYMBOL_GPL(md_trim_bio);
> +
> /*
> * We have a system wide 'event count' that is incremented
> * on any 'interesting' event, and readers of /proc/mdstat
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 834e46b..eb11449 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -576,4 +576,5 @@ extern struct bio *bio_clone_mddev(struct bio *bio, gfp_t gfp_mask,
> extern struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
> mddev_t *mddev);
> extern int mddev_check_plugged(mddev_t *mddev);
> +extern void md_trim_bio(struct bio *bio, int offset, int size);
> #endif /* _MD_MD_H */
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 8db311d..cc3939d 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -41,11 +41,7 @@
> #include "bitmap.h"
>
> #define DEBUG 0
> -#if DEBUG
> -#define PRINTK(x...) printk(x)
> -#else
> -#define PRINTK(x...)
> -#endif
> +#define PRINTK(x...) do { if (DEBUG) printk(x); } while (0)
>
> /*
> * Number of guaranteed r1bios in case of extreme VM load:
> @@ -177,12 +173,6 @@ static void free_r1bio(r1bio_t *r1_bio)
> {
> conf_t *conf = r1_bio->mddev->private;
>
> - /*
> - * Wake up any possible resync thread that waits for the device
> - * to go idle.
> - */
> - allow_barrier(conf);
> -
> put_all_bios(conf, r1_bio);
> mempool_free(r1_bio, conf->r1bio_pool);
> }
> @@ -223,6 +213,33 @@ static void reschedule_retry(r1bio_t *r1_bio)
> * operation and are ready to return a success/failure code to the buffer
> * cache layer.
> */
> +static void call_bio_endio(r1bio_t *r1_bio)
> +{
> + struct bio *bio = r1_bio->master_bio;
> + int done;
> + conf_t *conf = r1_bio->mddev->private;
> +
> + if (bio->bi_phys_segments) {
> + unsigned long flags;
> + spin_lock_irqsave(&conf->device_lock, flags);
> + bio->bi_phys_segments--;
> + done = (bio->bi_phys_segments == 0);
> + spin_unlock_irqrestore(&conf->device_lock, flags);
> + } else
> + done = 1;
> +
> + if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
> + clear_bit(BIO_UPTODATE, &bio->bi_flags);
> + if (done) {
> + bio_endio(bio, 0);
> + /*
> + * Wake up any possible resync thread that waits for the device
> + * to go idle.
> + */
> + allow_barrier(conf);
> + }
> +}
> +
> static void raid_end_bio_io(r1bio_t *r1_bio)
> {
> struct bio *bio = r1_bio->master_bio;
> @@ -235,8 +252,7 @@ static void raid_end_bio_io(r1bio_t *r1_bio)
> (unsigned long long) bio->bi_sector +
> (bio->bi_size >> 9) - 1);
>
> - bio_endio(bio,
> - test_bit(R1BIO_Uptodate, &r1_bio->state) ? 0 : -EIO);
> + call_bio_endio(r1_bio);
> }
> free_r1bio(r1_bio);
> }
> @@ -295,6 +311,7 @@ static void raid1_end_read_request(struct bio *bio, int error)
> bdevname(conf->mirrors[mirror].rdev->bdev,
> b),
> (unsigned long long)r1_bio->sector);
> + set_bit(R1BIO_ReadError, &r1_bio->state);
> reschedule_retry(r1_bio);
> }
>
> @@ -381,7 +398,7 @@ static void raid1_end_write_request(struct bio *bio, int error)
> (unsigned long long) mbio->bi_sector,
> (unsigned long long) mbio->bi_sector +
> (mbio->bi_size >> 9) - 1);
> - bio_endio(mbio, 0);
> + call_bio_endio(r1_bio);
> }
> }
> }
> @@ -412,10 +429,11 @@ static void raid1_end_write_request(struct bio *bio, int error)
> *
> * The rdev for the device selected will have nr_pending incremented.
> */
> -static int read_balance(conf_t *conf, r1bio_t *r1_bio)
> +static int read_balance(conf_t *conf, r1bio_t *r1_bio, int *max_sectors)
> {
> const sector_t this_sector = r1_bio->sector;
> - const int sectors = r1_bio->sectors;
> + int sectors;
> + int best_good_sectors;
> int start_disk;
> int best_disk;
> int i;
> @@ -430,8 +448,11 @@ static int read_balance(conf_t *conf, r1bio_t *r1_bio)
> * We take the first readable disk when above the resync window.
> */
> retry:
> + sectors = r1_bio->sectors;
> best_disk = -1;
> best_dist = MaxSector;
> + best_good_sectors = 0;
> +
> if (conf->mddev->recovery_cp < MaxSector &&
> (this_sector + sectors >= conf->next_resync)) {
> choose_first = 1;
> @@ -443,6 +464,9 @@ static int read_balance(conf_t *conf, r1bio_t *r1_bio)
>
> for (i = 0 ; i < conf->raid_disks ; i++) {
> sector_t dist;
> + sector_t first_bad;
> + int bad_sectors;
> +
> int disk = start_disk + i;
> if (disk >= conf->raid_disks)
> disk -= conf->raid_disks;
> @@ -465,6 +489,35 @@ static int read_balance(conf_t *conf, r1bio_t *r1_bio)
> /* This is a reasonable device to use. It might
> * even be best.
> */
> + if (is_badblock(rdev, this_sector, sectors,
> + &first_bad, &bad_sectors)) {
> + if (best_dist < MaxSector)
> + /* already have a better device */
> + continue;
> + if (first_bad <= this_sector) {
> + /* cannot read here. If this is the 'primary'
> + * device, then we must not read beyond
> + * bad_sectors from another device..
> + */
> + bad_sectors -= (this_sector - first_bad);
> + if (choose_first && sectors > bad_sectors)
> + sectors = bad_sectors;
> + if (best_good_sectors > sectors)
> + best_good_sectors = sectors;
> +
> + } else {
> + sector_t good_sectors = first_bad - this_sector;
> + if (good_sectors > best_good_sectors) {
> + best_good_sectors = good_sectors;
> + best_disk = disk;
> + }
> + if (choose_first)
> + break;
> + }
> + continue;
> + } else
> + best_good_sectors = sectors;
> +
> dist = abs(this_sector - conf->mirrors[disk].head_position);
> if (choose_first
> /* Don't change to another disk for sequential reads */
> @@ -493,10 +546,12 @@ static int read_balance(conf_t *conf, r1bio_t *r1_bio)
> rdev_dec_pending(rdev, conf->mddev);
> goto retry;
> }
> + sectors = best_good_sectors;
> conf->next_seq_sect = this_sector + sectors;
> conf->last_used = best_disk;
> }
> rcu_read_unlock();
> + *max_sectors = sectors;
>
> return best_disk;
> }
> @@ -763,11 +818,25 @@ static int make_request(mddev_t *mddev, struct bio * bio)
> r1_bio->mddev = mddev;
> r1_bio->sector = bio->bi_sector;
>
> + /* We might need to issue multiple reads to different
> + * devices if there are bad blocks around, so we keep
> + * track of the number of reads in bio->bi_phys_segments.
> + * If this is 0, there is only one r1_bio and no locking
> + * will be needed when requests complete. If it is
> + * non-zero, then it is the number of not-completed requests.
> + */
> + bio->bi_phys_segments = 0;
> + clear_bit(BIO_SEG_VALID, &bio->bi_flags);
> +
> if (rw == READ) {
> /*
> * read balancing logic:
> */
> - int rdisk = read_balance(conf, r1_bio);
> + int max_sectors;
> + int rdisk;
> +
> +read_again:
> + rdisk = read_balance(conf, r1_bio, &max_sectors);
>
> if (rdisk < 0) {
> /* couldn't find anywhere to read from */
> @@ -788,6 +857,8 @@ static int make_request(mddev_t *mddev, struct bio * bio)
> r1_bio->read_disk = rdisk;
>
> read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
> + md_trim_bio(read_bio, r1_bio->sector - bio->bi_sector,
> + max_sectors);
>
> r1_bio->bios[rdisk] = read_bio;
>
> @@ -797,7 +868,38 @@ static int make_request(mddev_t *mddev, struct bio * bio)
> read_bio->bi_rw = READ | do_sync;
> read_bio->bi_private = r1_bio;
>
> - generic_make_request(read_bio);
> + if (max_sectors < r1_bio->sectors) {
> + /* could not read all from this device, so we will
> + * need another r1_bio.
> + */
> + int sectors_handled;
> +
> + sectors_handled = (r1_bio->sector + max_sectors
> + - bio->bi_sector);
> + r1_bio->sectors = max_sectors;
> + spin_lock_irq(&conf->device_lock);
> + if (bio->bi_phys_segments == 0)
> + bio->bi_phys_segments = 2;
> + else
> + bio->bi_phys_segments++;
> + spin_unlock_irq(&conf->device_lock);
> + /* Cannot call generic_make_request directly
> + * as that will be queued in __make_request
> + * and subsequent mempool_alloc might block waiting
> + * for it. So hand bio over to raid1d.
> + */
> + reschedule_retry(r1_bio);
> +
> + r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
> +
> + r1_bio->master_bio = bio;
> + r1_bio->sectors = (bio->bi_size >> 9) - sectors_handled;
> + r1_bio->state = 0;
> + r1_bio->mddev = mddev;
> + r1_bio->sector = bio->bi_sector + sectors_handled;
> + goto read_again;
> + } else
> + generic_make_request(read_bio);
> return 0;

To reduce a depth of nesting, how about rearraning this like:

if (max_sectors == r1_bio->sectors) {
generic_make_request(read_bio);
return 0;
}
/* could not read all from this device, so we will need
* another bio
*/
...
goto read_again;

> }
>
> @@ -849,8 +951,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
> goto retry_write;
> }
>
> - BUG_ON(targets == 0); /* we never fail the last device */
> -
> if (targets < conf->raid_disks) {
> /* array is degraded, we will not clear the bitmap
> * on I/O completion (see raid1_end_write_request) */
> @@ -1425,7 +1525,7 @@ static void sync_request_write(mddev_t *mddev, r1bio_t *r1_bio)
> *
> * 1. Retries failed read operations on working mirrors.
> * 2. Updates the raid superblock when problems encounter.
> - * 3. Performs writes following reads for array syncronising.
> + * 3. Performs writes following reads for array synchronising.
> */
>
> static void fix_read_error(conf_t *conf, int read_disk,
> @@ -1448,9 +1548,14 @@ static void fix_read_error(conf_t *conf, int read_disk,
> * which is the thread that might remove
> * a device. If raid1d ever becomes multi-threaded....
> */
> + sector_t first_bad;
> + int bad_sectors;
> +
> rdev = conf->mirrors[d].rdev;
> if (rdev &&
> test_bit(In_sync, &rdev->flags) &&
> + is_badblock(rdev, sect, s,
> + &first_bad, &bad_sectors) == 0 &&
> sync_page_io(rdev, sect, s<<9,
> conf->tmppage, READ, false))
> success = 1;
> @@ -1546,9 +1651,11 @@ static void raid1d(mddev_t *mddev)
> conf = mddev->private;
> if (test_bit(R1BIO_IsSync, &r1_bio->state))
> sync_request_write(mddev, r1_bio);
> - else {
> + else if (test_bit(R1BIO_ReadError, &r1_bio->state)) {
> int disk;
> + int max_sectors;
>
> + clear_bit(R1BIO_ReadError, &r1_bio->state);
> /* we got a read error. Maybe the drive is bad. Maybe just
> * the block and we can fix it.
> * We freeze all other IO, and try reading the block from
> @@ -1568,21 +1675,28 @@ static void raid1d(mddev_t *mddev)
> conf->mirrors[r1_bio->read_disk].rdev);
>
> bio = r1_bio->bios[r1_bio->read_disk];
> - if ((disk=read_balance(conf, r1_bio)) == -1) {
> + bdevname(bio->bi_bdev, b);
> +read_more:
> + disk = read_balance(conf, r1_bio, &max_sectors);
> + if (disk == -1) {
> printk(KERN_ALERT "md/raid1:%s: %s: unrecoverable I/O"
> " read error for block %llu\n",
> - mdname(mddev),
> - bdevname(bio->bi_bdev,b),
> + mdname(mddev), b,
> (unsigned long long)r1_bio->sector);
> raid_end_bio_io(r1_bio);
> } else {
> const unsigned long do_sync = r1_bio->master_bio->bi_rw & REQ_SYNC;
> - r1_bio->bios[r1_bio->read_disk] =
> - mddev->ro ? IO_BLOCKED : NULL;
> + if (bio) {
> + r1_bio->bios[r1_bio->read_disk] =
> + mddev->ro ? IO_BLOCKED : NULL;
> + bio_put(bio);
> + }
> r1_bio->read_disk = disk;
> - bio_put(bio);
> bio = bio_clone_mddev(r1_bio->master_bio,
> GFP_NOIO, mddev);
> + md_trim_bio(bio,
> + r1_bio->sector - bio->bi_sector,
> + max_sectors);
> r1_bio->bios[r1_bio->read_disk] = bio;
> rdev = conf->mirrors[disk].rdev;
> printk_ratelimited(
> @@ -1597,8 +1711,44 @@ static void raid1d(mddev_t *mddev)
> bio->bi_end_io = raid1_end_read_request;
> bio->bi_rw = READ | do_sync;
> bio->bi_private = r1_bio;
> - generic_make_request(bio);
> + if (max_sectors < r1_bio->sectors) {
> + /* Drat - have to split this up more */
> + struct bio *mbio = r1_bio->master_bio;
> + int sectors_handled =
> + r1_bio->sector + max_sectors
> + - mbio->bi_sector;
> + r1_bio->sectors = max_sectors;
> + spin_lock_irq(&conf->device_lock);
> + if (mbio->bi_phys_segments == 0)
> + mbio->bi_phys_segments = 2;
> + else
> + mbio->bi_phys_segments++;
> + spin_unlock_irq(&conf->device_lock);
> + generic_make_request(bio);
> + bio = NULL;
> +
> + r1_bio = mempool_alloc(conf->r1bio_pool,
> + GFP_NOIO);
> +
> + r1_bio->master_bio = mbio;
> + r1_bio->sectors = (mbio->bi_size >> 9)
> + - sectors_handled;
> + r1_bio->state = 0;
> + set_bit(R1BIO_ReadError,
> + &r1_bio->state);
> + r1_bio->mddev = mddev;
> + r1_bio->sector = mbio->bi_sector
> + + sectors_handled;
> +
> + goto read_more;
> + } else
> + generic_make_request(bio);

Same here.


> }
> + } else {
> + /* just a partial read to be scheduled from separate
> + * context
> + */
> + generic_make_request(r1_bio->bios[r1_bio->read_disk]);
> }
> cond_resched();
> }
> diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
> index 3cd18cf..aa6af37 100644
> --- a/drivers/md/raid1.h
> +++ b/drivers/md/raid1.h
> @@ -123,6 +123,10 @@ struct r1bio_s {
> #define R1BIO_IsSync 1
> #define R1BIO_Degraded 2
> #define R1BIO_BehindIO 3
> +/* Set ReadError on bios that experience a readerror so that
> + * raid1d knows what to do with them.
> + */
> +#define R1BIO_ReadError 4
> /* For write-behind requests, we call bi_end_io when
> * the last non-write-behind device completes, providing
> * any write was successful. Otherwise we call when
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 07/36] md/raid1: avoid reading known bad blocks during resync

am 26.07.2011 16:25:10 von Namhyung Kim

NeilBrown writes:

> When performing resync/etc, keep the size of the request
> small enough that it doesn't overlap any known bad blocks.
> Devices with badblocks at the start of the request are completely
> excluded.
> If there is nowhere to read from due to bad blocks, record
> a bad block on each target device.
>
> Now that we never read from known-bad-blocks we can allow devices with
> known-bad-blocks into a RAID1.
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH v2] md: add documentation for bad block log

am 26.07.2011 17:03:43 von Namhyung Kim

Previous patch in the bad block series added new sysfs interfaces
([unacknowledged_]bad_blocks) for each rdev without documentation.
Add it.

Signed-off-by: Namhyung Kim
---
Previous version misplaced the descriptions. Move them to correct
position (under rdev directory).


Documentation/md.txt | 15 ++++++++++++++-
1 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/Documentation/md.txt b/Documentation/md.txt
index f0eee83ff78a..923a6bddce7c 100644
--- a/Documentation/md.txt
+++ b/Documentation/md.txt
@@ -419,7 +419,6 @@ Each directory contains:
written, it will be rejected.

recovery_start
-
When the device is not 'in_sync', this records the number of
sectors from the start of the device which are known to be
correct. This is normally zero, but during a recovery
@@ -435,6 +434,20 @@ Each directory contains:
Setting this to 'none' is equivalent to setting 'in_sync'.
Setting to any other value also clears the 'in_sync' flag.

+ bad_blocks
+ This gives the list of all known bad blocks in the form of
+ start address and length (in sectors respectively). If output
+ is too big to fit in a page, it will be truncated. Writing
+ "sector length" to this file adds new acknowledged (i.e.
+ recorded to disk safely) bad blocks.
+
+ unacknowledged_bad_blocks
+ This gives the list of known-but-not-yet-saved-to-disk bad
+ blocks in the same form of 'bad_blocks'. If output is too big
+ to fit in a page, it will be truncated. Writing to this file
+ adds bad blocks without acknowledging them. This is largely
+ for testing.
+


An active md device will also contain and entry for each active device
--
1.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 08/36] md: add "write_error" flag to component devices.

am 26.07.2011 17:22:17 von Namhyung Kim

NeilBrown writes:

> If a device has ever seen a write error, we will want to handle
> known-bad-blocks differently.
> So create an appropriate state flag and export it via sysfs.
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim

but it looks like documentation update is needed too
(probably can be squashed to next patch 09/36).
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 09/36] md: make it easier to wait for bad blocks to be acknowledged.

am 26.07.2011 18:04:15 von Namhyung Kim

NeilBrown writes:

> It is only safe to choose not to write to a bad block if that bad
> block is safely recorded in metadata - i.e. if it has been
> 'acknowledged'.
>
> If it hasn't we need to wait for the acknowledgement.
>
> We support that using rdev->blocked wait and
> md_wait_for_blocked_rdev by introducing a new device flag
> 'BlockedBadBlock'.
>
> This flag is only advisory.
> It is cleared whenever we acknowledge a bad block, so that a waiter
> can re-check the particular bad blocks that it is interested it.
>
> It should be set by a caller when they find they need to wait.
> This (set after test) is inherently racy, but as
> md_wait_for_blocked_rdev already has a timeout, losing the race will
> have minimal impact.
>
> When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
> was set incorrectly (see above race).
>
> We also modify the way we manage 'Blocked' to fit better with the new
> handling of 'BlockedBadBlocks' and to make it consistent between
> externally managed and internally managed metadata. This requires
> that each raidXd loop checks if the metadata needs to be written and
> triggers a write (md_check_recovery) if needed. Otherwise a queued
> write request might cause raidXd to wait for the metadata to write,
> and only that thread can write it.
>
> Before writing metadata, we set FaultRecorded for all devices that
> are Faulty, then after writing the metadata we clear Blocked for any
> device for which the Fault was certainly Recorded.
>
> The 'faulty' device flag now appears in sysfs if the device is faulty
> *or* it has unacknowledged bad blocks. So user-space which does not
> understand bad blocks can continue to function correctly.
> User space which does, should not assume a device is faulty until it
> sees the 'faulty' flag, and then sees the list of unacknowledged bad
> blocks is empty.
>
> Signed-off-by: NeilBrown

Probably you also need this patch:

From 76320c4fdaed91f26a083a9337bb5a5503300e0e Mon Sep 17 00:00:00 2001
From: Namhyung Kim
Date: Wed, 27 Jul 2011 00:59:26 +0900
Subject: [PATCH] md: update documentation for md/rdev/state sysfs interface

Previous patches in the bad block series extended behavior of
rdev's 'state' interface but lacked documentation update.
Fix it.

Signed-off-by: Namhyung Kim
---
Documentation/md.txt | 14 +++++++++-----
1 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/Documentation/md.txt b/Documentation/md.txt
index 923a6bddce7c..fc94770f44ab 100644
--- a/Documentation/md.txt
+++ b/Documentation/md.txt
@@ -360,18 +360,20 @@ Each directory contains:
A file recording the current state of the device in the array
which can be a comma separated list of
faulty - device has been kicked from active use due to
- a detected fault
+ a detected fault or it has unacknowledged bad
+ blocks
in_sync - device is a fully in-sync member of the array
writemostly - device will only be subject to read
requests if there are no other options.
This applies only to raid1 arrays.
- blocked - device has failed, metadata is "external",
- and the failure hasn't been acknowledged yet.
+ blocked - device has failed, and the failure hasn't been
+ acknowledged yet by the metadata handler.
Writes that would write to this device if
it were not faulty are blocked.
spare - device is working, but not a full member.
This includes spares that are in the process
of being recovered to
+ write_error - device has ever seen a write error.
This list may grow in future.
This can be written to.
Writing "faulty" simulates a failure on the device.
@@ -379,9 +381,11 @@ Each directory contains:
Writing "writemostly" sets the writemostly flag.
Writing "-writemostly" clears the writemostly flag.
Writing "blocked" sets the "blocked" flag.
- Writing "-blocked" clears the "blocked" flag and allows writes
- to complete.
+ Writing "-blocked" clears the "blocked" flags and allows writes
+ to complete and possibly simulates an error.
Writing "in_sync" sets the in_sync flag.
+ Writing "write_error" sets writeerrorseen flag.
+ Writing "-write_error" clears writeerrorseen flag.

This file responds to select/poll. Any change to 'faulty'
or 'blocked' causes an event.
--
1.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 02/36] md/bad-block-log: add sysfs interface foraccessing bad-block-log.

am 27.07.2011 03:05:46 von NeilBrown

On Tue, 26 Jul 2011 17:48:27 +0900 Namhyung Kim wrote:


> Empty line in comment.
>
> And you might need this as well:

Thanks. I've added the patch.

> (or maybe it should be located somewhere in Documentation/ABI/)

Yes... maybe... I wonder who reads that ....

Thanks,
NeilBrown



>
> >From 95c3f191bd0cbe6d339fced75656502b2d591fe4 Mon Sep 17 00:00:00 2001
> From: Namhyung Kim
> Date: Tue, 26 Jul 2011 17:39:52 +0900
> Subject: [PATCH] md: add documentation for bad block log
>
> Previous patch in the bad block series added new sysfs interfaces
> ([unacknowledged_]bad_blocks) for each rdev without documentation.
> Add it.
>
> Signed-off-by: Namhyung Kim
> ---
> Documentation/md.txt | 14 ++++++++++++++
> 1 files changed, 14 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/md.txt b/Documentation/md.txt
> index f0eee83ff78a..be88a24b8584 100644
> --- a/Documentation/md.txt
> +++ b/Documentation/md.txt
> @@ -562,6 +562,20 @@ also have
> that number to reach sync_max. Then you can either increase
> "sync_max", or can write 'idle' to "sync_action".
>
> + bad_blocks
> + This gives the list of all known bad blocks in the form of
> + start address and length (in sectors respectively). If output
> + is too big to fit in a page, it will be truncated. Writing
> + "sector length" to this file will add new acknowledged (i.e.
> + saved to disk safely) bad blocks.
> +
> + unacknowledged_bad_blocks
> + This gives the list of known-but-not-yet-saved-to-disk bad
> + blocks in the same form of 'bad_blocks'. If output is too big
> + to fit in a page, it will be truncated. Writing to this file
> + adds bad blocks without acknowledging them. This is largely
> + for testing.
> +
>
> Each active md device may also have attributes specific to the
> personality module that manages it.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 09/36] md: make it easier to wait for bad blocks tobe acknowledged.

am 27.07.2011 03:18:22 von NeilBrown

On Wed, 27 Jul 2011 01:04:15 +0900 Namhyung Kim wrote:

> NeilBrown writes:
>
> > It is only safe to choose not to write to a bad block if that bad
> > block is safely recorded in metadata - i.e. if it has been
> > 'acknowledged'.
> >
> > If it hasn't we need to wait for the acknowledgement.
> >
> > We support that using rdev->blocked wait and
> > md_wait_for_blocked_rdev by introducing a new device flag
> > 'BlockedBadBlock'.
> >
> > This flag is only advisory.
> > It is cleared whenever we acknowledge a bad block, so that a waiter
> > can re-check the particular bad blocks that it is interested it.
> >
> > It should be set by a caller when they find they need to wait.
> > This (set after test) is inherently racy, but as
> > md_wait_for_blocked_rdev already has a timeout, losing the race will
> > have minimal impact.
> >
> > When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
> > was set incorrectly (see above race).
> >
> > We also modify the way we manage 'Blocked' to fit better with the new
> > handling of 'BlockedBadBlocks' and to make it consistent between
> > externally managed and internally managed metadata. This requires
> > that each raidXd loop checks if the metadata needs to be written and
> > triggers a write (md_check_recovery) if needed. Otherwise a queued
> > write request might cause raidXd to wait for the metadata to write,
> > and only that thread can write it.
> >
> > Before writing metadata, we set FaultRecorded for all devices that
> > are Faulty, then after writing the metadata we clear Blocked for any
> > device for which the Fault was certainly Recorded.
> >
> > The 'faulty' device flag now appears in sysfs if the device is faulty
> > *or* it has unacknowledged bad blocks. So user-space which does not
> > understand bad blocks can continue to function correctly.
> > User space which does, should not assume a device is faulty until it
> > sees the 'faulty' flag, and then sees the list of unacknowledged bad
> > blocks is empty.
> >
> > Signed-off-by: NeilBrown
>
> Probably you also need this patch:
>
> >From 76320c4fdaed91f26a083a9337bb5a5503300e0e Mon Sep 17 00:00:00 2001
> From: Namhyung Kim
> Date: Wed, 27 Jul 2011 00:59:26 +0900
> Subject: [PATCH] md: update documentation for md/rdev/state sysfs interface
>
> Previous patches in the bad block series extended behavior of
> rdev's 'state' interface but lacked documentation update.
> Fix it.
>
> Signed-off-by: Namhyung Kim

Applied, thanks.

NeilBrown


> ---
> Documentation/md.txt | 14 +++++++++-----
> 1 files changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/md.txt b/Documentation/md.txt
> index 923a6bddce7c..fc94770f44ab 100644
> --- a/Documentation/md.txt
> +++ b/Documentation/md.txt
> @@ -360,18 +360,20 @@ Each directory contains:
> A file recording the current state of the device in the array
> which can be a comma separated list of
> faulty - device has been kicked from active use due to
> - a detected fault
> + a detected fault or it has unacknowledged bad
> + blocks
> in_sync - device is a fully in-sync member of the array
> writemostly - device will only be subject to read
> requests if there are no other options.
> This applies only to raid1 arrays.
> - blocked - device has failed, metadata is "external",
> - and the failure hasn't been acknowledged yet.
> + blocked - device has failed, and the failure hasn't been
> + acknowledged yet by the metadata handler.
> Writes that would write to this device if
> it were not faulty are blocked.
> spare - device is working, but not a full member.
> This includes spares that are in the process
> of being recovered to
> + write_error - device has ever seen a write error.
> This list may grow in future.
> This can be written to.
> Writing "faulty" simulates a failure on the device.
> @@ -379,9 +381,11 @@ Each directory contains:
> Writing "writemostly" sets the writemostly flag.
> Writing "-writemostly" clears the writemostly flag.
> Writing "blocked" sets the "blocked" flag.
> - Writing "-blocked" clears the "blocked" flag and allows writes
> - to complete.
> + Writing "-blocked" clears the "blocked" flags and allows writes
> + to complete and possibly simulates an error.
> Writing "in_sync" sets the in_sync flag.
> + Writing "write_error" sets writeerrorseen flag.
> + Writing "-write_error" clears writeerrorseen flag.
>
> This file responds to select/poll. Any change to 'faulty'
> or 'blocked' causes an event.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 10/36] md/raid1: avoid writing to known-bad blocks on known-bad drives.

am 27.07.2011 06:09:07 von Namhyung Kim

NeilBrown writes:

> If we have seen any write error on a drive, then don't write to
> any known-bad blocks on that drive.
> If necessary, we divide the write request up into pieces just
> like we do for reads, so each piece is either all written or
> all not written to any given drive.
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim

and a nit below


> ---
>
> drivers/md/raid1.c | 152 +++++++++++++++++++++++++++++++++++++++-------------
> 1 files changed, 115 insertions(+), 37 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 4d40d9d..44277dc 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -772,6 +772,9 @@ static int make_request(mddev_t *mddev, struct bio * bio)
> const unsigned long do_flush_fua = (bio->bi_rw & (REQ_FLUSH | REQ_FUA));
> mdk_rdev_t *blocked_rdev;
> int plugged;
> + int first_clone;
> + int sectors_handled;
> + int max_sectors;
>
> /*
> * Register the new request and wait if the reconstruction
> @@ -832,7 +835,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
> /*
> * read balancing logic:
> */
> - int max_sectors;
> int rdisk;
>
> read_again:
> @@ -872,7 +874,6 @@ read_again:
> /* could not read all from this device, so we will
> * need another r1_bio.
> */
> - int sectors_handled;
>
> sectors_handled = (r1_bio->sector + max_sectors
> - bio->bi_sector);
> @@ -906,9 +907,15 @@ read_again:
> /*
> * WRITE:
> */
> - /* first select target devices under spinlock and
> + /* first select target devices under rcu_lock and
> * inc refcount on their rdev. Record them by setting
> * bios[x] to bio
> + * If there are known/acknowledged bad blocks on any device on
> + * which we have seen a write error, we want to avoid writing those
> + * blocks.
> + * This potentially requires several writes to write around
> + * the bad blocks. Each set of writes gets it's own r1bio
> + * with a set of bios attached.
> */
> plugged = mddev_check_plugged(mddev);
>
> @@ -916,6 +923,7 @@ read_again:
> retry_write:
> blocked_rdev = NULL;
> rcu_read_lock();
> + max_sectors = r1_bio->sectors;
> for (i = 0; i < disks; i++) {
> mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
> if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
> @@ -923,17 +931,57 @@ read_again:
> blocked_rdev = rdev;
> break;
> }
> - if (rdev && !test_bit(Faulty, &rdev->flags)) {
> - atomic_inc(&rdev->nr_pending);
> - if (test_bit(Faulty, &rdev->flags)) {
> + r1_bio->bios[i] = NULL;
> + if (!rdev || test_bit(Faulty, &rdev->flags)) {
> + set_bit(R1BIO_Degraded, &r1_bio->state);
> + continue;
> + }
> +
> + atomic_inc(&rdev->nr_pending);
> + if (test_bit(WriteErrorSeen, &rdev->flags)) {
> + sector_t first_bad;
> + int bad_sectors;
> + int is_bad;
> +
> + is_bad = is_badblock(rdev, r1_bio->sector,
> + max_sectors,
> + &first_bad, &bad_sectors);
> + if (is_bad < 0) {
> + /* mustn't write here until the bad block is
> + * acknowledged*/
> + set_bit(BlockedBadBlocks, &rdev->flags);
> + blocked_rdev = rdev;
> + break;
> + }
> + if (is_bad && first_bad <= r1_bio->sector) {
> + /* Cannot write here at all */
> + bad_sectors -= (r1_bio->sector - first_bad);
> + if (bad_sectors < max_sectors)
> + /* mustn't write more than bad_sectors
> + * to other devices yet
> + */
> + max_sectors = bad_sectors;
> rdev_dec_pending(rdev, mddev);
> - r1_bio->bios[i] = NULL;
> - } else {
> - r1_bio->bios[i] = bio;
> - targets++;
> + /* We don't set R1BIO_Degraded as that
> + * only applies if the disk is
> + * missing, so it might be re-added,
> + * and we want to know to recover this
> + * chunk.
> + * In this case the device is here,
> + * and the fact that this chunk is not
> + * in-sync is recorded in the bad
> + * block log
> + */
> + continue;
> }
> - } else
> - r1_bio->bios[i] = NULL;
> + if (is_bad) {
> + int good_sectors = first_bad - r1_bio->sector;
> + if (good_sectors < max_sectors)
> + max_sectors = good_sectors;
> + }
> + }
> + r1_bio->bios[i] = bio;
> + targets++;

Looks like variable 'targets' is not needed anymore.


> }
> rcu_read_unlock();
>
> @@ -944,48 +992,56 @@ read_again:
> for (j = 0; j < i; j++)
> if (r1_bio->bios[j])
> rdev_dec_pending(conf->mirrors[j].rdev, mddev);
> -
> + r1_bio->state = 0;
> allow_barrier(conf);
> md_wait_for_blocked_rdev(blocked_rdev, mddev);
> wait_barrier(conf);
> goto retry_write;
> }
>
> - if (targets < conf->raid_disks) {
> - /* array is degraded, we will not clear the bitmap
> - * on I/O completion (see raid1_end_write_request) */
> - set_bit(R1BIO_Degraded, &r1_bio->state);
> + if (max_sectors < r1_bio->sectors) {
> + /* We are splitting this write into multiple parts, so
> + * we need to prepare for allocating another r1_bio.
> + */
> + r1_bio->sectors = max_sectors;
> + spin_lock_irq(&conf->device_lock);
> + if (bio->bi_phys_segments == 0)
> + bio->bi_phys_segments = 2;
> + else
> + bio->bi_phys_segments++;
> + spin_unlock_irq(&conf->device_lock);
> }
> -
> - /* do behind I/O ?
> - * Not if there are too many, or cannot allocate memory,
> - * or a reader on WriteMostly is waiting for behind writes
> - * to flush */
> - if (bitmap &&
> - (atomic_read(&bitmap->behind_writes)
> - < mddev->bitmap_info.max_write_behind) &&
> - !waitqueue_active(&bitmap->behind_wait))
> - alloc_behind_pages(bio, r1_bio);
> + sectors_handled = r1_bio->sector + max_sectors - bio->bi_sector;
>
> atomic_set(&r1_bio->remaining, 1);
> atomic_set(&r1_bio->behind_remaining, 0);
>
> - bitmap_startwrite(bitmap, bio->bi_sector, r1_bio->sectors,
> - test_bit(R1BIO_BehindIO, &r1_bio->state));
> + first_clone = 1;
> for (i = 0; i < disks; i++) {
> struct bio *mbio;
> if (!r1_bio->bios[i])
> continue;
>
> mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
> - r1_bio->bios[i] = mbio;
> -
> - mbio->bi_sector = r1_bio->sector + conf->mirrors[i].rdev->data_offset;
> - mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
> - mbio->bi_end_io = raid1_end_write_request;
> - mbio->bi_rw = WRITE | do_flush_fua | do_sync;
> - mbio->bi_private = r1_bio;
> -
> + md_trim_bio(mbio, r1_bio->sector - bio->bi_sector, max_sectors);
> +
> + if (first_clone) {
> + /* do behind I/O ?
> + * Not if there are too many, or cannot
> + * allocate memory, or a reader on WriteMostly
> + * is waiting for behind writes to flush */
> + if (bitmap &&
> + (atomic_read(&bitmap->behind_writes)
> + < mddev->bitmap_info.max_write_behind) &&
> + !waitqueue_active(&bitmap->behind_wait))
> + alloc_behind_pages(mbio, r1_bio);
> +
> + bitmap_startwrite(bitmap, r1_bio->sector,
> + r1_bio->sectors,
> + test_bit(R1BIO_BehindIO,
> + &r1_bio->state));
> + first_clone = 0;
> + }
> if (r1_bio->behind_pages) {
> struct bio_vec *bvec;
> int j;
> @@ -1003,6 +1059,15 @@ read_again:
> atomic_inc(&r1_bio->behind_remaining);
> }
>
> + r1_bio->bios[i] = mbio;
> +
> + mbio->bi_sector = (r1_bio->sector +
> + conf->mirrors[i].rdev->data_offset);
> + mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
> + mbio->bi_end_io = raid1_end_write_request;
> + mbio->bi_rw = WRITE | do_flush_fua | do_sync;
> + mbio->bi_private = r1_bio;
> +
> atomic_inc(&r1_bio->remaining);
> spin_lock_irqsave(&conf->device_lock, flags);
> bio_list_add(&conf->pending_bio_list, mbio);
> @@ -1013,6 +1078,19 @@ read_again:
> /* In case raid1d snuck in to freeze_array */
> wake_up(&conf->wait_barrier);
>
> + if (sectors_handled < (bio->bi_size >> 9)) {
> + /* We need another r1_bio. It has already been counted
> + * in bio->bi_phys_segments
> + */
> + r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
> + r1_bio->master_bio = bio;
> + r1_bio->sectors = (bio->bi_size >> 9) - sectors_handled;
> + r1_bio->state = 0;
> + r1_bio->mddev = mddev;
> + r1_bio->sector = bio->bi_sector + sectors_handled;
> + goto retry_write;
> + }
> +
> if (do_sync || !bitmap || !plugged)
> md_wakeup_thread(mddev->thread);
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 10/36] md/raid1: avoid writing to known-bad blocks onknown-bad drives.

am 27.07.2011 06:19:41 von NeilBrown

On Wed, 27 Jul 2011 13:09:07 +0900 Namhyung Kim wrote:

> NeilBrown writes:
>
> > If we have seen any write error on a drive, then don't write to
> > any known-bad blocks on that drive.
> > If necessary, we divide the write request up into pieces just
> > like we do for reads, so each piece is either all written or
> > all not written to any given drive.
> >
> > Signed-off-by: NeilBrown
>
> Reviewed-by: Namhyung Kim
>
> and a nit below
>

> > - } else
> > - r1_bio->bios[i] = NULL;
> > + if (is_bad) {
> > + int good_sectors = first_bad - r1_bio->sector;
> > + if (good_sectors < max_sectors)
> > + max_sectors = good_sectors;
> > + }
> > + }
> > + r1_bio->bios[i] = bio;
> > + targets++;
>
> Looks like variable 'targets' is not needed anymore.
>
>

Thanks. I've removed it.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 11/36] md/raid1: clear bad-block record when write succeeds.

am 27.07.2011 07:05:33 von Namhyung Kim

NeilBrown writes:

> If we succeed in writing to a block that was recorded as
> being bad, we clear the bad-block record.
>
> This requires some delayed handling as the bad-block-list update has
> to happen in process-context.
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 12/36] md/raid1: store behind-write pages in bi_vecs.

am 27.07.2011 17:16:56 von Namhyung Kim

NeilBrown writes:

> When performing write-behind we allocate pages to store the data
> during write.
> Previously we just keep a list of pages. Now we keep a list of
> bi_vec which includes offset and size.
> This means that the r1bio has complete information to create a new
> bio which will be needed for retrying after write errors.
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 13/36] md/raid1: Handle write errors by updating badblock log.

am 27.07.2011 17:28:57 von Namhyung Kim

NeilBrown writes:

> When we get a write error (in the data area, not in metadata),
> update the badblock log rather than failing the whole device.
>
> As the write may well be many blocks, we trying writing each
> block individually and only log the ones which fail.
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 14/36] md/raid1: record badblocks found during resync etc.

am 27.07.2011 17:39:36 von Namhyung Kim

NeilBrown writes:

> If we find a bad block while writing as part of resync/recovery we
> need to report that back to raid1d which must record the bad block,
> or fail the device.
>
> Similarly when fixing a read error, a further error should just
> record a bad block if possible rather than failing the device.
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 15/36] md/raid1: improve handling of read failure during recovery.

am 27.07.2011 17:45:15 von Namhyung Kim

NeilBrown writes:

> If we cannot read a block from anywhere during recovery, there is
> now a better approach than just giving up.
> We can record a bad block on each device and keep going - being
> careful not to clear the bad block when a write succeeds as it might -
> it will be a write of incorrect data.
>
> We have now reached the state where - for raid1 - we only call
> md_error if md_set_badblocks has failed.
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 16/36] md/raid1: factor several functions out or raid1d()

am 27.07.2011 17:55:41 von Namhyung Kim

NeilBrown writes:

> raid1d is too big with several deep branches.
> So separate them out into their own functions.
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim

with some whitespace changes below..


> ---
>
> drivers/md/raid1.c | 318 ++++++++++++++++++++++++++--------------------------
> 1 files changed, 159 insertions(+), 159 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 08ff21a..d7518dc 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1862,21 +1862,168 @@ static int narrow_write_error(r1bio_t *r1_bio, int i)
> return ok;
> }
>
> +static void handle_sync_write_finished(conf_t *conf, r1bio_t *r1_bio)
> +{
> + int m;
> + int s = r1_bio->sectors;
> + for (m = 0; m < conf->raid_disks ; m++) {
> + mdk_rdev_t *rdev = conf->mirrors[m].rdev;
> + struct bio *bio = r1_bio->bios[m];
> + if (bio->bi_end_io == NULL)
> + continue;
> + if (test_bit(BIO_UPTODATE, &bio->bi_flags) &&
> + test_bit(R1BIO_MadeGood, &r1_bio->state)) {
> + rdev_clear_badblocks(rdev,
> + r1_bio->sector,
> + r1_bio->sectors);
rdev_clear_badblocks(rdev, r1_bio->sector, s);

> + }
> + if (!test_bit(BIO_UPTODATE, &bio->bi_flags) &&
> + test_bit(R1BIO_WriteError, &r1_bio->state)) {
> + if (!rdev_set_badblocks(rdev,
> + r1_bio->sector,
> + r1_bio->sectors, 0))
if (!rdev_set_badblocks(rdev, r1_bio->sector, s, 0))

> + md_error(conf->mddev, rdev);
> + }
> + }
> + put_buf(r1_bio);
> + md_done_sync(conf->mddev, s, 1);
> +}
> +
> +static void handle_write_finished(conf_t *conf, r1bio_t *r1_bio)
> +{
> + int m;
> + for (m = 0; m < conf->raid_disks ; m++)
> + if (r1_bio->bios[m] == IO_MADE_GOOD) {
> + mdk_rdev_t *rdev = conf->mirrors[m].rdev;
> + rdev_clear_badblocks(rdev,
> + r1_bio->sector,
> + r1_bio->sectors);
> + rdev_dec_pending(rdev, conf->mddev);
> + } else if (r1_bio->bios[m] != NULL) {
> + /* This drive got a write error. We need to
> + * narrow down and record precise write
> + * errors.
> + */
> + if (!narrow_write_error(r1_bio, m)) {
> + md_error(conf->mddev,
> + conf->mirrors[m].rdev);
> + /* an I/O failed, we can't clear the bitmap */
> + set_bit(R1BIO_Degraded, &r1_bio->state);
> + }
> + rdev_dec_pending(conf->mirrors[m].rdev,
> + conf->mddev);
> + }
> + if (test_bit(R1BIO_WriteError, &r1_bio->state))
> + close_write(r1_bio);
> + raid_end_bio_io(r1_bio);
> +}
> +
> +static void handle_read_error(conf_t *conf, r1bio_t *r1_bio)
> +{
> + int disk;
> + int max_sectors;
> + mddev_t *mddev = conf->mddev;
> + struct bio *bio;
> + char b[BDEVNAME_SIZE];
> + mdk_rdev_t *rdev;
> +
> + clear_bit(R1BIO_ReadError, &r1_bio->state);
> + /* we got a read error. Maybe the drive is bad. Maybe just
> + * the block and we can fix it.
> + * We freeze all other IO, and try reading the block from
> + * other devices. When we find one, we re-write
> + * and check it that fixes the read error.
> + * This is all done synchronously while the array is
> + * frozen
> + */
> + if (mddev->ro == 0) {
> + freeze_array(conf);
> + fix_read_error(conf, r1_bio->read_disk,
> + r1_bio->sector,
> + r1_bio->sectors);
r1_bio->sector, r1_bio->sectors);

> + unfreeze_array(conf);
> + } else
> + md_error(mddev,
> + conf->mirrors[r1_bio->read_disk].rdev);
> +
> + bio = r1_bio->bios[r1_bio->read_disk];
> + bdevname(bio->bi_bdev, b);
> +read_more:
> + disk = read_balance(conf, r1_bio, &max_sectors);
> + if (disk == -1) {
> + printk(KERN_ALERT "md/raid1:%s: %s: unrecoverable I/O"
> + " read error for block %llu\n",
> + mdname(mddev), b,
> + (unsigned long long)r1_bio->sector);
> + raid_end_bio_io(r1_bio);
> + } else {
> + const unsigned long do_sync
> + = r1_bio->master_bio->bi_rw & REQ_SYNC;
> + if (bio) {
> + r1_bio->bios[r1_bio->read_disk] =
> + mddev->ro ? IO_BLOCKED : NULL;
> + bio_put(bio);
> + }
> + r1_bio->read_disk = disk;
> + bio = bio_clone_mddev(r1_bio->master_bio, GFP_NOIO, mddev);
> + md_trim_bio(bio,
> + r1_bio->sector - bio->bi_sector,
> + max_sectors);
md_trim_bio(bio, r1_bio->sector - bio->bi_sector, max_sectors);

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 16/36] md/raid1: factor several functions out orraid1d()

am 28.07.2011 03:39:07 von NeilBrown

On Thu, 28 Jul 2011 00:55:41 +0900 Namhyung Kim wrote:

> NeilBrown writes:
>
> > raid1d is too big with several deep branches.
> > So separate them out into their own functions.
> >
> > Signed-off-by: NeilBrown
>
> Reviewed-by: Namhyung Kim
>
> with some whitespace changes below..

Thanks. I made those changes and a couple of other similar ones.

NeilBrown


>
>
> > ---
> >
> > drivers/md/raid1.c | 318 ++++++++++++++++++++++++++--------------------------
> > 1 files changed, 159 insertions(+), 159 deletions(-)
> >
> > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> > index 08ff21a..d7518dc 100644
> > --- a/drivers/md/raid1.c
> > +++ b/drivers/md/raid1.c
> > @@ -1862,21 +1862,168 @@ static int narrow_write_error(r1bio_t *r1_bio, int i)
> > return ok;
> > }
> >
> > +static void handle_sync_write_finished(conf_t *conf, r1bio_t *r1_bio)
> > +{
> > + int m;
> > + int s = r1_bio->sectors;
> > + for (m = 0; m < conf->raid_disks ; m++) {
> > + mdk_rdev_t *rdev = conf->mirrors[m].rdev;
> > + struct bio *bio = r1_bio->bios[m];
> > + if (bio->bi_end_io == NULL)
> > + continue;
> > + if (test_bit(BIO_UPTODATE, &bio->bi_flags) &&
> > + test_bit(R1BIO_MadeGood, &r1_bio->state)) {
> > + rdev_clear_badblocks(rdev,
> > + r1_bio->sector,
> > + r1_bio->sectors);
> rdev_clear_badblocks(rdev, r1_bio->sector, s);
>
> > + }
> > + if (!test_bit(BIO_UPTODATE, &bio->bi_flags) &&
> > + test_bit(R1BIO_WriteError, &r1_bio->state)) {
> > + if (!rdev_set_badblocks(rdev,
> > + r1_bio->sector,
> > + r1_bio->sectors, 0))
> if (!rdev_set_badblocks(rdev, r1_bio->sector, s, 0))
>
> > + md_error(conf->mddev, rdev);
> > + }
> > + }
> > + put_buf(r1_bio);
> > + md_done_sync(conf->mddev, s, 1);
> > +}
> > +
> > +static void handle_write_finished(conf_t *conf, r1bio_t *r1_bio)
> > +{
> > + int m;
> > + for (m = 0; m < conf->raid_disks ; m++)
> > + if (r1_bio->bios[m] == IO_MADE_GOOD) {
> > + mdk_rdev_t *rdev = conf->mirrors[m].rdev;
> > + rdev_clear_badblocks(rdev,
> > + r1_bio->sector,
> > + r1_bio->sectors);
> > + rdev_dec_pending(rdev, conf->mddev);
> > + } else if (r1_bio->bios[m] != NULL) {
> > + /* This drive got a write error. We need to
> > + * narrow down and record precise write
> > + * errors.
> > + */
> > + if (!narrow_write_error(r1_bio, m)) {
> > + md_error(conf->mddev,
> > + conf->mirrors[m].rdev);
> > + /* an I/O failed, we can't clear the bitmap */
> > + set_bit(R1BIO_Degraded, &r1_bio->state);
> > + }
> > + rdev_dec_pending(conf->mirrors[m].rdev,
> > + conf->mddev);
> > + }
> > + if (test_bit(R1BIO_WriteError, &r1_bio->state))
> > + close_write(r1_bio);
> > + raid_end_bio_io(r1_bio);
> > +}
> > +
> > +static void handle_read_error(conf_t *conf, r1bio_t *r1_bio)
> > +{
> > + int disk;
> > + int max_sectors;
> > + mddev_t *mddev = conf->mddev;
> > + struct bio *bio;
> > + char b[BDEVNAME_SIZE];
> > + mdk_rdev_t *rdev;
> > +
> > + clear_bit(R1BIO_ReadError, &r1_bio->state);
> > + /* we got a read error. Maybe the drive is bad. Maybe just
> > + * the block and we can fix it.
> > + * We freeze all other IO, and try reading the block from
> > + * other devices. When we find one, we re-write
> > + * and check it that fixes the read error.
> > + * This is all done synchronously while the array is
> > + * frozen
> > + */
> > + if (mddev->ro == 0) {
> > + freeze_array(conf);
> > + fix_read_error(conf, r1_bio->read_disk,
> > + r1_bio->sector,
> > + r1_bio->sectors);
> r1_bio->sector, r1_bio->sectors);
>
> > + unfreeze_array(conf);
> > + } else
> > + md_error(mddev,
> > + conf->mirrors[r1_bio->read_disk].rdev);
> > +
> > + bio = r1_bio->bios[r1_bio->read_disk];
> > + bdevname(bio->bi_bdev, b);
> > +read_more:
> > + disk = read_balance(conf, r1_bio, &max_sectors);
> > + if (disk == -1) {
> > + printk(KERN_ALERT "md/raid1:%s: %s: unrecoverable I/O"
> > + " read error for block %llu\n",
> > + mdname(mddev), b,
> > + (unsigned long long)r1_bio->sector);
> > + raid_end_bio_io(r1_bio);
> > + } else {
> > + const unsigned long do_sync
> > + = r1_bio->master_bio->bi_rw & REQ_SYNC;
> > + if (bio) {
> > + r1_bio->bios[r1_bio->read_disk] =
> > + mddev->ro ? IO_BLOCKED : NULL;
> > + bio_put(bio);
> > + }
> > + r1_bio->read_disk = disk;
> > + bio = bio_clone_mddev(r1_bio->master_bio, GFP_NOIO, mddev);
> > + md_trim_bio(bio,
> > + r1_bio->sector - bio->bi_sector,
> > + max_sectors);
> md_trim_bio(bio, r1_bio->sector - bio->bi_sector, max_sectors);
>
> Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html