[md PATCH 00/34] md patches for 3.1 - part 1

[md PATCH 00/34] md patches for 3.1 - part 1

am 21.07.2011 04:32:24 von NeilBrown

As 3.0 is fast approaching it is long past time to publish my patch
queue for 3.1. Much of it has been in -next for a while and while
that enabled testing it doesn't really encourage review.

So this is the first of 2 patch-bombs I will be sending which comprise
all of the changes is (currently) plan to submit for 3.1.

This set is mostly ad-hoc bits and pieces. Code clean up and minor
bugs fixes and so forth. Probably the strongest theme is some
refactoring in raid5 to reduce code duplication between RAID5 and
RAID6. (almost) all similar code has been factored into common
routines.

All "Reviewed-by" lines gratefully accepted ... until early-ish next
week when I will submit that code to Linus. If anyone is keen to do
some review, the RAID5 changes is where I would be most happy to have
it focused. Remember: if something doesn't look right, it probably
isn't.

Some of these patches have been seen on the list already, but I
thought it best to simply include them all for completeness.

NeilBrown


---

Akinobu Mita (1):
md: use proper little-endian bitops

Christian Dietrich (1):
md/raid: use printk_ratelimited instead of printk_ratelimit

Jonathan Brassow (2):
MD bitmap: Revert DM dirty log hooks
MD: raid1 s/sysfs_notify_dirent/sysfs_notify_dirent_safe

Namhyung Kim (11):
md/raid10: move rdev->corrected_errors counting
md/raid5: move rdev->corrected_errors counting
md/raid1: move rdev->corrected_errors counting
md: get rid of unnecessary casts on page_address()
md: remove ro check in md_check_recovery()
md: introduce link/unlink_rdev() helpers
md/raid5: get rid of duplicated call to bio_data_dir()
md/raid5: use kmem_cache_zalloc()
md/raid10: share pages between read and write bio's during recovery
md/raid10: factor out common bio handling code
md/raid10: get rid of duplicated conditional expression

NeilBrown (19):
md/raid5: Avoid BUG caused by multiple failures.
md/raid10: Improve decision on whether to fail a device with a read error.
md/raid10: Make use of new recovery_disabled handling
md: change managed of recovery_disabled.
md/raid5: finalise new merged handle_stripe.
md/raid5: move some more common code into handle_stripe
md/raid5: move more common code into handle_stripe
md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6
md/raid5: unite fetch_block5 and fetch_block6
md/raid5: rearrange a test in fetch_block6.
md/raid5: move more code into common handle_stripe
md/raid5: Move code for finishing a reconstruction into handle_stripe.
md/raid5: move stripe_head_state and more code into handle_stripe.
md/raid5: add some more fields to stripe_head_state
md/raid5: unify stripe_head_state and r6_state
md/raid5: move common code into handle_stripe
md/raid5: replace sh->lock with an 'active' flag.
md/raid5: Protect some more code with ->device_lock.
md/raid5: Remove use of sh->lock in sync_request


drivers/md/bitmap.c | 137 +++-----
drivers/md/bitmap.h | 5
drivers/md/md.c | 77 ++---
drivers/md/md.h | 28 +-
drivers/md/raid1.c | 66 ++--
drivers/md/raid1.h | 6
drivers/md/raid10.c | 214 +++++++------
drivers/md/raid10.h | 5
drivers/md/raid5.c | 832 ++++++++++++++-------------------------------------
drivers/md/raid5.h | 45 +--
10 files changed, 505 insertions(+), 910 deletions(-)

--
Signature

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 02/34] md/raid10: factor out common bio handling code

am 21.07.2011 04:32:24 von NeilBrown

From: Namhyung Kim

When normal-write and sync-read/write bio completes, we should
find out the disk number the bio belongs to. Factor those common
code out to a separate function.

Signed-off-by: Namhyung Kim
Signed-off-by: NeilBrown
---

drivers/md/raid10.c | 44 +++++++++++++++++++++++---------------------
1 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index d55ae12..e434f1e 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -244,6 +244,23 @@ static inline void update_head_pos(int slot, r10bio_t *r10_bio)
r10_bio->devs[slot].addr + (r10_bio->sectors);
}

+/*
+ * Find the disk number which triggered given bio
+ */
+static int find_bio_disk(conf_t *conf, r10bio_t *r10_bio, struct bio *bio)
+{
+ int slot;
+
+ for (slot = 0; slot < conf->copies; slot++)
+ if (r10_bio->devs[slot].bio == bio)
+ break;
+
+ BUG_ON(slot == conf->copies);
+ update_head_pos(slot, r10_bio);
+
+ return r10_bio->devs[slot].devnum;
+}
+
static void raid10_end_read_request(struct bio *bio, int error)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
@@ -289,13 +306,10 @@ static void raid10_end_write_request(struct bio *bio, int error)
{
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
r10bio_t *r10_bio = bio->bi_private;
- int slot, dev;
+ int dev;
conf_t *conf = r10_bio->mddev->private;

- for (slot = 0; slot < conf->copies; slot++)
- if (r10_bio->devs[slot].bio == bio)
- break;
- dev = r10_bio->devs[slot].devnum;
+ dev = find_bio_disk(conf, r10_bio, bio);

/*
* this branch is our 'one mirror IO has finished' event handler:
@@ -316,8 +330,6 @@ static void raid10_end_write_request(struct bio *bio, int error)
*/
set_bit(R10BIO_Uptodate, &r10_bio->state);

- update_head_pos(slot, r10_bio);
-
/*
*
* Let's see if all mirrored write operations have finished
@@ -1173,14 +1185,9 @@ static void end_sync_read(struct bio *bio, int error)
{
r10bio_t *r10_bio = bio->bi_private;
conf_t *conf = r10_bio->mddev->private;
- int i,d;
+ int d;

- for (i=0; icopies; i++)
- if (r10_bio->devs[i].bio == bio)
- break;
- BUG_ON(i == conf->copies);
- update_head_pos(i, r10_bio);
- d = r10_bio->devs[i].devnum;
+ d = find_bio_disk(conf, r10_bio, bio);

if (test_bit(BIO_UPTODATE, &bio->bi_flags))
set_bit(R10BIO_Uptodate, &r10_bio->state);
@@ -1211,18 +1218,13 @@ static void end_sync_write(struct bio *bio, int error)
r10bio_t *r10_bio = bio->bi_private;
mddev_t *mddev = r10_bio->mddev;
conf_t *conf = mddev->private;
- int i,d;
+ int d;

- for (i = 0; i < conf->copies; i++)
- if (r10_bio->devs[i].bio == bio)
- break;
- d = r10_bio->devs[i].devnum;
+ d = find_bio_disk(conf, r10_bio, bio);

if (!uptodate)
md_error(mddev, conf->mirrors[d].rdev);

- update_head_pos(i, r10_bio);
-
rdev_dec_pending(conf->mirrors[d].rdev, mddev);
while (atomic_dec_and_test(&r10_bio->remaining)) {
if (r10_bio->master_bio == NULL) {


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 03/34] md/raid10: share pages between read and write bio"sduring recovery

am 21.07.2011 04:32:24 von NeilBrown

From: Namhyung Kim

When performing a recovery, only first 2 slots in r10_bio are in use,
for read and write respectively. However all of pages in the write bio
are never used and just replaced to read bio's when the read completes.

Get rid of those unused pages and share read pages properly.

Signed-off-by: Namhyung Kim
Signed-off-by: NeilBrown
---

drivers/md/raid10.c | 23 ++++++++++++-----------
1 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index e434f1e..3715e22 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -123,7 +123,14 @@ static void * r10buf_pool_alloc(gfp_t gfp_flags, void *data)
for (j = 0 ; j < nalloc; j++) {
bio = r10_bio->devs[j].bio;
for (i = 0; i < RESYNC_PAGES; i++) {
- page = alloc_page(gfp_flags);
+ if (j == 1 && !test_bit(MD_RECOVERY_SYNC,
+ &conf->mddev->recovery)) {
+ /* we can share bv_page's during recovery */
+ struct bio *rbio = r10_bio->devs[0].bio;
+ page = rbio->bi_io_vec[i].bv_page;
+ get_page(page);
+ } else
+ page = alloc_page(gfp_flags);
if (unlikely(!page))
goto out_free_pages;

@@ -1360,20 +1367,14 @@ done:
static void recovery_request_write(mddev_t *mddev, r10bio_t *r10_bio)
{
conf_t *conf = mddev->private;
- int i, d;
- struct bio *bio, *wbio;
-
+ int d;
+ struct bio *wbio;

- /* move the pages across to the second bio
+ /*
+ * share the pages with the first bio
* and submit the write request
*/
- bio = r10_bio->devs[0].bio;
wbio = r10_bio->devs[1].bio;
- for (i=0; i < wbio->bi_vcnt; i++) {
- struct page *p = bio->bi_io_vec[i].bv_page;
- bio->bi_io_vec[i].bv_page = wbio->bi_io_vec[i].bv_page;
- wbio->bi_io_vec[i].bv_page = p;
- }
d = r10_bio->devs[1].devnum;

atomic_inc(&conf->mirrors[d].rdev->nr_pending);


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 01/34] md/raid10: get rid of duplicated conditionalexpression

am 21.07.2011 04:32:24 von NeilBrown

From: Namhyung Kim

Variable 'first' is initialized to zero and updated to @rdev->raid_disk
only if it is greater than 0. Thus condition '>= first' always implies
'>= 0' so the latter is not needed.

Signed-off-by: Namhyung Kim
Signed-off-by: NeilBrown
---

drivers/md/raid10.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 6e84668..d55ae12 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1093,8 +1093,7 @@ static int raid10_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
if (rdev->raid_disk >= 0)
first = last = rdev->raid_disk;

- if (rdev->saved_raid_disk >= 0 &&
- rdev->saved_raid_disk >= first &&
+ if (rdev->saved_raid_disk >= first &&
conf->mirrors[rdev->saved_raid_disk].rdev == NULL)
mirror = rdev->saved_raid_disk;
else


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 09/34] md/raid5: move common code into handle_stripe

am 21.07.2011 04:32:25 von NeilBrown

There is common code at the start of handle_stripe5 and
handle_stripe6. Move it into handle_stripe.

Signed-off-by: NeilBrown
---

drivers/md/raid5.c | 18 ++++++------------
1 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f8275b5..dfb3d9f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3016,12 +3016,6 @@ static void handle_stripe5(struct stripe_head *sh)
atomic_read(&sh->count), sh->pd_idx, sh->check_state,
sh->reconstruct_state);

- if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
- set_bit(STRIPE_SYNCING, &sh->state);
- clear_bit(STRIPE_INSYNC, &sh->state);
- }
- clear_bit(STRIPE_DELAYED, &sh->state);
-
s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
@@ -3310,12 +3304,6 @@ static void handle_stripe6(struct stripe_head *sh)
sh->check_state, sh->reconstruct_state);
memset(&s, 0, sizeof(s));

- if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
- set_bit(STRIPE_SYNCING, &sh->state);
- clear_bit(STRIPE_INSYNC, &sh->state);
- }
- clear_bit(STRIPE_DELAYED, &sh->state);
-
s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
@@ -3607,6 +3595,12 @@ static void handle_stripe(struct stripe_head *sh)
return;
}

+ if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
+ set_bit(STRIPE_SYNCING, &sh->state);
+ clear_bit(STRIPE_INSYNC, &sh->state);
+ }
+ clear_bit(STRIPE_DELAYED, &sh->state);
+
if (sh->raid_conf->level == 6)
handle_stripe6(sh);
else


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 07/34] md/raid5: Protect some more code with ->device_lock.

am 21.07.2011 04:32:25 von NeilBrown

Other places that change or follow dev->towrite and dev->written take
the device_lock as well as the sh->lock.
So it should really be held in these places too.
Also, doing so will allow sh->lock to be discarded.

with merged fixes by: Namhyung Kim


Signed-off-by: NeilBrown
---

drivers/md/raid5.c | 30 ++++++++++++++++--------------
1 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f2f2ab3..9985138 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1021,10 +1021,12 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
struct bio *wbi;

spin_lock(&sh->lock);
+ spin_lock_irq(&sh->raid_conf->device_lock);
chosen = dev->towrite;
dev->towrite = NULL;
BUG_ON(dev->written);
wbi = dev->written = chosen;
+ spin_unlock_irq(&sh->raid_conf->device_lock);
spin_unlock(&sh->lock);

while (wbi && wbi->bi_sector <
@@ -2141,7 +2143,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
raid5_conf_t *conf = sh->raid_conf;
int firstwrite=0;

- pr_debug("adding bh b#%llu to stripe s#%llu\n",
+ pr_debug("adding bi b#%llu to stripe s#%llu\n",
(unsigned long long)bi->bi_sector,
(unsigned long long)sh->sector);

@@ -2167,19 +2169,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
bi->bi_next = *bip;
*bip = bi;
bi->bi_phys_segments++;
- spin_unlock_irq(&conf->device_lock);
- spin_unlock(&sh->lock);
-
- pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
- (unsigned long long)bi->bi_sector,
- (unsigned long long)sh->sector, dd_idx);
-
- if (conf->mddev->bitmap && firstwrite) {
- bitmap_startwrite(conf->mddev->bitmap, sh->sector,
- STRIPE_SECTORS, 0);
- sh->bm_seq = conf->seq_flush+1;
- set_bit(STRIPE_BIT_DELAY, &sh->state);
- }

if (forwrite) {
/* check if page is covered */
@@ -2194,6 +2183,19 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS)
set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
}
+ spin_unlock_irq(&conf->device_lock);
+ spin_unlock(&sh->lock);
+
+ pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
+ (unsigned long long)(*bip)->bi_sector,
+ (unsigned long long)sh->sector, dd_idx);
+
+ if (conf->mddev->bitmap && firstwrite) {
+ bitmap_startwrite(conf->mddev->bitmap, sh->sector,
+ STRIPE_SECTORS, 0);
+ sh->bm_seq = conf->seq_flush+1;
+ set_bit(STRIPE_BIT_DELAY, &sh->state);
+ }
return 1;

overlap:


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 11/34] md/raid5: add some more fields to stripe_head_state

am 21.07.2011 04:32:25 von NeilBrown

Adding these three fields will allow more common code to be moved
to handle_stripe()

Signed-off-by: NeilBrown
---

drivers/md/raid5.c | 54 +++++++++++++++++++++++-----------------------------
drivers/md/raid5.h | 4 ++++
2 files changed, 28 insertions(+), 30 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c32ffb5..3327e82 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3003,12 +3003,9 @@ static void handle_stripe5(struct stripe_head *sh)
{
raid5_conf_t *conf = sh->raid_conf;
int disks = sh->disks, i;
- struct bio *return_bi = NULL;
struct stripe_head_state s;
struct r5dev *dev;
- mdk_rdev_t *blocked_rdev = NULL;
int prexor;
- int dec_preread_active = 0;

memset(&s, 0, sizeof(s));
pr_debug("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d check:%d "
@@ -3058,9 +3055,9 @@ static void handle_stripe5(struct stripe_head *sh)
if (dev->written)
s.written++;
rdev = rcu_dereference(conf->disks[i].rdev);
- if (blocked_rdev == NULL &&
+ if (s.blocked_rdev == NULL &&
rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
- blocked_rdev = rdev;
+ s.blocked_rdev = rdev;
atomic_inc(&rdev->nr_pending);
}
clear_bit(R5_Insync, &dev->flags);
@@ -3088,15 +3085,15 @@ static void handle_stripe5(struct stripe_head *sh)
spin_unlock_irq(&conf->device_lock);
rcu_read_unlock();

- if (unlikely(blocked_rdev)) {
+ if (unlikely(s.blocked_rdev)) {
if (s.syncing || s.expanding || s.expanded ||
s.to_write || s.written) {
set_bit(STRIPE_HANDLE, &sh->state);
goto unlock;
}
/* There is nothing for the blocked_rdev to block */
- rdev_dec_pending(blocked_rdev, conf->mddev);
- blocked_rdev = NULL;
+ rdev_dec_pending(s.blocked_rdev, conf->mddev);
+ s.blocked_rdev = NULL;
}

if (s.to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
@@ -3112,7 +3109,7 @@ static void handle_stripe5(struct stripe_head *sh)
* need to be failed
*/
if (s.failed > 1 && s.to_read+s.to_write+s.written)
- handle_failed_stripe(conf, sh, &s, disks, &return_bi);
+ handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
if (s.failed > 1 && s.syncing) {
md_done_sync(conf->mddev, STRIPE_SECTORS,0);
clear_bit(STRIPE_SYNCING, &sh->state);
@@ -3128,7 +3125,7 @@ static void handle_stripe5(struct stripe_head *sh)
!test_bit(R5_LOCKED, &dev->flags) &&
test_bit(R5_UPTODATE, &dev->flags)) ||
(s.failed == 1 && s.failed_num[0] == sh->pd_idx)))
- handle_stripe_clean_event(conf, sh, disks, &return_bi);
+ handle_stripe_clean_event(conf, sh, disks, &s.return_bi);

/* Now we might consider reading some blocks, either to check/generate
* parity, or to satisfy requests
@@ -3166,7 +3163,7 @@ static void handle_stripe5(struct stripe_head *sh)
}
}
if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
- dec_preread_active = 1;
+ s.dec_preread_active = 1;
}

/* Now to consider new write requests and what else, if anything
@@ -3264,15 +3261,15 @@ static void handle_stripe5(struct stripe_head *sh)
unlock:

/* wait for this device to become unblocked */
- if (unlikely(blocked_rdev))
- md_wait_for_blocked_rdev(blocked_rdev, conf->mddev);
+ if (unlikely(s.blocked_rdev))
+ md_wait_for_blocked_rdev(s.blocked_rdev, conf->mddev);

if (s.ops_request)
raid_run_ops(sh, s.ops_request);

ops_run_io(sh, &s);

- if (dec_preread_active) {
+ if (s.dec_preread_active) {
/* We delay this until after ops_run_io so that if make_request
* is waiting on a flush, it won't continue until the writes
* have actually been submitted.
@@ -3282,19 +3279,16 @@ static void handle_stripe5(struct stripe_head *sh)
IO_THRESHOLD)
md_wakeup_thread(conf->mddev->thread);
}
- return_io(return_bi);
+ return_io(s.return_bi);
}

static void handle_stripe6(struct stripe_head *sh)
{
raid5_conf_t *conf = sh->raid_conf;
int disks = sh->disks;
- struct bio *return_bi = NULL;
int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx;
struct stripe_head_state s;
struct r5dev *dev, *pdev, *qdev;
- mdk_rdev_t *blocked_rdev = NULL;
- int dec_preread_active = 0;

pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
"pd_idx=%d, qd_idx=%d\n, check:%d, reconstruct:%d\n",
@@ -3345,9 +3339,9 @@ static void handle_stripe6(struct stripe_head *sh)
if (dev->written)
s.written++;
rdev = rcu_dereference(conf->disks[i].rdev);
- if (blocked_rdev == NULL &&
+ if (s.blocked_rdev == NULL &&
rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
- blocked_rdev = rdev;
+ s.blocked_rdev = rdev;
atomic_inc(&rdev->nr_pending);
}
clear_bit(R5_Insync, &dev->flags);
@@ -3376,15 +3370,15 @@ static void handle_stripe6(struct stripe_head *sh)
spin_unlock_irq(&conf->device_lock);
rcu_read_unlock();

- if (unlikely(blocked_rdev)) {
+ if (unlikely(s.blocked_rdev)) {
if (s.syncing || s.expanding || s.expanded ||
s.to_write || s.written) {
set_bit(STRIPE_HANDLE, &sh->state);
goto unlock;
}
/* There is nothing for the blocked_rdev to block */
- rdev_dec_pending(blocked_rdev, conf->mddev);
- blocked_rdev = NULL;
+ rdev_dec_pending(s.blocked_rdev, conf->mddev);
+ s.blocked_rdev = NULL;
}

if (s.to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
@@ -3400,7 +3394,7 @@ static void handle_stripe6(struct stripe_head *sh)
* might need to be failed
*/
if (s.failed > 2 && s.to_read+s.to_write+s.written)
- handle_failed_stripe(conf, sh, &s, disks, &return_bi);
+ handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
if (s.failed > 2 && s.syncing) {
md_done_sync(conf->mddev, STRIPE_SECTORS,0);
clear_bit(STRIPE_SYNCING, &sh->state);
@@ -3425,7 +3419,7 @@ static void handle_stripe6(struct stripe_head *sh)
(s.q_failed || ((test_bit(R5_Insync, &qdev->flags)
&& !test_bit(R5_LOCKED, &qdev->flags)
&& test_bit(R5_UPTODATE, &qdev->flags)))))
- handle_stripe_clean_event(conf, sh, disks, &return_bi);
+ handle_stripe_clean_event(conf, sh, disks, &s.return_bi);

/* Now we might consider reading some blocks, either to check/generate
* parity, or to satisfy requests
@@ -3461,7 +3455,7 @@ static void handle_stripe6(struct stripe_head *sh)
}
}
if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
- dec_preread_active = 1;
+ s.dec_preread_active = 1;
}

/* Now to consider new write requests and what else, if anything
@@ -3561,8 +3555,8 @@ static void handle_stripe6(struct stripe_head *sh)
unlock:

/* wait for this device to become unblocked */
- if (unlikely(blocked_rdev))
- md_wait_for_blocked_rdev(blocked_rdev, conf->mddev);
+ if (unlikely(s.blocked_rdev))
+ md_wait_for_blocked_rdev(s.blocked_rdev, conf->mddev);

if (s.ops_request)
raid_run_ops(sh, s.ops_request);
@@ -3570,7 +3564,7 @@ static void handle_stripe6(struct stripe_head *sh)
ops_run_io(sh, &s);


- if (dec_preread_active) {
+ if (s.dec_preread_active) {
/* We delay this until after ops_run_io so that if make_request
* is waiting on a flush, it won't continue until the writes
* have actually been submitted.
@@ -3581,7 +3575,7 @@ static void handle_stripe6(struct stripe_head *sh)
md_wakeup_thread(conf->mddev->thread);
}

- return_io(return_bi);
+ return_io(s.return_bi);
}

static void handle_stripe(struct stripe_head *sh)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index d3c61d3..9ceb574 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -248,6 +248,10 @@ struct stripe_head_state {
int failed_num[2];
unsigned long ops_request;
int p_failed, q_failed;
+
+ struct bio *return_bi;
+ mdk_rdev_t *blocked_rdev;
+ int dec_preread_active;
};

/* Flags */


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 05/34] md/raid5: get rid of duplicated call tobio_data_dir()

am 21.07.2011 04:32:25 von NeilBrown

From: Namhyung Kim

In raid5::make_request(), once bio_data_dir(@bi) is detected
it never (and couldn't) be changed. Use the result always.

Signed-off-by: Namhyung Kim
Signed-off-by: NeilBrown
---

drivers/md/raid5.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 0f71aa9..7148064 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4014,7 +4014,7 @@ static int make_request(mddev_t *mddev, struct bio * bi)
}
}

- if (bio_data_dir(bi) == WRITE &&
+ if (rw == WRITE &&
logical_sector >= mddev->suspend_lo &&
logical_sector < mddev->suspend_hi) {
release_stripe(sh);
@@ -4032,7 +4032,7 @@ static int make_request(mddev_t *mddev, struct bio * bi)
}

if (test_bit(STRIPE_EXPANDING, &sh->state) ||
- !add_stripe_bio(sh, bi, dd_idx, (bi->bi_rw&RW_MASK))) {
+ !add_stripe_bio(sh, bi, dd_idx, rw)) {
/* Stripe is busy expanding or
* add failed due to overlap. Flush everything
* and wait a while


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 08/34] md/raid5: replace sh->lock with an "active" flag.

am 21.07.2011 04:32:25 von NeilBrown

sh->lock is now mainly used to ensure that two threads aren't running
in the locked part of handle_stripe[56] at the same time.

That can more neatly be achieved with an 'active' flag which we set
while running handle_stripe. If we find the flag is set, we simply
requeue the stripe for later by setting STRIPE_HANDLE.

For safety we take ->device_lock while examining the state of the
stripe and creating a summary in 'stripe_head_state / r6_state'.
This possibly isn't needed but as shared fields like ->toread,
->towrite are checked it is safer for now at least.

We leave the label after the old 'unlock' called "unlock" because it
will disappear in a few patches, so renaming seems pointless.

This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE
later, but that is not a problem.

Signed-off-by: NeilBrown
---

drivers/md/raid5.c | 26 +++++++++++++-------------
drivers/md/raid5.h | 2 +-
2 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9985138..f8275b5 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1020,14 +1020,12 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
if (test_and_clear_bit(R5_Wantdrain, &dev->flags)) {
struct bio *wbi;

- spin_lock(&sh->lock);
spin_lock_irq(&sh->raid_conf->device_lock);
chosen = dev->towrite;
dev->towrite = NULL;
BUG_ON(dev->written);
wbi = dev->written = chosen;
spin_unlock_irq(&sh->raid_conf->device_lock);
- spin_unlock(&sh->lock);

while (wbi && wbi->bi_sector <
dev->sector + STRIPE_SECTORS) {
@@ -1322,7 +1320,6 @@ static int grow_one_stripe(raid5_conf_t *conf)
return 0;

sh->raid_conf = conf;
- spin_lock_init(&sh->lock);
#ifdef CONFIG_MULTICORE_RAID456
init_waitqueue_head(&sh->ops.wait_for_ops);
#endif
@@ -1442,7 +1439,6 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
break;

nsh->raid_conf = conf;
- spin_lock_init(&nsh->lock);
#ifdef CONFIG_MULTICORE_RAID456
init_waitqueue_head(&nsh->ops.wait_for_ops);
#endif
@@ -2148,7 +2144,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
(unsigned long long)sh->sector);


- spin_lock(&sh->lock);
spin_lock_irq(&conf->device_lock);
if (forwrite) {
bip = &sh->dev[dd_idx].towrite;
@@ -2184,7 +2179,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
}
spin_unlock_irq(&conf->device_lock);
- spin_unlock(&sh->lock);

pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
(unsigned long long)(*bip)->bi_sector,
@@ -2201,7 +2195,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
overlap:
set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
spin_unlock_irq(&conf->device_lock);
- spin_unlock(&sh->lock);
return 0;
}

@@ -3023,12 +3016,10 @@ static void handle_stripe5(struct stripe_head *sh)
atomic_read(&sh->count), sh->pd_idx, sh->check_state,
sh->reconstruct_state);

- spin_lock(&sh->lock);
if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
set_bit(STRIPE_SYNCING, &sh->state);
clear_bit(STRIPE_INSYNC, &sh->state);
}
- clear_bit(STRIPE_HANDLE, &sh->state);
clear_bit(STRIPE_DELAYED, &sh->state);

s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
@@ -3037,6 +3028,7 @@ static void handle_stripe5(struct stripe_head *sh)

/* Now to look around and see what can be done */
rcu_read_lock();
+ spin_lock_irq(&conf->device_lock);
for (i=disks; i--; ) {
mdk_rdev_t *rdev;

@@ -3099,6 +3091,7 @@ static void handle_stripe5(struct stripe_head *sh)
s.failed_num = i;
}
}
+ spin_unlock_irq(&conf->device_lock);
rcu_read_unlock();

if (unlikely(blocked_rdev)) {
@@ -3275,7 +3268,6 @@ static void handle_stripe5(struct stripe_head *sh)
handle_stripe_expansion(conf, sh, NULL);

unlock:
- spin_unlock(&sh->lock);

/* wait for this device to become unblocked */
if (unlikely(blocked_rdev))
@@ -3318,12 +3310,10 @@ static void handle_stripe6(struct stripe_head *sh)
sh->check_state, sh->reconstruct_state);
memset(&s, 0, sizeof(s));

- spin_lock(&sh->lock);
if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
set_bit(STRIPE_SYNCING, &sh->state);
clear_bit(STRIPE_INSYNC, &sh->state);
}
- clear_bit(STRIPE_HANDLE, &sh->state);
clear_bit(STRIPE_DELAYED, &sh->state);

s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
@@ -3332,6 +3322,7 @@ static void handle_stripe6(struct stripe_head *sh)
/* Now to look around and see what can be done */

rcu_read_lock();
+ spin_lock_irq(&conf->device_lock);
for (i=disks; i--; ) {
mdk_rdev_t *rdev;
dev = &sh->dev[i];
@@ -3395,6 +3386,7 @@ static void handle_stripe6(struct stripe_head *sh)
s.failed++;
}
}
+ spin_unlock_irq(&conf->device_lock);
rcu_read_unlock();

if (unlikely(blocked_rdev)) {
@@ -3580,7 +3572,6 @@ static void handle_stripe6(struct stripe_head *sh)
handle_stripe_expansion(conf, sh, &r6s);

unlock:
- spin_unlock(&sh->lock);

/* wait for this device to become unblocked */
if (unlikely(blocked_rdev))
@@ -3608,10 +3599,19 @@ static void handle_stripe6(struct stripe_head *sh)

static void handle_stripe(struct stripe_head *sh)
{
+ clear_bit(STRIPE_HANDLE, &sh->state);
+ if (test_and_set_bit(STRIPE_ACTIVE, &sh->state)) {
+ /* already being handled, ensure it gets handled
+ * again when current action finishes */
+ set_bit(STRIPE_HANDLE, &sh->state);
+ return;
+ }
+
if (sh->raid_conf->level == 6)
handle_stripe6(sh);
else
handle_stripe5(sh);
+ clear_bit(STRIPE_ACTIVE, &sh->state);
}

static void raid5_activate_delayed(raid5_conf_t *conf)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index a330011..217a9d4 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -209,7 +209,6 @@ struct stripe_head {
short ddf_layout;/* use DDF ordering to calculate Q */
unsigned long state; /* state flags */
atomic_t count; /* nr of active thread/requests */
- spinlock_t lock;
int bm_seq; /* sequence number for bitmap flushes */
int disks; /* disks in stripe */
enum check_states check_state;
@@ -290,6 +289,7 @@ struct r6_state {
* Stripe state
*/
enum {
+ STRIPE_ACTIVE,
STRIPE_HANDLE,
STRIPE_SYNC_REQUESTED,
STRIPE_SYNCING,


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 19/34] md/raid5: move some more common code intohandle_stripe

am 21.07.2011 04:32:26 von NeilBrown

The RAID6 version of this code is usable for RAID5 providing:
- we test "conf->max_degraded" rather than "2" as appropriate
- we make sure s->failed_num[1] is meaningful (and not '-1')
when s->failed > 1

The 'return 1' must become 'goto finish' in the new location.

Signed-off-by: NeilBrown
---

drivers/md/raid5.c | 178 +++++++++++++++++++---------------------------------
1 files changed, 66 insertions(+), 112 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ba6f892..f23848f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2969,63 +2969,14 @@ static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
if (test_bit(R5_ReadError, &dev->flags))
clear_bit(R5_Insync, &dev->flags);
if (!test_bit(R5_Insync, &dev->flags)) {
+ if (s->failed < 2)
+ s->failed_num[s->failed] = i;
s->failed++;
- s->failed_num[0] = i;
}
}
spin_unlock_irq(&conf->device_lock);
rcu_read_unlock();

- if (unlikely(s->blocked_rdev)) {
- if (s->syncing || s->expanding || s->expanded ||
- s->to_write || s->written) {
- set_bit(STRIPE_HANDLE, &sh->state);
- return 1;
- }
- /* There is nothing for the blocked_rdev to block */
- rdev_dec_pending(s->blocked_rdev, conf->mddev);
- s->blocked_rdev = NULL;
- }
-
- if (s->to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
- set_bit(STRIPE_OP_BIOFILL, &s->ops_request);
- set_bit(STRIPE_BIOFILL_RUN, &sh->state);
- }
-
- pr_debug("locked=%d uptodate=%d to_read=%d"
- " to_write=%d failed=%d failed_num=%d\n",
- s->locked, s->uptodate, s->to_read, s->to_write,
- s->failed, s->failed_num[0]);
- /* check if the array has lost two devices and, if so, some requests might
- * need to be failed
- */
- if (s->failed > 1 && s->to_read+s->to_write+s->written)
- handle_failed_stripe(conf, sh, s, disks, &s->return_bi);
- if (s->failed > 1 && s->syncing) {
- md_done_sync(conf->mddev, STRIPE_SECTORS,0);
- clear_bit(STRIPE_SYNCING, &sh->state);
- s->syncing = 0;
- }
-
- /* might be able to return some write requests if the parity block
- * is safe, or on a failed drive
- */
- dev = &sh->dev[sh->pd_idx];
- if (s->written &&
- ((test_bit(R5_Insync, &dev->flags) &&
- !test_bit(R5_LOCKED, &dev->flags) &&
- test_bit(R5_UPTODATE, &dev->flags)) ||
- (s->failed == 1 && s->failed_num[0] == sh->pd_idx)))
- handle_stripe_clean_event(conf, sh, disks, &s->return_bi);
-
- /* Now we might consider reading some blocks, either to check/generate
- * parity, or to satisfy requests
- * or to load a block that is being partially written.
- */
- if (s->to_read || s->non_overwrite ||
- (s->syncing && (s->uptodate + s->compute < disks)) || s->expanding)
- handle_stripe_fill(sh, s, disks);
-
return 0;
}

@@ -3033,8 +2984,8 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
{
raid5_conf_t *conf = sh->raid_conf;
int disks = sh->disks;
- int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx;
- struct r5dev *dev, *pdev, *qdev;
+ struct r5dev *dev;
+ int i;

/* Now to look around and see what can be done */

@@ -3108,65 +3059,6 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
spin_unlock_irq(&conf->device_lock);
rcu_read_unlock();

- if (unlikely(s->blocked_rdev)) {
- if (s->syncing || s->expanding || s->expanded ||
- s->to_write || s->written) {
- set_bit(STRIPE_HANDLE, &sh->state);
- return 1;
- }
- /* There is nothing for the blocked_rdev to block */
- rdev_dec_pending(s->blocked_rdev, conf->mddev);
- s->blocked_rdev = NULL;
- }
-
- if (s->to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
- set_bit(STRIPE_OP_BIOFILL, &s->ops_request);
- set_bit(STRIPE_BIOFILL_RUN, &sh->state);
- }
-
- pr_debug("locked=%d uptodate=%d to_read=%d"
- " to_write=%d failed=%d failed_num=%d,%d\n",
- s->locked, s->uptodate, s->to_read, s->to_write, s->failed,
- s->failed_num[0], s->failed_num[1]);
- /* check if the array has lost >2 devices and, if so, some requests
- * might need to be failed
- */
- if (s->failed > 2 && s->to_read+s->to_write+s->written)
- handle_failed_stripe(conf, sh, s, disks, &s->return_bi);
- if (s->failed > 2 && s->syncing) {
- md_done_sync(conf->mddev, STRIPE_SECTORS,0);
- clear_bit(STRIPE_SYNCING, &sh->state);
- s->syncing = 0;
- }
-
- /*
- * might be able to return some write requests if the parity blocks
- * are safe, or on a failed drive
- */
- pdev = &sh->dev[pd_idx];
- s->p_failed = (s->failed >= 1 && s->failed_num[0] == pd_idx)
- || (s->failed >= 2 && s->failed_num[1] == pd_idx);
- qdev = &sh->dev[qd_idx];
- s->q_failed = (s->failed >= 1 && s->failed_num[0] == qd_idx)
- || (s->failed >= 2 && s->failed_num[1] == qd_idx);
-
- if (s->written &&
- (s->p_failed || ((test_bit(R5_Insync, &pdev->flags)
- && !test_bit(R5_LOCKED, &pdev->flags)
- && test_bit(R5_UPTODATE, &pdev->flags)))) &&
- (s->q_failed || ((test_bit(R5_Insync, &qdev->flags)
- && !test_bit(R5_LOCKED, &qdev->flags)
- && test_bit(R5_UPTODATE, &qdev->flags)))))
- handle_stripe_clean_event(conf, sh, disks, &s->return_bi);
-
- /* Now we might consider reading some blocks, either to check/generate
- * parity, or to satisfy requests
- * or to load a block that is being partially written.
- */
- if (s->to_read || s->non_overwrite || (s->to_write && s->failed) ||
- (s->syncing && (s->uptodate + s->compute < disks)) || s->expanding)
- handle_stripe_fill(sh, s, disks);
-
return 0;
}

@@ -3178,6 +3070,7 @@ static void handle_stripe(struct stripe_head *sh)
int i;
int prexor;
int disks = sh->disks;
+ struct r5dev *pdev, *qdev;

clear_bit(STRIPE_HANDLE, &sh->state);
if (test_and_set_bit(STRIPE_ACTIVE, &sh->state)) {
@@ -3214,6 +3107,67 @@ static void handle_stripe(struct stripe_head *sh)
if (done)
goto finish;

+ if (unlikely(s.blocked_rdev)) {
+ if (s.syncing || s.expanding || s.expanded ||
+ s.to_write || s.written) {
+ set_bit(STRIPE_HANDLE, &sh->state);
+ goto finish;
+ }
+ /* There is nothing for the blocked_rdev to block */
+ rdev_dec_pending(s.blocked_rdev, conf->mddev);
+ s.blocked_rdev = NULL;
+ }
+
+ if (s.to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
+ set_bit(STRIPE_OP_BIOFILL, &s.ops_request);
+ set_bit(STRIPE_BIOFILL_RUN, &sh->state);
+ }
+
+ pr_debug("locked=%d uptodate=%d to_read=%d"
+ " to_write=%d failed=%d failed_num=%d,%d\n",
+ s.locked, s.uptodate, s.to_read, s.to_write, s.failed,
+ s.failed_num[0], s.failed_num[1]);
+ /* check if the array has lost >2 devices and, if so, some requests
+ * might need to be failed
+ */
+ if (s.failed > conf->max_degraded && s.to_read+s.to_write+s.written)
+ handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
+ if (s.failed > conf->max_degraded && s.syncing) {
+ md_done_sync(conf->mddev, STRIPE_SECTORS, 0);
+ clear_bit(STRIPE_SYNCING, &sh->state);
+ s.syncing = 0;
+ }
+
+ /*
+ * might be able to return some write requests if the parity blocks
+ * are safe, or on a failed drive
+ */
+ pdev = &sh->dev[sh->pd_idx];
+ s.p_failed = (s.failed >= 1 && s.failed_num[0] == sh->pd_idx)
+ || (s.failed >= 2 && s.failed_num[1] == sh->pd_idx);
+ qdev = &sh->dev[sh->qd_idx];
+ s.q_failed = (s.failed >= 1 && s.failed_num[0] == sh->qd_idx)
+ || (s.failed >= 2 && s.failed_num[1] == sh->qd_idx)
+ || conf->level < 6;
+
+ if (s.written &&
+ (s.p_failed || ((test_bit(R5_Insync, &pdev->flags)
+ && !test_bit(R5_LOCKED, &pdev->flags)
+ && test_bit(R5_UPTODATE, &pdev->flags)))) &&
+ (s.q_failed || ((test_bit(R5_Insync, &qdev->flags)
+ && !test_bit(R5_LOCKED, &qdev->flags)
+ && test_bit(R5_UPTODATE, &qdev->flags)))))
+ handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
+
+ /* Now we might consider reading some blocks, either to check/generate
+ * parity, or to satisfy requests
+ * or to load a block that is being partially written.
+ */
+ if (s.to_read || s.non_overwrite
+ || (conf->level == 6 && s.to_write && s.failed)
+ || (s.syncing && (s.uptodate + s.compute < disks)) || s.expanding)
+ handle_stripe_fill(sh, &s, disks);
+
/* Now we check to see if any write operations have recently
* completed
*/


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 12/34] md/raid5: move stripe_head_state and more code intohandle_stripe.

am 21.07.2011 04:32:26 von NeilBrown

By defining the 'stripe_head_state' in 'handle_stripe', we can move
some common code out of handle_stripe[56]() and into handle_stripe.

The means that all accesses for stripe_head_state in handle_stripe[56]
need to be 's->' instead of 's.', but the compiler should inline
those functions and just use a direct stack reference, and future
patches while hoist most of this code up into handle_stripe()
so we will revert to "s.".

Signed-off-by: NeilBrown
---

drivers/md/raid5.c | 340 ++++++++++++++++++++++++----------------------------
1 files changed, 158 insertions(+), 182 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3327e82..ce1c291 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2999,24 +2999,13 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
*
*/

-static void handle_stripe5(struct stripe_head *sh)
+static void handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
{
raid5_conf_t *conf = sh->raid_conf;
int disks = sh->disks, i;
- struct stripe_head_state s;
struct r5dev *dev;
int prexor;

- memset(&s, 0, sizeof(s));
- pr_debug("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d check:%d "
- "reconstruct:%d\n", (unsigned long long)sh->sector, sh->state,
- atomic_read(&sh->count), sh->pd_idx, sh->check_state,
- sh->reconstruct_state);
-
- s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
- s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
- s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
-
/* Now to look around and see what can be done */
rcu_read_lock();
spin_lock_irq(&conf->device_lock);
@@ -3039,25 +3028,28 @@ static void handle_stripe5(struct stripe_head *sh)
set_bit(R5_Wantfill, &dev->flags);

/* now count some things */
- if (test_bit(R5_LOCKED, &dev->flags)) s.locked++;
- if (test_bit(R5_UPTODATE, &dev->flags)) s.uptodate++;
- if (test_bit(R5_Wantcompute, &dev->flags)) s.compute++;
+ if (test_bit(R5_LOCKED, &dev->flags))
+ s->locked++;
+ if (test_bit(R5_UPTODATE, &dev->flags))
+ s->uptodate++;
+ if (test_bit(R5_Wantcompute, &dev->flags))
+ s->compute++;

if (test_bit(R5_Wantfill, &dev->flags))
- s.to_fill++;
+ s->to_fill++;
else if (dev->toread)
- s.to_read++;
+ s->to_read++;
if (dev->towrite) {
- s.to_write++;
+ s->to_write++;
if (!test_bit(R5_OVERWRITE, &dev->flags))
- s.non_overwrite++;
+ s->non_overwrite++;
}
if (dev->written)
- s.written++;
+ s->written++;
rdev = rcu_dereference(conf->disks[i].rdev);
- if (s.blocked_rdev == NULL &&
+ if (s->blocked_rdev == NULL &&
rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
- s.blocked_rdev = rdev;
+ s->blocked_rdev = rdev;
atomic_inc(&rdev->nr_pending);
}
clear_bit(R5_Insync, &dev->flags);
@@ -3078,62 +3070,62 @@ static void handle_stripe5(struct stripe_head *sh)
if (test_bit(R5_ReadError, &dev->flags))
clear_bit(R5_Insync, &dev->flags);
if (!test_bit(R5_Insync, &dev->flags)) {
- s.failed++;
- s.failed_num[0] = i;
+ s->failed++;
+ s->failed_num[0] = i;
}
}
spin_unlock_irq(&conf->device_lock);
rcu_read_unlock();

- if (unlikely(s.blocked_rdev)) {
- if (s.syncing || s.expanding || s.expanded ||
- s.to_write || s.written) {
+ if (unlikely(s->blocked_rdev)) {
+ if (s->syncing || s->expanding || s->expanded ||
+ s->to_write || s->written) {
set_bit(STRIPE_HANDLE, &sh->state);
- goto unlock;
+ return;
}
/* There is nothing for the blocked_rdev to block */
- rdev_dec_pending(s.blocked_rdev, conf->mddev);
- s.blocked_rdev = NULL;
+ rdev_dec_pending(s->blocked_rdev, conf->mddev);
+ s->blocked_rdev = NULL;
}

- if (s.to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
- set_bit(STRIPE_OP_BIOFILL, &s.ops_request);
+ if (s->to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
+ set_bit(STRIPE_OP_BIOFILL, &s->ops_request);
set_bit(STRIPE_BIOFILL_RUN, &sh->state);
}

pr_debug("locked=%d uptodate=%d to_read=%d"
" to_write=%d failed=%d failed_num=%d\n",
- s.locked, s.uptodate, s.to_read, s.to_write,
- s.failed, s.failed_num[0]);
+ s->locked, s->uptodate, s->to_read, s->to_write,
+ s->failed, s->failed_num[0]);
/* check if the array has lost two devices and, if so, some requests might
* need to be failed
*/
- if (s.failed > 1 && s.to_read+s.to_write+s.written)
- handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
- if (s.failed > 1 && s.syncing) {
+ if (s->failed > 1 && s->to_read+s->to_write+s->written)
+ handle_failed_stripe(conf, sh, s, disks, &s->return_bi);
+ if (s->failed > 1 && s->syncing) {
md_done_sync(conf->mddev, STRIPE_SECTORS,0);
clear_bit(STRIPE_SYNCING, &sh->state);
- s.syncing = 0;
+ s->syncing = 0;
}

/* might be able to return some write requests if the parity block
* is safe, or on a failed drive
*/
dev = &sh->dev[sh->pd_idx];
- if ( s.written &&
- ((test_bit(R5_Insync, &dev->flags) &&
- !test_bit(R5_LOCKED, &dev->flags) &&
- test_bit(R5_UPTODATE, &dev->flags)) ||
- (s.failed == 1 && s.failed_num[0] == sh->pd_idx)))
- handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
+ if (s->written &&
+ ((test_bit(R5_Insync, &dev->flags) &&
+ !test_bit(R5_LOCKED, &dev->flags) &&
+ test_bit(R5_UPTODATE, &dev->flags)) ||
+ (s->failed == 1 && s->failed_num[0] == sh->pd_idx)))
+ handle_stripe_clean_event(conf, sh, disks, &s->return_bi);

/* Now we might consider reading some blocks, either to check/generate
* parity, or to satisfy requests
* or to load a block that is being partially written.
*/
- if (s.to_read || s.non_overwrite ||
- (s.syncing && (s.uptodate + s.compute < disks)) || s.expanding)
- handle_stripe_fill5(sh, &s, disks);
+ if (s->to_read || s->non_overwrite ||
+ (s->syncing && (s->uptodate + s->compute < disks)) || s->expanding)
+ handle_stripe_fill5(sh, s, disks);

/* Now we check to see if any write operations have recently
* completed
@@ -3158,12 +3150,12 @@ static void handle_stripe5(struct stripe_head *sh)
if (prexor)
continue;
if (!test_bit(R5_Insync, &dev->flags) ||
- (i == sh->pd_idx && s.failed == 0))
+ (i == sh->pd_idx && s->failed == 0))
set_bit(STRIPE_INSYNC, &sh->state);
}
}
if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
- s.dec_preread_active = 1;
+ s->dec_preread_active = 1;
}

/* Now to consider new write requests and what else, if anything
@@ -3172,8 +3164,8 @@ static void handle_stripe5(struct stripe_head *sh)
* 2/ A 'check' operation is in flight, as it may clobber the parity
* block.
*/
- if (s.to_write && !sh->reconstruct_state && !sh->check_state)
- handle_stripe_dirtying5(conf, sh, &s, disks);
+ if (s->to_write && !sh->reconstruct_state && !sh->check_state)
+ handle_stripe_dirtying5(conf, sh, s, disks);

/* maybe we need to check and possibly fix the parity for this stripe
* Any reads will already have been scheduled, so we just see if enough
@@ -3181,12 +3173,13 @@ static void handle_stripe5(struct stripe_head *sh)
* dependent operations are in flight.
*/
if (sh->check_state ||
- (s.syncing && s.locked == 0 &&
+ (s->syncing && s->locked == 0 &&
!test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
!test_bit(STRIPE_INSYNC, &sh->state)))
- handle_parity_checks5(conf, sh, &s, disks);
+ handle_parity_checks5(conf, sh, s, disks);

- if (s.syncing && s.locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
+ if (s->syncing && s->locked == 0
+ && test_bit(STRIPE_INSYNC, &sh->state)) {
md_done_sync(conf->mddev, STRIPE_SECTORS,1);
clear_bit(STRIPE_SYNCING, &sh->state);
}
@@ -3194,22 +3187,22 @@ static void handle_stripe5(struct stripe_head *sh)
/* If the failed drive is just a ReadError, then we might need to progress
* the repair/check process
*/
- if (s.failed == 1 && !conf->mddev->ro &&
- test_bit(R5_ReadError, &sh->dev[s.failed_num[0]].flags)
- && !test_bit(R5_LOCKED, &sh->dev[s.failed_num[0]].flags)
- && test_bit(R5_UPTODATE, &sh->dev[s.failed_num[0]].flags)
+ if (s->failed == 1 && !conf->mddev->ro &&
+ test_bit(R5_ReadError, &sh->dev[s->failed_num[0]].flags)
+ && !test_bit(R5_LOCKED, &sh->dev[s->failed_num[0]].flags)
+ && test_bit(R5_UPTODATE, &sh->dev[s->failed_num[0]].flags)
) {
- dev = &sh->dev[s.failed_num[0]];
+ dev = &sh->dev[s->failed_num[0]];
if (!test_bit(R5_ReWrite, &dev->flags)) {
set_bit(R5_Wantwrite, &dev->flags);
set_bit(R5_ReWrite, &dev->flags);
set_bit(R5_LOCKED, &dev->flags);
- s.locked++;
+ s->locked++;
} else {
/* let's read it back */
set_bit(R5_Wantread, &dev->flags);
set_bit(R5_LOCKED, &dev->flags);
- s.locked++;
+ s->locked++;
}
}

@@ -3227,7 +3220,7 @@ static void handle_stripe5(struct stripe_head *sh)
&sh2->state))
atomic_inc(&conf->preread_active_stripes);
release_stripe(sh2);
- goto unlock;
+ return;
}
if (sh2)
release_stripe(sh2);
@@ -3237,69 +3230,35 @@ static void handle_stripe5(struct stripe_head *sh)
for (i = conf->raid_disks; i--; ) {
set_bit(R5_Wantwrite, &sh->dev[i].flags);
set_bit(R5_LOCKED, &sh->dev[i].flags);
- s.locked++;
+ s->locked++;
}
}

- if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
+ if (s->expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
!sh->reconstruct_state) {
/* Need to write out all blocks after computing parity */
sh->disks = conf->raid_disks;
stripe_set_idx(sh->sector, conf, 0, sh);
- schedule_reconstruction(sh, &s, 1, 1);
- } else if (s.expanded && !sh->reconstruct_state && s.locked == 0) {
+ schedule_reconstruction(sh, s, 1, 1);
+ } else if (s->expanded && !sh->reconstruct_state && s->locked == 0) {
clear_bit(STRIPE_EXPAND_READY, &sh->state);
atomic_dec(&conf->reshape_stripes);
wake_up(&conf->wait_for_overlap);
md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
}

- if (s.expanding && s.locked == 0 &&
+ if (s->expanding && s->locked == 0 &&
!test_bit(STRIPE_COMPUTE_RUN, &sh->state))
handle_stripe_expansion(conf, sh, NULL);
-
- unlock:
-
- /* wait for this device to become unblocked */
- if (unlikely(s.blocked_rdev))
- md_wait_for_blocked_rdev(s.blocked_rdev, conf->mddev);
-
- if (s.ops_request)
- raid_run_ops(sh, s.ops_request);
-
- ops_run_io(sh, &s);
-
- if (s.dec_preread_active) {
- /* We delay this until after ops_run_io so that if make_request
- * is waiting on a flush, it won't continue until the writes
- * have actually been submitted.
- */
- atomic_dec(&conf->preread_active_stripes);
- if (atomic_read(&conf->preread_active_stripes) <
- IO_THRESHOLD)
- md_wakeup_thread(conf->mddev->thread);
- }
- return_io(s.return_bi);
}

-static void handle_stripe6(struct stripe_head *sh)
+static void handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
{
raid5_conf_t *conf = sh->raid_conf;
int disks = sh->disks;
int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx;
- struct stripe_head_state s;
struct r5dev *dev, *pdev, *qdev;

- pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
- "pd_idx=%d, qd_idx=%d\n, check:%d, reconstruct:%d\n",
- (unsigned long long)sh->sector, sh->state,
- atomic_read(&sh->count), pd_idx, qd_idx,
- sh->check_state, sh->reconstruct_state);
- memset(&s, 0, sizeof(s));
-
- s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
- s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
- s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
/* Now to look around and see what can be done */

rcu_read_lock();
@@ -3320,28 +3279,30 @@ static void handle_stripe6(struct stripe_head *sh)
set_bit(R5_Wantfill, &dev->flags);

/* now count some things */
- if (test_bit(R5_LOCKED, &dev->flags)) s.locked++;
- if (test_bit(R5_UPTODATE, &dev->flags)) s.uptodate++;
+ if (test_bit(R5_LOCKED, &dev->flags))
+ s->locked++;
+ if (test_bit(R5_UPTODATE, &dev->flags))
+ s->uptodate++;
if (test_bit(R5_Wantcompute, &dev->flags)) {
- s.compute++;
- BUG_ON(s.compute > 2);
+ s->compute++;
+ BUG_ON(s->compute > 2);
}

if (test_bit(R5_Wantfill, &dev->flags)) {
- s.to_fill++;
+ s->to_fill++;
} else if (dev->toread)
- s.to_read++;
+ s->to_read++;
if (dev->towrite) {
- s.to_write++;
+ s->to_write++;
if (!test_bit(R5_OVERWRITE, &dev->flags))
- s.non_overwrite++;
+ s->non_overwrite++;
}
if (dev->written)
- s.written++;
+ s->written++;
rdev = rcu_dereference(conf->disks[i].rdev);
- if (s.blocked_rdev == NULL &&
+ if (s->blocked_rdev == NULL &&
rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
- s.blocked_rdev = rdev;
+ s->blocked_rdev = rdev;
atomic_inc(&rdev->nr_pending);
}
clear_bit(R5_Insync, &dev->flags);
@@ -3362,43 +3323,43 @@ static void handle_stripe6(struct stripe_head *sh)
if (test_bit(R5_ReadError, &dev->flags))
clear_bit(R5_Insync, &dev->flags);
if (!test_bit(R5_Insync, &dev->flags)) {
- if (s.failed < 2)
- s.failed_num[s.failed] = i;
- s.failed++;
+ if (s->failed < 2)
+ s->failed_num[s->failed] = i;
+ s->failed++;
}
}
spin_unlock_irq(&conf->device_lock);
rcu_read_unlock();

- if (unlikely(s.blocked_rdev)) {
- if (s.syncing || s.expanding || s.expanded ||
- s.to_write || s.written) {
+ if (unlikely(s->blocked_rdev)) {
+ if (s->syncing || s->expanding || s->expanded ||
+ s->to_write || s->written) {
set_bit(STRIPE_HANDLE, &sh->state);
- goto unlock;
+ return;
}
/* There is nothing for the blocked_rdev to block */
- rdev_dec_pending(s.blocked_rdev, conf->mddev);
- s.blocked_rdev = NULL;
+ rdev_dec_pending(s->blocked_rdev, conf->mddev);
+ s->blocked_rdev = NULL;
}

- if (s.to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
- set_bit(STRIPE_OP_BIOFILL, &s.ops_request);
+ if (s->to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
+ set_bit(STRIPE_OP_BIOFILL, &s->ops_request);
set_bit(STRIPE_BIOFILL_RUN, &sh->state);
}

pr_debug("locked=%d uptodate=%d to_read=%d"
" to_write=%d failed=%d failed_num=%d,%d\n",
- s.locked, s.uptodate, s.to_read, s.to_write, s.failed,
- s.failed_num[0], s.failed_num[1]);
+ s->locked, s->uptodate, s->to_read, s->to_write, s->failed,
+ s->failed_num[0], s->failed_num[1]);
/* check if the array has lost >2 devices and, if so, some requests
* might need to be failed
*/
- if (s.failed > 2 && s.to_read+s.to_write+s.written)
- handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
- if (s.failed > 2 && s.syncing) {
+ if (s->failed > 2 && s->to_read+s->to_write+s->written)
+ handle_failed_stripe(conf, sh, s, disks, &s->return_bi);
+ if (s->failed > 2 && s->syncing) {
md_done_sync(conf->mddev, STRIPE_SECTORS,0);
clear_bit(STRIPE_SYNCING, &sh->state);
- s.syncing = 0;
+ s->syncing = 0;
}

/*
@@ -3406,28 +3367,28 @@ static void handle_stripe6(struct stripe_head *sh)
* are safe, or on a failed drive
*/
pdev = &sh->dev[pd_idx];
- s.p_failed = (s.failed >= 1 && s.failed_num[0] == pd_idx)
- || (s.failed >= 2 && s.failed_num[1] == pd_idx);
+ s->p_failed = (s->failed >= 1 && s->failed_num[0] == pd_idx)
+ || (s->failed >= 2 && s->failed_num[1] == pd_idx);
qdev = &sh->dev[qd_idx];
- s.q_failed = (s.failed >= 1 && s.failed_num[0] == qd_idx)
- || (s.failed >= 2 && s.failed_num[1] == qd_idx);
+ s->q_failed = (s->failed >= 1 && s->failed_num[0] == qd_idx)
+ || (s->failed >= 2 && s->failed_num[1] == qd_idx);

- if (s.written &&
- (s.p_failed || ((test_bit(R5_Insync, &pdev->flags)
+ if (s->written &&
+ (s->p_failed || ((test_bit(R5_Insync, &pdev->flags)
&& !test_bit(R5_LOCKED, &pdev->flags)
&& test_bit(R5_UPTODATE, &pdev->flags)))) &&
- (s.q_failed || ((test_bit(R5_Insync, &qdev->flags)
+ (s->q_failed || ((test_bit(R5_Insync, &qdev->flags)
&& !test_bit(R5_LOCKED, &qdev->flags)
&& test_bit(R5_UPTODATE, &qdev->flags)))))
- handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
+ handle_stripe_clean_event(conf, sh, disks, &s->return_bi);

/* Now we might consider reading some blocks, either to check/generate
* parity, or to satisfy requests
* or to load a block that is being partially written.
*/
- if (s.to_read || s.non_overwrite || (s.to_write && s.failed) ||
- (s.syncing && (s.uptodate + s.compute < disks)) || s.expanding)
- handle_stripe_fill6(sh, &s, disks);
+ if (s->to_read || s->non_overwrite || (s->to_write && s->failed) ||
+ (s->syncing && (s->uptodate + s->compute < disks)) || s->expanding)
+ handle_stripe_fill6(sh, s, disks);

/* Now we check to see if any write operations have recently
* completed
@@ -3450,12 +3411,12 @@ static void handle_stripe6(struct stripe_head *sh)
set_bit(R5_Wantwrite, &dev->flags);
if (!test_bit(R5_Insync, &dev->flags) ||
((i == sh->pd_idx || i == qd_idx) &&
- s.failed == 0))
+ s->failed == 0))
set_bit(STRIPE_INSYNC, &sh->state);
}
}
if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
- s.dec_preread_active = 1;
+ s->dec_preread_active = 1;
}

/* Now to consider new write requests and what else, if anything
@@ -3464,8 +3425,8 @@ static void handle_stripe6(struct stripe_head *sh)
* 2/ A 'check' operation is in flight, as it may clobber the parity
* block.
*/
- if (s.to_write && !sh->reconstruct_state && !sh->check_state)
- handle_stripe_dirtying6(conf, sh, &s, disks);
+ if (s->to_write && !sh->reconstruct_state && !sh->check_state)
+ handle_stripe_dirtying6(conf, sh, s, disks);

/* maybe we need to check and possibly fix the parity for this stripe
* Any reads will already have been scheduled, so we just see if enough
@@ -3473,12 +3434,13 @@ static void handle_stripe6(struct stripe_head *sh)
* dependent operations are in flight.
*/
if (sh->check_state ||
- (s.syncing && s.locked == 0 &&
+ (s->syncing && s->locked == 0 &&
!test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
!test_bit(STRIPE_INSYNC, &sh->state)))
- handle_parity_checks6(conf, sh, &s, disks);
+ handle_parity_checks6(conf, sh, s, disks);

- if (s.syncing && s.locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
+ if (s->syncing && s->locked == 0
+ && test_bit(STRIPE_INSYNC, &sh->state)) {
md_done_sync(conf->mddev, STRIPE_SECTORS,1);
clear_bit(STRIPE_SYNCING, &sh->state);
}
@@ -3486,9 +3448,9 @@ static void handle_stripe6(struct stripe_head *sh)
/* If the failed drives are just a ReadError, then we might need
* to progress the repair/check process
*/
- if (s.failed <= 2 && !conf->mddev->ro)
- for (i = 0; i < s.failed; i++) {
- dev = &sh->dev[s.failed_num[i]];
+ if (s->failed <= 2 && !conf->mddev->ro)
+ for (i = 0; i < s->failed; i++) {
+ dev = &sh->dev[s->failed_num[i]];
if (test_bit(R5_ReadError, &dev->flags)
&& !test_bit(R5_LOCKED, &dev->flags)
&& test_bit(R5_UPTODATE, &dev->flags)
@@ -3497,12 +3459,12 @@ static void handle_stripe6(struct stripe_head *sh)
set_bit(R5_Wantwrite, &dev->flags);
set_bit(R5_ReWrite, &dev->flags);
set_bit(R5_LOCKED, &dev->flags);
- s.locked++;
+ s->locked++;
} else {
/* let's read it back */
set_bit(R5_Wantread, &dev->flags);
set_bit(R5_LOCKED, &dev->flags);
- s.locked++;
+ s->locked++;
}
}
}
@@ -3514,11 +3476,11 @@ static void handle_stripe6(struct stripe_head *sh)
for (i = conf->raid_disks; i--; ) {
set_bit(R5_Wantwrite, &sh->dev[i].flags);
set_bit(R5_LOCKED, &sh->dev[i].flags);
- s.locked++;
+ s->locked++;
}
}

- if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
+ if (s->expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
!sh->reconstruct_state) {
struct stripe_head *sh2
= get_active_stripe(conf, sh->sector, 1, 1, 1);
@@ -3532,7 +3494,7 @@ static void handle_stripe6(struct stripe_head *sh)
&sh2->state))
atomic_inc(&conf->preread_active_stripes);
release_stripe(sh2);
- goto unlock;
+ return;
}
if (sh2)
release_stripe(sh2);
@@ -3540,19 +3502,54 @@ static void handle_stripe6(struct stripe_head *sh)
/* Need to write out all blocks after computing P&Q */
sh->disks = conf->raid_disks;
stripe_set_idx(sh->sector, conf, 0, sh);
- schedule_reconstruction(sh, &s, 1, 1);
- } else if (s.expanded && !sh->reconstruct_state && s.locked == 0) {
+ schedule_reconstruction(sh, s, 1, 1);
+ } else if (s->expanded && !sh->reconstruct_state && s->locked == 0) {
clear_bit(STRIPE_EXPAND_READY, &sh->state);
atomic_dec(&conf->reshape_stripes);
wake_up(&conf->wait_for_overlap);
md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
}

- if (s.expanding && s.locked == 0 &&
+ if (s->expanding && s->locked == 0 &&
!test_bit(STRIPE_COMPUTE_RUN, &sh->state))
- handle_stripe_expansion(conf, sh, &s);
+ handle_stripe_expansion(conf, sh, s);
+}
+
+static void handle_stripe(struct stripe_head *sh)
+{
+ struct stripe_head_state s;
+ raid5_conf_t *conf = sh->raid_conf;
+
+ clear_bit(STRIPE_HANDLE, &sh->state);
+ if (test_and_set_bit(STRIPE_ACTIVE, &sh->state)) {
+ /* already being handled, ensure it gets handled
+ * again when current action finishes */
+ set_bit(STRIPE_HANDLE, &sh->state);
+ return;
+ }
+
+ if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
+ set_bit(STRIPE_SYNCING, &sh->state);
+ clear_bit(STRIPE_INSYNC, &sh->state);
+ }
+ clear_bit(STRIPE_DELAYED, &sh->state);
+
+ pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
+ "pd_idx=%d, qd_idx=%d\n, check:%d, reconstruct:%d\n",
+ (unsigned long long)sh->sector, sh->state,
+ atomic_read(&sh->count), sh->pd_idx, sh->qd_idx,
+ sh->check_state, sh->reconstruct_state);
+ memset(&s, 0, sizeof(s));
+
+ s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
+ s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
+ s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
+
+ if (conf->level == 6)
+ handle_stripe6(sh, &s);
+ else
+ handle_stripe5(sh, &s);

- unlock:

/* wait for this device to become unblocked */
if (unlikely(s.blocked_rdev))
@@ -3576,28 +3573,7 @@ static void handle_stripe6(struct stripe_head *sh)
}

return_io(s.return_bi);
-}

-static void handle_stripe(struct stripe_head *sh)
-{
- clear_bit(STRIPE_HANDLE, &sh->state);
- if (test_and_set_bit(STRIPE_ACTIVE, &sh->state)) {
- /* already being handled, ensure it gets handled
- * again when current action finishes */
- set_bit(STRIPE_HANDLE, &sh->state);
- return;
- }
-
- if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
- set_bit(STRIPE_SYNCING, &sh->state);
- clear_bit(STRIPE_INSYNC, &sh->state);
- }
- clear_bit(STRIPE_DELAYED, &sh->state);
-
- if (sh->raid_conf->level == 6)
- handle_stripe6(sh);
- else
- handle_stripe5(sh);
clear_bit(STRIPE_ACTIVE, &sh->state);
}



--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 14/34] md/raid5: move more code into common handle_stripe

am 21.07.2011 04:32:26 von NeilBrown

The difference between the RAID5 and RAID6 code here is easily
resolved using conf->max_degraded.

Signed-off-by: NeilBrown
---

drivers/md/raid5.c | 90 ++++++++++++++++++----------------------------------
1 files changed, 32 insertions(+), 58 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7b562fc..e41b622 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3177,34 +3177,6 @@ static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
!test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
!test_bit(STRIPE_INSYNC, &sh->state)))
handle_parity_checks5(conf, sh, s, disks);
-
- if (s->syncing && s->locked == 0
- && test_bit(STRIPE_INSYNC, &sh->state)) {
- md_done_sync(conf->mddev, STRIPE_SECTORS,1);
- clear_bit(STRIPE_SYNCING, &sh->state);
- }
-
- /* If the failed drive is just a ReadError, then we might need to progress
- * the repair/check process
- */
- if (s->failed == 1 && !conf->mddev->ro &&
- test_bit(R5_ReadError, &sh->dev[s->failed_num[0]].flags)
- && !test_bit(R5_LOCKED, &sh->dev[s->failed_num[0]].flags)
- && test_bit(R5_UPTODATE, &sh->dev[s->failed_num[0]].flags)
- ) {
- dev = &sh->dev[s->failed_num[0]];
- if (!test_bit(R5_ReWrite, &dev->flags)) {
- set_bit(R5_Wantwrite, &dev->flags);
- set_bit(R5_ReWrite, &dev->flags);
- set_bit(R5_LOCKED, &dev->flags);
- s->locked++;
- } else {
- /* let's read it back */
- set_bit(R5_Wantread, &dev->flags);
- set_bit(R5_LOCKED, &dev->flags);
- s->locked++;
- }
- }
return 0;
}

@@ -3394,36 +3366,6 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
!test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
!test_bit(STRIPE_INSYNC, &sh->state)))
handle_parity_checks6(conf, sh, s, disks);
-
- if (s->syncing && s->locked == 0
- && test_bit(STRIPE_INSYNC, &sh->state)) {
- md_done_sync(conf->mddev, STRIPE_SECTORS,1);
- clear_bit(STRIPE_SYNCING, &sh->state);
- }
-
- /* If the failed drives are just a ReadError, then we might need
- * to progress the repair/check process
- */
- if (s->failed <= 2 && !conf->mddev->ro)
- for (i = 0; i < s->failed; i++) {
- dev = &sh->dev[s->failed_num[i]];
- if (test_bit(R5_ReadError, &dev->flags)
- && !test_bit(R5_LOCKED, &dev->flags)
- && test_bit(R5_UPTODATE, &dev->flags)
- ) {
- if (!test_bit(R5_ReWrite, &dev->flags)) {
- set_bit(R5_Wantwrite, &dev->flags);
- set_bit(R5_ReWrite, &dev->flags);
- set_bit(R5_LOCKED, &dev->flags);
- s->locked++;
- } else {
- /* let's read it back */
- set_bit(R5_Wantread, &dev->flags);
- set_bit(R5_LOCKED, &dev->flags);
- s->locked++;
- }
- }
- }
return 0;
}

@@ -3466,6 +3408,38 @@ static void handle_stripe(struct stripe_head *sh)

if (done)
goto finish;
+
+
+ if (s.syncing && s.locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
+ md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
+ clear_bit(STRIPE_SYNCING, &sh->state);
+ }
+
+ /* If the failed drives are just a ReadError, then we might need
+ * to progress the repair/check process
+ */
+ if (s.failed <= conf->max_degraded && !conf->mddev->ro)
+ for (i = 0; i < s.failed; i++) {
+ struct r5dev *dev = &sh->dev[s.failed_num[i]];
+ if (test_bit(R5_ReadError, &dev->flags)
+ && !test_bit(R5_LOCKED, &dev->flags)
+ && test_bit(R5_UPTODATE, &dev->flags)
+ ) {
+ if (!test_bit(R5_ReWrite, &dev->flags)) {
+ set_bit(R5_Wantwrite, &dev->flags);
+ set_bit(R5_ReWrite, &dev->flags);
+ set_bit(R5_LOCKED, &dev->flags);
+ s.locked++;
+ } else {
+ /* let's read it back */
+ set_bit(R5_Wantread, &dev->flags);
+ set_bit(R5_LOCKED, &dev->flags);
+ s.locked++;
+ }
+ }
+ }
+
+
/* Finish reconstruct operations initiated by the expansion process */
if (sh->reconstruct_state == reconstruct_state_result) {
struct stripe_head *sh2


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 13/34] md/raid5: Move code for finishing a reconstructioninto handle_stripe.

am 21.07.2011 04:32:26 von NeilBrown

Prior to commit ab69ae12ceef7 the code in handle_stripe5 and
handle_stripe6 to "Finish reconstruct operations initiated by the
expansion process" was identical.
That commit added an identical stanza of code to each function, but in
different places. That was careless.

The raid5 code was correct, so move that out into handle_stripe and
remove raid6 version.

Signed-off-by: NeilBrown
---

drivers/md/raid5.c | 153 +++++++++++++++++++---------------------------------
1 files changed, 57 insertions(+), 96 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ce1c291..7b562fc 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2999,7 +2999,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
*
*/

-static void handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
+static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
{
raid5_conf_t *conf = sh->raid_conf;
int disks = sh->disks, i;
@@ -3081,7 +3081,7 @@ static void handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
if (s->syncing || s->expanding || s->expanded ||
s->to_write || s->written) {
set_bit(STRIPE_HANDLE, &sh->state);
- return;
+ return 1;
}
/* There is nothing for the blocked_rdev to block */
rdev_dec_pending(s->blocked_rdev, conf->mddev);
@@ -3205,54 +3205,10 @@ static void handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
s->locked++;
}
}
-
- /* Finish reconstruct operations initiated by the expansion process */
- if (sh->reconstruct_state == reconstruct_state_result) {
- struct stripe_head *sh2
- = get_active_stripe(conf, sh->sector, 1, 1, 1);
- if (sh2 && test_bit(STRIPE_EXPAND_SOURCE, &sh2->state)) {
- /* sh cannot be written until sh2 has been read.
- * so arrange for sh to be delayed a little
- */
- set_bit(STRIPE_DELAYED, &sh->state);
- set_bit(STRIPE_HANDLE, &sh->state);
- if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE,
- &sh2->state))
- atomic_inc(&conf->preread_active_stripes);
- release_stripe(sh2);
- return;
- }
- if (sh2)
- release_stripe(sh2);
-
- sh->reconstruct_state = reconstruct_state_idle;
- clear_bit(STRIPE_EXPANDING, &sh->state);
- for (i = conf->raid_disks; i--; ) {
- set_bit(R5_Wantwrite, &sh->dev[i].flags);
- set_bit(R5_LOCKED, &sh->dev[i].flags);
- s->locked++;
- }
- }
-
- if (s->expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
- !sh->reconstruct_state) {
- /* Need to write out all blocks after computing parity */
- sh->disks = conf->raid_disks;
- stripe_set_idx(sh->sector, conf, 0, sh);
- schedule_reconstruction(sh, s, 1, 1);
- } else if (s->expanded && !sh->reconstruct_state && s->locked == 0) {
- clear_bit(STRIPE_EXPAND_READY, &sh->state);
- atomic_dec(&conf->reshape_stripes);
- wake_up(&conf->wait_for_overlap);
- md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
- }
-
- if (s->expanding && s->locked == 0 &&
- !test_bit(STRIPE_COMPUTE_RUN, &sh->state))
- handle_stripe_expansion(conf, sh, NULL);
+ return 0;
}

-static void handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
+static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
{
raid5_conf_t *conf = sh->raid_conf;
int disks = sh->disks;
@@ -3335,7 +3291,7 @@ static void handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
if (s->syncing || s->expanding || s->expanded ||
s->to_write || s->written) {
set_bit(STRIPE_HANDLE, &sh->state);
- return;
+ return 1;
}
/* There is nothing for the blocked_rdev to block */
rdev_dec_pending(s->blocked_rdev, conf->mddev);
@@ -3468,57 +3424,15 @@ static void handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
}
}
}
-
- /* Finish reconstruct operations initiated by the expansion process */
- if (sh->reconstruct_state == reconstruct_state_result) {
- sh->reconstruct_state = reconstruct_state_idle;
- clear_bit(STRIPE_EXPANDING, &sh->state);
- for (i = conf->raid_disks; i--; ) {
- set_bit(R5_Wantwrite, &sh->dev[i].flags);
- set_bit(R5_LOCKED, &sh->dev[i].flags);
- s->locked++;
- }
- }
-
- if (s->expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
- !sh->reconstruct_state) {
- struct stripe_head *sh2
- = get_active_stripe(conf, sh->sector, 1, 1, 1);
- if (sh2 && test_bit(STRIPE_EXPAND_SOURCE, &sh2->state)) {
- /* sh cannot be written until sh2 has been read.
- * so arrange for sh to be delayed a little
- */
- set_bit(STRIPE_DELAYED, &sh->state);
- set_bit(STRIPE_HANDLE, &sh->state);
- if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE,
- &sh2->state))
- atomic_inc(&conf->preread_active_stripes);
- release_stripe(sh2);
- return;
- }
- if (sh2)
- release_stripe(sh2);
-
- /* Need to write out all blocks after computing P&Q */
- sh->disks = conf->raid_disks;
- stripe_set_idx(sh->sector, conf, 0, sh);
- schedule_reconstruction(sh, s, 1, 1);
- } else if (s->expanded && !sh->reconstruct_state && s->locked == 0) {
- clear_bit(STRIPE_EXPAND_READY, &sh->state);
- atomic_dec(&conf->reshape_stripes);
- wake_up(&conf->wait_for_overlap);
- md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
- }
-
- if (s->expanding && s->locked == 0 &&
- !test_bit(STRIPE_COMPUTE_RUN, &sh->state))
- handle_stripe_expansion(conf, sh, s);
+ return 0;
}

static void handle_stripe(struct stripe_head *sh)
{
struct stripe_head_state s;
+ int done;
raid5_conf_t *conf = sh->raid_conf;
+ int i;

clear_bit(STRIPE_HANDLE, &sh->state);
if (test_and_set_bit(STRIPE_ACTIVE, &sh->state)) {
@@ -3546,11 +3460,58 @@ static void handle_stripe(struct stripe_head *sh)
s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);

if (conf->level == 6)
- handle_stripe6(sh, &s);
+ done = handle_stripe6(sh, &s);
else
- handle_stripe5(sh, &s);
+ done = handle_stripe5(sh, &s);
+
+ if (done)
+ goto finish;
+ /* Finish reconstruct operations initiated by the expansion process */
+ if (sh->reconstruct_state == reconstruct_state_result) {
+ struct stripe_head *sh2
+ = get_active_stripe(conf, sh->sector, 1, 1, 1);
+ if (sh2 && test_bit(STRIPE_EXPAND_SOURCE, &sh2->state)) {
+ /* sh cannot be written until sh2 has been read.
+ * so arrange for sh to be delayed a little
+ */
+ set_bit(STRIPE_DELAYED, &sh->state);
+ set_bit(STRIPE_HANDLE, &sh->state);
+ if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE,
+ &sh2->state))
+ atomic_inc(&conf->preread_active_stripes);
+ release_stripe(sh2);
+ goto finish;
+ }
+ if (sh2)
+ release_stripe(sh2);
+
+ sh->reconstruct_state = reconstruct_state_idle;
+ clear_bit(STRIPE_EXPANDING, &sh->state);
+ for (i = conf->raid_disks; i--; ) {
+ set_bit(R5_Wantwrite, &sh->dev[i].flags);
+ set_bit(R5_LOCKED, &sh->dev[i].flags);
+ s.locked++;
+ }
+ }

+ if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
+ !sh->reconstruct_state) {
+ /* Need to write out all blocks after computing parity */
+ sh->disks = conf->raid_disks;
+ stripe_set_idx(sh->sector, conf, 0, sh);
+ schedule_reconstruction(sh, &s, 1, 1);
+ } else if (s.expanded && !sh->reconstruct_state && s.locked == 0) {
+ clear_bit(STRIPE_EXPAND_READY, &sh->state);
+ atomic_dec(&conf->reshape_stripes);
+ wake_up(&conf->wait_for_overlap);
+ md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
+ }
+
+ if (s.expanding && s.locked == 0 &&
+ !test_bit(STRIPE_COMPUTE_RUN, &sh->state))
+ handle_stripe_expansion(conf, sh, NULL);

+finish:
/* wait for this device to become unblocked */
if (unlikely(s.blocked_rdev))
md_wait_for_blocked_rdev(s.blocked_rdev, conf->mddev);


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 18/34] md/raid5: move more common code into handle_stripe

am 21.07.2011 04:32:26 von NeilBrown

Apart from 'prexor' which can only be set for RAID5, and
'qd_idx' which can only be meaningful for RAID6, these two
chunks of code are nearly the same.

So combine them into one adding a test to call either
handle_parity_checks5 or handle_parity_checks6 as appropriate.

Signed-off-by: NeilBrown
---

drivers/md/raid5.c | 161 ++++++++++++++++++++--------------------------------
1 files changed, 61 insertions(+), 100 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b1eba98..ba6f892 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1756,7 +1756,7 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
/*
* Select the parity disk based on the user selected algorithm.
*/
- pd_idx = qd_idx = ~0;
+ pd_idx = qd_idx = -1;
switch(conf->level) {
case 4:
pd_idx = data_disks;
@@ -2904,7 +2904,6 @@ static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
raid5_conf_t *conf = sh->raid_conf;
int disks = sh->disks, i;
struct r5dev *dev;
- int prexor;

/* Now to look around and see what can be done */
rcu_read_lock();
@@ -3027,56 +3026,6 @@ static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
(s->syncing && (s->uptodate + s->compute < disks)) || s->expanding)
handle_stripe_fill(sh, s, disks);

- /* Now we check to see if any write operations have recently
- * completed
- */
- prexor = 0;
- if (sh->reconstruct_state == reconstruct_state_prexor_drain_result)
- prexor = 1;
- if (sh->reconstruct_state == reconstruct_state_drain_result ||
- sh->reconstruct_state == reconstruct_state_prexor_drain_result) {
- sh->reconstruct_state = reconstruct_state_idle;
-
- /* All the 'written' buffers and the parity block are ready to
- * be written back to disk
- */
- BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags));
- for (i = disks; i--; ) {
- dev = &sh->dev[i];
- if (test_bit(R5_LOCKED, &dev->flags) &&
- (i == sh->pd_idx || dev->written)) {
- pr_debug("Writing block %d\n", i);
- set_bit(R5_Wantwrite, &dev->flags);
- if (prexor)
- continue;
- if (!test_bit(R5_Insync, &dev->flags) ||
- (i == sh->pd_idx && s->failed == 0))
- set_bit(STRIPE_INSYNC, &sh->state);
- }
- }
- if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
- s->dec_preread_active = 1;
- }
-
- /* Now to consider new write requests and what else, if anything
- * should be read. We do not handle new writes when:
- * 1/ A 'write' operation (copy+xor) is already in flight.
- * 2/ A 'check' operation is in flight, as it may clobber the parity
- * block.
- */
- if (s->to_write && !sh->reconstruct_state && !sh->check_state)
- handle_stripe_dirtying(conf, sh, s, disks);
-
- /* maybe we need to check and possibly fix the parity for this stripe
- * Any reads will already have been scheduled, so we just see if enough
- * data is available. The parity check is held off while parity
- * dependent operations are in flight.
- */
- if (sh->check_state ||
- (s->syncing && s->locked == 0 &&
- !test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
- !test_bit(STRIPE_INSYNC, &sh->state)))
- handle_parity_checks5(conf, sh, s, disks);
return 0;
}

@@ -3218,54 +3167,6 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
(s->syncing && (s->uptodate + s->compute < disks)) || s->expanding)
handle_stripe_fill(sh, s, disks);

- /* Now we check to see if any write operations have recently
- * completed
- */
- if (sh->reconstruct_state == reconstruct_state_drain_result) {
-
- sh->reconstruct_state = reconstruct_state_idle;
- /* All the 'written' buffers and the parity blocks are ready to
- * be written back to disk
- */
- BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags));
- BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[qd_idx].flags));
- for (i = disks; i--; ) {
- dev = &sh->dev[i];
- if (test_bit(R5_LOCKED, &dev->flags) &&
- (i == sh->pd_idx || i == qd_idx ||
- dev->written)) {
- pr_debug("Writing block %d\n", i);
- BUG_ON(!test_bit(R5_UPTODATE, &dev->flags));
- set_bit(R5_Wantwrite, &dev->flags);
- if (!test_bit(R5_Insync, &dev->flags) ||
- ((i == sh->pd_idx || i == qd_idx) &&
- s->failed == 0))
- set_bit(STRIPE_INSYNC, &sh->state);
- }
- }
- if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
- s->dec_preread_active = 1;
- }
-
- /* Now to consider new write requests and what else, if anything
- * should be read. We do not handle new writes when:
- * 1/ A 'write' operation (copy+gen_syndrome) is already in flight.
- * 2/ A 'check' operation is in flight, as it may clobber the parity
- * block.
- */
- if (s->to_write && !sh->reconstruct_state && !sh->check_state)
- handle_stripe_dirtying(conf, sh, s, disks);
-
- /* maybe we need to check and possibly fix the parity for this stripe
- * Any reads will already have been scheduled, so we just see if enough
- * data is available. The parity check is held off while parity
- * dependent operations are in flight.
- */
- if (sh->check_state ||
- (s->syncing && s->locked == 0 &&
- !test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
- !test_bit(STRIPE_INSYNC, &sh->state)))
- handle_parity_checks6(conf, sh, s, disks);
return 0;
}

@@ -3275,6 +3176,8 @@ static void handle_stripe(struct stripe_head *sh)
int done;
raid5_conf_t *conf = sh->raid_conf;
int i;
+ int prexor;
+ int disks = sh->disks;

clear_bit(STRIPE_HANDLE, &sh->state);
if (test_and_set_bit(STRIPE_ACTIVE, &sh->state)) {
@@ -3311,6 +3214,64 @@ static void handle_stripe(struct stripe_head *sh)
if (done)
goto finish;

+ /* Now we check to see if any write operations have recently
+ * completed
+ */
+ prexor = 0;
+ if (sh->reconstruct_state == reconstruct_state_prexor_drain_result)
+ prexor = 1;
+ if (sh->reconstruct_state == reconstruct_state_drain_result ||
+ sh->reconstruct_state == reconstruct_state_prexor_drain_result) {
+ sh->reconstruct_state = reconstruct_state_idle;
+
+ /* All the 'written' buffers and the parity block are ready to
+ * be written back to disk
+ */
+ BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags));
+ BUG_ON(sh->qd_idx >= 0 &&
+ !test_bit(R5_UPTODATE, &sh->dev[sh->qd_idx].flags));
+ for (i = disks; i--; ) {
+ struct r5dev *dev = &sh->dev[i];
+ if (test_bit(R5_LOCKED, &dev->flags) &&
+ (i == sh->pd_idx || i == sh->qd_idx ||
+ dev->written)) {
+ pr_debug("Writing block %d\n", i);
+ set_bit(R5_Wantwrite, &dev->flags);
+ if (prexor)
+ continue;
+ if (!test_bit(R5_Insync, &dev->flags) ||
+ ((i == sh->pd_idx || i == sh->qd_idx) &&
+ s.failed == 0))
+ set_bit(STRIPE_INSYNC, &sh->state);
+ }
+ }
+ if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+ s.dec_preread_active = 1;
+ }
+
+ /* Now to consider new write requests and what else, if anything
+ * should be read. We do not handle new writes when:
+ * 1/ A 'write' operation (copy+xor) is already in flight.
+ * 2/ A 'check' operation is in flight, as it may clobber the parity
+ * block.
+ */
+ if (s.to_write && !sh->reconstruct_state && !sh->check_state)
+ handle_stripe_dirtying(conf, sh, &s, disks);
+
+ /* maybe we need to check and possibly fix the parity for this stripe
+ * Any reads will already have been scheduled, so we just see if enough
+ * data is available. The parity check is held off while parity
+ * dependent operations are in flight.
+ */
+ if (sh->check_state ||
+ (s.syncing && s.locked == 0 &&
+ !test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
+ !test_bit(STRIPE_INSYNC, &sh->state))) {
+ if (conf->level == 6)
+ handle_parity_checks6(conf, sh, &s, disks);
+ else
+ handle_parity_checks5(conf, sh, &s, disks);
+ }

if (s.syncing && s.locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
md_done_sync(conf->mddev, STRIPE_SECTORS, 1);


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 16/34] md/raid5: unite fetch_block5 and fetch_block6

am 21.07.2011 04:32:26 von NeilBrown

Provided that ->failed_num[1] is not a valid device number (which is
easily achieved) fetch_block6 provides all the functionality of
fetch_block5.

So remove the latter and rename the former to simply "fetch_block".

Then handle_stripe_fill5 and handle_stripe_fill6 become the same and
can similarly be united.

Signed-off-by: NeilBrown
---

drivers/md/raid5.c | 107 +++++++++++-----------------------------------------
1 files changed, 23 insertions(+), 84 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 100ac1f..ea796a9 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2308,91 +2308,20 @@ handle_failed_stripe(raid5_conf_t *conf, struct stripe_head *sh,
md_wakeup_thread(conf->mddev->thread);
}

-/* fetch_block5 - checks the given member device to see if its data needs
+/* fetch_block - checks the given member device to see if its data needs
* to be read or computed to satisfy a request.
*
* Returns 1 when no more member devices need to be checked, otherwise returns
- * 0 to tell the loop in handle_stripe_fill5 to continue
+ * 0 to tell the loop in handle_stripe_fill to continue
*/
-static int fetch_block5(struct stripe_head *sh, struct stripe_head_state *s,
- int disk_idx, int disks)
-{
- struct r5dev *dev = &sh->dev[disk_idx];
- struct r5dev *failed_dev = &sh->dev[s->failed_num[0]];
-
- /* is the data in this block needed, and can we get it? */
- if (!test_bit(R5_LOCKED, &dev->flags) &&
- !test_bit(R5_UPTODATE, &dev->flags) &&
- (dev->toread ||
- (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
- s->syncing || s->expanding ||
- (s->failed && failed_dev->toread) ||
- (s->failed && failed_dev->towrite &&
- !test_bit(R5_OVERWRITE, &failed_dev->flags)))))) {
- /* We would like to get this block, possibly by computing it,
- * otherwise read it if the backing disk is insync
- */
- if ((s->uptodate == disks - 1) &&
- (s->failed && disk_idx == s->failed_num[0])) {
- set_bit(STRIPE_COMPUTE_RUN, &sh->state);
- set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request);
- set_bit(R5_Wantcompute, &dev->flags);
- sh->ops.target = disk_idx;
- sh->ops.target2 = -1;
- s->req_compute = 1;
- /* Careful: from this point on 'uptodate' is in the eye
- * of raid_run_ops which services 'compute' operations
- * before writes. R5_Wantcompute flags a block that will
- * be R5_UPTODATE by the time it is needed for a
- * subsequent operation.
- */
- s->uptodate++;
- return 1; /* uptodate + compute == disks */
- } else if (test_bit(R5_Insync, &dev->flags)) {
- set_bit(R5_LOCKED, &dev->flags);
- set_bit(R5_Wantread, &dev->flags);
- s->locked++;
- pr_debug("Reading block %d (sync=%d)\n", disk_idx,
- s->syncing);
- }
- }
-
- return 0;
-}
-
-/**
- * handle_stripe_fill5 - read or compute data to satisfy pending requests.
- */
-static void handle_stripe_fill5(struct stripe_head *sh,
- struct stripe_head_state *s, int disks)
-{
- int i;
-
- /* look for blocks to read/compute, skip this if a compute
- * is already in flight, or if the stripe contents are in the
- * midst of changing due to a write
- */
- if (!test_bit(STRIPE_COMPUTE_RUN, &sh->state) && !sh->check_state &&
- !sh->reconstruct_state)
- for (i = disks; i--; )
- if (fetch_block5(sh, s, i, disks))
- break;
- set_bit(STRIPE_HANDLE, &sh->state);
-}
-
-/* fetch_block6 - checks the given member device to see if its data needs
- * to be read or computed to satisfy a request.
- *
- * Returns 1 when no more member devices need to be checked, otherwise returns
- * 0 to tell the loop in handle_stripe_fill6 to continue
- */
-static int fetch_block6(struct stripe_head *sh, struct stripe_head_state *s,
- int disk_idx, int disks)
+static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
+ int disk_idx, int disks)
{
struct r5dev *dev = &sh->dev[disk_idx];
struct r5dev *fdev[2] = { &sh->dev[s->failed_num[0]],
&sh->dev[s->failed_num[1]] };

+ /* is the data in this block needed, and can we get it? */
if (!test_bit(R5_LOCKED, &dev->flags) &&
!test_bit(R5_UPTODATE, &dev->flags) &&
(dev->toread ||
@@ -2400,7 +2329,9 @@ static int fetch_block6(struct stripe_head *sh, struct stripe_head_state *s,
s->syncing || s->expanding ||
(s->failed >= 1 && fdev[0]->toread) ||
(s->failed >= 2 && fdev[1]->toread) ||
- (s->failed && s->to_write)) {
+ (sh->raid_conf->level <= 5 && s->failed && fdev[0]->towrite &&
+ !test_bit(R5_OVERWRITE, &fdev[0]->flags)) ||
+ (sh->raid_conf->level == 6 && s->failed && s->to_write))) {
/* we would like to get this block, possibly by computing it,
* otherwise read it if the backing disk is insync
*/
@@ -2420,6 +2351,12 @@ static int fetch_block6(struct stripe_head *sh, struct stripe_head_state *s,
sh->ops.target = disk_idx;
sh->ops.target2 = -1; /* no 2nd target */
s->req_compute = 1;
+ /* Careful: from this point on 'uptodate' is in the eye
+ * of raid_run_ops which services 'compute' operations
+ * before writes. R5_Wantcompute flags a block that will
+ * be R5_UPTODATE by the time it is needed for a
+ * subsequent operation.
+ */
s->uptodate++;
return 1;
} else if (s->uptodate == disks-2 && s->failed >= 2) {
@@ -2460,11 +2397,11 @@ static int fetch_block6(struct stripe_head *sh, struct stripe_head_state *s,
}

/**
- * handle_stripe_fill6 - read or compute data to satisfy pending requests.
+ * handle_stripe_fill - read or compute data to satisfy pending requests.
*/
-static void handle_stripe_fill6(struct stripe_head *sh,
- struct stripe_head_state *s,
- int disks)
+static void handle_stripe_fill(struct stripe_head *sh,
+ struct stripe_head_state *s,
+ int disks)
{
int i;

@@ -2475,7 +2412,7 @@ static void handle_stripe_fill6(struct stripe_head *sh,
if (!test_bit(STRIPE_COMPUTE_RUN, &sh->state) && !sh->check_state &&
!sh->reconstruct_state)
for (i = disks; i--; )
- if (fetch_block6(sh, s, i, disks))
+ if (fetch_block(sh, s, i, disks))
break;
set_bit(STRIPE_HANDLE, &sh->state);
}
@@ -3123,7 +3060,7 @@ static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
*/
if (s->to_read || s->non_overwrite ||
(s->syncing && (s->uptodate + s->compute < disks)) || s->expanding)
- handle_stripe_fill5(sh, s, disks);
+ handle_stripe_fill(sh, s, disks);

/* Now we check to see if any write operations have recently
* completed
@@ -3314,7 +3251,7 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
*/
if (s->to_read || s->non_overwrite || (s->to_write && s->failed) ||
(s->syncing && (s->uptodate + s->compute < disks)) || s->expanding)
- handle_stripe_fill6(sh, s, disks);
+ handle_stripe_fill(sh, s, disks);

/* Now we check to see if any write operations have recently
* completed
@@ -3398,6 +3335,8 @@ static void handle_stripe(struct stripe_head *sh)
s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
+ s.failed_num[0] = -1;
+ s.failed_num[1] = -1;

if (conf->level == 6)
done = handle_stripe6(sh, &s);


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 15/34] md/raid5: rearrange a test in fetch_block6.

am 21.07.2011 04:32:26 von NeilBrown

Next patch will unite fetch_block5 and fetch_block6.
First I want to make the differences a little more clear.

For RAID6 if we are writing at all and there is a failed device, then
we need to load or compute every block so we can do a
reconstruct-write.
This case isn't needed for RAID5 - we will do a read-modify-write in
that case.
So make that test a separate test in fetch_block6 rather than merged
with two other tests.

Make a similar change in fetch_block5 so the one bit that is not
needed for RAID6 is clearly separate.

Signed-off-by: NeilBrown
---

drivers/md/raid5.c | 12 +++++-------
1 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e41b622..100ac1f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2326,9 +2326,8 @@ static int fetch_block5(struct stripe_head *sh, struct stripe_head_state *s,
(dev->toread ||
(dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
s->syncing || s->expanding ||
- (s->failed &&
- (failed_dev->toread ||
- (failed_dev->towrite &&
+ (s->failed && failed_dev->toread) ||
+ (s->failed && failed_dev->towrite &&
!test_bit(R5_OVERWRITE, &failed_dev->flags)))))) {
/* We would like to get this block, possibly by computing it,
* otherwise read it if the backing disk is insync
@@ -2399,10 +2398,9 @@ static int fetch_block6(struct stripe_head *sh, struct stripe_head_state *s,
(dev->toread ||
(dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
s->syncing || s->expanding ||
- (s->failed >= 1 &&
- (fdev[0]->toread || s->to_write)) ||
- (s->failed >= 2 &&
- (fdev[1]->toread || s->to_write)))) {
+ (s->failed >= 1 && fdev[0]->toread) ||
+ (s->failed >= 2 && fdev[1]->toread) ||
+ (s->failed && s->to_write)) {
/* we would like to get this block, possibly by computing it,
* otherwise read it if the backing disk is insync
*/


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 27/34] md/raid10: Improve decision on whether to fail adevice with a read error.

am 21.07.2011 04:32:27 von NeilBrown

Normally we would fail a device with a READ error. However if doing
so causes the array to fail, it is better to leave the device
in place and just return the read error to the caller.

The current test for decide if the array will fail is overly
simplistic.
We have a function 'enough' which can tell if the array is failed or
not, so use it to guide the decision.

Signed-off-by: NeilBrown
---

drivers/md/raid10.c | 57 +++++++++++++++++++++++++--------------------------
1 files changed, 28 insertions(+), 29 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 5583201..6721cb0 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -970,6 +970,30 @@ static void status(struct seq_file *seq, mddev_t *mddev)
seq_printf(seq, "]");
}

+/* check if there are enough drives for
+ * every block to appear on atleast one.
+ * Don't consider the device numbered 'ignore'
+ * as we might be about to remove it.
+ */
+static int enough(conf_t *conf, int ignore)
+{
+ int first = 0;
+
+ do {
+ int n = conf->copies;
+ int cnt = 0;
+ while (n--) {
+ if (conf->mirrors[first].rdev &&
+ first != ignore)
+ cnt++;
+ first = (first+1) % conf->raid_disks;
+ }
+ if (cnt == 0)
+ return 0;
+ } while (first != 0);
+ return 1;
+}
+
static void error(mddev_t *mddev, mdk_rdev_t *rdev)
{
char b[BDEVNAME_SIZE];
@@ -982,13 +1006,9 @@ static void error(mddev_t *mddev, mdk_rdev_t *rdev)
* else mark the drive as failed
*/
if (test_bit(In_sync, &rdev->flags)
- && conf->raid_disks-mddev->degraded == 1)
+ && !enough(conf, rdev->raid_disk))
/*
* Don't fail the drive, just return an IO error.
- * The test should really be more sophisticated than
- * "working_disks == 1", but it isn't critical, and
- * can wait until we do more sophisticated "is the drive
- * really dead" tests...
*/
return;
if (test_and_clear_bit(In_sync, &rdev->flags)) {
@@ -1043,27 +1063,6 @@ static void close_sync(conf_t *conf)
conf->r10buf_pool = NULL;
}

-/* check if there are enough drives for
- * every block to appear on atleast one
- */
-static int enough(conf_t *conf)
-{
- int first = 0;
-
- do {
- int n = conf->copies;
- int cnt = 0;
- while (n--) {
- if (conf->mirrors[first].rdev)
- cnt++;
- first = (first+1) % conf->raid_disks;
- }
- if (cnt == 0)
- return 0;
- } while (first != 0);
- return 1;
-}
-
static int raid10_spare_active(mddev_t *mddev)
{
int i;
@@ -1107,7 +1106,7 @@ static int raid10_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
* very different from resync
*/
return -EBUSY;
- if (!enough(conf))
+ if (!enough(conf, -1))
return -EINVAL;

if (rdev->raid_disk >= 0)
@@ -1173,7 +1172,7 @@ static int raid10_remove_disk(mddev_t *mddev, int number)
*/
if (!test_bit(Faulty, &rdev->flags) &&
mddev->recovery_disabled != p->recovery_disabled &&
- enough(conf)) {
+ enough(conf, -1)) {
err = -EBUSY;
goto abort;
}
@@ -2286,7 +2285,7 @@ static int run(mddev_t *mddev)
disk->head_position = 0;
}
/* need to check that every block has at least one working mirror */
- if (!enough(conf)) {
+ if (!enough(conf, -1)) {
printk(KERN_ERR "md/raid10:%s: not enough operational mirrors.\n",
mdname(mddev));
goto out_free_conf;


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 22/34] md/raid: use printk_ratelimited instead ofprintk_ratelimit

am 21.07.2011 04:32:27 von NeilBrown

From: Christian Dietrich

As per printk_ratelimit comment, it should not be used.

Signed-off-by: Christian Dietrich
Signed-off-by: NeilBrown
---

drivers/md/raid1.c | 25 +++++++++++++++----------
drivers/md/raid10.c | 23 +++++++++++++----------
drivers/md/raid5.c | 47 +++++++++++++++++++++++++----------------------
3 files changed, 53 insertions(+), 42 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index f7431b6..d3a8f4b 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -35,6 +35,7 @@
#include
#include
#include
+#include
#include "md.h"
#include "raid1.h"
#include "bitmap.h"
@@ -287,10 +288,13 @@ static void raid1_end_read_request(struct bio *bio, int error)
* oops, read error:
*/
char b[BDEVNAME_SIZE];
- if (printk_ratelimit())
- printk(KERN_ERR "md/raid1:%s: %s: rescheduling sector %llu\n",
- mdname(conf->mddev),
- bdevname(conf->mirrors[mirror].rdev->bdev,b), (unsigned long long)r1_bio->sector);
+ printk_ratelimited(
+ KERN_ERR "md/raid1:%s: %s: "
+ "rescheduling sector %llu\n",
+ mdname(conf->mddev),
+ bdevname(conf->mirrors[mirror].rdev->bdev,
+ b),
+ (unsigned long long)r1_bio->sector);
reschedule_retry(r1_bio);
}

@@ -1580,12 +1584,13 @@ static void raid1d(mddev_t *mddev)
GFP_NOIO, mddev);
r1_bio->bios[r1_bio->read_disk] = bio;
rdev = conf->mirrors[disk].rdev;
- if (printk_ratelimit())
- printk(KERN_ERR "md/raid1:%s: redirecting sector %llu to"
- " other mirror: %s\n",
- mdname(mddev),
- (unsigned long long)r1_bio->sector,
- bdevname(rdev->bdev,b));
+ printk_ratelimited(
+ KERN_ERR
+ "md/raid1:%s: redirecting sector %llu"
+ " to other mirror: %s\n",
+ mdname(mddev),
+ (unsigned long long)r1_bio->sector,
+ bdevname(rdev->bdev, b));
bio->bi_sector = r1_bio->sector + rdev->data_offset;
bio->bi_bdev = rdev->bdev;
bio->bi_end_io = raid1_end_read_request;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3715e22..1725ec1 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -22,6 +22,7 @@
#include
#include
#include
+#include
#include "md.h"
#include "raid10.h"
#include "raid0.h"
@@ -301,10 +302,11 @@ static void raid10_end_read_request(struct bio *bio, int error)
* oops, read error - keep the refcount on the rdev
*/
char b[BDEVNAME_SIZE];
- if (printk_ratelimit())
- printk(KERN_ERR "md/raid10:%s: %s: rescheduling sector %llu\n",
- mdname(conf->mddev),
- bdevname(conf->mirrors[dev].rdev->bdev,b), (unsigned long long)r10_bio->sector);
+ printk_ratelimited(KERN_ERR
+ "md/raid10:%s: %s: rescheduling sector %llu\n",
+ mdname(conf->mddev),
+ bdevname(conf->mirrors[dev].rdev->bdev, b),
+ (unsigned long long)r10_bio->sector);
reschedule_retry(r10_bio);
}
}
@@ -1669,12 +1671,13 @@ static void raid10d(mddev_t *mddev)
bio_put(bio);
slot = r10_bio->read_slot;
rdev = conf->mirrors[mirror].rdev;
- if (printk_ratelimit())
- printk(KERN_ERR "md/raid10:%s: %s: redirecting sector %llu to"
- " another mirror\n",
- mdname(mddev),
- bdevname(rdev->bdev,b),
- (unsigned long long)r10_bio->sector);
+ printk_ratelimited(
+ KERN_ERR
+ "md/raid10:%s: %s: redirecting"
+ "sector %llu to another mirror\n",
+ mdname(mddev),
+ bdevname(rdev->bdev, b),
+ (unsigned long long)r10_bio->sector);
bio = bio_clone_mddev(r10_bio->master_bio,
GFP_NOIO, mddev);
r10_bio->devs[slot].bio = bio;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 93ee26c..b14183e 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -51,6 +51,7 @@
#include
#include
#include
+#include
#include "md.h"
#include "raid5.h"
#include "raid0.h"
@@ -96,8 +97,6 @@
#define __inline__
#endif

-#define printk_rl(args...) ((void) (printk_ratelimit() && printk(args)))
-
/*
* We maintain a biased count of active stripes in the bottom 16 bits of
* bi_phys_segments, and a count of processed stripes in the upper 16 bits
@@ -1583,12 +1582,14 @@ static void raid5_end_read_request(struct bio * bi, int error)
set_bit(R5_UPTODATE, &sh->dev[i].flags);
if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
rdev = conf->disks[i].rdev;
- printk_rl(KERN_INFO "md/raid:%s: read error corrected"
- " (%lu sectors at %llu on %s)\n",
- mdname(conf->mddev), STRIPE_SECTORS,
- (unsigned long long)(sh->sector
- + rdev->data_offset),
- bdevname(rdev->bdev, b));
+ printk_ratelimited(
+ KERN_INFO
+ "md/raid:%s: read error corrected"
+ " (%lu sectors at %llu on %s)\n",
+ mdname(conf->mddev), STRIPE_SECTORS,
+ (unsigned long long)(sh->sector
+ + rdev->data_offset),
+ bdevname(rdev->bdev, b));
clear_bit(R5_ReadError, &sh->dev[i].flags);
clear_bit(R5_ReWrite, &sh->dev[i].flags);
}
@@ -1602,22 +1603,24 @@ static void raid5_end_read_request(struct bio * bi, int error)
clear_bit(R5_UPTODATE, &sh->dev[i].flags);
atomic_inc(&rdev->read_errors);
if (conf->mddev->degraded >= conf->max_degraded)
- printk_rl(KERN_WARNING
- "md/raid:%s: read error not correctable "
- "(sector %llu on %s).\n",
- mdname(conf->mddev),
- (unsigned long long)(sh->sector
- + rdev->data_offset),
- bdn);
+ printk_ratelimited(
+ KERN_WARNING
+ "md/raid:%s: read error not correctable "
+ "(sector %llu on %s).\n",
+ mdname(conf->mddev),
+ (unsigned long long)(sh->sector
+ + rdev->data_offset),
+ bdn);
else if (test_bit(R5_ReWrite, &sh->dev[i].flags))
/* Oh, no!!! */
- printk_rl(KERN_WARNING
- "md/raid:%s: read error NOT corrected!! "
- "(sector %llu on %s).\n",
- mdname(conf->mddev),
- (unsigned long long)(sh->sector
- + rdev->data_offset),
- bdn);
+ printk_ratelimited(
+ KERN_WARNING
+ "md/raid:%s: read error NOT corrected!! "
+ "(sector %llu on %s).\n",
+ mdname(conf->mddev),
+ (unsigned long long)(sh->sector
+ + rdev->data_offset),
+ bdn);
else if (atomic_read(&rdev->read_errors)
> conf->max_nr_stripes)
printk(KERN_WARNING


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 23/34] md: introduce link/unlink_rdev() helpers

am 21.07.2011 04:32:27 von NeilBrown

From: Namhyung Kim

There are places where sysfs links to rdev are handled
in a same way. Add the helper functions to consolidate
them.

Signed-off-by: Namhyung Kim
Signed-off-by: NeilBrown
---

drivers/md/md.c | 47 ++++++++++++++---------------------------------
drivers/md/md.h | 14 ++++++++++++++
drivers/md/raid1.c | 15 +++++----------
drivers/md/raid5.c | 10 +++-------
4 files changed, 36 insertions(+), 50 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 91e31e2..0398dc4 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2459,7 +2459,6 @@ slot_store(mdk_rdev_t *rdev, const char *buf, size_t len)
{
char *e;
int err;
- char nm[20];
int slot = simple_strtoul(buf, &e, 10);
if (strncmp(buf, "none", 4)==0)
slot = -1;
@@ -2482,8 +2481,7 @@ slot_store(mdk_rdev_t *rdev, const char *buf, size_t len)
hot_remove_disk(rdev->mddev, rdev->raid_disk);
if (err)
return err;
- sprintf(nm, "rd%d", rdev->raid_disk);
- sysfs_remove_link(&rdev->mddev->kobj, nm);
+ sysfs_unlink_rdev(rdev->mddev, rdev);
rdev->raid_disk = -1;
set_bit(MD_RECOVERY_NEEDED, &rdev->mddev->recovery);
md_wakeup_thread(rdev->mddev->thread);
@@ -2522,8 +2520,7 @@ slot_store(mdk_rdev_t *rdev, const char *buf, size_t len)
return err;
} else
sysfs_notify_dirent_safe(rdev->sysfs_state);
- sprintf(nm, "rd%d", rdev->raid_disk);
- if (sysfs_create_link(&rdev->mddev->kobj, &rdev->kobj, nm))
+ if (sysfs_link_rdev(rdev->mddev, rdev))
/* failure here is OK */;
/* don't wakeup anyone, leave that to userspace. */
} else {
@@ -3149,15 +3146,13 @@ level_store(mddev_t *mddev, const char *buf, size_t len)
}

list_for_each_entry(rdev, &mddev->disks, same_set) {
- char nm[20];
if (rdev->raid_disk < 0)
continue;
if (rdev->new_raid_disk >= mddev->raid_disks)
rdev->new_raid_disk = -1;
if (rdev->new_raid_disk == rdev->raid_disk)
continue;
- sprintf(nm, "rd%d", rdev->raid_disk);
- sysfs_remove_link(&mddev->kobj, nm);
+ sysfs_unlink_rdev(mddev, rdev);
}
list_for_each_entry(rdev, &mddev->disks, same_set) {
if (rdev->raid_disk < 0)
@@ -3168,11 +3163,10 @@ level_store(mddev_t *mddev, const char *buf, size_t len)
if (rdev->raid_disk < 0)
clear_bit(In_sync, &rdev->flags);
else {
- char nm[20];
- sprintf(nm, "rd%d", rdev->raid_disk);
- if(sysfs_create_link(&mddev->kobj, &rdev->kobj, nm))
- printk("md: cannot register %s for %s after level change\n",
- nm, mdname(mddev));
+ if (sysfs_link_rdev(mddev, rdev))
+ printk(KERN_WARNING "md: cannot register rd%d"
+ " for %s after level change\n",
+ rdev->raid_disk, mdname(mddev));
}
}

@@ -4621,12 +4615,9 @@ int md_run(mddev_t *mddev)
smp_wmb();
mddev->ready = 1;
list_for_each_entry(rdev, &mddev->disks, same_set)
- if (rdev->raid_disk >= 0) {
- char nm[20];
- sprintf(nm, "rd%d", rdev->raid_disk);
- if (sysfs_create_link(&mddev->kobj, &rdev->kobj, nm))
+ if (rdev->raid_disk >= 0)
+ if (sysfs_link_rdev(mddev, rdev))
/* failure here is OK */;
- }

set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);

@@ -4854,11 +4845,8 @@ static int do_md_stop(mddev_t * mddev, int mode, int is_open)
sysfs_notify_dirent_safe(mddev->sysfs_state);

list_for_each_entry(rdev, &mddev->disks, same_set)
- if (rdev->raid_disk >= 0) {
- char nm[20];
- sprintf(nm, "rd%d", rdev->raid_disk);
- sysfs_remove_link(&mddev->kobj, nm);
- }
+ if (rdev->raid_disk >= 0)
+ sysfs_unlink_rdev(mddev, rdev);

set_capacity(disk, 0);
mutex_unlock(&mddev->open_mutex);
@@ -7077,9 +7065,7 @@ static int remove_and_add_spares(mddev_t *mddev)
atomic_read(&rdev->nr_pending)==0) {
if (mddev->pers->hot_remove_disk(
mddev, rdev->raid_disk)==0) {
- char nm[20];
- sprintf(nm,"rd%d", rdev->raid_disk);
- sysfs_remove_link(&mddev->kobj, nm);
+ sysfs_unlink_rdev(mddev, rdev);
rdev->raid_disk = -1;
}
}
@@ -7096,10 +7082,7 @@ static int remove_and_add_spares(mddev_t *mddev)
rdev->recovery_offset = 0;
if (mddev->pers->
hot_add_disk(mddev, rdev) == 0) {
- char nm[20];
- sprintf(nm, "rd%d", rdev->raid_disk);
- if (sysfs_create_link(&mddev->kobj,
- &rdev->kobj, nm))
+ if (sysfs_link_rdev(mddev, rdev))
/* failure here is OK */;
spares++;
md_new_event(mddev);
@@ -7219,9 +7202,7 @@ void md_check_recovery(mddev_t *mddev)
atomic_read(&rdev->nr_pending)==0) {
if (mddev->pers->hot_remove_disk(
mddev, rdev->raid_disk)==0) {
- char nm[20];
- sprintf(nm,"rd%d", rdev->raid_disk);
- sysfs_remove_link(&mddev->kobj, nm);
+ sysfs_unlink_rdev(mddev, rdev);
rdev->raid_disk = -1;
}
}
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 1c26c7a..6863f72 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -413,6 +413,20 @@ static inline char * mdname (mddev_t * mddev)
return mddev->gendisk ? mddev->gendisk->disk_name : "mdX";
}

+static inline int sysfs_link_rdev(mddev_t *mddev, mdk_rdev_t *rdev)
+{
+ char nm[20];
+ sprintf(nm, "rd%d", rdev->raid_disk);
+ return sysfs_create_link(&mddev->kobj, &rdev->kobj, nm);
+}
+
+static inline void sysfs_unlink_rdev(mddev_t *mddev, mdk_rdev_t *rdev)
+{
+ char nm[20];
+ sprintf(nm, "rd%d", rdev->raid_disk);
+ sysfs_remove_link(&mddev->kobj, nm);
+}
+
/*
* iterates through some rdev ringlist. It's safe to remove the
* current 'rdev'. Dont touch 'tmp' though.
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d3a8f4b..1d79a04 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2159,18 +2159,13 @@ static int raid1_reshape(mddev_t *mddev)
for (d = d2 = 0; d < conf->raid_disks; d++) {
mdk_rdev_t *rdev = conf->mirrors[d].rdev;
if (rdev && rdev->raid_disk != d2) {
- char nm[20];
- sprintf(nm, "rd%d", rdev->raid_disk);
- sysfs_remove_link(&mddev->kobj, nm);
+ sysfs_unlink_rdev(mddev, rdev);
rdev->raid_disk = d2;
- sprintf(nm, "rd%d", rdev->raid_disk);
- sysfs_remove_link(&mddev->kobj, nm);
- if (sysfs_create_link(&mddev->kobj,
- &rdev->kobj, nm))
+ sysfs_unlink_rdev(mddev, rdev);
+ if (sysfs_link_rdev(mddev, rdev))
printk(KERN_WARNING
- "md/raid1:%s: cannot register "
- "%s\n",
- mdname(mddev), nm);
+ "md/raid1:%s: cannot register rd%d\n",
+ mdname(mddev), rdev->raid_disk);
}
if (rdev)
newmirrors[d2++].rdev = rdev;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b14183e..3c67942 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5152,16 +5152,14 @@ static int raid5_start_reshape(mddev_t *mddev)
if (rdev->raid_disk < 0 &&
!test_bit(Faulty, &rdev->flags)) {
if (raid5_add_disk(mddev, rdev) == 0) {
- char nm[20];
if (rdev->raid_disk
>= conf->previous_raid_disks) {
set_bit(In_sync, &rdev->flags);
added_devices++;
} else
rdev->recovery_offset = 0;
- sprintf(nm, "rd%d", rdev->raid_disk);
- if (sysfs_create_link(&mddev->kobj,
- &rdev->kobj, nm))
+
+ if (sysfs_link_rdev(mddev, rdev))
/* Failure here is OK */;
}
} else if (rdev->raid_disk >= conf->previous_raid_disks
@@ -5257,9 +5255,7 @@ static void raid5_finish_reshape(mddev_t *mddev)
d++) {
mdk_rdev_t *rdev = conf->disks[d].rdev;
if (rdev && raid5_remove_disk(mddev, d) == 0) {
- char nm[20];
- sprintf(nm, "rd%d", rdev->raid_disk);
- sysfs_remove_link(&mddev->kobj, nm);
+ sysfs_unlink_rdev(mddev, rdev);
rdev->raid_disk = -1;
}
}


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 24/34] md: remove ro check in md_check_recovery()

am 21.07.2011 04:32:27 von NeilBrown

From: Namhyung Kim

Commit c89a8eee6154 ("Allow faulty devices to be removed from a
readonly array.") added some work on ro array in the function,
but it couldn't be done since we didn't allow the ro array to be
handled from the beginning. Fix it.

Signed-off-by: Namhyung Kim
Signed-off-by: NeilBrown
---

drivers/md/md.c | 3 ---
1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0398dc4..77bd8d8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7163,9 +7163,6 @@ void md_check_recovery(mddev_t *mddev)
if (mddev->bitmap)
bitmap_daemon_work(mddev);

- if (mddev->ro)
- return;
-
if (signal_pending(current)) {
if (mddev->pers->sync_request && !mddev->external) {
printk(KERN_INFO "md: %s in immediate safe mode\n",


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 26/34] md/raid10: Make use of new recovery_disabled handling

am 21.07.2011 04:32:27 von NeilBrown

When we get a read error during recovery, RAID10 previously
arranged for the recovering device to appear to fail so that
the recovery stops and doesn't restart. This is misleading and wrong.

Instead, make use of the new recovery_disabled handling and mark
the target device and having recovery disabled.

Add appropriate checks in add_disk and remove_disk so that devices
are removed and not re-added when recovery is disabled.

Signed-off-by: NeilBrown
---

drivers/md/raid10.c | 62 ++++++++++++++++++++++++++++++---------------------
drivers/md/raid10.h | 5 ++++
2 files changed, 41 insertions(+), 26 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 1725ec1..5583201 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1099,7 +1099,6 @@ static int raid10_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
conf_t *conf = mddev->private;
int err = -EEXIST;
int mirror;
- mirror_info_t *p;
int first = 0;
int last = conf->raid_disks - 1;

@@ -1119,32 +1118,36 @@ static int raid10_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
mirror = rdev->saved_raid_disk;
else
mirror = first;
- for ( ; mirror <= last ; mirror++)
- if ( !(p=conf->mirrors+mirror)->rdev) {
-
- disk_stack_limits(mddev->gendisk, rdev->bdev,
- rdev->data_offset << 9);
- /* as we don't honour merge_bvec_fn, we must
- * never risk violating it, so limit
- * ->max_segments to one lying with a single
- * page, as a one page request is never in
- * violation.
- */
- if (rdev->bdev->bd_disk->queue->merge_bvec_fn) {
- blk_queue_max_segments(mddev->queue, 1);
- blk_queue_segment_boundary(mddev->queue,
- PAGE_CACHE_SIZE - 1);
- }
+ for ( ; mirror <= last ; mirror++) {
+ mirror_info_t *p = &conf->mirrors[mirror];
+ if (p->recovery_disabled == mddev->recovery_disabled)
+ continue;
+ if (!p->rdev)
+ continue;

- p->head_position = 0;
- rdev->raid_disk = mirror;
- err = 0;
- if (rdev->saved_raid_disk != mirror)
- conf->fullsync = 1;
- rcu_assign_pointer(p->rdev, rdev);
- break;
+ disk_stack_limits(mddev->gendisk, rdev->bdev,
+ rdev->data_offset << 9);
+ /* as we don't honour merge_bvec_fn, we must
+ * never risk violating it, so limit
+ * ->max_segments to one lying with a single
+ * page, as a one page request is never in
+ * violation.
+ */
+ if (rdev->bdev->bd_disk->queue->merge_bvec_fn) {
+ blk_queue_max_segments(mddev->queue, 1);
+ blk_queue_segment_boundary(mddev->queue,
+ PAGE_CACHE_SIZE - 1);
}

+ p->head_position = 0;
+ rdev->raid_disk = mirror;
+ err = 0;
+ if (rdev->saved_raid_disk != mirror)
+ conf->fullsync = 1;
+ rcu_assign_pointer(p->rdev, rdev);
+ break;
+ }
+
md_integrity_add_rdev(rdev, mddev);
print_conf(conf);
return err;
@@ -1169,6 +1172,7 @@ static int raid10_remove_disk(mddev_t *mddev, int number)
* is not possible.
*/
if (!test_bit(Faulty, &rdev->flags) &&
+ mddev->recovery_disabled != p->recovery_disabled &&
enough(conf)) {
err = -EBUSY;
goto abort;
@@ -1383,8 +1387,14 @@ static void recovery_request_write(mddev_t *mddev, r10bio_t *r10_bio)
md_sync_acct(conf->mirrors[d].rdev->bdev, wbio->bi_size >> 9);
if (test_bit(R10BIO_Uptodate, &r10_bio->state))
generic_make_request(wbio);
- else
- bio_endio(wbio, -EIO);
+ else {
+ printk(KERN_NOTICE
+ "md/raid10:%s: recovery aborted due to read error\n",
+ mdname(mddev));
+ conf->mirrors[d].recovery_disabled = mddev->recovery_disabled;
+ set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+ bio_endio(wbio, 0);
+ }
}


diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index 944b110..a485914 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -6,6 +6,11 @@ typedef struct mirror_info mirror_info_t;
struct mirror_info {
mdk_rdev_t *rdev;
sector_t head_position;
+ int recovery_disabled; /* matches
+ * mddev->recovery_disabled
+ * when we shouldn't try
+ * recovering this device.
+ */
};

typedef struct r10bio_s r10bio_t;


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 21/34] md: use proper little-endian bitops

am 21.07.2011 04:32:27 von NeilBrown

From: Akinobu Mita

Using __test_and_{set,clear}_bit_le() with ignoring its return value
can be replaced with __{set,clear}_bit_le().

Signed-off-by: Akinobu Mita
Cc: NeilBrown
Cc: linux-raid@vger.kernel.org
Signed-off-by: NeilBrown
---

drivers/md/bitmap.c | 8 +++++---
1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 574b09a..39ff0ef 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -932,7 +932,7 @@ static void bitmap_file_set_bit(struct bitmap *bitmap, sector_t block)
if (bitmap->flags & BITMAP_HOSTENDIAN)
set_bit(bit, kaddr);
else
- __test_and_set_bit_le(bit, kaddr);
+ __set_bit_le(bit, kaddr);
kunmap_atomic(kaddr, KM_USER0);
PRINTK("set file bit %lu page %lu\n", bit, page->index);
}
@@ -1304,8 +1304,10 @@ void bitmap_daemon_work(mddev_t *mddev)
clear_bit(file_page_offset(bitmap, j),
paddr);
else
- __test_and_clear_bit_le(file_page_offset(bitmap, j),
- paddr);
+ __clear_bit_le(
+ file_page_offset(bitmap,
+ j),
+ paddr);
kunmap_atomic(paddr, KM_USER0);
} else
log->type->clear_region(log, j);


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 29/34] md/raid1: move rdev->corrected_errors counting

am 21.07.2011 04:32:28 von NeilBrown

From: Namhyung Kim

Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO. Also included a couple of whitespace fixes on sync_page_io().

Signed-off-by: Namhyung Kim
Signed-off-by: NeilBrown
---

drivers/md/raid1.c | 17 ++++++-----------
1 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 44069b3..a7e6908 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1224,9 +1224,7 @@ static int fix_sync_read_error(r1bio_t *r1_bio)
* active, and resync is currently active
*/
rdev = conf->mirrors[d].rdev;
- if (sync_page_io(rdev,
- sect,
- s<<9,
+ if (sync_page_io(rdev, sect, s<<9,
bio->bi_io_vec[idx].bv_page,
READ, false)) {
success = 1;
@@ -1261,16 +1259,13 @@ static int fix_sync_read_error(r1bio_t *r1_bio)
if (r1_bio->bios[d]->bi_end_io != end_sync_read)
continue;
rdev = conf->mirrors[d].rdev;
- if (sync_page_io(rdev,
- sect,
- s<<9,
+ if (sync_page_io(rdev, sect, s<<9,
bio->bi_io_vec[idx].bv_page,
WRITE, false) == 0) {
r1_bio->bios[d]->bi_end_io = NULL;
rdev_dec_pending(rdev, mddev);
md_error(mddev, rdev);
- } else
- atomic_add(s, &rdev->corrected_errors);
+ }
}
d = start;
while (d != r1_bio->read_disk) {
@@ -1280,12 +1275,12 @@ static int fix_sync_read_error(r1bio_t *r1_bio)
if (r1_bio->bios[d]->bi_end_io != end_sync_read)
continue;
rdev = conf->mirrors[d].rdev;
- if (sync_page_io(rdev,
- sect,
- s<<9,
+ if (sync_page_io(rdev, sect, s<<9,
bio->bi_io_vec[idx].bv_page,
READ, false) == 0)
md_error(mddev, rdev);
+ else
+ atomic_add(s, &rdev->corrected_errors);
}
sectors -= s;
sect += s;


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 28/34] md: get rid of unnecessary casts on page_address()

am 21.07.2011 04:32:28 von NeilBrown

From: Namhyung Kim

page_address() returns void pointer, so the casts can be removed.

Signed-off-by: Namhyung Kim
Signed-off-by: NeilBrown
---

drivers/md/md.c | 23 +++++++++++------------
1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index c7d9c6a..2a32050 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1025,7 +1025,7 @@ static int super_90_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version
ret = -EINVAL;

bdevname(rdev->bdev, b);
- sb = (mdp_super_t*)page_address(rdev->sb_page);
+ sb = page_address(rdev->sb_page);

if (sb->md_magic != MD_SB_MAGIC) {
printk(KERN_ERR "md: invalid raid superblock magic on %s\n",
@@ -1064,7 +1064,7 @@ static int super_90_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version
ret = 1;
} else {
__u64 ev1, ev2;
- mdp_super_t *refsb = (mdp_super_t*)page_address(refdev->sb_page);
+ mdp_super_t *refsb = page_address(refdev->sb_page);
if (!uuid_equal(refsb, sb)) {
printk(KERN_WARNING "md: %s has different UUID to %s\n",
b, bdevname(refdev->bdev,b2));
@@ -1099,7 +1099,7 @@ static int super_90_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version
static int super_90_validate(mddev_t *mddev, mdk_rdev_t *rdev)
{
mdp_disk_t *desc;
- mdp_super_t *sb = (mdp_super_t *)page_address(rdev->sb_page);
+ mdp_super_t *sb = page_address(rdev->sb_page);
__u64 ev1 = md_event(sb);

rdev->raid_disk = -1;
@@ -1230,7 +1230,7 @@ static void super_90_sync(mddev_t *mddev, mdk_rdev_t *rdev)

rdev->sb_size = MD_SB_BYTES;

- sb = (mdp_super_t*)page_address(rdev->sb_page);
+ sb = page_address(rdev->sb_page);

memset(sb, 0, sizeof(*sb));

@@ -1435,7 +1435,7 @@ static int super_1_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version)
if (ret) return ret;


- sb = (struct mdp_superblock_1*)page_address(rdev->sb_page);
+ sb = page_address(rdev->sb_page);

if (sb->magic != cpu_to_le32(MD_SB_MAGIC) ||
sb->major_version != cpu_to_le32(1) ||
@@ -1477,8 +1477,7 @@ static int super_1_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version)
ret = 1;
} else {
__u64 ev1, ev2;
- struct mdp_superblock_1 *refsb =
- (struct mdp_superblock_1*)page_address(refdev->sb_page);
+ struct mdp_superblock_1 *refsb = page_address(refdev->sb_page);

if (memcmp(sb->set_uuid, refsb->set_uuid, 16) != 0 ||
sb->level != refsb->level ||
@@ -1513,7 +1512,7 @@ static int super_1_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version)

static int super_1_validate(mddev_t *mddev, mdk_rdev_t *rdev)
{
- struct mdp_superblock_1 *sb = (struct mdp_superblock_1*)page_address(rdev->sb_page);
+ struct mdp_superblock_1 *sb = page_address(rdev->sb_page);
__u64 ev1 = le64_to_cpu(sb->events);

rdev->raid_disk = -1;
@@ -1619,7 +1618,7 @@ static void super_1_sync(mddev_t *mddev, mdk_rdev_t *rdev)
int max_dev, i;
/* make rdev->sb match mddev and rdev data. */

- sb = (struct mdp_superblock_1*)page_address(rdev->sb_page);
+ sb = page_address(rdev->sb_page);

sb->feature_map = 0;
sb->pad0 = 0;
@@ -1724,7 +1723,7 @@ super_1_rdev_size_change(mdk_rdev_t *rdev, sector_t num_sectors)
num_sectors = max_sectors;
rdev->sb_start = sb_start;
}
- sb = (struct mdp_superblock_1 *) page_address(rdev->sb_page);
+ sb = page_address(rdev->sb_page);
sb->data_size = cpu_to_le64(num_sectors);
sb->super_offset = rdev->sb_start;
sb->sb_csum = calc_sb_1_csum(sb);
@@ -2127,10 +2126,10 @@ static void print_rdev(mdk_rdev_t *rdev, int major_version)
printk(KERN_INFO "md: rdev superblock (MJ:%d):\n", major_version);
switch (major_version) {
case 0:
- print_sb_90((mdp_super_t*)page_address(rdev->sb_page));
+ print_sb_90(page_address(rdev->sb_page));
break;
case 1:
- print_sb_1((struct mdp_superblock_1 *)page_address(rdev->sb_page));
+ print_sb_1(page_address(rdev->sb_page));
break;
}
} else


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 31/34] md/raid10: move rdev->corrected_errors counting

am 21.07.2011 04:32:28 von NeilBrown

From: Namhyung Kim

Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO.

Signed-off-by: Namhyung Kim
Signed-off-by: NeilBrown
---

drivers/md/raid10.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 6721cb0..5def27c 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1533,7 +1533,6 @@ static void fix_read_error(conf_t *conf, mddev_t *mddev, r10bio_t *r10_bio)
test_bit(In_sync, &rdev->flags)) {
atomic_inc(&rdev->nr_pending);
rcu_read_unlock();
- atomic_add(s, &rdev->corrected_errors);
if (sync_page_io(rdev,
r10_bio->devs[sl].addr +
sect,
@@ -1598,6 +1597,7 @@ static void fix_read_error(conf_t *conf, mddev_t *mddev, r10bio_t *r10_bio)
(unsigned long long)(
sect + rdev->data_offset),
bdevname(rdev->bdev, b));
+ atomic_add(s, &rdev->corrected_errors);
}

rdev_dec_pending(rdev, mddev);


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 30/34] md/raid5: move rdev->corrected_errors counting

am 21.07.2011 04:32:28 von NeilBrown

From: Namhyung Kim

Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO.

Signed-off-by: Namhyung Kim
Signed-off-by: NeilBrown
---

drivers/md/raid5.c | 5 +----
1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3c67942..ee26f50 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -547,10 +547,6 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
bi->bi_io_vec[0].bv_offset = 0;
bi->bi_size = STRIPE_SIZE;
bi->bi_next = NULL;
- if ((rw & WRITE) &&
- test_bit(R5_ReWrite, &sh->dev[i].flags))
- atomic_add(STRIPE_SECTORS,
- &rdev->corrected_errors);
generic_make_request(bi);
} else {
if (rw & WRITE)
@@ -1590,6 +1586,7 @@ static void raid5_end_read_request(struct bio * bi, int error)
(unsigned long long)(sh->sector
+ rdev->data_offset),
bdevname(rdev->bdev, b));
+ atomic_add(STRIPE_SECTORS, &rdev->corrected_errors);
clear_bit(R5_ReadError, &sh->dev[i].flags);
clear_bit(R5_ReWrite, &sh->dev[i].flags);
}


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 34/34] MD bitmap: Revert DM dirty log hooks

am 21.07.2011 04:32:28 von NeilBrown

From: Jonathan Brassow

Revert most of commit e384e58549a2e9a83071ad80280c1a9053cfd84c
md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log.

MD should not need to use DM's dirty log - we decided to use md's
bitmaps instead.

Keeping the DIV_ROUND_UP clean-ups that were part of commit
e384e58549a2e9a83071ad80280c1a9053cfd84c, however.

Signed-off-by: Jonathan Brassow
Signed-off-by: NeilBrown
---

drivers/md/bitmap.c | 133 ++++++++++++++++-----------------------------------
drivers/md/bitmap.h | 5 --
drivers/md/md.h | 5 --
3 files changed, 43 insertions(+), 100 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 39ff0ef..0dc6546 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -29,7 +29,6 @@
#include "md.h"
#include "bitmap.h"

-#include
/* debug macros */

#define DEBUG 0
@@ -775,10 +774,8 @@ static inline unsigned long file_page_offset(struct bitmap *bitmap, unsigned lon
* 0 or page 1
*/
static inline struct page *filemap_get_page(struct bitmap *bitmap,
- unsigned long chunk)
+ unsigned long chunk)
{
- if (bitmap->filemap == NULL)
- return NULL;
if (file_page_index(bitmap, chunk) >= bitmap->file_pages)
return NULL;
return bitmap->filemap[file_page_index(bitmap, chunk)
@@ -878,28 +875,19 @@ enum bitmap_page_attr {
static inline void set_page_attr(struct bitmap *bitmap, struct page *page,
enum bitmap_page_attr attr)
{
- if (page)
- __set_bit((page->index<<2) + attr, bitmap->filemap_attr);
- else
- __set_bit(attr, &bitmap->logattrs);
+ __set_bit((page->index<<2) + attr, bitmap->filemap_attr);
}

static inline void clear_page_attr(struct bitmap *bitmap, struct page *page,
enum bitmap_page_attr attr)
{
- if (page)
- __clear_bit((page->index<<2) + attr, bitmap->filemap_attr);
- else
- __clear_bit(attr, &bitmap->logattrs);
+ __clear_bit((page->index<<2) + attr, bitmap->filemap_attr);
}

static inline unsigned long test_page_attr(struct bitmap *bitmap, struct page *page,
enum bitmap_page_attr attr)
{
- if (page)
- return test_bit((page->index<<2) + attr, bitmap->filemap_attr);
- else
- return test_bit(attr, &bitmap->logattrs);
+ return test_bit((page->index<<2) + attr, bitmap->filemap_attr);
}

/*
@@ -912,30 +900,26 @@ static inline unsigned long test_page_attr(struct bitmap *bitmap, struct page *p
static void bitmap_file_set_bit(struct bitmap *bitmap, sector_t block)
{
unsigned long bit;
- struct page *page = NULL;
+ struct page *page;
void *kaddr;
unsigned long chunk = block >> CHUNK_BLOCK_SHIFT(bitmap);

- if (!bitmap->filemap) {
- struct dm_dirty_log *log = bitmap->mddev->bitmap_info.log;
- if (log)
- log->type->mark_region(log, chunk);
- } else {
+ if (!bitmap->filemap)
+ return;

- page = filemap_get_page(bitmap, chunk);
- if (!page)
- return;
- bit = file_page_offset(bitmap, chunk);
+ page = filemap_get_page(bitmap, chunk);
+ if (!page)
+ return;
+ bit = file_page_offset(bitmap, chunk);

- /* set the bit */
- kaddr = kmap_atomic(page, KM_USER0);
- if (bitmap->flags & BITMAP_HOSTENDIAN)
- set_bit(bit, kaddr);
- else
- __set_bit_le(bit, kaddr);
- kunmap_atomic(kaddr, KM_USER0);
- PRINTK("set file bit %lu page %lu\n", bit, page->index);
- }
+ /* set the bit */
+ kaddr = kmap_atomic(page, KM_USER0);
+ if (bitmap->flags & BITMAP_HOSTENDIAN)
+ set_bit(bit, kaddr);
+ else
+ __set_bit_le(bit, kaddr);
+ kunmap_atomic(kaddr, KM_USER0);
+ PRINTK("set file bit %lu page %lu\n", bit, page->index);
/* record page number so it gets flushed to disk when unplug occurs */
set_page_attr(bitmap, page, BITMAP_PAGE_DIRTY);
}
@@ -952,16 +936,6 @@ void bitmap_unplug(struct bitmap *bitmap)

if (!bitmap)
return;
- if (!bitmap->filemap) {
- /* Must be using a dirty_log */
- struct dm_dirty_log *log = bitmap->mddev->bitmap_info.log;
- dirty = test_and_clear_bit(BITMAP_PAGE_DIRTY, &bitmap->logattrs);
- need_write = test_and_clear_bit(BITMAP_PAGE_NEEDWRITE, &bitmap->logattrs);
- if (dirty || need_write)
- if (log->type->flush(log))
- bitmap->flags |= BITMAP_WRITE_ERROR;
- goto out;
- }

/* look at each page to see if there are any set bits that need to be
* flushed out to disk */
@@ -990,7 +964,6 @@ void bitmap_unplug(struct bitmap *bitmap)
else
md_super_wait(bitmap->mddev);
}
-out:
if (bitmap->flags & BITMAP_WRITE_ERROR)
bitmap_file_kick(bitmap);
}
@@ -1199,7 +1172,6 @@ void bitmap_daemon_work(mddev_t *mddev)
struct page *page = NULL, *lastpage = NULL;
sector_t blocks;
void *paddr;
- struct dm_dirty_log *log = mddev->bitmap_info.log;

/* Use a mutex to guard daemon_work against
* bitmap_destroy.
@@ -1224,12 +1196,11 @@ void bitmap_daemon_work(mddev_t *mddev)
spin_lock_irqsave(&bitmap->lock, flags);
for (j = 0; j < bitmap->chunks; j++) {
bitmap_counter_t *bmc;
- if (!bitmap->filemap) {
- if (!log)
- /* error or shutdown */
- break;
- } else
- page = filemap_get_page(bitmap, j);
+ if (!bitmap->filemap)
+ /* error or shutdown */
+ break;
+
+ page = filemap_get_page(bitmap, j);

if (page != lastpage) {
/* skip this page unless it's marked as needing cleaning */
@@ -1298,19 +1269,16 @@ void bitmap_daemon_work(mddev_t *mddev)
-1);

/* clear the bit */
- if (page) {
- paddr = kmap_atomic(page, KM_USER0);
- if (bitmap->flags & BITMAP_HOSTENDIAN)
- clear_bit(file_page_offset(bitmap, j),
- paddr);
- else
- __clear_bit_le(
+ paddr = kmap_atomic(page, KM_USER0);
+ if (bitmap->flags & BITMAP_HOSTENDIAN)
+ clear_bit(file_page_offset(bitmap, j),
+ paddr);
+ else
+ __clear_bit_le(
file_page_offset(bitmap,
j),
paddr);
- kunmap_atomic(paddr, KM_USER0);
- } else
- log->type->clear_region(log, j);
+ kunmap_atomic(paddr, KM_USER0);
}
} else
j |= PAGE_COUNTER_MASK;
@@ -1318,16 +1286,12 @@ void bitmap_daemon_work(mddev_t *mddev)
spin_unlock_irqrestore(&bitmap->lock, flags);

/* now sync the final page */
- if (lastpage != NULL || log != NULL) {
+ if (lastpage != NULL) {
spin_lock_irqsave(&bitmap->lock, flags);
if (test_page_attr(bitmap, lastpage, BITMAP_PAGE_NEEDWRITE)) {
clear_page_attr(bitmap, lastpage, BITMAP_PAGE_NEEDWRITE);
spin_unlock_irqrestore(&bitmap->lock, flags);
- if (lastpage)
- write_page(bitmap, lastpage, 0);
- else
- if (log->type->flush(log))
- bitmap->flags |= BITMAP_WRITE_ERROR;
+ write_page(bitmap, lastpage, 0);
} else {
set_page_attr(bitmap, lastpage, BITMAP_PAGE_NEEDWRITE);
spin_unlock_irqrestore(&bitmap->lock, flags);
@@ -1769,12 +1733,10 @@ int bitmap_create(mddev_t *mddev)
BUILD_BUG_ON(sizeof(bitmap_super_t) != 256);

if (!file
- && !mddev->bitmap_info.offset
- && !mddev->bitmap_info.log) /* bitmap disabled, nothing to do */
+ && !mddev->bitmap_info.offset) /* bitmap disabled, nothing to do */
return 0;

BUG_ON(file && mddev->bitmap_info.offset);
- BUG_ON(mddev->bitmap_info.offset && mddev->bitmap_info.log);

bitmap = kzalloc(sizeof(*bitmap), GFP_KERNEL);
if (!bitmap)
@@ -1865,6 +1827,7 @@ int bitmap_create(mddev_t *mddev)
int bitmap_load(mddev_t *mddev)
{
int err = 0;
+ sector_t start = 0;
sector_t sector = 0;
struct bitmap *bitmap = mddev->bitmap;

@@ -1883,24 +1846,14 @@ int bitmap_load(mddev_t *mddev)
}
bitmap_close_sync(bitmap);

- if (mddev->bitmap_info.log) {
- unsigned long i;
- struct dm_dirty_log *log = mddev->bitmap_info.log;
- for (i = 0; i < bitmap->chunks; i++)
- if (!log->type->in_sync(log, i, 1))
- bitmap_set_memory_bits(bitmap,
- (sector_t)i << CHUNK_BLOCK_SHIFT(bitmap),
- 1);
- } else {
- sector_t start = 0;
- if (mddev->degraded == 0
- || bitmap->events_cleared == mddev->events)
- /* no need to keep dirty bits to optimise a
- * re-add of a missing device */
- start = mddev->recovery_cp;
-
- err = bitmap_init_from_disk(bitmap, start);
- }
+ if (mddev->degraded == 0
+ || bitmap->events_cleared == mddev->events)
+ /* no need to keep dirty bits to optimise a
+ * re-add of a missing device */
+ start = mddev->recovery_cp;
+
+ err = bitmap_init_from_disk(bitmap, start);
+
if (err)
goto out;

diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h
index b2a127e..a28f2e5 100644
--- a/drivers/md/bitmap.h
+++ b/drivers/md/bitmap.h
@@ -212,10 +212,6 @@ struct bitmap {
unsigned long file_pages; /* number of pages in the file */
int last_page_size; /* bytes in the last page */

- unsigned long logattrs; /* used when filemap_attr doesn't exist
- * because we are working with a dirty_log
- */
-
unsigned long flags;

int allclean;
@@ -237,7 +233,6 @@ struct bitmap {
wait_queue_head_t behind_wait;

struct sysfs_dirent *sysfs_can_clear;
-
};

/* the bitmap API */
diff --git a/drivers/md/md.h b/drivers/md/md.h
index de5455d..7d906a9 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -307,11 +307,6 @@ struct mddev_s
* hot-adding a bitmap. It should
* eventually be settable by sysfs.
*/
- /* When md is serving under dm, it might use a
- * dirty_log to store the bits.
- */
- struct dm_dirty_log *log;
-
struct mutex mutex;
unsigned long chunksize;
unsigned long daemon_sleep; /* how many jiffies between updates? */


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[md PATCH 32/34] md/raid5: Avoid BUG caused by multiple failures.

am 21.07.2011 04:32:28 von NeilBrown

While preparing to write a stripe we keep the parity block or blocks
locked (R5_LOCKED) - towards the end of schedule_reconstruction.

If the array is discovered to have failed before this write completes
we can leave those blocks LOCKED, and init_stripe will notice that a
free stripe still has a locked block and will complain.

So clear the R5_LOCKED flag in handle_failed_stripe, and demote the
'BUG' to a 'WARN_ON'.

Signed-off-by: NeilBrown
---

drivers/md/raid5.c | 6 +++++-
1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ee26f50..6337768 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -340,7 +340,7 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
(unsigned long long)sh->sector, i, dev->toread,
dev->read, dev->towrite, dev->written,
test_bit(R5_LOCKED, &dev->flags));
- BUG();
+ WARN_ON(1);
}
dev->flags = 0;
raid5_build_block(sh, i, previous);
@@ -2301,6 +2301,10 @@ handle_failed_stripe(raid5_conf_t *conf, struct stripe_head *sh,
if (bitmap_end)
bitmap_endwrite(conf->mddev->bitmap, sh->sector,
STRIPE_SECTORS, 0, 0);
+ /* If we were in the middle of a write the parity block might
+ * still be locked - so just clear all R5_LOCKED flags
+ */
+ clear_bit(R5_LOCKED, &sh->dev[i].flags);
}

if (test_and_clear_bit(STRIPE_FULL_WRITE, &sh->state))


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 06/34] md/raid5: Remove use of sh->lock in sync_request

am 22.07.2011 05:39:54 von Namhyung Kim

NeilBrown writes:

> This is the start of a series of patches to remove sh->lock.
>
> sync_request takes sh->lock before setting STRIPE_SYNCING to ensure
> there is no race with testing it in handle_stripe[56].
>
> Instead, use a new flag STRIPE_SYNC_REQUESTED and test it early
> in handle_stripe[56] (after getting the same lock) and perform the
> same set/clear operations if it was set.
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 07/34] md/raid5: Protect some more code with ->device_lock.

am 22.07.2011 05:54:01 von Namhyung Kim

NeilBrown writes:

> Other places that change or follow dev->towrite and dev->written take
> the device_lock as well as the sh->lock.
> So it should really be held in these places too.
> Also, doing so will allow sh->lock to be discarded.
>
> with merged fixes by: Namhyung Kim
>
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 08/34] md/raid5: replace sh->lock with an "active" flag.

am 22.07.2011 06:27:36 von Namhyung Kim

NeilBrown writes:

> sh->lock is now mainly used to ensure that two threads aren't running
> in the locked part of handle_stripe[56] at the same time.
>
> That can more neatly be achieved with an 'active' flag which we set
> while running handle_stripe. If we find the flag is set, we simply
> requeue the stripe for later by setting STRIPE_HANDLE.
>
> For safety we take ->device_lock while examining the state of the
> stripe and creating a summary in 'stripe_head_state / r6_state'.
> This possibly isn't needed but as shared fields like ->toread,
> ->towrite are checked it is safer for now at least.
>
> We leave the label after the old 'unlock' called "unlock" because it
> will disappear in a few patches, so renaming seems pointless.
>
> This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE
> later, but that is not a problem.
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim

But I have a question, please see below.


> ---
>
> drivers/md/raid5.c | 26 +++++++++++++-------------
> drivers/md/raid5.h | 2 +-
> 2 files changed, 14 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 9985138..f8275b5 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -1020,14 +1020,12 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
> if (test_and_clear_bit(R5_Wantdrain, &dev->flags)) {
> struct bio *wbi;
>
> - spin_lock(&sh->lock);
> spin_lock_irq(&sh->raid_conf->device_lock);
> chosen = dev->towrite;
> dev->towrite = NULL;
> BUG_ON(dev->written);
> wbi = dev->written = chosen;
> spin_unlock_irq(&sh->raid_conf->device_lock);
> - spin_unlock(&sh->lock);
>
> while (wbi && wbi->bi_sector <
> dev->sector + STRIPE_SECTORS) {
> @@ -1322,7 +1320,6 @@ static int grow_one_stripe(raid5_conf_t *conf)
> return 0;
>
> sh->raid_conf = conf;
> - spin_lock_init(&sh->lock);
> #ifdef CONFIG_MULTICORE_RAID456
> init_waitqueue_head(&sh->ops.wait_for_ops);
> #endif
> @@ -1442,7 +1439,6 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
> break;
>
> nsh->raid_conf = conf;
> - spin_lock_init(&nsh->lock);
> #ifdef CONFIG_MULTICORE_RAID456
> init_waitqueue_head(&nsh->ops.wait_for_ops);
> #endif
> @@ -2148,7 +2144,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
> (unsigned long long)sh->sector);
>
>
> - spin_lock(&sh->lock);
> spin_lock_irq(&conf->device_lock);
> if (forwrite) {
> bip = &sh->dev[dd_idx].towrite;
> @@ -2184,7 +2179,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
> set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
> }
> spin_unlock_irq(&conf->device_lock);
> - spin_unlock(&sh->lock);
>
> pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
> (unsigned long long)(*bip)->bi_sector,
> @@ -2201,7 +2195,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
> overlap:
> set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
> spin_unlock_irq(&conf->device_lock);
> - spin_unlock(&sh->lock);
> return 0;
> }
>
> @@ -3023,12 +3016,10 @@ static void handle_stripe5(struct stripe_head *sh)
> atomic_read(&sh->count), sh->pd_idx, sh->check_state,
> sh->reconstruct_state);
>
> - spin_lock(&sh->lock);
> if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
> set_bit(STRIPE_SYNCING, &sh->state);
> clear_bit(STRIPE_INSYNC, &sh->state);
> }
> - clear_bit(STRIPE_HANDLE, &sh->state);
> clear_bit(STRIPE_DELAYED, &sh->state);
>
> s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
> @@ -3037,6 +3028,7 @@ static void handle_stripe5(struct stripe_head *sh)
>
> /* Now to look around and see what can be done */
> rcu_read_lock();
> + spin_lock_irq(&conf->device_lock);

Do we still need rcu_read_lock()? AFAIK rcu_read_lock() only prevents
task from preemption but spin_lock does same thing as well.

I know it's been already there under sh->lock before this patch, and
it doesn't hurt anything, but I'm not sure it is really needed.


> for (i=disks; i--; ) {
> mdk_rdev_t *rdev;
>
> @@ -3099,6 +3091,7 @@ static void handle_stripe5(struct stripe_head *sh)
> s.failed_num = i;
> }
> }
> + spin_unlock_irq(&conf->device_lock);
> rcu_read_unlock();
>
> if (unlikely(blocked_rdev)) {
> @@ -3275,7 +3268,6 @@ static void handle_stripe5(struct stripe_head *sh)
> handle_stripe_expansion(conf, sh, NULL);
>
> unlock:
> - spin_unlock(&sh->lock);
>
> /* wait for this device to become unblocked */
> if (unlikely(blocked_rdev))
> @@ -3318,12 +3310,10 @@ static void handle_stripe6(struct stripe_head *sh)
> sh->check_state, sh->reconstruct_state);
> memset(&s, 0, sizeof(s));
>
> - spin_lock(&sh->lock);
> if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
> set_bit(STRIPE_SYNCING, &sh->state);
> clear_bit(STRIPE_INSYNC, &sh->state);
> }
> - clear_bit(STRIPE_HANDLE, &sh->state);
> clear_bit(STRIPE_DELAYED, &sh->state);
>
> s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
> @@ -3332,6 +3322,7 @@ static void handle_stripe6(struct stripe_head *sh)
> /* Now to look around and see what can be done */
>
> rcu_read_lock();
> + spin_lock_irq(&conf->device_lock);

Same here.


> for (i=disks; i--; ) {
> mdk_rdev_t *rdev;
> dev = &sh->dev[i];
> @@ -3395,6 +3386,7 @@ static void handle_stripe6(struct stripe_head *sh)
> s.failed++;
> }
> }
> + spin_unlock_irq(&conf->device_lock);
> rcu_read_unlock();
>
> if (unlikely(blocked_rdev)) {
> @@ -3580,7 +3572,6 @@ static void handle_stripe6(struct stripe_head *sh)
> handle_stripe_expansion(conf, sh, &r6s);
>
> unlock:
> - spin_unlock(&sh->lock);
>
> /* wait for this device to become unblocked */
> if (unlikely(blocked_rdev))
> @@ -3608,10 +3599,19 @@ static void handle_stripe6(struct stripe_head *sh)
>
> static void handle_stripe(struct stripe_head *sh)
> {
> + clear_bit(STRIPE_HANDLE, &sh->state);
> + if (test_and_set_bit(STRIPE_ACTIVE, &sh->state)) {
> + /* already being handled, ensure it gets handled
> + * again when current action finishes */
> + set_bit(STRIPE_HANDLE, &sh->state);
> + return;
> + }
> +
> if (sh->raid_conf->level == 6)
> handle_stripe6(sh);
> else
> handle_stripe5(sh);
> + clear_bit(STRIPE_ACTIVE, &sh->state);
> }
>
> static void raid5_activate_delayed(raid5_conf_t *conf)
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index a330011..217a9d4 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -209,7 +209,6 @@ struct stripe_head {
> short ddf_layout;/* use DDF ordering to calculate Q */
> unsigned long state; /* state flags */
> atomic_t count; /* nr of active thread/requests */
> - spinlock_t lock;
> int bm_seq; /* sequence number for bitmap flushes */
> int disks; /* disks in stripe */
> enum check_states check_state;
> @@ -290,6 +289,7 @@ struct r6_state {
> * Stripe state
> */
> enum {
> + STRIPE_ACTIVE,
> STRIPE_HANDLE,
> STRIPE_SYNC_REQUESTED,
> STRIPE_SYNCING,
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 09/34] md/raid5: move common code into handle_stripe

am 22.07.2011 06:30:06 von Namhyung Kim

NeilBrown writes:

> There is common code at the start of handle_stripe5 and
> handle_stripe6. Move it into handle_stripe.
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 08/34] md/raid5: replace sh->lock with an "active"flag.

am 22.07.2011 06:49:20 von NeilBrown

On Fri, 22 Jul 2011 13:27:36 +0900 Namhyung Kim wrote:

> NeilBrown writes:
>
> > sh->lock is now mainly used to ensure that two threads aren't running
> > in the locked part of handle_stripe[56] at the same time.
> >
> > That can more neatly be achieved with an 'active' flag which we set
> > while running handle_stripe. If we find the flag is set, we simply
> > requeue the stripe for later by setting STRIPE_HANDLE.
> >
> > For safety we take ->device_lock while examining the state of the
> > stripe and creating a summary in 'stripe_head_state / r6_state'.
> > This possibly isn't needed but as shared fields like ->toread,
> > ->towrite are checked it is safer for now at least.
> >
> > We leave the label after the old 'unlock' called "unlock" because it
> > will disappear in a few patches, so renaming seems pointless.
> >
> > This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE
> > later, but that is not a problem.
> >
> > Signed-off-by: NeilBrown
>
> Reviewed-by: Namhyung Kim

Thanks.

>
> But I have a question, please see below.

I like questions...

> > @@ -3037,6 +3028,7 @@ static void handle_stripe5(struct stripe_head *sh)
> >
> > /* Now to look around and see what can be done */
> > rcu_read_lock();
> > + spin_lock_irq(&conf->device_lock);
>
> Do we still need rcu_read_lock()? AFAIK rcu_read_lock() only prevents
> task from preemption but spin_lock does same thing as well.
>
> I know it's been already there under sh->lock before this patch, and
> it doesn't hurt anything, but I'm not sure it is really needed.

I see your point, but I think there are two reasons why it really is needed.

1/ I think rcu_read_lock does more than prevent preemption. What it does
exactly I don't know and I think it depends on which RCU scheme was chosen
when configuring the kernel. But what it conceptually does is stop
a subsequent synchronize_rcu() from completing, and that it what I need.
Disabling preemption might stop that as well, but I really don't know and
so cannot trust it.

2/ It is useful documentation. A reader seeing rcu_read_lock() knows that
there must be some rcu-protected data here which needs to be handled with
a little care and so they can be on the lookout for it.

So I think it is really needed.

Thanks,

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 10/34] md/raid5: unify stripe_head_state and r6_state

am 22.07.2011 06:49:37 von Namhyung Kim

NeilBrown writes:

> 'struct stripe_head_state' stores state about the 'current' stripe
> that is passed around while handling the stripe.
> For RAID6 there is an extension structure: r6_state, which is also
> passed around.
> There is no value in keeping these separate, so move the fields from
> the latter into the former.
>
> This means that all code now needs to treat s->failed_num as an small
> array, but this is a small cost.
>
> Signed-off-by: NeilBrown
> ---
>
> drivers/md/raid5.c | 79 ++++++++++++++++++++++++++--------------------------
> drivers/md/raid5.h | 8 +----
> 2 files changed, 41 insertions(+), 46 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index dfb3d9f..c32ffb5 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2318,7 +2318,7 @@ static int fetch_block5(struct stripe_head *sh, struct stripe_head_state *s,
> int disk_idx, int disks)
> {
> struct r5dev *dev = &sh->dev[disk_idx];
> - struct r5dev *failed_dev = &sh->dev[s->failed_num];
> + struct r5dev *failed_dev = &sh->dev[s->failed_num[0]];
>
> /* is the data in this block needed, and can we get it? */
> if (!test_bit(R5_LOCKED, &dev->flags) &&
> @@ -2334,7 +2334,7 @@ static int fetch_block5(struct stripe_head *sh, struct stripe_head_state *s,
> * otherwise read it if the backing disk is insync
> */
> if ((s->uptodate == disks - 1) &&
> - (s->failed && disk_idx == s->failed_num)) {
> + (s->failed && disk_idx == s->failed_num[0])) {
> set_bit(STRIPE_COMPUTE_RUN, &sh->state);
> set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request);
> set_bit(R5_Wantcompute, &dev->flags);
> @@ -2388,11 +2388,11 @@ static void handle_stripe_fill5(struct stripe_head *sh,
> * 0 to tell the loop in handle_stripe_fill6 to continue
> */
> static int fetch_block6(struct stripe_head *sh, struct stripe_head_state *s,
> - struct r6_state *r6s, int disk_idx, int disks)
> + int disk_idx, int disks)
> {
> struct r5dev *dev = &sh->dev[disk_idx];
> - struct r5dev *fdev[2] = { &sh->dev[r6s->failed_num[0]],
> - &sh->dev[r6s->failed_num[1]] };
> + struct r5dev *fdev[2] = { &sh->dev[s->failed_num[0]],
> + &sh->dev[s->failed_num[1]] };
>
> if (!test_bit(R5_LOCKED, &dev->flags) &&
> !test_bit(R5_UPTODATE, &dev->flags) &&
> @@ -2409,8 +2409,8 @@ static int fetch_block6(struct stripe_head *sh, struct stripe_head_state *s,
> BUG_ON(test_bit(R5_Wantcompute, &dev->flags));
> BUG_ON(test_bit(R5_Wantread, &dev->flags));
> if ((s->uptodate == disks - 1) &&
> - (s->failed && (disk_idx == r6s->failed_num[0] ||
> - disk_idx == r6s->failed_num[1]))) {
> + (s->failed && (disk_idx == s->failed_num[0] ||
> + disk_idx == s->failed_num[1]))) {
> /* have disk failed, and we're requested to fetch it;
> * do compute it
> */
> @@ -2465,7 +2465,7 @@ static int fetch_block6(struct stripe_head *sh, struct stripe_head_state *s,
> * handle_stripe_fill6 - read or compute data to satisfy pending requests.
> */
> static void handle_stripe_fill6(struct stripe_head *sh,
> - struct stripe_head_state *s, struct r6_state *r6s,
> + struct stripe_head_state *s,
> int disks)
> {
> int i;
> @@ -2477,7 +2477,7 @@ static void handle_stripe_fill6(struct stripe_head *sh,
> if (!test_bit(STRIPE_COMPUTE_RUN, &sh->state) && !sh->check_state &&
> !sh->reconstruct_state)
> for (i = disks; i--; )
> - if (fetch_block6(sh, s, r6s, i, disks))
> + if (fetch_block6(sh, s, i, disks))
> break;
> set_bit(STRIPE_HANDLE, &sh->state);
> }
> @@ -2625,7 +2625,7 @@ static void handle_stripe_dirtying5(raid5_conf_t *conf,
>
> static void handle_stripe_dirtying6(raid5_conf_t *conf,
> struct stripe_head *sh, struct stripe_head_state *s,
> - struct r6_state *r6s, int disks)
> + int disks)
> {
> int rcw = 0, pd_idx = sh->pd_idx, i;
> int qd_idx = sh->qd_idx;
> @@ -2688,7 +2688,7 @@ static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh,
> s->uptodate--;
> break;
> }
> - dev = &sh->dev[s->failed_num];
> + dev = &sh->dev[s->failed_num[0]];
> /* fall through */
> case check_state_compute_result:
> sh->check_state = check_state_idle;
> @@ -2760,7 +2760,7 @@ static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh,
>
> static void handle_parity_checks6(raid5_conf_t *conf, struct stripe_head *sh,
> struct stripe_head_state *s,
> - struct r6_state *r6s, int disks)
> + int disks)
> {
> int pd_idx = sh->pd_idx;
> int qd_idx = sh->qd_idx;
> @@ -2779,14 +2779,14 @@ static void handle_parity_checks6(raid5_conf_t *conf, struct stripe_head *sh,
> switch (sh->check_state) {
> case check_state_idle:
> /* start a new check operation if there are < 2 failures */
> - if (s->failed == r6s->q_failed) {
> + if (s->failed == s->q_failed) {
> /* The only possible failed device holds Q, so it
> * makes sense to check P (If anything else were failed,
> * we would have used P to recreate it).
> */
> sh->check_state = check_state_run;
> }
> - if (!r6s->q_failed && s->failed < 2) {
> + if (!s->q_failed && s->failed < 2) {
> /* Q is not failed, and we didn't use it to generate
> * anything, so it makes sense to check it
> */
> @@ -2828,13 +2828,13 @@ static void handle_parity_checks6(raid5_conf_t *conf, struct stripe_head *sh,
> */
> BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */
> if (s->failed == 2) {
> - dev = &sh->dev[r6s->failed_num[1]];
> + dev = &sh->dev[s->failed_num[1]];
> s->locked++;
> set_bit(R5_LOCKED, &dev->flags);
> set_bit(R5_Wantwrite, &dev->flags);
> }
> if (s->failed >= 1) {
> - dev = &sh->dev[r6s->failed_num[0]];
> + dev = &sh->dev[s->failed_num[0]];
> s->locked++;
> set_bit(R5_LOCKED, &dev->flags);
> set_bit(R5_Wantwrite, &dev->flags);
> @@ -2922,7 +2922,7 @@ static void handle_parity_checks6(raid5_conf_t *conf, struct stripe_head *sh,
> }
>
> static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
> - struct r6_state *r6s)
> + struct stripe_head_state *r6s)
> {
> int i;
>
> @@ -2964,7 +2964,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
> set_bit(R5_UPTODATE, &sh2->dev[dd_idx].flags);
> for (j = 0; j < conf->raid_disks; j++)
> if (j != sh2->pd_idx &&
> - (!r6s || j != sh2->qd_idx) &&
> + (r6s || j != sh2->qd_idx) &&

Why is this changed?


> !test_bit(R5_Expanded, &sh2->dev[j].flags))
> break;
> if (j == conf->raid_disks) {
> @@ -3082,7 +3082,7 @@ static void handle_stripe5(struct stripe_head *sh)
> clear_bit(R5_Insync, &dev->flags);
> if (!test_bit(R5_Insync, &dev->flags)) {
> s.failed++;
> - s.failed_num = i;
> + s.failed_num[0] = i;
> }
> }
> spin_unlock_irq(&conf->device_lock);
> @@ -3107,7 +3107,7 @@ static void handle_stripe5(struct stripe_head *sh)
> pr_debug("locked=%d uptodate=%d to_read=%d"
> " to_write=%d failed=%d failed_num=%d\n",
> s.locked, s.uptodate, s.to_read, s.to_write,
> - s.failed, s.failed_num);
> + s.failed, s.failed_num[0]);
> /* check if the array has lost two devices and, if so, some requests might
> * need to be failed
> */
> @@ -3127,7 +3127,7 @@ static void handle_stripe5(struct stripe_head *sh)
> ((test_bit(R5_Insync, &dev->flags) &&
> !test_bit(R5_LOCKED, &dev->flags) &&
> test_bit(R5_UPTODATE, &dev->flags)) ||
> - (s.failed == 1 && s.failed_num == sh->pd_idx)))
> + (s.failed == 1 && s.failed_num[0] == sh->pd_idx)))
> handle_stripe_clean_event(conf, sh, disks, &return_bi);
>
> /* Now we might consider reading some blocks, either to check/generate
> @@ -3198,11 +3198,11 @@ static void handle_stripe5(struct stripe_head *sh)
> * the repair/check process
> */
> if (s.failed == 1 && !conf->mddev->ro &&
> - test_bit(R5_ReadError, &sh->dev[s.failed_num].flags)
> - && !test_bit(R5_LOCKED, &sh->dev[s.failed_num].flags)
> - && test_bit(R5_UPTODATE, &sh->dev[s.failed_num].flags)
> + test_bit(R5_ReadError, &sh->dev[s.failed_num[0]].flags)
> + && !test_bit(R5_LOCKED, &sh->dev[s.failed_num[0]].flags)
> + && test_bit(R5_UPTODATE, &sh->dev[s.failed_num[0]].flags)
> ) {
> - dev = &sh->dev[s.failed_num];
> + dev = &sh->dev[s.failed_num[0]];
> if (!test_bit(R5_ReWrite, &dev->flags)) {
> set_bit(R5_Wantwrite, &dev->flags);
> set_bit(R5_ReWrite, &dev->flags);
> @@ -3292,7 +3292,6 @@ static void handle_stripe6(struct stripe_head *sh)
> struct bio *return_bi = NULL;
> int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx;
> struct stripe_head_state s;
> - struct r6_state r6s;
> struct r5dev *dev, *pdev, *qdev;
> mdk_rdev_t *blocked_rdev = NULL;
> int dec_preread_active = 0;
> @@ -3370,7 +3369,7 @@ static void handle_stripe6(struct stripe_head *sh)
> clear_bit(R5_Insync, &dev->flags);
> if (!test_bit(R5_Insync, &dev->flags)) {
> if (s.failed < 2)
> - r6s.failed_num[s.failed] = i;
> + s.failed_num[s.failed] = i;
> s.failed++;
> }
> }
> @@ -3396,7 +3395,7 @@ static void handle_stripe6(struct stripe_head *sh)
> pr_debug("locked=%d uptodate=%d to_read=%d"
> " to_write=%d failed=%d failed_num=%d,%d\n",
> s.locked, s.uptodate, s.to_read, s.to_write, s.failed,
> - r6s.failed_num[0], r6s.failed_num[1]);
> + s.failed_num[0], s.failed_num[1]);
> /* check if the array has lost >2 devices and, if so, some requests
> * might need to be failed
> */
> @@ -3413,17 +3412,17 @@ static void handle_stripe6(struct stripe_head *sh)
> * are safe, or on a failed drive
> */
> pdev = &sh->dev[pd_idx];
> - r6s.p_failed = (s.failed >= 1 && r6s.failed_num[0] == pd_idx)
> - || (s.failed >= 2 && r6s.failed_num[1] == pd_idx);
> + s.p_failed = (s.failed >= 1 && s.failed_num[0] == pd_idx)
> + || (s.failed >= 2 && s.failed_num[1] == pd_idx);
> qdev = &sh->dev[qd_idx];
> - r6s.q_failed = (s.failed >= 1 && r6s.failed_num[0] == qd_idx)
> - || (s.failed >= 2 && r6s.failed_num[1] == qd_idx);
> + s.q_failed = (s.failed >= 1 && s.failed_num[0] == qd_idx)
> + || (s.failed >= 2 && s.failed_num[1] == qd_idx);
>
> - if ( s.written &&
> - ( r6s.p_failed || ((test_bit(R5_Insync, &pdev->flags)
> + if (s.written &&
> + (s.p_failed || ((test_bit(R5_Insync, &pdev->flags)
> && !test_bit(R5_LOCKED, &pdev->flags)
> && test_bit(R5_UPTODATE, &pdev->flags)))) &&
> - ( r6s.q_failed || ((test_bit(R5_Insync, &qdev->flags)
> + (s.q_failed || ((test_bit(R5_Insync, &qdev->flags)
> && !test_bit(R5_LOCKED, &qdev->flags)
> && test_bit(R5_UPTODATE, &qdev->flags)))))
> handle_stripe_clean_event(conf, sh, disks, &return_bi);
> @@ -3434,7 +3433,7 @@ static void handle_stripe6(struct stripe_head *sh)
> */
> if (s.to_read || s.non_overwrite || (s.to_write && s.failed) ||
> (s.syncing && (s.uptodate + s.compute < disks)) || s.expanding)
> - handle_stripe_fill6(sh, &s, &r6s, disks);
> + handle_stripe_fill6(sh, &s, disks);
>
> /* Now we check to see if any write operations have recently
> * completed
> @@ -3472,7 +3471,7 @@ static void handle_stripe6(struct stripe_head *sh)
> * block.
> */
> if (s.to_write && !sh->reconstruct_state && !sh->check_state)
> - handle_stripe_dirtying6(conf, sh, &s, &r6s, disks);
> + handle_stripe_dirtying6(conf, sh, &s, disks);
>
> /* maybe we need to check and possibly fix the parity for this stripe
> * Any reads will already have been scheduled, so we just see if enough
> @@ -3483,7 +3482,7 @@ static void handle_stripe6(struct stripe_head *sh)
> (s.syncing && s.locked == 0 &&
> !test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
> !test_bit(STRIPE_INSYNC, &sh->state)))
> - handle_parity_checks6(conf, sh, &s, &r6s, disks);
> + handle_parity_checks6(conf, sh, &s, disks);
>
> if (s.syncing && s.locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
> md_done_sync(conf->mddev, STRIPE_SECTORS,1);
> @@ -3495,7 +3494,7 @@ static void handle_stripe6(struct stripe_head *sh)
> */
> if (s.failed <= 2 && !conf->mddev->ro)
> for (i = 0; i < s.failed; i++) {
> - dev = &sh->dev[r6s.failed_num[i]];
> + dev = &sh->dev[s.failed_num[i]];
> if (test_bit(R5_ReadError, &dev->flags)
> && !test_bit(R5_LOCKED, &dev->flags)
> && test_bit(R5_UPTODATE, &dev->flags)
> @@ -3557,7 +3556,7 @@ static void handle_stripe6(struct stripe_head *sh)
>
> if (s.expanding && s.locked == 0 &&
> !test_bit(STRIPE_COMPUTE_RUN, &sh->state))
> - handle_stripe_expansion(conf, sh, &r6s);
> + handle_stripe_expansion(conf, sh, &s);
>
> unlock:
>
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 217a9d4..d3c61d3 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -245,13 +245,9 @@ struct stripe_head_state {
> int syncing, expanding, expanded;
> int locked, uptodate, to_read, to_write, failed, written;
> int to_fill, compute, req_compute, non_overwrite;
> - int failed_num;
> + int failed_num[2];
> unsigned long ops_request;
> -};
> -
> -/* r6_state - extra state data only relevant to r6 */
> -struct r6_state {
> - int p_failed, q_failed, failed_num[2];
> + int p_failed, q_failed;
> };
>
> /* Flags */
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 08/34] md/raid5: replace sh->lock with an "active"flag.

am 22.07.2011 07:03:27 von Namhyung Kim

2011-07-22 (ê¸=88), 14:49 +1000, NeilBrown:
> On Fri, 22 Jul 2011 13:27:36 +0900 Namhyung Kim =
wrote:
>=20
> > NeilBrown writes:
> >=20
> > > sh->lock is now mainly used to ensure that two threads aren't run=
ning
> > > in the locked part of handle_stripe[56] at the same time.
> > >
> > > That can more neatly be achieved with an 'active' flag which we s=
et
> > > while running handle_stripe. If we find the flag is set, we simp=
ly
> > > requeue the stripe for later by setting STRIPE_HANDLE.
> > >
> > > For safety we take ->device_lock while examining the state of the
> > > stripe and creating a summary in 'stripe_head_state / r6_state'.
> > > This possibly isn't needed but as shared fields like ->toread,
> > > ->towrite are checked it is safer for now at least.
> > >
> > > We leave the label after the old 'unlock' called "unlock" because=
it
> > > will disappear in a few patches, so renaming seems pointless.
> > >
> > > This leaves the stripe 'locked' for longer as we clear STRIPE_ACT=
IVE
> > > later, but that is not a problem.
> > >
> > > Signed-off-by: NeilBrown
> >=20
> > Reviewed-by: Namhyung Kim
>=20
> Thanks.
>=20
> >=20
> > But I have a question, please see below.
>=20
> I like questions...
>=20
> > > @@ -3037,6 +3028,7 @@ static void handle_stripe5(struct stripe_he=
ad *sh)
> > > =20
> > > /* Now to look around and see what can be done */
> > > rcu_read_lock();
> > > + spin_lock_irq(&conf->device_lock);
> >=20
> > Do we still need rcu_read_lock()? AFAIK rcu_read_lock() only preven=
ts
> > task from preemption but spin_lock does same thing as well.
> >=20
> > I know it's been already there under sh->lock before this patch, an=
d
> > it doesn't hurt anything, but I'm not sure it is really needed.
>=20
> I see your point, but I think there are two reasons why it really is =
needed.
>=20
> 1/ I think rcu_read_lock does more than prevent preemption. What it =
does
> exactly I don't know and I think it depends on which RCU scheme wa=
s chosen
> when configuring the kernel. But what it conceptually does is sto=
p
> a subsequent synchronize_rcu() from completing, and that it what I=
need.
> Disabling preemption might stop that as well, but I really don't k=
now and
> so cannot trust it.
>=20
> 2/ It is useful documentation. A reader seeing rcu_read_lock() knows=
that
> there must be some rcu-protected data here which needs to be handl=
ed with
> a little care and so they can be on the lookout for it.
>=20
> So I think it is really needed.
>=20
> Thanks,
>=20
> NeilBrown
>=20

OK, looks reasonable. Thanks for the explanation. :)


--=20
Regards,
Namhyung Kim


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 10/34] md/raid5: unify stripe_head_state and r6_state

am 22.07.2011 07:15:06 von NeilBrown

On Fri, 22 Jul 2011 13:49:37 +0900 Namhyung Kim wrote:


> > @@ -2964,7 +2964,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
> > set_bit(R5_UPTODATE, &sh2->dev[dd_idx].flags);
> > for (j = 0; j < conf->raid_disks; j++)
> > if (j != sh2->pd_idx &&
> > - (!r6s || j != sh2->qd_idx) &&
> > + (r6s || j != sh2->qd_idx) &&
>
> Why is this changed?
>

No good reason that I can think of ... so must be a bug. I'll fix it.

In the current code, r6s is always NULL! which is also a bug.
When you get to

md/raid5: Move code for finishing a reconstruction into handle_stripe.

You'll find that the call which passed 's' in was changed to pass 'NULL' in
when merging the RAID5 and RAID6 code. I'll fix that too.

Thanks,
NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 11/34] md/raid5: add some more fields to stripe_head_state

am 22.07.2011 07:31:51 von Namhyung Kim

NeilBrown writes:

> Adding these three fields will allow more common code to be moved
> to handle_stripe()
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim

and nitpick below.


> ---
>
> drivers/md/raid5.c | 54 +++++++++++++++++++++++-----------------------------
> drivers/md/raid5.h | 4 ++++
> 2 files changed, 28 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index c32ffb5..3327e82 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3003,12 +3003,9 @@ static void handle_stripe5(struct stripe_head *sh)
> {
> raid5_conf_t *conf = sh->raid_conf;
> int disks = sh->disks, i;
> - struct bio *return_bi = NULL;
> struct stripe_head_state s;
> struct r5dev *dev;
> - mdk_rdev_t *blocked_rdev = NULL;
> int prexor;
> - int dec_preread_active = 0;
>
> memset(&s, 0, sizeof(s));
> pr_debug("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d check:%d "
> @@ -3058,9 +3055,9 @@ static void handle_stripe5(struct stripe_head *sh)
> if (dev->written)
> s.written++;
> rdev = rcu_dereference(conf->disks[i].rdev);
> - if (blocked_rdev == NULL &&
> + if (s.blocked_rdev == NULL &&
> rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
> - blocked_rdev = rdev;
> + s.blocked_rdev = rdev;
> atomic_inc(&rdev->nr_pending);
> }
> clear_bit(R5_Insync, &dev->flags);
> @@ -3088,15 +3085,15 @@ static void handle_stripe5(struct stripe_head *sh)
> spin_unlock_irq(&conf->device_lock);
> rcu_read_unlock();
>
> - if (unlikely(blocked_rdev)) {
> + if (unlikely(s.blocked_rdev)) {
> if (s.syncing || s.expanding || s.expanded ||
> s.to_write || s.written) {
> set_bit(STRIPE_HANDLE, &sh->state);
> goto unlock;
> }
> /* There is nothing for the blocked_rdev to block */
> - rdev_dec_pending(blocked_rdev, conf->mddev);
> - blocked_rdev = NULL;
> + rdev_dec_pending(s.blocked_rdev, conf->mddev);
> + s.blocked_rdev = NULL;
> }
>
> if (s.to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
> @@ -3112,7 +3109,7 @@ static void handle_stripe5(struct stripe_head *sh)
> * need to be failed
> */
> if (s.failed > 1 && s.to_read+s.to_write+s.written)
> - handle_failed_stripe(conf, sh, &s, disks, &return_bi);
> + handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
> if (s.failed > 1 && s.syncing) {
> md_done_sync(conf->mddev, STRIPE_SECTORS,0);
> clear_bit(STRIPE_SYNCING, &sh->state);
> @@ -3128,7 +3125,7 @@ static void handle_stripe5(struct stripe_head *sh)
> !test_bit(R5_LOCKED, &dev->flags) &&
> test_bit(R5_UPTODATE, &dev->flags)) ||
> (s.failed == 1 && s.failed_num[0] == sh->pd_idx)))
> - handle_stripe_clean_event(conf, sh, disks, &return_bi);
> + handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
>
> /* Now we might consider reading some blocks, either to check/generate
> * parity, or to satisfy requests
> @@ -3166,7 +3163,7 @@ static void handle_stripe5(struct stripe_head *sh)
> }
> }
> if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> - dec_preread_active = 1;
> + s.dec_preread_active = 1;
> }
>
> /* Now to consider new write requests and what else, if anything
> @@ -3264,15 +3261,15 @@ static void handle_stripe5(struct stripe_head *sh)
> unlock:
>
> /* wait for this device to become unblocked */
> - if (unlikely(blocked_rdev))
> - md_wait_for_blocked_rdev(blocked_rdev, conf->mddev);
> + if (unlikely(s.blocked_rdev))
> + md_wait_for_blocked_rdev(s.blocked_rdev, conf->mddev);
>
> if (s.ops_request)
> raid_run_ops(sh, s.ops_request);
>
> ops_run_io(sh, &s);
>
> - if (dec_preread_active) {
> + if (s.dec_preread_active) {
> /* We delay this until after ops_run_io so that if make_request
> * is waiting on a flush, it won't continue until the writes
> * have actually been submitted.
> @@ -3282,19 +3279,16 @@ static void handle_stripe5(struct stripe_head *sh)
> IO_THRESHOLD)
> md_wakeup_thread(conf->mddev->thread);
> }
> - return_io(return_bi);
> + return_io(s.return_bi);
> }
>
> static void handle_stripe6(struct stripe_head *sh)
> {
> raid5_conf_t *conf = sh->raid_conf;
> int disks = sh->disks;
> - struct bio *return_bi = NULL;
> int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx;
> struct stripe_head_state s;
> struct r5dev *dev, *pdev, *qdev;
> - mdk_rdev_t *blocked_rdev = NULL;
> - int dec_preread_active = 0;
>
> pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
> "pd_idx=%d, qd_idx=%d\n, check:%d, reconstruct:%d\n",
> @@ -3345,9 +3339,9 @@ static void handle_stripe6(struct stripe_head *sh)
> if (dev->written)
> s.written++;
> rdev = rcu_dereference(conf->disks[i].rdev);
> - if (blocked_rdev == NULL &&
> + if (s.blocked_rdev == NULL &&
> rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
> - blocked_rdev = rdev;
> + s.blocked_rdev = rdev;
> atomic_inc(&rdev->nr_pending);
> }
> clear_bit(R5_Insync, &dev->flags);
> @@ -3376,15 +3370,15 @@ static void handle_stripe6(struct stripe_head *sh)
> spin_unlock_irq(&conf->device_lock);
> rcu_read_unlock();
>
> - if (unlikely(blocked_rdev)) {
> + if (unlikely(s.blocked_rdev)) {
> if (s.syncing || s.expanding || s.expanded ||
> s.to_write || s.written) {
> set_bit(STRIPE_HANDLE, &sh->state);
> goto unlock;
> }
> /* There is nothing for the blocked_rdev to block */
> - rdev_dec_pending(blocked_rdev, conf->mddev);
> - blocked_rdev = NULL;
> + rdev_dec_pending(s.blocked_rdev, conf->mddev);
> + s.blocked_rdev = NULL;
> }
>
> if (s.to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
> @@ -3400,7 +3394,7 @@ static void handle_stripe6(struct stripe_head *sh)
> * might need to be failed
> */
> if (s.failed > 2 && s.to_read+s.to_write+s.written)
> - handle_failed_stripe(conf, sh, &s, disks, &return_bi);
> + handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
> if (s.failed > 2 && s.syncing) {
> md_done_sync(conf->mddev, STRIPE_SECTORS,0);
> clear_bit(STRIPE_SYNCING, &sh->state);
> @@ -3425,7 +3419,7 @@ static void handle_stripe6(struct stripe_head *sh)
> (s.q_failed || ((test_bit(R5_Insync, &qdev->flags)
> && !test_bit(R5_LOCKED, &qdev->flags)
> && test_bit(R5_UPTODATE, &qdev->flags)))))
> - handle_stripe_clean_event(conf, sh, disks, &return_bi);
> + handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
>
> /* Now we might consider reading some blocks, either to check/generate
> * parity, or to satisfy requests
> @@ -3461,7 +3455,7 @@ static void handle_stripe6(struct stripe_head *sh)
> }
> }
> if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> - dec_preread_active = 1;
> + s.dec_preread_active = 1;
> }
>
> /* Now to consider new write requests and what else, if anything
> @@ -3561,8 +3555,8 @@ static void handle_stripe6(struct stripe_head *sh)
> unlock:
>
> /* wait for this device to become unblocked */
> - if (unlikely(blocked_rdev))
> - md_wait_for_blocked_rdev(blocked_rdev, conf->mddev);
> + if (unlikely(s.blocked_rdev))
> + md_wait_for_blocked_rdev(s.blocked_rdev, conf->mddev);
>
> if (s.ops_request)
> raid_run_ops(sh, s.ops_request);
> @@ -3570,7 +3564,7 @@ static void handle_stripe6(struct stripe_head *sh)
> ops_run_io(sh, &s);
>
>
> - if (dec_preread_active) {
> + if (s.dec_preread_active) {
> /* We delay this until after ops_run_io so that if make_request
> * is waiting on a flush, it won't continue until the writes
> * have actually been submitted.
> @@ -3581,7 +3575,7 @@ static void handle_stripe6(struct stripe_head *sh)
> md_wakeup_thread(conf->mddev->thread);
> }
>
> - return_io(return_bi);
> + return_io(s.return_bi);
> }
>
> static void handle_stripe(struct stripe_head *sh)
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index d3c61d3..9ceb574 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -248,6 +248,10 @@ struct stripe_head_state {
> int failed_num[2];
> unsigned long ops_request;
> int p_failed, q_failed;
> +
> + struct bio *return_bi;
> + mdk_rdev_t *blocked_rdev;
> + int dec_preread_active;
> };

I'd rather rearrange the struct to reduce paddings on 64-bit:

/* stripe_head_state - collects and tracks the dynamic state of a stripe_head
* for handle_stripe. It is only valid under spin_lock(sh->lock);
*/
struct stripe_head_state {
int syncing, expanding, expanded;
int locked, uptodate, to_read, to_write, failed, written;
int to_fill, compute, req_compute, non_overwrite;
int failed_num[2];
int p_failed, q_failed;
int dec_preread_active;
unsigned long ops_request;

struct bio *return_bi;
mdk_rdev_t *blocked_rdev;
};

BTW, comment above mentions sh->lock which is disappeared in previous
patch. It should be fixed as well (in patch 08/34?).

Thanks.


>
> /* Flags */
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 10/34] md/raid5: unify stripe_head_state and r6_state

am 22.07.2011 07:37:46 von NeilBrown

On Fri, 22 Jul 2011 15:15:06 +1000 NeilBrown wrote:

> On Fri, 22 Jul 2011 13:49:37 +0900 Namhyung Kim wrote:
>
>
> > > @@ -2964,7 +2964,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
> > > set_bit(R5_UPTODATE, &sh2->dev[dd_idx].flags);
> > > for (j = 0; j < conf->raid_disks; j++)
> > > if (j != sh2->pd_idx &&
> > > - (!r6s || j != sh2->qd_idx) &&
> > > + (r6s || j != sh2->qd_idx) &&
> >
> > Why is this changed?
> >
>
> No good reason that I can think of ... so must be a bug. I'll fix it.
>
> In the current code, r6s is always NULL! which is also a bug.
> When you get to
>
> md/raid5: Move code for finishing a reconstruction into handle_stripe.
>
> You'll find that the call which passed 's' in was changed to pass 'NULL' in
> when merging the RAID5 and RAID6 code. I'll fix that too.
>

To be specific:

1/ I removed the hunk that you noticed above.
2/ I added the following patch.
3/ I made the obvious changes to the subsequent patch so that it would still
apply.

This is all now in my for-next branch (I haven't added the Reviewed-by's -
I'll do them all together later.

Thanks,
NeilBrown

From 74721b68b4cf8d4845aa9efa3daf99247a611b70 Mon Sep 17 00:00:00 2001
From: NeilBrown
Date: Fri, 22 Jul 2011 15:29:38 +1000
Subject: [PATCH] md/raid5: Remove stripe_head_state arg from handle_stripe_expansion.

This arg is only used to differentiate between RAID5 and RAID6 but
that is not needed. For RAID5, raid5_compute_sector will set qd_idx
to "~0" so j with certainly not equals qd_idx, so there is no need
for a guard on that condition.

So remove the guard and remove the arg from the declaration and
callers of handle_stripe_expansion.

Signed-off-by: NeilBrown

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 36873cf..793dd76 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2921,8 +2921,7 @@ static void handle_parity_checks6(raid5_conf_t *conf, struct stripe_head *sh,
}
}

-static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
- struct stripe_head_state *r6s)
+static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh)
{
int i;

@@ -2964,7 +2963,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
set_bit(R5_UPTODATE, &sh2->dev[dd_idx].flags);
for (j = 0; j < conf->raid_disks; j++)
if (j != sh2->pd_idx &&
- (!r6s || j != sh2->qd_idx) &&
+ j != sh2->qd_idx &&
!test_bit(R5_Expanded, &sh2->dev[j].flags))
break;
if (j == conf->raid_disks) {
@@ -3249,7 +3248,7 @@ static void handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)

if (s->expanding && s->locked == 0 &&
!test_bit(STRIPE_COMPUTE_RUN, &sh->state))
- handle_stripe_expansion(conf, sh, NULL);
+ handle_stripe_expansion(conf, sh);
}

static void handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
@@ -3512,7 +3511,7 @@ static void handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)

if (s->expanding && s->locked == 0 &&
!test_bit(STRIPE_COMPUTE_RUN, &sh->state))
- handle_stripe_expansion(conf, sh, s);
+ handle_stripe_expansion(conf, sh);
}

static void handle_stripe(struct stripe_head *sh)
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 12/34] md/raid5: move stripe_head_state and more code into handle_stripe.

am 22.07.2011 07:41:40 von Namhyung Kim

NeilBrown writes:

> By defining the 'stripe_head_state' in 'handle_stripe', we can move
> some common code out of handle_stripe[56]() and into handle_stripe.
>
> The means that all accesses for stripe_head_state in handle_stripe[56]
> need to be 's->' instead of 's.', but the compiler should inline
> those functions and just use a direct stack reference, and future
> patches while hoist most of this code up into handle_stripe()
> so we will revert to "s.".
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 10/34] md/raid5: unify stripe_head_state and r6_state

am 22.07.2011 07:53:52 von Namhyung Kim

2011-07-22 (ê¸=88), 15:37 +1000, NeilBrown:
> To be specific:
>=20
> 1/ I removed the hunk that you noticed above.
> 2/ I added the following patch.
> 3/ I made the obvious changes to the subsequent patch so that it woul=
d still
> apply.
>=20
> This is all now in my for-next branch (I haven't added the Reviewed-b=
y's -
> I'll do them all together later.
>=20
> Thanks,
> NeilBrown
>=20
> From 74721b68b4cf8d4845aa9efa3daf99247a611b70 Mon Sep 17 00:00:00 200=
1
> From: NeilBrown
> Date: Fri, 22 Jul 2011 15:29:38 +1000
> Subject: [PATCH] md/raid5: Remove stripe_head_state arg from handle_s=
tripe_expansion.
>=20
> This arg is only used to differentiate between RAID5 and RAID6 but
> that is not needed. For RAID5, raid5_compute_sector will set qd_idx
> to "~0" so j with certainly not equals qd_idx, so there is no need
> for a guard on that condition.
>=20
> So remove the guard and remove the arg from the declaration and
> callers of handle_stripe_expansion.
>=20
> Signed-off-by: NeilBrown

Looks good to me, thanks!


--=20
Regards,
Namhyung Kim


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 13/34] md/raid5: Move code for finishing a reconstruction into handle_stripe.

am 22.07.2011 09:09:10 von Namhyung Kim

NeilBrown writes:

> Prior to commit ab69ae12ceef7 the code in handle_stripe5 and
> handle_stripe6 to "Finish reconstruct operations initiated by the
> expansion process" was identical.
> That commit added an identical stanza of code to each function, but in
> different places. That was careless.
>
> The raid5 code was correct, so move that out into handle_stripe and
> remove raid6 version.
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim

and nitpick below..


> ---
>
> drivers/md/raid5.c | 153 +++++++++++++++++++---------------------------------
> 1 files changed, 57 insertions(+), 96 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index ce1c291..7b562fc 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2999,7 +2999,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
> *
> */
>
> -static void handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
> +static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
> {
> raid5_conf_t *conf = sh->raid_conf;
> int disks = sh->disks, i;
> @@ -3081,7 +3081,7 @@ static void handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
> if (s->syncing || s->expanding || s->expanded ||
> s->to_write || s->written) {
> set_bit(STRIPE_HANDLE, &sh->state);
> - return;
> + return 1;
> }
> /* There is nothing for the blocked_rdev to block */
> rdev_dec_pending(s->blocked_rdev, conf->mddev);
> @@ -3205,54 +3205,10 @@ static void handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
> s->locked++;
> }
> }
> -
> - /* Finish reconstruct operations initiated by the expansion process */
> - if (sh->reconstruct_state == reconstruct_state_result) {
> - struct stripe_head *sh2
> - = get_active_stripe(conf, sh->sector, 1, 1, 1);
> - if (sh2 && test_bit(STRIPE_EXPAND_SOURCE, &sh2->state)) {
> - /* sh cannot be written until sh2 has been read.
> - * so arrange for sh to be delayed a little
> - */
> - set_bit(STRIPE_DELAYED, &sh->state);
> - set_bit(STRIPE_HANDLE, &sh->state);
> - if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE,
> - &sh2->state))
> - atomic_inc(&conf->preread_active_stripes);
> - release_stripe(sh2);
> - return;
> - }
> - if (sh2)
> - release_stripe(sh2);
> -
> - sh->reconstruct_state = reconstruct_state_idle;
> - clear_bit(STRIPE_EXPANDING, &sh->state);
> - for (i = conf->raid_disks; i--; ) {
> - set_bit(R5_Wantwrite, &sh->dev[i].flags);
> - set_bit(R5_LOCKED, &sh->dev[i].flags);
> - s->locked++;
> - }
> - }
> -
> - if (s->expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
> - !sh->reconstruct_state) {
> - /* Need to write out all blocks after computing parity */
> - sh->disks = conf->raid_disks;
> - stripe_set_idx(sh->sector, conf, 0, sh);
> - schedule_reconstruction(sh, s, 1, 1);
> - } else if (s->expanded && !sh->reconstruct_state && s->locked == 0) {
> - clear_bit(STRIPE_EXPAND_READY, &sh->state);
> - atomic_dec(&conf->reshape_stripes);
> - wake_up(&conf->wait_for_overlap);
> - md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
> - }
> -
> - if (s->expanding && s->locked == 0 &&
> - !test_bit(STRIPE_COMPUTE_RUN, &sh->state))
> - handle_stripe_expansion(conf, sh, NULL);
> + return 0;
> }
>
> -static void handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
> +static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
> {
> raid5_conf_t *conf = sh->raid_conf;
> int disks = sh->disks;
> @@ -3335,7 +3291,7 @@ static void handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
> if (s->syncing || s->expanding || s->expanded ||
> s->to_write || s->written) {
> set_bit(STRIPE_HANDLE, &sh->state);
> - return;
> + return 1;
> }
> /* There is nothing for the blocked_rdev to block */
> rdev_dec_pending(s->blocked_rdev, conf->mddev);
> @@ -3468,57 +3424,15 @@ static void handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
> }
> }
> }
> -
> - /* Finish reconstruct operations initiated by the expansion process */
> - if (sh->reconstruct_state == reconstruct_state_result) {
> - sh->reconstruct_state = reconstruct_state_idle;
> - clear_bit(STRIPE_EXPANDING, &sh->state);
> - for (i = conf->raid_disks; i--; ) {
> - set_bit(R5_Wantwrite, &sh->dev[i].flags);
> - set_bit(R5_LOCKED, &sh->dev[i].flags);
> - s->locked++;
> - }
> - }
> -
> - if (s->expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
> - !sh->reconstruct_state) {
> - struct stripe_head *sh2
> - = get_active_stripe(conf, sh->sector, 1, 1, 1);
> - if (sh2 && test_bit(STRIPE_EXPAND_SOURCE, &sh2->state)) {
> - /* sh cannot be written until sh2 has been read.
> - * so arrange for sh to be delayed a little
> - */
> - set_bit(STRIPE_DELAYED, &sh->state);
> - set_bit(STRIPE_HANDLE, &sh->state);
> - if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE,
> - &sh2->state))
> - atomic_inc(&conf->preread_active_stripes);
> - release_stripe(sh2);
> - return;
> - }
> - if (sh2)
> - release_stripe(sh2);
> -
> - /* Need to write out all blocks after computing P&Q */
> - sh->disks = conf->raid_disks;
> - stripe_set_idx(sh->sector, conf, 0, sh);
> - schedule_reconstruction(sh, s, 1, 1);
> - } else if (s->expanded && !sh->reconstruct_state && s->locked == 0) {
> - clear_bit(STRIPE_EXPAND_READY, &sh->state);
> - atomic_dec(&conf->reshape_stripes);
> - wake_up(&conf->wait_for_overlap);
> - md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
> - }
> -
> - if (s->expanding && s->locked == 0 &&
> - !test_bit(STRIPE_COMPUTE_RUN, &sh->state))
> - handle_stripe_expansion(conf, sh, s);
> + return 0;
> }
>
> static void handle_stripe(struct stripe_head *sh)
> {
> struct stripe_head_state s;
> + int done;
> raid5_conf_t *conf = sh->raid_conf;
> + int i;

Can be declared in a line.


>
> clear_bit(STRIPE_HANDLE, &sh->state);
> if (test_and_set_bit(STRIPE_ACTIVE, &sh->state)) {
> @@ -3546,11 +3460,58 @@ static void handle_stripe(struct stripe_head *sh)
> s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
>
> if (conf->level == 6)
> - handle_stripe6(sh, &s);
> + done = handle_stripe6(sh, &s);
> else
> - handle_stripe5(sh, &s);
> + done = handle_stripe5(sh, &s);
> +
> + if (done)
> + goto finish;
> + /* Finish reconstruct operations initiated by the expansion process */
> + if (sh->reconstruct_state == reconstruct_state_result) {
> + struct stripe_head *sh2
> + = get_active_stripe(conf, sh->sector, 1, 1, 1);

Wouldn't it be better to use more descriptive name rather than sh2,
something like sh_prev, sh_src or whatever?

Thanks


> + if (sh2 && test_bit(STRIPE_EXPAND_SOURCE, &sh2->state)) {
> + /* sh cannot be written until sh2 has been read.
> + * so arrange for sh to be delayed a little
> + */
> + set_bit(STRIPE_DELAYED, &sh->state);
> + set_bit(STRIPE_HANDLE, &sh->state);
> + if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE,
> + &sh2->state))
> + atomic_inc(&conf->preread_active_stripes);
> + release_stripe(sh2);
> + goto finish;
> + }
> + if (sh2)
> + release_stripe(sh2);
> +
> + sh->reconstruct_state = reconstruct_state_idle;
> + clear_bit(STRIPE_EXPANDING, &sh->state);
> + for (i = conf->raid_disks; i--; ) {
> + set_bit(R5_Wantwrite, &sh->dev[i].flags);
> + set_bit(R5_LOCKED, &sh->dev[i].flags);
> + s.locked++;
> + }
> + }
>
> + if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
> + !sh->reconstruct_state) {
> + /* Need to write out all blocks after computing parity */
> + sh->disks = conf->raid_disks;
> + stripe_set_idx(sh->sector, conf, 0, sh);
> + schedule_reconstruction(sh, &s, 1, 1);
> + } else if (s.expanded && !sh->reconstruct_state && s.locked == 0) {
> + clear_bit(STRIPE_EXPAND_READY, &sh->state);
> + atomic_dec(&conf->reshape_stripes);
> + wake_up(&conf->wait_for_overlap);
> + md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
> + }
> +
> + if (s.expanding && s.locked == 0 &&
> + !test_bit(STRIPE_COMPUTE_RUN, &sh->state))
> + handle_stripe_expansion(conf, sh, NULL);
>
> +finish:
> /* wait for this device to become unblocked */
> if (unlikely(s.blocked_rdev))
> md_wait_for_blocked_rdev(s.blocked_rdev, conf->mddev);
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 14/34] md/raid5: move more code into common handle_stripe

am 22.07.2011 09:32:03 von Namhyung Kim

NeilBrown writes:

> The difference between the RAID5 and RAID6 code here is easily
> resolved using conf->max_degraded.
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim

and a coding style issue.


> ---
>
> drivers/md/raid5.c | 90 ++++++++++++++++++----------------------------------
> 1 files changed, 32 insertions(+), 58 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 7b562fc..e41b622 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3177,34 +3177,6 @@ static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
> !test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
> !test_bit(STRIPE_INSYNC, &sh->state)))
> handle_parity_checks5(conf, sh, s, disks);
> -
> - if (s->syncing && s->locked == 0
> - && test_bit(STRIPE_INSYNC, &sh->state)) {
> - md_done_sync(conf->mddev, STRIPE_SECTORS,1);
> - clear_bit(STRIPE_SYNCING, &sh->state);
> - }
> -
> - /* If the failed drive is just a ReadError, then we might need to progress
> - * the repair/check process
> - */
> - if (s->failed == 1 && !conf->mddev->ro &&
> - test_bit(R5_ReadError, &sh->dev[s->failed_num[0]].flags)
> - && !test_bit(R5_LOCKED, &sh->dev[s->failed_num[0]].flags)
> - && test_bit(R5_UPTODATE, &sh->dev[s->failed_num[0]].flags)
> - ) {
> - dev = &sh->dev[s->failed_num[0]];
> - if (!test_bit(R5_ReWrite, &dev->flags)) {
> - set_bit(R5_Wantwrite, &dev->flags);
> - set_bit(R5_ReWrite, &dev->flags);
> - set_bit(R5_LOCKED, &dev->flags);
> - s->locked++;
> - } else {
> - /* let's read it back */
> - set_bit(R5_Wantread, &dev->flags);
> - set_bit(R5_LOCKED, &dev->flags);
> - s->locked++;
> - }
> - }
> return 0;
> }
>
> @@ -3394,36 +3366,6 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
> !test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
> !test_bit(STRIPE_INSYNC, &sh->state)))
> handle_parity_checks6(conf, sh, s, disks);
> -
> - if (s->syncing && s->locked == 0
> - && test_bit(STRIPE_INSYNC, &sh->state)) {
> - md_done_sync(conf->mddev, STRIPE_SECTORS,1);
> - clear_bit(STRIPE_SYNCING, &sh->state);
> - }
> -
> - /* If the failed drives are just a ReadError, then we might need
> - * to progress the repair/check process
> - */
> - if (s->failed <= 2 && !conf->mddev->ro)
> - for (i = 0; i < s->failed; i++) {
> - dev = &sh->dev[s->failed_num[i]];
> - if (test_bit(R5_ReadError, &dev->flags)
> - && !test_bit(R5_LOCKED, &dev->flags)
> - && test_bit(R5_UPTODATE, &dev->flags)
> - ) {
> - if (!test_bit(R5_ReWrite, &dev->flags)) {
> - set_bit(R5_Wantwrite, &dev->flags);
> - set_bit(R5_ReWrite, &dev->flags);
> - set_bit(R5_LOCKED, &dev->flags);
> - s->locked++;
> - } else {
> - /* let's read it back */
> - set_bit(R5_Wantread, &dev->flags);
> - set_bit(R5_LOCKED, &dev->flags);
> - s->locked++;
> - }
> - }
> - }
> return 0;
> }
>
> @@ -3466,6 +3408,38 @@ static void handle_stripe(struct stripe_head *sh)
>
> if (done)
> goto finish;
> +
> +
> + if (s.syncing && s.locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
> + md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
> + clear_bit(STRIPE_SYNCING, &sh->state);
> + }
> +
> + /* If the failed drives are just a ReadError, then we might need
> + * to progress the repair/check process
> + */
> + if (s.failed <= conf->max_degraded && !conf->mddev->ro)
> + for (i = 0; i < s.failed; i++) {
> + struct r5dev *dev = &sh->dev[s.failed_num[i]];
> + if (test_bit(R5_ReadError, &dev->flags)
> + && !test_bit(R5_LOCKED, &dev->flags)
> + && test_bit(R5_UPTODATE, &dev->flags)
> + ) {

This line looks weird and should be combined to previous line.

And I slightly prefer that the logical operators are put on the
same line with previous conditional, so that next one can be on
the same column and more readable, but ...

Thanks.


> + if (!test_bit(R5_ReWrite, &dev->flags)) {
> + set_bit(R5_Wantwrite, &dev->flags);
> + set_bit(R5_ReWrite, &dev->flags);
> + set_bit(R5_LOCKED, &dev->flags);
> + s.locked++;
> + } else {
> + /* let's read it back */
> + set_bit(R5_Wantread, &dev->flags);
> + set_bit(R5_LOCKED, &dev->flags);
> + s.locked++;
> + }
> + }
> + }
> +
> +
> /* Finish reconstruct operations initiated by the expansion process */
> if (sh->reconstruct_state == reconstruct_state_result) {
> struct stripe_head *sh2
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 15/34] md/raid5: rearrange a test in fetch_block6.

am 22.07.2011 09:37:00 von Namhyung Kim

NeilBrown writes:

> Next patch will unite fetch_block5 and fetch_block6.
> First I want to make the differences a little more clear.
>
> For RAID6 if we are writing at all and there is a failed device, then
> we need to load or compute every block so we can do a
> reconstruct-write.
> This case isn't needed for RAID5 - we will do a read-modify-write in
> that case.
> So make that test a separate test in fetch_block6 rather than merged
> with two other tests.
>
> Make a similar change in fetch_block5 so the one bit that is not
> needed for RAID6 is clearly separate.
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 16/34] md/raid5: unite fetch_block5 and fetch_block6

am 22.07.2011 10:24:22 von Namhyung Kim

NeilBrown writes:

> Provided that ->failed_num[1] is not a valid device number (which is
> easily achieved) fetch_block6 provides all the functionality of
> fetch_block5.
>
> So remove the latter and rename the former to simply "fetch_block".
>
> Then handle_stripe_fill5 and handle_stripe_fill6 become the same and
> can similarly be united.
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 17/34] md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6

am 22.07.2011 11:10:33 von Namhyung Kim

NeilBrown writes:

> RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is
> also allow 'read-modify-write'
> Apart from this difference, handle_stripe_dirtying[56] are nearly
> identical. So resolve these differences and create just one function.
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim

BTW, here is a question:
Why RAID6 doesn't allow the read-modify-write? I don't think it is
not possible, so what prevents doing that? performance? complexity?
or it's not possible? Why? :)
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 18/34] md/raid5: move more common code into handle_stripe

am 22.07.2011 11:20:02 von Namhyung Kim

NeilBrown writes:

> Apart from 'prexor' which can only be set for RAID5, and
> 'qd_idx' which can only be meaningful for RAID6, these two
> chunks of code are nearly the same.
>
> So combine them into one adding a test to call either
> handle_parity_checks5 or handle_parity_checks6 as appropriate.
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 19/34] md/raid5: move some more common code into handle_stripe

am 22.07.2011 11:29:16 von Namhyung Kim

NeilBrown writes:

> The RAID6 version of this code is usable for RAID5 providing:
> - we test "conf->max_degraded" rather than "2" as appropriate
> - we make sure s->failed_num[1] is meaningful (and not '-1')
> when s->failed > 1
>
> The 'return 1' must become 'goto finish' in the new location.
>
> Signed-off-by: NeilBrown

Reviewed-by: Namhyung Kim

and nitpick below.


> ---
>
> drivers/md/raid5.c | 178 +++++++++++++++++++---------------------------------
> 1 files changed, 66 insertions(+), 112 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index ba6f892..f23848f 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2969,63 +2969,14 @@ static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
> if (test_bit(R5_ReadError, &dev->flags))
> clear_bit(R5_Insync, &dev->flags);
> if (!test_bit(R5_Insync, &dev->flags)) {
> + if (s->failed < 2)
> + s->failed_num[s->failed] = i;
> s->failed++;
> - s->failed_num[0] = i;
> }
> }
> spin_unlock_irq(&conf->device_lock);
> rcu_read_unlock();
>
> - if (unlikely(s->blocked_rdev)) {
> - if (s->syncing || s->expanding || s->expanded ||
> - s->to_write || s->written) {
> - set_bit(STRIPE_HANDLE, &sh->state);
> - return 1;
> - }
> - /* There is nothing for the blocked_rdev to block */
> - rdev_dec_pending(s->blocked_rdev, conf->mddev);
> - s->blocked_rdev = NULL;
> - }
> -
> - if (s->to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
> - set_bit(STRIPE_OP_BIOFILL, &s->ops_request);
> - set_bit(STRIPE_BIOFILL_RUN, &sh->state);
> - }
> -
> - pr_debug("locked=%d uptodate=%d to_read=%d"
> - " to_write=%d failed=%d failed_num=%d\n",
> - s->locked, s->uptodate, s->to_read, s->to_write,
> - s->failed, s->failed_num[0]);
> - /* check if the array has lost two devices and, if so, some requests might
> - * need to be failed
> - */
> - if (s->failed > 1 && s->to_read+s->to_write+s->written)
> - handle_failed_stripe(conf, sh, s, disks, &s->return_bi);
> - if (s->failed > 1 && s->syncing) {
> - md_done_sync(conf->mddev, STRIPE_SECTORS,0);
> - clear_bit(STRIPE_SYNCING, &sh->state);
> - s->syncing = 0;
> - }
> -
> - /* might be able to return some write requests if the parity block
> - * is safe, or on a failed drive
> - */
> - dev = &sh->dev[sh->pd_idx];
> - if (s->written &&
> - ((test_bit(R5_Insync, &dev->flags) &&
> - !test_bit(R5_LOCKED, &dev->flags) &&
> - test_bit(R5_UPTODATE, &dev->flags)) ||
> - (s->failed == 1 && s->failed_num[0] == sh->pd_idx)))
> - handle_stripe_clean_event(conf, sh, disks, &s->return_bi);
> -
> - /* Now we might consider reading some blocks, either to check/generate
> - * parity, or to satisfy requests
> - * or to load a block that is being partially written.
> - */
> - if (s->to_read || s->non_overwrite ||
> - (s->syncing && (s->uptodate + s->compute < disks)) || s->expanding)
> - handle_stripe_fill(sh, s, disks);
> -
> return 0;
> }
>
> @@ -3033,8 +2984,8 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
> {
> raid5_conf_t *conf = sh->raid_conf;
> int disks = sh->disks;
> - int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx;
> - struct r5dev *dev, *pdev, *qdev;
> + struct r5dev *dev;
> + int i;
>
> /* Now to look around and see what can be done */
>
> @@ -3108,65 +3059,6 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
> spin_unlock_irq(&conf->device_lock);
> rcu_read_unlock();
>
> - if (unlikely(s->blocked_rdev)) {
> - if (s->syncing || s->expanding || s->expanded ||
> - s->to_write || s->written) {
> - set_bit(STRIPE_HANDLE, &sh->state);
> - return 1;
> - }
> - /* There is nothing for the blocked_rdev to block */
> - rdev_dec_pending(s->blocked_rdev, conf->mddev);
> - s->blocked_rdev = NULL;
> - }
> -
> - if (s->to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
> - set_bit(STRIPE_OP_BIOFILL, &s->ops_request);
> - set_bit(STRIPE_BIOFILL_RUN, &sh->state);
> - }
> -
> - pr_debug("locked=%d uptodate=%d to_read=%d"
> - " to_write=%d failed=%d failed_num=%d,%d\n",
> - s->locked, s->uptodate, s->to_read, s->to_write, s->failed,
> - s->failed_num[0], s->failed_num[1]);
> - /* check if the array has lost >2 devices and, if so, some requests
> - * might need to be failed
> - */
> - if (s->failed > 2 && s->to_read+s->to_write+s->written)
> - handle_failed_stripe(conf, sh, s, disks, &s->return_bi);
> - if (s->failed > 2 && s->syncing) {
> - md_done_sync(conf->mddev, STRIPE_SECTORS,0);
> - clear_bit(STRIPE_SYNCING, &sh->state);
> - s->syncing = 0;
> - }
> -
> - /*
> - * might be able to return some write requests if the parity blocks
> - * are safe, or on a failed drive
> - */
> - pdev = &sh->dev[pd_idx];
> - s->p_failed = (s->failed >= 1 && s->failed_num[0] == pd_idx)
> - || (s->failed >= 2 && s->failed_num[1] == pd_idx);
> - qdev = &sh->dev[qd_idx];
> - s->q_failed = (s->failed >= 1 && s->failed_num[0] == qd_idx)
> - || (s->failed >= 2 && s->failed_num[1] == qd_idx);
> -
> - if (s->written &&
> - (s->p_failed || ((test_bit(R5_Insync, &pdev->flags)
> - && !test_bit(R5_LOCKED, &pdev->flags)
> - && test_bit(R5_UPTODATE, &pdev->flags)))) &&
> - (s->q_failed || ((test_bit(R5_Insync, &qdev->flags)
> - && !test_bit(R5_LOCKED, &qdev->flags)
> - && test_bit(R5_UPTODATE, &qdev->flags)))))
> - handle_stripe_clean_event(conf, sh, disks, &s->return_bi);
> -
> - /* Now we might consider reading some blocks, either to check/generate
> - * parity, or to satisfy requests
> - * or to load a block that is being partially written.
> - */
> - if (s->to_read || s->non_overwrite || (s->to_write && s->failed) ||
> - (s->syncing && (s->uptodate + s->compute < disks)) || s->expanding)
> - handle_stripe_fill(sh, s, disks);
> -
> return 0;
> }
>
> @@ -3178,6 +3070,7 @@ static void handle_stripe(struct stripe_head *sh)
> int i;
> int prexor;
> int disks = sh->disks;
> + struct r5dev *pdev, *qdev;
>
> clear_bit(STRIPE_HANDLE, &sh->state);
> if (test_and_set_bit(STRIPE_ACTIVE, &sh->state)) {
> @@ -3214,6 +3107,67 @@ static void handle_stripe(struct stripe_head *sh)
> if (done)
> goto finish;
>
> + if (unlikely(s.blocked_rdev)) {
> + if (s.syncing || s.expanding || s.expanded ||
> + s.to_write || s.written) {
> + set_bit(STRIPE_HANDLE, &sh->state);
> + goto finish;
> + }
> + /* There is nothing for the blocked_rdev to block */
> + rdev_dec_pending(s.blocked_rdev, conf->mddev);
> + s.blocked_rdev = NULL;
> + }
> +
> + if (s.to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
> + set_bit(STRIPE_OP_BIOFILL, &s.ops_request);
> + set_bit(STRIPE_BIOFILL_RUN, &sh->state);
> + }
> +
> + pr_debug("locked=%d uptodate=%d to_read=%d"
> + " to_write=%d failed=%d failed_num=%d,%d\n",
> + s.locked, s.uptodate, s.to_read, s.to_write, s.failed,
> + s.failed_num[0], s.failed_num[1]);
> + /* check if the array has lost >2 devices and, if so, some requests
> + * might need to be failed
/* check if the array has lost more than max_degraded devices and,
if so, some requests might need to be failed

Thanks.


> + */
> + if (s.failed > conf->max_degraded && s.to_read+s.to_write+s.written)
> + handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
> + if (s.failed > conf->max_degraded && s.syncing) {
> + md_done_sync(conf->mddev, STRIPE_SECTORS, 0);
> + clear_bit(STRIPE_SYNCING, &sh->state);
> + s.syncing = 0;
> + }
> +
> + /*
> + * might be able to return some write requests if the parity blocks
> + * are safe, or on a failed drive
> + */
> + pdev = &sh->dev[sh->pd_idx];
> + s.p_failed = (s.failed >= 1 && s.failed_num[0] == sh->pd_idx)
> + || (s.failed >= 2 && s.failed_num[1] == sh->pd_idx);
> + qdev = &sh->dev[sh->qd_idx];
> + s.q_failed = (s.failed >= 1 && s.failed_num[0] == sh->qd_idx)
> + || (s.failed >= 2 && s.failed_num[1] == sh->qd_idx)
> + || conf->level < 6;
> +
> + if (s.written &&
> + (s.p_failed || ((test_bit(R5_Insync, &pdev->flags)
> + && !test_bit(R5_LOCKED, &pdev->flags)
> + && test_bit(R5_UPTODATE, &pdev->flags)))) &&
> + (s.q_failed || ((test_bit(R5_Insync, &qdev->flags)
> + && !test_bit(R5_LOCKED, &qdev->flags)
> + && test_bit(R5_UPTODATE, &qdev->flags)))))
> + handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
> +
> + /* Now we might consider reading some blocks, either to check/generate
> + * parity, or to satisfy requests
> + * or to load a block that is being partially written.
> + */
> + if (s.to_read || s.non_overwrite
> + || (conf->level == 6 && s.to_write && s.failed)
> + || (s.syncing && (s.uptodate + s.compute < disks)) || s.expanding)
> + handle_stripe_fill(sh, &s, disks);
> +
> /* Now we check to see if any write operations have recently
> * completed
> */
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 20/34] md/raid5: finalise new merged handle_stripe.

am 22.07.2011 11:36:13 von Namhyung Kim

NeilBrown writes:

> handle_stripe5() and handle_stripe6() are now virtually identical.
> So discard one and rename the other to 'analyse_stripe()'.
>
> It always returns 0, so change it to 'void' and remove the 'done'
> variable in handle_stripe().
>
> Signed-off-by: NeilBrown
> ---
>
> drivers/md/raid5.c | 100 +++-------------------------------------------------
> 1 files changed, 5 insertions(+), 95 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index f23848f..93ee26c 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2899,88 +2899,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
> *
> */
>
> -static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
> -{
> - raid5_conf_t *conf = sh->raid_conf;
> - int disks = sh->disks, i;
> - struct r5dev *dev;
> -
> - /* Now to look around and see what can be done */
> - rcu_read_lock();
> - spin_lock_irq(&conf->device_lock);
> - for (i=disks; i--; ) {
> - mdk_rdev_t *rdev;
> -
> - dev = &sh->dev[i];
> -
> - pr_debug("check %d: state 0x%lx toread %p read %p write %p "
> - "written %p\n", i, dev->flags, dev->toread, dev->read,
> - dev->towrite, dev->written);
> -
> - /* maybe we can request a biofill operation
> - *
> - * new wantfill requests are only permitted while
> - * ops_complete_biofill is guaranteed to be inactive
> - */
> - if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread &&
> - !test_bit(STRIPE_BIOFILL_RUN, &sh->state))
> - set_bit(R5_Wantfill, &dev->flags);
> -
> - /* now count some things */
> - if (test_bit(R5_LOCKED, &dev->flags))
> - s->locked++;
> - if (test_bit(R5_UPTODATE, &dev->flags))
> - s->uptodate++;
> - if (test_bit(R5_Wantcompute, &dev->flags))
> - s->compute++;
> -
> - if (test_bit(R5_Wantfill, &dev->flags))
> - s->to_fill++;
> - else if (dev->toread)
> - s->to_read++;
> - if (dev->towrite) {
> - s->to_write++;
> - if (!test_bit(R5_OVERWRITE, &dev->flags))
> - s->non_overwrite++;
> - }
> - if (dev->written)
> - s->written++;
> - rdev = rcu_dereference(conf->disks[i].rdev);
> - if (s->blocked_rdev == NULL &&
> - rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
> - s->blocked_rdev = rdev;
> - atomic_inc(&rdev->nr_pending);
> - }
> - clear_bit(R5_Insync, &dev->flags);
> - if (!rdev)
> - /* Not in-sync */;
> - else if (test_bit(In_sync, &rdev->flags))
> - set_bit(R5_Insync, &dev->flags);
> - else {
> - /* could be in-sync depending on recovery/reshape status */
> - if (sh->sector + STRIPE_SECTORS <= rdev->recovery_offset)
> - set_bit(R5_Insync, &dev->flags);
> - }
> - if (!test_bit(R5_Insync, &dev->flags)) {
> - /* The ReadError flag will just be confusing now */
> - clear_bit(R5_ReadError, &dev->flags);
> - clear_bit(R5_ReWrite, &dev->flags);
> - }
> - if (test_bit(R5_ReadError, &dev->flags))
> - clear_bit(R5_Insync, &dev->flags);
> - if (!test_bit(R5_Insync, &dev->flags)) {
> - if (s->failed < 2)
> - s->failed_num[s->failed] = i;
> - s->failed++;
> - }
> - }
> - spin_unlock_irq(&conf->device_lock);
> - rcu_read_unlock();
> -
> - return 0;
> -}
> -
> -static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
> +static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
> {
> raid5_conf_t *conf = sh->raid_conf;
> int disks = sh->disks;
> @@ -2988,11 +2907,11 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
> int i;
>
> /* Now to look around and see what can be done */
> -
> rcu_read_lock();
> spin_lock_irq(&conf->device_lock);
> for (i=disks; i--; ) {
> mdk_rdev_t *rdev;
> +
> dev = &sh->dev[i];
>
> pr_debug("check %d: state 0x%lx read %p write %p written %p\n",
> @@ -3016,9 +2935,9 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
> BUG_ON(s->compute > 2);
> }
>
> - if (test_bit(R5_Wantfill, &dev->flags)) {
> + if (test_bit(R5_Wantfill, &dev->flags))
> s->to_fill++;
> - } else if (dev->toread)
> + else if (dev->toread)
> s->to_read++;
> if (dev->towrite) {
> s->to_write++;
> @@ -3058,14 +2977,11 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
> }
> spin_unlock_irq(&conf->device_lock);
> rcu_read_unlock();
> -
> - return 0;
> }
>
> static void handle_stripe(struct stripe_head *sh)
> {
> struct stripe_head_state s;
> - int done;
> raid5_conf_t *conf = sh->raid_conf;
> int i;
> int prexor;
> @@ -3099,13 +3015,7 @@ static void handle_stripe(struct stripe_head *sh)
> s.failed_num[0] = -1;
> s.failed_num[1] = -1;

There are codes that set/initialize some fields of stripe_head_state
outside of analyse_stripe() and it'd better off moving them into the
function, IMHO.

Thanks.


>
> - if (conf->level == 6)
> - done = handle_stripe6(sh, &s);
> - else
> - done = handle_stripe5(sh, &s);
> -
> - if (done)
> - goto finish;
> + analyse_stripe(sh, &s);
>
> if (unlikely(s.blocked_rdev)) {
> if (s.syncing || s.expanding || s.expanded ||
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 11/34] md/raid5: add some more fields tostripe_head_state

am 26.07.2011 03:35:08 von NeilBrown

On Fri, 22 Jul 2011 14:31:51 +0900 Namhyung Kim wrote:

> NeilBrown writes:
>
> > Adding these three fields will allow more common code to be moved
> > to handle_stripe()
> >
> > Signed-off-by: NeilBrown
>
> Reviewed-by: Namhyung Kim
>
> and nitpick below.
>
>
> > ---
> >
> > drivers/md/raid5.c | 54 +++++++++++++++++++++++-----------------------------
> > drivers/md/raid5.h | 4 ++++
> > 2 files changed, 28 insertions(+), 30 deletions(-)
> >
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index c32ffb5..3327e82 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -3003,12 +3003,9 @@ static void handle_stripe5(struct stripe_head *sh)
> > {
> > raid5_conf_t *conf = sh->raid_conf;
> > int disks = sh->disks, i;
> > - struct bio *return_bi = NULL;
> > struct stripe_head_state s;
> > struct r5dev *dev;
> > - mdk_rdev_t *blocked_rdev = NULL;
> > int prexor;
> > - int dec_preread_active = 0;
> >
> > memset(&s, 0, sizeof(s));
> > pr_debug("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d check:%d "
> > @@ -3058,9 +3055,9 @@ static void handle_stripe5(struct stripe_head *sh)
> > if (dev->written)
> > s.written++;
> > rdev = rcu_dereference(conf->disks[i].rdev);
> > - if (blocked_rdev == NULL &&
> > + if (s.blocked_rdev == NULL &&
> > rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
> > - blocked_rdev = rdev;
> > + s.blocked_rdev = rdev;
> > atomic_inc(&rdev->nr_pending);
> > }
> > clear_bit(R5_Insync, &dev->flags);
> > @@ -3088,15 +3085,15 @@ static void handle_stripe5(struct stripe_head *sh)
> > spin_unlock_irq(&conf->device_lock);
> > rcu_read_unlock();
> >
> > - if (unlikely(blocked_rdev)) {
> > + if (unlikely(s.blocked_rdev)) {
> > if (s.syncing || s.expanding || s.expanded ||
> > s.to_write || s.written) {
> > set_bit(STRIPE_HANDLE, &sh->state);
> > goto unlock;
> > }
> > /* There is nothing for the blocked_rdev to block */
> > - rdev_dec_pending(blocked_rdev, conf->mddev);
> > - blocked_rdev = NULL;
> > + rdev_dec_pending(s.blocked_rdev, conf->mddev);
> > + s.blocked_rdev = NULL;
> > }
> >
> > if (s.to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
> > @@ -3112,7 +3109,7 @@ static void handle_stripe5(struct stripe_head *sh)
> > * need to be failed
> > */
> > if (s.failed > 1 && s.to_read+s.to_write+s.written)
> > - handle_failed_stripe(conf, sh, &s, disks, &return_bi);
> > + handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
> > if (s.failed > 1 && s.syncing) {
> > md_done_sync(conf->mddev, STRIPE_SECTORS,0);
> > clear_bit(STRIPE_SYNCING, &sh->state);
> > @@ -3128,7 +3125,7 @@ static void handle_stripe5(struct stripe_head *sh)
> > !test_bit(R5_LOCKED, &dev->flags) &&
> > test_bit(R5_UPTODATE, &dev->flags)) ||
> > (s.failed == 1 && s.failed_num[0] == sh->pd_idx)))
> > - handle_stripe_clean_event(conf, sh, disks, &return_bi);
> > + handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
> >
> > /* Now we might consider reading some blocks, either to check/generate
> > * parity, or to satisfy requests
> > @@ -3166,7 +3163,7 @@ static void handle_stripe5(struct stripe_head *sh)
> > }
> > }
> > if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> > - dec_preread_active = 1;
> > + s.dec_preread_active = 1;
> > }
> >
> > /* Now to consider new write requests and what else, if anything
> > @@ -3264,15 +3261,15 @@ static void handle_stripe5(struct stripe_head *sh)
> > unlock:
> >
> > /* wait for this device to become unblocked */
> > - if (unlikely(blocked_rdev))
> > - md_wait_for_blocked_rdev(blocked_rdev, conf->mddev);
> > + if (unlikely(s.blocked_rdev))
> > + md_wait_for_blocked_rdev(s.blocked_rdev, conf->mddev);
> >
> > if (s.ops_request)
> > raid_run_ops(sh, s.ops_request);
> >
> > ops_run_io(sh, &s);
> >
> > - if (dec_preread_active) {
> > + if (s.dec_preread_active) {
> > /* We delay this until after ops_run_io so that if make_request
> > * is waiting on a flush, it won't continue until the writes
> > * have actually been submitted.
> > @@ -3282,19 +3279,16 @@ static void handle_stripe5(struct stripe_head *sh)
> > IO_THRESHOLD)
> > md_wakeup_thread(conf->mddev->thread);
> > }
> > - return_io(return_bi);
> > + return_io(s.return_bi);
> > }
> >
> > static void handle_stripe6(struct stripe_head *sh)
> > {
> > raid5_conf_t *conf = sh->raid_conf;
> > int disks = sh->disks;
> > - struct bio *return_bi = NULL;
> > int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx;
> > struct stripe_head_state s;
> > struct r5dev *dev, *pdev, *qdev;
> > - mdk_rdev_t *blocked_rdev = NULL;
> > - int dec_preread_active = 0;
> >
> > pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
> > "pd_idx=%d, qd_idx=%d\n, check:%d, reconstruct:%d\n",
> > @@ -3345,9 +3339,9 @@ static void handle_stripe6(struct stripe_head *sh)
> > if (dev->written)
> > s.written++;
> > rdev = rcu_dereference(conf->disks[i].rdev);
> > - if (blocked_rdev == NULL &&
> > + if (s.blocked_rdev == NULL &&
> > rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
> > - blocked_rdev = rdev;
> > + s.blocked_rdev = rdev;
> > atomic_inc(&rdev->nr_pending);
> > }
> > clear_bit(R5_Insync, &dev->flags);
> > @@ -3376,15 +3370,15 @@ static void handle_stripe6(struct stripe_head *sh)
> > spin_unlock_irq(&conf->device_lock);
> > rcu_read_unlock();
> >
> > - if (unlikely(blocked_rdev)) {
> > + if (unlikely(s.blocked_rdev)) {
> > if (s.syncing || s.expanding || s.expanded ||
> > s.to_write || s.written) {
> > set_bit(STRIPE_HANDLE, &sh->state);
> > goto unlock;
> > }
> > /* There is nothing for the blocked_rdev to block */
> > - rdev_dec_pending(blocked_rdev, conf->mddev);
> > - blocked_rdev = NULL;
> > + rdev_dec_pending(s.blocked_rdev, conf->mddev);
> > + s.blocked_rdev = NULL;
> > }
> >
> > if (s.to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
> > @@ -3400,7 +3394,7 @@ static void handle_stripe6(struct stripe_head *sh)
> > * might need to be failed
> > */
> > if (s.failed > 2 && s.to_read+s.to_write+s.written)
> > - handle_failed_stripe(conf, sh, &s, disks, &return_bi);
> > + handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
> > if (s.failed > 2 && s.syncing) {
> > md_done_sync(conf->mddev, STRIPE_SECTORS,0);
> > clear_bit(STRIPE_SYNCING, &sh->state);
> > @@ -3425,7 +3419,7 @@ static void handle_stripe6(struct stripe_head *sh)
> > (s.q_failed || ((test_bit(R5_Insync, &qdev->flags)
> > && !test_bit(R5_LOCKED, &qdev->flags)
> > && test_bit(R5_UPTODATE, &qdev->flags)))))
> > - handle_stripe_clean_event(conf, sh, disks, &return_bi);
> > + handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
> >
> > /* Now we might consider reading some blocks, either to check/generate
> > * parity, or to satisfy requests
> > @@ -3461,7 +3455,7 @@ static void handle_stripe6(struct stripe_head *sh)
> > }
> > }
> > if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> > - dec_preread_active = 1;
> > + s.dec_preread_active = 1;
> > }
> >
> > /* Now to consider new write requests and what else, if anything
> > @@ -3561,8 +3555,8 @@ static void handle_stripe6(struct stripe_head *sh)
> > unlock:
> >
> > /* wait for this device to become unblocked */
> > - if (unlikely(blocked_rdev))
> > - md_wait_for_blocked_rdev(blocked_rdev, conf->mddev);
> > + if (unlikely(s.blocked_rdev))
> > + md_wait_for_blocked_rdev(s.blocked_rdev, conf->mddev);
> >
> > if (s.ops_request)
> > raid_run_ops(sh, s.ops_request);
> > @@ -3570,7 +3564,7 @@ static void handle_stripe6(struct stripe_head *sh)
> > ops_run_io(sh, &s);
> >
> >
> > - if (dec_preread_active) {
> > + if (s.dec_preread_active) {
> > /* We delay this until after ops_run_io so that if make_request
> > * is waiting on a flush, it won't continue until the writes
> > * have actually been submitted.
> > @@ -3581,7 +3575,7 @@ static void handle_stripe6(struct stripe_head *sh)
> > md_wakeup_thread(conf->mddev->thread);
> > }
> >
> > - return_io(return_bi);
> > + return_io(s.return_bi);
> > }
> >
> > static void handle_stripe(struct stripe_head *sh)
> > diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> > index d3c61d3..9ceb574 100644
> > --- a/drivers/md/raid5.h
> > +++ b/drivers/md/raid5.h
> > @@ -248,6 +248,10 @@ struct stripe_head_state {
> > int failed_num[2];
> > unsigned long ops_request;
> > int p_failed, q_failed;
> > +
> > + struct bio *return_bi;
> > + mdk_rdev_t *blocked_rdev;
> > + int dec_preread_active;
> > };
>
> I'd rather rearrange the struct to reduce paddings on 64-bit:
>
> /* stripe_head_state - collects and tracks the dynamic state of a stripe_head
> * for handle_stripe. It is only valid under spin_lock(sh->lock);
> */
> struct stripe_head_state {
> int syncing, expanding, expanded;
> int locked, uptodate, to_read, to_write, failed, written;
> int to_fill, compute, req_compute, non_overwrite;
> int failed_num[2];
> int p_failed, q_failed;
> int dec_preread_active;
> unsigned long ops_request;
>
> struct bio *return_bi;
> mdk_rdev_t *blocked_rdev;
> };

I've made that change, thanks.


>
> BTW, comment above mentions sh->lock which is disappeared in previous
> patch. It should be fixed as well (in patch 08/34?).

True ... and there are lots of other mentions made of the stripe lock in all
those comments.
I have fixed some of them up but I really need to review all of that
commentary.

Thanks,
NeilBrown


>
> Thanks.
>
>
> >
> > /* Flags */
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 13/34] md/raid5: Move code for finishing areconstruction into handle_stripe.

am 26.07.2011 03:44:17 von NeilBrown

On Fri, 22 Jul 2011 16:09:10 +0900 Namhyung Kim wrote:

> NeilBrown writes:
>
> > Prior to commit ab69ae12ceef7 the code in handle_stripe5 and
> > handle_stripe6 to "Finish reconstruct operations initiated by the
> > expansion process" was identical.
> > That commit added an identical stanza of code to each function, but in
> > different places. That was careless.
> >
> > The raid5 code was correct, so move that out into handle_stripe and
> > remove raid6 version.
> >
> > Signed-off-by: NeilBrown
>
> Reviewed-by: Namhyung Kim
>
> and nitpick below..
>
>

> > static void handle_stripe(struct stripe_head *sh)
> > {
> > struct stripe_head_state s;
> > + int done;
> > raid5_conf_t *conf = sh->raid_conf;
> > + int i;
>
> Can be declared in a line.
>

True.... There are differing opinions on whether that is a good idea or not.
I tend to like to keep each variable in it's own declaration, though closely
related variables might get put together.
I have moved 'int i' up one line so the addition is in just one place, but
I've kept it separate from 'done'.

> > + /* Finish reconstruct operations initiated by the expansion process */
> > + if (sh->reconstruct_state == reconstruct_state_result) {
> > + struct stripe_head *sh2
> > + = get_active_stripe(conf, sh->sector, 1, 1, 1);
>
> Wouldn't it be better to use more descriptive name rather than sh2,
> something like sh_prev, sh_src or whatever?

good point. 'sh_src' it is.

Thanks,
NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 14/34] md/raid5: move more code into commonhandle_stripe

am 26.07.2011 03:48:42 von NeilBrown

On Fri, 22 Jul 2011 16:32:03 +0900 Namhyung Kim wrote:

> NeilBrown writes:
>
> > The difference between the RAID5 and RAID6 code here is easily
> > resolved using conf->max_degraded.
> >
> > Signed-off-by: NeilBrown
>
> Reviewed-by: Namhyung Kim
>
> and a coding style issue.
>

> > + if (s.failed <= conf->max_degraded && !conf->mddev->ro)
> > + for (i = 0; i < s.failed; i++) {
> > + struct r5dev *dev = &sh->dev[s.failed_num[i]];
> > + if (test_bit(R5_ReadError, &dev->flags)
> > + && !test_bit(R5_LOCKED, &dev->flags)
> > + && test_bit(R5_UPTODATE, &dev->flags)
> > + ) {
>
> This line looks weird and should be combined to previous line.
>
> And I slightly prefer that the logical operators are put on the
> same line with previous conditional, so that next one can be on
> the same column and more readable, but ...
>

I agree that it is nice if they line up. But I also think the "&&" is
important and want it to stand out. In this case the '!' would stop them
from lining up very nicely, so I'll leave it as it is.

The thing I like about having ") {" on a separate line is that I can later
add a condition by just adding a line - it makes the patch easier to read.

Not a bit thing maybe and I don't always structure conditions like this, but
I think I'll leave it this time.

Thanks,
NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 17/34] md/raid5: unite handle_stripe_dirtying5 andhandle_stripe_dirtying6

am 26.07.2011 03:52:38 von NeilBrown

On Fri, 22 Jul 2011 18:10:33 +0900 Namhyung Kim wrote:

> NeilBrown writes:
>
> > RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is
> > also allow 'read-modify-write'
> > Apart from this difference, handle_stripe_dirtying[56] are nearly
> > identical. So resolve these differences and create just one function.
> >
> > Signed-off-by: NeilBrown
>
> Reviewed-by: Namhyung Kim
>
> BTW, here is a question:
> Why RAID6 doesn't allow the read-modify-write? I don't think it is
> not possible, so what prevents doing that? performance? complexity?
> or it's not possible? Why? :)


The primary reason in my mind is that when Peter Anvin wrote this code he
didn't implement read-modify-write and I have never seen a need to consider
changing that.

You would need a largish array - at least 7 devices - before RWM could
possibly be a win, but some people do have arrays larger than that.

The computation to "subtract" from the Q-syndrome might be a bit complex - I
don't know.

Peter: maybe you have a reason that I haven't mentioned?

Thanks,
NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 19/34] md/raid5: move some more common code intohandle_stripe

am 26.07.2011 03:59:23 von NeilBrown

On Fri, 22 Jul 2011 18:29:16 +0900 Namhyung Kim wrote:

> NeilBrown writes:
>
> > The RAID6 version of this code is usable for RAID5 providing:
> > - we test "conf->max_degraded" rather than "2" as appropriate
> > - we make sure s->failed_num[1] is meaningful (and not '-1')
> > when s->failed > 1
> >
> > The 'return 1' must become 'goto finish' in the new location.
> >
> > Signed-off-by: NeilBrown
>
> Reviewed-by: Namhyung Kim
>
> and nitpick below.
>
>

> > + pr_debug("locked=%d uptodate=%d to_read=%d"
> > + " to_write=%d failed=%d failed_num=%d,%d\n",
> > + s.locked, s.uptodate, s.to_read, s.to_write, s.failed,
> > + s.failed_num[0], s.failed_num[1]);
> > + /* check if the array has lost >2 devices and, if so, some requests
> > + * might need to be failed
> /* check if the array has lost more than max_degraded devices and,
> if so, some requests might need to be failed
>
> Thanks.
>
>

OK, I made that change - thanks.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 20/34] md/raid5: finalise new merged handle_stripe.

am 26.07.2011 04:02:05 von NeilBrown

On Fri, 22 Jul 2011 18:36:13 +0900 Namhyung Kim wrote:

> NeilBrown writes:
>
> > handle_stripe5() and handle_stripe6() are now virtually identical.
> > So discard one and rename the other to 'analyse_stripe()'.
> >
> > It always returns 0, so change it to 'void' and remove the 'done'
> > variable in handle_stripe().
> >
> > Signed-off-by: NeilBrown


> > static void handle_stripe(struct stripe_head *sh)
> > {
> > struct stripe_head_state s;
> > - int done;
> > raid5_conf_t *conf = sh->raid_conf;
> > int i;
> > int prexor;
> > @@ -3099,13 +3015,7 @@ static void handle_stripe(struct stripe_head *sh)
> > s.failed_num[0] = -1;
> > s.failed_num[1] = -1;
>
> There are codes that set/initialize some fields of stripe_head_state
> outside of analyse_stripe() and it'd better off moving them into the
> function, IMHO.
>
> Thanks.

Thanks. I have merged this change.

NeilBrown


diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ed6729d..b321d6c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2905,6 +2905,14 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
struct r5dev *dev;
int i;

+ memset(s, 0, sizeof(*s));
+
+ s->syncing = test_bit(STRIPE_SYNCING, &sh->state);
+ s->expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
+ s->expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
+ s->failed_num[0] = -1;
+ s->failed_num[1] = -1;
+
/* Now to look around and see what can be done */
rcu_read_lock();
spin_lock_irq(&conf->device_lock);
@@ -3006,13 +3014,6 @@ static void handle_stripe(struct stripe_head *sh)
(unsigned long long)sh->sector, sh->state,
atomic_read(&sh->count), sh->pd_idx, sh->qd_idx,
sh->check_state, sh->reconstruct_state);
- memset(&s, 0, sizeof(s));
-
- s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
- s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
- s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
- s.failed_num[0] = -1;
- s.failed_num[1] = -1;

analyse_stripe(sh, &s);

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 17/34] md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6

am 26.07.2011 04:41:03 von hpa

NeilBrown wrote:

>On Fri, 22 Jul 2011 18:10:33 +0900 Namhyung Kim
>wrote:
>
>> NeilBrown writes:
>>
>> > RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is
>> > also allow 'read-modify-write'
>> > Apart from this difference, handle_stripe_dirtying[56] are nearly
>> > identical. So resolve these differences and create just one
>function.
>> >
>> > Signed-off-by: NeilBrown
>>
>> Reviewed-by: Namhyung Kim
>>
>> BTW, here is a question:
>> Why RAID6 doesn't allow the read-modify-write? I don't think it is
>> not possible, so what prevents doing that? performance? complexity?
>> or it's not possible? Why? :)
>
>
>The primary reason in my mind is that when Peter Anvin wrote this code
>he
>didn't implement read-modify-write and I have never seen a need to
>consider
>changing that.
>
>You would need a largish array - at least 7 devices - before RWM could
>possibly be a win, but some people do have arrays larger than that.
>
>The computation to "subtract" from the Q-syndrome might be a bit
>complex - I
>don't know.
>
>Peter: maybe you have a reason that I haven't mentioned?
>
>Thanks,
>NeilBrown

When I measured it it wasn't a win on my system, and it created a lot of extra complexity. It probably would be a win with SSSE3 and lots of drives.
--
Sent from my mobile phone. Please excuse my brevity and lack of formatting.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 20/34] md/raid5: finalise new merged handle_stripe.

am 26.07.2011 06:50:53 von Namhyung Kim

2011-07-26 (í™=94), 12:02 +1000, NeilBrown:
> On Fri, 22 Jul 2011 18:36:13 +0900 Namhyung Kim =
wrote:
>=20
> > NeilBrown writes:
> >=20
> > > handle_stripe5() and handle_stripe6() are now virtually identical=

> > > So discard one and rename the other to 'analyse_stripe()'.
> > >
> > > It always returns 0, so change it to 'void' and remove the 'done'
> > > variable in handle_stripe().
> > >
> > > Signed-off-by: NeilBrown
>=20
>=20
> > > static void handle_stripe(struct stripe_head *sh)
> > > {
> > > struct stripe_head_state s;
> > > - int done;
> > > raid5_conf_t *conf =3D sh->raid_conf;
> > > int i;
> > > int prexor;
> > > @@ -3099,13 +3015,7 @@ static void handle_stripe(struct stripe_he=
ad *sh)
> > > s.failed_num[0] =3D -1;
> > > s.failed_num[1] =3D -1;
> >=20
> > There are codes that set/initialize some fields of stripe_head_stat=
e
> > outside of analyse_stripe() and it'd better off moving them into th=
e
> > function, IMHO.
> >=20
> > Thanks.
>=20
> Thanks. I have merged this change.
>=20
> NeilBrown
>=20
>=20
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index ed6729d..b321d6c 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2905,6 +2905,14 @@ static void analyse_stripe(struct stripe_head =
*sh, struct stripe_head_state *s)
> struct r5dev *dev;
> int i;
> =20
> + memset(s, 0, sizeof(*s));
> +
> + s->syncing =3D test_bit(STRIPE_SYNCING, &sh->state);
> + s->expanding =3D test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
> + s->expanded =3D test_bit(STRIPE_EXPAND_READY, &sh->state);
> + s->failed_num[0] =3D -1;
> + s->failed_num[1] =3D -1;
> +
> /* Now to look around and see what can be done */
> rcu_read_lock();
> spin_lock_irq(&conf->device_lock);
> @@ -3006,13 +3014,6 @@ static void handle_stripe(struct stripe_head *=
sh)
> (unsigned long long)sh->sector, sh->state,
> atomic_read(&sh->count), sh->pd_idx, sh->qd_idx,
> sh->check_state, sh->reconstruct_state);
> - memset(&s, 0, sizeof(s));
> -
> - s.syncing =3D test_bit(STRIPE_SYNCING, &sh->state);
> - s.expanding =3D test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
> - s.expanded =3D test_bit(STRIPE_EXPAND_READY, &sh->state);
> - s.failed_num[0] =3D -1;
> - s.failed_num[1] =3D -1;
> =20
> analyse_stripe(sh, &s);
> =20

Reviewed-by: Namhyung Kim

Looks great to me, thanks.


--=20
Regards,
Namhyung Kim


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 10/34] md/raid5: unify stripe_head_state and r6_state

am 26.07.2011 08:44:00 von Namhyung Kim

2011-07-22 (ê¸=88), 14:54 +0900, Namhyung Kim:
> 2011-07-22 (ê¸=88), 15:37 +1000, NeilBrown:
> From 74721b68b4cf8d4845aa9efa3daf99247a611b70 Mon Sep 17 00:00:00 200=
1
> > From: NeilBrown
> > Date: Fri, 22 Jul 2011 15:29:38 +1000
> > Subject: [PATCH] md/raid5: Remove stripe_head_state arg from handle=
_stripe_expansion.
> >=20
> > This arg is only used to differentiate between RAID5 and RAID6 but
> > that is not needed. For RAID5, raid5_compute_sector will set qd_id=
x
> > to "~0" so j with certainly not equals qd_idx, so there is no need
> > for a guard on that condition.
> >=20
> > So remove the guard and remove the arg from the declaration and
> > callers of handle_stripe_expansion.
> >=20
> > Signed-off-by: NeilBrown
>=20
> Looks good to me, thanks!
>=20

And please feel free to add this in case you need it :)

Reviewed-by: Namhyung Kim



--=20
Regards,
Namhyung Kim


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 17/34] md/raid5: unite handle_stripe_dirtying5 andhandle_stripe_dirtying6

am 26.07.2011 11:40:42 von David Brown

On 26/07/11 03:52, NeilBrown wrote:
> On Fri, 22 Jul 2011 18:10:33 +0900 Namhyung Kim wrote:
>
>> NeilBrown writes:
>>
>>> RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is
>>> also allow 'read-modify-write'
>>> Apart from this difference, handle_stripe_dirtying[56] are nearly
>>> identical. So resolve these differences and create just one function.
>>>
>>> Signed-off-by: NeilBrown
>>
>> Reviewed-by: Namhyung Kim
>>
>> BTW, here is a question:
>> Why RAID6 doesn't allow the read-modify-write? I don't think it is
>> not possible, so what prevents doing that? performance? complexity?
>> or it's not possible? Why? :)
>
>
> The primary reason in my mind is that when Peter Anvin wrote this code he
> didn't implement read-modify-write and I have never seen a need to consider
> changing that.
>
> You would need a largish array - at least 7 devices - before RWM could
> possibly be a win, but some people do have arrays larger than that.
>
> The computation to "subtract" from the Q-syndrome might be a bit complex - I
> don't know.
>

The "subtract from Q" isn't difficult in theory, but it involves more
cpu time than the "subtract from P". It should be an overall win in the
same was as for RAID5. However, the code would be more complex if you
allow for RMW on more than one block.

With raid5, if you have a set of data blocks D0, D1, ..., Dn and a
parity block P, and you want to change Di_old to Di_new, you can do this:

Read Di_old and P_old
Calculate P_new = P_old xor Di_old xor Di_new
Write Di_new and P_new

You can easily extend this do changing several data blocks without
reading in the whole old stripe. I don't know if the Linux raid5
implementation does that or not (I understand the theory here, but I am
far from an expert on the implementation). There is no doubt a balance
between the speed gains of multiple-block RMW and the code complexity.
As long as you are changing less than half the blocks in the stripe, the
cpu time will be less than it would for whole stripe write. For RMW
writes that are more than half a stripe, the "best" choice depends on
the balance between cpu time and disk bandwidth - a balance that is very
different on today's servers compared to those when Linux raid5 was
first written.



With raid6, the procedure is similar:

Read Di_old, P_old and Q_old.
Calculate P_new = P_old xor Di_old xor Di_new
Calculate Q_new = Q_old xor (2^i . Di_old) xor (2^i . Di_new)
= Q_old xor (2^i . (Di_old xor Di_new))
Write Di_new, P_new and Q_new

The difference is simply that (Di_old xor Di_new) needs to be multiplied
(over the GF field - not normal multiplication) by 2^i. You would do
this by repeatedly applying the multiply-by-two function that already
exists in the raid6 implementation.

I don't know whether or not it is worth using the common subexpression
(Di_old xor Di_new), which turns up twice.



Multi-block raid6 RMW is similar, but you have to keep track of how many
times you should multiply by 2. For a general case, the code would
probably be too messy - it's easier to simple handle the whole stripe.
But the case of consecutive blocks is easier, and likely to be far more
common in practice. If you want to change blocks Di through Dj, you can do:

Read Di_old, D(i+1)_old, ..., Dj_old, P_old and Q_old.
Calculate P_new = P_old xor Di_old xor Di_new
xor D(i+1)_old xor D(i+1)_new
...
xor Dj_old xor Dj_new


Calculate Q_new = Q_old xor (2^i . (Di_old xor Di_new))
xor (2^(i+1) . (D(i+1)_old xor D(i+1)_new))
...
xor (2^j . (Dj_old xor Dj_new))
= Q_old xor (2^i .
(Di_old xor Di_new)
xor (2^1) . (D(i+1)_old xor D(i+1)_new))
xor (2^2) . (D(i+2)_old xor D(i+2)_new))
...
xor (2^(j-i) . (Dj_old xor Dj_new))
)

Write Di_new, D(i+1)_new, ..., Dj_new, P_new and Q_new


The algorithm above looks a little messy (ASCII emails are not the best
medium for mathematics), but it's not hard to see the pattern, and the
loops needed. It should also be possible to merge such routines with
the main raid6 parity calculation functions.


mvh.,

David



> Peter: maybe you have a reason that I haven't mentioned?
>
> Thanks,
> NeilBrown
>

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 17/34] md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6

am 26.07.2011 15:23:27 von Namhyung Kim

David Brown writes:

> On 26/07/11 03:52, NeilBrown wrote:
>> On Fri, 22 Jul 2011 18:10:33 +0900 Namhyung Kim wrote:
>>
>>> NeilBrown writes:
>>>
>>>> RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is
>>>> also allow 'read-modify-write'
>>>> Apart from this difference, handle_stripe_dirtying[56] are nearly
>>>> identical. So resolve these differences and create just one function.
>>>>
>>>> Signed-off-by: NeilBrown
>>>
>>> Reviewed-by: Namhyung Kim
>>>
>>> BTW, here is a question:
>>> Why RAID6 doesn't allow the read-modify-write? I don't think it is
>>> not possible, so what prevents doing that? performance? complexity?
>>> or it's not possible? Why? :)
>>
>>
>> The primary reason in my mind is that when Peter Anvin wrote this code he
>> didn't implement read-modify-write and I have never seen a need to consider
>> changing that.
>>
>> You would need a largish array - at least 7 devices - before RWM could
>> possibly be a win, but some people do have arrays larger than that.
>>
>> The computation to "subtract" from the Q-syndrome might be a bit complex - I
>> don't know.
>>
>
> The "subtract from Q" isn't difficult in theory, but it involves more
> cpu time than the "subtract from P". It should be an overall win in
> the same was as for RAID5. However, the code would be more complex if
> you allow for RMW on more than one block.
>
> With raid5, if you have a set of data blocks D0, D1, ..., Dn and a
> parity block P, and you want to change Di_old to Di_new, you can do
> this:
>
> Read Di_old and P_old
> Calculate P_new = P_old xor Di_old xor Di_new
> Write Di_new and P_new
>
> You can easily extend this do changing several data blocks without
> reading in the whole old stripe. I don't know if the Linux raid5
> implementation does that or not (I understand the theory here, but I
> am far from an expert on the implementation). There is no doubt a
> balance between the speed gains of multiple-block RMW and the code
> complexity. As long as you are changing less than half the blocks in
> the stripe, the cpu time will be less than it would for whole stripe
> write. For RMW writes that are more than half a stripe, the "best"
> choice depends on the balance between cpu time and disk bandwidth - a
> balance that is very different on today's servers compared to those
> when Linux raid5 was first written.
>
>
>
> With raid6, the procedure is similar:
>
> Read Di_old, P_old and Q_old.
> Calculate P_new = P_old xor Di_old xor Di_new
> Calculate Q_new = Q_old xor (2^i . Di_old) xor (2^i . Di_new)
> = Q_old xor (2^i . (Di_old xor Di_new))
> Write Di_new, P_new and Q_new
>
> The difference is simply that (Di_old xor Di_new) needs to be
> multiplied (over the GF field - not normal multiplication) by 2^i.
> You would do this by repeatedly applying the multiply-by-two function
> that already exists in the raid6 implementation.
>
> I don't know whether or not it is worth using the common subexpression
> (Di_old xor Di_new), which turns up twice.
>
>
>
> Multi-block raid6 RMW is similar, but you have to keep track of how
> many times you should multiply by 2. For a general case, the code
> would probably be too messy - it's easier to simple handle the whole
> stripe. But the case of consecutive blocks is easier, and likely to be
> far more common in practice. If you want to change blocks Di through
> Dj, you can do:
>
> Read Di_old, D(i+1)_old, ..., Dj_old, P_old and Q_old.
> Calculate P_new = P_old xor Di_old xor Di_new
> xor D(i+1)_old xor D(i+1)_new
> ...
> xor Dj_old xor Dj_new
>
>
> Calculate Q_new = Q_old xor (2^i . (Di_old xor Di_new))
> xor (2^(i+1) . (D(i+1)_old xor D(i+1)_new))
> ...
> xor (2^j . (Dj_old xor Dj_new))
> = Q_old xor (2^i .
> (Di_old xor Di_new)
> xor (2^1) . (D(i+1)_old xor D(i+1)_new))
> xor (2^2) . (D(i+2)_old xor D(i+2)_new))
> ...
> xor (2^(j-i) . (Dj_old xor Dj_new))
> )
>
> Write Di_new, D(i+1)_new, ..., Dj_new, P_new and Q_new
>
>
> The algorithm above looks a little messy (ASCII emails are not the
> best medium for mathematics), but it's not hard to see the pattern,
> and the loops needed. It should also be possible to merge such
> routines with the main raid6 parity calculation functions.
>
>
> mvh.,
>
> David
>

Thanks a lot for your detailed explanation, David!

I think it'd be good if we implement RMW for small (eg. single disk)
write on RAID6 too.


--
Regards,
Namhyung Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 17/34] md/raid5: unite handle_stripe_dirtying5 andhandle_stripe_dirtying6

am 26.07.2011 17:01:13 von David Brown

On 26/07/11 15:23, Namhyung Kim wrote:
> David Brown writes:
>
>> On 26/07/11 03:52, NeilBrown wrote:
>>> On Fri, 22 Jul 2011 18:10:33 +0900 Namhyung Kim wrote:
>>>
>>>> NeilBrown writes:
>>>>
>>>>> RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is
>>>>> also allow 'read-modify-write'
>>>>> Apart from this difference, handle_stripe_dirtying[56] are nearly
>>>>> identical. So resolve these differences and create just one function.
>>>>>
>>>>> Signed-off-by: NeilBrown
>>>>
>>>> Reviewed-by: Namhyung Kim
>>>>
>>>> BTW, here is a question:
>>>> Why RAID6 doesn't allow the read-modify-write? I don't think it is
>>>> not possible, so what prevents doing that? performance? complexity?
>>>> or it's not possible? Why? :)
>>>
>>>
>>> The primary reason in my mind is that when Peter Anvin wrote this code he
>>> didn't implement read-modify-write and I have never seen a need to consider
>>> changing that.
>>>
>>> You would need a largish array - at least 7 devices - before RWM could
>>> possibly be a win, but some people do have arrays larger than that.
>>>
>>> The computation to "subtract" from the Q-syndrome might be a bit complex - I
>>> don't know.
>>>
>>
>> The "subtract from Q" isn't difficult in theory, but it involves more
>> cpu time than the "subtract from P". It should be an overall win in
>> the same was as for RAID5. However, the code would be more complex if
>> you allow for RMW on more than one block.
>>
>> With raid5, if you have a set of data blocks D0, D1, ..., Dn and a
>> parity block P, and you want to change Di_old to Di_new, you can do
>> this:
>>
>> Read Di_old and P_old
>> Calculate P_new = P_old xor Di_old xor Di_new
>> Write Di_new and P_new
>>
>> You can easily extend this do changing several data blocks without
>> reading in the whole old stripe. I don't know if the Linux raid5
>> implementation does that or not (I understand the theory here, but I
>> am far from an expert on the implementation). There is no doubt a
>> balance between the speed gains of multiple-block RMW and the code
>> complexity. As long as you are changing less than half the blocks in
>> the stripe, the cpu time will be less than it would for whole stripe
>> write. For RMW writes that are more than half a stripe, the "best"
>> choice depends on the balance between cpu time and disk bandwidth - a
>> balance that is very different on today's servers compared to those
>> when Linux raid5 was first written.
>>
>>
>>
>> With raid6, the procedure is similar:
>>
>> Read Di_old, P_old and Q_old.
>> Calculate P_new = P_old xor Di_old xor Di_new
>> Calculate Q_new = Q_old xor (2^i . Di_old) xor (2^i . Di_new)
>> = Q_old xor (2^i . (Di_old xor Di_new))
>> Write Di_new, P_new and Q_new
>>
>> The difference is simply that (Di_old xor Di_new) needs to be
>> multiplied (over the GF field - not normal multiplication) by 2^i.
>> You would do this by repeatedly applying the multiply-by-two function
>> that already exists in the raid6 implementation.
>>
>> I don't know whether or not it is worth using the common subexpression
>> (Di_old xor Di_new), which turns up twice.
>>
>>
>>
>> Multi-block raid6 RMW is similar, but you have to keep track of how
>> many times you should multiply by 2. For a general case, the code
>> would probably be too messy - it's easier to simple handle the whole
>> stripe. But the case of consecutive blocks is easier, and likely to be
>> far more common in practice. If you want to change blocks Di through
>> Dj, you can do:
>>
>> Read Di_old, D(i+1)_old, ..., Dj_old, P_old and Q_old.
>> Calculate P_new = P_old xor Di_old xor Di_new
>> xor D(i+1)_old xor D(i+1)_new
>> ...
>> xor Dj_old xor Dj_new
>>
>>
>> Calculate Q_new = Q_old xor (2^i . (Di_old xor Di_new))
>> xor (2^(i+1) . (D(i+1)_old xor D(i+1)_new))
>> ...
>> xor (2^j . (Dj_old xor Dj_new))
>> = Q_old xor (2^i .
>> (Di_old xor Di_new)
>> xor (2^1) . (D(i+1)_old xor D(i+1)_new))
>> xor (2^2) . (D(i+2)_old xor D(i+2)_new))
>> ...
>> xor (2^(j-i) . (Dj_old xor Dj_new))
>> )
>>
>> Write Di_new, D(i+1)_new, ..., Dj_new, P_new and Q_new
>>
>>
>> The algorithm above looks a little messy (ASCII emails are not the
>> best medium for mathematics), but it's not hard to see the pattern,
>> and the loops needed. It should also be possible to merge such
>> routines with the main raid6 parity calculation functions.
>>
>>
>> mvh.,
>>
>> David
>>
>
> Thanks a lot for your detailed explanation, David!
>
> I think it'd be good if we implement RMW for small (eg. single disk)
> write on RAID6 too.
>

Without having looked much at the details of the implementation, I
believe it should be practical to use RMW for single block writes on
RAID6.

I also think it should be reasonably practical to do consecutive RMW
blocks for both raid5 and raid6. But to do that efficiently would
probably mean generalising the existing raid5 and raid6 functions a
little rather than writing a new set of functions - I'm not sure how
much effort that would take within the existing structure of multiple
implementations that are different speeds on different cpus.


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 08/34] md/raid5: replace sh->lock with an "active" flag.

am 04.08.2011 00:47:47 von dan.j.williams

On Wed, Jul 20, 2011 at 7:32 PM, NeilBrown wrote:
> sh->lock is now mainly used to ensure that two threads aren't running
> in the locked part of handle_stripe[56] at the same time.
>
> That can more neatly be achieved with an 'active' flag which we set
> while running handle_stripe. =A0If we find the flag is set, we simply
> requeue the stripe for later by setting STRIPE_HANDLE.
>
> For safety we take ->device_lock while examining the state of the
> stripe and creating a summary in 'stripe_head_state / r6_state'.
> This possibly isn't needed but as shared fields like ->toread,
> ->towrite are checked it is safer for now at least.
>
> We leave the label after the old 'unlock' called "unlock" because it
> will disappear in a few patches, so renaming seems pointless.
>
> This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE
> later, but that is not a problem.
>

This removal reminds me of one thing I have wondered about, but to
date have not found the time to measure (maybe someone might beat me
to it if the idea is out there), is what is the overhead of all the
atomic operations that raid5.c generates? If we can guarantee that
certain updates only happen under sh->lock (now STRIPE_ACTIVE) can we
downgrade set_bit and clear_bit to their non-atomic __set_bit and
__clear_bit versions and recover some cpu cycles?

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 08/34] md/raid5: replace sh->lock with an "active"flag.

am 04.08.2011 01:35:32 von NeilBrown

On Wed, 3 Aug 2011 15:47:47 -0700 Dan Williams m>
wrote:

> On Wed, Jul 20, 2011 at 7:32 PM, NeilBrown wrote:
> > sh->lock is now mainly used to ensure that two threads aren't runni=
ng
> > in the locked part of handle_stripe[56] at the same time.
> >
> > That can more neatly be achieved with an 'active' flag which we set
> > while running handle_stripe. =A0If we find the flag is set, we simp=
ly
> > requeue the stripe for later by setting STRIPE_HANDLE.
> >
> > For safety we take ->device_lock while examining the state of the
> > stripe and creating a summary in 'stripe_head_state / r6_state'.
> > This possibly isn't needed but as shared fields like ->toread,
> > ->towrite are checked it is safer for now at least.
> >
> > We leave the label after the old 'unlock' called "unlock" because i=
t
> > will disappear in a few patches, so renaming seems pointless.
> >
> > This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIV=
E
> > later, but that is not a problem.
> >
>=20
> This removal reminds me of one thing I have wondered about, but to
> date have not found the time to measure (maybe someone might beat me
> to it if the idea is out there), is what is the overhead of all the
> atomic operations that raid5.c generates? If we can guarantee that
> certain updates only happen under sh->lock (now STRIPE_ACTIVE) can we
> downgrade set_bit and clear_bit to their non-atomic __set_bit and
> __clear_bit versions and recover some cpu cycles?
>=20

You can only used the unlocked version if you know that no other CPU wi=
ll
change any of the bits in the 'unsigned long'. As STRIPE_HANDLE can be=
set
at any time, I think all accesses to sh->state must be atomic.

However 'struct stripe_head_state' is thread-local so setting/clearing =
flags
in ops_request can probably benefit from non-atomic ops.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 08/34] md/raid5: replace sh->lock with an "active" flag.

am 04.08.2011 01:45:35 von dan.j.williams

On Wed, Aug 3, 2011 at 4:35 PM, NeilBrown wrote:
> On Wed, 3 Aug 2011 15:47:47 -0700 Dan Williams com>
> wrote:
>
>> On Wed, Jul 20, 2011 at 7:32 PM, NeilBrown wrote:
>> > sh->lock is now mainly used to ensure that two threads aren't runn=
ing
>> > in the locked part of handle_stripe[56] at the same time.
>> >
>> > That can more neatly be achieved with an 'active' flag which we se=
t
>> > while running handle_stripe. =A0If we find the flag is set, we sim=
ply
>> > requeue the stripe for later by setting STRIPE_HANDLE.
>> >
>> > For safety we take ->device_lock while examining the state of the
>> > stripe and creating a summary in 'stripe_head_state / r6_state'.
>> > This possibly isn't needed but as shared fields like ->toread,
>> > ->towrite are checked it is safer for now at least.
>> >
>> > We leave the label after the old 'unlock' called "unlock" because =
it
>> > will disappear in a few patches, so renaming seems pointless.
>> >
>> > This leaves the stripe 'locked' for longer as we clear STRIPE_ACTI=
VE
>> > later, but that is not a problem.
>> >
>>
>> This removal reminds me of one thing I have wondered about, but to
>> date have not found the time to measure (maybe someone might beat me
>> to it if the idea is out there), is what is the overhead of all the
>> atomic operations that raid5.c generates? =A0If we can guarantee tha=
t
>> certain updates only happen under sh->lock (now STRIPE_ACTIVE) can w=
e
>> downgrade set_bit and clear_bit to their non-atomic __set_bit and
>> __clear_bit versions and recover some cpu cycles?
>>
>
> You can only used the unlocked version if you know that no other CPU =
will
> change any of the bits in the 'unsigned long'. =A0As STRIPE_HANDLE ca=
n be set
> at any time, I think all accesses to sh->state must be atomic.
>
> However 'struct stripe_head_state' is thread-local so setting/clearin=
g flags
> in ops_request can probably benefit from non-atomic ops.

I was particularly eyeing 'flags' in struct r5dev...
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [md PATCH 08/34] md/raid5: replace sh->lock with an "active"flag.

am 04.08.2011 02:18:10 von NeilBrown

On Wed, 3 Aug 2011 16:45:35 -0700 "Williams, Dan J"
wrote:

> On Wed, Aug 3, 2011 at 4:35 PM, NeilBrown wrote:
> > On Wed, 3 Aug 2011 15:47:47 -0700 Dan Williams l.com>
> > wrote:
> >
> >> On Wed, Jul 20, 2011 at 7:32 PM, NeilBrown wrote:
> >> > sh->lock is now mainly used to ensure that two threads aren't ru=
nning
> >> > in the locked part of handle_stripe[56] at the same time.
> >> >
> >> > That can more neatly be achieved with an 'active' flag which we =
set
> >> > while running handle_stripe. =A0If we find the flag is set, we s=
imply
> >> > requeue the stripe for later by setting STRIPE_HANDLE.
> >> >
> >> > For safety we take ->device_lock while examining the state of th=
e
> >> > stripe and creating a summary in 'stripe_head_state / r6_state'.
> >> > This possibly isn't needed but as shared fields like ->toread,
> >> > ->towrite are checked it is safer for now at least.
> >> >
> >> > We leave the label after the old 'unlock' called "unlock" becaus=
e it
> >> > will disappear in a few patches, so renaming seems pointless.
> >> >
> >> > This leaves the stripe 'locked' for longer as we clear STRIPE_AC=
TIVE
> >> > later, but that is not a problem.
> >> >
> >>
> >> This removal reminds me of one thing I have wondered about, but to
> >> date have not found the time to measure (maybe someone might beat =
me
> >> to it if the idea is out there), is what is the overhead of all th=
e
> >> atomic operations that raid5.c generates? =A0If we can guarantee t=
hat
> >> certain updates only happen under sh->lock (now STRIPE_ACTIVE) can=
we
> >> downgrade set_bit and clear_bit to their non-atomic __set_bit and
> >> __clear_bit versions and recover some cpu cycles?
> >>
> >
> > You can only used the unlocked version if you know that no other CP=
U will
> > change any of the bits in the 'unsigned long'. =A0As STRIPE_HANDLE =
can be set
> > at any time, I think all accesses to sh->state must be atomic.
> >
> > However 'struct stripe_head_state' is thread-local so setting/clear=
ing flags
> > in ops_request can probably benefit from non-atomic ops.
>=20
> I was particularly eyeing 'flags' in struct r5dev...

good point.... yes, they would probably be safe.
Async updates only happen when the refcount on the stripe head is !=3D =
0,
and handle_stripe only updates things when it own the sole reference.

So the updates in handle_stripe and functions directly call from there =
should
be safe to use unlocked versions...

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html