[PATCH 00/13] Series short description

[PATCH 00/13] Series short description

am 18.11.2010 10:21:20 von krzysztof.wojcik

The following series of patches are based on mdadm version 3.1.4.

This series is the first of three parts of reshape/takeover implementation for external metadata.
Next parts are:
- OnlineCapacityExpansion/Checkpointing (already sent)
- Takeover/Migrations (will be send)

The external-reshape-design.txt file
(provided by 0011-Document-the-external-reshape-implementation.patch patch)
contains detailed description of the reshape/takeover design.

---

Dan Williams (13):
Provide a mdstat_ent to subarray helper
block monitor: freeze spare assignment for external arrays
Manage: allow manual control of external raid0 readonly flag
Grow: mark some functions static
Assemble: fix assembly in the delta_disks > max_degraded case
Grow: fix check for raid6 layout normalization
Grow: add missing raid4 geometries to geo_map()
fix a get_linux_version() comparison typo
Create: cleanup/unify default geometry handling
Initialize st->devnum and st->container_dev in super_by_fd
Document the external reshape implementation
External reshape (step 1): container reshape and ->reshape_super()
External reshape (step 2): Freeze container


Assemble.c | 4
Create.c | 21 +-
Detail.c | 11 -
Grow.c | 493 ++++++++++++++++++++++++++++++++++++++++---
Manage.c | 1
external-reshape-design.txt | 168 +++++++++++++++
managemon.c | 21 ++
mdadm.c | 2
mdadm.h | 26 ++
msg.c | 195 +++++++++++++++++
msg.h | 2
restripe.c | 2
super-ddf.c | 11 +
super-intel.c | 15 +
sysfs.c | 33 +++
util.c | 48 +++-
16 files changed, 965 insertions(+), 88 deletions(-)
create mode 100644 external-reshape-design.txt

--
Krzysztof Wojcik
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 01/13] Provide a mdstat_ent to subarray helper

am 18.11.2010 10:21:29 von krzysztof.wojcik

From: Dan Williams

....before introducing another open coded instace of this conversion.

Signed-off-by: Dan Williams
---
managemon.c | 2 +-
mdadm.h | 5 +++++
util.c | 11 ++++-------
3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/managemon.c b/managemon.c
index bab0397..544c4a6 100644
--- a/managemon.c
+++ b/managemon.c
@@ -511,7 +511,7 @@ static void manage_new(struct mdstat_ent *mdstat,

new->container = container;

- inst = &mdstat->metadata_version[10+strlen(container->devname)+1];
+ inst = to_subarray(mdstat, container->devname);

new->info.array = mdi->array;
new->info.component_size = mdi->component_size;
diff --git a/mdadm.h b/mdadm.h
index 03dd41c..9787f9e 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -979,6 +979,11 @@ static inline int is_subarray(char *vers)
return (*vers == '/' || *vers == '-');
}

+static inline char *to_subarray(struct mdstat_ent *ent, char *container)
+{
+ return &ent->metadata_version[10+strlen(container)+1];
+}
+
#ifdef DEBUG
#define dprintf(fmt, arg...) \
fprintf(stderr, fmt, ##arg)
diff --git a/util.c b/util.c
index c9bdd6e..6f1c1d2 100644
--- a/util.c
+++ b/util.c
@@ -1437,14 +1437,11 @@ int is_subarray_active(char *subarray, char *container)
struct mdstat_ent *mdstat = mdstat_read(0, 0);
struct mdstat_ent *ent;

- for (ent = mdstat; ent; ent = ent->next) {
- if (is_container_member(ent, container)) {
- char *inst = &ent->metadata_version[10+strlen(container)+1];
-
- if (!subarray || strcmp(inst, subarray) == 0)
+ for (ent = mdstat; ent; ent = ent->next)
+ if (is_container_member(ent, container))
+ if (!subarray ||
+ strcmp(to_subarray(ent, container), subarray) == 0)
break;
- }
- }

free_mdstat(mdstat);


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 02/13] block monitor: freeze spare assignment for external

am 18.11.2010 10:21:37 von krzysztof.wojcik

From: Dan Williams

In order to support reshape and atomic removal of spares from containers
we need to prevent mdmon from activating spares. In the reshape case we
additionally need to freeze sync_action while the reshape transaction is
initiated with the kernel and recorded in the metadata.

When reshaping a raid0 array we need to freeze the array *before* it is
transitioned to a redundant raid level. Since sync_action does not exist
at this point we extend the '-' prefix of a subarray string to flag
mdmon not to activate spares.

Mdadm needs to be reasonably certain that the version of mdmon in the
system honors this 'freeze' indication. If mdmon is not already active
then we assume the version that gets started is the same as the mdadm
version. Otherwise, we check the version of mdmon as returned by the
extended ping_monitor() operation. This is to catch cases where mdadm
is upgraded in the filesystem, but mdmon started in the initramfs is
from a previous release.

Signed-off-by: Dan Williams
---
managemon.c | 19 +++++-
mdadm.h | 4 +
msg.c | 195 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
msg.h | 2 +
sysfs.c | 33 ++++++++++
util.c | 24 +++++++
6 files changed, 273 insertions(+), 4 deletions(-)

diff --git a/managemon.c b/managemon.c
index 544c4a6..164e4f8 100644
--- a/managemon.c
+++ b/managemon.c
@@ -394,12 +394,21 @@ static void manage_member(struct mdstat_ent *mdstat,
* trying to find and assign a spare.
* We do that whenever the monitor tells us too.
*/
+ char buf[64];
+ int frozen;
+
// FIXME
a->info.array.raid_disks = mdstat->raid_disks;
a->info.array.chunk_size = mdstat->chunk_size;
// MORE

- if (a->check_degraded) {
+ /* honor 'frozen' */
+ if (sysfs_get_str(&a->info, NULL, "metadata_version", buf, sizeof(buf)) > 0)
+ frozen = buf[9] == '-';
+ else
+ frozen = 1; /* can't read metadata_version assume the worst */
+
+ if (a->check_degraded && !frozen) {
struct metadata_update *updates = NULL;
struct mdinfo *newdev = NULL;
struct active_array *newa;
@@ -656,7 +665,13 @@ void read_sock(struct supertype *container)
/* read and validate the message */
if (receive_message(fd, &msg, tmo) == 0) {
handle_message(container, &msg);
- if (ack(fd, tmo) < 0)
+ if (msg.len == 0) {
+ /* ping reply with version */
+ msg.buf = Version;
+ msg.len = strlen(Version) + 1;
+ if (send_message(fd, &msg, tmo) < 0)
+ terminate = 1;
+ } else if (ack(fd, tmo) < 0)
terminate = 1;
} else
terminate = 1;
diff --git a/mdadm.h b/mdadm.h
index 9787f9e..f7172e9 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -436,6 +436,8 @@ extern int sysfs_fd_get_ll(int fd, unsigned long long *val);
extern int sysfs_get_ll(struct mdinfo *sra, struct mdinfo *dev,
char *name, unsigned long long *val);
extern int sysfs_fd_get_str(int fd, char *val, int size);
+extern int sysfs_attribute_available(struct mdinfo *sra, struct mdinfo *dev,
+ char *name);
extern int sysfs_get_str(struct mdinfo *sra, struct mdinfo *dev,
char *name, char *val, int size);
extern int sysfs_set_safemode(struct mdinfo *sra, unsigned long ms);
@@ -443,6 +445,7 @@ extern int sysfs_set_array(struct mdinfo *info, int vers);
extern int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume);
extern int sysfs_disk_to_scsi_id(int fd, __u32 *id);
extern int sysfs_unique_holder(int devnum, long rdev);
+extern int sysfs_freeze_array(struct mdinfo *sra);
extern int load_sys(char *path, char *buf);


@@ -847,6 +850,7 @@ extern unsigned long bitmap_sectors(struct bitmap_super_s *bsb);

extern int md_get_version(int fd);
extern int get_linux_version(void);
+extern int mdadm_version(char *version);
extern long long parse_size(char *size);
extern int parse_uuid(char *str, int uuid[4]);
extern int parse_layout_10(char *layout);
diff --git a/msg.c b/msg.c
index aabfa8f..8e7ebfd 100644
--- a/msg.c
+++ b/msg.c
@@ -135,7 +135,15 @@ int ack(int fd, int tmo)
int wait_reply(int fd, int tmo)
{
struct metadata_update msg;
- return receive_message(fd, &msg, tmo);
+ int err = receive_message(fd, &msg, tmo);
+
+ /* mdmon sent extra data, but caller only cares that we got a
+ * successful reply
+ */
+ if (err == 0 && msg.len > 0)
+ free(msg.buf);
+
+ return err;
}

int connect_monitor(char *devname)
@@ -195,7 +203,6 @@ int fping_monitor(int sfd)
return err;
}

-
/* give the monitor a chance to update the metadata */
int ping_monitor(char *devname)
{
@@ -206,6 +213,190 @@ int ping_monitor(char *devname)
return err;
}

+static char *ping_monitor_version(char *devname)
+{
+ int sfd = connect_monitor(devname);
+ struct metadata_update msg;
+ int err = 0;
+
+ if (sfd < 0)
+ return NULL;
+
+ if (ack(sfd, 20) != 0)
+ err = -1;
+
+ if (!err && receive_message(sfd, &msg, 20) != 0)
+ err = -1;
+
+ close(sfd);
+
+ if (err || !msg.len || !msg.buf)
+ return NULL;
+ return msg.buf;
+}
+
+static int unblock_subarray(struct mdinfo *sra, const int unfreeze)
+{
+ char buf[64];
+ int rc = 0;
+
+ if (sra) {
+ sprintf(buf, "external:%s\n", sra->text_version);
+ buf[9] = '/';
+ } else
+ buf[9] = '-';
+
+ if (buf[9] == '-' ||
+ sysfs_set_str(sra, NULL, "metadata_version", buf) ||
+ (unfreeze &&
+ sysfs_attribute_available(sra, NULL, "sync_action") &&
+ sysfs_set_str(sra, NULL, "sync_action", "idle")))
+ rc = -1;
+ return rc;
+}
+
+/**
+ * block_monitor - prevent mdmon spare assignment
+ * @container - container to block
+ * @freeze - flag to additionally freeze sync_action
+ *
+ * This is used by the reshape code to freeze the container, and the
+ * auto-rebuild implementation to atomically move spares. For reshape
+ * we need to freeze sync_action in the auto-rebuild we only need to
+ * block new spare assignment, existing rebuilds can continue
+ */
+int block_monitor(char *container, const int freeze)
+{
+ int devnum = devname2devnum(container);
+ struct mdstat_ent *ent, *e, *e2;
+ struct mdinfo *sra = NULL;
+ char *version = NULL;
+ char buf[64];
+ int rv = 0;
+
+ if (!mdmon_running(devnum)) {
+ /* if mdmon is not active we assume that any instance that is
+ * later started will match the current mdadm version, if this
+ * assumption is violated we may inadvertantly rebuild an array
+ * that was meant for reshape, or start rebuild on a spare that
+ * was to be moved to another container
+ */
+ /* pass */;
+ } else {
+ int ver;
+
+ version = ping_monitor_version(container);
+ ver = version ? mdadm_version(version) : -1;
+ free(version);
+ if (ver < 3001003) {
+ fprintf(stderr, Name
+ ": mdmon instance for %s cannot be disabled\n",
+ container);
+ return -1;
+ }
+ }
+
+ ent = mdstat_read(0, 0);
+ if (!ent) {
+ fprintf(stderr, Name
+ ": failed to read /proc/mdstat while disabling mdmon\n");
+ return -1;
+ }
+
+ /* freeze container contents */
+ for (e = ent; e; e = e->next) {
+ if (!is_container_member(e, container))
+ continue;
+ sysfs_free(sra);
+ sra = sysfs_read(-1, e->devnum, GET_VERSION);
+ if (!sra) {
+ fprintf(stderr, Name
+ ": failed to read sysfs for subarray%s\n",
+ to_subarray(e, container));
+ break;
+ }
+ /* can't reshape an array that we can't monitor */
+ if (sra->text_version[0] == '-')
+ break;
+
+ if (freeze && sysfs_freeze_array(sra) < 1)
+ break;
+ /* flag this array to not be modified by mdmon (close race with
+ * takeover in reshape case and spare reassignment in the
+ * auto-rebuild case)
+ */
+ sprintf(buf, "external:%s\n", sra->text_version);
+ buf[9] = '-';
+ if (sysfs_set_str(sra, NULL, "metadata_version", buf))
+ break;
+ ping_monitor(container);
+
+ /* check that we did not race with recovery */
+ if ((freeze &&
+ !sysfs_attribute_available(sra, NULL, "sync_action")) ||
+ (freeze &&
+ sysfs_attribute_available(sra, NULL, "sync_action") &&
+ sysfs_get_str(sra, NULL, "sync_action", buf, 20) > 0 &&
+ strcmp(buf, "frozen\n") == 0))
+ /* pass */;
+ else
+ break;
+ }
+
+ if (e) {
+ fprintf(stderr, Name ": failed to freeze subarray%s\n",
+ to_subarray(e, container));
+
+ /* thaw the partially frozen container */
+ for (e2 = ent; e2 && e2 != e; e2 = e2->next) {
+ if (!is_container_member(e2, container))
+ continue;
+ sysfs_free(sra);
+ sra = sysfs_read(-1, e2->devnum, GET_VERSION);
+ if (unblock_subarray(sra, freeze))
+ fprintf(stderr, Name ": Failed to unfreeze %s\n", e2->dev);
+ }
+
+ ping_monitor(container); /* cleared frozen */
+ rv = -1;
+ }
+
+ sysfs_free(sra);
+ free_mdstat(ent);
+ free(container);
+
+ return rv;
+}
+
+void unblock_monitor(char *container, const int unfreeze)
+{
+ struct mdstat_ent *ent, *e;
+ struct mdinfo *sra = NULL;
+
+ ent = mdstat_read(0, 0);
+ if (!ent) {
+ fprintf(stderr, Name
+ ": failed to read /proc/mdstat while unblocking container\n");
+ return;
+ }
+
+ /* unfreeze container contents */
+ for (e = ent; e; e = e->next) {
+ if (!is_container_member(e, container))
+ continue;
+ sysfs_free(sra);
+ sra = sysfs_read(-1, e->devnum, GET_VERSION);
+ if (unblock_subarray(sra, unfreeze))
+ fprintf(stderr, Name ": Failed to unfreeze %s\n", e->dev);
+ }
+ ping_monitor(container);
+
+ sysfs_free(sra);
+ free_mdstat(ent);
+}
+
+
+
/* give the manager a chance to view the updated container state. This
* would naturally happen due to the manager noticing a change in
* /proc/mdstat; however, pinging encourages this detection to happen
diff --git a/msg.h b/msg.h
index f8e89fd..1f916de 100644
--- a/msg.h
+++ b/msg.h
@@ -27,6 +27,8 @@ extern int ack(int fd, int tmo);
extern int wait_reply(int fd, int tmo);
extern int connect_monitor(char *devname);
extern int ping_monitor(char *devname);
+extern int block_monitor(char *container, const int freeze);
+extern void unblock_monitor(char *container, const int unfreeze);
extern int fping_monitor(int sock);
extern int ping_manager(char *devname);

diff --git a/sysfs.c b/sysfs.c
index 6e1d77b..16e41fb 100644
--- a/sysfs.c
+++ b/sysfs.c
@@ -435,6 +435,17 @@ int sysfs_uevent(struct mdinfo *sra, char *event)
return 0;
}

+int sysfs_attribute_available(struct mdinfo *sra, struct mdinfo *dev, char *name)
+{
+ char fname[50];
+ struct stat st;
+
+ sprintf(fname, "/sys/block/%s/md/%s/%s",
+ sra->sys_name, dev?dev->sys_name:"", name);
+
+ return stat(fname, &st) == 0;
+}
+
int sysfs_get_fd(struct mdinfo *sra, struct mdinfo *dev,
char *name)
{
@@ -789,6 +800,28 @@ int sysfs_unique_holder(int devnum, long rdev)
return found;
}

+int sysfs_freeze_array(struct mdinfo *sra)
+{
+ /* Try to freeze resync/rebuild on this array/container.
+ * Return -1 if the array is busy,
+ * return -2 container cannot be frozen,
+ * return 0 if this kernel doesn't support 'frozen'
+ * return 1 if it worked.
+ */
+ char buf[20];
+
+ if (!sysfs_attribute_available(sra, NULL, "sync_action"))
+ return 1; /* no sync_action == frozen */
+ if (sysfs_get_str(sra, NULL, "sync_action", buf, 20) <= 0)
+ return 0;
+ if (strcmp(buf, "idle\n") != 0 &&
+ strcmp(buf, "frozen\n") != 0)
+ return -1;
+ if (sysfs_set_str(sra, NULL, "sync_action", "frozen") < 0)
+ return 0;
+ return 1;
+}
+
#ifndef MDASSEMBLE

static char *clean_states[] = {
diff --git a/util.c b/util.c
index 6f1c1d2..5f2694e 100644
--- a/util.c
+++ b/util.c
@@ -216,6 +216,30 @@ int get_linux_version()
return (a*1000000)+(b*1000)+c;
}

+int mdadm_version(char *version)
+{
+ int a, b, c;
+ char *cp;
+
+ if (!version)
+ version = Version;
+
+ cp = strchr(version, '-');
+ if (!cp || *(cp+1) != ' ' || *(cp+2) != 'v')
+ return -1;
+ cp += 3;
+ a = strtoul(cp, &cp, 10);
+ if (*cp != '.')
+ return -1;
+ b = strtoul(cp+1, &cp, 10);
+ if (*cp != '.')
+ return -1;
+ c = strtoul(cp+1, &cp, 10);
+ if (*cp != ' ')
+ return -1;
+ return (a*1000000)+(b*1000)+c;
+}
+
#ifndef MDASSEMBLE
long long parse_size(char *size)
{

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 03/13] Manage: allow manual control of external raid0 readonly

am 18.11.2010 10:21:45 von krzysztof.wojcik

From: Dan Williams

mdadm --readwrite will clear the external readonly flag ('-'
to '/'), but only for redudant arrays. Allow raid0 arrays as well so
the user has a simple helper to control this flag.

Signed-off-by: Dan Williams
---
Manage.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/Manage.c b/Manage.c
index 6e9d4a0..ac9415b 100644
--- a/Manage.c
+++ b/Manage.c
@@ -56,7 +56,6 @@ int Manage_ro(char *devname, int fd, int readonly)
mdi = sysfs_read(fd, -1, GET_LEVEL|GET_VERSION);
if (mdi &&
mdi->array.major_version == -1 &&
- mdi->array.level > 0 &&
is_subarray(mdi->text_version)) {
char vers[64];
strcpy(vers, "external:");

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 04/13] Grow: mark some functions static

am 18.11.2010 10:21:53 von krzysztof.wojcik

From: Dan Williams

Going through the Grow api found some local routines that could be
marked static.

Signed-off-by: Dan Williams
---
Grow.c | 12 ++++++------
1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/Grow.c b/Grow.c
index 0571f5b..f16228d 100644
--- a/Grow.c
+++ b/Grow.c
@@ -409,7 +409,7 @@ static struct mdp_backup_super {
__u8 pad[512-68-32];
} __attribute__((aligned(512))) bsb, bsb2;

-__u32 bsb_csum(char *buf, int len)
+static __u32 bsb_csum(char *buf, int len)
{
int i;
int csum = 0;
@@ -432,7 +432,7 @@ static int child_same_size(int afd, struct mdinfo *sra, unsigned long blocks,
int disks, int chunk, int level, int layout, int data,
int dests, int *destfd, unsigned long long *destoffsets);

-int freeze_array(struct mdinfo *sra)
+static int freeze_array(struct mdinfo *sra)
{
/* Try to freeze resync on this array.
* Return -1 if the array is busy,
@@ -450,14 +450,14 @@ int freeze_array(struct mdinfo *sra)
return 1;
}

-void unfreeze_array(struct mdinfo *sra, int frozen)
+static void unfreeze_array(struct mdinfo *sra, int frozen)
{
/* If 'frozen' is 1, unfreeze the array */
if (frozen > 0)
sysfs_set_str(sra, NULL, "sync_action", "idle");
}

-void wait_reshape(struct mdinfo *sra)
+static void wait_reshape(struct mdinfo *sra)
{
int fd = sysfs_get_fd(sra, NULL, "sync_action");
char action[20];
@@ -1266,7 +1266,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
*/

/* FIXME return status is never checked */
-int grow_backup(struct mdinfo *sra,
+static int grow_backup(struct mdinfo *sra,
unsigned long long offset, /* per device */
unsigned long stripes, /* per device */
int *sources, unsigned long long *offsets,
@@ -1381,7 +1381,7 @@ int grow_backup(struct mdinfo *sra,
* every works.
*/
/* FIXME return value is often ignored */
-int wait_backup(struct mdinfo *sra,
+static int wait_backup(struct mdinfo *sra,
unsigned long long offset, /* per device */
unsigned long long blocks, /* per device */
unsigned long long blocks2, /* per device - hack */

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 05/13] Assemble: fix assembly in the delta_disks >

am 18.11.2010 10:22:01 von krzysztof.wojcik

From: Dan Williams

Incremental assembly works on such an array because the kernel sees the
disk as in-sync and that the array is reshaping. Teach Assemble() the
same assumptions.

This is only needed on kernels that do not initialize ->recovery_offset
when activating spares for reshape.

Signed-off-by: Dan Williams
---
Assemble.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/Assemble.c b/Assemble.c
index afd4e60..409f0d7 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -804,7 +804,9 @@ int Assemble(struct supertype *st, char *mddev,
devices[most_recent].i.events) {
devices[j].uptodate = 1;
if (i < content->array.raid_disks) {
- if (devices[j].i.recovery_start == MaxSector) {
+ if (devices[j].i.recovery_start == MaxSector ||
+ (content->reshape_active &&
+ j >= content->array.raid_disks - content->delta_disks)) {
okcnt++;
avail[i]=1;
} else

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 06/13] Grow: fix check for raid6 layout normalization

am 18.11.2010 10:22:09 von krzysztof.wojcik

From: Dan Williams

If the user does not specify a layout, don't skip asking about retaining
the non-standard raid6 layout which may be implicitly changed.

Signed-off-by: Dan Williams
---
Grow.c | 11 ++++++-----
1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/Grow.c b/Grow.c
index f16228d..bf634d3 100644
--- a/Grow.c
+++ b/Grow.c
@@ -706,9 +706,9 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,

/* ========= set shape (chunk_size / layout / ndisks) ============== */
/* Check if layout change is a no-op */
- if (layout_str) switch(array.level) {
+ switch(array.level) {
case 5:
- if (array.layout == map_name(r5layout, layout_str))
+ if (layout_str && array.layout == map_name(r5layout, layout_str))
layout_str = NULL;
break;
case 6:
@@ -724,8 +724,9 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
rv = 1;
goto release;
}
- if (strcmp(layout_str, "normalise") == 0 ||
- strcmp(layout_str, "normalize") == 0) {
+ if (layout_str &&
+ (strcmp(layout_str, "normalise") == 0 ||
+ strcmp(layout_str, "normalize") == 0)) {
char *hyphen;
strcpy(alt_layout, map_num(r6layout, array.layout));
hyphen = strrchr(alt_layout, '-');
@@ -735,7 +736,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
}
}

- if (array.layout == map_name(r6layout, layout_str))
+ if (layout_str && array.layout == map_name(r6layout, layout_str))
layout_str = NULL;
if (layout_str && strcmp(layout_str, "preserve") == 0)
layout_str = NULL;

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 07/13] Grow: add missing raid4 geometries to geo_map()

am 18.11.2010 10:22:17 von krzysztof.wojcik

From: Dan Williams

They are equivalent to their raid5 versions and let the reshape code
optionally use either.

Signed-off-by: Dan Williams
---
restripe.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/restripe.c b/restripe.c
index 3074693..c2fbe5b 100644
--- a/restripe.c
+++ b/restripe.c
@@ -46,6 +46,7 @@ static int geo_map(int block, unsigned long long stripe, int raid_disks,
switch(level*100 + layout) {
case 000:
case 400:
+ case 400 + ALGORITHM_PARITY_N:
case 500 + ALGORITHM_PARITY_N:
/* raid 4 isn't messed around by parity blocks */
if (block == -1)
@@ -75,6 +76,7 @@ static int geo_map(int block, unsigned long long stripe, int raid_disks,
if (block == -1) return pd;
return (pd + 1 + block) % raid_disks;

+ case 400 + ALGORITHM_PARITY_0:
case 500 + ALGORITHM_PARITY_0:
return block + 1;


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 08/13] fix a get_linux_version() comparison typo

am 18.11.2010 10:22:25 von krzysztof.wojcik

From: Dan Williams

Signed-off-by: Dan Williams
---
mdadm.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mdadm.c b/mdadm.c
index 08e8ea4..e3361ed 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -1484,7 +1484,7 @@ int main(int argc, char *argv[])
break;
}
if (delay == 0) {
- if (get_linux_version() > 20616)
+ if (get_linux_version() > 2006016)
/* mdstat responds to poll */
delay = 1000;
else

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 09/13] Create: cleanup/unify default geometry handling

am 18.11.2010 10:22:33 von krzysztof.wojcik

From: Dan Williams

Support metadata specific level, layout and chunksize defaults. Kill an
uneeded superswitch methods ahead of adding more for the reshape case.

Signed-off-by: Dan Williams
---
Create.c | 21 ++++++---------------
mdadm.h | 8 +++-----
super-ddf.c | 11 ++++++++++-
super-intel.c | 15 +++++++++------
4 files changed, 28 insertions(+), 27 deletions(-)

diff --git a/Create.c b/Create.c
index 2bf7ebe..bc2613a 100644
--- a/Create.c
+++ b/Create.c
@@ -31,8 +31,8 @@ static int default_layout(struct supertype *st, int level, int verbose)
{
int layout = UnSet;

- if (st && st->ss->default_layout)
- layout = st->ss->default_layout(level);
+ if (st && st->ss->default_geometry)
+ st->ss->default_geometry(st, &level, &layout, NULL);

if (layout == UnSet)
switch(level) {
@@ -120,15 +120,8 @@ int Create(struct supertype *st, char *mddev,
int major_num = BITMAP_MAJOR_HI;

memset(&info, 0, sizeof(info));
-
- if (level == UnSet) {
- /* "ddf" and "imsm" metadata only supports one level - should possibly
- * push this into metadata handler??
- */
- if (st && (st->ss == &super_ddf || st->ss == &super_imsm))
- level = LEVEL_CONTAINER;
- }
-
+ if (level == UnSet && st && st->ss->default_geometry)
+ st->ss->default_geometry(st, &level, NULL, NULL);
if (level == UnSet) {
fprintf(stderr,
Name ": a RAID level is needed to create an array.\n");
@@ -235,11 +228,9 @@ int Create(struct supertype *st, char *mddev,
case 6:
case 0:
if (chunk == 0) {
- if (st && st->ss->default_chunk)
- chunk = st->ss->default_chunk(st);
-
+ if (st && st->ss->default_geometry)
+ st->ss->default_geometry(st, NULL, NULL, &chunk);
chunk = chunk ? : 512;
-
if (verbose > 0)
fprintf(stderr, Name ": chunk size defaults to %dK\n", chunk);
}
diff --git a/mdadm.h b/mdadm.h
index f7172e9..a4de06f 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -612,7 +612,7 @@ extern struct superswitch {
* added to validate changing size and new devices. If there are
* inter-device dependencies, it should record sufficient details
* so these can be validated.
- * Both 'size' and '*freesize' are in sectors. chunk is bytes.
+ * Both 'size' and '*freesize' are in sectors. chunk is KiB.
*/
int (*validate_geometry)(struct supertype *st, int level, int layout,
int raiddisks,
@@ -621,10 +621,8 @@ extern struct superswitch {
int verbose);

struct mdinfo *(*container_content)(struct supertype *st);
- /* Allow a metadata handler to override mdadm's default layouts */
- int (*default_layout)(int level); /* optional */
- /* query the supertype for default chunk size */
- int (*default_chunk)(struct supertype *st); /* optional */
+ /* query the supertype for default geometry */
+ void (*default_geometry)(struct supertype *st, int *level, int *layout, int *chunk); /* optional */
/* Permit subarray's to be deleted from inactive containers */
int (*kill_subarray)(struct supertype *st); /* optional */
/* Permit subarray's to be modified */
diff --git a/super-ddf.c b/super-ddf.c
index dba5970..772ca97 100644
--- a/super-ddf.c
+++ b/super-ddf.c
@@ -3653,6 +3653,15 @@ static int ddf_level_to_layout(int level)
}
}

+static void default_geometry_ddf(struct supertype *st, int *level, int *layout, int *chunk)
+{
+ if (level && *level == UnSet)
+ *level = LEVEL_CONTAINER;
+
+ if (level && layout && *layout == UnSet)
+ *layout = ddf_level_to_layout(*level);
+}
+
struct superswitch super_ddf = {
#ifndef MDASSEMBLE
.examine_super = examine_super_ddf,
@@ -3680,7 +3689,7 @@ struct superswitch super_ddf = {
.free_super = free_super_ddf,
.match_metadata_desc = match_metadata_desc_ddf,
.container_content = container_content_ddf,
- .default_layout = ddf_level_to_layout,
+ .default_geometry = default_geometry_ddf,

.external = 1,

diff --git a/super-intel.c b/super-intel.c
index b880a74..7c5fcc4 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -4115,14 +4115,18 @@ static int validate_geometry_imsm(struct supertype *st, int level, int layout,
return 0;
}

-static int default_chunk_imsm(struct supertype *st)
+static void default_geometry_imsm(struct supertype *st, int *level, int *layout, int *chunk)
{
struct intel_super *super = st->sb;

- if (!super->orom)
- return 0;
+ if (level && *level == UnSet)
+ *level = LEVEL_CONTAINER;
+
+ if (level && layout && *layout == UnSet)
+ *layout = imsm_level_to_layout(*level);

- return imsm_orom_default_chunk(super->orom);
+ if (chunk && (*chunk == UnSet || *chunk == 0) && super->orom)
+ *chunk = imsm_orom_default_chunk(super->orom);
}

static void handle_missing(struct intel_super *super, struct imsm_dev *dev);
@@ -5567,7 +5571,6 @@ struct superswitch super_imsm = {
.brief_detail_super = brief_detail_super_imsm,
.write_init_super = write_init_super_imsm,
.validate_geometry = validate_geometry_imsm,
- .default_chunk = default_chunk_imsm,
.add_to_super = add_to_super_imsm,
.detail_platform = detail_platform_imsm,
.kill_subarray = kill_subarray_imsm,
@@ -5588,7 +5591,7 @@ struct superswitch super_imsm = {
.free_super = free_super_imsm,
.match_metadata_desc = match_metadata_desc_imsm,
.container_content = container_content_imsm,
- .default_layout = imsm_level_to_layout,
+ .default_geometry = default_geometry_imsm,

.external = 1,
.name = "imsm",

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 10/13] Initialize st->devnum and st->container_dev in

am 18.11.2010 10:22:41 von krzysztof.wojcik

From: Dan Williams

Precludes needing to deduce this information later, like in Detail.c and
soon in Grow.c.

Signed-off-by: Dan Williams
---
Detail.c | 11 ++++-------
util.c | 13 ++++++++-----
2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/Detail.c b/Detail.c
index e0817aa..1f2dbf2 100644
--- a/Detail.c
+++ b/Detail.c
@@ -97,16 +97,13 @@ int Detail(char *dev, int brief, int export, int test, char *homehost)
if (st)
max_disks = st->max_devs;

- if (sra && is_subarray(sra->text_version) &&
- strchr(sra->text_version+1, '/')) {
+ if (st->subarray[0]) {
/* This is a subarray of some container.
* We want the name of the container, and the member
*/
- char *s = strchr(sra->text_version+1, '/');
- int dn;
- *s++ = '\0';
- member = s;
- dn = devname2devnum(sra->text_version+1);
+ int dn = st->container_dev;
+
+ member = st->subarray;
container = map_dev(dev2major(dn), dev2minor(dn), 1);
}

diff --git a/util.c b/util.c
index 5f2694e..5023f42 100644
--- a/util.c
+++ b/util.c
@@ -1088,6 +1088,7 @@ struct supertype *super_by_fd(int fd)
char version[20];
int i;
char *subarray = NULL;
+ int container = NoMdDev;

sra = sysfs_read(fd, 0, GET_VERSION);

@@ -1109,15 +1110,15 @@ struct supertype *super_by_fd(int fd)
}
if (minor == -2 && is_subarray(verstr)) {
char *dev = verstr+1;
+
subarray = strchr(dev, '/');
- int devnum;
if (subarray)
*subarray++ = '\0';
- devnum = devname2devnum(dev);
subarray = strdup(subarray);
+ container = devname2devnum(dev);
if (sra)
sysfs_free(sra);
- sra = sysfs_read(-1, devnum, GET_VERSION);
+ sra = sysfs_read(-1, container, GET_VERSION);
if (sra && sra->text_version[0])
verstr = sra->text_version;
else
@@ -1132,11 +1133,13 @@ struct supertype *super_by_fd(int fd)
if (st) {
st->sb = NULL;
if (subarray) {
- strncpy(st->subarray, subarray, 32);
- st->subarray[31] = 0;
+ strncpy(st->subarray, subarray, sizeof(st->subarray));
+ st->subarray[sizeof(st->subarray) - 1] = 0;
free(subarray);
} else
st->subarray[0] = 0;
+ st->container_dev = container;
+ st->devnum = fd2devnum(fd);
}
return st;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 11/13] Document the external reshape implementation

am 18.11.2010 10:22:51 von krzysztof.wojcik

From: Dan Williams

Signed-off-by: Dan Williams
---
external-reshape-design.txt | 168 +++++++++++++++++++++++++++++++++++++++++++
1 files changed, 168 insertions(+), 0 deletions(-)
create mode 100644 external-reshape-design.txt

diff --git a/external-reshape-design.txt b/external-reshape-design.txt
new file mode 100644
index 0000000..d6fb98d
--- /dev/null
+++ b/external-reshape-design.txt
@@ -0,0 +1,168 @@
+External Reshape
+
+1 Problem statement
+
+External (third-party metadata) reshape differs from native-metadata
+reshape in three key ways:
+
+1.1 Format specific constraints
+
+In the native case reshape is limited by what is implemented in the
+generic reshape routine (Grow_reshape()) and what is supported by the
+kernel. There are exceptional cases where Grow_reshape() may block
+operations when it knows that the kernel implementation is broken, but
+otherwise the kernel is relied upon to be the final arbiter of what
+reshape operations are supported.
+
+In the external case the kernel, and the generic checks in
+Grow_reshape(), become the super-set of what reshapes are possible. The
+metadata format may not support, or have yet to implement a given
+reshape type. The implication for Grow_reshape() is that it must query
+the metadata handler and effect changes in the metadata before the new
+geometry is posted to the kernel. The ->reshape_super method allows
+Grow_reshape() to validate the requested operation and post the metadata
+update.
+
+1.2 Scope of reshape
+
+Native metadata reshape is always performed at the array scope (no
+metadata relationship with sibling arrays on the same disks). External
+reshape, depending on the format, may not allow the number of member
+disks to be changed in a subarray unless the change is simultaneously
+applied to all subarrays in the container. For example the imsm format
+requires all member disks to be a member of all subarrays, so a 4-disk
+raid5 in a container that also houses a 4-disk raid10 array could not be
+reshaped to 5 disks as the imsm format does not support a 5-disk raid10
+representation. This requires the ->reshape_super method to check the
+contents of the array and ask the user to run the reshape at container
+scope (if both subarrays are agreeable to the change), or report an
+error in the case where one subarray cannot support the change.
+
+1.3 Monitoring / checkpointing
+
+Reshape, unlike rebuild/resync, requires strict checkpointing to survive
+interrupted reshape operations. For example when expanding a raid5
+array the first few stripes of the array will be overwritten in a
+destructive manner. When restarting the reshape process we need to know
+the exact location of the last successfully written stripe, and we need
+to restore the data in any partially overwritten stripe. Native
+metadata stores this backup data in the unused portion of spares that
+are being promoted to array members, or in an external backup file
+(located on a non-involved block device).
+
+The kernel is in charge of recording checkpoints of reshape progress,
+but mdadm is delegated the task of managing the backup space which
+involves:
+1/ Identifying what data will be overwritten in the next unit of reshape
+ operation
+2/ Suspending access to that region so that a snapshot of the data can
+ be transferred to the backup space.
+3/ Allowing the kernel to reshape the saved region and setting the
+ boundary for the next backup.
+
+In the external reshape case we want to preserve this mdadm
+'reshape-manager' arrangement, but have a third actor, mdmon, to
+consider. It is tempting to give the role of managing reshape to mdmon,
+but that is counter to its role as a monitor, and conflicts with the
+existing capabilities and role of mdadm to manage the progress of
+reshape. For clarity the external reshape implementation maintains the
+role of mdmon as a (mostly) passive recorder of raid events, and mdadm
+treats it as it would the kernel in the native reshape case (modulo
+needing to send explicit metadata update messages and checking that
+mdmon took the expected action).
+
+External reshape can use the generic md backup file as a fallback, but in the
+optimal/firmware-compatible case the reshape-manager will use the metadata
+specific areas for managing reshape. The implementation also needs to spawn a
+reshape-manager per subarray when the reshape is being carried out at the
+container level. For these two reasons the ->manage_reshape() method is
+introduced. This method in addition to base tasks mentioned above:
+1/ Spawns a manager per-subarray, when necessary
+2/ Uses either generic routines in Grow.c for md-style backup file
+ support, or uses the metadata-format specific location for storing
+ recovery data.
+This aims to avoid a "midlayer mistake"[1] and lets the metadata handler
+optionally take advantage of generic infrastructure in Grow.c
+
+2 Details for specific reshape requests
+
+There are quite a few moving pieces spread out across md, mdadm, and mdmon for
+the support of external reshape, and there are several different types of
+reshape that need to be comprehended by the implementation. A rundown of
+these details follows.
+
+2.0 General provisions:
+
+Obtain an exclusive open on the container to make sure we are not
+running concurrently with a Create() event.
+
+2.1 Freezing sync_action
+
+2.2 Reshape size
+
+ 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
+ initializes st->update_tail
+ 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the size change
+ is allowed (being performed at subarray scope / enough room) prepares a
+ metadata update
+ 3/ mdadm::Grow_reshape(): flushes the metadata update (via
+ flush_metadata_update(), or ->sync_metadata())
+ 4/ mdadm::Grow_reshape(): post the new size to the kernel
+
+
+2.3 Reshape level (simple-takeover)
+
+"simple-takeover" implies the level change can be satisfied without touching
+sync_action
+
+ 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
+ initializes st->update_tail
+ 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the level change
+ is allowed (being performed at subarray scope) prepares a
+ metadata update
+ 2a/ raid10 --> raid0: degrade all mirror legs prior to calling
+ ->reshape_super
+ 3/ mdadm::Grow_reshape(): flushes the metadata update (via
+ flush_metadata_update(), or ->sync_metadata())
+ 4/ mdadm::Grow_reshape(): post the new level to the kernel
+
+2.4 Reshape chunk, layout
+
+2.5 Reshape raid disks (grow)
+
+ 1/ mdadm::Grow_reshape(): unconditionally initializes st->update_tail
+ because only redundant raid levels can modify the number of raid disks
+ 2/ mdadm::Grow_reshape(): calls ->reshape_super() to check that the level
+ change is allowed (being performed at proper scope / permissible
+ geometry / proper spares available in the container) prepares a metadata
+ update.
+ 3/ mdadm::Grow_reshape(): Converts each subarray in the container to the
+ raid level that can perform the reshape and starts mdmon.
+ 4/ mdadm::Grow_reshape(): Pushes the update to mdmon...
+ 4a/ mdmon::process_update(): marks the array as reshaping
+ 4b/ mdmon::manage_member(): adds the spares (without assigning a slot)
+ 5/ mdadm::Grow_reshape(): Notes that mdmon has assigned spares and invokes
+ ->manage_reshape()
+ 5/ mdadm::->manage_reshape(): (for each subarray) sets sync_max to
+ zero, starts the reshape, and pings mdmon
+ 5a/ mdmon::read_and_act(): notices that reshape has started and notifies
+ the metadata handler to record the slots chosen by the kernel
+ 6/ mdadm::->manage_reshape(): saves data that will be overwritten by
+ the kernel to either the backup file or the metadata specific location,
+ advances sync_max, waits for reshape, ping mdmon, repeat.
+ 6a/ mdmon::read_and_act(): records checkpoints
+ 7/ mdadm::->manage_reshape(): Once reshape completes changes the raid
+ level back to the nominal raid level (if necessary)
+
+ FIXME: native metadata does not have the capability to record the original
+ raid level in reshape-restart case because the kernel always records current
+ raid level to the metadata, whereas external metadata can masquerade at an
+ alternate level based on the reshape state.
+
+2.6 Reshape raid disks (shrink)
+
+3 TODO
+
+...
+
+[1]: Linux kernel design patterns - part 3, Neil Brown http://lwn.net/Articles/336262/

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 12/13] External reshape (step 1): container reshape and

am 18.11.2010 10:22:59 von krzysztof.wojcik

From: Dan Williams

In the native metadata case Grow_reshape() and the kernel validate what
reshapes are possible / supported and the kernel handles all the metadata
updates. In the external case the metadata format may have specific
constraints above this baseline. External formats also introduce the
constraint of only permitting some reshapes at container scope versus subarray
scope. For exmaple imsm changes to 'raiddisks' must be applied to all arrays
in the container.

This operation assumes that its 'st' parameter has been obtained from
super_by_fd() (such that st->subarray is up to date), and that a snapshot of
the metadata has been loaded from the container.

Why a new method, versus extending an existing one?
->validate_geometry: this routine assumes it is being called from Create(),
adding reshape complicates the cases that this routine needs to handle. Where
we find that checks can be shared between the two cases those routines
refactored into common code internal to the metadata handler, i.e. no need to
provide a unified external interface. ->validate_geometry() also does not
expect to update the metadata.

->update_super: this is meant to update single fields at Assembly() and only at
the container scope. Reshape potentially wants to update multiple fields at
either container or subarray scope.

Signed-off-by: Dan Williams
---
Grow.c | 390 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++-
mdadm.h | 9 +
2 files changed, 391 insertions(+), 8 deletions(-)

diff --git a/Grow.c b/Grow.c
index bf634d3..59032ef 100644
--- a/Grow.c
+++ b/Grow.c
@@ -474,8 +474,222 @@ static void wait_reshape(struct mdinfo *sra)
}
} while (strncmp(action, "reshape", 7) == 0);
}
-
-
+
+static int reshape_super(struct supertype *st, long long size, int level,
+ int layout, int chunksize, int raid_disks,
+ char *backup_file, char *dev, int verbose)
+{
+ /* nothing extra to check in the native case */
+ if (!st->ss->external)
+ return 0;
+ if (!st->ss->reshape_super ||
+ !st->ss->manage_reshape) {
+ fprintf(stderr, Name ": %s metadata does not support reshape\n",
+ st->ss->name);
+ return 1;
+ }
+
+ return st->ss->reshape_super(st, size, level, layout, chunksize,
+ raid_disks, backup_file, dev, verbose);
+}
+
+static void sync_metadata(struct supertype *st)
+{
+ if (st->ss->external) {
+ if (st->update_tail)
+ flush_metadata_updates(st);
+ else
+ st->ss->sync_metadata(st);
+ }
+}
+
+static int subarray_set_num(char *container, struct mdinfo *sra, char *name, int n)
+{
+ /* when dealing with external metadata subarrays we need to be
+ * prepared to handle EAGAIN. The kernel may need to wait for
+ * mdmon to mark the array active so the kernel can handle
+ * allocations/writeback when preparing the reshape action
+ * (md_allow_write()). We temporarily disable safe_mode_delay
+ * to close a race with the array_state going clean before the
+ * next write to raid_disks / stripe_cache_size
+ */
+ char safe[50];
+ int rc;
+
+ /* only 'raid_disks' and 'stripe_cache_size' trigger md_allow_write */
+ if (strcmp(name, "raid_disks") != 0 &&
+ strcmp(name, "stripe_cache_size") != 0)
+ return sysfs_set_num(sra, NULL, name, n);
+
+ rc = sysfs_get_str(sra, NULL, "safe_mode_delay", safe, sizeof(safe));
+ if (rc <= 0)
+ return -1;
+ sysfs_set_num(sra, NULL, "safe_mode_delay", 0);
+ rc = sysfs_set_num(sra, NULL, name, n);
+ if (rc < 0 && errno == EAGAIN) {
+ ping_monitor(container);
+ /* if we get EAGAIN here then the monitor is not active
+ * so stop trying
+ */
+ rc = sysfs_set_num(sra, NULL, name, n);
+ }
+ sysfs_set_str(sra, NULL, "safe_mode_delay", safe);
+ return rc;
+}
+
+static int reshape_container_raid_disks(char *container, int raid_disks)
+{
+ /* for each subarray switch to a raid level that can
+ * support the reshape, and set raid disks
+ */
+ struct mdstat_ent *ent, *e;
+ int changed = 0, rv = 0, err = 0;
+
+ ent = mdstat_read(1, 0);
+ if (!ent) {
+ fprintf(stderr, Name ": unable to read /proc/mdstat\n");
+ return -1;
+ }
+
+ changed = 0;
+ for (e = ent; e; e = e->next) {
+ struct mdinfo *sub;
+ unsigned int cache;
+ int level, takeover_delta = 0;
+
+ if (!is_container_member(e, container))
+ continue;
+
+ level = map_name(pers, e->level);
+ if (level == 0) {
+ sub = sysfs_read(-1, e->devnum, GET_VERSION);
+ if (!sub)
+ break;
+ /* metadata records 'orig_level' */
+ rv = sysfs_set_num(sub, NULL, "level", 4);
+ if (rv < 0) {
+ err = errno;
+ break;
+ }
+ /* we want spares to be used for capacity
+ * expansion, not rebuild
+ */
+ takeover_delta = 1;
+
+ sysfs_free(sub);
+ level = 4;
+ }
+
+ sub = NULL;
+ switch (level) {
+ default:
+ rv = -1;
+ break;
+ case 4:
+ case 5:
+ case 6:
+ sub = sysfs_read(-1, e->devnum, GET_CHUNK|GET_CACHE);
+ if (!sub)
+ break;
+ cache = (sub->array.chunk_size / 4096) * 4;
+ if (cache > sub->cache_size)
+ rv = subarray_set_num(container, sub,
+ "stripe_cache_size", cache);
+ if (rv) {
+ err = errno;
+ break;
+ }
+ /* fall through */
+ case 1:
+ if (!sub)
+ sub = sysfs_read(-1, e->devnum, GET_VERSION);
+ if (!sub)
+ break;
+
+ rv = subarray_set_num(container, sub, "raid_disks",
+ raid_disks + takeover_delta);
+ if (rv)
+ err = errno;
+ else
+ changed++;
+ break;
+ }
+ sysfs_free(sub);
+ if (rv)
+ break;
+ }
+ free_mdstat(ent);
+ if (rv) {
+ fprintf(stderr, Name
+ ": failed to initiate container reshape%s%s\n",
+ err ? ": " : "", err ? strerror(err) : "");
+ return rv;
+ }
+
+ return changed;
+}
+
+static void revert_container_raid_disks(struct supertype *st, int fd, char *container)
+{
+ /* we failed to prepare all subarrays in the container for
+ * reshape, so cancel the changes and restore the nominal raid
+ * level
+ */
+ struct mdstat_ent *ent, *e;
+
+ ent = mdstat_read(0, 0);
+ if (!ent) {
+ fprintf(stderr, Name
+ ": failed to read /proc/mdstat while aborting reshape\n");
+ return;
+ }
+
+ for (e = ent; e; e = e->next) {
+ int level_fixed = 0, disks_fixed = 0;
+ struct mdinfo *sub, prev;
+
+ if (!is_container_member(e, container))
+ continue;
+
+ st->ss->free_super(st);
+ sprintf(st->subarray, "%s", to_subarray(e, container));
+ if (st->ss->load_super(st, fd, NULL)) {
+ fprintf(stderr, Name
+ ": failed read metadata while aborting reshape\n");
+ continue;
+ }
+ st->ss->getinfo_super(st, &prev);
+
+ /* changing level might change raid_disks so we do it
+ * first and then check if raid_disks still needs fixing
+ */
+ if (map_name(pers, e->level) != prev.array.level) {
+ sub = sysfs_read(-1, e->devnum, GET_VERSION);
+ if (sub &&
+ !sysfs_set_num(sub, NULL, "level", prev.array.level))
+ level_fixed = 1;
+ sysfs_free(sub);
+ } else
+ level_fixed = 1;
+
+ sub = sysfs_read(-1, e->devnum, GET_DISKS);
+ if (sub && sub->array.raid_disks != prev.array.raid_disks) {
+ if (!subarray_set_num(container, sub, "raid_disks",
+ prev.array.raid_disks))
+ disks_fixed = 1;
+ } else if (sub)
+ disks_fixed = 1;
+ sysfs_free(sub);
+
+ if (!disks_fixed || !level_fixed)
+ fprintf(stderr, Name
+ ": failed to restore %s to a %d-disk %s array\n",
+ e->dev, prev.array.raid_disks,
+ map_num(pers, prev.array.level));
+ }
+ free_mdstat(ent);
+}
+
int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
long long size,
int level, char *layout_str, int chunksize, int raid_disks)
@@ -518,6 +732,8 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
unsigned long cache;
unsigned long long array_size;
int changed = 0;
+ char *container = NULL;
+ int cfd = -1;
int done;

struct mdinfo *sra;
@@ -545,10 +761,65 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
" Please use a newer kernel\n");
return 1;
}
+
+ st = super_by_fd(fd);
+ if (!st) {
+ fprintf(stderr, Name ": Unable to determine metadata format for %s\n", devname);
+ return 1;
+ }
+
+ /* in the external case we need to check that the requested reshape is
+ * supported, and perform an initial check that the container holds the
+ * pre-requisite spare devices (mdmon owns final validation)
+ */
+ if (st->ss->external) {
+ int container_dev;
+
+ if (st->subarray[0]) {
+ container_dev = st->container_dev;
+ cfd = open_dev_excl(st->container_dev);
+ } else if (size >= 0 || layout_str != NULL || chunksize != 0 ||
+ level != UnSet) {
+ fprintf(stderr,
+ Name ": %s is a container, only 'raid-devices' can be changed\n",
+ devname);
+ return 1;
+ } else {
+ container_dev = st->devnum;
+ close(fd);
+ cfd = open_dev_excl(st->devnum);
+ fd = cfd;
+ }
+ if (cfd < 0) {
+ fprintf(stderr, Name ": Unable to open container for %s\n",
+ devname);
+ return 1;
+ }
+
+ container = devnum2devname(st->devnum);
+ if (!container) {
+ fprintf(stderr, Name ": Could not determine container name\n");
+ return 1;
+ }
+
+ if (st->ss->load_super(st, cfd, NULL)) {
+ fprintf(stderr, Name ": Cannot read superblock for %s\n",
+ devname);
+ return 1;
+ }
+
+ if (mdmon_running(container_dev))
+ st->update_tail = &st->updates;
+ }
+
sra = sysfs_read(fd, 0, GET_LEVEL);
- if (sra)
+ if (sra) {
+ if (st->ss->external && st->subarray[0] == 0) {
+ array.level = LEVEL_CONTAINER;
+ sra->array.level = LEVEL_CONTAINER;
+ }
frozen = freeze_array(sra);
- else {
+ } else {
fprintf(stderr, Name ": failed to read sysfs parameters for %s\n",
devname);
return 1;
@@ -559,8 +830,16 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
return 1;
}

+
/* ========= set size =============== */
if (size >= 0 && (size == 0 || size != array.size)) {
+ long long orig_size = array.size;
+
+ if (reshape_super(st, size, UnSet, UnSet, 0, 0, NULL, devname, !quiet)) {
+ rv = 1;
+ goto release;
+ }
+ sync_metadata(st);
array.size = size;
if (array.size != size) {
/* got truncated to 32bit, write to
@@ -575,6 +854,11 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
rv = ioctl(fd, SET_ARRAY_INFO, &array);
if (rv != 0) {
int err = errno;
+
+ /* restore metadata */
+ if (reshape_super(st, orig_size, UnSet, UnSet, 0, 0,
+ NULL, devname, !quiet) == 0)
+ sync_metadata(st);
fprintf(stderr, Name ": Cannot set device size for %s: %s\n",
devname, strerror(err));
if (err == EBUSY &&
@@ -591,7 +875,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
fprintf(stderr, Name ": component size of %s has been set to %lluK\n",
devname, size);
changed = 1;
- } else {
+ } else if (array.level != LEVEL_CONTAINER) {
size = get_component_size(fd)/2;
if (size == 0)
size = array.size;
@@ -674,6 +958,13 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
} else
layout_str = "parity-last";
} else {
+ /* Level change is a simple takeover. In the external
+ * case we don't check with the metadata handler until
+ * we establish what the final layout will be. If the
+ * level change is disallowed we will revert to
+ * orig_level without disturbing the metadata, otherwise
+ * we will send an update.
+ */
c = map_num(pers, level);
if (c == NULL) {
rv = 1;/* not possible */
@@ -706,7 +997,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,

/* ========= set shape (chunk_size / layout / ndisks) ============== */
/* Check if layout change is a no-op */
- switch(array.level) {
+ switch (array.level) {
case 5:
if (layout_str && array.layout == map_name(r5layout, layout_str))
layout_str = NULL;
@@ -745,6 +1036,11 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
if (layout_str == NULL
&& (chunksize == 0 || chunksize*1024 == array.chunk_size)
&& (raid_disks == 0 || raid_disks == array.raid_disks)) {
+ if (reshape_super(st, -1, level, UnSet, 0, 0, NULL, devname, !quiet)) {
+ rv = 1;
+ goto release;
+ }
+ sync_metadata(st);
rv = 0;
if (level != UnSet && level != array.level) {
/* Looks like this level change doesn't need
@@ -766,18 +1062,69 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
} else if (!changed && !quiet)
fprintf(stderr, Name ": %s: no change requested\n",
devname);
+
+ if (st->ss->external && !mdmon_running(st->container_dev) &&
+ level > 0) {
+ start_mdmon(st->container_dev);
+ ping_monitor(container);
+ }
goto release;
}

c = map_num(pers, array.level);
if (c == NULL) c = "-unknown-";
- switch(array.level) {
+ switch (array.level) {
default: /* raid0, linear, multipath cannot be reconfigured */
fprintf(stderr, Name ": %s array %s cannot be reshaped.\n",
c, devname);
+ /* TODO raid0 raiddisks can be reshaped via raid4 */
rv = 1;
break;
+ case LEVEL_CONTAINER: {
+ int count;
+
+ /* double check that we are not changing anything but raid_disks */
+ if (size >= 0 || layout_str != NULL || chunksize != 0 || level != UnSet) {
+ fprintf(stderr,
+ Name ": %s is a container, only 'raid-devices' can be changed\n",
+ devname);
+ rv = 1;
+ goto release;
+ }
+
+ st->update_tail = &st->updates;
+ if (reshape_super(st, -1, UnSet, UnSet, 0, raid_disks,
+ backup_file, devname, !quiet)) {
+ rv = 1;
+ goto release;
+ }
+
+ count = reshape_container_raid_disks(container, raid_disks);
+ if (count < 0) {
+ revert_container_raid_disks(st, fd, container);
+ rv = 1;
+ goto release;
+ } else if (count == 0) {
+ if (!quiet)
+ fprintf(stderr, Name
+ ": no active subarrays to reshape\n");
+ goto release;
+ }

+ if (!mdmon_running(st->devnum)) {
+ start_mdmon(st->devnum);
+ ping_monitor(container);
+ }
+ sync_metadata(st);
+
+ /* give mdmon a chance to allocate spares */
+ ping_manager(container);
+
+ /* manage_reshape takes care of releasing the array(s) */
+ st->ss->manage_reshape(st, backup_file);
+ frozen = 0;
+ goto release;
+ }
case LEVEL_FAULTY: /* only 'layout' change is permitted */

if (chunksize || raid_disks) {
@@ -813,6 +1160,12 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
break;
}
if (raid_disks > 0) {
+ if (reshape_super(st, -1, UnSet, UnSet, 0, raid_disks,
+ NULL, devname, !quiet)) {
+ rv = 1;
+ goto release;
+ }
+ sync_metadata(st);
array.raid_disks = raid_disks;
if (ioctl(fd, SET_ARRAY_INFO, &array) != 0) {
fprintf(stderr, Name ": Cannot set raid-devices for %s: %s\n",
@@ -830,7 +1183,6 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
* layout/chunksize/raid_disks can be changed
* though the kernel may not support it all.
*/
- st = super_by_fd(fd);

/*
* There are three possibilities.
@@ -1024,6 +1376,12 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
}
}
if (backup_file == NULL) {
+ if (st->ss->external && !st->ss->manage_reshape) {
+ fprintf(stderr, Name ": %s Grow operation not supported by %s metadata\n",
+ devname, st->ss->name);
+ rv = 1;
+ break;
+ }
if (ndata <= odata) {
fprintf(stderr, Name ": %s: Cannot grow - need backup-file\n",
devname);
@@ -1072,6 +1430,13 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
d++;
}

+ /* check that the operation is supported by the metadata */
+ if (reshape_super(st, -1, level, nlayout, nchunk, ndisks,
+ backup_file, devname, !quiet)) {
+ rv = 1;
+ break;
+ }
+
/* lastly, check that the internal stripe cache is
* large enough, or it won't work.
*/
@@ -1088,6 +1453,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
* If only changing raid_disks, use ioctl, else use
* sysfs.
*/
+ sync_metadata(st);
if (ochunk == nchunk && olayout == nlayout) {
array.raid_disks = ndisks;
if (ioctl(fd, SET_ARRAY_INFO, &array) != 0) {
@@ -1136,6 +1502,14 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
break;
}

+ if (st->ss->external) {
+ /* metadata handler takes it from here */
+ ping_manager(container);
+ st->ss->manage_reshape(st, backup_file);
+ frozen = 0;
+ break;
+ }
+
/* set up the backup-super-block. This requires the
* uuid from the array.
*/
diff --git a/mdadm.h b/mdadm.h
index a4de06f..64b32cc 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -627,6 +627,15 @@ extern struct superswitch {
int (*kill_subarray)(struct supertype *st); /* optional */
/* Permit subarray's to be modified */
int (*update_subarray)(struct supertype *st, char *update, mddev_ident_t ident); /* optional */
+ /* Check if reshape is supported for this external format.
+ * st is obtained from super_by_fd() where st->subarray[0] is
+ * initialized to indicate if reshape is being performed at the
+ * container or subarray level
+ */
+ int (*reshape_super)(struct supertype *st, long long size, int level,
+ int layout, int chunksize, int raid_disks,
+ char *backup, char *dev, int verbose); /* optional */
+ int (*manage_reshape)(struct supertype *st, char *backup); /* optional */

/* for mdmon */
int (*open_new)(struct supertype *c, struct active_array *a,

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 13/13] External reshape (step 2): Freeze container

am 18.11.2010 10:23:08 von krzysztof.wojcik

From: Dan Williams

When growing the number of raid disks the reshape process will promote
container-spares to subarray-spares (later the kernel promotes them to
subarray-members in raid5_start_reshape()). The automatic spare
promotion that mdmon performs upon seeing a degraded array must be
disabled until the reshape process has been initiated. Otherwise, mdmon
may start a rebuild before the reshape parameters can be specified.

In the external case we arrange for the monitor to be blocked, and turn off the safemode delay.
Mdmon is updated to check sync_action is not frozen before initiating
recovery. This introduces a need to check which version of mdmon is
running to be sure it honors the expected semantics. Extend
ping_monitor() to report the version of mdmon. This also permits
discrimination of known buggy mdmon implementations in the future.
Note, it's not enough to know the current version of mdadm because the
mdmon instance may have originated from the initrd, so there is no
guaratee that mdadm and mdmon versions are synchronized.

Signed-off-by: Dan Williams
---
Grow.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++---------- ----
1 files changed, 69 insertions(+), 19 deletions(-)

diff --git a/Grow.c b/Grow.c
index 59032ef..4139265 100644
--- a/Grow.c
+++ b/Grow.c
@@ -432,29 +432,78 @@ static int child_same_size(int afd, struct mdinfo *sra, unsigned long blocks,
int disks, int chunk, int level, int layout, int data,
int dests, int *destfd, unsigned long long *destoffsets);

-static int freeze_array(struct mdinfo *sra)
+static int freeze_container(struct supertype *st)
{
- /* Try to freeze resync on this array.
+ int container_dev = st->subarray[0] ? st->container_dev : st->devnum;
+ char *container = devnum2devname(container_dev);
+
+ if (!container) {
+ fprintf(stderr, Name
+ ": could not determine container name, freeze aborted\n");
+ return -2;
+ }
+
+ if (block_monitor(container, 1)) {
+ fprintf(stderr, Name ": failed to freeze container\n");
+ return -2;
+ }
+
+ return 1;
+}
+
+static void unfreeze_container(struct supertype *st)
+{
+ int container_dev = st->subarray[0] ? st->container_dev : st->devnum;
+ char *container = devnum2devname(container_dev);
+
+ if (!container) {
+ fprintf(stderr, Name
+ ": could not determine container name, unfreeze aborted\n");
+ return;
+ }
+
+ unblock_monitor(container, 1);
+}
+
+static int freeze(struct supertype *st)
+{
+ /* Try to freeze resync/rebuild on this array/container.
* Return -1 if the array is busy,
+ * return -2 container cannot be frozen,
* return 0 if this kernel doesn't support 'frozen'
* return 1 if it worked.
*/
- char buf[20];
- if (sysfs_get_str(sra, NULL, "sync_action", buf, 20) <= 0)
- return 0;
- if (strcmp(buf, "idle\n") != 0 &&
- strcmp(buf, "frozen\n") != 0)
- return -1;
- if (sysfs_set_str(sra, NULL, "sync_action", "frozen") < 0)
- return 0;
- return 1;
+ if (st->ss->external)
+ return freeze_container(st);
+ else {
+ struct mdinfo *sra = sysfs_read(-1, st->devnum, GET_VERSION);
+ int err;
+
+ if (!sra)
+ return -1;
+ err = sysfs_freeze_array(sra);
+ sysfs_free(sra);
+ return err;
+ }
}

-static void unfreeze_array(struct mdinfo *sra, int frozen)
+static void unfreeze(struct supertype *st, int frozen)
{
/* If 'frozen' is 1, unfreeze the array */
- if (frozen > 0)
- sysfs_set_str(sra, NULL, "sync_action", "idle");
+ if (frozen <= 0)
+ return;
+
+ if (st->ss->external)
+ return unfreeze_container(st);
+ else {
+ struct mdinfo *sra = sysfs_read(-1, st->devnum, GET_VERSION);
+
+ if (sra)
+ sysfs_set_str(sra, NULL, "sync_action", "idle");
+ else
+ fprintf(stderr, Name ": failed to unfreeze array\n");
+ sysfs_free(sra);
+ }
}

static void wait_reshape(struct mdinfo *sra)
@@ -818,19 +867,21 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
array.level = LEVEL_CONTAINER;
sra->array.level = LEVEL_CONTAINER;
}
- frozen = freeze_array(sra);
} else {
fprintf(stderr, Name ": failed to read sysfs parameters for %s\n",
devname);
return 1;
}
- if (frozen < 0) {
+ frozen = freeze(st);
+ if (frozen < -1) {
+ /* freeze() already spewed the reason */
+ return 1;
+ } else if (frozen < 0) {
fprintf(stderr, Name ": %s is performing resync/recovery and cannot"
" be reshaped\n", devname);
return 1;
}

-
/* ========= set size =============== */
if (size >= 0 && (size == 0 || size != array.size)) {
long long orig_size = array.size;
@@ -1611,8 +1662,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
if (c && sysfs_set_str(sra, NULL, "level", c) == 0)
fprintf(stderr, Name ": aborting level change\n");
}
- if (sra)
- unfreeze_array(sra, frozen);
+ unfreeze(st, frozen);
return rv;
}


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/13] block monitor: freeze spare assignment forexternal arrays

am 23.11.2010 05:03:35 von NeilBrown

I've applied 1/13 to devel-3.2
and if have applied this (2/13) with a few changes:

> + version = ping_monitor_version(container);
> + ver = version ? mdadm_version(version) : -1;
> + free(version);
> + if (ver < 3001003) {

I changed this to 3002000 and changed ReadMe.c so the version is 3.2-devel.

>
> +int mdadm_version(char *version)
> +{
> + int a, b, c;
> + char *cp;
> +
> + if (!version)
> + version = Version;
> +
> + cp = strchr(version, '-');
> + if (!cp || *(cp+1) != ' ' || *(cp+2) != 'v')
> + return -1;
> + cp += 3;
> + a = strtoul(cp, &cp, 10);
> + if (*cp != '.')
> + return -1;
> + b = strtoul(cp+1, &cp, 10);
> + if (*cp != '.')
> + return -1;
> + c = strtoul(cp+1, &cp, 10);
> + if (*cp != ' ')
> + return -1;
> + return (a*1000000)+(b*1000)+c;
> +}
> +

I have fixed this so that it access 2 number versions, and ignores a trailing
"-tag", so 3.2-devel is parsed OK.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/13] Manage: allow manual control of external raid0readonly flag

am 23.11.2010 05:08:22 von NeilBrown

On Thu, 18 Nov 2010 10:21:45 +0100
Krzysztof Wojcik wrote:

> From: Dan Williams
>
> mdadm --readwrite will clear the external readonly flag ('-'
> to '/'), but only for redudant arrays. Allow raid0 arrays as well so
> the user has a simple helper to control this flag.
>
> Signed-off-by: Dan Williams
> ---
> Manage.c | 1 -
> 1 files changed, 0 insertions(+), 1 deletions(-)
>
> diff --git a/Manage.c b/Manage.c
> index 6e9d4a0..ac9415b 100644
> --- a/Manage.c
> +++ b/Manage.c
> @@ -56,7 +56,6 @@ int Manage_ro(char *devname, int fd, int readonly)
> mdi = sysfs_read(fd, -1, GET_LEVEL|GET_VERSION);
> if (mdi &&
> mdi->array.major_version == -1 &&
> - mdi->array.level > 0 &&
> is_subarray(mdi->text_version)) {
> char vers[64];
> strcpy(vers, "external:");
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

I've added:
diff --git a/Manage.c b/Manage.c
index 919dc01..a203ec9 100644
--- a/Manage.c
+++ b/Manage.c
@@ -87,6 +87,8 @@ int Manage_ro(char *devname, int fd, int readonly)
if (*cp)
*cp = 0;
ping_monitor(vers+10);
+ if (mdi->array.level <= 0)
+ sysfs_set_str(mdi, NULL, "array_state", "active");
}
return 0;
}


so that you can set raid0 to --readonly again. Yell if you think that is
wrong.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 07/13] Grow: add missing raid4 geometries to geo_map()

am 23.11.2010 05:16:32 von NeilBrown

On Thu, 18 Nov 2010 10:22:17 +0100
Krzysztof Wojcik wrote:

> From: Dan Williams
>
> They are equivalent to their raid5 versions and let the reshape code
> optionally use either.
>
> Signed-off-by: Dan Williams
> ---
> restripe.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/restripe.c b/restripe.c
> index 3074693..c2fbe5b 100644
> --- a/restripe.c
> +++ b/restripe.c
> @@ -46,6 +46,7 @@ static int geo_map(int block, unsigned long long stripe, int raid_disks,
> switch(level*100 + layout) {
> case 000:
> case 400:
> + case 400 + ALGORITHM_PARITY_N:
> case 500 + ALGORITHM_PARITY_N:
> /* raid 4 isn't messed around by parity blocks */
> if (block == -1)
> @@ -75,6 +76,7 @@ static int geo_map(int block, unsigned long long stripe, int raid_disks,
> if (block == -1) return pd;
> return (pd + 1 + block) % raid_disks;
>
> + case 400 + ALGORITHM_PARITY_0:
> case 500 + ALGORITHM_PARITY_0:
> return block + 1;
>

I'm not sure about this.
In the kernel, raid4 ignores the 'layout'. So it seems safest to do the same
in mdadm, and always use a level of '5' when we want PARITY_0.

Why is this needed exactly?

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 11/13] Document the external reshape implementation

am 23.11.2010 05:52:29 von NeilBrown

On Thu, 18 Nov 2010 10:22:51 +0100
Krzysztof Wojcik wrote:

> From: Dan Williams
>
> Signed-off-by: Dan Williams
> ---
> external-reshape-design.txt | 168 +++++++++++++++++++++++++++++++++++++++++++

*very* happy to get this sort of documentation.
Hopefully the holes will be filled in in due course(?).
There is no mention of starting mdmon when converting e.g. raid0 to raid10.

But applied as-is.

Thanks,
NeilBrown


> 1 files changed, 168 insertions(+), 0 deletions(-)
> create mode 100644 external-reshape-design.txt
>
> diff --git a/external-reshape-design.txt b/external-reshape-design.txt
> new file mode 100644
> index 0000000..d6fb98d
> --- /dev/null
> +++ b/external-reshape-design.txt
> @@ -0,0 +1,168 @@
> +External Reshape
> +
> +1 Problem statement
> +
> +External (third-party metadata) reshape differs from native-metadata
> +reshape in three key ways:
> +
> +1.1 Format specific constraints
> +
> +In the native case reshape is limited by what is implemented in the
> +generic reshape routine (Grow_reshape()) and what is supported by the
> +kernel. There are exceptional cases where Grow_reshape() may block
> +operations when it knows that the kernel implementation is broken, but
> +otherwise the kernel is relied upon to be the final arbiter of what
> +reshape operations are supported.
> +
> +In the external case the kernel, and the generic checks in
> +Grow_reshape(), become the super-set of what reshapes are possible. The
> +metadata format may not support, or have yet to implement a given
> +reshape type. The implication for Grow_reshape() is that it must query
> +the metadata handler and effect changes in the metadata before the new
> +geometry is posted to the kernel. The ->reshape_super method allows
> +Grow_reshape() to validate the requested operation and post the metadata
> +update.
> +
> +1.2 Scope of reshape
> +
> +Native metadata reshape is always performed at the array scope (no
> +metadata relationship with sibling arrays on the same disks). External
> +reshape, depending on the format, may not allow the number of member
> +disks to be changed in a subarray unless the change is simultaneously
> +applied to all subarrays in the container. For example the imsm format
> +requires all member disks to be a member of all subarrays, so a 4-disk
> +raid5 in a container that also houses a 4-disk raid10 array could not be
> +reshaped to 5 disks as the imsm format does not support a 5-disk raid10
> +representation. This requires the ->reshape_super method to check the
> +contents of the array and ask the user to run the reshape at container
> +scope (if both subarrays are agreeable to the change), or report an
> +error in the case where one subarray cannot support the change.
> +
> +1.3 Monitoring / checkpointing
> +
> +Reshape, unlike rebuild/resync, requires strict checkpointing to survive
> +interrupted reshape operations. For example when expanding a raid5
> +array the first few stripes of the array will be overwritten in a
> +destructive manner. When restarting the reshape process we need to know
> +the exact location of the last successfully written stripe, and we need
> +to restore the data in any partially overwritten stripe. Native
> +metadata stores this backup data in the unused portion of spares that
> +are being promoted to array members, or in an external backup file
> +(located on a non-involved block device).
> +
> +The kernel is in charge of recording checkpoints of reshape progress,
> +but mdadm is delegated the task of managing the backup space which
> +involves:
> +1/ Identifying what data will be overwritten in the next unit of reshape
> + operation
> +2/ Suspending access to that region so that a snapshot of the data can
> + be transferred to the backup space.
> +3/ Allowing the kernel to reshape the saved region and setting the
> + boundary for the next backup.
> +
> +In the external reshape case we want to preserve this mdadm
> +'reshape-manager' arrangement, but have a third actor, mdmon, to
> +consider. It is tempting to give the role of managing reshape to mdmon,
> +but that is counter to its role as a monitor, and conflicts with the
> +existing capabilities and role of mdadm to manage the progress of
> +reshape. For clarity the external reshape implementation maintains the
> +role of mdmon as a (mostly) passive recorder of raid events, and mdadm
> +treats it as it would the kernel in the native reshape case (modulo
> +needing to send explicit metadata update messages and checking that
> +mdmon took the expected action).
> +
> +External reshape can use the generic md backup file as a fallback, but in the
> +optimal/firmware-compatible case the reshape-manager will use the metadata
> +specific areas for managing reshape. The implementation also needs to spawn a
> +reshape-manager per subarray when the reshape is being carried out at the
> +container level. For these two reasons the ->manage_reshape() method is
> +introduced. This method in addition to base tasks mentioned above:
> +1/ Spawns a manager per-subarray, when necessary
> +2/ Uses either generic routines in Grow.c for md-style backup file
> + support, or uses the metadata-format specific location for storing
> + recovery data.
> +This aims to avoid a "midlayer mistake"[1] and lets the metadata handler
> +optionally take advantage of generic infrastructure in Grow.c
> +
> +2 Details for specific reshape requests
> +
> +There are quite a few moving pieces spread out across md, mdadm, and mdmon for
> +the support of external reshape, and there are several different types of
> +reshape that need to be comprehended by the implementation. A rundown of
> +these details follows.
> +
> +2.0 General provisions:
> +
> +Obtain an exclusive open on the container to make sure we are not
> +running concurrently with a Create() event.
> +
> +2.1 Freezing sync_action
> +
> +2.2 Reshape size
> +
> + 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
> + initializes st->update_tail
> + 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the size change
> + is allowed (being performed at subarray scope / enough room) prepares a
> + metadata update
> + 3/ mdadm::Grow_reshape(): flushes the metadata update (via
> + flush_metadata_update(), or ->sync_metadata())
> + 4/ mdadm::Grow_reshape(): post the new size to the kernel
> +
> +
> +2.3 Reshape level (simple-takeover)
> +
> +"simple-takeover" implies the level change can be satisfied without touching
> +sync_action
> +
> + 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
> + initializes st->update_tail
> + 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the level change
> + is allowed (being performed at subarray scope) prepares a
> + metadata update
> + 2a/ raid10 --> raid0: degrade all mirror legs prior to calling
> + ->reshape_super
> + 3/ mdadm::Grow_reshape(): flushes the metadata update (via
> + flush_metadata_update(), or ->sync_metadata())
> + 4/ mdadm::Grow_reshape(): post the new level to the kernel
> +
> +2.4 Reshape chunk, layout
> +
> +2.5 Reshape raid disks (grow)
> +
> + 1/ mdadm::Grow_reshape(): unconditionally initializes st->update_tail
> + because only redundant raid levels can modify the number of raid disks
> + 2/ mdadm::Grow_reshape(): calls ->reshape_super() to check that the level
> + change is allowed (being performed at proper scope / permissible
> + geometry / proper spares available in the container) prepares a metadata
> + update.
> + 3/ mdadm::Grow_reshape(): Converts each subarray in the container to the
> + raid level that can perform the reshape and starts mdmon.
> + 4/ mdadm::Grow_reshape(): Pushes the update to mdmon...
> + 4a/ mdmon::process_update(): marks the array as reshaping
> + 4b/ mdmon::manage_member(): adds the spares (without assigning a slot)
> + 5/ mdadm::Grow_reshape(): Notes that mdmon has assigned spares and invokes
> + ->manage_reshape()
> + 5/ mdadm::->manage_reshape(): (for each subarray) sets sync_max to
> + zero, starts the reshape, and pings mdmon
> + 5a/ mdmon::read_and_act(): notices that reshape has started and notifies
> + the metadata handler to record the slots chosen by the kernel
> + 6/ mdadm::->manage_reshape(): saves data that will be overwritten by
> + the kernel to either the backup file or the metadata specific location,
> + advances sync_max, waits for reshape, ping mdmon, repeat.
> + 6a/ mdmon::read_and_act(): records checkpoints
> + 7/ mdadm::->manage_reshape(): Once reshape completes changes the raid
> + level back to the nominal raid level (if necessary)
> +
> + FIXME: native metadata does not have the capability to record the original
> + raid level in reshape-restart case because the kernel always records current
> + raid level to the metadata, whereas external metadata can masquerade at an
> + alternate level based on the reshape state.
> +
> +2.6 Reshape raid disks (shrink)
> +
> +3 TODO
> +
> +...
> +
> +[1]: Linux kernel design patterns - part 3, Neil Brown http://lwn.net/Articles/336262/
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 12/13] External reshape (step 1): container reshape and->reshape_super()

am 23.11.2010 06:22:49 von NeilBrown

On Thu, 18 Nov 2010 10:22:59 +0100
Krzysztof Wojcik wrote:

> From: Dan Williams
>
> In the native metadata case Grow_reshape() and the kernel validate what
> reshapes are possible / supported and the kernel handles all the metadata
> updates. In the external case the metadata format may have specific
> constraints above this baseline. External formats also introduce the
> constraint of only permitting some reshapes at container scope versus subarray
> scope. For exmaple imsm changes to 'raiddisks' must be applied to all arrays
> in the container.
>
> This operation assumes that its 'st' parameter has been obtained from
> super_by_fd() (such that st->subarray is up to date), and that a snapshot of
> the metadata has been loaded from the container.
>
> Why a new method, versus extending an existing one?
> ->validate_geometry: this routine assumes it is being called from Create(),
> adding reshape complicates the cases that this routine needs to handle. Where
> we find that checks can be shared between the two cases those routines
> refactored into common code internal to the metadata handler, i.e. no need to
> provide a unified external interface. ->validate_geometry() also does not
> expect to update the metadata.
>
> ->update_super: this is meant to update single fields at Assembly() and only at
> the container scope. Reshape potentially wants to update multiple fields at
> either container or subarray scope.

I've applied this, but I had to make a few changes due to the new
load_container etc. Hopefully I got it right...

Also, I'm a bit concerned about the handling of container-wide changes.
Currently we only allow changes to the number of devices.
However RAID5 -> RAID6 change typically changes the number of devices and
the level/layout of the RAID5.
Any idea how that can fit into this scheme??

Thanks,
NeilBrown


>
> Signed-off-by: Dan Williams
> ---
> Grow.c | 390 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++-
> mdadm.h | 9 +
> 2 files changed, 391 insertions(+), 8 deletions(-)
>
> diff --git a/Grow.c b/Grow.c
> index bf634d3..59032ef 100644
> --- a/Grow.c
> +++ b/Grow.c
> @@ -474,8 +474,222 @@ static void wait_reshape(struct mdinfo *sra)
> }
> } while (strncmp(action, "reshape", 7) == 0);
> }
> -
> -
> +
> +static int reshape_super(struct supertype *st, long long size, int level,
> + int layout, int chunksize, int raid_disks,
> + char *backup_file, char *dev, int verbose)
> +{
> + /* nothing extra to check in the native case */
> + if (!st->ss->external)
> + return 0;
> + if (!st->ss->reshape_super ||
> + !st->ss->manage_reshape) {
> + fprintf(stderr, Name ": %s metadata does not support reshape\n",
> + st->ss->name);
> + return 1;
> + }
> +
> + return st->ss->reshape_super(st, size, level, layout, chunksize,
> + raid_disks, backup_file, dev, verbose);
> +}
> +
> +static void sync_metadata(struct supertype *st)
> +{
> + if (st->ss->external) {
> + if (st->update_tail)
> + flush_metadata_updates(st);
> + else
> + st->ss->sync_metadata(st);
> + }
> +}
> +
> +static int subarray_set_num(char *container, struct mdinfo *sra, char *name, int n)
> +{
> + /* when dealing with external metadata subarrays we need to be
> + * prepared to handle EAGAIN. The kernel may need to wait for
> + * mdmon to mark the array active so the kernel can handle
> + * allocations/writeback when preparing the reshape action
> + * (md_allow_write()). We temporarily disable safe_mode_delay
> + * to close a race with the array_state going clean before the
> + * next write to raid_disks / stripe_cache_size
> + */
> + char safe[50];
> + int rc;
> +
> + /* only 'raid_disks' and 'stripe_cache_size' trigger md_allow_write */
> + if (strcmp(name, "raid_disks") != 0 &&
> + strcmp(name, "stripe_cache_size") != 0)
> + return sysfs_set_num(sra, NULL, name, n);
> +
> + rc = sysfs_get_str(sra, NULL, "safe_mode_delay", safe, sizeof(safe));
> + if (rc <= 0)
> + return -1;
> + sysfs_set_num(sra, NULL, "safe_mode_delay", 0);
> + rc = sysfs_set_num(sra, NULL, name, n);
> + if (rc < 0 && errno == EAGAIN) {
> + ping_monitor(container);
> + /* if we get EAGAIN here then the monitor is not active
> + * so stop trying
> + */
> + rc = sysfs_set_num(sra, NULL, name, n);
> + }
> + sysfs_set_str(sra, NULL, "safe_mode_delay", safe);
> + return rc;
> +}
> +
> +static int reshape_container_raid_disks(char *container, int raid_disks)
> +{
> + /* for each subarray switch to a raid level that can
> + * support the reshape, and set raid disks
> + */
> + struct mdstat_ent *ent, *e;
> + int changed = 0, rv = 0, err = 0;
> +
> + ent = mdstat_read(1, 0);
> + if (!ent) {
> + fprintf(stderr, Name ": unable to read /proc/mdstat\n");
> + return -1;
> + }
> +
> + changed = 0;
> + for (e = ent; e; e = e->next) {
> + struct mdinfo *sub;
> + unsigned int cache;
> + int level, takeover_delta = 0;
> +
> + if (!is_container_member(e, container))
> + continue;
> +
> + level = map_name(pers, e->level);
> + if (level == 0) {
> + sub = sysfs_read(-1, e->devnum, GET_VERSION);
> + if (!sub)
> + break;
> + /* metadata records 'orig_level' */
> + rv = sysfs_set_num(sub, NULL, "level", 4);
> + if (rv < 0) {
> + err = errno;
> + break;
> + }
> + /* we want spares to be used for capacity
> + * expansion, not rebuild
> + */
> + takeover_delta = 1;
> +
> + sysfs_free(sub);
> + level = 4;
> + }
> +
> + sub = NULL;
> + switch (level) {
> + default:
> + rv = -1;
> + break;
> + case 4:
> + case 5:
> + case 6:
> + sub = sysfs_read(-1, e->devnum, GET_CHUNK|GET_CACHE);
> + if (!sub)
> + break;
> + cache = (sub->array.chunk_size / 4096) * 4;
> + if (cache > sub->cache_size)
> + rv = subarray_set_num(container, sub,
> + "stripe_cache_size", cache);
> + if (rv) {
> + err = errno;
> + break;
> + }
> + /* fall through */
> + case 1:
> + if (!sub)
> + sub = sysfs_read(-1, e->devnum, GET_VERSION);
> + if (!sub)
> + break;
> +
> + rv = subarray_set_num(container, sub, "raid_disks",
> + raid_disks + takeover_delta);
> + if (rv)
> + err = errno;
> + else
> + changed++;
> + break;
> + }
> + sysfs_free(sub);
> + if (rv)
> + break;
> + }
> + free_mdstat(ent);
> + if (rv) {
> + fprintf(stderr, Name
> + ": failed to initiate container reshape%s%s\n",
> + err ? ": " : "", err ? strerror(err) : "");
> + return rv;
> + }
> +
> + return changed;
> +}
> +
> +static void revert_container_raid_disks(struct supertype *st, int fd, char *container)
> +{
> + /* we failed to prepare all subarrays in the container for
> + * reshape, so cancel the changes and restore the nominal raid
> + * level
> + */
> + struct mdstat_ent *ent, *e;
> +
> + ent = mdstat_read(0, 0);
> + if (!ent) {
> + fprintf(stderr, Name
> + ": failed to read /proc/mdstat while aborting reshape\n");
> + return;
> + }
> +
> + for (e = ent; e; e = e->next) {
> + int level_fixed = 0, disks_fixed = 0;
> + struct mdinfo *sub, prev;
> +
> + if (!is_container_member(e, container))
> + continue;
> +
> + st->ss->free_super(st);
> + sprintf(st->subarray, "%s", to_subarray(e, container));
> + if (st->ss->load_super(st, fd, NULL)) {
> + fprintf(stderr, Name
> + ": failed read metadata while aborting reshape\n");
> + continue;
> + }
> + st->ss->getinfo_super(st, &prev);
> +
> + /* changing level might change raid_disks so we do it
> + * first and then check if raid_disks still needs fixing
> + */
> + if (map_name(pers, e->level) != prev.array.level) {
> + sub = sysfs_read(-1, e->devnum, GET_VERSION);
> + if (sub &&
> + !sysfs_set_num(sub, NULL, "level", prev.array.level))
> + level_fixed = 1;
> + sysfs_free(sub);
> + } else
> + level_fixed = 1;
> +
> + sub = sysfs_read(-1, e->devnum, GET_DISKS);
> + if (sub && sub->array.raid_disks != prev.array.raid_disks) {
> + if (!subarray_set_num(container, sub, "raid_disks",
> + prev.array.raid_disks))
> + disks_fixed = 1;
> + } else if (sub)
> + disks_fixed = 1;
> + sysfs_free(sub);
> +
> + if (!disks_fixed || !level_fixed)
> + fprintf(stderr, Name
> + ": failed to restore %s to a %d-disk %s array\n",
> + e->dev, prev.array.raid_disks,
> + map_num(pers, prev.array.level));
> + }
> + free_mdstat(ent);
> +}
> +
> int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> long long size,
> int level, char *layout_str, int chunksize, int raid_disks)
> @@ -518,6 +732,8 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> unsigned long cache;
> unsigned long long array_size;
> int changed = 0;
> + char *container = NULL;
> + int cfd = -1;
> int done;
>
> struct mdinfo *sra;
> @@ -545,10 +761,65 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> " Please use a newer kernel\n");
> return 1;
> }
> +
> + st = super_by_fd(fd);
> + if (!st) {
> + fprintf(stderr, Name ": Unable to determine metadata format for %s\n", devname);
> + return 1;
> + }
> +
> + /* in the external case we need to check that the requested reshape is
> + * supported, and perform an initial check that the container holds the
> + * pre-requisite spare devices (mdmon owns final validation)
> + */
> + if (st->ss->external) {
> + int container_dev;
> +
> + if (st->subarray[0]) {
> + container_dev = st->container_dev;
> + cfd = open_dev_excl(st->container_dev);
> + } else if (size >= 0 || layout_str != NULL || chunksize != 0 ||
> + level != UnSet) {
> + fprintf(stderr,
> + Name ": %s is a container, only 'raid-devices' can be changed\n",
> + devname);
> + return 1;
> + } else {
> + container_dev = st->devnum;
> + close(fd);
> + cfd = open_dev_excl(st->devnum);
> + fd = cfd;
> + }
> + if (cfd < 0) {
> + fprintf(stderr, Name ": Unable to open container for %s\n",
> + devname);
> + return 1;
> + }
> +
> + container = devnum2devname(st->devnum);
> + if (!container) {
> + fprintf(stderr, Name ": Could not determine container name\n");
> + return 1;
> + }
> +
> + if (st->ss->load_super(st, cfd, NULL)) {
> + fprintf(stderr, Name ": Cannot read superblock for %s\n",
> + devname);
> + return 1;
> + }
> +
> + if (mdmon_running(container_dev))
> + st->update_tail = &st->updates;
> + }
> +
> sra = sysfs_read(fd, 0, GET_LEVEL);
> - if (sra)
> + if (sra) {
> + if (st->ss->external && st->subarray[0] == 0) {
> + array.level = LEVEL_CONTAINER;
> + sra->array.level = LEVEL_CONTAINER;
> + }
> frozen = freeze_array(sra);
> - else {
> + } else {
> fprintf(stderr, Name ": failed to read sysfs parameters for %s\n",
> devname);
> return 1;
> @@ -559,8 +830,16 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> return 1;
> }
>
> +
> /* ========= set size =============== */
> if (size >= 0 && (size == 0 || size != array.size)) {
> + long long orig_size = array.size;
> +
> + if (reshape_super(st, size, UnSet, UnSet, 0, 0, NULL, devname, !quiet)) {
> + rv = 1;
> + goto release;
> + }
> + sync_metadata(st);
> array.size = size;
> if (array.size != size) {
> /* got truncated to 32bit, write to
> @@ -575,6 +854,11 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> rv = ioctl(fd, SET_ARRAY_INFO, &array);
> if (rv != 0) {
> int err = errno;
> +
> + /* restore metadata */
> + if (reshape_super(st, orig_size, UnSet, UnSet, 0, 0,
> + NULL, devname, !quiet) == 0)
> + sync_metadata(st);
> fprintf(stderr, Name ": Cannot set device size for %s: %s\n",
> devname, strerror(err));
> if (err == EBUSY &&
> @@ -591,7 +875,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> fprintf(stderr, Name ": component size of %s has been set to %lluK\n",
> devname, size);
> changed = 1;
> - } else {
> + } else if (array.level != LEVEL_CONTAINER) {
> size = get_component_size(fd)/2;
> if (size == 0)
> size = array.size;
> @@ -674,6 +958,13 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> } else
> layout_str = "parity-last";
> } else {
> + /* Level change is a simple takeover. In the external
> + * case we don't check with the metadata handler until
> + * we establish what the final layout will be. If the
> + * level change is disallowed we will revert to
> + * orig_level without disturbing the metadata, otherwise
> + * we will send an update.
> + */
> c = map_num(pers, level);
> if (c == NULL) {
> rv = 1;/* not possible */
> @@ -706,7 +997,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
>
> /* ========= set shape (chunk_size / layout / ndisks) ============== */
> /* Check if layout change is a no-op */
> - switch(array.level) {
> + switch (array.level) {
> case 5:
> if (layout_str && array.layout == map_name(r5layout, layout_str))
> layout_str = NULL;
> @@ -745,6 +1036,11 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> if (layout_str == NULL
> && (chunksize == 0 || chunksize*1024 == array.chunk_size)
> && (raid_disks == 0 || raid_disks == array.raid_disks)) {
> + if (reshape_super(st, -1, level, UnSet, 0, 0, NULL, devname, !quiet)) {
> + rv = 1;
> + goto release;
> + }
> + sync_metadata(st);
> rv = 0;
> if (level != UnSet && level != array.level) {
> /* Looks like this level change doesn't need
> @@ -766,18 +1062,69 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> } else if (!changed && !quiet)
> fprintf(stderr, Name ": %s: no change requested\n",
> devname);
> +
> + if (st->ss->external && !mdmon_running(st->container_dev) &&
> + level > 0) {
> + start_mdmon(st->container_dev);
> + ping_monitor(container);
> + }
> goto release;
> }
>
> c = map_num(pers, array.level);
> if (c == NULL) c = "-unknown-";
> - switch(array.level) {
> + switch (array.level) {
> default: /* raid0, linear, multipath cannot be reconfigured */
> fprintf(stderr, Name ": %s array %s cannot be reshaped.\n",
> c, devname);
> + /* TODO raid0 raiddisks can be reshaped via raid4 */
> rv = 1;
> break;
> + case LEVEL_CONTAINER: {
> + int count;
> +
> + /* double check that we are not changing anything but raid_disks */
> + if (size >= 0 || layout_str != NULL || chunksize != 0 || level != UnSet) {
> + fprintf(stderr,
> + Name ": %s is a container, only 'raid-devices' can be changed\n",
> + devname);
> + rv = 1;
> + goto release;
> + }
> +
> + st->update_tail = &st->updates;
> + if (reshape_super(st, -1, UnSet, UnSet, 0, raid_disks,
> + backup_file, devname, !quiet)) {
> + rv = 1;
> + goto release;
> + }
> +
> + count = reshape_container_raid_disks(container, raid_disks);
> + if (count < 0) {
> + revert_container_raid_disks(st, fd, container);
> + rv = 1;
> + goto release;
> + } else if (count == 0) {
> + if (!quiet)
> + fprintf(stderr, Name
> + ": no active subarrays to reshape\n");
> + goto release;
> + }
>
> + if (!mdmon_running(st->devnum)) {
> + start_mdmon(st->devnum);
> + ping_monitor(container);
> + }
> + sync_metadata(st);
> +
> + /* give mdmon a chance to allocate spares */
> + ping_manager(container);
> +
> + /* manage_reshape takes care of releasing the array(s) */
> + st->ss->manage_reshape(st, backup_file);
> + frozen = 0;
> + goto release;
> + }
> case LEVEL_FAULTY: /* only 'layout' change is permitted */
>
> if (chunksize || raid_disks) {
> @@ -813,6 +1160,12 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> break;
> }
> if (raid_disks > 0) {
> + if (reshape_super(st, -1, UnSet, UnSet, 0, raid_disks,
> + NULL, devname, !quiet)) {
> + rv = 1;
> + goto release;
> + }
> + sync_metadata(st);
> array.raid_disks = raid_disks;
> if (ioctl(fd, SET_ARRAY_INFO, &array) != 0) {
> fprintf(stderr, Name ": Cannot set raid-devices for %s: %s\n",
> @@ -830,7 +1183,6 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> * layout/chunksize/raid_disks can be changed
> * though the kernel may not support it all.
> */
> - st = super_by_fd(fd);
>
> /*
> * There are three possibilities.
> @@ -1024,6 +1376,12 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> }
> }
> if (backup_file == NULL) {
> + if (st->ss->external && !st->ss->manage_reshape) {
> + fprintf(stderr, Name ": %s Grow operation not supported by %s metadata\n",
> + devname, st->ss->name);
> + rv = 1;
> + break;
> + }
> if (ndata <= odata) {
> fprintf(stderr, Name ": %s: Cannot grow - need backup-file\n",
> devname);
> @@ -1072,6 +1430,13 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> d++;
> }
>
> + /* check that the operation is supported by the metadata */
> + if (reshape_super(st, -1, level, nlayout, nchunk, ndisks,
> + backup_file, devname, !quiet)) {
> + rv = 1;
> + break;
> + }
> +
> /* lastly, check that the internal stripe cache is
> * large enough, or it won't work.
> */
> @@ -1088,6 +1453,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> * If only changing raid_disks, use ioctl, else use
> * sysfs.
> */
> + sync_metadata(st);
> if (ochunk == nchunk && olayout == nlayout) {
> array.raid_disks = ndisks;
> if (ioctl(fd, SET_ARRAY_INFO, &array) != 0) {
> @@ -1136,6 +1502,14 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> break;
> }
>
> + if (st->ss->external) {
> + /* metadata handler takes it from here */
> + ping_manager(container);
> + st->ss->manage_reshape(st, backup_file);
> + frozen = 0;
> + break;
> + }
> +
> /* set up the backup-super-block. This requires the
> * uuid from the array.
> */
> diff --git a/mdadm.h b/mdadm.h
> index a4de06f..64b32cc 100644
> --- a/mdadm.h
> +++ b/mdadm.h
> @@ -627,6 +627,15 @@ extern struct superswitch {
> int (*kill_subarray)(struct supertype *st); /* optional */
> /* Permit subarray's to be modified */
> int (*update_subarray)(struct supertype *st, char *update, mddev_ident_t ident); /* optional */
> + /* Check if reshape is supported for this external format.
> + * st is obtained from super_by_fd() where st->subarray[0] is
> + * initialized to indicate if reshape is being performed at the
> + * container or subarray level
> + */
> + int (*reshape_super)(struct supertype *st, long long size, int level,
> + int layout, int chunksize, int raid_disks,
> + char *backup, char *dev, int verbose); /* optional */
> + int (*manage_reshape)(struct supertype *st, char *backup); /* optional */
>
> /* for mdmon */
> int (*open_new)(struct supertype *c, struct active_array *a,

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 13/13] External reshape (step 2): Freeze container

am 23.11.2010 07:11:25 von NeilBrown

On Thu, 18 Nov 2010 10:23:08 +0100
Krzysztof Wojcik wrote:

> From: Dan Williams
>
> When growing the number of raid disks the reshape process will promote
> container-spares to subarray-spares (later the kernel promotes them to
> subarray-members in raid5_start_reshape()). The automatic spare
> promotion that mdmon performs upon seeing a degraded array must be
> disabled until the reshape process has been initiated. Otherwise, mdmon
> may start a rebuild before the reshape parameters can be specified.
>
> In the external case we arrange for the monitor to be blocked, and turn off the safemode delay.
> Mdmon is updated to check sync_action is not frozen before initiating
> recovery. This introduces a need to check which version of mdmon is
> running to be sure it honors the expected semantics. Extend
> ping_monitor() to report the version of mdmon. This also permits
> discrimination of known buggy mdmon implementations in the future.
> Note, it's not enough to know the current version of mdadm because the
> mdmon instance may have originated from the initrd, so there is no
> guaratee that mdadm and mdmon versions are synchronized.

I have applied this, and all the others that I didn't raise explicit issues
with (which I think was only
[PATCH 07/13] Grow: add missing raid4 geometries to geo_map()

and I have pushed out a new devel-3.2

Thanks,
NeilBrown


>
> Signed-off-by: Dan Williams
> ---
> Grow.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++---------- ----
> 1 files changed, 69 insertions(+), 19 deletions(-)
>
> diff --git a/Grow.c b/Grow.c
> index 59032ef..4139265 100644
> --- a/Grow.c
> +++ b/Grow.c
> @@ -432,29 +432,78 @@ static int child_same_size(int afd, struct mdinfo *sra, unsigned long blocks,
> int disks, int chunk, int level, int layout, int data,
> int dests, int *destfd, unsigned long long *destoffsets);
>
> -static int freeze_array(struct mdinfo *sra)
> +static int freeze_container(struct supertype *st)
> {
> - /* Try to freeze resync on this array.
> + int container_dev = st->subarray[0] ? st->container_dev : st->devnum;
> + char *container = devnum2devname(container_dev);
> +
> + if (!container) {
> + fprintf(stderr, Name
> + ": could not determine container name, freeze aborted\n");
> + return -2;
> + }
> +
> + if (block_monitor(container, 1)) {
> + fprintf(stderr, Name ": failed to freeze container\n");
> + return -2;
> + }
> +
> + return 1;
> +}
> +
> +static void unfreeze_container(struct supertype *st)
> +{
> + int container_dev = st->subarray[0] ? st->container_dev : st->devnum;
> + char *container = devnum2devname(container_dev);
> +
> + if (!container) {
> + fprintf(stderr, Name
> + ": could not determine container name, unfreeze aborted\n");
> + return;
> + }
> +
> + unblock_monitor(container, 1);
> +}
> +
> +static int freeze(struct supertype *st)
> +{
> + /* Try to freeze resync/rebuild on this array/container.
> * Return -1 if the array is busy,
> + * return -2 container cannot be frozen,
> * return 0 if this kernel doesn't support 'frozen'
> * return 1 if it worked.
> */
> - char buf[20];
> - if (sysfs_get_str(sra, NULL, "sync_action", buf, 20) <= 0)
> - return 0;
> - if (strcmp(buf, "idle\n") != 0 &&
> - strcmp(buf, "frozen\n") != 0)
> - return -1;
> - if (sysfs_set_str(sra, NULL, "sync_action", "frozen") < 0)
> - return 0;
> - return 1;
> + if (st->ss->external)
> + return freeze_container(st);
> + else {
> + struct mdinfo *sra = sysfs_read(-1, st->devnum, GET_VERSION);
> + int err;
> +
> + if (!sra)
> + return -1;
> + err = sysfs_freeze_array(sra);
> + sysfs_free(sra);
> + return err;
> + }
> }
>
> -static void unfreeze_array(struct mdinfo *sra, int frozen)
> +static void unfreeze(struct supertype *st, int frozen)
> {
> /* If 'frozen' is 1, unfreeze the array */
> - if (frozen > 0)
> - sysfs_set_str(sra, NULL, "sync_action", "idle");
> + if (frozen <= 0)
> + return;
> +
> + if (st->ss->external)
> + return unfreeze_container(st);
> + else {
> + struct mdinfo *sra = sysfs_read(-1, st->devnum, GET_VERSION);
> +
> + if (sra)
> + sysfs_set_str(sra, NULL, "sync_action", "idle");
> + else
> + fprintf(stderr, Name ": failed to unfreeze array\n");
> + sysfs_free(sra);
> + }
> }
>
> static void wait_reshape(struct mdinfo *sra)
> @@ -818,19 +867,21 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> array.level = LEVEL_CONTAINER;
> sra->array.level = LEVEL_CONTAINER;
> }
> - frozen = freeze_array(sra);
> } else {
> fprintf(stderr, Name ": failed to read sysfs parameters for %s\n",
> devname);
> return 1;
> }
> - if (frozen < 0) {
> + frozen = freeze(st);
> + if (frozen < -1) {
> + /* freeze() already spewed the reason */
> + return 1;
> + } else if (frozen < 0) {
> fprintf(stderr, Name ": %s is performing resync/recovery and cannot"
> " be reshaped\n", devname);
> return 1;
> }
>
> -
> /* ========= set size =============== */
> if (size >= 0 && (size == 0 || size != array.size)) {
> long long orig_size = array.size;
> @@ -1611,8 +1662,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> if (c && sysfs_set_str(sra, NULL, "level", c) == 0)
> fprintf(stderr, Name ": aborting level change\n");
> }
> - if (sra)
> - unfreeze_array(sra, frozen);
> + unfreeze(st, frozen);
> return rv;
> }
>

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html