[PATCH 00/53] External Metadata Reshape

am 26.11.2010 09:03:51 von adam.kwolek

This patch series (combines 3 previous series in to one) for mdadm and introduces features:
- Freeze array/container and new reshape vectors: patches 0001 to 0015
mdadm devel 3.2 contains patches 0001 to 0013 already, patches 0014 and 0016 fixes 2 problems in this functionality
- Takeover: patches 0016 to 0017
- Online Capacity Expansion (OLCE): patches 0018 to 0036
- Checkpointing: patches 0037 to 0045
- Migrations: patches 0045 to 0053
1. raid0 to raid5 : patch 0051
2. raid5 to raid0 : patch 0052
3. chunk size migration) : patch 0053

Patches are for mdadm 3.1.4 and Neil's feedback for 6 first OLCE patches is included.
There should be no patch corruption problem now, as it is sent directly from stgit (not outlook).

For checkpointing md patch "md: raid5: update suspend_hi during reshape" is required also (sent before).
New vectors (introduced by Dan Williams) reshape_super() and manage_reshape() are used in whole process.

In the next step, I'll rebase it to mdadm devel 3.2, meanwhile Krzysztof Wojcik will prepare additional fixes for raid10<->raid0 takeover

I think that few patches can be taken in to devel 3.2 at this monent i.e.:
0014-FIX-Cannot-exit-monitor-after-takeover.patch
0015-FIX-Unfreeze-not-only-container-for-external-metada.pat ch
0016-Add-takeover-support-for-external-meta.patch
0018-Treat-feature-as-experimental.patch
0033-Prepare-and-free-fdlist-in-functions.patch
0034-Compute-backup-blocks-in-function.patch

Online Capacity Expansion for raid0 and raid5 arrays implements the following algorithm for container reshape:
1. mdadm: Freeze container
2. mdadm: Perform takeover to raid5 for all raid0 arrays in container (imsm for raid0 <->raid5 takeover requires no metadata updates)
3. mdadm: set raid_disks sysfs entry for all arrays in container
4. mdadm: prepares and sends metadata update using reshape_super() vector for first array in container.
5. mdadm: waits for array idle or reshape state
6. managemon: prepare_update(): allocates memory for bigger device object
7. monitor: process_update(): applies update, relinks memory for device objects. Sets reshape_delta_disks variable in active array to requested ne disks
8. monitor: kicks managemon on reshape_delta_disks value other than RESHAPE_NOT_ACTIVE and RESHAPE_IN_PROGRESS value
9. managemon: adds devices to md (let md set slot number on reshape start)
10. managemon: sets sync_max to 0
11. managemon: starts reshape in md
12. managemon: on success sends slot verification message to monitor to update slots
13. managemon: on failure sends reshape cancelation message (sets idle state to md)
14. managemon: sets reshape_delta_disks variable to RESHAPE_IN_PROGRESS value to avoid managemon procedures reentry.
15. monitor:
a. for set slot message verifies and corrects (if necessary) slot information in metadata
b. for cancel message roll backs metadata information, set reshape_delta_disks variable to RESHAPE_NOT_ACTIVE
16. mdadm: on idle array state exits and unfreezes array. End
17. mdadm: on reshape array state continues with reshape (it also sends ping to monitor and mandgemon to be sure that metadata updates hits disks)
18. mdadm: verifies array state: if slots are set correctly
19. mdadm: calls child_grow() function
20. mdadm: waits for reshape finish
21. monitor: on reshape finish sets reshape_delta_disks variable to RESHAPE_NOT_ACTIVE
22. mdadm: sets array size according to information in metadata
23. mdadm: for raid0 array backward takeover to raid0 is executed.
24. mdadm: check if other array in container requires reshape if, yes starts from #4
25. mdadm: unfreezes array

Migration feature reuses code flow introduced for OLCE (Online Capacity Expansion) and uses the same grow/reshape flow in mdadm/mdmon.
Migration is executed in the following way:
1. mdadm: reshape_super() prepares metadata update and sends it to mdmon 2. mdadm: waits for reshape array state 3. monitor: receives metadata update and applies it.
4. monitor: metadata update triggers managemon.
5. managemon: updates array (md) configuration and starts reshape 6. mdadm: finds that reshape is started and continues it using check pointing 7. mdadm: reshape is finished
and manage_reshape() finalizes array:
- Sets array size as is given in metadata
- Performs takeover to raid0 if necessary

In current patches placement of manage_reshape() function call was changed (patch 0050).
It is moved to end of array processing to use common code form Grow.c for external metadata reshape case (we do not need to duplicate existing code) as it would do the same
things as code for native metadata. New manage_reshape() placement causes a few things to do in current implementation only and simplifees code.

Migrations command line:
1. Execute migration raid0->raid5:
mdadm --grow /dev/md/array_name -level 5 -layout=left-asymmetric

This converts n-disks raid0 array to (n+1)-disks raid5 array.
Additional disk is user from spares pool for raid5 array.

2. Execute migration raid5->raid0:
mdadm - -grow /dev/md/array_name -level 0

This converts n-disks raid5 array to n-disks raid0 array.

3. Execute chunk size migration
mdadm - -grow /dev/md/array_name -chunk N

where N is ne chunk size value

Online Capacity Expansion command line:
1. Add spares to container i.e. mdadm -add /dev/md/imsm_container_name /dev/sdX
For Raid0 spares are required also. Patch "[PATCH 16] Add spares to raid0 array using takeover" enables this.
2. Execute reshape i.e. : mdadm -grown /dev/md/imsm_container_name -raid-devices=requested_raid_disks_number
Grow is executed for all arrays in container that command is executed on.

Feature is treated as experimental due to Windows compatibility during reshape process, code is guarded by MDADM_EXPERIMENTAL environment variable.

---

Adam Kwolek (45):
Migration: Chunk size migration
Migration raid0->raid5
Migration: raid5->raid0
Change manage_reshape() placement
imsm Fix: Core during rebuild on array details read
WORKAROUND: md reports idle state during reshape start
FIX: Honor !reshape state on wait_reshape() entry
FIX: Allow for reshape without backup file
mdadm: support grow operation for external meta
Add mdadm->mdmon sync_max command message
mdadm: migration restart for external meta
mdadm: support backup operations for imsm
mdadm: support restore_stripes() from the given buffer
mdadm: add backup methods to superswitch
mdadm: Add IMSM migration record to intel_super
mdadm: read chunksize and layout from mdstat
mdadm: second_map enhancement for imsm_get_map()
Finalize reshape after adding disks to array
Control reshape in mdadm
Compute backup blocks in function.
Prepare and free fdlist in functions
imsm: FIX: spare list contains one device several times
imsm: FIX: Fill delta_disks field in getinfo_super()
imsm: FIX: Fill sys_name field in getinfo_super()
Add spares to raid0 array using takeover
imsm: Do not indicate resync during reshape
imsm: Do not accept messages sent by mdadm
imsm: Cancel metadata changes on reshape start failure
imsm: Verify slots in meta against slot numbers set by md
Add support to skip slot configuration
Process reshape initialization by managemon
Send information to managemon about reshape request
imsm: FIX: core dump during imsm metadata writing
imsm: Add reshape_update for grow array case
imsm: Add support for general migration
Treat feature as experimental
Disk removal support for Raid10->Raid0 takeover
Add takeover support for external meta
FIX: Unfreeze not only container for external metadata
FIX: Cannot exit monitor after takeover
External reshape (step 2): Freeze container
External reshape (step 1): container reshape and ->reshape_super()
Document the external reshape implementation
Initialize st->devnum and st->container_dev in super_by_fd
block monitor: freeze spare assignment for external arrays

Dan Williams (8):
Create: cleanup/unify default geometry handling
fix a get_linux_version() comparison typo
Grow: add missing raid4 geometries to geo_map()
Grow: fix check for raid6 layout normalization
Assemble: fix assembly in the delta_disks > max_degraded case
Grow: mark some functions static
Manage: allow manual control of external raid0 readonly flag
Provide a mdstat_ent to subarray helper

Assemble.c | 12
Create.c | 21
Detail.c | 22
Grow.c | 1285 ++++++++++++++++--
Makefile | 6
Manage.c | 155 ++
external-reshape-design.txt | 168 ++
managemon.c | 266 ++++
mdadm.c | 2
mdadm.h | 113 +-
mdmon.h | 16
mdstat.c | 11
monitor.c | 112 ++
msg.c | 229 +++
msg.h | 4
restripe.c | 47 -
super-ddf.c | 11
super-intel.c | 3125 ++++++++++++++++++++++++++++++++++++++++++-
sysfs.c | 180 ++
util.c | 241 +++
20 files changed, 5722 insertions(+), 304 deletions(-)
create mode 100644 external-reshape-design.txt

--
Signature
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 02/53] block monitor: freeze spare assignment for external

am 26.11.2010 09:04:07 von adam.kwolek

In order to support reshape and atomic removal of spares from containers
we need to prevent mdmon from activating spares. In the reshape case we
additionally need to freeze sync_action while the reshape transaction is
initiated with the kernel and recorded in the metadata.

When reshaping a raid0 array we need to freeze the array *before* it is
transitioned to a redundant raid level. Since sync_action does not exist
at this point we extend the '-' prefix of a subarray string to flag
mdmon not to activate spares.

Mdadm needs to be reasonably certain that the version of mdmon in the
system honors this 'freeze' indication. If mdmon is not already active
then we assume the version that gets started is the same as the mdadm
version. Otherwise, we check the version of mdmon as returned by the
extended ping_monitor() operation. This is to catch cases where mdadm
is upgraded in the filesystem, but mdmon started in the initramfs is
from a previous release.

Signed-off-by: Dan Williams
---

managemon.c | 19 +++++-
mdadm.h | 4 +
msg.c | 195 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
msg.h | 2 +
sysfs.c | 33 ++++++++++
util.c | 24 +++++++
6 files changed, 273 insertions(+), 4 deletions(-)

diff --git a/managemon.c b/managemon.c
index 544c4a6..164e4f8 100644
--- a/managemon.c
+++ b/managemon.c
@@ -394,12 +394,21 @@ static void manage_member(struct mdstat_ent *mdstat,
* trying to find and assign a spare.
* We do that whenever the monitor tells us too.
*/
+ char buf[64];
+ int frozen;
+
// FIXME
a->info.array.raid_disks = mdstat->raid_disks;
a->info.array.chunk_size = mdstat->chunk_size;
// MORE

- if (a->check_degraded) {
+ /* honor 'frozen' */
+ if (sysfs_get_str(&a->info, NULL, "metadata_version", buf, sizeof(buf)) > 0)
+ frozen = buf[9] == '-';
+ else
+ frozen = 1; /* can't read metadata_version assume the worst */
+
+ if (a->check_degraded && !frozen) {
struct metadata_update *updates = NULL;
struct mdinfo *newdev = NULL;
struct active_array *newa;
@@ -656,7 +665,13 @@ void read_sock(struct supertype *container)
/* read and validate the message */
if (receive_message(fd, &msg, tmo) == 0) {
handle_message(container, &msg);
- if (ack(fd, tmo) < 0)
+ if (msg.len == 0) {
+ /* ping reply with version */
+ msg.buf = Version;
+ msg.len = strlen(Version) + 1;
+ if (send_message(fd, &msg, tmo) < 0)
+ terminate = 1;
+ } else if (ack(fd, tmo) < 0)
terminate = 1;
} else
terminate = 1;
diff --git a/mdadm.h b/mdadm.h
index 9787f9e..f7172e9 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -436,6 +436,8 @@ extern int sysfs_fd_get_ll(int fd, unsigned long long *val);
extern int sysfs_get_ll(struct mdinfo *sra, struct mdinfo *dev,
char *name, unsigned long long *val);
extern int sysfs_fd_get_str(int fd, char *val, int size);
+extern int sysfs_attribute_available(struct mdinfo *sra, struct mdinfo *dev,
+ char *name);
extern int sysfs_get_str(struct mdinfo *sra, struct mdinfo *dev,
char *name, char *val, int size);
extern int sysfs_set_safemode(struct mdinfo *sra, unsigned long ms);
@@ -443,6 +445,7 @@ extern int sysfs_set_array(struct mdinfo *info, int vers);
extern int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume);
extern int sysfs_disk_to_scsi_id(int fd, __u32 *id);
extern int sysfs_unique_holder(int devnum, long rdev);
+extern int sysfs_freeze_array(struct mdinfo *sra);
extern int load_sys(char *path, char *buf);

@@ -847,6 +850,7 @@ extern unsigned long bitmap_sectors(struct bitmap_super_s *bsb);

extern int md_get_version(int fd);
extern int get_linux_version(void);
+extern int mdadm_version(char *version);
extern long long parse_size(char *size);
extern int parse_uuid(char *str, int uuid[4]);
extern int parse_layout_10(char *layout);
diff --git a/msg.c b/msg.c
index aabfa8f..8e7ebfd 100644
--- a/msg.c
+++ b/msg.c
@@ -135,7 +135,15 @@ int ack(int fd, int tmo)
int wait_reply(int fd, int tmo)
{
struct metadata_update msg;
- return receive_message(fd, &msg, tmo);
+ int err = receive_message(fd, &msg, tmo);
+
+ /* mdmon sent extra data, but caller only cares that we got a
+ * successful reply
+ */
+ if (err == 0 && msg.len > 0)
+ free(msg.buf);
+
+ return err;
}

int connect_monitor(char *devname)
@@ -195,7 +203,6 @@ int fping_monitor(int sfd)
return err;
}

-
/* give the monitor a chance to update the metadata */
int ping_monitor(char *devname)
{
@@ -206,6 +213,190 @@ int ping_monitor(char *devname)
return err;
}

+static char *ping_monitor_version(char *devname)
+{
+ int sfd = connect_monitor(devname);
+ struct metadata_update msg;
+ int err = 0;
+
+ if (sfd < 0)
+ return NULL;
+
+ if (ack(sfd, 20) != 0)
+ err = -1;
+
+ if (!err && receive_message(sfd, &msg, 20) != 0)
+ err = -1;
+
+ close(sfd);
+
+ if (err || !msg.len || !msg.buf)
+ return NULL;
+ return msg.buf;
+}
+
+static int unblock_subarray(struct mdinfo *sra, const int unfreeze)
+{
+ char buf[64];
+ int rc = 0;
+
+ if (sra) {
+ sprintf(buf, "external:%s\n", sra->text_version);
+ buf[9] = '/';
+ } else
+ buf[9] = '-';
+
+ if (buf[9] == '-' ||
+ sysfs_set_str(sra, NULL, "metadata_version", buf) ||
+ (unfreeze &&
+ sysfs_attribute_available(sra, NULL, "sync_action") &&
+ sysfs_set_str(sra, NULL, "sync_action", "idle")))
+ rc = -1;
+ return rc;
+}
+
+/**
+ * block_monitor - prevent mdmon spare assignment
+ * @container - container to block
+ * @freeze - flag to additionally freeze sync_action
+ *
+ * This is used by the reshape code to freeze the container, and the
+ * auto-rebuild implementation to atomically move spares. For reshape
+ * we need to freeze sync_action in the auto-rebuild we only need to
+ * block new spare assignment, existing rebuilds can continue
+ */
+int block_monitor(char *container, const int freeze)
+{
+ int devnum = devname2devnum(container);
+ struct mdstat_ent *ent, *e, *e2;
+ struct mdinfo *sra = NULL;
+ char *version = NULL;
+ char buf[64];
+ int rv = 0;
+
+ if (!mdmon_running(devnum)) {
+ /* if mdmon is not active we assume that any instance that is
+ * later started will match the current mdadm version, if this
+ * assumption is violated we may inadvertantly rebuild an array
+ * that was meant for reshape, or start rebuild on a spare that
+ * was to be moved to another container
+ */
+ /* pass */;
+ } else {
+ int ver;
+
+ version = ping_monitor_version(container);
+ ver = version ? mdadm_version(version) : -1;
+ free(version);
+ if (ver < 3001003) {
+ fprintf(stderr, Name
+ ": mdmon instance for %s cannot be disabled\n",
+ container);
+ return -1;
+ }
+ }
+
+ ent = mdstat_read(0, 0);
+ if (!ent) {
+ fprintf(stderr, Name
+ ": failed to read /proc/mdstat while disabling mdmon\n");
+ return -1;
+ }
+
+ /* freeze container contents */
+ for (e = ent; e; e = e->next) {
+ if (!is_container_member(e, container))
+ continue;
+ sysfs_free(sra);
+ sra = sysfs_read(-1, e->devnum, GET_VERSION);
+ if (!sra) {
+ fprintf(stderr, Name
+ ": failed to read sysfs for subarray%s\n",
+ to_subarray(e, container));
+ break;
+ }
+ /* can't reshape an array that we can't monitor */
+ if (sra->text_version[0] == '-')
+ break;
+
+ if (freeze && sysfs_freeze_array(sra) < 1)
+ break;
+ /* flag this array to not be modified by mdmon (close race with
+ * takeover in reshape case and spare reassignment in the
+ * auto-rebuild case)
+ */
+ sprintf(buf, "external:%s\n", sra->text_version);
+ buf[9] = '-';
+ if (sysfs_set_str(sra, NULL, "metadata_version", buf))
+ break;
+ ping_monitor(container);
+
+ /* check that we did not race with recovery */
+ if ((freeze &&
+ !sysfs_attribute_available(sra, NULL, "sync_action")) ||
+ (freeze &&
+ sysfs_attribute_available(sra, NULL, "sync_action") &&
+ sysfs_get_str(sra, NULL, "sync_action", buf, 20) > 0 &&
+ strcmp(buf, "frozen\n") == 0))
+ /* pass */;
+ else
+ break;
+ }
+
+ if (e) {
+ fprintf(stderr, Name ": failed to freeze subarray%s\n",
+ to_subarray(e, container));
+
+ /* thaw the partially frozen container */
+ for (e2 = ent; e2 && e2 != e; e2 = e2->next) {
+ if (!is_container_member(e2, container))
+ continue;
+ sysfs_free(sra);
+ sra = sysfs_read(-1, e2->devnum, GET_VERSION);
+ if (unblock_subarray(sra, freeze))
+ fprintf(stderr, Name ": Failed to unfreeze %s\n", e2->dev);
+ }
+
+ ping_monitor(container); /* cleared frozen */
+ rv = -1;
+ }
+
+ sysfs_free(sra);
+ free_mdstat(ent);
+ free(container);
+
+ return rv;
+}
+
+void unblock_monitor(char *container, const int unfreeze)
+{
+ struct mdstat_ent *ent, *e;
+ struct mdinfo *sra = NULL;
+
+ ent = mdstat_read(0, 0);
+ if (!ent) {
+ fprintf(stderr, Name
+ ": failed to read /proc/mdstat while unblocking container\n");
+ return;
+ }
+
+ /* unfreeze container contents */
+ for (e = ent; e; e = e->next) {
+ if (!is_container_member(e, container))
+ continue;
+ sysfs_free(sra);
+ sra = sysfs_read(-1, e->devnum, GET_VERSION);
+ if (unblock_subarray(sra, unfreeze))
+ fprintf(stderr, Name ": Failed to unfreeze %s\n", e->dev);
+ }
+ ping_monitor(container);
+
+ sysfs_free(sra);
+ free_mdstat(ent);
+}
+
+
+
/* give the manager a chance to view the updated container state. This
* would naturally happen due to the manager noticing a change in
* /proc/mdstat; however, pinging encourages this detection to happen
diff --git a/msg.h b/msg.h
index f8e89fd..1f916de 100644
--- a/msg.h
+++ b/msg.h
@@ -27,6 +27,8 @@ extern int ack(int fd, int tmo);
extern int wait_reply(int fd, int tmo);
extern int connect_monitor(char *devname);
extern int ping_monitor(char *devname);
+extern int block_monitor(char *container, const int freeze);
+extern void unblock_monitor(char *container, const int unfreeze);
extern int fping_monitor(int sock);
extern int ping_manager(char *devname);

diff --git a/sysfs.c b/sysfs.c
index 6e1d77b..3582fed 100644
--- a/sysfs.c
+++ b/sysfs.c
@@ -435,6 +435,17 @@ int sysfs_uevent(struct mdinfo *sra, char *event)
return 0;
}

+int sysfs_attribute_available(struct mdinfo *sra, struct mdinfo *dev, char *name)
+{
+ char fname[60];
+ struct stat st;
+
+ sprintf(fname, "/sys/block/%s/md/%s/%s",
+ sra->sys_name, dev?dev->sys_name:"", name);
+
+ return stat(fname, &st) == 0;
+}
+
int sysfs_get_fd(struct mdinfo *sra, struct mdinfo *dev,
char *name)
{
@@ -789,6 +800,28 @@ int sysfs_unique_holder(int devnum, long rdev)
return found;
}

+int sysfs_freeze_array(struct mdinfo *sra)
+{
+ /* Try to freeze resync/rebuild on this array/container.
+ * Return -1 if the array is busy,
+ * return -2 container cannot be frozen,
+ * return 0 if this kernel doesn't support 'frozen'
+ * return 1 if it worked.
+ */
+ char buf[20];
+
+ if (!sysfs_attribute_available(sra, NULL, "sync_action"))
+ return 1; /* no sync_action == frozen */
+ if (sysfs_get_str(sra, NULL, "sync_action", buf, 20) <= 0)
+ return 0;
+ if (strcmp(buf, "idle\n") != 0 &&
+ strcmp(buf, "frozen\n") != 0)
+ return -1;
+ if (sysfs_set_str(sra, NULL, "sync_action", "frozen") < 0)
+ return 0;
+ return 1;
+}
+
#ifndef MDASSEMBLE

static char *clean_states[] = {
diff --git a/util.c b/util.c
index 6f1c1d2..5f2694e 100644
--- a/util.c
+++ b/util.c
@@ -216,6 +216,30 @@ int get_linux_version()
return (a*1000000)+(b*1000)+c;
}

+int mdadm_version(char *version)
+{
+ int a, b, c;
+ char *cp;
+
+ if (!version)
+ version = Version;
+
+ cp = strchr(version, '-');
+ if (!cp || *(cp+1) != ' ' || *(cp+2) != 'v')
+ return -1;
+ cp += 3;
+ a = strtoul(cp, &cp, 10);
+ if (*cp != '.')
+ return -1;
+ b = strtoul(cp+1, &cp, 10);
+ if (*cp != '.')
+ return -1;
+ c = strtoul(cp+1, &cp, 10);
+ if (*cp != ' ')
+ return -1;
+ return (a*1000000)+(b*1000)+c;
+}
+
#ifndef MDASSEMBLE
long long parse_size(char *size)
{

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 03/53] Manage: allow manual control of external raid0 readonly

am 26.11.2010 09:04:15 von adam.kwolek

From: Dan Williams

mdadm --readwrite will clear the external readonly flag ('-'
to '/'), but only for redudant arrays. Allow raid0 arrays as well so
the user has a simple helper to control this flag.

Signed-off-by: Dan Williams
---

Manage.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/Manage.c b/Manage.c
index 6e9d4a0..ac9415b 100644
--- a/Manage.c
+++ b/Manage.c
@@ -56,7 +56,6 @@ int Manage_ro(char *devname, int fd, int readonly)
mdi = sysfs_read(fd, -1, GET_LEVEL|GET_VERSION);
if (mdi &&
mdi->array.major_version == -1 &&
- mdi->array.level > 0 &&
is_subarray(mdi->text_version)) {
char vers[64];
strcpy(vers, "external:");

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 04/53] Grow: mark some functions static

am 26.11.2010 09:04:22 von adam.kwolek

From: Dan Williams

Going through the Grow api found some local routines that could be
marked static.

Signed-off-by: Dan Williams
---

Grow.c | 12 ++++++------
1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/Grow.c b/Grow.c
index 0571f5b..f16228d 100644
--- a/Grow.c
+++ b/Grow.c
@@ -409,7 +409,7 @@ static struct mdp_backup_super {
__u8 pad[512-68-32];
} __attribute__((aligned(512))) bsb, bsb2;

-__u32 bsb_csum(char *buf, int len)
+static __u32 bsb_csum(char *buf, int len)
{
int i;
int csum = 0;
@@ -432,7 +432,7 @@ static int child_same_size(int afd, struct mdinfo *sra, unsigned long blocks,
int disks, int chunk, int level, int layout, int data,
int dests, int *destfd, unsigned long long *destoffsets);

-int freeze_array(struct mdinfo *sra)
+static int freeze_array(struct mdinfo *sra)
{
/* Try to freeze resync on this array.
* Return -1 if the array is busy,
@@ -450,14 +450,14 @@ int freeze_array(struct mdinfo *sra)
return 1;
}

-void unfreeze_array(struct mdinfo *sra, int frozen)
+static void unfreeze_array(struct mdinfo *sra, int frozen)
{
/* If 'frozen' is 1, unfreeze the array */
if (frozen > 0)
sysfs_set_str(sra, NULL, "sync_action", "idle");
}

-void wait_reshape(struct mdinfo *sra)
+static void wait_reshape(struct mdinfo *sra)
{
int fd = sysfs_get_fd(sra, NULL, "sync_action");
char action[20];
@@ -1266,7 +1266,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
*/

/* FIXME return status is never checked */
-int grow_backup(struct mdinfo *sra,
+static int grow_backup(struct mdinfo *sra,
unsigned long long offset, /* per device */
unsigned long stripes, /* per device */
int *sources, unsigned long long *offsets,
@@ -1381,7 +1381,7 @@ int grow_backup(struct mdinfo *sra,
* every works.
*/
/* FIXME return value is often ignored */
-int wait_backup(struct mdinfo *sra,
+static int wait_backup(struct mdinfo *sra,
unsigned long long offset, /* per device */
unsigned long long blocks, /* per device */
unsigned long long blocks2, /* per device - hack */

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 05/53] Assemble: fix assembly in the delta_disks >

am 26.11.2010 09:04:30 von adam.kwolek

From: Dan Williams

Incremental assembly works on such an array because the kernel sees the
disk as in-sync and that the array is reshaping. Teach Assemble() the
same assumptions.

This is only needed on kernels that do not initialize ->recovery_offset
when activating spares for reshape.

Signed-off-by: Dan Williams
---

Assemble.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/Assemble.c b/Assemble.c
index afd4e60..409f0d7 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -804,7 +804,9 @@ int Assemble(struct supertype *st, char *mddev,
devices[most_recent].i.events) {
devices[j].uptodate = 1;
if (i < content->array.raid_disks) {
- if (devices[j].i.recovery_start == MaxSector) {
+ if (devices[j].i.recovery_start == MaxSector ||
+ (content->reshape_active &&
+ j >= content->array.raid_disks - content->delta_disks)) {
okcnt++;
avail[i]=1;
} else

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 06/53] Grow: fix check for raid6 layout normalization

am 26.11.2010 09:04:38 von adam.kwolek

From: Dan Williams

If the user does not specify a layout, don't skip asking about retaining
the non-standard raid6 layout which may be implicitly changed.

Signed-off-by: Dan Williams
---

Grow.c | 11 ++++++-----
1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/Grow.c b/Grow.c
index f16228d..bf634d3 100644
--- a/Grow.c
+++ b/Grow.c
@@ -706,9 +706,9 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,

/* ========= set shape (chunk_size / layout / ndisks) ============== */
/* Check if layout change is a no-op */
- if (layout_str) switch(array.level) {
+ switch(array.level) {
case 5:
- if (array.layout == map_name(r5layout, layout_str))
+ if (layout_str && array.layout == map_name(r5layout, layout_str))
layout_str = NULL;
break;
case 6:
@@ -724,8 +724,9 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
rv = 1;
goto release;
}
- if (strcmp(layout_str, "normalise") == 0 ||
- strcmp(layout_str, "normalize") == 0) {
+ if (layout_str &&
+ (strcmp(layout_str, "normalise") == 0 ||
+ strcmp(layout_str, "normalize") == 0)) {
char *hyphen;
strcpy(alt_layout, map_num(r6layout, array.layout));
hyphen = strrchr(alt_layout, '-');
@@ -735,7 +736,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
}
}

- if (array.layout == map_name(r6layout, layout_str))
+ if (layout_str && array.layout == map_name(r6layout, layout_str))
layout_str = NULL;
if (layout_str && strcmp(layout_str, "preserve") == 0)
layout_str = NULL;

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 07/53] Grow: add missing raid4 geometries to geo_map()

am 26.11.2010 09:04:45 von adam.kwolek

From: Dan Williams

They are equivalent to their raid5 versions and let the reshape code
optionally use either.

Signed-off-by: Dan Williams
---

restripe.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/restripe.c b/restripe.c
index 3074693..c2fbe5b 100644
--- a/restripe.c
+++ b/restripe.c
@@ -46,6 +46,7 @@ static int geo_map(int block, unsigned long long stripe, int raid_disks,
switch(level*100 + layout) {
case 000:
case 400:
+ case 400 + ALGORITHM_PARITY_N:
case 500 + ALGORITHM_PARITY_N:
/* raid 4 isn't messed around by parity blocks */
if (block == -1)
@@ -75,6 +76,7 @@ static int geo_map(int block, unsigned long long stripe, int raid_disks,
if (block == -1) return pd;
return (pd + 1 + block) % raid_disks;

+ case 400 + ALGORITHM_PARITY_0:
case 500 + ALGORITHM_PARITY_0:
return block + 1;

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 08/53] fix a get_linux_version() comparison typo

am 26.11.2010 09:04:52 von adam.kwolek

From: Dan Williams

Signed-off-by: Dan Williams
---

mdadm.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mdadm.c b/mdadm.c
index 08e8ea4..e3361ed 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -1484,7 +1484,7 @@ int main(int argc, char *argv[])
break;
}
if (delay == 0) {
- if (get_linux_version() > 20616)
+ if (get_linux_version() > 2006016)
/* mdstat responds to poll */
delay = 1000;
else

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 09/53] Create: cleanup/unify default geometry handling

am 26.11.2010 09:05:00 von adam.kwolek

From: Dan Williams

Support metadata specific level, layout and chunksize defaults. Kill an
uneeded superswitch methods ahead of adding more for the reshape case.

Signed-off-by: Dan Williams
---

Create.c | 21 ++++++---------------
mdadm.h | 8 +++-----
super-ddf.c | 11 ++++++++++-
super-intel.c | 15 +++++++++------
4 files changed, 28 insertions(+), 27 deletions(-)

diff --git a/Create.c b/Create.c
index 2bf7ebe..bc2613a 100644
--- a/Create.c
+++ b/Create.c
@@ -31,8 +31,8 @@ static int default_layout(struct supertype *st, int level, int verbose)
{
int layout = UnSet;

- if (st && st->ss->default_layout)
- layout = st->ss->default_layout(level);
+ if (st && st->ss->default_geometry)
+ st->ss->default_geometry(st, &level, &layout, NULL);

if (layout == UnSet)
switch(level) {
@@ -120,15 +120,8 @@ int Create(struct supertype *st, char *mddev,
int major_num = BITMAP_MAJOR_HI;

memset(&info, 0, sizeof(info));
-
- if (level == UnSet) {
- /* "ddf" and "imsm" metadata only supports one level - should possibly
- * push this into metadata handler??
- */
- if (st && (st->ss == &super_ddf || st->ss == &super_imsm))
- level = LEVEL_CONTAINER;
- }
-
+ if (level == UnSet && st && st->ss->default_geometry)
+ st->ss->default_geometry(st, &level, NULL, NULL);
if (level == UnSet) {
fprintf(stderr,
Name ": a RAID level is needed to create an array.\n");
@@ -235,11 +228,9 @@ int Create(struct supertype *st, char *mddev,
case 6:
case 0:
if (chunk == 0) {
- if (st && st->ss->default_chunk)
- chunk = st->ss->default_chunk(st);
-
+ if (st && st->ss->default_geometry)
+ st->ss->default_geometry(st, NULL, NULL, &chunk);
chunk = chunk ? : 512;
-
if (verbose > 0)
fprintf(stderr, Name ": chunk size defaults to %dK\n", chunk);
}
diff --git a/mdadm.h b/mdadm.h
index f7172e9..a4de06f 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -612,7 +612,7 @@ extern struct superswitch {
* added to validate changing size and new devices. If there are
* inter-device dependencies, it should record sufficient details
* so these can be validated.
- * Both 'size' and '*freesize' are in sectors. chunk is bytes.
+ * Both 'size' and '*freesize' are in sectors. chunk is KiB.
*/
int (*validate_geometry)(struct supertype *st, int level, int layout,
int raiddisks,
@@ -621,10 +621,8 @@ extern struct superswitch {
int verbose);

struct mdinfo *(*container_content)(struct supertype *st);
- /* Allow a metadata handler to override mdadm's default layouts */
- int (*default_layout)(int level); /* optional */
- /* query the supertype for default chunk size */
- int (*default_chunk)(struct supertype *st); /* optional */
+ /* query the supertype for default geometry */
+ void (*default_geometry)(struct supertype *st, int *level, int *layout, int *chunk); /* optional */
/* Permit subarray's to be deleted from inactive containers */
int (*kill_subarray)(struct supertype *st); /* optional */
/* Permit subarray's to be modified */
diff --git a/super-ddf.c b/super-ddf.c
index dba5970..772ca97 100644
--- a/super-ddf.c
+++ b/super-ddf.c
@@ -3653,6 +3653,15 @@ static int ddf_level_to_layout(int level)
}
}

+static void default_geometry_ddf(struct supertype *st, int *level, int *layout, int *chunk)
+{
+ if (level && *level == UnSet)
+ *level = LEVEL_CONTAINER;
+
+ if (level && layout && *layout == UnSet)
+ *layout = ddf_level_to_layout(*level);
+}
+
struct superswitch super_ddf = {
#ifndef MDASSEMBLE
.examine_super = examine_super_ddf,
@@ -3680,7 +3689,7 @@ struct superswitch super_ddf = {
.free_super = free_super_ddf,
.match_metadata_desc = match_metadata_desc_ddf,
.container_content = container_content_ddf,
- .default_layout = ddf_level_to_layout,
+ .default_geometry = default_geometry_ddf,

.external = 1,

diff --git a/super-intel.c b/super-intel.c
index b880a74..7c5fcc4 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -4115,14 +4115,18 @@ static int validate_geometry_imsm(struct supertype *st, int level, int layout,
return 0;
}

-static int default_chunk_imsm(struct supertype *st)
+static void default_geometry_imsm(struct supertype *st, int *level, int *layout, int *chunk)
{
struct intel_super *super = st->sb;

- if (!super->orom)
- return 0;
+ if (level && *level == UnSet)
+ *level = LEVEL_CONTAINER;
+
+ if (level && layout && *layout == UnSet)
+ *layout = imsm_level_to_layout(*level);

- return imsm_orom_default_chunk(super->orom);
+ if (chunk && (*chunk == UnSet || *chunk == 0) && super->orom)
+ *chunk = imsm_orom_default_chunk(super->orom);
}

static void handle_missing(struct intel_super *super, struct imsm_dev *dev);
@@ -5567,7 +5571,6 @@ struct superswitch super_imsm = {
.brief_detail_super = brief_detail_super_imsm,
.write_init_super = write_init_super_imsm,
.validate_geometry = validate_geometry_imsm,
- .default_chunk = default_chunk_imsm,
.add_to_super = add_to_super_imsm,
.detail_platform = detail_platform_imsm,
.kill_subarray = kill_subarray_imsm,
@@ -5588,7 +5591,7 @@ struct superswitch super_imsm = {
.free_super = free_super_imsm,
.match_metadata_desc = match_metadata_desc_imsm,
.container_content = container_content_imsm,
- .default_layout = imsm_level_to_layout,
+ .default_geometry = default_geometry_imsm,

.external = 1,
.name = "imsm",

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 10/53] Initialize st->devnum and st->container_dev in

am 26.11.2010 09:05:07 von adam.kwolek

Precludes needing to deduce this information later, like in Detail.c and
soon in Grow.c.

Signed-off-by: Dan Williams
Signed-off-by: Adam Kwolek
---

Detail.c | 21 ++++++++++-----------
util.c | 23 ++++++++++++++---------
2 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/Detail.c b/Detail.c
index e0817aa..0fb90e8 100644
--- a/Detail.c
+++ b/Detail.c
@@ -97,16 +97,13 @@ int Detail(char *dev, int brief, int export, int test, char *homehost)
if (st)
max_disks = st->max_devs;

- if (sra && is_subarray(sra->text_version) &&
- strchr(sra->text_version+1, '/')) {
+ if (st && st->subarray[0]) {
/* This is a subarray of some container.
* We want the name of the container, and the member
*/
- char *s = strchr(sra->text_version+1, '/');
- int dn;
- *s++ = '\0';
- member = s;
- dn = devname2devnum(sra->text_version+1);
+ int dn = st->container_dev;
+
+ member = st->subarray;
container = map_dev(dev2major(dn), dev2minor(dn), 1);
}

@@ -417,7 +414,7 @@ int Detail(char *dev, int brief, int export, int test, char *homehost)
}
free_mdstat(ms);

- if (st->sb && info.reshape_active) {
+ if (st && st->sb && info.reshape_active) {
#if 0
This is pretty boring
printf(" Reshape pos'n : %llu%s\n", (unsigned long long) info.reshape_progress<<9,
@@ -567,9 +564,11 @@ This is pretty boring
if (!brief) printf("\n");
}
if (spares && brief && array.raid_disks) printf(" spares=%d", spares);
- if (brief && st && st->sb)
- st->ss->brief_detail_super(st);
- st->ss->free_super(st);
+ if (st) {
+ if (brief && st->sb)
+ st->ss->brief_detail_super(st);
+ st->ss->free_super(st);
+ }

if (brief > 1 && devices) printf("\n devices=%s", devices);
if (brief) printf("\n");
diff --git a/util.c b/util.c
index 5f2694e..8739278 100644
--- a/util.c
+++ b/util.c
@@ -1085,9 +1085,10 @@ struct supertype *super_by_fd(int fd)
struct supertype *st = NULL;
struct mdinfo *sra;
char *verstr;
- char version[20];
+ char version[30];
int i;
char *subarray = NULL;
+ int container = NoMdDev;

sra = sysfs_read(fd, 0, GET_VERSION);

@@ -1109,15 +1110,16 @@ struct supertype *super_by_fd(int fd)
}
if (minor == -2 && is_subarray(verstr)) {
char *dev = verstr+1;
+
subarray = strchr(dev, '/');
- int devnum;
- if (subarray)
+ if (subarray) {
*subarray++ = '\0';
- devnum = devname2devnum(dev);
- subarray = strdup(subarray);
+ subarray = strdup(subarray);
+ }
+ container = devname2devnum(dev);
if (sra)
sysfs_free(sra);
- sra = sysfs_read(-1, devnum, GET_VERSION);
+ sra = sysfs_read(-1, container, GET_VERSION);
if (sra && sra->text_version[0])
verstr = sra->text_version;
else
@@ -1132,12 +1134,15 @@ struct supertype *super_by_fd(int fd)
if (st) {
st->sb = NULL;
if (subarray) {
- strncpy(st->subarray, subarray, 32);
- st->subarray[31] = 0;
- free(subarray);
+ strncpy(st->subarray, subarray, sizeof(st->subarray));
+ st->subarray[sizeof(st->subarray) - 1] = 0;
} else
st->subarray[0] = 0;
+ st->container_dev = container;
+ st->devnum = fd2devnum(fd);
}
+ if (subarray)
+ free(subarray);
return st;
}
#endif /* !defined(MDASSEMBLE) || defined(MDASSEMBLE) && defined(MDASSEMBLE_AUTO) */

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 11/53] Document the external reshape implementation

am 26.11.2010 09:05:15 von adam.kwolek

Signed-off-by: Dan Williams
---

external-reshape-design.txt | 168 +++++++++++++++++++++++++++++++++++++++++++
1 files changed, 168 insertions(+), 0 deletions(-)
create mode 100644 external-reshape-design.txt

diff --git a/external-reshape-design.txt b/external-reshape-design.txt
new file mode 100644
index 0000000..28e3434
--- /dev/null
+++ b/external-reshape-design.txt
@@ -0,0 +1,168 @@
+External Reshape
+
+1 Problem statement
+
+External (third-party metadata) reshape differs from native-metadata
+reshape in three key ways:
+
+1.1 Format specific constraints
+
+In the native case reshape is limited by what is implemented in the
+generic reshape routine (Grow_reshape()) and what is supported by the
+kernel. There are exceptional cases where Grow_reshape() may block
+operations when it knows that the kernel implementation is broken, but
+otherwise the kernel is relied upon to be the final arbiter of what
+reshape operations are supported.
+
+In the external case the kernel, and the generic checks in
+Grow_reshape(), become the super-set of what reshapes are possible. The
+metadata format may not support, or have yet to implement a given
+reshape type. The implication for Grow_reshape() is that it must query
+the metadata handler and effect changes in the metadata before the new
+geometry is posted to the kernel. The ->reshape_super method allows
+Grow_reshape() to validate the requested operation and post the metadata
+update.
+
+1.2 Scope of reshape
+
+Native metadata reshape is always performed at the array scope (no
+metadata relationship with sibling arrays on the same disks). External
+reshape, depending on the format, may not allow the number of member
+disks to be changed in a subarray unless the change is simultaneously
+applied to all subarrays in the container. For example the imsm format
+requires all member disks to be a member of all subarrays, so a 4-disk
+raid5 in a container that also houses a 4-disk raid10 array could not be
+reshaped to 5 disks as the imsm format does not support a 5-disk raid10
+representation. This requires the ->reshape_super method to check the
+contents of the array and ask the user to run the reshape at container
+scope (if both subarrays are agreeable to the change), or report an
+error in the case where one subarray cannot support the change.
+
+1.3 Monitoring / checkpointing
+
+Reshape, unlike rebuild/resync, requires strict checkpointing to survive
+interrupted reshape operations. For example when expanding a raid5
+array the first few stripes of the array will be overwritten in a
+destructive manner. When restarting the reshape process we need to know
+the exact location of the last successfully written stripe, and we need
+to restore the data in any partially overwritten stripe. Native
+metadata stores this backup data in the unused portion of spares that
+are being promoted to array members, or in an external backup file
+(located on a non-involved block device).
+
+The kernel is in charge of recording checkpoints of reshape progress,
+but mdadm is delegated the task of managing the backup space which
+involves:
+1/ Identifying what data will be overwritten in the next unit of reshape
+ operation
+2/ Suspending access to that region so that a snapshot of the data can
+ be transferred to the backup space.
+3/ Allowing the kernel to reshape the saved region and setting the
+ boundary for the next backup.
+
+In the external reshape case we want to preserve this mdadm
+'reshape-manager' arrangement, but have a third actor, mdmon, to
+consider. It is tempting to give the role of managing reshape to mdmon,
+but that is counter to its role as a monitor, and conflicts with the
+existing capabilities and role of mdadm to manage the progress of
+reshape. For clarity the external reshape implementation maintains the
+role of mdmon as a (mostly) passive recorder of raid events, and mdadm
+treats it as it would the kernel in the native reshape case (modulo
+needing to send explicit metadata update messages and checking that
+mdmon took the expected action).
+
+External reshape can use the generic md backup file as a fallback, but in the
+optimal/firmware-compatible case the reshape-manager will use the metadata
+specific areas for managing reshape. The implementation also needs to spawn a
+reshape-manager per subarray when the reshape is being carried out at the
+container level. For these two reasons the ->manage_reshape() method is
+introduced. This method in addition to base tasks mentioned above:
+1/ Spawns a manager per-subarray, when necessary
+2/ Uses either generic routines in Grow.c for md-style backup file
+ support, or uses the metadata-format specific location for storing
+ recovery data.
+This aims to avoid a "midlayer mistake"[1] and lets the metadata handler
+optionally take advantage of generic infrastructure in Grow.c
+
+2 Details for specific reshape requests
+
+There are quite a few moving pieces spread out across md, mdadm, and mdmon for
+the support of external reshape, and there are several different types of
+reshape that need to be comprehended by the implementation. A rundown of
+these details follows.
+
+2.0 General provisions:
+
+Obtain an exclusive open on the container to make sure we are not
+running concurrently with a Create() event.
+
+2.1 Freezing sync_action
+
+2.2 Reshape size
+
+ 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
+ initializes st->update_tail
+ 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the size change
+ is allowed (being performed at subarray scope / enough room) prepares a
+ metadata update
+ 3/ mdadm::Grow_reshape(): flushes the metadata update (via
+ flush_metadata_update(), or ->sync_metadata())
+ 4/ mdadm::Grow_reshape(): post the new size to the kernel
+
+
+2.3 Reshape level (simple-takeover)
+
+"simple-takeover" implies the level change can be satisfied without touching
+sync_action
+
+ 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
+ initializes st->update_tail
+ 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the level change
+ is allowed (being performed at subarray scope) prepares a
+ metadata update
+ 2a/ raid10 --> raid0: degrade all mirror legs prior to calling
+ ->reshape_super
+ 3/ mdadm::Grow_reshape(): flushes the metadata update (via
+ flush_metadata_update(), or ->sync_metadata())
+ 4/ mdadm::Grow_reshape(): post the new level to the kernel
+
+2.4 Reshape chunk, layout
+
+2.5 Reshape raid disks (grow)
+
+ 1/ mdadm::Grow_reshape(): unconditionally initializes st->update_tail
+ because only redundant raid levels can modify the number of raid disks
+ 2/ mdadm::Grow_reshape(): calls ->reshape_super() to check that the level
+ change is allowed (being performed at proper scope / permissible
+ geometry / proper spares available in the container) prepares a metadata
+ update.
+ 3/ mdadm::Grow_reshape(): Converts each subarray in the container to the
+ raid level that can perform the reshape and starts mdmon.
+ 4/ mdadm::Grow_reshape(): Pushes the update to mdmon...
+ 4a/ mdmon::process_update(): marks the array as reshaping
+ 4b/ mdmon::manage_member(): adds the spares (without assigning a slot)
+ 5/ mdadm::Grow_reshape(): Notes that mdmon has assigned spares and invokes
+ ->manage_reshape()
+ 5/ mdadm::->manage_reshape(): (for each subarray) sets sync_max to
+ zero, starts the reshape, and pings mdmon
+ 5a/ mdmon::read_and_act(): notices that reshape has started and notifies
+ the metadata handler to record the slots chosen by the kernel
+ 6/ mdadm::->manage_reshape(): saves data that will be overwritten by
+ the kernel to either the backup file or the metadata specific location,
+ advances sync_max, waits for reshape, ping mdmon, repeat.
+ 6a/ mdmon::read_and_act(): records checkpoints
+ 7/ mdadm::->manage_reshape(): Once reshape completes changes the raid
+ level back to the nominal raid level (if necessary)
+
+ FIXME: native metadata does not have the capability to record the original
+ raid level in reshape-restart case because the kernel always records current
+ raid level to the metadata, whereas external metadata can masquerade at an
+ alternate level based on the reshape state.
+
+2.6 Reshape raid disks (shrink)
+
+3 TODO
+
+...
+
+[1]: Linux kernel design patterns - part 3, Neil Brown http://lwn.net/Articles/336262/

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 12/53] External reshape (step 1): container reshape and

am 26.11.2010 09:05:22 von adam.kwolek

In the native metadata case Grow_reshape() and the kernel validate what
reshapes are possible / supported and the kernel handles all the metadata
updates. In the external case the metadata format may have specific
constraints above this baseline. External formats also introduce the
constraint of only permitting some reshapes at container scope versus subarray
scope. For exmaple imsm changes to 'raiddisks' must be applied to all arrays
in the container.

This operation assumes that its 'st' parameter has been obtained from
super_by_fd() (such that st->subarray is up to date), and that a snapshot of
the metadata has been loaded from the container.

Why a new method, versus extending an existing one?
->validate_geometry: this routine assumes it is being called from Create(),
adding reshape complicates the cases that this routine needs to handle. Where
we find that checks can be shared between the two cases those routines
refactored into common code internal to the metadata handler, i.e. no need to
provide a unified external interface. ->validate_geometry() also does not
expect to update the metadata.

->update_super: this is meant to update single fields at Assembly() and only at
the container scope. Reshape potentially wants to update multiple fields at
either container or subarray scope.

Signed-off-by: Dan Williams
Signed-off-by: Adam Kwolek
---

Grow.c | 414 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++-
mdadm.h | 9 +
2 files changed, 415 insertions(+), 8 deletions(-)

diff --git a/Grow.c b/Grow.c
index bf634d3..3815fad 100644
--- a/Grow.c
+++ b/Grow.c
@@ -474,8 +474,230 @@ static void wait_reshape(struct mdinfo *sra)
}
} while (strncmp(action, "reshape", 7) == 0);
}
-
-
+
+static int reshape_super(struct supertype *st, long long size, int level,
+ int layout, int chunksize, int raid_disks,
+ char *backup_file, char *dev, int verbose)
+{
+ /* nothing extra to check in the native case */
+ if (!st->ss->external)
+ return 0;
+ if (!st->ss->reshape_super ||
+ !st->ss->manage_reshape) {
+ fprintf(stderr, Name ": %s metadata does not support reshape\n",
+ st->ss->name);
+ return 1;
+ }
+
+ return st->ss->reshape_super(st, size, level, layout, chunksize,
+ raid_disks, backup_file, dev, verbose);
+}
+
+static void sync_metadata(struct supertype *st)
+{
+ if (st->ss->external) {
+ if (st->update_tail)
+ flush_metadata_updates(st);
+ else
+ st->ss->sync_metadata(st);
+ }
+}
+
+static int subarray_set_num(char *container, struct mdinfo *sra, char *name, int n)
+{
+ /* when dealing with external metadata subarrays we need to be
+ * prepared to handle EAGAIN. The kernel may need to wait for
+ * mdmon to mark the array active so the kernel can handle
+ * allocations/writeback when preparing the reshape action
+ * (md_allow_write()). We temporarily disable safe_mode_delay
+ * to close a race with the array_state going clean before the
+ * next write to raid_disks / stripe_cache_size
+ */
+ char safe[50];
+ int rc;
+
+ /* only 'raid_disks' and 'stripe_cache_size' trigger md_allow_write */
+ if (strcmp(name, "raid_disks") != 0 &&
+ strcmp(name, "stripe_cache_size") != 0)
+ return sysfs_set_num(sra, NULL, name, n);
+
+ rc = sysfs_get_str(sra, NULL, "safe_mode_delay", safe, sizeof(safe));
+ if (rc <= 0)
+ return -1;
+ sysfs_set_num(sra, NULL, "safe_mode_delay", 0);
+ rc = sysfs_set_num(sra, NULL, name, n);
+ if (rc < 0 && errno == EAGAIN) {
+ ping_monitor(container);
+ /* if we get EAGAIN here then the monitor is not active
+ * so stop trying
+ */
+ rc = sysfs_set_num(sra, NULL, name, n);
+ }
+ sysfs_set_str(sra, NULL, "safe_mode_delay", safe);
+ return rc;
+}
+
+static int reshape_container_raid_disks(char *container, int raid_disks)
+{
+ /* for each subarray switch to a raid level that can
+ * support the reshape, and set raid disks
+ */
+ struct mdstat_ent *ent, *e;
+ int changed = 0, rv = 0, err = 0;
+ struct mdinfo *sub = NULL;
+
+ if (container == NULL)
+ return -1;
+
+ ent = mdstat_read(1, 0);
+ if (!ent) {
+ fprintf(stderr, Name ": unable to read /proc/mdstat\n");
+ return -1;
+ }
+
+ changed = 0;
+ for (e = ent; e; e = e->next) {
+ unsigned int cache;
+ int level, takeover_delta = 0;
+
+ if (!is_container_member(e, container))
+ continue;
+
+ level = map_name(pers, e->level);
+ if (level == 0) {
+ sub = sysfs_read(-1, e->devnum, GET_VERSION);
+ if (!sub)
+ break;
+ /* metadata records 'orig_level' */
+ rv = sysfs_set_num(sub, NULL, "level", 4);
+ if (rv < 0) {
+ err = errno;
+ break;
+ }
+ /* we want spares to be used for capacity
+ * expansion, not rebuild
+ */
+ takeover_delta = 1;
+
+ sysfs_free(sub);
+ level = 4;
+ }
+
+ sub = NULL;
+ switch (level) {
+ default:
+ rv = -1;
+ break;
+ case 4:
+ case 5:
+ case 6:
+ sub = sysfs_read(-1, e->devnum, GET_CHUNK|GET_CACHE);
+ if (!sub)
+ break;
+ cache = (sub->array.chunk_size / 4096) * 4;
+ if (cache > sub->cache_size)
+ rv = subarray_set_num(container, sub,
+ "stripe_cache_size", cache);
+ if (rv) {
+ err = errno;
+ break;
+ }
+ /* fall through */
+ case 1:
+ if (!sub)
+ sub = sysfs_read(-1, e->devnum, GET_VERSION);
+ if (!sub)
+ break;
+
+ rv = subarray_set_num(container, sub, "raid_disks",
+ raid_disks + takeover_delta);
+ if (rv)
+ err = errno;
+ else
+ changed++;
+ break;
+ }
+ sysfs_free(sub);
+ sub = NULL;
+ if (rv)
+ break;
+ }
+ sysfs_free(sub);
+ free_mdstat(ent);
+ if (rv) {
+ fprintf(stderr, Name
+ ": failed to initiate container reshape%s%s\n",
+ err ? ": " : "", err ? strerror(err) : "");
+ return rv;
+ }
+
+ return changed;
+}
+
+static void revert_container_raid_disks(struct supertype *st, int fd, char *container)
+{
+ /* we failed to prepare all subarrays in the container for
+ * reshape, so cancel the changes and restore the nominal raid
+ * level
+ */
+ struct mdstat_ent *ent, *e;
+
+ if (container == NULL)
+ return;
+
+ ent = mdstat_read(0, 0);
+ if (!ent) {
+ fprintf(stderr, Name
+ ": failed to read /proc/mdstat while aborting reshape\n");
+ return;
+ }
+
+ for (e = ent; e; e = e->next) {
+ int level_fixed = 0, disks_fixed = 0;
+ struct mdinfo *sub, prev;
+
+ if (!is_container_member(e, container))
+ continue;
+
+ st->ss->free_super(st);
+ sprintf(st->subarray, "%s", to_subarray(e, container));
+ if (st->ss->load_super(st, fd, NULL)) {
+ fprintf(stderr, Name
+ ": failed read metadata while aborting reshape\n");
+ continue;
+ }
+ st->ss->getinfo_super(st, &prev);
+
+ /* changing level might change raid_disks so we do it
+ * first and then check if raid_disks still needs fixing
+ */
+ if (map_name(pers, e->level) != prev.array.level) {
+ sub = sysfs_read(-1, e->devnum, GET_VERSION);
+ if (sub &&
+ !sysfs_set_num(sub, NULL, "level", prev.array.level))
+ level_fixed = 1;
+ sysfs_free(sub);
+ } else
+ level_fixed = 1;
+
+ sub = sysfs_read(-1, e->devnum, GET_DISKS);
+ if (sub && sub->array.raid_disks != prev.array.raid_disks) {
+ if (!subarray_set_num(container, sub, "raid_disks",
+ prev.array.raid_disks))
+ disks_fixed = 1;
+ } else if (sub)
+ disks_fixed = 1;
+ sysfs_free(sub);
+
+ if (!disks_fixed || !level_fixed)
+ fprintf(stderr, Name
+ ": failed to restore %s to a %d-disk %s array\n",
+ e->dev, prev.array.raid_disks,
+ map_num(pers, prev.array.level));
+ }
+ free_mdstat(ent);
+}
+
int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
long long size,
int level, char *layout_str, int chunksize, int raid_disks)
@@ -518,6 +740,8 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
unsigned long cache;
unsigned long long array_size;
int changed = 0;
+ char *container = NULL;
+ int cfd = -1;
int done;

struct mdinfo *sra;
@@ -545,22 +769,97 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
" Please use a newer kernel\n");
return 1;
}
+
+ st = super_by_fd(fd);
+ if (!st) {
+ fprintf(stderr, Name ": Unable to determine metadata format for %s\n", devname);
+ return 1;
+ }
+
+ /* in the external case we need to check that the requested reshape is
+ * supported, and perform an initial check that the container holds the
+ * pre-requisite spare devices (mdmon owns final validation)
+ */
+ if (st->ss->external) {
+ int container_dev;
+
+ if (st->subarray[0]) {
+ container_dev = st->container_dev;
+ cfd = open_dev_excl(st->container_dev);
+ } else if (size >= 0 || layout_str != NULL || chunksize != 0 ||
+ level != UnSet) {
+ fprintf(stderr,
+ Name ": %s is a container, only 'raid-devices' can be changed\n",
+ devname);
+ return 1;
+ } else {
+ container_dev = st->devnum;
+ close(fd);
+ cfd = open_dev_excl(st->devnum);
+ fd = cfd;
+ }
+ if (cfd < 0) {
+ fprintf(stderr, Name ": Unable to open container for %s\n",
+ devname);
+ return 1;
+ }
+
+ container = devnum2devname(st->devnum);
+ if (!container) {
+ fprintf(stderr, Name ": Could not determine container name\n");
+ close(cfd);
+ return 1;
+ }
+
+ if (st->ss->load_super(st, cfd, NULL)) {
+ fprintf(stderr, Name ": Cannot read superblock for %s\n",
+ devname);
+ if (container)
+ free(container);
+ close(cfd);
+ return 1;
+ }
+
+ if (mdmon_running(container_dev))
+ st->update_tail = &st->updates;
+ }
+
sra = sysfs_read(fd, 0, GET_LEVEL);
- if (sra)
+ if (sra) {
+ if (st->ss->external && st->subarray[0] == 0) {
+ array.level = LEVEL_CONTAINER;
+ sra->array.level = LEVEL_CONTAINER;
+ }
frozen = freeze_array(sra);
- else {
+ } else {
fprintf(stderr, Name ": failed to read sysfs parameters for %s\n",
devname);
+ if (container)
+ free(container);
+ if (cfd > -1)
+ close(cfd);
return 1;
}
if (frozen < 0) {
fprintf(stderr, Name ": %s is performing resync/recovery and cannot"
" be reshaped\n", devname);
+ if (container)
+ free(container);
+ if (cfd > -1)
+ close(cfd);
return 1;
}

+
/* ========= set size =============== */
if (size >= 0 && (size == 0 || size != array.size)) {
+ long long orig_size = array.size;
+
+ if (reshape_super(st, size, UnSet, UnSet, 0, 0, NULL, devname, !quiet)) {
+ rv = 1;
+ goto release;
+ }
+ sync_metadata(st);
array.size = size;
if (array.size != size) {
/* got truncated to 32bit, write to
@@ -575,6 +874,11 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
rv = ioctl(fd, SET_ARRAY_INFO, &array);
if (rv != 0) {
int err = errno;
+
+ /* restore metadata */
+ if (reshape_super(st, orig_size, UnSet, UnSet, 0, 0,
+ NULL, devname, !quiet) == 0)
+ sync_metadata(st);
fprintf(stderr, Name ": Cannot set device size for %s: %s\n",
devname, strerror(err));
if (err == EBUSY &&
@@ -591,7 +895,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
fprintf(stderr, Name ": component size of %s has been set to %lluK\n",
devname, size);
changed = 1;
- } else {
+ } else if (array.level != LEVEL_CONTAINER) {
size = get_component_size(fd)/2;
if (size == 0)
size = array.size;
@@ -674,6 +978,13 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
} else
layout_str = "parity-last";
} else {
+ /* Level change is a simple takeover. In the external
+ * case we don't check with the metadata handler until
+ * we establish what the final layout will be. If the
+ * level change is disallowed we will revert to
+ * orig_level without disturbing the metadata, otherwise
+ * we will send an update.
+ */
c = map_num(pers, level);
if (c == NULL) {
rv = 1;/* not possible */
@@ -706,7 +1017,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,

/* ========= set shape (chunk_size / layout / ndisks) ============== */
/* Check if layout change is a no-op */
- switch(array.level) {
+ switch (array.level) {
case 5:
if (layout_str && array.layout == map_name(r5layout, layout_str))
layout_str = NULL;
@@ -745,6 +1056,11 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
if (layout_str == NULL
&& (chunksize == 0 || chunksize*1024 == array.chunk_size)
&& (raid_disks == 0 || raid_disks == array.raid_disks)) {
+ if (reshape_super(st, -1, level, UnSet, 0, 0, NULL, devname, !quiet)) {
+ rv = 1;
+ goto release;
+ }
+ sync_metadata(st);
rv = 0;
if (level != UnSet && level != array.level) {
/* Looks like this level change doesn't need
@@ -766,18 +1082,69 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
} else if (!changed && !quiet)
fprintf(stderr, Name ": %s: no change requested\n",
devname);
+
+ if (st->ss->external && !mdmon_running(st->container_dev) &&
+ level > 0) {
+ start_mdmon(st->container_dev);
+ ping_monitor(container);
+ }
goto release;
}

c = map_num(pers, array.level);
if (c == NULL) c = "-unknown-";
- switch(array.level) {
+ switch (array.level) {
default: /* raid0, linear, multipath cannot be reconfigured */
fprintf(stderr, Name ": %s array %s cannot be reshaped.\n",
c, devname);
+ /* TODO raid0 raiddisks can be reshaped via raid4 */
rv = 1;
break;
+ case LEVEL_CONTAINER: {
+ int count;
+
+ /* double check that we are not changing anything but raid_disks */
+ if (size >= 0 || layout_str != NULL || chunksize != 0 || level != UnSet) {
+ fprintf(stderr,
+ Name ": %s is a container, only 'raid-devices' can be changed\n",
+ devname);
+ rv = 1;
+ goto release;
+ }
+
+ st->update_tail = &st->updates;
+ if (reshape_super(st, -1, UnSet, UnSet, 0, raid_disks,
+ backup_file, devname, !quiet)) {
+ rv = 1;
+ goto release;
+ }
+
+ count = reshape_container_raid_disks(container, raid_disks);
+ if (count < 0) {
+ revert_container_raid_disks(st, fd, container);
+ rv = 1;
+ goto release;
+ } else if (count == 0) {
+ if (!quiet)
+ fprintf(stderr, Name
+ ": no active subarrays to reshape\n");
+ goto release;
+ }
+
+ if (!mdmon_running(st->devnum)) {
+ start_mdmon(st->devnum);
+ ping_monitor(container);
+ }
+ sync_metadata(st);

+ /* give mdmon a chance to allocate spares */
+ ping_manager(container);
+
+ /* manage_reshape takes care of releasing the array(s) */
+ st->ss->manage_reshape(st, backup_file);
+ frozen = 0;
+ goto release;
+ }
case LEVEL_FAULTY: /* only 'layout' change is permitted */

if (chunksize || raid_disks) {
@@ -813,6 +1180,12 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
break;
}
if (raid_disks > 0) {
+ if (reshape_super(st, -1, UnSet, UnSet, 0, raid_disks,
+ NULL, devname, !quiet)) {
+ rv = 1;
+ goto release;
+ }
+ sync_metadata(st);
array.raid_disks = raid_disks;
if (ioctl(fd, SET_ARRAY_INFO, &array) != 0) {
fprintf(stderr, Name ": Cannot set raid-devices for %s: %s\n",
@@ -830,7 +1203,6 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
* layout/chunksize/raid_disks can be changed
* though the kernel may not support it all.
*/
- st = super_by_fd(fd);

/*
* There are three possibilities.
@@ -1024,6 +1396,12 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
}
}
if (backup_file == NULL) {
+ if (st->ss->external && !st->ss->manage_reshape) {
+ fprintf(stderr, Name ": %s Grow operation not supported by %s metadata\n",
+ devname, st->ss->name);
+ rv = 1;
+ break;
+ }
if (ndata <= odata) {
fprintf(stderr, Name ": %s: Cannot grow - need backup-file\n",
devname);
@@ -1072,6 +1450,13 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
d++;
}

+ /* check that the operation is supported by the metadata */
+ if (reshape_super(st, -1, level, nlayout, nchunk, ndisks,
+ backup_file, devname, !quiet)) {
+ rv = 1;
+ break;
+ }
+
/* lastly, check that the internal stripe cache is
* large enough, or it won't work.
*/
@@ -1088,6 +1473,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
* If only changing raid_disks, use ioctl, else use
* sysfs.
*/
+ sync_metadata(st);
if (ochunk == nchunk && olayout == nlayout) {
array.raid_disks = ndisks;
if (ioctl(fd, SET_ARRAY_INFO, &array) != 0) {
@@ -1136,6 +1522,14 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
break;
}

+ if (st->ss->external) {
+ /* metadata handler takes it from here */
+ ping_manager(container);
+ st->ss->manage_reshape(st, backup_file);
+ frozen = 0;
+ break;
+ }
+
/* set up the backup-super-block. This requires the
* uuid from the array.
*/
@@ -1239,6 +1633,10 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
}
if (sra)
unfreeze_array(sra, frozen);
+ if (container)
+ free(container);
+ if (cfd > -1)
+ close(cfd);
return rv;
}

diff --git a/mdadm.h b/mdadm.h
index a4de06f..64b32cc 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -627,6 +627,15 @@ extern struct superswitch {
int (*kill_subarray)(struct supertype *st); /* optional */
/* Permit subarray's to be modified */
int (*update_subarray)(struct supertype *st, char *update, mddev_ident_t ident); /* optional */
+ /* Check if reshape is supported for this external format.
+ * st is obtained from super_by_fd() where st->subarray[0] is
+ * initialized to indicate if reshape is being performed at the
+ * container or subarray level
+ */
+ int (*reshape_super)(struct supertype *st, long long size, int level,
+ int layout, int chunksize, int raid_disks,
+ char *backup, char *dev, int verbose); /* optional */
+ int (*manage_reshape)(struct supertype *st, char *backup); /* optional */

/* for mdmon */
int (*open_new)(struct supertype *c, struct active_array *a,

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 13/53] External reshape (step 2): Freeze container

am 26.11.2010 09:05:30 von adam.kwolek

When growing the number of raid disks the reshape process will promote
container-spares to subarray-spares (later the kernel promotes them to
subarray-members in raid5_start_reshape()). The automatic spare
promotion that mdmon performs upon seeing a degraded array must be
disabled until the reshape process has been initiated. Otherwise, mdmon
may start a rebuild before the reshape parameters can be specified.

In the external case we arrange for the monitor to be blocked, and turn off the safemode delay.
Mdmon is updated to check sync_action is not frozen before initiating
recovery. This introduces a need to check which version of mdmon is
running to be sure it honors the expected semantics. Extend
ping_monitor() to report the version of mdmon. This also permits
discrimination of known buggy mdmon implementations in the future.
Note, it's not enough to know the current version of mdadm because the
mdmon instance may have originated from the initrd, so there is no
guaratee that mdadm and mdmon versions are synchronized.

Signed-off-by: Dan Williams
Signed-off-by: Adam Kwolek
---

Grow.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++--------- ----
1 files changed, 74 insertions(+), 19 deletions(-)

diff --git a/Grow.c b/Grow.c
index 3815fad..4060129 100644
--- a/Grow.c
+++ b/Grow.c
@@ -432,29 +432,79 @@ static int child_same_size(int afd, struct mdinfo *sra, unsigned long blocks,
int disks, int chunk, int level, int layout, int data,
int dests, int *destfd, unsigned long long *destoffsets);

-static int freeze_array(struct mdinfo *sra)
+static int freeze_container(struct supertype *st)
{
- /* Try to freeze resync on this array.
+ int container_dev = st->subarray[0] ? st->container_dev : st->devnum;
+ char *container = devnum2devname(container_dev);
+
+ if (!container) {
+ fprintf(stderr, Name
+ ": could not determine container name, freeze aborted\n");
+ return -2;
+ }
+
+ if (block_monitor(container, 1)) {
+ fprintf(stderr, Name ": failed to freeze container\n");
+ return -2;
+ }
+
+ return 1;
+}
+
+static void unfreeze_container(struct supertype *st)
+{
+ int container_dev = st->subarray[0] ? st->container_dev : st->devnum;
+ char *container = devnum2devname(container_dev);
+
+ if (!container) {
+ fprintf(stderr, Name
+ ": could not determine container name, unfreeze aborted\n");
+ return;
+ }
+
+ unblock_monitor(container, 1);
+ free(container);
+}
+
+static int freeze(struct supertype *st)
+{
+ /* Try to freeze resync/rebuild on this array/container.
* Return -1 if the array is busy,
+ * return -2 container cannot be frozen,
* return 0 if this kernel doesn't support 'frozen'
* return 1 if it worked.
*/
- char buf[20];
- if (sysfs_get_str(sra, NULL, "sync_action", buf, 20) <= 0)
- return 0;
- if (strcmp(buf, "idle\n") != 0 &&
- strcmp(buf, "frozen\n") != 0)
- return -1;
- if (sysfs_set_str(sra, NULL, "sync_action", "frozen") < 0)
- return 0;
- return 1;
+ if (st->ss->external)
+ return freeze_container(st);
+ else {
+ struct mdinfo *sra = sysfs_read(-1, st->devnum, GET_VERSION);
+ int err;
+
+ if (!sra)
+ return -1;
+ err = sysfs_freeze_array(sra);
+ sysfs_free(sra);
+ return err;
+ }
}

-static void unfreeze_array(struct mdinfo *sra, int frozen)
+static void unfreeze(struct supertype *st, int frozen)
{
/* If 'frozen' is 1, unfreeze the array */
- if (frozen > 0)
- sysfs_set_str(sra, NULL, "sync_action", "idle");
+ if (frozen <= 0)
+ return;
+
+ if (st->ss->external)
+ return unfreeze_container(st);
+ else {
+ struct mdinfo *sra = sysfs_read(-1, st->devnum, GET_VERSION);
+
+ if (sra)
+ sysfs_set_str(sra, NULL, "sync_action", "idle");
+ else
+ fprintf(stderr, Name ": failed to unfreeze array\n");
+ sysfs_free(sra);
+ }
}

static void wait_reshape(struct mdinfo *sra)
@@ -830,7 +880,6 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
array.level = LEVEL_CONTAINER;
sra->array.level = LEVEL_CONTAINER;
}
- frozen = freeze_array(sra);
} else {
fprintf(stderr, Name ": failed to read sysfs parameters for %s\n",
devname);
@@ -840,7 +889,15 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
close(cfd);
return 1;
}
- if (frozen < 0) {
+ frozen = freeze(st);
+ if (frozen < -1) {
+ /* freeze() already spewed the reason */
+ if (container)
+ free(container);
+ if (cfd > -1)
+ close(cfd);
+ return 1;
+ } else if (frozen < 0) {
fprintf(stderr, Name ": %s is performing resync/recovery and cannot"
" be reshaped\n", devname);
if (container)
@@ -850,7 +907,6 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
return 1;
}

-
/* ========= set size =============== */
if (size >= 0 && (size == 0 || size != array.size)) {
long long orig_size = array.size;
@@ -1631,8 +1687,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
if (c && sysfs_set_str(sra, NULL, "level", c) == 0)
fprintf(stderr, Name ": aborting level change\n");
}
- if (sra)
- unfreeze_array(sra, frozen);
+ unfreeze(st, frozen);
if (container)
free(container);
if (cfd > -1)

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 14/53] FIX: Cannot exit monitor after takeover

am 26.11.2010 09:05:37 von adam.kwolek

When performing backward takeover to raid0 monitor cannot exit
for single raid0 array configuration.
Monitor is locked by communication (ping_manager()) after unfreeze()

Do not ping manager for raid0 array as they shouldn't be monitored.

Signed-off-by: Adam Kwolek
---

msg.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/msg.c b/msg.c
index 8e7ebfd..95c6f0b 100644
--- a/msg.c
+++ b/msg.c
@@ -385,11 +385,12 @@ void unblock_monitor(char *container, const int unfreeze)
if (!is_container_member(e, container))
continue;
sysfs_free(sra);
- sra = sysfs_read(-1, e->devnum, GET_VERSION);
+ sra = sysfs_read(-1, e->devnum, GET_VERSION|GET_LEVEL);
if (unblock_subarray(sra, unfreeze))
fprintf(stderr, Name ": Failed to unfreeze %s\n", e->dev);
}
- ping_monitor(container);
+ if (sra && sra->array.level > 0)
+ ping_monitor(container);

sysfs_free(sra);
free_mdstat(ent);

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 15/53] FIX: Unfreeze not only container for external metadata

am 26.11.2010 09:05:45 von adam.kwolek

Unfreeze for external metadata case should unfreeze arrays and container,
not only container as so far. Unfreeze() function doesn't know
what the changes to configuration was made so far, and if arrays
are pulled from frozen state in md.
Unfreeze() has to make sure by performing array unfreeze that all arrays
are not frozen and then unblock monitor.

Signed-off-by: Adam Kwolek
---

Grow.c | 18 ++++++++----------
1 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/Grow.c b/Grow.c
index 4060129..8ca1812 100644
--- a/Grow.c
+++ b/Grow.c
@@ -495,16 +495,14 @@ static void unfreeze(struct supertype *st, int frozen)
return;

if (st->ss->external)
- return unfreeze_container(st);
- else {
- struct mdinfo *sra = sysfs_read(-1, st->devnum, GET_VERSION);
-
- if (sra)
- sysfs_set_str(sra, NULL, "sync_action", "idle");
- else
- fprintf(stderr, Name ": failed to unfreeze array\n");
- sysfs_free(sra);
- }
+ unfreeze_container(st);
+
+ struct mdinfo *sra = sysfs_read(-1, st->devnum, GET_VERSION);
+ if (sra)
+ sysfs_set_str(sra, NULL, "sync_action", "idle");
+ else
+ fprintf(stderr, Name ": failed to unfreeze array\n");
+ sysfs_free(sra);
}

static void wait_reshape(struct mdinfo *sra)

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 16/53] Add takeover support for external meta

am 26.11.2010 09:05:52 von adam.kwolek

When performing takeover 0->10 or 10->0 mdmon should update the external metadata (due to disk slot changes).
To achieve that mdadm, after changing the level in md, mdadm calls update_super with "update_level" type.
update_super() allocates a new imsm_dev with updated disk slot numbers to be processed by mdmon in process_update().
process_update() discovers missing disks and adds them to imsm metadata.

Signed-off-by: Maciej Trela
Signed-off-by: Adam Kwolek
---

Grow.c | 28 ++++++
managemon.c | 16 +++
monitor.c | 2
super-intel.c | 279 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 321 insertions(+), 4 deletions(-)

diff --git a/Grow.c b/Grow.c
index 8ca1812..e977ce2 100644
--- a/Grow.c
+++ b/Grow.c
@@ -1066,6 +1066,31 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
fprintf(stderr, Name " level of %s changed to %s\n",
devname, c);
changed = 1;
+
+ st = super_by_fd(fd);
+ if (!st) {
+ fprintf(stderr, Name ": cannot handle this array\n");
+ if (container)
+ free(container);
+ if (cfd > -1)
+ close(cfd);
+ return 1;
+ } else {
+ if (st && reshape_super(st, -1, level, UnSet, 0, 0, NULL, devname, !quiet)) {
+ rv = 1;
+ goto release;
+ }
+ /* before sending update make sure that for external metadata
+ * and after changing raid level mdmon is running
+ */
+ if (st->ss->external && !mdmon_running(st->container_dev) &&
+ level > 0) {
+ start_mdmon(st->container_dev);
+ if (container)
+ ping_monitor(container);
+ }
+ sync_metadata(st);
+ }
}
}

@@ -1140,7 +1165,8 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
if (st->ss->external && !mdmon_running(st->container_dev) &&
level > 0) {
start_mdmon(st->container_dev);
- ping_monitor(container);
+ if (container)
+ ping_monitor(container);
}
goto release;
}
diff --git a/managemon.c b/managemon.c
index 164e4f8..53ab4a9 100644
--- a/managemon.c
+++ b/managemon.c
@@ -381,6 +381,9 @@ static int disk_init_and_add(struct mdinfo *disk, struct mdinfo *clone,
static void manage_member(struct mdstat_ent *mdstat,
struct active_array *a)
{
+ struct active_array *newa;
+ int level;
+
/* Compare mdstat info with known state of member array.
* We do not need to look for device state changes here, that
* is dealt with by the monitor.
@@ -408,6 +411,19 @@ static void manage_member(struct mdstat_ent *mdstat,
else
frozen = 1; /* can't read metadata_version assume the worst */

+ level = a->info.array.level;
+ if (mdstat->level) {
+ level = map_name(pers, mdstat->level);
+ if (a->info.array.level != level && level >= 0) {
+ newa = duplicate_aa(a);
+ if (newa) {
+ newa->info.array.level = level;
+ replace_array(a->container, a, newa);
+ a = newa;
+ }
+ }
+ }
+
if (a->check_degraded && !frozen) {
struct metadata_update *updates = NULL;
struct mdinfo *newdev = NULL;
diff --git a/monitor.c b/monitor.c
index 59b4181..5705a9b 100644
--- a/monitor.c
+++ b/monitor.c
@@ -483,7 +483,7 @@ static int wait_and_act(struct supertype *container, int nowait)
/* once an array has been deactivated we want to
* ask the manager to discard it.
*/
- if (!a->container) {
+ if (!a->container || a->info.array.level == 0) {
if (discard_this) {
ap = &(*ap)->next;
continue;
diff --git a/super-intel.c b/super-intel.c
index 7c5fcc4..2434fa1 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -285,6 +285,7 @@ enum imsm_update_type {
update_kill_array,
update_rename_array,
update_add_disk,
+ update_level,
};

struct imsm_update_activate_spare {
@@ -320,6 +321,13 @@ struct imsm_update_add_disk {
enum imsm_update_type type;
};

+struct imsm_update_level {
+ enum imsm_update_type type;
+ int delta_disks;
+ int container_member;
+ struct imsm_dev dev;
+};
+
static struct supertype *match_metadata_desc_imsm(char *arg)
{
struct supertype *st;
@@ -1666,6 +1674,9 @@ static void getinfo_super_imsm(struct supertype *st, struct mdinfo *info)
}
}

+static int is_raid_level_supported(const struct imsm_orom *orom, int level, int raiddisks);
+static void imsm_copy_dev(struct imsm_dev *dest, struct imsm_dev *src);
+
static int update_super_imsm(struct supertype *st, struct mdinfo *info,
char *update, char *devname, int verbose,
int uuid_set, char *homehost)
@@ -1698,12 +1709,15 @@ static int update_super_imsm(struct supertype *st, struct mdinfo *info,
struct intel_super *super = st->sb;
struct imsm_super *mpb;

- /* we can only update container info */
- if (!super || super->current_vol >= 0 || !super->anchor)
+ if (!super || !super->anchor)
return 1;

mpb = super->anchor;

+ /* we can only update container info */
+ if (super->current_vol >= 0)
+ return 1;
+
if (strcmp(update, "uuid") == 0 && uuid_set && !info->update_private)
fprintf(stderr,
Name ": '--uuid' not supported for imsm metadata\n");
@@ -1778,6 +1792,45 @@ static void imsm_copy_dev(struct imsm_dev *dest, struct imsm_dev *src)
memcpy(dest, src, sizeof_imsm_dev(src, 0));
}

+struct imsm_dev *reallocate_imsm_dev(struct intel_super *super,
+ unsigned int array_index,
+ int map_num_members)
+{
+ struct imsm_dev *newdev = NULL;
+ struct imsm_dev *retval = NULL;
+ struct intel_dev *dv = NULL;
+ struct imsm_dev *dv_free = NULL;
+ int memNeeded;
+
+ if (!super)
+ return NULL;
+
+ /* Calculate space needed for imsm_dev with a double map */
+ memNeeded = sizeof(struct imsm_dev) + sizeof(__u32) * (map_num_members - 1) +
+ sizeof(struct imsm_map) + sizeof(__u32) * (map_num_members - 1);
+
+ newdev = malloc(memNeeded);
+ if (!newdev) {
+ fprintf(stderr, "error: imsm meta update not possible due to no memory conditions\n");
+ return NULL;
+ }
+ /* Find our device */
+ for (dv = super->devlist; dv; dv = dv->next)
+ if (dv->index == array_index) {
+ /* Copy imsm_dev into the new buffer */
+ imsm_copy_dev(newdev, dv->dev);
+ dv_free = dv->dev;
+ dv->dev = newdev;
+ retval = newdev;
+ free(dv_free);
+ break;
+ }
+ if (retval == NULL)
+ free(newdev);
+
+ return retval;
+}
+
static int compare_super_imsm(struct supertype *st, struct supertype *tst)
{
/*
@@ -5123,6 +5176,57 @@ static void imsm_process_update(struct supertype *st,
mpb = super->anchor;

switch (type) {
+ case update_level: {
+ struct imsm_update_level *u = (void *)update->buf;
+ struct imsm_dev *dev_new, *dev = NULL;
+ struct imsm_map *map;
+ struct dl *d;
+ int i;
+ int start_disk;
+
+ dev_new = &u->dev;
+ for (i = 0; i < mpb->num_raid_devs; i++) {
+ dev = get_imsm_dev(super, i);
+ if (strcmp((char *)dev_new->volume, (char *)dev->volume) == 0)
+ break;
+ }
+ if (i == super->anchor->num_raid_devs)
+ return;
+
+ if (dev == NULL)
+ return;
+
+ imsm_copy_dev(dev, dev_new);
+ map = get_imsm_map(dev, 0);
+ start_disk = mpb->num_disks;
+ mpb->num_disks += u->delta_disks;
+
+ /* clear missing disks list */
+ while (super->missing) {
+ d = super->missing;
+ super->missing = d->next;
+ __free_imsm_disk(d);
+ }
+ find_missing(super);
+
+ /* clear new disk entries if number of disks increased*/
+ d = super->missing;
+ for (i = start_disk; i < map->num_members; i++) {
+ assert(d != NULL);
+ if (!d)
+ break;
+ memset(&d->disk, 0, sizeof(d->disk));
+ strcpy((char *)d->disk.serial, "MISSING");
+ d->disk.total_blocks = map->blocks_per_member;
+ /* Set slot for missing disk */
+ set_imsm_ord_tbl_ent(map, i, d->index | IMSM_ORD_REBUILD);
+ d->raiddisk = i;
+ d = d->next;
+ }
+
+ super->updates_pending++;
+ break;
+ }
case update_activate_spare: {
struct imsm_update_activate_spare *u = (void *) update->buf;
struct imsm_dev *dev = get_imsm_dev(super, u->array);
@@ -5442,6 +5546,26 @@ static void imsm_prepare_update(struct supertype *st,
size_t len = 0;

switch (type) {
+ case update_level: {
+ struct imsm_update_level *u = (void *) update->buf;
+ struct active_array *a;
+
+ dprintf("prepare_update(): update level\n");
+ len += u->delta_disks * sizeof(struct imsm_disk) +
+ u->delta_disks * sizeof(__u32);
+
+ for (a = st->arrays; a; a = a->next)
+ if (a->info.container_member == u->container_member)
+ break;
+ if (a == NULL)
+ break; /* what else we can do here? */
+
+ /* we'll add new disks to imsm_dev */
+ if (u->delta_disks > 0)
+ reallocate_imsm_dev(super, u->container_member,
+ a->info.array.raid_disks);
+ break;
+ }
case update_create_array: {
struct imsm_update_create_array *u = (void *) update->buf;
struct intel_dev *dv;
@@ -5561,6 +5685,156 @@ static void imsm_delete(struct intel_super *super, struct dl **dlp, unsigned ind
}
#endif /* MDASSEMBLE */

+static int update_level_imsm(struct supertype *st, struct mdinfo *info,
+ char *devname, int verbose,
+ int uuid_set, char *homehost)
+{
+ struct intel_super *super = st->sb;
+ struct imsm_super *mpb = super->anchor;
+ struct imsm_update_level *u;
+ struct imsm_dev *dev_new, *dev = NULL;
+ struct imsm_map *map_new, *map;
+ struct mdinfo *newdi;
+ struct dl *dl;
+ int *tmp_ord_tbl;
+ int i, slot, idx;
+ int len, disks;
+
+ if (!is_raid_level_supported(super->orom,
+ info->array.level,
+ info->array.raid_disks))
+ return 1;
+
+ for (i = 0; i < mpb->num_raid_devs; i++) {
+ dev = get_imsm_dev(super, i);
+ if (strcmp(devname, (char *)dev->volume) == 0)
+ break;
+ }
+ if (dev == NULL)
+ return 1;
+
+ if (i == super->anchor->num_raid_devs)
+ return 1;
+
+ map = get_imsm_map(dev, 0);
+
+ /* update level is needed only for 0->10 and 10->0 transitions */
+ if ((info->array.level != 10 || map->raid_level != 0) &&
+ (info->array.level != 0 || map->raid_level != 10))
+ return 1;
+
+ disks = (info->array.raid_disks > map->num_members) ?
+ info->array.raid_disks : map->num_members;
+ len = sizeof(struct imsm_update_level) +
+ ((disks - 1) * sizeof(__u32));
+
+ u = malloc(len);
+ if (u == NULL)
+ return 1;
+
+ dev_new = &u->dev;
+ imsm_copy_dev(dev_new, dev);
+ map_new = get_imsm_map(dev_new, 0);
+
+ tmp_ord_tbl = malloc(sizeof(int) * disks);
+ if (tmp_ord_tbl == NULL) {
+ free(u);
+ return 1;
+ }
+
+ for (i = 0; i < disks; i++)
+ tmp_ord_tbl[i] = -1;
+
+ /* iterate through devices to detect slot changes */
+ for (dl = super->disks; dl; dl = dl->next)
+ for (newdi = info->devs; newdi; newdi = newdi->next) {
+ if ((dl->major != newdi->disk.major) ||
+ (dl->minor != newdi->disk.minor))
+ continue;
+ slot = get_imsm_disk_slot(map, dl->index);
+ idx = get_imsm_ord_tbl_ent(dev_new, slot);
+ tmp_ord_tbl[newdi->disk.raid_disk] = idx;
+ break;
+ }
+
+ for (i = 0; i < disks; i++)
+ set_imsm_ord_tbl_ent(map_new, i, tmp_ord_tbl[i]);
+ free(tmp_ord_tbl);
+ map_new->raid_level = info->array.level;
+ map_new->num_members = info->array.raid_disks;
+ u->type = update_level;
+ u->delta_disks = info->array.raid_disks - map->num_members;
+ u->container_member = info->container_member;
+ append_metadata_update(st, u, len);
+
+ return 0;
+}
+
+
+int imsm_reshape_super(struct supertype *st, long long size, int level,
+ int layout, int chunksize, int raid_disks,
+ char *backup, char *dev, int verbouse)
+{
+ int ret_val = 1;
+ struct mdinfo *sra = NULL;
+ int fd = -1;
+ char buf[PATH_MAX];
+
+ snprintf(buf, PATH_MAX, "/dev/md%i", st->devnum);
+ fd = open(buf , O_RDONLY | O_DIRECT);
+ if (fd < 0) {
+ dprintf("imsm: cannot open device: %s\n", buf);
+ goto imsm_reshape_super_exit;
+ }
+
+ if ((size == -1) && (layout == UnSet) && (raid_disks == 0) && (level != UnSet)) {
+ /* ok - this is takeover */
+ int container_fd;
+ int dn;
+ int err;
+
+ sra = sysfs_read(fd, 0, GET_VERSION | GET_LEVEL |
+ GET_LAYOUT | GET_DISKS | GET_DEVS);
+ if (sra == NULL) {
+ fprintf(stderr, Name ": Cannot read sysfs info (imsm)\n");
+ goto imsm_reshape_super_exit;
+ }
+ dn = devname2devnum(sra->text_version + 1);
+ container_fd = open_dev_excl(dn);
+ if (container_fd < 0) {
+ fprintf(stderr, Name ": Cannot get exclusive access "
+ "to container (imsm).\n");
+ goto imsm_reshape_super_exit;
+ }
+ st->ss->load_super(st, container_fd, NULL);
+ close(container_fd);
+ st->ss->getinfo_super(st, sra);
+
+ /* send metadata update for raid10 takeover
+ * this means we are going from/to raid10
+ * to/from different than raid10 level
+ * if source level is raid0 mdmon is sterted only
+ */
+ if (((level == 10) || (sra->array.level == 10) || (sra->array.level == 0)) &&
+ (level != sra->array.level) &&
+ (level > 0)) {
+ st->update_tail = &st->updates;
+ err = update_level_imsm(st, sra, sra->name, 0, 0, NULL);
+ ret_val = 0;
+ }
+ sysfs_free(sra);
+ sra = NULL;
+ }
+
+imsm_reshape_super_exit:
+ sysfs_free(sra);
+ if (fd >= 0)
+ close(fd);
+
+ dprintf("imsm: reshape_super Exit code = %i\n", ret_val);
+ return ret_val;
+}
+
struct superswitch super_imsm = {
#ifndef MDASSEMBLE
.examine_super = examine_super_imsm,
@@ -5592,6 +5866,7 @@ struct superswitch super_imsm = {
.match_metadata_desc = match_metadata_desc_imsm,
.container_content = container_content_imsm,
.default_geometry = default_geometry_imsm,
+ .reshape_super = imsm_reshape_super,

.external = 1,
.name = "imsm",

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 17/53] Disk removal support for Raid10->Raid0 takeover

am 26.11.2010 09:06:00 von adam.kwolek

Until now Raid10->Raid0 takeover was possible only if all the mirrors where removed before md starts the takeover.
Now mdadm, when performing Raid10->raid0 takeover, will remove all unwanted mirrors from the array before actual md takeover is called.

Signed-off-by: Maciej Trela
Signed-off-by: Adam Kwolek
---

Grow.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++-
1 files changed, 106 insertions(+), 1 deletions(-)

diff --git a/Grow.c b/Grow.c
index e977ce2..347f07b 100644
--- a/Grow.c
+++ b/Grow.c
@@ -746,6 +746,92 @@ static void revert_container_raid_disks(struct supertype *st, int fd, char *cont
free_mdstat(ent);
}

+int remove_disks_on_raid10_to_raid0_takeover(struct supertype *st,
+ struct mdinfo *sra,
+ int layout)
+{
+ int max_disks;
+ int nr_of_copies, in_sync, copies;
+ struct mdinfo info;
+ struct mdinfo *sd_temp;
+ struct mdinfo *sd;
+ int d;
+
+ st->ss->getinfo_super(st, &info);
+ max_disks = info.array.raid_disks;
+
+ nr_of_copies = layout & 0xff;
+ in_sync = nr_of_copies;
+ copies = nr_of_copies;
+
+ /* sort list by slot numbers
+ */
+ sd_temp = sra->devs;
+ for (sd = sra->devs; sd; sd = sd->next) {
+ struct mdinfo *sd1 = sd;
+ struct mdinfo *sd1_next = sd1->next;
+ struct mdinfo *sd1_prev = NULL;
+ for (sd1 = sd; sd1; sd1 = sd1->next) {
+ if (sd1_next) {
+ if (sd1_next->disk.raid_disk < sd1->disk.raid_disk) {
+ if (sd == sd1)
+ sd = sd1_next;
+ if (sd1_prev)
+ sd1_prev->next = sd1_next;
+ sd1->next = sd1_next->next;
+ sd1_next->next = sd1;
+ }
+ }
+ }
+ }
+ /* Find devices that will be removed from the array */
+ d = 0;
+ sd_temp = sra->devs;
+ for (sd = sra->devs; sd; sd = sd->next, d++) {
+ int i, remove_all = 0;
+
+ if (sd->disk.state & (1< + continue;
+ if (!(sd->disk.state & (1< + in_sync--;
+
+ copies--;
+ if (!copies) {
+ /* We reached end of "mirrored" set of devices */
+ if (!in_sync) {
+ /* The array is failed and cannot be reshaped */
+ return 1;
+ }
+ /* Now mark all disks to be removed as faulty
+ * (leave only one in_sync disk) */
+ for (i = (d-nr_of_copies+1); i <= d; i++, sd_temp = sd_temp->next) {
+ if (sd_temp == NULL) {
+ /* error, array is wrong built
+ */
+ return 1;
+ }
+ if ((sd_temp->disk.state & (1< + (remove_all == 0)) {
+ /* this will be the candidate for Raid 0,
+ * leave it */
+ remove_all = 1;
+ continue;
+ } else {
+ /* this one will be removed */
+ sysfs_set_str(sra, sd_temp, "state", "faulty");
+ sysfs_set_str(sra, sd_temp, "slot", "none");
+ sysfs_set_str(sra, sd_temp, "state", "remove");
+ }
+ }
+ /* update in_sync and copies for the next set of devices */
+ in_sync = nr_of_copies;
+ copies = nr_of_copies;
+ sd_temp = sd->next;
+ }
+ }
+ return 0;
+}
+
int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
long long size,
int level, char *layout_str, int chunksize, int raid_disks)
@@ -872,7 +958,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
st->update_tail = &st->updates;
}

- sra = sysfs_read(fd, 0, GET_LEVEL);
+ sra = sysfs_read(fd, 0, GET_LEVEL | GET_DEVS | GET_STATE);
if (sra) {
if (st->ss->external && st->subarray[0] == 0) {
array.level = LEVEL_CONTAINER;
@@ -955,6 +1041,25 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
size = array.size;
}

+ /* ========= check for Raid10 -> Raid0 conversion ===============
+ * current implemenation assumes that following conditions must be met:
+ * - far_copies == 1
+ * - near_copies == 2
+ */
+ if (level == 0 && array.level == 10 &&
+ array.layout == ((1 << 8) + 2) && !(array.raid_disks & 1)) {
+ int err;
+ err = remove_disks_on_raid10_to_raid0_takeover(st, sra, array.layout);
+ if (err) {
+ dprintf(Name": Array cannot be reshaped\n");
+ if (container)
+ free(container);
+ if (cfd > -1)
+ close(cfd);
+ return 1;
+ }
+ }
+
/* ======= set level =========== */
if (level != UnSet && level != array.level) {
/* Trying to change the level.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 18/53] Treat feature as experimental

am 26.11.2010 09:06:08 von adam.kwolek

Due to fact that IMSM Windows compatibility was not tested yet, feature has to be treated as experimental until compatibility verification will be performed.

Signed-off-by: Adam Kwolek
---

mdadm.h | 1 +
super-intel.c | 4 ++++
util.c | 10 ++++++++++
3 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/mdadm.h b/mdadm.h
index 64b32cc..bf3c1d3 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -890,6 +890,7 @@ extern char *conf_word(FILE *file, int allow_key);
extern int conf_name_is_free(char *name);
extern int devname_matches(char *name, char *match);
extern struct mddev_ident_s *conf_match(struct mdinfo *info, struct supertype *st);
+extern inline int experimental(void);

extern void free_line(char *line);
extern int match_oneof(char *devices, char *devname);
diff --git a/super-intel.c b/super-intel.c
index 2434fa1..f092ccc 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -5780,6 +5780,10 @@ int imsm_reshape_super(struct supertype *st, long long size, int level,
int fd = -1;
char buf[PATH_MAX];

+
+ if (experimental() == 0)
+ return ret_val;
+
snprintf(buf, PATH_MAX, "/dev/md%i", st->devnum);
fd = open(buf , O_RDONLY | O_DIRECT);
if (fd < 0) {
diff --git a/util.c b/util.c
index 8739278..f220792 100644
--- a/util.c
+++ b/util.c
@@ -1859,3 +1859,13 @@ void append_metadata_update(struct supertype *st, void *buf, int len)
unsigned int __invalid_size_argument_for_IOC = 0;
#endif

+inline int experimental(void)
+{
+ if (check_env("MDADM_EXPERIMENTAL"))
+ return 1;
+ else {
+ fprintf(stderr, Name "(IMSM): To use this feature MDADM_EXPERIMENTAL enviroment variable has to defined.\n");
+ return 0;
+ }
+}
+

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 19/53] imsm: Add support for general migration

am 26.11.2010 09:06:15 von adam.kwolek

Internal IMSM procedures need to support the General Migration.
It is used during operations like:
- Online Capacity Expansion,
- migration initialization,
- finishing migration,
- apply changes to raid disks etc.

Signed-off-by: Adam Kwolek
---

mdmon.h | 2 +-
super-intel.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 56 insertions(+), 8 deletions(-)

diff --git a/mdmon.h b/mdmon.h
index 5c51566..8190358 100644
--- a/mdmon.h
+++ b/mdmon.h
@@ -76,7 +76,7 @@ void do_monitor(struct supertype *container);
void do_manager(struct supertype *container);
extern int sigterm;

-int read_dev_state(int fd);
+extern int read_dev_state(int fd);
int is_container_member(struct mdstat_ent *mdstat, char *container);

struct mdstat_ent *mdstat_read(int hold, int start);
diff --git a/super-intel.c b/super-intel.c
index f092ccc..90faff6 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -2139,7 +2139,8 @@ static void migrate(struct imsm_dev *dev, __u8 to_state, int migr_type)

/* duplicate and then set the target end state in map[0] */
memcpy(dest, src, sizeof_imsm_map(src));
- if (migr_type == MIGR_REBUILD) {
+ if ((migr_type == MIGR_REBUILD) ||
+ (migr_type == MIGR_GEN_MIGR)) {
__u32 ord;
int i;

@@ -2156,18 +2157,26 @@ static void end_migration(struct imsm_dev *dev, __u8 map_state)
{
struct imsm_map *map = get_imsm_map(dev, 0);
struct imsm_map *prev = get_imsm_map(dev, dev->vol.migr_state);
- int i;
+ int i, j;

/* merge any IMSM_ORD_REBUILD bits that were not successfully
* completed in the last migration.
*
- * FIXME add support for online capacity expansion and
- * raid-level-migration
+ * FIXME add support for raid-level-migration
*/
for (i = 0; i < prev->num_members; i++)
- map->disk_ord_tbl[i] |= prev->disk_ord_tbl[i];
+ for (j = 0; j < map->num_members; j++)
+ /* during online capacity expansion
+ * disks position can be changed if takeover is used
+ */
+ if (ord_to_idx(map->disk_ord_tbl[j]) ==
+ ord_to_idx(prev->disk_ord_tbl[i])) {
+ map->disk_ord_tbl[j] |= prev->disk_ord_tbl[i];
+ break;
+ }

dev->vol.migr_state = 0;
+ dev->vol.migr_type = 0;
dev->vol.curr_migr_unit = 0;
map->map_state = map_state;
}
@@ -4307,6 +4316,17 @@ static int update_subarray_imsm(struct supertype *st, char *update, mddev_ident_
}
#endif /* MDASSEMBLE */

+static int is_gen_migration(struct imsm_dev *dev)
+{
+ if (!dev->vol.migr_state)
+ return 0;
+
+ if (migr_type(dev) == MIGR_GEN_MIGR)
+ return 1;
+
+ return 0;
+}
+
static int is_rebuilding(struct imsm_dev *dev)
{
struct imsm_map *migr_map;
@@ -4388,8 +4408,7 @@ static struct mdinfo *container_content_imsm(struct supertype *st)
* unsupported migration
*/
if (dev->vol.migr_state &&
- (migr_type(dev) == MIGR_GEN_MIGR ||
- migr_type(dev) == MIGR_STATE_CHANGE)) {
+ (migr_type(dev) == MIGR_STATE_CHANGE)) {
fprintf(stderr, Name ": cannot assemble volume '%.16s':"
" unsupported migration in progress\n",
dev->volume);
@@ -4672,6 +4691,8 @@ static void handle_missing(struct intel_super *super, struct imsm_dev *dev)
super->updates_pending++;
}

+static void imsm_set_disk(struct active_array *a, int n, int state);
+
/* Handle dirty -> clean transititions and resync. Degraded and rebuild
* states are handled in imsm_set_disk() with one exception, when a
* resync is stopped due to a new failure this routine will set the
@@ -4747,6 +4768,16 @@ static int imsm_set_array_state(struct active_array *a, int consistent)
dev->vol.dirty = 1;
super->updates_pending++;
}
+
+ /* finalize online capacity expansion/reshape */
+ if ((a->curr_action != reshape) &&
+ (a->prev_action == reshape)) {
+ struct mdinfo *mdi;
+
+ for (mdi = a->info.devs; mdi; mdi = mdi->next)
+ imsm_set_disk(a, mdi->disk.raid_disk, mdi->curr_state);
+ }
+
return consistent;
}

@@ -4810,6 +4841,23 @@ static void imsm_set_disk(struct active_array *a, int n, int state)
end_migration(dev, map_state);
super->updates_pending++;
a->last_checkpoint = 0;
+ } else if (is_gen_migration(dev)) {
+ dprintf("imsm: Detected General Migration in state: ");
+ if (map_state == IMSM_T_STATE_NORMAL) {
+ end_migration(dev, map_state);
+ map = get_imsm_map(dev, 0);
+ map->failed_disk_num = ~0;
+ dprintf("normal\n");
+ } else {
+ if (map_state == IMSM_T_STATE_DEGRADED) {
+ printf("degraded\n");
+ end_migration(dev, map_state);
+ } else {
+ dprintf("failed\n");
+ }
+ map->map_state = map_state;
+ }
+ super->updates_pending++;
}
}

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 20/53] imsm: Add reshape_update for grow array case

am 26.11.2010 09:06:29 von adam.kwolek

Store metadata update during Online Capacity Expansion initialization to currently reshaped array in container.
New update type imsm_update_reshape is added to perform this action.
Active array is extended with reshape_delta_disk variable that triggers additional actions in managemon.

1. reshape_super() prepares metadata update and send it to mdmon 2. managemon in prepare_update() allocates required memory for bigger device object 3. monitor in
process_update() updates (replaces) device object with information
passed from mdadm (memory was allocated by managemon) 4. set reshape_delta_disks variable to delta_disks value from update.
This signals managemon to add devices to md and start reshape for this array

Signed-off-by: Adam Kwolek
Signed-off-by: Krzysztof Wojcik
---

Makefile | 6
managemon.c | 2
mdadm.h | 4
mdmon.h | 5
super-intel.c | 792 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
sysfs.c | 144 ++++++++++
util.c | 148 +++++++++++
7 files changed, 1094 insertions(+), 7 deletions(-)

diff --git a/Makefile b/Makefile
index e2c65a5..e3fb949 100644
--- a/Makefile
+++ b/Makefile
@@ -112,17 +112,17 @@ SRCS = mdadm.c config.c mdstat.c ReadMe.c util.c Manage.c Assemble.c Build.c \
MON_OBJS = mdmon.o monitor.o managemon.o util.o mdstat.o sysfs.o config.o \
Kill.o sg_io.o dlink.o ReadMe.o super0.o super1.o super-intel.o \
super-ddf.o sha1.o crc32.o msg.o bitmap.o \
- platform-intel.o probe_roms.o
+ platform-intel.o probe_roms.o mapfile.o

MON_SRCS = mdmon.c monitor.c managemon.c util.c mdstat.c sysfs.c config.c \
Kill.c sg_io.c dlink.c ReadMe.c super0.c super1.c super-intel.c \
super-ddf.c sha1.c crc32.c msg.c bitmap.c \
- platform-intel.c probe_roms.c
+ platform-intel.c probe_roms.c mapfile.c

STATICSRC = pwgr.c
STATICOBJS = pwgr.o

-ASSEMBLE_SRCS := mdassemble.c Assemble.c Manage.c config.c dlink.c util.c \
+ASSEMBLE_SRCS := mdassemble.c Assemble.c Manage.c config.c dlink.c util.c mapfile.c\
super0.c super1.c super-ddf.c super-intel.c sha1.c crc32.c sg_io.c mdstat.c \
platform-intel.c probe_roms.c sysfs.c
ASSEMBLE_AUTO_SRCS := mdopen.c
diff --git a/managemon.c b/managemon.c
index 53ab4a9..d495014 100644
--- a/managemon.c
+++ b/managemon.c
@@ -536,6 +536,8 @@ static void manage_new(struct mdstat_ent *mdstat,

new->container = container;

+ new->reshape_state = reshape_not_active;
+
inst = to_subarray(mdstat, container->devname);

new->info.array = mdi->array;
diff --git a/mdadm.h b/mdadm.h
index bf3c1d3..4777ad2 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -447,6 +447,7 @@ extern int sysfs_disk_to_scsi_id(int fd, __u32 *id);
extern int sysfs_unique_holder(int devnum, long rdev);
extern int sysfs_freeze_array(struct mdinfo *sra);
extern int load_sys(char *path, char *buf);
+extern struct mdinfo *sysfs_get_unused_spares(int container_fd, int fd);

extern int save_stripes(int *source, unsigned long long *offsets,
@@ -473,6 +474,7 @@ extern char *map_dev(int major, int minor, int create);

struct active_array;
struct metadata_update;
+enum state_of_reshape;

/* A superswitch provides entry point the a metadata handler.
*
@@ -891,6 +893,8 @@ extern int conf_name_is_free(char *name);
extern int devname_matches(char *name, char *match);
extern struct mddev_ident_s *conf_match(struct mdinfo *info, struct supertype *st);
extern inline int experimental(void);
+extern int find_array_minor(char *text_version, int external, int container, int *minor);
+extern int find_array_minor2(char *text_version, int external, int container, int *minor);

extern void free_line(char *line);
extern int match_oneof(char *devices, char *devname);
diff --git a/mdmon.h b/mdmon.h
index 8190358..9ea0b93 100644
--- a/mdmon.h
+++ b/mdmon.h
@@ -24,6 +24,8 @@ enum array_state { clear, inactive, suspended, readonly, read_auto,
enum sync_action { idle, reshape, resync, recover, check, repair, bad_action };

+enum state_of_reshape { reshape_not_active, reshape_is_starting, reshape_in_progress, reshape_cancel_request };
+
struct active_array {
struct mdinfo info;
struct supertype *container;
@@ -45,6 +47,9 @@ struct active_array {
enum array_state prev_state, curr_state, next_state;
enum sync_action prev_action, curr_action, next_action;

+ enum state_of_reshape reshape_state;
+ int reshape_delta_disks;
+
int check_degraded; /* flag set by mon, read by manage */

int devnum;
diff --git a/super-intel.c b/super-intel.c
index 90faff6..98e4c6d 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -286,6 +286,7 @@ enum imsm_update_type {
update_rename_array,
update_add_disk,
update_level,
+ update_reshape,
};

struct imsm_update_activate_spare {
@@ -296,6 +297,43 @@ struct imsm_update_activate_spare {
struct imsm_update_activate_spare *next;
};

+struct geo_params {
+ int dev_id;
+ char *dev_name;
+ long long size;
+ int level;
+ int layout;
+ int chunksize;
+ int raid_disks;
+};
+
+
+struct imsm_update_reshape {
+ enum imsm_update_type type;
+ int update_memory_size;
+ int reshape_delta_disks;
+ int disks_count;
+ int spares_in_update;
+ int devnum;
+ /* pointers to memory that will be allocated
+ * by manager during prepare_update()
+ */
+ struct intel_dev devs_mem;
+ /* status of update preparation
+ */
+ int update_prepared;
+ /* anchor data prepared by mdadm */
+ int upd_devs_offset;
+ int device_size;
+ struct dl upd_disks[1];
+ /* here goes added spares
+ */
+ /* and here goes imsm_devs pointed by upd_devs
+ * devs are put here as row data every device_size bytes
+ *
+ */
+};
+
struct disk_info {
__u8 serial[MAX_RAID_SERIAL_LEN];
};
@@ -5189,6 +5227,7 @@ static int disks_overlap(struct intel_super *super, int idx, struct imsm_update_
}

static void imsm_delete(struct intel_super *super, struct dl **dlp, unsigned index);
+int imsm_get_new_device_name(struct dl *dl);

static void imsm_process_update(struct supertype *st,
struct metadata_update *update)
@@ -5224,6 +5263,102 @@ static void imsm_process_update(struct supertype *st,
mpb = super->anchor;

switch (type) {
+ case update_reshape: {
+ struct imsm_update_reshape *u = (void *)update->buf;
+ struct dl *new_disk;
+ struct active_array *a;
+ int i;
+ __u32 new_mpb_size;
+ int new_disk_num;
+ struct intel_dev *current_dev;
+
+ dprintf("imsm: imsm_process_update() for update_reshape [u->update_prepared = %i]\n", u->update_prepared);
+ if ((u->update_prepared == -1) ||
+ (u->devnum < 0)) {
+ dprintf("imsm: Error: update_reshape not prepared\n");
+ goto update_reshape_exit;
+ }
+
+ if (u->spares_in_update) {
+ new_disk_num = mpb->num_disks + u->reshape_delta_disks;
+ new_mpb_size = disks_to_mpb_size(new_disk_num);
+ if (mpb->mpb_size < new_mpb_size)
+ mpb->mpb_size = new_mpb_size;
+
+ /* enable spares to use in array
+ */
+ for (i = 0; i < u->reshape_delta_disks; i++) {
+ char buf[PATH_MAX];
+
+ new_disk = super->disks;
+ while (new_disk) {
+ if ((new_disk->major == u->upd_disks[i].major) &&
+ (new_disk->minor == u->upd_disks[i].minor))
+ break;
+ new_disk = new_disk->next;
+ }
+ if (new_disk == NULL) {
+ u->update_prepared = -1;
+ goto update_reshape_exit;
+ }
+ if (new_disk->index < 0) {
+ new_disk->index = i + mpb->num_disks;
+ new_disk->raiddisk = new_disk->index; /* slot to fill in autolayout */
+ new_disk->disk.status |= CONFIGURED_DISK;
+ new_disk->disk.status &= ~SPARE_DISK;
+ }
+ sprintf(buf, "%d:%d", new_disk->major, new_disk->minor);
+ if (new_disk->fd < 0)
+ new_disk->fd = dev_open(buf, O_RDWR);
+ imsm_get_new_device_name(new_disk);
+ }
+ }
+
+ dprintf("imsm: process_update(): update_reshape: volume set mpb->num_raid_devs = %i\n", mpb->num_raid_devs);
+ /* manage changes in volumes
+ */
+ /* check if array is in RESHAPE_NOT_ACTIVE reshape state
+ */
+ for (a = st->arrays; a; a = a->next)
+ if (a->devnum == u->devnum)
+ break;
+ if ((a == NULL) || (a->reshape_state != reshape_not_active)) {
+ u->update_prepared = -1;
+ goto update_reshape_exit;
+ }
+ /* find current dev in intel_super
+ */
+ dprintf("\t\tLooking for volume %s\n", (char *)u->devs_mem.dev->volume);
+ current_dev = super->devlist;
+ while (current_dev) {
+ if (strcmp((char *)current_dev->dev->volume,
+ (char *)u->devs_mem.dev->volume) == 0)
+ break;
+ current_dev = current_dev->next;
+ }
+ if (current_dev == NULL) {
+ u->update_prepared = -1;
+ goto update_reshape_exit;
+ }
+
+ dprintf("Found volume %s\n", (char *)current_dev->dev->volume);
+ /* replace current device with provided in update
+ */
+ free(current_dev->dev);
+ current_dev->dev = u->devs_mem.dev;
+ u->devs_mem.dev = NULL;
+
+ /* set reshape_delta_disks
+ */
+ a->reshape_delta_disks = u->reshape_delta_disks;
+ a->reshape_state = reshape_is_starting;
+
+ super->updates_pending++;
+update_reshape_exit:
+ if (u->devs_mem.dev)
+ free(u->devs_mem.dev);
+ break;
+ }
case update_level: {
struct imsm_update_level *u = (void *)update->buf;
struct imsm_dev *dev_new, *dev = NULL;
@@ -5592,8 +5727,58 @@ static void imsm_prepare_update(struct supertype *st,
struct imsm_super *mpb = super->anchor;
size_t buf_len;
size_t len = 0;
+ void *upd_devs;

switch (type) {
+ case update_reshape: {
+ struct imsm_update_reshape *u = (void *)update->buf;
+ struct dl *dl = NULL;
+
+ u->update_prepared = -1;
+ u->devs_mem.dev = NULL;
+ dprintf("imsm: imsm_prepare_update() for update_reshape\n");
+ if (u->devnum < 0) {
+ dprintf("imsm: No passed device.\n");
+ break;
+ }
+ dprintf("imsm: reshape delta disks is = %i\n", u->reshape_delta_disks);
+ if (u->reshape_delta_disks < 0)
+ break;
+ u->update_prepared = 1;
+ if (u->reshape_delta_disks == 0) {
+ /* for non growing reshape buffers sizes are not affected
+ * but check some parameters
+ */
+ break;
+ }
+ /* count HDDs
+ */
+ u->disks_count = 0;
+ for (dl = super->disks; dl; dl = dl->next)
+ if (dl->index >= 0)
+ u->disks_count++;
+
+ /* set pointer in monitor address space
+ */
+ upd_devs = (struct imsm_dev *)((void *)u + u->upd_devs_offset);
+ /* allocate memory for new volumes */
+ if (((struct imsm_dev *)(upd_devs))->vol.migr_type != MIGR_GEN_MIGR) {
+ dprintf("imsm: Error.Device is not in migration state.\n");
+ u->update_prepared = -1;
+ break;
+ }
+ dprintf("passed device : %s\n", ((struct imsm_dev *)(upd_devs))->volume);
+ u->devs_mem.dev = calloc(1, u->device_size);
+ if (u->devs_mem.dev == NULL) {
+ u->update_prepared = -1;
+ break;
+ }
+ dprintf("METADATA Copy - using it.\n");
+ memcpy(u->devs_mem.dev, upd_devs, u->device_size);
+ len = disks_to_mpb_size(u->spares_in_update + mpb->num_disks);
+ dprintf("New anchor length is %llu\n", (unsigned long long)len);
+ break;
+ }
case update_level: {
struct imsm_update_level *u = (void *) update->buf;
struct active_array *a;
@@ -5818,6 +6003,525 @@ static int update_level_imsm(struct supertype *st, struct mdinfo *info,
return 0;
}

+int imsm_reshape_is_allowed_on_container(struct supertype *st,
+ struct geo_params *geo)
+{
+ int ret_val = 0;
+ struct mdinfo *info = NULL;
+ char buf[PATH_MAX];
+ int fd = -1;
+ int device_num = -1;
+ int devices_that_can_grow = 0;
+
+ dprintf("imsm: imsm_reshape_is_allowed_on_container(ENTER): st->devnum = (%i)\n", st->devnum);
+
+ if (geo == NULL ||
+ (geo->size != -1) || (geo->level != UnSet) ||
+ (geo->layout != UnSet) || (geo->chunksize != 0)) {
+ dprintf("imsm: Container operation is allowed for raid disks number change only.\n");
+ return ret_val;
+ }
+
+ snprintf(buf, PATH_MAX, "/dev/md%i", st->devnum);
+ dprintf("imsm: open device (%s)\n", buf);
+ fd = open(buf , O_RDONLY | O_DIRECT);
+ if (fd < 0) {
+ dprintf("imsm: cannot open device\n");
+ return ret_val;
+ }
+
+ if (geo->raid_disks == UnSet) {
+ dprintf("imsm: for container operation raid disks change is required\n");
+ goto exit_imsm_reshape_is_allowed_on_container;
+ }
+
+ device_num = 0; /* start from first device (skip container info) */
+ while (device_num > -1) {
+ int result;
+ int minor;
+ unsigned long long array_blocks;
+ struct imsm_map *map = NULL;
+ struct imsm_dev *dev = NULL;
+ struct intel_super *super = NULL;
+ int used_disks;
+
+
+ dprintf("imsm: checking device_num: %i\n", device_num);
+ sprintf(st->subarray, "%i", device_num);
+ st->ss->load_super(st, fd, NULL);
+ if (st->sb == NULL) {
+ if (device_num == 0) {
+ /* for the first checked device this is error
+ there should be at least one device to check
+ */
+ dprintf("imsm: error: superblock is NULL during container operation\n");
+ } else {
+ dprintf("imsm: no more devices to check, number of forund devices: %i\n",
+ devices_that_can_grow);
+ /* check if any device in container can be groved
+ */
+ if (devices_that_can_grow)
+ ret_val = 1;
+ /* restore superblock, for last device not loaded */
+ sprintf(st->subarray, "%i", 0);
+ st->ss->load_super(st, fd, NULL);
+ }
+ break;
+ }
+ info = sysfs_read(fd, 0, GET_LEVEL|GET_VERSION|GET_DEVS|GET_STATE);
+ if (info == NULL) {
+ dprintf("imsm: Cannot get device info.\n");
+ break;
+ }
+ st->ss->getinfo_super(st, info);
+
+ if (geo->raid_disks < info->array.raid_disks) {
+ /* we work on container for Online Capacity Expansion
+ * only so raid_disks has to grow
+ */
+ dprintf("imsm: for container operation raid disks increase is required\n");
+ break;
+ }
+ /* check if size is set corectly
+ * wrong conditions could happend when previous reshape wes interrupted
+ */
+ super = st->sb;
+ dev = get_imsm_dev(super, device_num);
+ if (dev == NULL) {
+ dprintf("cannot get imsm device\n");
+ ret_val = 0;
+ break;
+ }
+ map = get_imsm_map(dev, 0);
+ if (dev == NULL) {
+ dprintf("cannot get imsm device map\n");
+ ret_val = 0;
+ break;
+ }
+ used_disks = imsm_num_data_members(dev);
+ dprintf("read raid_disks = %i\n", used_disks);
+ dprintf("read requested disks = %i\n", geo->raid_disks);
+ array_blocks = map->blocks_per_member * used_disks;
+ /* round array size down to closest MB
+ */
+ array_blocks = (array_blocks >> SECT_PER_MB_SHIFT) << SECT_PER_MB_SHIFT;
+ if (sysfs_set_num(info, NULL, "array_size", array_blocks/2) < 0)
+ dprintf("cannot set array size to %llu\n", array_blocks/2);
+
+ if (geo->raid_disks > info->array.raid_disks)
+ devices_that_can_grow++;
+
+ if ((info->array.level != 0) &&
+ (info->array.level != 5)) {
+ /* we cannot use this container other raid level
+ */
+ dprintf("imsm: for container operation wrong raid level (%i) detected\n", info->array.level);
+ break;
+ } else {
+ /* check for platform support for this raid level configuration
+ */
+ struct intel_super *super = st->sb;
+ if (!is_raid_level_supported(super->orom, info->array.level, geo->raid_disks)) {
+ dprintf("platform does not support raid%d with %d disk%s\n",
+ info->array.level, geo->raid_disks, geo->raid_disks > 1 ? "s" : "");
+ break;
+ }
+ }
+
+ /* all raid5 and raid0 volumes in container
+ * has to be ready for Online Capacity Expansion
+ */
+ result = find_array_minor2(info->text_version, st->ss->external, st->devnum, &minor);
+ if (result < 0) {
+ dprintf("imsm: cannot find array\n");
+ break;
+ }
+ sprintf(info->sys_name, "md%i", minor);
+ if (sysfs_get_str(info, NULL, "array_state", buf, 20) <= 0) {
+ dprintf("imsm: cannot read array state\n");
+ break;
+ }
+ if ((strncmp(buf, "clean", 5) != 0) &&
+ (strncmp(buf, "clear", 5) != 0) &&
+ (strncmp(buf, "active", 6) != 0)) {
+ int index = strlen(buf) - 1;
+
+ if (index < 0)
+ index = 0;
+ *(buf + index) = 0;
+ fprintf(stderr, "imsm: Error: Array %s is not in proper state (current state: %s). Cannot continue.\n", info->sys_name, buf);
+ break;
+ }
+ if (info->array.level > 0) {
+ if (sysfs_get_str(info, NULL, "sync_action", buf, 20) <= 0) {
+ dprintf("imsm: for container operation no sync action\n");
+ break;
+ }
+ /* check if any reshape is not in progress
+ */
+ if (strncmp(buf, "reshape", 7) == 0) {
+ dprintf("imsm: for container operation reshape is currently in progress\n");
+ break;
+ }
+ }
+ sysfs_free(info);
+ info = NULL;
+ device_num++;
+ }
+ sysfs_free(info);
+ info = NULL;
+
+exit_imsm_reshape_is_allowed_on_container:
+ if (fd >= 0)
+ close(fd);
+
+ dprintf("imsm: imsm_reshape_is_allowed_on_container(Exit) device_num = %i, ret_val = %i\n", device_num, ret_val);
+ if (ret_val)
+ dprintf("\tContainer operation allowed\n");
+ else
+ dprintf("\tError: %i\n", ret_val);
+
+ return ret_val;
+}
+struct mdinfo *get_spares_imsm(int devnum)
+{
+ int fd = -1;
+ char buf[PATH_MAX];
+ struct mdinfo *info = NULL;
+ struct mdinfo *ret_val = NULL;
+ int cont_id = -1;
+ struct supertype *st = NULL;
+ int find_result;
+
+ dprintf("imsm: get_spares_imsm for device: %i.\n", devnum);
+
+ sprintf(buf, "/dev/md%i", devnum);
+ dprintf("try to read container %s\n", buf);
+
+ cont_id = open(buf, O_RDONLY);
+ if (cont_id < 0) {
+ dprintf("imsm: ERROR: Cannot open container.\n");
+ goto abort;
+ }
+
+ /* get first volume */
+ st = super_by_fd(cont_id);
+ if (st == NULL) {
+ dprintf("imsm: ERROR: Cannot load container information.\n");
+ goto abort;
+ }
+ sprintf(buf, "/md%i/0", devnum);
+ find_result = find_array_minor2(buf, 1, devnum, &devnum);
+ if (find_result < 0) {
+ dprintf("imsm: ERROR: Cannot find array.\n");
+ goto abort;
+ }
+ sprintf(buf, "/dev/md%i", devnum);
+ fd = open(buf, O_RDONLY);
+ if (fd < 0) {
+ dprintf("imsm: ERROR: Cannot open device.\n");
+ goto abort;
+ }
+ sprintf(st->subarray, "0");
+ st->ss->load_super(st, cont_id, NULL);
+ if (st->sb == NULL) {
+ dprintf("imsm: ERROR: Cannot load array information.\n");
+ goto abort;
+ }
+ info = sysfs_read(fd, 0, GET_LEVEL | GET_VERSION | GET_DEVS | GET_STATE);
+ if (info == NULL) {
+ dprintf("imsm: Cannot get device info.\n");
+ goto abort;
+ }
+ st->ss->getinfo_super(st, info);
+ sprintf(buf, "/dev/md/%s", info->name);
+ ret_val = sysfs_get_unused_spares(cont_id, fd);
+ if (ret_val == NULL) {
+ dprintf("imsm: ERROR: Cannot get spare devices.\n");
+ goto abort;
+ }
+ if (ret_val->array.spare_disks == 0) {
+ dprintf("imsm: ERROR: No available spares.\n");
+ free(ret_val);
+ ret_val = NULL;
+ goto abort;
+ }
+
+abort:
+ if (st)
+ st->ss->free_super(st);
+ sysfs_free(info);
+ if (fd > -1)
+ close(fd);
+ if (cont_id > -1)
+ close(cont_id);
+
+ return ret_val;
+}
+
+/********************************************************** ********************
+ * function: imsm_create_metadata_update_for_reshape
+ * Function creates update for whole IMSM container.
+ * Slot number for new devices are guesed only. Managemon will correct them
+ * when reshape will be triggered and md sets slot numbers.
+ * Slot numbers in metadata will be updated with stage_2 update
+ ************************************************************ ******************/
+struct imsm_update_reshape *imsm_create_metadata_update_for_reshape(struct supertype *st, struct geo_params *geo)
+{
+ struct imsm_update_reshape *ret_val = NULL;
+ struct intel_super *super = st->sb;
+ int update_memory_size = 0;
+ struct imsm_update_reshape *u = NULL;
+ struct imsm_map *new_map = NULL;
+ struct mdinfo *spares = NULL;
+ int i;
+ unsigned long long array_blocks;
+ int used_disks;
+ int delta_disks = 0;
+ struct dl *new_disks;
+ int device_size;
+ void *upd_devs;
+
+ dprintf("imsm imsm_update_metadata_for_reshape(enter) raid_disks = %i\n", geo->raid_disks);
+
+ if ((geo->raid_disks < super->anchor->num_disks) ||
+ (geo->raid_disks == UnSet))
+ geo->raid_disks = super->anchor->num_disks;
+ delta_disks = geo->raid_disks - super->anchor->num_disks;
+
+ /* size of all update data without anchor */
+ update_memory_size = sizeof(struct imsm_update_reshape);
+ /* add space for all devices,
+ * then add maps space
+ */
+ device_size = sizeof(struct imsm_dev);
+ device_size += sizeof(struct imsm_map);
+ device_size += 2 * (geo->raid_disks - 1) * sizeof(__u32);
+
+ update_memory_size += device_size * super->anchor->num_raid_devs;
+ if (delta_disks > 1) {
+ /* now add space for spare disks information
+ */
+ update_memory_size += sizeof(struct dl) * (delta_disks - 1);
+ }
+
+ u = calloc(1, update_memory_size);
+ if (u == NULL) {
+ dprintf("error: cannot get memory for imsm_update_reshape update\n");
+ return ret_val;
+ }
+ u->reshape_delta_disks = delta_disks;
+ u->update_prepared = -1;
+ u->update_memory_size = update_memory_size;
+ u->type = update_reshape;
+ u->spares_in_update = 0;
+ u->upd_devs_offset = sizeof(struct imsm_update_reshape) + sizeof(struct dl) * (delta_disks - 1);
+ upd_devs = (struct imsm_dev *)((void *)u + u->upd_devs_offset);
+ u->device_size = device_size;
+
+ for (i = 0; i < super->anchor->num_raid_devs; i++) {
+ struct imsm_dev *old_dev = __get_imsm_dev(super->anchor, i);
+ int old_disk_number;
+ int devnum = -1;
+
+ u->devnum = -1;
+ if (old_dev == NULL)
+ break;
+
+ find_array_minor((char *)old_dev->volume, 1, st->devnum, &devnum);
+ if (devnum == geo->dev_id) {
+ __u8 to_state;
+ struct imsm_map *new_map2;
+ int idx;
+
+ new_map = NULL;
+ imsm_copy_dev(upd_devs, old_dev);
+ new_map = get_imsm_map(upd_devs, 0);
+ old_disk_number = new_map->num_members;
+ new_map->num_members = geo->raid_disks;
+ u->reshape_delta_disks = new_map->num_members - old_disk_number;
+ /* start migration on new device
+ * it puts second map there also
+ */
+
+ to_state = imsm_check_degraded(super, old_dev, 0);
+ migrate(upd_devs, to_state, MIGR_GEN_MIGR);
+ /* second map length is equal to first map
+ * correct second map length to old value
+ */
+ new_map2 = get_imsm_map(upd_devs, 1);
+ if (new_map2) {
+ if (new_map2->num_members != old_disk_number) {
+ new_map2->num_members = old_disk_number;
+ /* guess new disk indexes
+ */
+ for (idx = new_map2->num_members; idx < new_map->num_members; idx++)
+ set_imsm_ord_tbl_ent(new_map, idx, idx);
+ }
+ u->devnum = geo->dev_id;
+ break;
+ }
+ }
+ }
+
+ if (delta_disks <= 0) {
+ dprintf("imsm: reshape without grow (disk add).\n");
+ /* finalize update */
+ goto calculate_size_only;
+ }
+
+ /* now get spare disks list
+ */
+ spares = get_spares_imsm(st->container_dev);
+
+ if (spares == NULL) {
+ dprintf("imsm: ERROR: Cannot get spare devices.\n");
+ goto exit_imsm_create_metadata_update_for_reshape;
+ }
+ if ((spares->array.spare_disks == 0) ||
+ (u->reshape_delta_disks > spares->array.spare_disks)) {
+ dprintf("imsm: ERROR: No available spares.\n");
+ goto exit_imsm_create_metadata_update_for_reshape;
+ }
+ /* we have got spares
+ * update disk list in imsm_disk list table in anchor
+ */
+ dprintf("imsm: %i spares are available.\n\n", spares->array.spare_disks);
+ new_disks = u->upd_disks;
+ for (i = 0; i < u->reshape_delta_disks; i++) {
+ struct mdinfo *dev = spares->devs;
+ __u32 id;
+ int fd;
+ char buf[PATH_MAX];
+ int rv;
+ unsigned long long size;
+
+ sprintf(buf, "%d:%d", dev->disk.major, dev->disk.minor);
+ dprintf("open spare disk %s (%s)\n", buf, dev->sys_name);
+ fd = dev_open(buf, O_RDWR);
+ if (fd < 0) {
+ dprintf("\topen failed\n");
+ goto exit_imsm_create_metadata_update_for_reshape;
+ }
+ if (sysfs_disk_to_scsi_id(fd, &id) == 0)
+ new_disks[i].disk.scsi_id = __cpu_to_le32(id);
+ else
+ new_disks[i].disk.scsi_id = __cpu_to_le32(0);
+ new_disks[i].disk.status = CONFIGURED_DISK;
+ rv = imsm_read_serial(fd, NULL, new_disks[i].disk.serial);
+ if (rv != 0) {
+ dprintf("\tcannot read disk serial\n");
+ close(fd);
+ goto exit_imsm_create_metadata_update_for_reshape;
+ }
+ dprintf("\tdisk serial: %s\n", new_disks[i].disk.serial);
+ get_dev_size(fd, NULL, &size);
+ size /= 512;
+ new_disks[i].disk.total_blocks = __cpu_to_le32(size);
+ new_disks[i].disk.owner_cfg_num = super->anchor->disk->owner_cfg_num;
+
+ new_disks[i].major = dev->disk.major;
+ new_disks[i].minor = dev->disk.minor;
+ /* no relink in update
+ * use table access
+ */
+ new_disks[i].next = NULL;
+
+ close(fd);
+ spares->devs = dev->next;
+ u->spares_in_update++;
+
+ free(dev);
+ dprintf("\n");
+ }
+calculate_size_only:
+ /* calculate new size
+ */
+ if (new_map != NULL) {
+
+ used_disks = imsm_num_data_members(upd_devs);
+ if (used_disks) {
+ array_blocks = new_map->blocks_per_member * used_disks;
+ /* round array size down to closest MB
+ */
+ array_blocks = (array_blocks >> SECT_PER_MB_SHIFT) << SECT_PER_MB_SHIFT;
+ ((struct imsm_dev *)(upd_devs))->size_low = __cpu_to_le32((__u32)array_blocks);
+ ((struct imsm_dev *)(upd_devs))->size_high = __cpu_to_le32((__u32)(array_blocks >> 32));
+ /* finalize update */
+ ret_val = u;
+ }
+ }
+
+exit_imsm_create_metadata_update_for_reshape:
+ /* free spares
+ */
+ if (spares) {
+ while (spares->devs) {
+ struct mdinfo *dev = spares->devs;
+ spares->devs = dev->next;
+ free(dev);
+ }
+ free(spares);
+ }
+
+ if (ret_val == NULL)
+ free(u);
+
+ return ret_val;
+}
+
+char *get_volume_for_olce(struct supertype *st, int raid_disks)
+{
+ char *ret_val = NULL;
+ struct mdinfo *sra = NULL;
+ struct mdinfo info;
+ char *ret_buf;
+ struct intel_super *super = st->sb;
+ int i;
+ int fd = -1;
+ char buf[PATH_MAX];
+
+ snprintf(buf, PATH_MAX, "/dev/md%i", st->devnum);
+ dprintf("imsm: open device (%s)\n", buf);
+ fd = open(buf , O_RDONLY | O_DIRECT);
+ if (fd < 0) {
+ dprintf("imsm: cannot open device\n");
+ return ret_val;
+ }
+
+ ret_buf = malloc(PATH_MAX);
+ if (ret_buf == NULL)
+ goto exit_get_volume_for_olce;
+
+ super = st->sb;
+ for (i = 0; i < super->anchor->num_raid_devs; i++) {
+ sprintf(st->subarray, "%i", i);
+ st->ss->load_super(st, fd, NULL);
+ if (st->sb == NULL)
+ goto exit_get_volume_for_olce;
+ info.devs = NULL;
+ st->ss->getinfo_super(st, &info);
+
+ if (raid_disks > info.array.raid_disks) {
+ snprintf(ret_buf, PATH_MAX,
+ "%s", info.name);
+ dprintf("Found device for OLCE requested raid_disks = %i, array raid_disks = %i\n",
+ raid_disks, info.array.raid_disks);
+ ret_val = ret_buf;
+ break;
+ }
+ }
+
+exit_get_volume_for_olce:
+ if ((ret_val == NULL) && ret_buf)
+ free(ret_buf);
+ sysfs_free(sra);
+ if (fd > -1)
+ close(fd);
+
+ return ret_val;
+}
+

int imsm_reshape_super(struct supertype *st, long long size, int level,
int layout, int chunksize, int raid_disks,
@@ -5827,7 +6531,20 @@ int imsm_reshape_super(struct supertype *st, long long size, int level,
struct mdinfo *sra = NULL;
int fd = -1;
char buf[PATH_MAX];
+ struct geo_params geo;
+
+ memset(&geo, sizeof (struct geo_params), 0);
+
+ geo.dev_name = dev;
+ geo.size = size;
+ geo.level = level;
+ geo.layout = layout;
+ geo.chunksize = chunksize;
+ geo.raid_disks = raid_disks;

+ dprintf("imsm: reshape_super called().\n");
+ dprintf("\tfor level : %i\n", geo.level);
+ dprintf("\tfor raid_disks : %i\n", geo.raid_disks);

if (experimental() == 0)
return ret_val;
@@ -5839,7 +6556,46 @@ int imsm_reshape_super(struct supertype *st, long long size, int level,
goto imsm_reshape_super_exit;
}

- if ((size == -1) && (layout == UnSet) && (raid_disks == 0) && (level != UnSet)) {
+ /* verify reshape conditions
+ * on container level we can do almost everything */
+ if (st->subarray[0] == 0) {
+ /* check for delta_disks > 0 and supported raid levels 0 and 5 only in container */
+ if (imsm_reshape_is_allowed_on_container(st, &geo)) {
+ struct imsm_update_reshape *u;
+ char *array;
+
+ array = get_volume_for_olce(st, geo.raid_disks);
+ if (array) {
+ find_array_minor(array, 1, st->devnum, &geo.dev_id);
+ if (geo.dev_id > 0) {
+ dprintf("imsm: Preparing metadata update for: %s\n", array);
+
+ st->update_tail = &st->updates;
+ u = imsm_create_metadata_update_for_reshape(st, &geo);
+
+ if (u) {
+ ret_val = 0;
+ append_metadata_update(st, u, u->update_memory_size);
+ } else
+ dprintf("imsm: Cannot prepare update\n");
+ } else
+ dprintf("imsm: Cannot find array in container\n");
+ free(array);
+ }
+ } else
+ dprintf("imsm: Operation is not allowed on container\n");
+ *st->subarray = 0;
+ goto imsm_reshape_super_exit;
+ } else
+ dprintf("imsm: not a container operation\n");
+
+ geo.dev_id = -1;
+ find_array_minor(geo.dev_name, 1, st->devnum, &geo.dev_id);
+
+ /* we have volume so takeover can be performed for single volume only
+ */
+ if ((geo.size == -1) && (geo.layout == UnSet) && (geo.raid_disks == 0) && (geo.level != UnSet) &&
+ (geo.dev_id > -1)) {
/* ok - this is takeover */
int container_fd;
int dn;
@@ -5867,9 +6623,9 @@ int imsm_reshape_super(struct supertype *st, long long size, int level,
* to/from different than raid10 level
* if source level is raid0 mdmon is sterted only
*/
- if (((level == 10) || (sra->array.level == 10) || (sra->array.level == 0)) &&
- (level != sra->array.level) &&
- (level > 0)) {
+ if (((geo.level == 10) || (sra->array.level == 10) || (sra->array.level == 0)) &&
+ (geo.level != sra->array.level) &&
+ (geo.level > 0)) {
st->update_tail = &st->updates;
err = update_level_imsm(st, sra, sra->name, 0, 0, NULL);
ret_val = 0;
@@ -5887,6 +6643,34 @@ imsm_reshape_super_exit:
return ret_val;
}

+int imsm_get_new_device_name(struct dl *dl)
+{
+ int rv;
+ char dv[PATH_MAX];
+ char nm[PATH_MAX];
+ char *dname;
+
+ if (dl->devname != NULL)
+ return 0;
+
+ sprintf(dv, "/sys/dev/block/%d:%d", dl->major, dl->minor);
+ memset(nm, 0, sizeof(nm));
+ rv = readlink(dv, nm, sizeof(nm));
+ if (rv > 0) {
+ nm[rv] = '\0';
+ dname = strrchr(nm, '/');
+ if (dname) {
+ char buf[PATH_MAX];
+
+ dname++;
+ sprintf(buf, "/dev/%s", dname);
+ dl->devname = strdup(buf);
+ }
+ }
+
+ return rv;
+}
+
struct superswitch super_imsm = {
#ifndef MDASSEMBLE
.examine_super = examine_super_imsm,
diff --git a/sysfs.c b/sysfs.c
index 3582fed..e316785 100644
--- a/sysfs.c
+++ b/sysfs.c
@@ -800,6 +800,150 @@ int sysfs_unique_holder(int devnum, long rdev)
return found;
}

+int sysfs_is_spare_device_belongs_to(int fd, char *devname)
+{
+ int ret_val = -1;
+ char fname[PATH_MAX];
+ char *base;
+ char *dbase;
+ struct mdinfo *sra;
+ DIR *dir = NULL;
+ struct dirent *de;
+
+ sra = malloc(sizeof(*sra));
+ if (sra == NULL)
+ goto abort;
+ memset(sra, 0, sizeof(*sra));
+ sysfs_init(sra, fd, -1);
+ if (sra->sys_name[0] == 0)
+ goto abort;
+
+ memset(fname, PATH_MAX, 0);
+ sprintf(fname, "/sys/block/%s/md/", sra->sys_name);
+ base = fname + strlen(fname);
+
+ /* Get all the devices as well */
+ *base = 0;
+ dir = opendir(fname);
+ if (!dir)
+ goto abort;
+ while ((de = readdir(dir)) != NULL) {
+ if (de->d_ino == 0 ||
+ strncmp(de->d_name, "dev-", 4) != 0)
+ continue;
+ strcpy(base, de->d_name);
+ dbase = base + strlen(base);
+ *dbase = '\0';
+ dbase = strstr(fname, "/md/");
+ if (dbase && strcmp(devname, dbase) == 0) {
+ ret_val = 1;
+ goto abort;
+ }
+ }
+abort:
+ if (dir)
+ closedir(dir);
+ sysfs_free(sra);
+
+ return ret_val;
+}
+
+struct mdinfo *sysfs_get_unused_spares(int container_fd, int fd)
+{
+ char fname[PATH_MAX];
+ char buf[PATH_MAX];
+ char *base;
+ char *dbase;
+ struct mdinfo *ret_val;
+ struct mdinfo *dev;
+ DIR *dir = NULL;
+ struct dirent *de;
+ int is_in;
+ char *to_check;
+
+ ret_val = malloc(sizeof(*ret_val));
+ if (ret_val == NULL)
+ goto abort;
+ memset(ret_val, 0, sizeof(*ret_val));
+ sysfs_init(ret_val, container_fd, -1);
+ if (ret_val->sys_name[0] == 0)
+ goto abort;
+
+ sprintf(fname, "/sys/block/%s/md/", ret_val->sys_name);
+ base = fname + strlen(fname);
+
+ strcpy(base, "raid_disks");
+ if (load_sys(fname, buf))
+ goto abort;
+ ret_val->array.raid_disks = strtoul(buf, NULL, 0);
+
+ /* Get all the devices as well */
+ *base = 0;
+ dir = opendir(fname);
+ if (!dir)
+ goto abort;
+ ret_val->array.spare_disks = 0;
+ while ((de = readdir(dir)) != NULL) {
+ char *ep;
+ if (de->d_ino == 0 ||
+ strncmp(de->d_name, "dev-", 4) != 0)
+ continue;
+ strcpy(base, de->d_name);
+ dbase = base + strlen(base);
+ *dbase = '\0';
+
+ to_check = strstr(fname, "/md/");
+ is_in = sysfs_is_spare_device_belongs_to(fd, to_check);
+ if (is_in == -1) {
+ dev = malloc(sizeof(*dev));
+ if (!dev)
+ goto abort;
+ strncpy(dev->text_version, fname, 50);
+
+ *dbase++ = '/';
+
+ dev->disk.raid_disk = strtoul(buf, &ep, 10);
+ dev->disk.raid_disk = -1;
+
+ strcpy(dbase, "block/dev");
+ if (load_sys(fname, buf)) {
+ free(dev);
+ continue;
+ }
+ sscanf(buf, "%d:%d", &dev->disk.major, &dev->disk.minor);
+ strcpy(dbase, "block/device/state");
+ if (load_sys(fname, buf) != 0) {
+ free(dev);
+ continue;
+ }
+ if (strncmp(buf, "offline", 7) == 0) {
+ free(dev);
+ continue;
+ }
+ if (strncmp(buf, "failed", 6) == 0) {
+ free(dev);
+ continue;
+ }
+
+ /* add this disk to spares list */
+ dev->next = ret_val->devs;
+ ret_val->devs = dev;
+ ret_val->array.spare_disks++;
+ *(dbase-1) = '\0';
+ dprintf("sysfs: found spare: (%s)\n", fname);
+ }
+ }
+ closedir(dir);
+ return ret_val;
+
+abort:
+ if (dir)
+ closedir(dir);
+ sysfs_free(ret_val);
+
+ return NULL;
+}
+
int sysfs_freeze_array(struct mdinfo *sra)
{
/* Try to freeze resync/rebuild on this array/container.
diff --git a/util.c b/util.c
index f220792..396f6d8 100644
--- a/util.c
+++ b/util.c
@@ -1869,3 +1869,151 @@ inline int experimental(void)
}
}

+int path2devnum(char *pth)
+{
+ char *ep;
+ int fd = -1;
+ char *dev_pth = NULL;
+ char *dev_str;
+ int dev_num = -1;
+
+ fd = open(pth, O_RDONLY);
+ if (fd < 0)
+ return dev_num;
+ close(fd);
+ dev_pth = canonicalize_file_name(pth);
+ if (dev_pth == NULL)
+ return dev_num;
+ dev_str = strrchr(dev_pth, '/');
+ if (dev_str) {
+ while (!isdigit(dev_str[0]))
+ dev_str++;
+ dev_num = strtoul(dev_str, &ep, 10);
+ if (*ep != '\0')
+ dev_num = -1;
+ }
+
+ if (dev_pth)
+ free(dev_pth);
+
+ return dev_num;
+}
+
+extern void map_read(struct map_ent **map);
+extern void map_free(struct map_ent *map);
+int find_array_minor(char *text_version, int external, int container, int *minor)
+{
+ int i;
+ char path[PATH_MAX];
+ struct stat s;
+
+ if (minor == NULL)
+ return -2;
+
+ snprintf(path, PATH_MAX, "/dev/md/%s", text_version);
+ i = path2devnum(path);
+ if (i > -1) {
+ *minor = i;
+ return 0;
+ }
+
+ i = path2devnum(text_version);
+ if (i > -1) {
+ *minor = i;
+ return 0;
+ }
+
+ if (container > 0) {
+ struct map_ent *map = NULL;
+ struct map_ent *m;
+ char cont[PATH_MAX];
+
+ snprintf(cont, PATH_MAX, "/md%i/", container);
+ map_read(&map);
+ for (m = map; m; m = m->next) {
+ int index;
+ unsigned int len = 0;
+ char buf[PATH_MAX];
+
+ /* array have belongs to proper container
+ */
+ if (strncmp(cont, m->metadata, 6) != 0)
+ continue;
+ /* begin of array name in map have to be the same
+ * as array name in metadata
+ */
+ if (strncmp(m->path, path, strlen(path)) != 0)
+ continue;
+ /* array name has to be followed by '_' char
+ */
+ len = strlen(path);
+ if (*(m->path + len) != '_')
+ continue;
+ /* then we have to have valid index
+ */
+ len++;
+ if (strlen(m->path + len) <= 0)
+ continue;
+ /* index has to be las position in array name
+ */
+ index = atoi(m->path + strlen(path) + 1);
+ snprintf(buf, PATH_MAX, "%i", index);
+ len += strlen(buf);
+ if (len != strlen(m->path))
+ continue;
+ dprintf("Found %s device based on mdadm maps\n", m->path);
+ *minor = m->devnum;
+ map_free(map);
+ return 0;
+ }
+ map_free(map);
+ }
+
+ for (i = 127; i >= 0; i--) {
+ char buf[PATH_MAX];
+
+ snprintf(path, PATH_MAX, "/sys/block/md%d/md/", i);
+ if (stat(path, &s) != -1) {
+ strcat(path, "metadata_version");
+ if (load_sys(path, buf))
+ continue;
+ if (external) {
+ char *version = strchr(buf, ':');
+ if (version && strcmp(version + 1,
+ text_version))
+ continue;
+ } else {
+ if (strcmp(buf, text_version))
+ continue;
+ }
+ *minor = i;
+ return 0;
+ }
+ }
+
+
+ return -1;
+}
+
+/* find_array_minor2 looks for frozen devices also
+ */
+int find_array_minor2(char *text_version, int external, int container, int *minor)
+{
+ int result;
+ char buf[PATH_MAX];
+
+ strcpy(buf, text_version);
+ result = find_array_minor(text_version, external, container, minor);
+ if (result < 0) {
+ /* try to find frozen array also
+ */
+ char buf[PATH_MAX];
+
+ strcpy(buf, text_version);
+
+ *buf = '-';
+ result = find_array_minor(buf, external, container, minor);
+ }
+ return result;
+}
+

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 21/53] imsm: FIX: core dump during imsm metadata writing

am 26.11.2010 09:06:38 von adam.kwolek

Wrong number of disks during metadata update causes core dump.
New disks number based on internal mdmon information has to used for calculation (not previously read from metadata).

Signed-off-by: Adam Kwolek
---

super-intel.c | 27 ++++++++++++++++++---------
1 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index 98e4c6d..1231fa8 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -3510,8 +3510,9 @@ static int write_super_imsm_spares(struct intel_super *super, int doclose)
return 0;
}

-static int write_super_imsm(struct intel_super *super, int doclose)
+static int write_super_imsm(struct supertype *st, int doclose)
{
+ struct intel_super *super = st->sb;
struct imsm_super *mpb = super->anchor;
struct dl *d;
__u32 generation;
@@ -3519,6 +3520,7 @@ static int write_super_imsm(struct intel_super *super, int doclose)
int spares = 0;
int i;
__u32 mpb_size = sizeof(struct imsm_super) - sizeof(struct imsm_disk);
+ int num_disks = 0;

/* 'generation' is incremented everytime the metadata is written */
generation = __le32_to_cpu(mpb->generation_num);
@@ -3531,21 +3533,28 @@ static int write_super_imsm(struct intel_super *super, int doclose)
if (mpb->orig_family_num == 0)
mpb->orig_family_num = mpb->family_num;

- mpb_size += sizeof(struct imsm_disk) * mpb->num_disks;
for (d = super->disks; d; d = d->next) {
if (d->index == -1)
spares++;
- else
+ else {
mpb->disk[d->index] = d->disk;
+ num_disks++;
+ }
}
- for (d = super->missing; d; d = d->next)
+ for (d = super->missing; d; d = d->next) {
mpb->disk[d->index] = d->disk;
+ num_disks++;
+ }
+ mpb->num_disks = num_disks;
+ mpb_size += sizeof(struct imsm_disk) * mpb->num_disks;

for (i = 0; i < mpb->num_raid_devs; i++) {
struct imsm_dev *dev = __get_imsm_dev(mpb, i);
-
- imsm_copy_dev(dev, get_imsm_dev(super, i));
- mpb_size += sizeof_imsm_dev(dev, 0);
+ struct imsm_dev *dev2 = get_imsm_dev(super, i);
+ if ((dev) && (dev2)) {
+ imsm_copy_dev(dev, dev2);
+ mpb_size += sizeof_imsm_dev(dev, 0);
+ }
}
mpb_size += __le32_to_cpu(mpb->bbm_log_size);
mpb->mpb_size = __cpu_to_le32(mpb_size);
@@ -3665,7 +3674,7 @@ static int write_init_super_imsm(struct supertype *st)
struct dl *d;
for (d = super->disks; d; d = d->next)
Kill(d->devname, NULL, 0, 1, 1);
- return write_super_imsm(st->sb, 1);
+ return write_super_imsm(st, 1);
}
}
#endif
@@ -4938,7 +4947,7 @@ static void imsm_sync_metadata(struct supertype *container)
if (!super->updates_pending)
return;

- write_super_imsm(super, 0);
+ write_super_imsm(container, 0);

super->updates_pending = 0;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 22/53] Send information to managemon about reshape request

am 26.11.2010 09:06:45 von adam.kwolek

When monitor made metadata update and indicates request to managemon to continue reshape initialization, kick managemon to perform its action, unless array is not during deactivation.

Signed-off-by: Adam Kwolek
---

monitor.c | 14 +++++++++++++-
super-intel.c | 14 ++++++++++++++
2 files changed, 27 insertions(+), 1 deletions(-)

diff --git a/monitor.c b/monitor.c
index 5705a9b..05bd96c 100644
--- a/monitor.c
+++ b/monitor.c
@@ -399,8 +399,20 @@ static int read_and_act(struct active_array *a)
signal_manager();
}

- if (deactivate)
+ if (deactivate) {
a->container = NULL;
+ /* break reshape also
+ */
+ if (a->reshape_state != reshape_in_progress)
+ a->reshape_state = reshape_not_active;
+ }
+
+ /* signal manager when real delta_disks value is present
+ */
+ if ((a->reshape_state != reshape_not_active) &&
+ (a->reshape_state != reshape_in_progress)) {
+ signal_manager();
+ }

return dirty;
}
diff --git a/super-intel.c b/super-intel.c
index 1231fa8..56f7ea4 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -4755,6 +4755,16 @@ static int imsm_set_array_state(struct active_array *a, int consistent)
__u8 map_state = imsm_check_degraded(super, dev, failed);
__u32 blocks_per_unit;

+ if (a->reshape_state != reshape_not_active) {
+ /* array state change is blocked due to reshape action
+ * metadata changes are during applying only before reshape.
+ *
+ * '1' is returned to indicate that array is clean
+ */
+ dprintf("imsm: prepare to reshape\n");
+ return 1;
+ }
+
/* before we activate this array handle any missing disks */
if (consistent == 2)
handle_missing(super, dev);
@@ -5106,6 +5116,10 @@ static struct mdinfo *imsm_activate_spare(struct active_array *a,

dprintf("imsm: activate spare: inst=%d failed=%d (%d) level=%d\n",
inst, failed, a->info.array.raid_disks, a->info.array.level);
+
+ if (a->reshape_state != reshape_not_active)
+ return NULL;
+
if (imsm_check_degraded(super, dev, failed) != IMSM_T_STATE_DEGRADED)
return NULL;

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 23/53] Process reshape initialization by managemon

am 26.11.2010 09:06:53 von adam.kwolek

Monitor signals request to managemon (using reshape_delta_disks variable).
This caused call to reshape_array() vector. It prepares second metadata update for added disks slot verification.
Slots are set by md during reshape start and they are unknown to user space so far.
Second update is sent after reshape is started. During this update processing, metadata is checked against slot numbers set by md and in mismatch case information metadata is updated.

The reshape is being stared in delayed state due to sync_max was set to 0. After this reshape_delta_disk is set to 'in progress' value to avoid reentry.
Reshape process is continued in mdadm.

If reshape cannot be started or any failure condition occurs, 'cancel' message is prepared by reshape_array() and send to monitor, to rollback metadata changes.
Mdadm is informed about failure by idle array state.

Signed-off-by: Adam Kwolek
---

managemon.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
mdadm.h | 26 +++++++++++++++++++++++
2 files changed, 94 insertions(+), 0 deletions(-)

diff --git a/managemon.c b/managemon.c
index d495014..d9eb743 100644
--- a/managemon.c
+++ b/managemon.c
@@ -424,6 +424,74 @@ static void manage_member(struct mdstat_ent *mdstat,
}
}

+ if ((a->reshape_state != reshape_not_active) &&
+ (a->reshape_state != reshape_in_progress)) {
+ dprintf("Reshape signals need to manage this member\n");
+ if (a->container->ss->reshape_array) {
+ struct metadata_update *updates = NULL;
+ struct mdinfo *newdev = NULL;
+ struct mdinfo *d;
+
+ newdev = newa->container->ss->reshape_array(newa, reshape_in_progress, &updates);
+ if (newdev) {
+ int status_ok = 1;
+ newa = duplicate_aa(a);
+ if (newa == NULL)
+ goto reshape_out;
+
+ for (d = newdev; d ; d = d->next) {
+ struct mdinfo *newd;
+
+ newd = malloc(sizeof(*newd));
+ if (!newd) {
+ status_ok = 0;
+ dprintf("Cannot aallocate memory for new disk.\n");
+ continue;
+ }
+ if (sysfs_add_disk(&newa->info, d, 0) < 0) {
+ free(newd);
+ status_ok = 0;
+ dprintf("Cannot add disk to array.\n");
+ continue;
+ }
+ disk_init_and_add(newd, d, newa);
+ }
+ /* go with reshape
+ */
+ if (status_ok)
+ if (sysfs_set_num(&newa->info, NULL, "sync_max", 0) < 0)
+ status_ok = 0;
+ if (status_ok && sysfs_set_str(&newa->info, NULL, "sync_action", "reshape") == 0) {
+ /* reshape executed
+ */
+ dprintf("Reshape was started\n");
+ replace_array(a->container, a, newa);
+ a = newa;
+ } else {
+ /* on problems cancel update
+ */
+ free_aa(newa);
+ free_updates(&updates);
+ updates = NULL;
+ a->container->ss->reshape_array(a, reshape_cancel_request, &updates);
+ sysfs_set_str(&a->info, NULL, "sync_action", "idle");
+ }
+ }
+ dprintf("Send metadata update for reshape.\n");
+
+ queue_metadata_update(updates);
+ updates = NULL;
+ wakeup_monitor();
+reshape_out:
+ while (newdev) {
+ d = newdev->next;
+ free(newdev);
+ newdev = d;
+ }
+ free_updates(&updates);
+ }
+ }
+
if (a->check_degraded && !frozen) {
struct metadata_update *updates = NULL;
struct mdinfo *newdev = NULL;
diff --git a/mdadm.h b/mdadm.h
index 4777ad2..750afcc 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -680,6 +680,32 @@ extern struct superswitch {
struct mdinfo *(*activate_spare)(struct active_array *a,
struct metadata_update **updates);

+ /* reshape_array() will
+ * 1. check is sync_max is set to 0
+ * 2. prepare device list that has to be added
+ * 3. prepare metadata update message to set disks slots
+ * after reshape is started
+ * request_type:
+ * 1. RESHAPE_CANCEL_REQUEST
+ * In error case it prepares metadata roll back message.
+ * Such error case message should be prepared when
+ * passed request_type is set to RESHAPE_CANCEL_REQUEST.
+ * 1. RESHAPE_IN_PROGRESS
+ * requests transition to RESHAPE_IN_PROGRESS state
+ * so proper update has to be prepared
+ * In active array structure can appear values:
+ * 1. RESHAPE_NOT_ACTIVE
+ * 2. RESHAPE_IN_PROGRESS
+ * 3. any other value indicates requested disk number if array change
+ * this is visible only during reshape and metadata initialization
+ * after initialization RESHAPE_IN_PROGRESS has to be placed
+ * in reshape_delta_disks. When reshape is finished it is replaced
+ * by RESHAPE_NOT_ACTIVE
+ */
+ struct mdinfo *(*reshape_array)(struct active_array *a,
+ enum state_of_reshape request_type,
+ struct metadata_update **updates);
+
int swapuuid; /* true if uuid is bigending rather than hostendian */
int external;
const char *name; /* canonical metadata name */

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 24/53] Add support to skip slot configuration

am 26.11.2010 09:07:00 von adam.kwolek

When managemon is signaled by monitor (using reshape_delta_disks variable), it adds new disks to md configuration.
To allow md to set slot numbers, flag SYSFS_ADD_DISK_DO_NOT_SET_SLOT was introduced to skip slot setting in sysfs_add_disk() function.

Signed-off-by: Adam Kwolek
---

sysfs.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/sysfs.c b/sysfs.c
index e316785..fa15895 100644
--- a/sysfs.c
+++ b/sysfs.c
@@ -614,7 +614,8 @@ int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume)
* yet, so just ignore status for now.
*/
sysfs_set_str(sra, sd, "state", "insync");
- rv |= sysfs_set_num(sra, sd, "slot", sd->disk.raid_disk);
+ if (sd->disk.raid_disk >= 0)
+ rv |= sysfs_set_num(sra, sd, "slot", sd->disk.raid_disk);
if (resume)
sysfs_set_num(sra, sd, "recovery_start", sd->recovery_start);
}

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 25/53] imsm: Verify slots in meta against slot numbers set by

am 26.11.2010 09:07:08 von adam.kwolek

To verify slots numbers stored in metadata against those chosen by md, update_reshape_set_slots_update is used.

Managemon calls reshape_array() vector and prepares slot verification metadata update there. It is sent when reshape is started successfully in md.
Then monitor updates/verifies slots.

Signed-off-by: Adam Kwolek
---

super-intel.c | 302 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 302 insertions(+), 0 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index 56f7ea4..799bb51 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -287,6 +287,7 @@ enum imsm_update_type {
update_add_disk,
update_level,
update_reshape,
+ update_reshape_set_slots,
};

struct imsm_update_activate_spare {
@@ -5250,6 +5251,7 @@ static int disks_overlap(struct intel_super *super, int idx, struct imsm_update_
}

static void imsm_delete(struct intel_super *super, struct dl **dlp, unsigned index);
+int imsm_reshape_array_set_slots(struct active_array *a);
int imsm_get_new_device_name(struct dl *dl);

static void imsm_process_update(struct supertype *st,
@@ -5382,6 +5384,25 @@ update_reshape_exit:
free(u->devs_mem.dev);
break;
}
+ case update_reshape_set_slots: {
+ struct imsm_update_reshape *u = (void *)update->buf;
+ struct active_array *a;
+
+ dprintf("imsm: process_update() for update_reshape_set_slot for device %i\n", u->devnum);
+ for (a = st->arrays; a; a = a->next)
+ if (a->devnum == u->devnum) {
+ break;
+ }
+
+ if (a == NULL) {
+ dprintf(" - cannot locate requested array\n");
+ break;
+ }
+
+ if (imsm_reshape_array_set_slots(a) > -1)
+ super->updates_pending++;
+ break;
+ }
case update_level: {
struct imsm_update_level *u = (void *)update->buf;
struct imsm_dev *dev_new, *dev = NULL;
@@ -5802,6 +5823,9 @@ static void imsm_prepare_update(struct supertype *st,
dprintf("New anchor length is %llu\n", (unsigned long long)len);
break;
}
+ case update_reshape_set_slots: {
+ break;
+ }
case update_level: {
struct imsm_update_level *u = (void *) update->buf;
struct active_array *a;
@@ -6694,6 +6718,283 @@ int imsm_get_new_device_name(struct dl *dl)
return rv;
}

+int imsm_reshape_array_manage_new_slots(struct intel_super *super, int inst, int devnum, int correct);
+
+int imsm_reshape_array_set_slots(struct active_array *a)
+{
+ struct intel_super *super = a->container->sb;
+ int inst = a->info.container_member;
+
+ return imsm_reshape_array_manage_new_slots(super, inst, a->devnum, 1);
+}
+/* imsm_reshape_array_manage_new_slots()
+ * returns: number of corrected slots for correct == 1
+ * counted number of different slots for correct == 0
+*/
+int imsm_reshape_array_manage_new_slots(struct intel_super *super, int inst, int devnum, int correct)
+{
+
+ struct imsm_dev *dev = get_imsm_dev(super, inst);
+ struct imsm_map *map_1 = get_imsm_map(dev, 0);
+ struct imsm_map *map_2 = get_imsm_map(dev, 1);
+ struct dl *dl;
+ unsigned long long sysfs_slot;
+ char buf[PATH_MAX];
+ char *devname;
+ int fd;
+ struct mdinfo *sra = NULL;
+ int ret_val = 0;
+
+ if ((map_1 == NULL) || (map_2 == NULL)) {
+ dprintf("imsm_reshape_array_set_slots() no maps (map_1 =%p, map_2 = %p)\n", map_1, map_2);
+ dprintf("\t\tdev->vol.migr_state = %i\n", dev->vol.migr_state);
+ dprintf("\t\tdev->volume = %s\n", dev->volume);
+ return -1;
+ }
+
+ /* verify/correct slot configuration of added disks
+ */
+ dprintf("\n\nStart map verification for %i added devices on device no %i\n",
+ map_1->num_members - map_2->num_members, devnum);
+ devname = devnum2devname(devnum);
+ if (devname == NULL) {
+ dprintf("imsm: ERROR: Cannot get device name.\n");
+ return -1;
+ }
+ sprintf(buf, "/dev/%s", devname);
+ free(devname);
+
+ fd = open(buf, O_RDONLY);
+ if (fd < 0) {
+ dprintf("imsm: ERROR: Cannot open device %s.\n", buf);
+ return -1;
+ }
+
+ sra = sysfs_read(fd, 0, GET_LEVEL|GET_VERSION|GET_DEVS|GET_STATE);
+ if (!sra) {
+ dprintf("imsm: ERROR: Device not found.\n");
+ close(fd);
+ return -1;
+ }
+
+ for (dl = super->disks; dl; dl = dl->next) {
+ int fd2;
+ int rv;
+
+ dprintf("\tLooking at device %s (index = %i).\n", dl->devname, dl->index);
+ if (dl->devname && (strlen(dl->devname) > 5))
+ sprintf(buf, "/sys/block/%s/md/dev-%s/slot",
+ sra->sys_name, dl->devname+5);
+ fd2 = open(buf, O_RDONLY);
+ if (fd2 < 0)
+ continue;
+ rv = sysfs_fd_get_ll(fd2, &sysfs_slot);
+ close(fd2);
+ if (rv < 0)
+ continue;
+ dprintf("\t\tLooking at slot %llu in sysfs.\n", sysfs_slot);
+ if ((int)sysfs_slot != dl->index) {
+ dprintf("Slots doesn't match sysfs->%i and imsm->%i\n", (int)sysfs_slot, dl->index);
+ ret_val++;
+ if (correct)
+ dl->index = sysfs_slot;
+ }
+ }
+ close(fd);
+ sysfs_free(sra);
+ dprintf("IMSM Map verification finished (found wrong slots : %i).\n", ret_val);
+
+ return ret_val;
+}
+
+struct mdinfo *imsm_grow_array(struct active_array *a)
+{
+ int disk_count = 0;
+ struct intel_super *super = a->container->sb;
+ int inst = a->info.container_member;
+ struct imsm_dev *dev = get_imsm_dev(super, inst);
+ struct imsm_map *map = get_imsm_map(dev, 0);
+ struct mdinfo *di;
+ struct dl *dl;
+ int i;
+ int prev_raid_disks = a->info.array.raid_disks;
+ int new_raid_disks = prev_raid_disks + a->reshape_delta_disks;
+ struct mdinfo *vol = NULL;
+ char buf[PATH_MAX];
+ char *p;
+ int fd;
+ struct mdinfo *rv = NULL;
+
+ dprintf("imsm: grow array: inst=%d raid disks=%d(%d) level=%d\n",
+ inst, a->info.array.raid_disks, new_raid_disks, a->info.array.level);
+
+ /* get array sysfs entry
+ */
+ p = devnum2devname(a->devnum);
+ if (p == NULL)
+ return rv;
+ sprintf(buf, "/dev/%s", p);
+ free(p);
+ fd = open(buf, O_RDONLY);
+ if (fd < 0)
+ return rv;
+ vol = sysfs_read(fd, 0, GET_LEVEL|GET_VERSION|GET_DEVS|GET_STATE);
+ if (vol == NULL) {
+ close(fd);
+ return rv;
+ }
+ /* Look for all disks beyond current configuration
+ * To handle degradation after takeover
+ * look also on last disk in configuration.
+ */
+ for (i = prev_raid_disks; i < new_raid_disks; i++) {
+ /* OK, this device can be added. Try to add.
+ */
+ dl = imsm_add_spare(super, i, a, 0);
+ if (!dl)
+ continue;
+
+ if (dl->index < 0)
+ dl->index = i;
+ /* found a usable disk with enough space */
+ di = malloc(sizeof(*di));
+ if (!di)
+ continue;
+
+ memset(di, 0, sizeof(*di));
+ /* dl->index will be -1 in the case we are activating a
+ * pristine spare. imsm_process_update() will create a
+ * new index in this case. On disks=4(5)ce a disk is found to be
+ * failed in all member arrays it is kicked from the
+ * metadata
+ */
+ di->disk.number = dl->index;
+
+ /* (ab)use di->devs to store a pointer to the device
+ * we chose
+ */
+ di->devs = (struct mdinfo *) dl;
+
+ di->disk.raid_disk = -1;
+ di->disk.major = dl->major;
+ di->disk.minor = dl->minor;
+ di->disk.state = (1< + (1< + di->next_state = 0;
+
+ di->recovery_start = MaxSector;
+ di->data_offset = __le32_to_cpu(map->pba_of_lba0);
+ di->component_size = a->info.component_size;
+ di->container_member = inst;
+ super->random = random32();
+
+ di->next = rv;
+ rv = di;
+ disk_count++;
+ dprintf("%x:%x to be %d at %llu\n", dl->major, dl->minor,
+ i, di->data_offset);
+ }
+
+ dprintf("imsm: imsm_grow_array() configures %i raid disks\n", disk_count);
+ close(fd);
+ sysfs_free(vol);
+ if (disk_count != a->reshape_delta_disks) {
+
+ dprintf("imsm: ERROR: but it should configure %i\n",
+ a->reshape_delta_disks);
+
+ while (rv) {
+ di = rv;
+ rv = rv->next;
+ free(di);
+ }
+ }
+
+ return rv;
+}
+
+struct mdinfo *imsm_reshape_array(struct active_array *a, enum state_of_reshape request_type,
+ struct metadata_update **updates)
+{
+ struct imsm_update_reshape *u = NULL;
+ struct metadata_update *mu;
+ struct mdinfo *disk_list = NULL;
+
+ dprintf("imsm: imsm_reshape_array(reshape_delta_disks = %i)\t", a->reshape_delta_disks);
+ if (request_type == reshape_cancel_request) {
+ dprintf("prepare cancel message.\n");
+ goto imsm_reshape_array_exit;
+ }
+ if (a->reshape_state == reshape_not_active) {
+ dprintf("has nothing to do.\n");
+ return disk_list;
+ }
+ if (a->reshape_delta_disks < 0) {
+ dprintf("doesn't support shrinking.\n");
+ a->reshape_state = reshape_not_active;
+ return disk_list;
+ }
+
+ if (a->reshape_delta_disks == 0) {
+ dprintf("array parameters has to be changed\n");
+ /* TBD */
+ }
+ if (a->reshape_delta_disks > 0) {
+ dprintf("grow is detected.\n");
+ disk_list = imsm_grow_array(a);
+ }
+
+ if (disk_list) {
+ dprintf("imsm: send update update_reshape_set_slots\n");
+
+ u = (struct imsm_update_reshape *)calloc(1, sizeof(struct imsm_update_reshape));
+ if (u) {
+ u->type = update_reshape_set_slots;
+ a->reshape_state = reshape_in_progress;
+ }
+ } else
+ dprintf("error: cannot start reshape\n");
+
+imsm_reshape_array_exit:
+ if (u == NULL) {
+ dprintf("imsm: send update update_reshape_cancel\n");
+ a->reshape_state = reshape_not_active;
+ sysfs_set_str(&a->info, NULL, "sync_action", "idle");
+ }
+
+ if (u) {
+ /* post any prepared update
+ */
+ u->devnum = a->devnum;
+
+ u->update_memory_size = sizeof(struct imsm_update_reshape);
+ u->reshape_delta_disks = a->reshape_delta_disks;
+ u->update_prepared = 1;
+
+ mu = malloc(sizeof(struct metadata_update));
+ if (mu) {
+ mu->buf = (void *)u;
+ mu->space = NULL;
+ mu->len = u->update_memory_size;
+ mu->next = *updates;
+ *updates = mu;
+ } else {
+ a->reshape_state = reshape_not_active;
+ free(u);
+ u = NULL;
+ }
+ }
+
+ if ((disk_list) && (u == NULL)) {
+ while (disk_list) {
+ struct mdinfo *di = disk_list;
+ disk_list = disk_list->next;
+ free(di);
+ }
+ }
+ return disk_list;
+}
+
struct superswitch super_imsm = {
#ifndef MDASSEMBLE
.examine_super = examine_super_imsm,
@@ -6726,6 +7027,7 @@ struct superswitch super_imsm = {
.container_content = container_content_imsm,
.default_geometry = default_geometry_imsm,
.reshape_super = imsm_reshape_super,
+ .reshape_array = imsm_reshape_array,

.external = 1,
.name = "imsm",

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 26/53] imsm: Cancel metadata changes on reshape start failure

am 26.11.2010 09:07:16 von adam.kwolek

It can occurs that managemon cannot run reshape in md.
To perform metadata changes cancellation, update_reshape_cancel message is used. It is prepared by reshape_array() vector.
When monitor receives this message, it rollbacks metadata changes made previously during processing update_reshape update.

Signed-off-by: Adam Kwolek
---

super-intel.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 123 insertions(+), 0 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index 799bb51..89fb118 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -288,6 +288,7 @@ enum imsm_update_type {
update_level,
update_reshape,
update_reshape_set_slots,
+ update_reshape_cancel,
};

struct imsm_update_activate_spare {
@@ -5403,6 +5404,94 @@ update_reshape_exit:
super->updates_pending++;
break;
}
+ case update_reshape_cancel: {
+ struct imsm_update_reshape *u = (void *)update->buf;
+ struct active_array *a;
+ int inst;
+ int i;
+ struct imsm_dev *dev;
+ struct imsm_dev *devi;
+ struct imsm_map *map_1;
+ struct imsm_map *map_2;
+ int reshape_delta_disks ;
+ struct dl *curr_disk;
+ int used_disks;
+ unsigned long long array_blocks;
+
+
+ dprintf("imsm: process_update() for update_reshape_cancel for device %i\n", u->devnum);
+ for (a = st->arrays; a; a = a->next)
+ if (a->devnum == u->devnum) {
+ break;
+ }
+ if (a == NULL)
+ break;
+
+ inst = a->info.container_member;
+ dev = get_imsm_dev(super, inst);
+ map_1 = get_imsm_map(dev, 0);
+ map_2 = get_imsm_map(dev, 1);
+ if (map_2 == NULL)
+ break;
+ reshape_delta_disks = map_1->num_members - map_2->num_members;
+ dprintf("\t\tRemove %i device(s) from configuration.\n", reshape_delta_disks);
+
+ /* when cancel was applied during reshape of second volume, we need disks for first
+ * array reshaped previously, find the smallest delta_disks to remove
+ */
+ i = 0;
+ devi = get_imsm_dev(super, i);
+ while (devi) {
+ struct imsm_map *mapi = get_imsm_map(devi, 0);
+ int delta_disks = map_1->num_members - mapi->num_members;
+ if ((i != inst) &&
+ (delta_disks < reshape_delta_disks) &&
+ (delta_disks >= 0))
+ reshape_delta_disks = delta_disks;
+ i++;
+ devi = get_imsm_dev(super, i);
+ }
+ /* remove disks
+ */
+ if (reshape_delta_disks > 0) {
+ /* reverse device(s) back to spares
+ */
+ curr_disk = super->disks;
+ while (curr_disk) {
+ dprintf("Looking at %i device to remove\n", curr_disk->index);
+ if (curr_disk->index >= map_2->num_members) {
+ dprintf("\t\t\tREMOVE\n");
+ curr_disk->index = -1;
+ curr_disk->raiddisk = -1;
+ curr_disk->disk.status &= ~CONFIGURED_DISK;
+ curr_disk->disk.status |= SPARE_DISK;
+ }
+ curr_disk = curr_disk->next;
+ }
+ }
+ /* roll back maps and migration
+ */
+ memcpy(map_1, map_2, sizeof_imsm_map(map_2));
+ /* reconfigure map_2 and perform migration end
+ */
+ map_2 = get_imsm_map(dev, 1);
+ memcpy(map_2, map_1, sizeof_imsm_map(map_1));
+ end_migration(dev, map_1->map_state);
+ /* array size rollback
+ */
+ used_disks = imsm_num_data_members(dev);
+ if (used_disks) {
+ array_blocks = map_1->blocks_per_member * used_disks;
+ /* round array size down to closest MB
+ */
+ array_blocks = (array_blocks >> SECT_PER_MB_SHIFT) << SECT_PER_MB_SHIFT;
+ dev->size_low = __cpu_to_le32((__u32)array_blocks);
+ dev->size_high = __cpu_to_le32((__u32)(array_blocks >> 32));
+ }
+
+ super->updates_pending++;
+ break;
+ }
case update_level: {
struct imsm_update_level *u = (void *)update->buf;
struct imsm_dev *dev_new, *dev = NULL;
@@ -5826,6 +5915,9 @@ static void imsm_prepare_update(struct supertype *st,
case update_reshape_set_slots: {
break;
}
+ case update_reshape_cancel: {
+ break;
+ }
case update_level: {
struct imsm_update_level *u = (void *) update->buf;
struct active_array *a;
@@ -6690,6 +6782,31 @@ imsm_reshape_super_exit:
return ret_val;
}

+void imsm_grow_array_remove_devices_on_cancel(struct active_array *a)
+{
+ struct mdinfo *di = a->info.devs;
+ struct mdinfo *di_prev = NULL;
+
+ while (di) {
+ if (di->disk.raid_disk < 0) {
+ struct mdinfo *rmdev = di;
+ sysfs_set_str(&a->info, rmdev, "state", "faulty");
+ sysfs_set_str(&a->info, rmdev, "slot", "none");
+ sysfs_set_str(&a->info, rmdev, "state", "remove");
+
+ if (di_prev)
+ di_prev->next = di->next;
+ else
+ a->info.devs = di->next;
+ di = di->next;
+ free(rmdev);
+ } else {
+ di_prev = di;
+ di = di->next;
+ }
+ }
+}
+
int imsm_get_new_device_name(struct dl *dl)
{
int rv;
@@ -6960,6 +7077,12 @@ imsm_reshape_array_exit:
dprintf("imsm: send update update_reshape_cancel\n");
a->reshape_state = reshape_not_active;
sysfs_set_str(&a->info, NULL, "sync_action", "idle");
+ imsm_grow_array_remove_devices_on_cancel(a);
+ u = (struct imsm_update_reshape *)calloc(1, sizeof(struct imsm_update_reshape));
+ if (u) {
+ u->type = update_reshape_cancel;
+ a->reshape_state = reshape_not_active;
+ }
}

if (u) {

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 27/53] imsm: Do not accept messages sent by mdadm

am 26.11.2010 09:07:23 von adam.kwolek

Messages update_reshape_cancel and update_reshape_set_slots ara intended to send by managemon.
If those message would be issued by mdadm prepare_message() is called in managemon for them.
In such cases set update_prepared to '-1' to indicate process_message() to not proceed such messages.

Signed-off-by: Adam Kwolek
---

super-intel.c | 24 ++++++++++++++++++++++++
1 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index 89fb118..2984685 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -5400,6 +5400,13 @@ update_reshape_exit:
break;
}

+ /* do not accept this update type sent by mdadm
+ */
+ if (u->update_prepared == -1) {
+ dprintf("imsm: message is sent by mdadm. cannot accept\n\n");
+ break;
+ }
+
if (imsm_reshape_array_set_slots(a) > -1)
super->updates_pending++;
break;
@@ -5427,6 +5434,13 @@ update_reshape_exit:
if (a == NULL)
break;

+ /* do not accept this update type sent by mdadm
+ */
+ if (u->update_prepared == -1) {
+ dprintf("imsm: message is sent by mdadm. cannot accept\n\n");
+ break;
+ }
+
inst = a->info.container_member;
dev = get_imsm_dev(super, inst);
map_1 = get_imsm_map(dev, 0);
@@ -5913,9 +5927,19 @@ static void imsm_prepare_update(struct supertype *st,
break;
}
case update_reshape_set_slots: {
+ struct imsm_update_reshape *u = (void *)update->buf;
+
+ /* do not accept this update type sent by mdadm
+ */
+ u->update_prepared = -1;
break;
}
case update_reshape_cancel: {
+ struct imsm_update_reshape *u = (void *)update->buf;
+
+ /* do not accept this update type sent by mdadm
+ */
+ u->update_prepared = -1;
break;
}
case update_level: {

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 28/53] imsm: Do not indicate resync during reshape

am 26.11.2010 09:07:31 von adam.kwolek

If reshape is started resync is not allowed in parallel. This would break reshape.
If array is in General Migration state do not indicate resync and allow for reshape continuation.

Signed-off-by: Adam Kwolek
---

super-intel.c | 6 +++++-
1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index 2984685..42219f6 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -4677,9 +4677,13 @@ static int is_resyncing(struct imsm_dev *dev)
migr_type(dev) == MIGR_REPAIR)
return 1;

+ if (migr_type(dev) == MIGR_GEN_MIGR)
+ return 0;
+
migr_map = get_imsm_map(dev, 1);

- if (migr_map->map_state == IMSM_T_STATE_NORMAL)
+ if ((migr_map->map_state == IMSM_T_STATE_NORMAL) &&
+ (dev->vol.migr_type != MIGR_GEN_MIGR))
return 1;
else
return 0;

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 29/53] Add spares to raid0 array using takeover

am 26.11.2010 09:07:38 von adam.kwolek

Spares are used by Online Capacity Expansion to expand array.
To run expansion on raid0, spares have to be added to raid0 volume also.
Raid0 cannot have spares (no mdmon runs for raid0 array).
To do this, takeover to raid5 (and back) is used. mdmon runs temporary for raid5 and spare drives can be added to container.

Signed-off-by: Adam Kwolek
---

Manage.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++
1 files changed, 153 insertions(+), 1 deletions(-)

diff --git a/Manage.c b/Manage.c
index ac9415b..65e345e 100644
--- a/Manage.c
+++ b/Manage.c
@@ -31,6 +31,137 @@
#define START_MD _IO (MD_MAJOR, 2)
#define STOP_MD _IO (MD_MAJOR, 3)

+
+void takeover5to0(struct mdinfo *sra)
+{
+ char *c;
+ int err;
+
+ dprintf("Takeover Raid5->Raid0.\n");
+
+ if (sra == NULL)
+ return;
+
+ c = map_num(pers, 0);
+ if (c == NULL)
+ return;
+
+ err = sysfs_set_str(sra, NULL, "level", c);
+
+ if (err)
+ fprintf(stderr,
+ Name ": %s: could not set level "
+ "to %s for external super.\n",
+ sra->sys_name, c);
+ sysfs_free(sra);
+}
+
+struct mdinfo *takeover0to5(int fd)
+{
+ struct mdinfo *ret_val = NULL;
+ int devnum;
+ struct mdinfo *sra = NULL;
+ struct supertype *st = NULL;
+ struct mdinfo info;
+ int dev_fd = -1;
+ int device_num;
+
+ dprintf("Takeover Raid0->Raid5.\n");
+ devnum = fd2devnum(fd);
+ if (mdmon_running(devnum)) {
+ dprintf("mdmon is runnig for this container - takeover is not required\n");
+ return ret_val;
+ }
+
+ sra = sysfs_read(fd, 0, GET_VERSION);
+ if (sra == NULL)
+ return ret_val;
+
+ st = super_by_fd(fd);
+
+ if ((sra->array.major_version != -1) ||
+ (strncmp(sra->text_version, "imsm", 4) != 0) ||
+ (st == NULL) ||
+ (st->ss->external == 0))
+ goto exit_takeover0to5;
+
+ device_num = 0;
+ while (device_num > -1) {
+ char dev_name[1024];
+ char *c;
+ int err;
+
+ sprintf(st->subarray, "%i", device_num);
+ st->ss->load_super(st, fd, NULL);
+ if (st->sb == NULL)
+ break;
+
+ st->ss->getinfo_super(st, &info);
+ if (info.array.level == 0) {
+ char *p = NULL;
+
+ sprintf(dev_name, "/dev/md/%s", info.name);
+ dev_fd = open_mddev(dev_name , 1);
+
+ if (dev_fd < 0)
+ continue;
+
+ sysfs_free(sra);
+ sra = sysfs_read(dev_fd, 0, GET_VERSION);
+ if (!sra)
+ continue;
+
+ c = map_num(pers, 5);
+ if (c == NULL)
+ break;
+
+ err = sysfs_set_str(sra, NULL, "level", c);
+ if (err) {
+ fprintf(stderr, Name": %s: could not set level to "
+ "%s for external super.\n", sra->sys_name, c);
+ break;
+ }
+
+ /* return to this raid level and do not release
+ */
+ ret_val = sra;
+ sra = NULL;
+
+ /* send update with return level,
+ * return level tells monitor to:
+ * - after reshape return automatically to this level
+ * - if return level is set do not activate spares
+ */
+
+ /* if after takeover mdmon is not running,
+ * start it
+ */
+ if (!mdmon_running(devnum))
+ start_mdmon(devnum);
+ p = devnum2devname(devnum);
+ if (p) {
+ ping_monitor(p);
+ free(p);
+ }
+ sleep(1);
+ device_num = -2;
+ break;
+ }
+ device_num++;
+ }
+
+exit_takeover0to5:
+ if (st)
+ st->ss->free_super(st);
+ sysfs_free(sra);
+ if (dev_fd >= 0)
+ close(dev_fd);
+
+dprintf("Takeover ret = %p\n\n", ret_val);
+ return ret_val;
+
+}
+
int Manage_ro(char *devname, int fd, int readonly)
{
/* switch to readonly or rw
@@ -811,6 +942,7 @@ int Manage_subdevs(char *devname, int fd,
struct mdinfo *sra;
int container_fd;
int devnum = fd2devnum(fd);
+ char *devname = NULL;

container_fd = open_dev_excl(devnum);
if (container_fd < 0) {
@@ -819,10 +951,17 @@ int Manage_subdevs(char *devname, int fd,
dv->devname);
return 1;
}
+ /* Raid 0 add spare via takeover
+ */
+ struct mdinfo *return_raid0_sra = NULL;
+ /* try to perform takeover if needed
+ */
+ return_raid0_sra = takeover0to5(container_fd);

if (!mdmon_running(devnum)) {
fprintf(stderr, Name ": add failed for %s: mdmon not running\n",
dv->devname);
+ takeover5to0(return_raid0_sra);
close(container_fd);
return 1;
}
@@ -831,7 +970,13 @@ int Manage_subdevs(char *devname, int fd,
if (!sra) {
fprintf(stderr, Name ": add failed for %s: sysfs_read failed\n",
dv->devname);
+ takeover5to0(return_raid0_sra);
close(container_fd);
+ devname = devnum2devname(devnum);
+ if (devname) {
+ ping_monitor(devname);
+ free(devname);
+ }
return 1;
}
sra->array.level = LEVEL_CONTAINER;
@@ -843,12 +988,19 @@ int Manage_subdevs(char *devname, int fd,
if (sysfs_add_disk(sra, &new_mdi, 0) != 0) {
fprintf(stderr, Name ": add new device to external metadata"
" failed for %s\n", dv->devname);
+ takeover5to0(return_raid0_sra);
close(container_fd);
+ sysfs_free(sra);
return 1;
}
- ping_monitor(devnum2devname(devnum));
+ takeover5to0(return_raid0_sra);
sysfs_free(sra);
close(container_fd);
+ devname = devnum2devname(devnum);
+ if (devname) {
+ ping_monitor(devname);
+ free(devname);
+ }
} else if (ioctl(fd, ADD_NEW_DISK, &disc)) {
fprintf(stderr, Name ": add new device failed for %s as %d: %s\n",
dv->devname, j, strerror(errno));

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 30/53] imsm: FIX: Fill sys_name field in getinfo_super()

am 26.11.2010 09:07:46 von adam.kwolek

sys_name field is not filled during getinfo_super() call.

Signed-off-by: Adam Kwolek
---

super-intel.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index 42219f6..3243132 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -1489,6 +1489,7 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info)
struct imsm_map *map = get_imsm_map(dev, 0);
struct dl *dl;
char *devname;
+ int minor;

for (dl = super->disks; dl; dl = dl->next)
if (dl->raiddisk == info->disk.raid_disk)
@@ -1560,6 +1561,11 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info)
free(devname);
info->safe_mode_delay = 4000; /* 4 secs like the Matrix driver */
uuid_from_super_imsm(st, info->uuid);
+
+ /* fill sys_name field
+ */
+ if (find_array_minor2(info->text_version, 1, st->devnum, &minor) == 0)
+ sprintf(info->sys_name, "md%i", minor);
}

/* check the config file to see if we can return a real uuid for this spare */
@@ -1611,6 +1617,7 @@ static void getinfo_super_imsm(struct supertype *st, struct mdinfo *info)
{
struct intel_super *super = st->sb;
struct imsm_disk *disk;
+ int minor;

if (super->current_vol >= 0) {
getinfo_super_imsm_volume(st, info);
@@ -1712,6 +1719,10 @@ static void getinfo_super_imsm(struct supertype *st, struct mdinfo *info)
memcpy(info->uuid, uuid_match_any, sizeof(int[4]));
fixup_container_spare_uuid(info);
}
+ /* fill sys_name field
+ */
+ if (find_array_minor2(info->text_version, 1, st->devnum, &minor) == 0)
+ sprintf(info->sys_name, "md%i", minor);
}

static int is_raid_level_supported(const struct imsm_orom *orom, int level, int raiddisks);

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 31/53] imsm: FIX: Fill delta_disks field in getinfo_super()

am 26.11.2010 09:07:53 von adam.kwolek

delta_disks field is not always filled during getinfo_super() call.

Signed-off-by: Adam Kwolek
---

super-intel.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index 3243132..3f75550 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -1487,6 +1487,7 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info)
struct intel_super *super = st->sb;
struct imsm_dev *dev = get_imsm_dev(super, super->current_vol);
struct imsm_map *map = get_imsm_map(dev, 0);
+ struct imsm_map *map2 = get_imsm_map(dev, 1);
struct dl *dl;
char *devname;
int minor;
@@ -1566,6 +1567,12 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info)
*/
if (find_array_minor2(info->text_version, 1, st->devnum, &minor) == 0)
sprintf(info->sys_name, "md%i", minor);
+
+ /* fill delta_disks field
+ */
+ info->delta_disks = 0;
+ if (map2)
+ info->delta_disks = map->num_members - map2->num_members;
}

/* check the config file to see if we can return a real uuid for this spare */

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 32/53] imsm: FIX: spare list contains one device several times

am 26.11.2010 09:08:01 von adam.kwolek

Assumption for spares searching was that after picking new device, it has to be added to array before next search.
This causes returning different disk on each call.

When spares list is created during Online Capacity Expansion, first devices list is collected and then all devices are added to md.
Picked device from spares pool has to be checked against picked devices so far. If not, the same disk will be returned all the time.
Already picked devices are stored in the list and this list is used for new devices verification also.

Signed-off-by: Adam Kwolek
---

super-intel.c | 24 +++++++++++++++++-------
1 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index 3f75550..e4ba875 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -5006,7 +5006,8 @@ static struct dl *imsm_readd(struct intel_super *super, int idx, struct active_a
}

static struct dl *imsm_add_spare(struct intel_super *super, int slot,
- struct active_array *a, int activate_new)
+ struct active_array *a, int activate_new,
+ struct mdinfo *additional_test_list)
{
struct imsm_dev *dev = get_imsm_dev(super, a->info.container_member);
int idx = get_imsm_disk_idx(dev, slot);
@@ -5032,6 +5033,16 @@ static struct dl *imsm_add_spare(struct intel_super *super, int slot,
}
if (d)
continue;
+ while (additional_test_list) {
+ if (additional_test_list->disk.major == dl->major &&
+ additional_test_list->disk.minor == dl->minor) {
+ dprintf("%x:%x already in additional test list\n", dl->major, dl->minor);
+ break;
+ }
+ additional_test_list = additional_test_list->next;
+ }
+ if (additional_test_list)
+ continue;

/* skip in use or failed drives */
if (is_failed(&dl->disk) || idx == dl->index ||
@@ -5165,9 +5176,9 @@ static struct mdinfo *imsm_activate_spare(struct active_array *a,
*/
dl = imsm_readd(super, i, a);
if (!dl)
- dl = imsm_add_spare(super, i, a, 0);
+ dl = imsm_add_spare(super, i, a, 0, NULL);
if (!dl)
- dl = imsm_add_spare(super, i, a, 1);
+ dl = imsm_add_spare(super, i, a, 1, NULL);
if (!dl)
continue;

@@ -6422,11 +6433,11 @@ struct mdinfo *get_spares_imsm(int devnum)
sprintf(buf, "/dev/md/%s", info->name);
ret_val = sysfs_get_unused_spares(cont_id, fd);
if (ret_val == NULL) {
- dprintf("imsm: ERROR: Cannot get spare devices.\n");
+ fprintf(stderr, Name": imsm: ERROR: Cannot get spare devices.\n");
goto abort;
}
if (ret_val->array.spare_disks == 0) {
- dprintf("imsm: ERROR: No available spares.\n");
+ fprintf(stderr, Name": imsm: ERROR: No available spares.\n");
free(ret_val);
ret_val = NULL;
goto abort;
@@ -7013,10 +7024,9 @@ struct mdinfo *imsm_grow_array(struct active_array *a)
for (i = prev_raid_disks; i < new_raid_disks; i++) {
/* OK, this device can be added. Try to add.
*/
- dl = imsm_add_spare(super, i, a, 0);
+ dl = imsm_add_spare(super, i, a, 0, rv);
if (!dl)
continue;
-
if (dl->index < 0)
dl->index = i;
/* found a usable disk with enough space */

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 33/53] Prepare and free fdlist in functions

am 26.11.2010 09:08:08 von adam.kwolek

fd handles table creation is put in to function for code reuse.

In manage_reshape(), child_grow() function from Grow.c will be reused.
To prepare parameters for this function, code from Grow.c can be reused also.

Signed-off-by: Adam Kwolek
---

Grow.c | 136 ++++++++++++++++++++++++++++++++++++++++++++++++------------ ---
mdadm.h | 11 +++++
2 files changed, 115 insertions(+), 32 deletions(-)

diff --git a/Grow.c b/Grow.c
index 347f07b..8cba82b 100644
--- a/Grow.c
+++ b/Grow.c
@@ -832,6 +832,103 @@ int remove_disks_on_raid10_to_raid0_takeover(struct supertype *st,
return 0;
}

+void reshape_free_fdlist(int **fdlist_in,
+ unsigned long long **offsets_in,
+ int size)
+{
+ int i;
+ int *fdlist;
+ unsigned long long *offsets;
+ if ((offsets_in == NULL) || (offsets_in == NULL)) {
+ dprintf(Name " Error: Parameters verification error #1.\n");
+ return;
+ }
+
+ fdlist = *fdlist_in;
+ offsets = *offsets_in;
+ if ((fdlist == NULL) || (offsets == NULL)) {
+ dprintf(Name " Error: Parameters verification error #2.\n");
+ return;
+ }
+
+ for (i = 0; i < size; i++) {
+ if (fdlist[i] > 0)
+ close(fdlist[i]);
+ }
+
+ free(fdlist);
+ free(offsets);
+ *fdlist_in = NULL;
+ *offsets_in = NULL;
+}
+
+int reshape_prepare_fdlist(char *devname,
+ struct mdinfo *sra,
+ int raid_disks,
+ int nrdisks,
+ unsigned long blocks,
+ char *backup_file,
+ int **fdlist_in,
+ unsigned long long **offsets_in)
+{
+ int d = 0;
+ int *fdlist;
+ unsigned long long *offsets;
+ struct mdinfo *sd;
+
+ if ((devname == NULL) || (sra == NULL) ||
+ (fdlist_in == NULL) || (offsets_in == NULL)) {
+ dprintf(Name " Error: Parameters verification error #1.\n");
+ d = -1;
+ goto release;
+ }
+
+ fdlist = *fdlist_in;
+ offsets = *offsets_in;
+
+ if ((fdlist == NULL) || (offsets == NULL)) {
+ dprintf(Name " Error: Parameters verification error #2.\n");
+ d = -1;
+ goto release;
+ }
+
+ for (d = 0; d <= nrdisks; d++)
+ fdlist[d] = -1;
+ d = raid_disks;
+ for (sd = sra->devs; sd; sd = sd->next) {
+ if (sd->disk.state & (1< + continue;
+ if (sd->disk.state & (1< + char *dn = map_dev(sd->disk.major,
+ sd->disk.minor, 1);
+ fdlist[sd->disk.raid_disk]
+ = dev_open(dn, O_RDONLY);
+ offsets[sd->disk.raid_disk] = sd->data_offset*512;
+ if (fdlist[sd->disk.raid_disk] < 0) {
+ fprintf(stderr, Name ": %s: cannot open component %s\n",
+ devname, dn ? dn : "-unknown-");
+ d = -1;
+ goto release;
+ }
+ } else if (backup_file == NULL) {
+ /* spare */
+ char *dn = map_dev(sd->disk.major,
+ sd->disk.minor, 1);
+ fdlist[d] = dev_open(dn, O_RDWR);
+ offsets[d] = (sd->data_offset + sra->component_size - blocks - 8)*512;
+ if (fdlist[d] < 0) {
+ fprintf(stderr, Name ": %s: cannot open component %s\n",
+ devname, dn ? dn : "-unknown-");
+ d = -1;
+ goto release;
+ }
+ d++;
+ }
+ }
+release:
+ return d;
+}
+
int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
long long size,
int level, char *layout_str, int chunksize, int raid_disks)
@@ -1547,38 +1644,13 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
rv = 1;
break;
}
- for (d=0; d <= nrdisks; d++)
- fdlist[d] = -1;
- d = array.raid_disks;
- for (sd = sra->devs; sd; sd=sd->next) {
- if (sd->disk.state & (1< - continue;
- if (sd->disk.state & (1< - char *dn = map_dev(sd->disk.major,
- sd->disk.minor, 1);
- fdlist[sd->disk.raid_disk]
- = dev_open(dn, O_RDONLY);
- offsets[sd->disk.raid_disk] = sd->data_offset*512;
- if (fdlist[sd->disk.raid_disk] < 0) {
- fprintf(stderr, Name ": %s: cannot open component %s\n",
- devname, dn?dn:"-unknown-");
- rv = 1;
- goto release;
- }
- } else if (backup_file == NULL) {
- /* spare */
- char *dn = map_dev(sd->disk.major,
- sd->disk.minor, 1);
- fdlist[d] = dev_open(dn, O_RDWR);
- offsets[d] = (sd->data_offset + sra->component_size - blocks - 8)*512;
- if (fdlist[d]<0) {
- fprintf(stderr, Name ": %s: cannot open component %s\n",
- devname, dn?dn:"-unknown");
- rv = 1;
- goto release;
- }
- d++;
- }
+
+ d = reshape_prepare_fdlist(devname, sra, array.raid_disks,
+ nrdisks, blocks, backup_file,
+ &fdlist, &offsets);
+ if (d < 0) {
+ rv = 1;
+ goto release;
}
if (backup_file == NULL) {
if (st->ss->external && !st->ss->manage_reshape) {
diff --git a/mdadm.h b/mdadm.h
index 750afcc..698f1bf 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -448,6 +448,17 @@ extern int sysfs_unique_holder(int devnum, long rdev);
extern int sysfs_freeze_array(struct mdinfo *sra);
extern int load_sys(char *path, char *buf);
extern struct mdinfo *sysfs_get_unused_spares(int container_fd, int fd);
+extern int reshape_prepare_fdlist(char *devname,
+ struct mdinfo *sra,
+ int raid_disks,
+ int nrdisks,
+ unsigned long blocks,
+ char *backup_file,
+ int **fdlist_in,
+ unsigned long long **offsets_in);
+extern void reshape_free_fdlist(int **fdlist_in,
+ unsigned long long **offsets_in,
+ int size);

extern int save_stripes(int *source, unsigned long long *offsets,

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 34/53] Compute backup blocks in function.

am 26.11.2010 09:08:16 von adam.kwolek

number of backup blocks evaluation is put in to function for code reuse.

Signed-off-by: Adam Kwolek
---

Grow.c | 44 +++++++++++++++++++++++++++-----------------
mdadm.h | 4 +++-
2 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/Grow.c b/Grow.c
index 8cba82b..8cc17d5 100644
--- a/Grow.c
+++ b/Grow.c
@@ -929,6 +929,31 @@ release:
return d;
}

+unsigned long compute_backup_blocks(int nchunk, int ochunk,
+ unsigned int ndata, unsigned int odata)
+{
+ unsigned long a, b, blocks;
+ /* So how much do we need to backup.
+ * We need an amount of data which is both a whole number of
+ * old stripes and a whole number of new stripes.
+ * So LCM for (chunksize*datadisks).
+ */
+ a = (ochunk/512) * odata;
+ b = (nchunk/512) * ndata;
+ /* Find GCD */
+ while (a != b) {
+ if (a < b)
+ b -= a;
+ if (b < a)
+ a -= b;
+ }
+ /* LCM == product / GCD */
+ blocks = (ochunk/512) * (nchunk/512) * odata * ndata / a;
+
+ return blocks;
+}
+
+
int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
long long size,
int level, char *layout_str, int chunksize, int raid_disks)
@@ -967,7 +992,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
int nrdisks;
int err;
int frozen;
- unsigned long a,b, blocks, stripes;
+ unsigned long blocks, stripes;
unsigned long cache;
unsigned long long array_size;
int changed = 0;
@@ -1587,22 +1612,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
break;
}

- /* So how much do we need to backup.
- * We need an amount of data which is both a whole number of
- * old stripes and a whole number of new stripes.
- * So LCM for (chunksize*datadisks).
- */
- a = (ochunk/512) * odata;
- b = (nchunk/512) * ndata;
- /* Find GCD */
- while (a != b) {
- if (a < b)
- b -= a;
- if (b < a)
- a -= b;
- }
- /* LCM == product / GCD */
- blocks = (ochunk/512) * (nchunk/512) * odata * ndata / a;
+ blocks = compute_backup_blocks(nchunk, ochunk, ndata, odata);

sysfs_free(sra);
sra = sysfs_read(fd, 0,
diff --git a/mdadm.h b/mdadm.h
index 698f1bf..06195c8 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -459,7 +459,8 @@ extern int reshape_prepare_fdlist(char *devname,
extern void reshape_free_fdlist(int **fdlist_in,
unsigned long long **offsets_in,
int size);
-
+extern unsigned long compute_backup_blocks(int nchunk, int ochunk,
+ unsigned int ndata, unsigned int odata);

extern int save_stripes(int *source, unsigned long long *offsets,
int raid_disks, int chunk_size, int level, int layout,
@@ -471,6 +472,7 @@ extern int restore_stripes(int *dest, unsigned long long *offsets,
int source, unsigned long long read_offset,
unsigned long long start, unsigned long long length);

+
#ifndef Sendmail
#define Sendmail "/usr/lib/sendmail -t"
#endif

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 35/53] Control reshape in mdadm

am 26.11.2010 09:08:23 von adam.kwolek

When managemon starts reshape while sync_max is set to 0, mdadm waits already for it in manage_reshape().
When array reaches reshape state, manage_reshape() handler checks if all metadata updates are in place.
If not mdadm has to wait until updates hits array.
It starts reshape using child_grow() common code. Then waits until reshape is not finished.
When it happens it sets size to value specified in metadata and performs backward takeover to raid0 if necessary.

If manage_reshape() finds idle array state (instead reshape state) it is treated as error condition and process is terminated.

Signed-off-by: Adam Kwolek
---

Grow.c | 16 +-
Makefile | 4
mdadm.h | 6 +
super-intel.c | 526 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 540 insertions(+), 12 deletions(-)

diff --git a/Grow.c b/Grow.c
index 8cc17d5..37bcfd6 100644
--- a/Grow.c
+++ b/Grow.c
@@ -418,10 +418,6 @@ static __u32 bsb_csum(char *buf, int len)
return __cpu_to_le32(csum);
}

-static int child_grow(int afd, struct mdinfo *sra, unsigned long blocks,
- int *fds, unsigned long long *offsets,
- int disks, int chunk, int level, int layout, int data,
- int dests, int *destfd, unsigned long long *destoffsets);
static int child_shrink(int afd, struct mdinfo *sra, unsigned long blocks,
int *fds, unsigned long long *offsets,
int disks, int chunk, int level, int layout, int data,
@@ -451,7 +447,7 @@ static int freeze_container(struct supertype *st)
return 1;
}

-static void unfreeze_container(struct supertype *st)
+void unfreeze_container(struct supertype *st)
{
int container_dev = st->subarray[0] ? st->container_dev : st->devnum;
char *container = devnum2devname(container_dev);
@@ -505,7 +501,7 @@ static void unfreeze(struct supertype *st, int frozen)
sysfs_free(sra);
}

-static void wait_reshape(struct mdinfo *sra)
+void wait_reshape(struct mdinfo *sra)
{
int fd = sysfs_get_fd(sra, NULL, "sync_action");
char action[20];
@@ -2202,10 +2198,10 @@ static void validate(int afd, int bfd, unsigned long long offset)
}
}

-static int child_grow(int afd, struct mdinfo *sra, unsigned long stripes,
- int *fds, unsigned long long *offsets,
- int disks, int chunk, int level, int layout, int data,
- int dests, int *destfd, unsigned long long *destoffsets)
+int child_grow(int afd, struct mdinfo *sra, unsigned long stripes,
+ int *fds, unsigned long long *offsets,
+ int disks, int chunk, int level, int layout, int data,
+ int dests, int *destfd, unsigned long long *destoffsets)
{
char *buf;
int degraded = 0;
diff --git a/Makefile b/Makefile
index e3fb949..6527152 100644
--- a/Makefile
+++ b/Makefile
@@ -112,12 +112,12 @@ SRCS = mdadm.c config.c mdstat.c ReadMe.c util.c Manage.c Assemble.c Build.c \
MON_OBJS = mdmon.o monitor.o managemon.o util.o mdstat.o sysfs.o config.o \
Kill.o sg_io.o dlink.o ReadMe.o super0.o super1.o super-intel.o \
super-ddf.o sha1.o crc32.o msg.o bitmap.o \
- platform-intel.o probe_roms.o mapfile.o
+ platform-intel.o probe_roms.o mapfile.o Grow.o restripe.o

MON_SRCS = mdmon.c monitor.c managemon.c util.c mdstat.c sysfs.c config.c \
Kill.c sg_io.c dlink.c ReadMe.c super0.c super1.c super-intel.c \
super-ddf.c sha1.c crc32.c msg.c bitmap.c \
- platform-intel.c probe_roms.c mapfile.c
+ platform-intel.c probe_roms.c mapfile.c Grow.c restripe.c

STATICSRC = pwgr.c
STATICOBJS = pwgr.o
diff --git a/mdadm.h b/mdadm.h
index 06195c8..2c08ee6 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -446,6 +446,7 @@ extern int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume);
extern int sysfs_disk_to_scsi_id(int fd, __u32 *id);
extern int sysfs_unique_holder(int devnum, long rdev);
extern int sysfs_freeze_array(struct mdinfo *sra);
+extern void wait_reshape(struct mdinfo *sra);
extern int load_sys(char *path, char *buf);
extern struct mdinfo *sysfs_get_unused_spares(int container_fd, int fd);
extern int reshape_prepare_fdlist(char *devname,
@@ -461,6 +462,11 @@ extern void reshape_free_fdlist(int **fdlist_in,
int size);
extern unsigned long compute_backup_blocks(int nchunk, int ochunk,
unsigned int ndata, unsigned int odata);
+extern int child_grow(int afd, struct mdinfo *sra, unsigned long stripes,
+ int *fds, unsigned long long *offsets,
+ int disks, int chunk, int level, int layout, int data,
+ int dests, int *destfd, unsigned long long *destoffsets);
+extern void unfreeze_container(struct supertype *st);

extern int save_stripes(int *source, unsigned long long *offsets,
int raid_disks, int chunk_size, int level, int layout,
diff --git a/super-intel.c b/super-intel.c
index e4ba875..e57a127 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -26,6 +26,7 @@
#include
#include
#include
+#include

/* MPB == Metadata Parameter Block */
#define MPB_SIGNATURE "Intel Raid ISM Cfg Sig. "
@@ -6780,6 +6781,8 @@ int imsm_reshape_super(struct supertype *st, long long size, int level,
}
} else
dprintf("imsm: Operation is not allowed on container\n");
+ if (ret_val)
+ unfreeze_container(st);
*st->subarray = 0;
goto imsm_reshape_super_exit;
} else
@@ -6901,6 +6904,13 @@ int imsm_reshape_array_set_slots(struct active_array *a)

return imsm_reshape_array_manage_new_slots(super, inst, a->devnum, 1);
}
+
+int imsm_reshape_array_count_slots_mismatches(struct intel_super *super, int inst, int devnum)
+{
+
+ return imsm_reshape_array_manage_new_slots(super, inst, devnum, 0);
+}
+
/* imsm_reshape_array_manage_new_slots()
* returns: number of corrected slots for correct == 1
* counted number of different slots for correct == 0
@@ -7174,6 +7184,521 @@ imsm_reshape_array_exit:
return disk_list;
}

+int imsm_grow_manage_size(struct supertype *st, struct mdinfo *sra)
+{
+ int ret_val = 0;
+ struct mdinfo *info = NULL;
+ unsigned long long size;
+ int container_fd;
+ unsigned long long current_size = 0;
+
+ /* finalize current volume reshape
+ * for external meta size has to be managed by mdadm
+ * read size set in meta and put it to md when
+ * reshape is finished.
+ */
+
+ if (sra == NULL) {
+ dprintf("Error: imsm_grow_manage_size(): sra == NULL\n");
+ goto exit_grow_manage_size_ext_meta;
+ }
+ wait_reshape(sra);
+
+ /* reshape has finished, update md size
+ * get per-device size and multiply by data disks
+ */
+ container_fd = open_dev(st->devnum);
+ if (container_fd < 0) {
+ dprintf("Error: imsm_grow_manage_size(): container_fd == 0\n");
+ goto exit_grow_manage_size_ext_meta;
+ }
+ if (st->loaded_container)
+ st->ss->load_super(st, container_fd, NULL);
+ info = sysfs_read(container_fd, 0, GET_LEVEL|GET_VERSION|GET_DEVS|GET_STATE);
+ close(container_fd);
+ if (info == NULL) {
+ dprintf("imsm: Cannot get device info.\n");
+ goto exit_grow_manage_size_ext_meta;
+ }
+ st->ss->getinfo_super(st, info);
+ size = info->custom_array_size/2;
+ sysfs_get_ll(sra, NULL, "array_size", ¤t_size);
+ dprintf("imsm_grow_manage_size(): current size is %llu, set size to %llu\n", current_size, size);
+ sysfs_set_num(sra, NULL, "array_size", size);
+
+ ret_val = 1;
+
+exit_grow_manage_size_ext_meta:
+ sysfs_free(info);
+ return ret_val;
+}
+
+int imsm_child_grow(struct supertype *st, char *devname, int validate_fd, struct mdinfo *sra)
+{
+ int ret_val = 0;
+ int nrdisks;
+ int *fdlist;
+ unsigned long long *offsets;
+ unsigned int ndata, odata;
+ int ndisks, odisks;
+ unsigned long blocks, stripes;
+ int d;
+ struct mdinfo *sd;
+
+ nrdisks = ndisks = odisks = sra->array.raid_disks;
+ odisks -= sra->delta_disks;
+ odata = odisks-1;
+ ndata = ndisks-1;
+ fdlist = malloc((1+nrdisks) * sizeof(int));
+ offsets = malloc((1+nrdisks) * sizeof(offsets[0]));
+ if (!fdlist || !offsets) {
+ fprintf(stderr, Name ": malloc failed: grow aborted\n");
+ ret_val = 1;
+ if (fdlist)
+ free(fdlist);
+ if (offsets)
+ free(offsets);
+ return ret_val;
+ }
+ blocks = compute_backup_blocks(sra->array.chunk_size,
+ sra->array.chunk_size,
+ ndata, odata);
+
+ /* set MD_DISK_SYNC flag to open all devices that has to be backuped
+ */
+ for (sd = sra->devs; sd; sd = sd->next) {
+ if ((sd->disk.raid_disk > -1) &&
+ ((unsigned int)sd->disk.raid_disk < odata)) {
+ sd->disk.state |= (1< + sd->disk.state &= ~(1< + } else {
+ sd->disk.state |= (1< + sd->disk.state &= ~(1< + }
+ }
+#ifdef DEBUG
+ dprintf("FD list disk inspection:\n");
+ for (sd = sra->devs; sd; sd = sd->next) {
+ char *dn = map_dev(sd->disk.major,
+ sd->disk.minor, 1);
+ dprintf("Disk %s", dn);
+ dprintf("\tstate = %i\n", sd->disk.state);
+ }
+#endif
+ d = reshape_prepare_fdlist(devname, sra, odisks,
+ nrdisks, blocks, NULL,
+ &fdlist, &offsets);
+ if (d < 0) {
+ fprintf(stderr, Name ": cannot prepare device list\n");
+ ret_val = 1;
+ return ret_val;
+ }
+
+ mlockall(MCL_FUTURE);
+ if (ret_val == 0) {
+ sra->array.raid_disks = odisks;
+ sra->new_level = sra->array.level;
+ sra->new_layout = sra->array.layout;
+ sra->new_chunk = sra->array.chunk_size;
+
+ stripes = blocks / (sra->array.chunk_size/512) / odata;
+ /* child grow returns fixed value == 1
+ */
+ child_grow(validate_fd, sra, stripes,
+ fdlist, offsets,
+ odisks, sra->array.chunk_size,
+ sra->array.level, -1, odata,
+ d - odisks, NULL, offsets + odata);
+ imsm_grow_manage_size(st, sra);
+ }
+ reshape_free_fdlist(&fdlist, &offsets, d);
+
+ return ret_val;
+}
+
+void return_to_raid0(struct mdinfo *sra)
+{
+ if (sra->array.level == 4) {
+ dprintf("Execute backward takeover to raid0\n");
+ sysfs_set_str(sra, NULL, "level", "raid0");
+ }
+}
+
+int imsm_check_reshape_conditions(int fd, struct supertype *st, int current_array)
+{
+ char buf[PATH_MAX];
+ struct mdinfo *info = NULL;
+ int arrays_in_reshape_state = 0;
+ int wait_counter = 0;
+ int i;
+ int ret_val = 0;
+ struct intel_super *super = st->sb;
+ struct imsm_super *mpb = super->anchor;
+ int wrong_slots_counter;
+
+ /* wait until all arrays will be in reshape state
+ * or error occures (iddle state detected)
+ */
+ while ((arrays_in_reshape_state == 0) &&
+ (ret_val == 0)) {
+ arrays_in_reshape_state = 0;
+ int temp_array;
+
+ if (wait_counter)
+ sleep(1);
+
+ for (i = 0; i < mpb->num_raid_devs; i++) {
+ int sync_max;
+ int len;
+
+ /* check array state in md
+ */
+ sprintf(st->subarray, "%i", i);
+ st->ss->load_super(st, fd, NULL);
+ if (st->sb == NULL) {
+ dprintf("cannot get sb\n");
+ ret_val = 1;
+ break;
+ }
+ info = sysfs_read(fd, 0, GET_LEVEL|GET_VERSION|GET_DEVS|GET_STATE);
+ if (info == NULL) {
+ dprintf("imsm: Cannot get device info.\n");
+ break;
+ }
+ st->ss->getinfo_super(st, info);
+
+ find_array_minor(info->name, 1, st->devnum, &temp_array);
+ if (temp_array != current_array) {
+ if (temp_array < 0) {
+ ret_val = -1;
+ break;
+ }
+ sysfs_free(info);
+ info = NULL;
+ continue;
+ }
+
+ if (sysfs_get_str(info, NULL, "raid_disks", buf, sizeof(buf)) < 0) {
+ dprintf("cannot get raid_disks\n");
+ ret_val = 1;
+ break;
+ }
+ /* sync_max should be always set to 0
+ */
+ if (sysfs_get_str(info, NULL, "sync_max", buf, sizeof(buf)) < 0) {
+ dprintf("cannot get sync_max\n");
+ ret_val = 1;
+ break;
+ }
+ len = strlen(buf)-1;
+ if (len < 0)
+ len = 0;
+ *(buf+len) = 0;
+ sync_max = atoi(buf);
+ if (sync_max != 0) {
+ dprintf("sync_max has wrong value (%s)\n", buf);
+ sysfs_free(info);
+ info = NULL;
+ continue;
+ }
+ if (sysfs_get_str(info, NULL, "sync_action", buf, sizeof(buf)) < 0) {
+ dprintf("cannot get sync_action\n");
+ ret_val = 1;
+ break;
+ }
+ len = strlen(buf)-1;
+ if (len < 0)
+ len = 0;
+ *(buf+len) = 0;
+ if (strncmp(buf, "idle", 7) == 0) {
+ dprintf("imsm: Error found array in idle state during reshape initialization\n");
+ ret_val = 1;
+ break;
+ }
+ if (strncmp(buf, "reshape", 7) == 0) {
+ arrays_in_reshape_state++;
+ } else {
+ if (strncmp(buf, "frozen", 6) != 0) {
+ *(buf+strlen(buf)) = 0;
+ dprintf("imsm: Error unexpected array state (%s) during reshape initialization\n",
+ buf);
+ ret_val = 1;
+ break;
+ }
+ }
+ /* this device looks ok, so
+ * check if slots are set corectly
+ */
+ super = st->sb;
+ wrong_slots_counter = imsm_reshape_array_count_slots_mismatches(super, i, atoi(info->sys_name+2));
+ sysfs_free(info);
+ info = NULL;
+ if (wrong_slots_counter != 0) {
+ dprintf("Slots for correction %i.\n", wrong_slots_counter);
+ ret_val = 1;
+ goto exit_imsm_check_reshape_conditions;
+ }
+ }
+ sysfs_free(info);
+ info = NULL;
+ wait_counter++;
+ if (wait_counter > 60) {
+ dprintf("exit on timeout, container is not prepared to reshape\n");
+ ret_val = 1;
+ }
+ }
+
+exit_imsm_check_reshape_conditions:
+ sysfs_free(info);
+ info = NULL;
+
+ return ret_val;
+}
+
+int imsm_manage_container_reshape(struct supertype *st)
+{
+ int ret_val = 1;
+ char buf[PATH_MAX];
+ struct intel_super *super = st->sb;
+ struct imsm_super *mpb = super->anchor;
+ int fd;
+ struct mdinfo *info = NULL;
+ struct mdinfo info2;
+ int validate_fd;
+ int delta_disks;
+ struct geo_params geo;
+#ifdef DEBUG
+ int i;
+#endif
+
+ memset(&geo, sizeof (struct geo_params), 0);
+ /* verify reshape conditions
+ * for single vlolume reshape exit only and reuse Grow_reshape() code
+ */
+ if (st->subarray[0] != 0) {
+ dprintf("imsm: imsm_manage_container_reshape() current volume: %s\n", st->subarray);
+ dprintf("imsm: imsm_manage_container_reshape() detects volume reshape (devnum = %i), exit.\n", st->devnum);
+ return ret_val;
+ }
+
+ geo.dev_name = devnum2devname(st->devnum);
+ if (geo.dev_name == NULL) {
+ dprintf("imsm: Error: imsm_manage_reshape(): cannot get device name.\n");
+ return ret_val;
+ }
+
+ snprintf(buf, PATH_MAX, "/dev/%s", geo.dev_name);
+ fd = open(buf , O_RDONLY | O_DIRECT);
+ if (fd < 0) {
+ dprintf("imsm: cannot open device\n");
+ goto imsm_manage_container_reshape_exit;
+ }
+
+ /* send pings to roll managemon and monitor
+ */
+ ping_manager(geo.dev_name);
+ ping_monitor(geo.dev_name);
+
+#ifdef DEBUG
+ /* device list for reshape
+ */
+ dprintf("Arrays to run reshape (no: %i)\n", mpb->num_raid_devs);
+ for (i = 0; i < mpb->num_raid_devs; i++) {
+ struct imsm_dev *dev = get_imsm_dev(super, i);
+ dprintf("\tDevice: %s\n", dev->volume);
+ }
+#endif
+
+ info2.devs = NULL;
+ st->ss->getinfo_super(st, &info2);
+ geo.dev_id = -1;
+ find_array_minor(info2.name, 1, st->devnum, &geo.dev_id);
+ if (geo.dev_id < 0) {
+ dprintf("imsm. Error.Cannot get first array.\n");
+ goto imsm_manage_container_reshape_exit;
+ }
+ if (imsm_check_reshape_conditions(fd, st, geo.dev_id)) {
+ dprintf("imsm. Error. Wrong reshape conditions.\n");
+ goto imsm_manage_container_reshape_exit;
+ }
+ geo.raid_disks = info2.array.raid_disks;
+ dprintf("Container is ready for reshape ...\n");
+ switch (fork()) {
+ case 0:
+ fprintf(stderr, Name ": Child forked to run and monitor reshape\n");
+ while (geo.dev_id > -1) {
+ int fd2 = -1;
+ int i;
+ int temp_array = -1;
+ char *array;
+
+ for (i = 0; i < mpb->num_raid_devs; i++) {
+ sprintf(st->subarray, "%i", i);
+ st->ss->load_super(st, fd, NULL);
+ if (st->sb == NULL) {
+ dprintf("cannot get sb\n");
+ ret_val = 1;
+ goto imsm_manage_container_reshape_exit;
+ }
+ info2.devs = NULL;
+ st->ss->getinfo_super(st, &info2);
+ dprintf("Checking slots for device %s\n", info2.sys_name);
+ find_array_minor(info2.name, 1, st->devnum, &temp_array);
+ if (temp_array == geo.dev_id)
+ break;
+ }
+ snprintf(buf, PATH_MAX, "/dev/%s", info2.sys_name);
+ dprintf("Prepare to reshape for device %s (md%i)\n", info2.sys_name, geo.dev_id);
+ fd2 = open(buf, O_RDWR | O_DIRECT);
+ if (fd2 < 0) {
+ dprintf("Reshape is broken (cannot open array)\n");
+ ret_val = 1;
+ goto imsm_manage_container_reshape_exit;
+ }
+ info = sysfs_read(fd2, 0, GET_VERSION | GET_LEVEL | GET_DEVS | GET_STATE |\
+ GET_COMPONENT | GET_OFFSET | GET_CACHE |\
+ GET_CHUNK | GET_DISKS | GET_DEGRADED |
+ GET_SIZE | GET_LAYOUT);
+ if (info == NULL) {
+ dprintf("Reshape is broken (cannot read sysfs)\n");
+ close(fd2);
+ ret_val = 1;
+ goto imsm_manage_container_reshape_exit;
+ }
+ delta_disks = info->delta_disks;
+ super = st->sb;
+ if (check_env("MDADM_GROW_VERIFY"))
+ validate_fd = fd2;
+ else
+ validate_fd = -1;
+
+ if (sysfs_get_str(info, NULL, "sync_completed", buf, sizeof(buf)) >= 0) {
+ /* check if in previous pass we reshape any array
+ * if not we have to omit sync_complete condition
+ * and try to reshape arrays
+ */
+ if ((*buf == '0') ||
+ /* or this array was already reshaped */
+ (strncmp(buf, "none", 4) == 0)) {
+ dprintf("Skip this array, sync_completed is %s\n", buf);
+ geo.dev_id = -1;
+ sysfs_free(info);
+ info = NULL;
+ close(fd2);
+ continue;
+ }
+ } else {
+ dprintf("Reshape is broken (cannot read sync_complete)\n");
+ dprintf("Array level is: %i\n", info->array.level);
+ ret_val = 1;
+ close(fd2);
+ goto imsm_manage_container_reshape_exit;
+ }
+ snprintf(buf, PATH_MAX, "/dev/md/%s", info2.name);
+ info->delta_disks = info2.delta_disks;
+
+ delta_disks = info->array.raid_disks - geo.raid_disks;
+ geo.raid_disks = info->array.raid_disks;
+ if (info->array.level == 4) {
+ geo.raid_disks--;
+ delta_disks--;
+ }
+
+ ret_val = imsm_child_grow(st, buf,
+ validate_fd,
+ info);
+ return_to_raid0(info);
+ sysfs_free(info);
+ info = NULL;
+ close(fd2);
+ if (ret_val) {
+ dprintf("Reshape is broken (cannot reshape)\n");
+ ret_val = 1;
+ goto imsm_manage_container_reshape_exit;
+ }
+ geo.dev_id = -1;
+ sprintf(st->subarray, "%i", 0);
+ array = get_volume_for_olce(st, geo.raid_disks);
+ if (array) {
+ struct imsm_update_reshape *u;
+ dprintf("imsm: next volume to reshape is: %s\n", array);
+ info2.devs = NULL;
+ st->ss->getinfo_super(st, &info2);
+ find_array_minor(info2.name, 1, st->devnum, &geo.dev_id);
+ if (geo.dev_id > -1) {
+ /* send next array update
+ */
+ dprintf("imsm: Preparing metadata update for: %s (md%i)\n", array, geo.dev_id);
+ st->update_tail = &st->updates;
+ u = imsm_create_metadata_update_for_reshape(st, &geo);
+ if (u) {
+ u->reshape_delta_disks = delta_disks;
+ append_metadata_update(st, u, u->update_memory_size);
+ flush_metadata_updates(st);
+ /* send pings to roll managemon and monitor
+ */
+ ping_manager(geo.dev_name);
+ ping_monitor(geo.dev_name);
+
+ if (imsm_check_reshape_conditions(fd, st, geo.dev_id)) {
+ dprintf("imsm. Error. Wrong reshape conditions.\n");
+ ret_val = 1;
+ geo.dev_id = -1;
+ }
+ } else
+ geo.dev_id = -1;
+ }
+ free(array);
+ }
+ }
+ unfreeze_container(st);
+ close(fd);
+ break;
+ case -1:
+ fprintf(stderr, Name ": Cannot run child to monitor reshape: %s\n",
+ strerror(errno));
+ ret_val = 1;
+ break;
+ default:
+ /* The child will take care of unfreezing the array */
+ break;
+ }
+
+imsm_manage_container_reshape_exit:
+ sysfs_free(info);
+ if (fd > -1)
+ close(fd);
+ if (geo.dev_name)
+ free(geo.dev_name);
+
+ return ret_val;
+}
+
+int imsm_manage_reshape(struct supertype *st, char *backup)
+{
+ int ret_val = 0;
+
+ dprintf("imsm: manage_reshape() called\n");
+
+ if (experimental() == 0)
+ return ret_val;
+
+ /* verify reshape conditions
+ * for single vlolume reshape exit only and reuse Grow_reshape() code
+ */
+ if (st->subarray[0] != 0) {
+ dprintf("imsm: manage_reshape() current volume: %s (devnum = %i)\n", st->subarray, st->devnum);
+ return ret_val;
+ }
+ ret_val = imsm_manage_container_reshape(st);
+ /* unfreeze on error and success
+ * for any result this is end of work
+ */
+ unfreeze_container(st);
+
+ return ret_val;
+}
+
struct superswitch super_imsm = {
#ifndef MDASSEMBLE
.examine_super = examine_super_imsm,
@@ -7207,6 +7732,7 @@ struct superswitch super_imsm = {
.default_geometry = default_geometry_imsm,
.reshape_super = imsm_reshape_super,
.reshape_array = imsm_reshape_array,
+ .manage_reshape = imsm_manage_reshape,

.external = 1,
.name = "imsm",

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 36/53] Finalize reshape after adding disks to array

am 26.11.2010 09:08:31 von adam.kwolek

When reshape is finished monitor, has to finalize reshape in metadata.
To do this set_array_state() should be called.
This finishes migration and stores metadata on disks.

reshape_delta_disks is set to not active value.
This finishes reshape flow in mdmon.

Signed-off-by: Adam Kwolek
---

monitor.c | 17 +++++++++++++++++
1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/monitor.c b/monitor.c
index 05bd96c..3e26a8a 100644
--- a/monitor.c
+++ b/monitor.c
@@ -305,6 +305,23 @@ static int read_and_act(struct active_array *a)
}
}

+ if (!deactivate) {
+ /* finalize reshape detection
+ */
+ if ((a->curr_action != reshape) &&
+ (a->prev_action == reshape)) {
+ /* set zero to allow for future rebuilds
+ */
+ a->reshape_state = reshape_not_active;
+
+ /* A reshape has finished.
+ * Some disks may be in sync now.
+ */
+ a->container->ss->set_array_state(a, a->curr_state <= clean);
+ check_degraded = 1;
+ }
+ }
+
/* Check for failures and if found:
* 1/ Record the failure in the metadata and unblock the device.
* FIXME update the kernel to stop notifying on failed drives when

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 37/53] mdadm: second_map enhancement for imsm_get_map()

am 26.11.2010 09:08:39 von adam.kwolek

Allow map related operations for the given map: first of second.
For reshape specific functionality it is required to have an access

Until now, the active map was chosen according to the current volume status.

Signed-off-by: Maciej Trela
Signed-off-by: Adam Kwolek
---

super-intel.c | 69 ++++++++++++++++++++++++++++++++-------------------------
1 files changed, 39 insertions(+), 30 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index e57a127..eea5fec 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -519,23 +519,33 @@ static struct imsm_dev *get_imsm_dev(struct intel_super *super, __u8 index)
return NULL;
}

-static __u32 get_imsm_ord_tbl_ent(struct imsm_dev *dev, int slot)
+/*
+ * for second_map:
+ * == 0 get first map
+ * == 1 get second map
+ * == -1 than get map according to the current migr_state
+ */
+static __u32 get_imsm_ord_tbl_ent(struct imsm_dev *dev, int slot, int second_map)
{
struct imsm_map *map;

- if (dev->vol.migr_state)
- map = get_imsm_map(dev, 1);
- else
- map = get_imsm_map(dev, 0);
+ if (second_map == -1) {
+ if (dev->vol.migr_state)
+ map = get_imsm_map(dev, 1);
+ else
+ map = get_imsm_map(dev, 0);
+ } else {
+ map = get_imsm_map(dev, second_map);
+ }

/* top byte identifies disk under rebuild */
return __le32_to_cpu(map->disk_ord_tbl[slot]);
}

#define ord_to_idx(ord) (((ord) << 8) >> 8)
-static __u32 get_imsm_disk_idx(struct imsm_dev *dev, int slot)
+static __u32 get_imsm_disk_idx(struct imsm_dev *dev, int slot, int second_map)
{
- __u32 ord = get_imsm_ord_tbl_ent(dev, slot);
+ __u32 ord = get_imsm_ord_tbl_ent(dev, slot, second_map);

return ord_to_idx(ord);
}
@@ -712,13 +722,13 @@ static void print_imsm_dev(struct imsm_dev *dev, char *uuid, int disk_idx)
printf(" Members : %d\n", map->num_members);
printf(" Slots : [");
for (i = 0; i < map->num_members; i++) {
- ord = get_imsm_ord_tbl_ent(dev, i);
+ ord = get_imsm_ord_tbl_ent(dev, i, -1);
printf("%s", ord & IMSM_ORD_REBUILD ? "_" : "U");
}
printf("]\n");
slot = get_imsm_disk_slot(map, disk_idx);
if (slot >= 0) {
- ord = get_imsm_ord_tbl_ent(dev, slot);
+ ord = get_imsm_ord_tbl_ent(dev, slot, -1);
printf(" This Slot : %d%s\n", slot,
ord & IMSM_ORD_REBUILD ? " (out-of-sync)" : "");
} else
@@ -1356,12 +1366,12 @@ static __u32 num_stripes_per_unit_rebuild(struct imsm_dev *dev)
return num_stripes_per_unit_resync(dev);
}

-static __u8 imsm_num_data_members(struct imsm_dev *dev)
+static __u8 imsm_num_data_members(struct imsm_dev *dev, int second_map)
{
/* named 'imsm_' because raid0, raid1 and raid10
* counter-intuitively have the same number of data disks
*/
- struct imsm_map *map = get_imsm_map(dev, 0);
+ struct imsm_map *map = get_imsm_map(dev, second_map);

switch (get_imsm_raid_level(map)) {
case 0:
@@ -1444,7 +1454,7 @@ static __u64 blocks_per_migr_unit(struct imsm_dev *dev)
*/
stripes_per_unit = num_stripes_per_unit_resync(dev);
migr_chunk = migr_strip_blocks_resync(dev);
- disks = imsm_num_data_members(dev);
+ disks = imsm_num_data_members(dev, 0);
blocks_per_unit = stripes_per_unit * migr_chunk * disks;
stripe = __le32_to_cpu(map->blocks_per_strip) * disks;
segment = blocks_per_unit / stripe;
@@ -1675,7 +1685,7 @@ static void getinfo_super_imsm(struct supertype *st, struct mdinfo *info)
* (catches single-degraded vs double-degraded)
*/
for (j = 0; j < map->num_members; j++) {
- __u32 ord = get_imsm_ord_tbl_ent(dev, i);
+ __u32 ord = get_imsm_ord_tbl_ent(dev, i, -1);
__u32 idx = ord_to_idx(ord);

if (!(ord & IMSM_ORD_REBUILD) &&
@@ -3383,7 +3393,7 @@ static int add_to_super_imsm_volume(struct supertype *st, mdu_disk_info_t *dk,
/* Check the device has not already been added */
slot = get_imsm_disk_slot(map, dl->index);
if (slot >= 0 &&
- (get_imsm_ord_tbl_ent(dev, slot) & IMSM_ORD_REBUILD) == 0) {
+ (get_imsm_ord_tbl_ent(dev, slot, -1) & IMSM_ORD_REBUILD) == 0) {
fprintf(stderr, Name ": %s has been included in this array twice\n",
devname);
return 1;
@@ -3629,7 +3639,7 @@ static int create_array(struct supertype *st, int dev_idx)
imsm_copy_dev(&u->dev, dev);
inf = get_disk_info(u);
for (i = 0; i < map->num_members; i++) {
- int idx = get_imsm_disk_idx(dev, i);
+ int idx = get_imsm_disk_idx(dev, i, -1);

disk = get_imsm_disk(super, idx);
serialcpy(inf[i].serial, disk->serial);
@@ -4503,8 +4513,8 @@ static struct mdinfo *container_content_imsm(struct supertype *st)
__u32 ord;

skip = 0;
- idx = get_imsm_disk_idx(dev, slot);
- ord = get_imsm_ord_tbl_ent(dev, slot);
+ idx = get_imsm_disk_idx(dev, slot, 0);
+ ord = get_imsm_ord_tbl_ent(dev, slot, 0);
for (d = super->disks; d ; d = d->next)
if (d->index == idx)
break;
@@ -4599,7 +4609,7 @@ static __u8 imsm_check_degraded(struct intel_super *super, struct imsm_dev *dev,
int insync = insync;

for (i = 0; i < map->num_members; i++) {
- __u32 ord = get_imsm_ord_tbl_ent(dev, i);
+ __u32 ord = get_imsm_ord_tbl_ent(dev, i, -1);
int idx = ord_to_idx(ord);
struct imsm_disk *disk;

@@ -4883,7 +4893,7 @@ static void imsm_set_disk(struct active_array *a, int n, int state)

dprintf("imsm: set_disk %d:%x\n", n, state);

- ord = get_imsm_ord_tbl_ent(dev, n);
+ ord = get_imsm_ord_tbl_ent(dev, n, -1);
disk = get_imsm_disk(super, ord_to_idx(ord));

/* check for new failures */
@@ -4990,7 +5000,7 @@ static void imsm_sync_metadata(struct supertype *container)
static struct dl *imsm_readd(struct intel_super *super, int idx, struct active_array *a)
{
struct imsm_dev *dev = get_imsm_dev(super, a->info.container_member);
- int i = get_imsm_disk_idx(dev, idx);
+ int i = get_imsm_disk_idx(dev, idx, -1);
struct dl *dl;

for (dl = super->disks; dl; dl = dl->next)
@@ -5011,7 +5021,7 @@ static struct dl *imsm_add_spare(struct intel_super *super, int slot,
struct mdinfo *additional_test_list)
{
struct imsm_dev *dev = get_imsm_dev(super, a->info.container_member);
- int idx = get_imsm_disk_idx(dev, slot);
+ int idx = get_imsm_disk_idx(dev, slot, -1);
struct imsm_super *mpb = super->anchor;
struct imsm_map *map;
unsigned long long pos;
@@ -5276,7 +5286,7 @@ static int disks_overlap(struct intel_super *super, int idx, struct imsm_update_
int j;

for (i = 0; i < map->num_members; i++) {
- disk = get_imsm_disk(super, get_imsm_disk_idx(dev, i));
+ disk = get_imsm_disk(super, get_imsm_disk_idx(dev, i, -1));
for (j = 0; j < new_map->num_members; j++)
if (serialcmp(disk->serial, inf[j].serial) == 0)
return 1;
@@ -5527,7 +5537,7 @@ update_reshape_exit:
end_migration(dev, map_1->map_state);
/* array size rollback
*/
- used_disks = imsm_num_data_members(dev);
+ used_disks = imsm_num_data_members(dev, 0);
if (used_disks) {
array_blocks = map_1->blocks_per_member * used_disks;
/* round array size down to closest MB
@@ -5602,7 +5612,7 @@ update_reshape_exit:
struct dl *dl;
unsigned int found;
int failed;
- int victim = get_imsm_disk_idx(dev, u->slot);
+ int victim = get_imsm_disk_idx(dev, u->slot, -1);
int i;

for (dl = super->disks; dl; dl = dl->next)
@@ -5625,7 +5635,7 @@ update_reshape_exit:
for (i = 0; i < map->num_members; i++) {
if (i == u->slot)
continue;
- disk = get_imsm_disk(super, get_imsm_disk_idx(dev, i));
+ disk = get_imsm_disk(super, get_imsm_disk_idx(dev, i, -1));
if (!disk || is_failed(disk))
failed++;
}
@@ -6091,7 +6101,7 @@ static void imsm_delete(struct intel_super *super, struct dl **dlp, unsigned ind
/* update ord entries being careful not to propagate
* ord-flags to the first map
*/
- ord = get_imsm_ord_tbl_ent(dev, j);
+ ord = get_imsm_ord_tbl_ent(dev, j, -1);

if (ord_to_idx(ord) <= index)
continue;
@@ -6182,7 +6192,7 @@ static int update_level_imsm(struct supertype *st, struct mdinfo *info,
(dl->minor != newdi->disk.minor))
continue;
slot = get_imsm_disk_slot(map, dl->index);
- idx = get_imsm_ord_tbl_ent(dev_new, slot);
+ idx = get_imsm_ord_tbl_ent(dev_new, slot, 0);
tmp_ord_tbl[newdi->disk.raid_disk] = idx;
break;
}
@@ -6295,7 +6305,7 @@ int imsm_reshape_is_allowed_on_container(struct supertype *st,
ret_val = 0;
break;
}
- used_disks = imsm_num_data_members(dev);
+ used_disks = imsm_num_data_members(dev, 0);
dprintf("read raid_disks = %i\n", used_disks);
dprintf("read requested disks = %i\n", geo->raid_disks);
array_blocks = map->blocks_per_member * used_disks;
@@ -6635,8 +6645,7 @@ calculate_size_only:
/* calculate new size
*/
if (new_map != NULL) {
-
- used_disks = imsm_num_data_members(upd_devs);
+ used_disks = imsm_num_data_members(upd_devs, 0);
if (used_disks) {
array_blocks = new_map->blocks_per_member * used_disks;
/* round array size down to closest MB

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 38/53] mdadm: read chunksize and layout from mdstat

am 26.11.2010 09:08:47 von adam.kwolek

Support reading layout and chunk size from mdstat.
It is needed for external reshape with layout or chunk size changes.

Signed-off-by: Maciej Trela
Signed-off-by: Adam Kwolek
---

mdadm.h | 1 +
mdstat.c | 11 +++++++++--
2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/mdadm.h b/mdadm.h
index 2c08ee6..95f355e 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -358,6 +358,7 @@ struct mdstat_ent {
int resync; /* 1 if resync, 0 if recovery */
int devcnt;
int raid_disks;
+ int layout;
int chunk_size;
char * metadata_version;
struct dev_member {
diff --git a/mdstat.c b/mdstat.c
index 47be2bb..7e73310 100644
--- a/mdstat.c
+++ b/mdstat.c
@@ -146,7 +146,7 @@ struct mdstat_ent *mdstat_read(int hold, int start)
end = &all;
for (; (line = conf_line(f)) ; free_line(line)) {
struct mdstat_ent *ent;
- char *w;
+ char *w, *prev = NULL;
int devnum;
int in_devs = 0;
char *ep;
@@ -192,7 +192,7 @@ struct mdstat_ent *mdstat_read(int hold, int start)
ent->dev = strdup(line);
ent->devnum = devnum;

- for (w=dl_next(line); w!= line ; w=dl_next(w)) {
+ for (w = dl_next(line); w != line ; prev = w, w = dl_next(w)) {
int l = strlen(w);
char *eq;
if (strcmp(w, "active")==0)
@@ -251,6 +251,13 @@ struct mdstat_ent *mdstat_read(int hold, int start)
w[0] <= '9' &&
w[l-1] == '%') {
ent->percent = atoi(w);
+ } else if (strcmp(w, "algorithm") == 0 &&
+ dl_next(w) != line) {
+ w = dl_next(w);
+ ent->layout = atoi(w);
+ } else if (strncmp(w, "chunk", 5) == 0 &&
+ prev != NULL) {
+ ent->chunk_size = atoi(prev) * 1024;
}
}
if (insert_here && (*insert_here)) {

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 39/53] mdadm: Add IMSM migration record to intel_super

am 26.11.2010 09:08:54 von adam.kwolek

Add support for IMSM migration record structure.
IMSM migration record is stored on the first two disks of IMSM volume during the migration.

Add function for reading/writing migration record - they will be used by the next checkpointing patches.
Clear migration record every time MIGR_GEN_MIGR is started.

Signed-off-by: Maciej Trela
Signed-off-by: Adam Kwolek
---

super-intel.c | 166 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 files changed, 161 insertions(+), 5 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index eea5fec..ef1ed45 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -195,6 +195,31 @@ struct bbm_log {
static char *map_state_str[] = { "normal", "uninitialized", "degraded", "failed" };
#endif

+#define UNIT_SRC_NORMAL 0 /* Source data for curr_migr_unit must
+ * be recovered using srcMap */
+#define UNIT_SRC_IN_CP_AREA 1 /* Source data for curr_migr_unit has
+ * already been migrated and must
+ * be recovered from checkpoint area */
+struct migr_record {
+ __u32 rec_status; /* Status used to determine how to restart
+ * migration in case it aborts in some fashion */
+ __u32 curr_migr_unit; /* 0..numMigrUnits-1 */
+ __u32 family_num; /* Family number of MPB containing the RaidDev
+ * that is migrating */
+ __u32 ascending_migr; /* True if migrating in increasing order of lbas */
+ __u32 blocks_per_unit; /* Num disk blocks per unit of operation */
+ __u32 dest_depth_per_unit; /* Num member blocks each destMap member disk
+ * advances per unit-of-operation */
+ __u32 ckpt_area_pba; /* Pba of first block of ckpt copy area */
+ __u32 dest_1st_member_lba; /* First member lba on first stripe of destination */
+ __u32 num_migr_units; /* Total num migration units-of-op */
+ __u32 post_migr_vol_cap; /* Size of volume after migration completes */
+ __u32 post_migr_vol_cap_hi; /* Expansion space for LBA64 */
+ __u32 ckpt_read_disk_num; /* Which member disk in destSubMap[0] the
+ * migration ckpt record was read from
+ * (for recovered migrations) */
+} __attribute__ ((__packed__));
+
static __u8 migr_type(struct imsm_dev *dev)
{
if (dev->vol.migr_type == MIGR_VERIFY &&
@@ -240,6 +265,10 @@ struct intel_super {
void *buf; /* O_DIRECT buffer for reading/writing metadata */
struct imsm_super *anchor; /* immovable parameters */
};
+ union {
+ void *migr_rec_buf; /* buffer for I/O operations */
+ struct migr_record *migr_rec; /* migration record */
+ };
size_t len; /* size of the 'buf' allocation */
void *next_buf; /* for realloc'ing buf from the manager */
size_t next_len;
@@ -1493,6 +1522,104 @@ static int imsm_level_to_layout(int level)
return UnSet;
}

+/*
+ * load_imsm_migr_rec - read imsm migration record
+ */
+__attribute__((unused))
+static int load_imsm_migr_rec(struct intel_super *super, struct mdinfo *info)
+{
+ unsigned long long dsize;
+ struct mdinfo *sd;
+ struct dl *dl;
+ char nm[30];
+ int retval = -1;
+ int fd = -1;
+
+ for (sd = info->devs ; sd ; sd = sd->next) {
+ /* read only from one of the first two slots */
+ if (sd->disk.raid_disk > 1)
+ continue;
+ sprintf(nm, "%d:%d", sd->disk.major, sd->disk.minor);
+ fd = dev_open(nm, O_RDONLY);
+ if (fd >= 0)
+ break;
+ }
+ if (fd < 0) {
+ for (dl = super->disks; dl; dl = dl->next) {
+ /* read only from one of the first two slots */
+ if (dl->index > 1)
+ continue;
+ sprintf(nm, "%d:%d", dl->major, dl->minor);
+ fd = dev_open(nm, O_RDONLY);
+ if (fd >= 0)
+ break;
+ }
+ }
+ if (fd < 0)
+ goto out;
+ get_dev_size(fd, NULL, &dsize);
+ if (lseek64(fd, dsize - 512, SEEK_SET) < 0) {
+ fprintf(stderr,
+ Name ": Cannot seek to anchor block: %s\n",
+ strerror(errno));
+ goto out;
+ }
+ if (read(fd, super->migr_rec_buf, 512) != 512) {
+ fprintf(stderr,
+ Name ": Cannot read migr record block: %s\n",
+ strerror(errno));
+ goto out;
+ }
+ retval = 0;
+ out:
+ if (fd >= 0)
+ close(fd);
+ return retval;
+}
+
+/*
+ * write_imsm_migr_rec - write imsm migration record
+ */
+__attribute__((unused))
+static int write_imsm_migr_rec(struct intel_super *super, struct mdinfo *info)
+{
+ unsigned long long dsize;
+ struct mdinfo *sd;
+ char nm[30];
+ int fd = -1;
+ int retval = -1;
+
+ for (sd = info->devs ; sd ; sd = sd->next) {
+ /* read only from one of the first two slots */
+ if (sd->disk.raid_disk > 1)
+ continue;
+ sprintf(nm, "%d:%d", sd->disk.major, sd->disk.minor);
+ fd = dev_open(nm, O_RDWR);
+ if (fd < 0)
+ continue;
+ get_dev_size(fd, NULL, &dsize);
+ if (lseek64(fd, dsize - 512, SEEK_SET) < 0) {
+ fprintf(stderr,
+ Name ": Cannot seek to anchor block: %s\n",
+ strerror(errno));
+ goto out;
+ }
+ if (write(fd, super->migr_rec_buf, 512) != 512) {
+ fprintf(stderr,
+ Name ": Cannot write migr record block: %s\n",
+ strerror(errno));
+ goto out;
+ }
+ close(fd);
+ fd = -1;
+ }
+ retval = 0;
+ out:
+ if (fd >= 0)
+ close(fd);
+ return retval;
+}
+
static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info)
{
struct intel_super *super = st->sb;
@@ -2195,8 +2322,11 @@ load_imsm_disk(int fd, struct intel_super *super, char *devname, int keep_fd)
* map1state=normal)
* 4/ Rebuild (migr_state=1 migr_type=MIGR_REBUILD map0state=normal
* map1state=degraded)
+ * 5/ Migration (mig_state=1 migr_type=MIGR_GEN_MIGR map0state=normal
+ * map1state=normal)
*/
-static void migrate(struct imsm_dev *dev, __u8 to_state, int migr_type)
+static void migrate(struct imsm_dev *dev, struct intel_super *super,
+ __u8 to_state, int migr_type)
{
struct imsm_map *dest;
struct imsm_map *src = get_imsm_map(dev, 0);
@@ -2219,6 +2349,10 @@ static void migrate(struct imsm_dev *dev, __u8 to_state, int migr_type)
}
}

+ if (migr_type == MIGR_GEN_MIGR)
+ /* Clear migration record */
+ memset(super->migr_rec, 0, sizeof(struct migr_record));
+
src->map_state = to_state;
}

@@ -2377,6 +2511,14 @@ static int load_imsm_mpb(int fd, struct intel_super *super, char *devname)

sectors = mpb_sectors(anchor) - 1;
free(anchor);
+
+ if (posix_memalign(&super->migr_rec_buf, 512, 512) != 0) {
+ fprintf(stderr, Name
+ ": %s could not allocate migr_rec buffer\n", __func__);
+ free(super->buf);
+ return 2;
+ }
+
if (!sectors) {
check_sum = __gen_imsm_checksum(super->anchor);
if (check_sum != __le32_to_cpu(super->anchor->check_sum)) {
@@ -2479,6 +2621,10 @@ static void __free_imsm(struct intel_super *super, int free_disks)
free(super->buf);
super->buf = NULL;
}
+ if (super->migr_rec_buf) {
+ free(super->migr_rec_buf);
+ super->migr_rec_buf = NULL;
+ }
if (free_disks)
free_imsm_disks(super);
free_devlist(super);
@@ -3326,6 +3472,13 @@ static int init_super_imsm(struct supertype *st, mdu_array_info_t *info,
": %s could not allocate superblock\n", __func__);
return 0;
}
+ if (posix_memalign(&super->migr_rec_buf, 512, 512) != 0) {
+ fprintf(stderr, Name
+ ": %s could not allocate migr_rec buffer\n", __func__);
+ free(super->buf);
+ free(super);
+ return 0;
+ }
memset(super->buf, 0, mpb_size);
mpb = super->buf;
mpb->mpb_size = __cpu_to_le32(mpb_size);
@@ -4825,9 +4978,9 @@ static int imsm_set_array_state(struct active_array *a, int consistent)
/* mark the start of the init process if nothing is failed */
dprintf("imsm: mark resync start\n");
if (map->map_state == IMSM_T_STATE_UNINITIALIZED)
- migrate(dev, IMSM_T_STATE_NORMAL, MIGR_INIT);
+ migrate(dev, super, IMSM_T_STATE_NORMAL, MIGR_INIT);
else
- migrate(dev, IMSM_T_STATE_NORMAL, MIGR_REPAIR);
+ migrate(dev, super, IMSM_T_STATE_NORMAL, MIGR_REPAIR);
super->updates_pending++;
}

@@ -5423,6 +5576,9 @@ static void imsm_process_update(struct supertype *st,
a->reshape_delta_disks = u->reshape_delta_disks;
a->reshape_state = reshape_is_starting;

+ /* Clear migration record */
+ memset(super->migr_rec, 0, sizeof(struct migr_record));
+
super->updates_pending++;
update_reshape_exit:
if (u->devs_mem.dev)
@@ -5652,7 +5808,7 @@ update_reshape_exit:
/* mark rebuild */
to_state = imsm_check_degraded(super, dev, failed);
map->map_state = IMSM_T_STATE_DEGRADED;
- migrate(dev, to_state, MIGR_REBUILD);
+ migrate(dev, super, to_state, MIGR_REBUILD);
migr_map = get_imsm_map(dev, 1);
set_imsm_ord_tbl_ent(map, u->slot, dl->index);
set_imsm_ord_tbl_ent(migr_map, u->slot, dl->index | IMSM_ORD_REBUILD);
@@ -6552,7 +6708,7 @@ struct imsm_update_reshape *imsm_create_metadata_update_for_reshape(struct super
*/

to_state = imsm_check_degraded(super, old_dev, 0);
- migrate(upd_devs, to_state, MIGR_GEN_MIGR);
+ migrate(upd_devs, super, to_state, MIGR_GEN_MIGR);
/* second map length is equal to first map
* correct second map length to old value
*/

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 40/53] mdadm: add backup methods to superswitch

am 26.11.2010 09:09:02 von adam.kwolek

Add new methods to the superswitch for external metadata supporting its own critical reshape data backup mechanism.

The new methods are:
save_backup - save critical data to backup area discard_backup - critical data was successfully migrated, so
the current backup may be discarded recover_backup - recover critical data after reshape crashed
during array assembly

Signed-off-by: Maciej Trela
Signed-off-by: Adam Kwolek
---

mdadm.h | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/mdadm.h b/mdadm.h
index 95f355e..726e0a4 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -726,6 +726,15 @@ extern struct superswitch {
enum state_of_reshape request_type,
struct metadata_update **updates);

+ /* for external backup area
+ *
+ */
+ int (*save_backup)(struct supertype *st, struct mdinfo *info,
+ void *buf, unsigned long write_offset, int length);
+ void (*discard_backup)(struct supertype *st, struct mdinfo *info);
+ int (*recover_backup)(struct supertype *st, struct mdinfo *info,
+ void *ptr, int length);
+
int swapuuid; /* true if uuid is bigending rather than hostendian */
int external;
const char *name; /* canonical metadata name */

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 41/53] mdadm: support restore_stripes() from the given buffer

am 26.11.2010 09:09:09 von adam.kwolek

Currently restore_stripes() function is able to restore data only from the given backup file handles and it is used only for assembling partially reshaped arrays.
As this function will be very helpful for external metadata backup mechanism, add the support for restoring data from the given source buffer.

Signed-off-by: Maciej Trela
Signed-off-by: Adam Kwolek
---

Grow.c | 4 ++--
mdadm.h | 3 ++-
restripe.c | 45 +++++++++++++++++++++++++++------------------
3 files changed, 31 insertions(+), 21 deletions(-)

diff --git a/Grow.c b/Grow.c
index 37bcfd6..64fb1c2 100644
--- a/Grow.c
+++ b/Grow.c
@@ -2520,7 +2520,7 @@ int Grow_restart(struct supertype *st, struct mdinfo *info, int *fdlist, int cnt
info->new_layout,
fd, __le64_to_cpu(bsb.devstart)*512,
__le64_to_cpu(bsb.arraystart)*512,
- __le64_to_cpu(bsb.length)*512)) {
+ __le64_to_cpu(bsb.length)*512, NULL)) {
/* didn't succeed, so giveup */
if (verbose)
fprintf(stderr, Name ": Error restoring backup from %s\n",
@@ -2537,7 +2537,7 @@ int Grow_restart(struct supertype *st, struct mdinfo *info, int *fdlist, int cnt
fd, __le64_to_cpu(bsb.devstart)*512 +
__le64_to_cpu(bsb.devstart2)*512,
__le64_to_cpu(bsb.arraystart2)*512,
- __le64_to_cpu(bsb.length2)*512)) {
+ __le64_to_cpu(bsb.length2)*512, NULL)) {
/* didn't succeed, so giveup */
if (verbose)
fprintf(stderr, Name ": Error restoring second backup from %s\n",
diff --git a/mdadm.h b/mdadm.h
index 726e0a4..be9a93e 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -477,7 +477,8 @@ extern int save_stripes(int *source, unsigned long long *offsets,
extern int restore_stripes(int *dest, unsigned long long *offsets,
int raid_disks, int chunk_size, int level, int layout,
int source, unsigned long long read_offset,
- unsigned long long start, unsigned long long length);
+ unsigned long long start, unsigned long long length,
+ char *src_buf);

#ifndef Sendmail
diff --git a/restripe.c b/restripe.c
index c2fbe5b..3b16ad8 100644
--- a/restripe.c
+++ b/restripe.c
@@ -45,6 +45,7 @@ static int geo_map(int block, unsigned long long stripe, int raid_disks,

switch(level*100 + layout) {
case 000:
+ case 000 + ALGORITHM_PARITY_N: /* layout has no matter for raid0 */
case 400:
case 400 + ALGORITHM_PARITY_N:
case 500 + ALGORITHM_PARITY_N:
@@ -533,11 +534,10 @@ int save_stripes(int *source, unsigned long long *offsets,
fdisk[0], fdisk[1], bufs);
}
}
-
- for (i=0; i - if (write(dest[i], buf, len) != len)
- return -1;
-
+ if (dest)
+ for (i = 0; i < nwrites; i++)
+ if (write(dest[i], buf, len) != len)
+ return -1;
length -= len;
start += len;
}
@@ -558,7 +558,8 @@ int save_stripes(int *source, unsigned long long *offsets,
int restore_stripes(int *dest, unsigned long long *offsets,
int raid_disks, int chunk_size, int level, int layout,
int source, unsigned long long read_offset,
- unsigned long long start, unsigned long long length)
+ unsigned long long start, unsigned long long length,
+ char *src_buf)
{
char *stripe_buf;
char **stripes = malloc(raid_disks * sizeof(char*));
@@ -576,13 +577,17 @@ int restore_stripes(int *dest, unsigned long long *offsets,
}
if (stripe_buf == NULL || stripes == NULL || blocks == NULL
|| zero == NULL) {
- free(stripe_buf);
- free(stripes);
- free(blocks);
- free(zero);
+ if (stripe_buf != NULL)
+ free(stripe_buf);
+ if (stripes != NULL)
+ free(stripes);
+ if (blocks != NULL)
+ free(blocks);
+ if (zero != NULL)
+ free(zero);
return -2;
}
- for (i=0; i + for (i = 0; i < raid_disks; i++)
stripes[i] = stripe_buf + i * chunk_size;
while (length > 0) {
unsigned int len = data_disks * chunk_size;
@@ -591,15 +596,19 @@ int restore_stripes(int *dest, unsigned long long *offsets,
int syndrome_disks;
if (length < len)
return -3;
- for (i=0; i < data_disks; i++) {
+ for (i = 0; i < data_disks; i++) {
int disk = geo_map(i, start/chunk_size/data_disks,
raid_disks, level, layout);
- if ((unsigned long long)lseek64(source, read_offset, 0)
- != read_offset)
- return -1;
- if (read(source, stripes[disk],
- chunk_size) != chunk_size)
- return -1;
+ if (src_buf == NULL) {
+ /* read from file */
+ if (lseek64(source, read_offset, 0) != (off64_t)read_offset)
+ return -1;
+ if (read(source, stripes[disk], chunk_size) != chunk_size)
+ return -1;
+ } else {
+ /* read from input buffer */
+ memcpy(stripes[disk], src_buf + read_offset, chunk_size);
+ }
read_offset += chunk_size;
}
/* We have the data, now do the parity */

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 42/53] mdadm: support backup operations for imsm

am 26.11.2010 09:09:17 von adam.kwolek

Add support for the following operations:
save_backup() - save critical data stripes to Migration Copy Area and
update the current migration unit status.
Use restore_stripes() to form a destination stripe,
and to write it to the Copy Area.
save_backup() initialize migration record at the
beginning of the reshape.

discard_backup() - critical data was successfully migrated by the kernel.
Update the current unit status in the migration record.

recover_backup() - recover critical data from the Migration Copy Area
while assembling an array.

Signed-off-by: Maciej Trela
Signed-off-by: Adam Kwolek
---

super-intel.c | 264 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 259 insertions(+), 5 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index ef1ed45..a4dda6a 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -1525,7 +1525,6 @@ static int imsm_level_to_layout(int level)
/*
* load_imsm_migr_rec - read imsm migration record
*/
-__attribute__((unused))
static int load_imsm_migr_rec(struct intel_super *super, struct mdinfo *info)
{
unsigned long long dsize;
@@ -1580,7 +1579,6 @@ static int load_imsm_migr_rec(struct intel_super *super, struct mdinfo *info)
/*
* write_imsm_migr_rec - write imsm migration record
*/
-__attribute__((unused))
static int write_imsm_migr_rec(struct intel_super *super, struct mdinfo *info)
{
unsigned long long dsize;
@@ -1590,9 +1588,6 @@ static int write_imsm_migr_rec(struct intel_super *super, struct mdinfo *info)
int retval = -1;

for (sd = info->devs ; sd ; sd = sd->next) {
- /* read only from one of the first two slots */
- if (sd->disk.raid_disk > 1)
- continue;
sprintf(nm, "%d:%d", sd->disk.major, sd->disk.minor);
fd = dev_open(nm, O_RDWR);
if (fd < 0)
@@ -6279,6 +6274,258 @@ static void imsm_delete(struct intel_super *super, struct dl **dlp, unsigned ind
__free_imsm_disk(dl);
}
}
+
+int open_backup_targets(struct mdinfo *info, int raid_disks, int *raid_fds)
+{
+ struct mdinfo *sd;
+
+ for (sd = info->devs ; sd ; sd = sd->next) {
+ if (sd->disk.state & (1< + dprintf("disk is faulty!!\n");
+ continue;
+ }
+
+ if ((sd->disk.raid_disk >= raid_disks) ||
+ (sd->disk.raid_disk < 0)) {
+ raid_fds[sd->disk.raid_disk] = -1;
+ continue;
+ }
+ char *dn = map_dev(sd->disk.major,
+ sd->disk.minor, 1);
+ raid_fds[sd->disk.raid_disk] = dev_open(dn, O_RDWR);
+ if (raid_fds[sd->disk.raid_disk] < 0) {
+ fprintf(stderr, "cannot open component\n");
+ return -1;
+ }
+ }
+ return 0;
+}
+
+#define RAID_DISK_RESERVED_BLOCKS_IMSM_HI 417
+
+void init_migr_record_imsm(struct intel_super *super, struct mdinfo *info,
+ unsigned blocks_per_unit)
+{
+ struct migr_record *migr_rec = super->migr_rec;
+ int new_data_disks, prev_data_disks;
+ long long unsigned new_array_sectors;
+ int prev_stripe_sectors, new_stripe_sectors;
+ unsigned long long dsize, dev_sectors;
+ long long unsigned min_dev_sectors = -1LLU;
+ struct mdinfo *sd;
+ char nm[30];
+ int fd;
+
+ memset(migr_rec, 0, sizeof(struct migr_record));
+ migr_rec->family_num = __cpu_to_le32(super->anchor->family_num);
+
+ migr_rec->ascending_migr = __cpu_to_le32((info->delta_disks > 0) ? 1 : 0);
+
+ prev_data_disks = info->array.raid_disks;
+ if ((info->array.level == 5) || (info->array.level == 4))
+ prev_data_disks--;
+ new_data_disks = info->array.raid_disks + info->delta_disks;
+ if ((info->new_level == 5) || (info->new_level == 4))
+ new_data_disks--;
+
+ new_array_sectors = info->component_size;
+ new_array_sectors &= ~(unsigned long long)((info->new_chunk / 512) - 1);
+ new_array_sectors *= new_data_disks;
+ new_array_sectors = (new_array_sectors >> SECT_PER_MB_SHIFT)
+ << SECT_PER_MB_SHIFT;
+
+ migr_rec->post_migr_vol_cap = __cpu_to_le32(new_array_sectors);
+ migr_rec->post_migr_vol_cap_hi = __cpu_to_le32(new_array_sectors >> 32);
+
+ prev_stripe_sectors = info->array.chunk_size/512 * prev_data_disks;
+ new_stripe_sectors = info->new_chunk/512 * new_data_disks;
+
+ new_array_sectors = info->component_size * new_data_disks / blocks_per_unit;
+ migr_rec->num_migr_units = __cpu_to_le32(new_array_sectors);
+ migr_rec->dest_depth_per_unit = __cpu_to_le32(blocks_per_unit / new_data_disks);
+ migr_rec->blocks_per_unit = __cpu_to_le32(blocks_per_unit);
+
+ /* Find the smallest dev */
+ for (sd = info->devs ; sd ; sd = sd->next) {
+ sprintf(nm, "%d:%d", sd->disk.major, sd->disk.minor);
+ fd = dev_open(nm, O_RDONLY);
+ if (fd < 0)
+ continue;
+ get_dev_size(fd, NULL, &dsize);
+ dev_sectors = dsize / 512;
+ if (dev_sectors < min_dev_sectors)
+ min_dev_sectors = dev_sectors;
+ close(fd);
+ }
+ migr_rec->ckpt_area_pba = __cpu_to_le32(min_dev_sectors -
+ RAID_DISK_RESERVED_BLOCKS_IMSM_HI);
+ return;
+}
+
+int save_backup_imsm(struct supertype *st, struct mdinfo *info,
+ void *buf, unsigned long write_offset,
+ int length)
+{
+ struct intel_super *super = st->sb;
+ unsigned long long *target_offsets = NULL;
+ int *targets = NULL;
+ int new_disks, new_odata;
+ int i, rv = -1;
+
+ if (info->reshape_progress == 0)
+ init_migr_record_imsm(super, info, length/512);
+
+ new_disks = info->array.raid_disks + info->delta_disks;
+ new_odata = new_disks;
+ if ((info->new_level == 5) || (info->new_level == 4))
+ new_odata--;
+
+ targets = malloc(new_disks * sizeof(int));
+ if (!targets)
+ goto abort;
+
+ target_offsets = malloc(new_disks * sizeof(unsigned long long));
+ if (!target_offsets)
+ goto abort;
+
+ for (i = 0; i < new_disks; i++) {
+ targets[i] = -1;
+ target_offsets[i] = (unsigned long long)
+ __le32_to_cpu(super->migr_rec->ckpt_area_pba) * 512;
+ target_offsets[i] -= write_offset / new_odata;
+ }
+
+ open_backup_targets(info, new_disks, targets);
+
+ if (restore_stripes(targets, /* list of dest devices */
+ target_offsets, /* migration record offsets */
+ new_disks,
+ info->new_chunk,
+ info->new_level,
+ info->new_layout,
+ 0, /* source backup file descriptor */
+ 0, /*input buf offset - always 0 buf is already offseted */
+ write_offset,
+ info->new_chunk * new_odata,
+ buf) != 0) {
+ fprintf(stderr, Name ": Error restoring stripes\n");
+ goto abort;
+ }
+
+ super->migr_rec->curr_migr_unit =
+ __cpu_to_le32(info->reshape_progress /
+ __le32_to_cpu(super->migr_rec->blocks_per_unit) + 1);
+ super->migr_rec->rec_status = __cpu_to_le32(UNIT_SRC_IN_CP_AREA);
+ super->migr_rec->dest_1st_member_lba =
+ __cpu_to_le32((__le32_to_cpu(super->migr_rec->curr_migr_unit ) - 1)
+ * __le32_to_cpu(super->migr_rec->dest_depth_per_unit));
+
+ write_imsm_migr_rec(super, info);
+ abort:
+ if (targets) {
+ for (i = 0; i < new_disks; i++)
+ if (targets[i] >= 0)
+ close(targets[i]);
+ free(targets);
+ }
+ if (target_offsets)
+ free(target_offsets);
+ return rv;
+}
+
+void discard_backup_imsm(struct supertype *st, struct mdinfo *info)
+{
+ struct intel_super *super = st->sb;
+ load_imsm_migr_rec(super, info);
+ if (__le32_to_cpu(super->migr_rec->blocks_per_unit) == 0) {
+ dprintf("ERROR: blocks_per_unit = 0!!!\n");
+ return;
+ }
+
+ super->migr_rec->curr_migr_unit =
+ __cpu_to_le32(info->reshape_progress /
+ __le32_to_cpu(super->migr_rec->blocks_per_unit) + 1);
+ super->migr_rec->rec_status = __cpu_to_le32(UNIT_SRC_NORMAL);
+ super->migr_rec->dest_1st_member_lba =
+ __cpu_to_le32((__le32_to_cpu(super->migr_rec->curr_migr_unit ) - 1)
+ * __le32_to_cpu(super->migr_rec->dest_depth_per_unit));
+ write_imsm_migr_rec(super, info);
+}
+
+int recover_backup_imsm(struct supertype *st, struct mdinfo *info,
+ void *ptr, int length)
+{
+ struct intel_super *super = st->sb;
+ unsigned long long read_offset;
+ unsigned long long write_offset;
+ unsigned unit_len;
+ int *targets = NULL;
+ int new_disks, i;
+ char *buf = NULL;
+ int retval = 1;
+
+ if (__le32_to_cpu(super->migr_rec->rec_status) == UNIT_SRC_NORMAL)
+ return 0;
+ if (__le32_to_cpu(super->migr_rec->curr_migr_unit)
+ >= __le32_to_cpu(super->migr_rec->num_migr_units))
+ return 0;
+
+ new_disks = info->array.raid_disks + info->delta_disks;
+
+ read_offset = (unsigned long long)
+ __le32_to_cpu(super->migr_rec->ckpt_area_pba) * 512;
+
+ write_offset = ((unsigned long long)
+ __le32_to_cpu(super->migr_rec->dest_1st_member_lba) +
+ info->data_offset) * 512;
+
+ unit_len = __le32_to_cpu(super->migr_rec->dest_depth_per_unit) * 512;
+ if (posix_memalign((void **)&buf, 512, unit_len) != 0)
+ goto abort;
+ targets = malloc(new_disks * sizeof(int));
+ if (!targets)
+ goto abort;
+
+ open_backup_targets(info, new_disks, targets);
+
+ for (i = 0; i < new_disks; i++) {
+ if (lseek64(targets[i], read_offset, SEEK_SET) < 0) {
+ fprintf(stderr,
+ Name ": Cannot seek to block: %s\n",
+ strerror(errno));
+ goto abort;
+ }
+ if (read(targets[i], buf, unit_len) != unit_len) {
+ fprintf(stderr,
+ Name ": Cannot read copy area block: %s\n",
+ strerror(errno));
+ goto abort;
+ }
+ if (lseek64(targets[i], write_offset, SEEK_SET) < 0) {
+ fprintf(stderr,
+ Name ": Cannot seek to block: %s\n",
+ strerror(errno));
+ goto abort;
+ }
+ if (write(targets[i], buf, unit_len) != unit_len) {
+ fprintf(stderr,
+ Name ": Cannot restore block: %s\n",
+ strerror(errno));
+ goto abort;
+ }
+ }
+ retval = 0;
+abort:
+ if (targets) {
+ for (i = 0; i < new_disks; i++)
+ if (targets[i])
+ close(targets[i]);
+ free(targets);
+ }
+ if (buf)
+ free(buf);
+ return retval;
+}
#endif /* MDASSEMBLE */

static int update_level_imsm(struct supertype *st, struct mdinfo *info,
@@ -7899,6 +8146,13 @@ struct superswitch super_imsm = {
.reshape_array = imsm_reshape_array,
.manage_reshape = imsm_manage_reshape,

+ /* for external backup area
+ *
+ */
+ .save_backup = save_backup_imsm,
+ .discard_backup = discard_backup_imsm,
+ .recover_backup = recover_backup_imsm,
+
.external = 1,
.name = "imsm",

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 43/53] mdadm: migration restart for external meta

am 26.11.2010 09:09:24 von adam.kwolek

Add support for assembling partially migrated arrays with external meta.
Note that if Raid0 was used while migration it should be changed to
Raid4 while assembling (see check_mpb_migr_compatibility and switch_raid0_configuration).

getinfo_super_imsm_volume() reads migration record and initializes mdadm reshape specific structures.

Signed-off-by: Maciej Trela
Signed-off-by: Adam Kwolek
---

Assemble.c | 8 ++
super-intel.c | 205 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 208 insertions(+), 5 deletions(-)

diff --git a/Assemble.c b/Assemble.c
index 409f0d7..c34c109 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -1313,6 +1313,14 @@ int assemble_container_content(struct supertype *st, int mdfd,
close(mdfd);
return 1;
}
+
+ if (content->reshape_active) {
+ sysfs_set_num(sra, NULL, "reshape_position", content->reshape_progress);
+ sysfs_set_num(sra, NULL, "chunk_size", content->new_chunk);
+ sysfs_set_num(sra, NULL, "layout", content->new_layout);
+ sysfs_set_num(sra, NULL, "raid_disks", content->array.raid_disks + content->delta_disks);
+ }
+
if (sra)
sysfs_free(sra);

diff --git a/super-intel.c b/super-intel.c
index a4dda6a..7eb7107 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -850,6 +850,7 @@ static void examine_super_imsm(struct supertype *st, char *homehost)
printf(" Orig Family : %08x\n", __le32_to_cpu(mpb->orig_family_num));
printf(" Family : %08x\n", __le32_to_cpu(mpb->family_num));
printf(" Generation : %08x\n", __le32_to_cpu(mpb->generation_num));
+ info.devs = NULL;
getinfo_super_imsm(st, &info);
fname_from_uuid(st, &info, nbuf, ':');
printf(" UUID : %s\n", nbuf + 5);
@@ -877,6 +878,7 @@ static void examine_super_imsm(struct supertype *st, char *homehost)
struct imsm_dev *dev = __get_imsm_dev(mpb, i);

super->current_vol = i;
+ info.devs = NULL;
getinfo_super_imsm(st, &info);
fname_from_uuid(st, &info, nbuf, ':');
print_imsm_dev(dev, nbuf + 5, super->disks->index);
@@ -900,6 +902,7 @@ static void brief_examine_super_imsm(struct supertype *st, int verbose)
return;
}

+ info.devs = NULL;
getinfo_super_imsm(st, &info);
fname_from_uuid(st, &info, nbuf, ':');
printf("ARRAY metadata=imsm UUID=%s\n", nbuf + 5);
@@ -917,12 +920,14 @@ static void brief_examine_subarrays_imsm(struct supertype *st, int verbose)
if (!super->anchor->num_raid_devs)
return;

+ info.devs = NULL;
getinfo_super_imsm(st, &info);
fname_from_uuid(st, &info, nbuf, ':');
for (i = 0; i < super->anchor->num_raid_devs; i++) {
struct imsm_dev *dev = get_imsm_dev(super, i);

super->current_vol = i;
+ info.devs = NULL;
getinfo_super_imsm(st, &info);
fname_from_uuid(st, &info, nbuf1, ':');
printf("ARRAY /dev/md/%.16s container=%s member=%d UUID=%s\n",
@@ -937,6 +942,7 @@ static void export_examine_super_imsm(struct supertype *st)
struct mdinfo info;
char nbuf[64];

+ info.devs = NULL;
getinfo_super_imsm(st, &info);
fname_from_uuid(st, &info, nbuf, ':');
printf("MD_METADATA=imsm\n");
@@ -950,6 +956,7 @@ static void detail_super_imsm(struct supertype *st, char *homehost)
struct mdinfo info;
char nbuf[64];

+ info.devs = NULL;
getinfo_super_imsm(st, &info);
fname_from_uuid(st, &info, nbuf, ':');
printf("\n UUID : %s\n", nbuf + 5);
@@ -959,6 +966,7 @@ static void brief_detail_super_imsm(struct supertype *st)
{
struct mdinfo info;
char nbuf[64];
+ info.devs = NULL;
getinfo_super_imsm(st, &info);
fname_from_uuid(st, &info, nbuf, ':');
printf(" UUID=%s", nbuf + 5);
@@ -1624,6 +1632,8 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info)
struct dl *dl;
char *devname;
int minor;
+ __u32 blocks_per_member;
+ __u32 blocks_per_strip;

for (dl = super->disks; dl; dl = dl->next)
if (dl->raiddisk == info->disk.raid_disk)
@@ -1631,7 +1641,13 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info)
info->container_member = super->current_vol;
info->array.raid_disks = map->num_members;
info->array.level = get_imsm_raid_level(map);
- info->array.layout = imsm_level_to_layout(info->array.level);
+ if (info->array.level == 4) {
+ map->raid_level = 5;
+ info->array.level = 5;
+ info->array.layout = ALGORITHM_PARITY_N;
+ } else {
+ info->array.layout = imsm_level_to_layout(info->array.level);
+ }
info->array.md_minor = -1;
info->array.ctime = 0;
info->array.utime = 0;
@@ -1649,7 +1665,15 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info)
}

info->data_offset = __le32_to_cpu(map->pba_of_lba0);
- info->component_size = __le32_to_cpu(map->blocks_per_member);
+ /* FIXME: For some unknown reason sometimes in a volume created by
+ * IMSM blocks_per_member is not a multiple of blocks_per strip.
+ * Fix blocks_per_member here:
+ */
+ blocks_per_member = __le32_to_cpu(map->blocks_per_member);
+ blocks_per_strip = __le16_to_cpu(map->blocks_per_strip);
+ blocks_per_member &= ~(blocks_per_strip - 1);
+ info->component_size = blocks_per_member;
+
memset(info->uuid, 0, sizeof(info->uuid));
info->recovery_start = MaxSector;
info->reshape_active = 0;
@@ -1673,7 +1697,43 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info)
*/
case MIGR_REBUILD:
/* this is handled by container_content_imsm() */
- case MIGR_GEN_MIGR:
+ case MIGR_GEN_MIGR: {
+ struct imsm_map *prev_map;
+ int data_members;
+
+ load_imsm_migr_rec(super, info);
+
+ info->reshape_progress = (unsigned long long)
+ __le32_to_cpu(super->migr_rec->blocks_per_unit) *
+ __le32_to_cpu(super->migr_rec->curr_migr_unit);
+
+ /* set previous and new map configurations */
+ prev_map = get_imsm_map(dev, 1);
+ info->reshape_active = 1;
+ info->array.raid_disks = prev_map->num_members;
+ info->delta_disks = map->num_members - prev_map->num_members;
+ info->new_level = info->array.level;
+ info->array.level = get_imsm_raid_level(prev_map);
+ info->new_layout = info->array.layout;
+ info->array.layout = imsm_level_to_layout(info->array.level);
+ info->array.chunk_size = __le16_to_cpu(prev_map->blocks_per_strip) << 9;
+ info->new_chunk = __le16_to_cpu(map->blocks_per_strip) << 9;
+
+ if (info->array.level == 4) {
+ prev_map->raid_level = 5;
+ info->array.level = 5;
+ info->array.layout = ALGORITHM_PARITY_N;
+ }
+
+ /* IMSM FIX for blocks_per_member */
+ blocks_per_strip = __le16_to_cpu(prev_map->blocks_per_strip);
+ blocks_per_member &= ~(blocks_per_strip - 1);
+ info->component_size = blocks_per_member;
+
+ /* Calculate previous array size */
+ data_members = imsm_num_data_members(dev, 1);
+ info->custom_array_size = blocks_per_member * data_members;
+ }
case MIGR_STATE_CHANGE:
/* FIXME handle other migrations */
default:
@@ -2445,6 +2505,117 @@ struct bbm_log *__get_imsm_bbm_log(struct imsm_super *mpb)
return ptr;
}

+/* Switches N-disk Raid0 map configuration (N+1)disk Raid4
+ */
+void switch_raid0_configuration(struct imsm_super *mpb, struct imsm_map *map)
+{
+ __u8 *src, *dst;
+ int bytes_to_copy;
+
+ /* get the pointer to the rest of the metadata */
+ src = (__u8 *)map + sizeof_imsm_map(map);
+
+ /* change the level and disk number to be compatible with IMSM */
+ map->raid_level = 4;
+ map->num_members++;
+
+ /* get the updated pointer to the rest of the metadata */
+ dst = (__u8 *)map + sizeof_imsm_map(map);
+ /* Now move the rest of the metadata to be properly aligned */
+ bytes_to_copy = mpb->mpb_size - (src - (__u8 *)mpb);
+ if (bytes_to_copy > 0)
+ memmove(dst, src, bytes_to_copy);
+ /* Now insert new entry to the map */
+ set_imsm_ord_tbl_ent(map, map->num_members - 1/*slot*/,
+ mpb->num_disks | IMSM_ORD_REBUILD);
+ /* update size */
+ mpb->mpb_size += sizeof(__u32);
+}
+
+/* Make sure that in case of migration in progress we'll convert raid
+ * personalities so we could continue migrating
+ */
+void convert_raid_personalities(struct intel_super *super)
+{
+ struct imsm_super *mpb = super->anchor;
+ struct imsm_map *map;
+ struct imsm_disk *newMissing;
+ int i, map_modified = 0;
+ int bytes_to_copy;
+ __u8 *src, *dst;
+
+ for (i = 0; i < super->anchor->num_raid_devs; i++) {
+ struct imsm_dev *dev_iter = __get_imsm_dev(super->anchor, i);
+
+ map_modified = 0;
+ if (dev_iter &&
+ dev_iter->vol.migr_state == 1 &&
+ dev_iter->vol.migr_type == MIGR_GEN_MIGR) {
+ /* This device is migrating, check for raid0 levels */
+ map = get_imsm_map(dev_iter, 0);
+ if (map->raid_level == 0) {
+ /* Map0: Migrating raid0 detected - lets switch it to level4 */
+ switch_raid0_configuration(mpb, map);
+ map_modified++;
+ }
+ map = get_imsm_map(dev_iter, 1);
+ if (map->raid_level == 0) {
+ /* Map1: Migrating raid0 detected - lets switch it to level4 */
+ switch_raid0_configuration(mpb, map);
+ map_modified++;
+ }
+ }
+ }
+
+ if (map_modified > 0) {
+ /* Add missing device to the MPB disk table */
+ src = (__u8 *)mpb->disk + sizeof(struct imsm_disk) * mpb->num_disks;
+ mpb->num_disks++;
+ dst = (__u8 *)mpb->disk + sizeof(struct imsm_disk) * mpb->num_disks;
+
+ /* Now move the rest of the metadata to be properly aligned */
+ bytes_to_copy = mpb->mpb_size - (src - (__u8 *)mpb);
+ if (bytes_to_copy > 0)
+ memmove(dst, src, bytes_to_copy);
+
+ /* Update mpb size */
+ mpb->mpb_size += sizeof(struct imsm_disk);
+
+ /* Now fill in the new missing disk fields */
+ newMissing = (struct imsm_disk *)src;
+ sprintf((char *)newMissing->serial, "%s", "MISSING DISK");
+ /* copy the device size from the first disk */
+ newMissing->total_blocks = mpb->disk[0].total_blocks;
+ newMissing->scsi_id = 0x0;
+ newMissing->status = FAILED_DISK;
+ }
+}
+
+/* Check for unsupported migration features:
+ * migration optimization area
+ */
+int check_mpb_migr_compatibility(struct intel_super *super)
+{
+ struct imsm_map *map0, *map1;
+ int i;
+
+ for (i = 0; i < super->anchor->num_raid_devs; i++) {
+ struct imsm_dev *dev_iter = __get_imsm_dev(super->anchor, i);
+
+ if (dev_iter &&
+ dev_iter->vol.migr_state == 1 &&
+ dev_iter->vol.migr_type == MIGR_GEN_MIGR) {
+ /* This device is migrating */
+ map0 = get_imsm_map(dev_iter, 0);
+ map1 = get_imsm_map(dev_iter, 1);
+ if (map0->pba_of_lba0 != map1->pba_of_lba0)
+ /* migration optimization area was used */
+ return -1;
+ }
+ }
+ return 0;
+}
+
static void __free_imsm(struct intel_super *super, int free_disks);

/* load_imsm_mpb - read matrix metadata
@@ -2556,6 +2727,21 @@ static int load_imsm_mpb(int fd, struct intel_super *super, char *devname)
return 3;
}

+ /* Check for unsupported migration features */
+ if (check_mpb_migr_compatibility(super) != 0) {
+ if (devname)
+ fprintf(stderr,
+ Name ": Unsupported migration detected on %s\n",
+ devname);
+
+ return 4;
+ }
+
+ /* Now make sure that in case of migration
+ * we'll convert raid personalities
+ */
+ convert_raid_personalities(super);
+
/* FIXME the BBM log is disk specific so we cannot use this global
* buffer for all disks. Ok for now since we only look at the global
* bbm_log_size parameter to gate assembly
@@ -4601,6 +4787,8 @@ static void update_recovery_start(struct imsm_dev *dev, struct mdinfo *array)
rebuild->recovery_start = units * blocks_per_migr_unit(dev);
}

+int recover_backup_imsm(struct supertype *st, struct mdinfo *info,
+ void *ptr, int length);

static struct mdinfo *container_content_imsm(struct supertype *st)
{
@@ -4718,8 +4906,15 @@ static struct mdinfo *container_content_imsm(struct supertype *st)
info_d->data_offset = __le32_to_cpu(map->pba_of_lba0);
info_d->component_size = __le32_to_cpu(map->blocks_per_member);
}
- /* now that the disk list is up-to-date fixup recovery_start */
- update_recovery_start(dev, this);
+ if (this) {
+ /* now that the disk list is up-to-date fixup recovery_start */
+ update_recovery_start(dev, this);
+
+ /* check for reshape */
+ if (this->reshape_active == 1)
+ recover_backup_imsm(st, this, NULL, 0);
+ }
+
rest = this;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 44/53] Add mdadm->mdmon sync_max command message

am 26.11.2010 09:09:32 von adam.kwolek

Currently only metadata_update messages can be send from mdadm do mdmon using a socket.
For the external metadata reshape implementation a support for sending sync_max command will be also needed.

A new type of message "cmd_message" was defined.
cmd_message is a generic structure that enables to define different types of commands to be send from mdadm to mdmon.

cmd_message's and update_message's are recognized by different start magic numbers sent through the socket.

In this patch only one type of cmd_message was defined:
'SET_SYNC_MAX'

Signed-off-by: Maciej Trela
Signed-off-by: Adam Kwolek
---

managemon.c | 39 +++++++++++++++++++++++++++++++++++++--
mdadm.h | 18 ++++++++++++++++++
mdmon.h | 3 +++
msg.c | 33 +++++++++++++++++++++++++++++++--
msg.h | 2 ++
util.c | 25 +++++++++++++++++++++++++
6 files changed, 116 insertions(+), 4 deletions(-)

diff --git a/managemon.c b/managemon.c
index d9eb743..9ff3632 100644
--- a/managemon.c
+++ b/managemon.c
@@ -729,13 +729,36 @@ static void handle_message(struct supertype *container, struct metadata_update *
}
}

+static void handle_command(struct supertype *container, struct cmd_message *msg)
+{
+ struct active_array *a;
+
+ /* Search for a member of this container */
+ for (a = container->arrays; a; a = a->next)
+ if (msg->devnum == a->devnum)
+ break;
+
+ if (!a)
+ return;
+
+ /* check command msg type */
+ switch (msg->type) {
+ case SET_SYNC_MAX:
+ /* Add SET_SYNC_MAX handler here */
+ break;
+ }
+}
+
void read_sock(struct supertype *container)
{
int fd;
struct metadata_update msg;
+ struct mdmon_update *update;
+ struct cmd_message *cmd_msg;
int terminate = 0;
long fl;
int tmo = 3; /* 3 second timeout before hanging up the socket */
+ int rv;

fd = accept(container->sock, NULL, NULL);
if (fd < 0)
@@ -749,7 +772,9 @@ void read_sock(struct supertype *container)
msg.buf = NULL;

/* read and validate the message */
- if (receive_message(fd, &msg, tmo) == 0) {
+ rv = receive_message(fd, &msg, tmo);
+ if (rv == 0) {
+ /* metadata update */
handle_message(container, &msg);
if (msg.len == 0) {
/* ping reply with version */
@@ -759,8 +784,18 @@ void read_sock(struct supertype *container)
terminate = 1;
} else if (ack(fd, tmo) < 0)
terminate = 1;
- } else
+ } else if (rv == 1) {
+ /* mdmon_update received */
+ update = (struct mdmon_update *)&msg;
+ cmd_msg = (struct cmd_message *)(update->buf);
+ handle_command(container, cmd_msg);
+
+ free(msg.buf);
+ if (ack(fd, tmo) < 0)
+ terminate = 1;
+ } else {
terminate = 1;
+ }

} while (!terminate);

diff --git a/mdadm.h b/mdadm.h
index be9a93e..eacf0f5 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -750,6 +750,23 @@ struct metadata_update {
struct metadata_update *next;
};

+struct mdmon_update {
+ int len;
+ char *buf;
+};
+
+enum cmd_type {
+ SET_SYNC_MAX,
+};
+
+struct cmd_message {
+ enum cmd_type type;
+ int devnum;
+ union {
+ unsigned long long new_sync_max;
+ } msg_buf;
+};
+
/* A supertype holds a particular collection of metadata.
* It identifies the metadata type by the superswitch, and the particular
* sub-version of that metadata type.
@@ -979,6 +996,7 @@ extern int assemble_container_content(struct supertype *st, int mdfd,
extern int add_disk(int mdfd, struct supertype *st,
struct mdinfo *sra, struct mdinfo *info);
extern int set_array_info(int mdfd, struct supertype *st, struct mdinfo *info);
+extern int send_mdmon_cmd(struct supertype *st, struct mdmon_update *update);
unsigned long long min_recovery_start(struct mdinfo *array);

extern char *human_size(long long bytes);
diff --git a/mdmon.h b/mdmon.h
index 9ea0b93..2c41e47 100644
--- a/mdmon.h
+++ b/mdmon.h
@@ -49,6 +49,9 @@ struct active_array {

enum state_of_reshape reshape_state;
int reshape_delta_disks;
+ int waiting_resync_max; /* wait for resync_max cmd from mdadm */
+ long long unsigned resync_max;
+ long long unsigned sync_completed;

int check_degraded; /* flag set by mon, read by manage */

diff --git a/msg.c b/msg.c
index 95c6f0b..f413824 100644
--- a/msg.c
+++ b/msg.c
@@ -32,6 +32,7 @@
#include "mdmon.h"

static const __u32 start_magic = 0x5a5aa5a5;
+static const __u32 start_magic_cmd = 0x6b6bb6b6;
static const __u32 end_magic = 0xa5a55a5a;

static int send_buf(int fd, const void* buf, int len, int tmo)
@@ -93,14 +94,42 @@ int send_message(int fd, struct metadata_update *msg, int tmo)
return rv;
}

+int send_message_cmd(int fd, struct mdmon_update *update, int tmo)
+{
+ __s32 len = update->len;
+ int rv;
+
+ rv = send_buf(fd, &start_magic_cmd, 4, tmo);
+ rv = rv ?: send_buf(fd, &len, 4, tmo);
+ if (len > 0)
+ rv = rv ?: send_buf(fd, update->buf, update->len, tmo);
+ rv = send_buf(fd, &end_magic, 4, tmo);
+
+ return rv;
+}
+
+/*
+ * return:
+ * 0 - metadata_update received
+ * 1 - mdmon_update received
+ * -1 - error case
+ */
int receive_message(int fd, struct metadata_update *msg, int tmo)
{
__u32 magic;
__s32 len;
int rv;
+ int msg_type;

rv = recv_buf(fd, &magic, 4, tmo);
- if (rv < 0 || magic != start_magic)
+ if (rv < 0)
+ return -1;
+
+ if (magic == start_magic)
+ msg_type = 0;
+ else if (magic == start_magic_cmd)
+ msg_type = 1;
+ else
return -1;
rv = recv_buf(fd, &len, 4, tmo);
if (rv < 0 || len > MSG_MAX_LEN)
@@ -122,7 +151,7 @@ int receive_message(int fd, struct metadata_update *msg, int tmo)
return -1;
}
msg->len = len;
- return 0;
+ return msg_type;
}

int ack(int fd, int tmo)
diff --git a/msg.h b/msg.h
index 1f916de..046f7c4 100644
--- a/msg.h
+++ b/msg.h
@@ -20,9 +20,11 @@

struct mdinfo;
struct metadata_update;
+struct mdmon_update;

extern int receive_message(int fd, struct metadata_update *msg, int tmo);
extern int send_message(int fd, struct metadata_update *msg, int tmo);
+extern int send_message_cmd(int fd, struct mdmon_update *update, int tmo);
extern int ack(int fd, int tmo);
extern int wait_reply(int fd, int tmo);
extern int connect_monitor(char *devname);
diff --git a/util.c b/util.c
index 396f6d8..ea9b148 100644
--- a/util.c
+++ b/util.c
@@ -1840,6 +1840,31 @@ int flush_metadata_updates(struct supertype *st)
return 0;
}

+int send_mdmon_cmd(struct supertype *st, struct mdmon_update *update)
+{
+ int sfd;
+ char *devname;
+
+ devname = devnum2devname(st->container_dev);
+ if (devname == NULL)
+ return -1;
+ sfd = connect_monitor(devname);
+ if (sfd < 0) {
+ free(devname);
+ return -1;
+ }
+
+ send_message_cmd(sfd, update, 0);
+ wait_reply(sfd, 0);
+
+ ack(sfd, 0);
+ wait_reply(sfd, 0);
+ close(sfd);
+ st->update_tail = NULL;
+ free(devname);
+ return 0;
+}
+
void append_metadata_update(struct supertype *st, void *buf, int len)
{

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 45/53] mdadm: support grow operation for external meta

am 26.11.2010 09:09:40 von adam.kwolek

Assumptions for external metadata reshape implementation:
- mdadm controls weather writing over live data
- mdadm advances suspend_hi, does a backup if needed,
tells mdmon it is safe to continue by sending
resync_max command_msg to mdmon
- mdmon controls sync_max sysfs entry - so the kernel won't
cross the safe position (reshape progress from metadata)
- mdmon monitors resync_completed and update the metadata
to reflect 'resync_completed'.
- mdmon moves suspend_lo forward in line with changes in
resync_completed
- md moves syspend_hi forward: if resync_position crosses
suspend_hi, suspend_hi is pushed forward to the new reshape_position.
- md updates/notifies resync_completed periodically which
guide mdmon in updating the metadata periodically.

Above "mdadm" here means a background process forked by "mdadm --grow"
or "mdadm --assemble" which monitors an ongoing reshape.
A general algorithm for external metadata reshape:

<=====we are writing over live data
1. mdadm sets suspend_lo = 0, suspend_hi = 0
2. monitor waits for new sync_max message from mdadm
3. mdadm sets suspend_hi
4. mdadm perform critical data backup with save_backup()
5. mdadm sends new resync_max to monitor
6. mdadm waits on suspend_lo change
7. mdmon wakes up on socket msg
8. mdmon: sync_max is not MAX (we are still writing over live data)
monitor sets sysfs:sync_max
9. md reshape critical stripes
10. mdmon wakes up on new sync_completed
11. mdmon updates metadata using discard_backup()
12. mdmon updates suspend_lo
13. mdmon wakes on suspend_lo
14. go back to 2.

<==== now critical section is finished
2. mdmon waits for new sync_max message from mdadm
3. mdadm sends new sync_max = MAX to monitor
(this means the end of critical section)
6. mdadm exits
7. mdmon wakes up on socket msg
8. mdmon calculates at which stripe the next checkpoint must be made
9. mdmon
sets sysfs:sync_max = next checkpoint
10. md reshape critical stripes
11. mdmon wakes up on new sync_completed
12. mdmon updates metadata with discard_backup()
13. mdmon sets
suspend_lo = sync_completed
14. go back to 8.

A new external counterpart for grow_backup() is implemented:
grow_backup_ext().
For non-grow reshape (number of data disks do not change) a new child_same_size_ext() function is implemented.
Both uses save_stripes to read critical data from the source array to the buffer and than writes the buffer to the external backup area with save_backup().
mdmon uses discard_backup() when notified with the new sync_completed.

Signed-off-by: Maciej Trela
Signed-off-by: Adam Kwolek
---

Grow.c | 314 ++++++++++++++++++++++++++++++++++++++++++++++++++++-----
managemon.c | 36 ++++++-
mdadm.h | 4 -
mdmon.h | 8 +
monitor.c | 78 ++++++++++++++
super-intel.c | 3 -
6 files changed, 411 insertions(+), 32 deletions(-)

diff --git a/Grow.c b/Grow.c
index 64fb1c2..7253e5a 100644
--- a/Grow.c
+++ b/Grow.c
@@ -422,7 +422,8 @@ static int child_shrink(int afd, struct mdinfo *sra, unsigned long blocks,
int *fds, unsigned long long *offsets,
int disks, int chunk, int level, int layout, int data,
int dests, int *destfd, unsigned long long *destoffsets);
-static int child_same_size(int afd, struct mdinfo *sra, unsigned long blocks,
+static int child_same_size(struct supertype *st,
+ int afd, struct mdinfo *sra, unsigned long blocks,
int *fds, unsigned long long *offsets,
unsigned long long start,
int disks, int chunk, int level, int layout, int data,
@@ -839,7 +840,6 @@ void reshape_free_fdlist(int **fdlist_in,
dprintf(Name " Error: Parameters verification error #1.\n");
return;
}
-
fdlist = *fdlist_in;
offsets = *offsets_in;
if ((fdlist == NULL) || (offsets == NULL)) {
@@ -1837,9 +1837,16 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
else
fd = -1;
mlockall(MCL_FUTURE);
-
+ sra->array.raid_disks = odisks;
+ sra->array.level = array.level;
+ sra->array.layout = olayout;
+ sra->array.chunk_size = ochunk;
+ sra->delta_disks = ndisks - odisks;
+ sra->new_level = (level == UnSet) ? array.level : level;
+ sra->new_layout = nlayout;
+ sra->new_chunk = nchunk;
if (odata < ndata)
- done = child_grow(fd, sra, stripes,
+ done = child_grow(st, fd, sra, stripes,
fdlist, offsets,
odisks, ochunk, array.level, olayout, odata,
d - odisks, fdlist+odisks, offsets+odisks);
@@ -1849,7 +1856,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
odisks, ochunk, array.level, olayout, odata,
d - odisks, fdlist+odisks, offsets+odisks);
else
- done = child_same_size(fd, sra, stripes,
+ done = child_same_size(st, fd, sra, stripes,
fdlist, offsets,
0,
odisks, ochunk, array.level, olayout, odata,
@@ -2198,31 +2205,233 @@ static void validate(int afd, int bfd, unsigned long long offset)
}
}

-int child_grow(int afd, struct mdinfo *sra, unsigned long stripes,
- int *fds, unsigned long long *offsets,
+int wait_reshape_completed_ext(struct supertype *st,
+ struct mdinfo *sra,
+ unsigned long long offset /* per device */)
+{
+
+ /* Wait for resync to pass the section that was backed up
+ * then erase the backup and allow IO
+ */
+ int fd = sysfs_get_fd(sra, NULL, "suspend_lo");
+ unsigned long long completed;
+
+ struct timeval timeout;
+
+ if (fd < 0)
+ return -1;
+ timeout.tv_sec = 0;
+ timeout.tv_usec = 500000;
+ do {
+ char action[20];
+ fd_set rfds;
+ FD_ZERO(&rfds);
+ FD_SET(fd, &rfds);
+ select(fd+1, NULL, NULL, &rfds, &timeout);
+ if (sysfs_fd_get_ll(fd, &completed) < 0) {
+ close(fd);
+ return -1;
+ }
+ if (sysfs_get_str(sra, NULL, "sync_action", action, 20) > 0) {
+ if (strncmp(action, "reshape", 7) != 0) {
+ close(fd);
+ return -2;
+ }
+ } else {
+ /* takeover support, when we will back to raid0
+ * sync_action sysfs entry disappears
+ * so we have to exit also
+ */
+ if (sysfs_get_str(sra, NULL, "level", action, 20) > 0) {
+ if (strncmp(action, "raid0", 5) == 0) {
+ close(fd);
+ return -2;
+ }
+ }
+ }
+ } while (completed < offset);
+ close(fd);
+
+ return 0;
+}
+
+void send_resync_max_to_mdmon(struct supertype *st,
+ struct mdinfo *sra,
+ unsigned long long resync_max)
+{
+ struct mdmon_update msg;
+ struct cmd_message cmd_msg;
+
+ cmd_msg.type = SET_SYNC_MAX;
+ cmd_msg.devnum = devname2devnum(sra->sys_name);
+ cmd_msg.msg_buf.new_sync_max = resync_max;
+ msg.buf = (void *)&cmd_msg;
+ msg.len = sizeof(cmd_msg);
+
+ send_mdmon_cmd(st, &msg);
+}
+
+int grow_backup_ext(struct supertype *st, struct mdinfo *sra,
+ unsigned long long offset, /* per device */
+ unsigned long long stripes, /* per device */
+ int *sources, unsigned long long *offsets,
+ int dests, int *destfd, unsigned long long *destoffsets,
+ int *degraded, char *buf)
+{
+ int disks = sra->array.raid_disks;
+ int chunk = sra->array.chunk_size;
+ int level = sra->array.level;
+ int layout = sra->array.layout;
+ unsigned long long new_degraded;
+ unsigned long long processed = 0;
+ unsigned long long read_offset = 0;
+ unsigned long long write_offset;
+ unsigned long long resync_max;
+ unsigned bytes_per_unit;
+ int new_disks, new_odata;
+ int odata = disks;
+ int retval = 0;
+ int rv = 0;
+ int i;
+
+ if (level >= 4)
+ odata--;
+ if (level == 6)
+ odata--;
+ sysfs_set_num(sra, NULL, "suspend_hi", (offset + stripes * chunk/512) * odata);
+ /* Check that array hasn't become degraded, else we might backup the wrong data */
+ sysfs_get_ll(sra, NULL, "degraded", &new_degraded);
+ if (new_degraded != (unsigned long long)*degraded) {
+ /* check each device to ensure it is still working */
+ struct mdinfo *sd;
+ for (sd = sra->devs ; sd ; sd = sd->next) {
+ if (sd->disk.state & (1< + continue;
+ if (sd->disk.state & (1< + char sbuf[20];
+ if (sysfs_get_str(sra, sd, "state", sbuf, 20) < 0 ||
+ strstr(sbuf, "faulty") ||
+ strstr(sbuf, "in_sync") == NULL) {
+ /* this device is dead */
+ sd->disk.state = (1< + if (sd->disk.raid_disk >= 0 &&
+ sources[sd->disk.raid_disk] >= 0) {
+ close(sources[sd->disk.raid_disk]);
+ sources[sd->disk.raid_disk] = -1;
+ }
+ }
+ }
+ }
+ *degraded = new_degraded;
+ }
+
+ for (i = 0; i < dests; i++)
+ lseek64(destfd[i], destoffsets[i], 0);
+
+ /* save critical stripes to buf */
+ for (i = 0; i < (int)stripes; i++)
+ rv |= save_stripes(sources, offsets,
+ disks, chunk, level, layout,
+ dests, destfd,
+ offset * 512 * odata + (i * chunk * odata),
+ chunk * odata,
+ buf + (i * chunk * odata));
+
+ if (rv)
+ return rv;
+
+ new_disks = disks + sra->delta_disks;
+ new_odata = new_disks;
+ if (sra->new_level >= 4)
+ new_odata--;
+ if (sra->new_level == 6)
+ new_odata--;
+
+ write_offset = offset * 512 * new_odata;
+ bytes_per_unit = sra->new_chunk * new_odata;
+ if (chunk > sra->new_chunk)
+ bytes_per_unit *= (chunk / sra->new_chunk);
+ while ((processed < stripes * chunk * odata) ||
+ (processed == 0 && stripes * chunk * odata == 0)) {
+ int dn;
+ char *devname;
+
+ /* Save critical stripes to external backup */
+ if (st->ss->save_backup)
+ st->ss->save_backup(st, sra,
+ buf + read_offset,
+ write_offset,
+ bytes_per_unit);
+
+ /* send new sync_max to mdmon */
+ resync_max = write_offset / 512 / new_odata +
+ bytes_per_unit / 512 / new_odata;
+ send_resync_max_to_mdmon(st, sra, resync_max);
+
+ /* Wait for updated suspend_lo */
+ retval = wait_reshape_completed_ext(st, sra, resync_max * new_odata);
+ if (retval == -2) {
+ /* reshape has been finished
+ */
+ rv = -1;
+ break;
+ }
+
+ processed += bytes_per_unit;
+ read_offset += bytes_per_unit;
+ write_offset += bytes_per_unit;
+ sra->reshape_progress = write_offset / 512;
+
+ dn = devname2devnum(sra->text_version + 1);
+ devname = devnum2devname(dn);
+ if (devname) {
+ ping_monitor(devname);
+ free(devname);
+ }
+ }
+
+ return rv;
+}
+
+int child_grow(struct supertype *st, int afd, struct mdinfo *sra,
+ unsigned long stripes, int *fds, unsigned long long *offsets,
int disks, int chunk, int level, int layout, int data,
int dests, int *destfd, unsigned long long *destoffsets)
{
char *buf;
int degraded = 0;
+ int ext_backup = (st->ss->save_backup) ? 1 : 0;
+ unsigned int buf_size;

- if (posix_memalign((void**)&buf, 4096, disks * chunk))
+ buf_size = (ext_backup) ? stripes * disks * chunk :
+ (unsigned int)(disks * chunk);
+ if (posix_memalign((void **)&buf, 4096, buf_size))
/* Don't start the 'reshape' */
return 0;
sysfs_set_num(sra, NULL, "suspend_hi", 0);
sysfs_set_num(sra, NULL, "suspend_lo", 0);
- grow_backup(sra, 0, stripes,
- fds, offsets, disks, chunk, level, layout,
- dests, destfd, destoffsets,
- 0, °raded, buf);
- validate(afd, destfd[0], destoffsets[0]);
- wait_backup(sra, 0, stripes * (chunk / 512), stripes * (chunk / 512),
- dests, destfd, destoffsets,
- 0);
+ if (ext_backup) {
+ grow_backup_ext(st, sra, 0, stripes, fds,
+ offsets, dests, destfd, destoffsets,
+ °raded, buf);
+
+ /* Send resync_max=MAX (-1LLU) to mdmon */
+ send_resync_max_to_mdmon(st, sra, -1LLU);
+ } else {
+ grow_backup(sra, 0, stripes,
+ fds, offsets, disks, chunk, level, layout,
+ dests, destfd, destoffsets,
+ 0, °raded, buf);
+ validate(afd, destfd[0], destoffsets[0]);
+ wait_backup(sra, 0, stripes * chunk / 512, stripes * chunk / 512,
+ dests, destfd, destoffsets,
+ 0);
+ sysfs_set_num(sra, NULL, "suspend_lo", (stripes * chunk/512) * data);
+ /* FIXME this should probably be numeric */
+ sysfs_set_str(sra, NULL, "sync_max", "max");
+ }
sysfs_set_num(sra, NULL, "suspend_lo", (stripes * (chunk/512)) * data);
free(buf);
- /* FIXME this should probably be numeric */
- sysfs_set_str(sra, NULL, "sync_max", "max");
return 1;
}

@@ -2253,7 +2462,7 @@ static int child_shrink(int afd, struct mdinfo *sra, unsigned long stripes,
dests, destfd, destoffsets,
0, °raded, buf);
validate(afd, destfd[0], destoffsets[0]);
- wait_backup(sra, start, stripes*(chunk/512), 0,
+ wait_backup(sra, start, stripes*chunk/512, 0,
dests, destfd, destoffsets, 0);
sysfs_set_num(sra, NULL, "suspend_lo", (stripes * (chunk/512)) * data);
free(buf);
@@ -2262,11 +2471,58 @@ static int child_shrink(int afd, struct mdinfo *sra, unsigned long stripes,
return 1;
}

-static int child_same_size(int afd, struct mdinfo *sra, unsigned long stripes,
- int *fds, unsigned long long *offsets,
- unsigned long long start,
- int disks, int chunk, int level, int layout, int data,
- int dests, int *destfd, unsigned long long *destoffsets)
+static int child_same_size_ext(struct supertype *st, int afd, struct mdinfo *sra,
+ unsigned long stripes, int *fds,
+ unsigned long long *offsets, unsigned long long start,
+ int disks, int chunk, int level, int layout, int data,
+ int dests, int *destfd, unsigned long long *destoffsets)
+{
+ unsigned long long size;
+ unsigned long tailstripes = stripes;
+ char *buf;
+ unsigned long long speed;
+ int degraded = 0;
+ int status;
+
+ if (posix_memalign((void **)&buf, 4096, stripes * disks * chunk))
+ return 0;
+
+ sysfs_set_num(sra, NULL, "suspend_lo", 0);
+ sysfs_set_num(sra, NULL, "suspend_hi", 0);
+
+ sysfs_get_ll(sra, NULL, "sync_speed_min", &speed);
+ sysfs_set_num(sra, NULL, "sync_speed_min", 200000);
+
+ /* Start the reshape - give a chance to update the metadata */
+ sysfs_set_num(sra, NULL, "sync_max", 0);
+ sysfs_set_str(sra, NULL, "sync_action", "reshape");
+ flush_metadata_updates(st);
+
+ size = sra->component_size / (chunk/512);
+ while (start < size) {
+ if (start + stripes > size)
+ tailstripes = (size - start);
+
+ status = grow_backup_ext(st, sra, start*chunk/512, tailstripes,
+ fds, offsets,
+ dests, destfd, destoffsets,
+ °raded, buf);
+ if (status == 0)
+ start += stripes;
+ else
+ break;
+ }
+ sysfs_set_num(sra, NULL, "sync_speed_min", speed);
+ free(buf);
+ return 1;
+}
+
+int child_same_size(struct supertype *st, int afd,
+ struct mdinfo *sra, unsigned long stripes,
+ int *fds, unsigned long long *offsets,
+ unsigned long long start,
+ int disks, int chunk, int level, int layout, int data,
+ int dests, int *destfd, unsigned long long *destoffsets)
{
unsigned long long size;
unsigned long tailstripes = stripes;
@@ -2275,6 +2531,13 @@ static int child_same_size(int afd, struct mdinfo *sra, unsigned long stripes,
unsigned long long speed;
int degraded = 0;

+ int ext_backup = (st->ss->save_backup) ? 1 : 0;
+
+ if (ext_backup)
+ return child_same_size_ext(st, afd, sra, stripes,
+ fds, offsets,
+ start, disks, chunk, level, layout, data,
+ dests, destfd, destoffsets);

if (posix_memalign((void**)&buf, 4096, disks * chunk))
return 0;
@@ -2298,6 +2561,7 @@ static int child_same_size(int afd, struct mdinfo *sra, unsigned long stripes,
validate(afd, destfd[0], destoffsets[0]);
part = 0;
start += stripes * 2; /* where to read next */
+
size = sra->component_size / (chunk/512);
while (start < size) {
if (wait_backup(sra, (start-stripes*2)*(chunk/512),
@@ -2754,7 +3018,7 @@ int Grow_continue(int mdfd, struct supertype *st, struct mdinfo *info,
*/
unsigned long long start = info->reshape_progress / ndata;
start /= (info->array.chunk_size/512);
- done = child_same_size(-1, info, stripes,
+ done = child_same_size(st, -1, info, stripes,
fds, offsets,
start,
info->array.raid_disks,
diff --git a/managemon.c b/managemon.c
index 9ff3632..abc1291 100644
--- a/managemon.c
+++ b/managemon.c
@@ -120,6 +120,7 @@ static void close_aa(struct active_array *aa)
close(aa->action_fd);
close(aa->info.state_fd);
close(aa->resync_start_fd);
+ close(aa->sync_completed_fd);
}

static void free_aa(struct active_array *aa)
@@ -431,6 +432,7 @@ static void manage_member(struct mdstat_ent *mdstat,
struct metadata_update *updates = NULL;
struct mdinfo *newdev = NULL;
struct mdinfo *d;
+ int delta_disks = a->reshape_delta_disks;

newdev = newa->container->ss->reshape_array(newa, reshape_in_progress, &updates);
if (newdev) {
@@ -465,6 +467,26 @@ static void manage_member(struct mdstat_ent *mdstat,
/* reshape executed
*/
dprintf("Reshape was started\n");
+ /* during reshape new_data_disks should be set
+ * for proper checkpointing handle
+ */
+ newa->old_data_disks = newa->info.array.raid_disks;
+ if (newa->info.array.level == 4)
+ newa->old_data_disks--;
+ if (newa->info.array.level == 5)
+ newa->old_data_disks--;
+ if (newa->info.array.level == 6)
+ newa->old_data_disks--;
+ newa->new_data_disks = newa->info.array.raid_disks + delta_disks;
+ if (level == 4)
+ newa->new_data_disks--;
+ if (level == 5)
+ newa->new_data_disks--;
+ if (level == 6)
+ newa->new_data_disks--;
+ newa->waiting_for = wait_grow_backup;
+ newa->grow_sync_max = 0;
+
replace_array(a->container, a, newa);
a = newa;
} else {
@@ -582,7 +604,7 @@ static void manage_new(struct mdstat_ent *mdstat,
return;

mdi = sysfs_read(-1, mdstat->devnum,
- GET_LEVEL|GET_CHUNK|GET_DISKS|GET_COMPONENT|
+ GET_LEVEL|GET_LAYOUT|GET_CHUNK|GET_DISKS|GET_COMPONENT|
GET_DEGRADED|GET_DEVS|GET_OFFSET|GET_SIZE|GET_STATE);

new = malloc(sizeof(*new));
@@ -745,6 +767,18 @@ static void handle_command(struct supertype *container, struct cmd_message *msg)
switch (msg->type) {
case SET_SYNC_MAX:
/* Add SET_SYNC_MAX handler here */
+ if (a->waiting_for == wait_grow_backup) {
+ if (msg->msg_buf.new_sync_max <= a->grow_sync_max) {
+ dprintf("%s: unexpected sync_max value: %llu <= %llu!\n",
+ __func__, msg->msg_buf.new_sync_max,
+ a->grow_sync_max);
+ }
+ a->grow_sync_max = msg->msg_buf.new_sync_max;
+ } else {
+ dprintf("%s: unexpected sync_max msg from mdadm!\n",
+ __func__);
+ }
+ wakeup_monitor();
break;
}
}
diff --git a/mdadm.h b/mdadm.h
index eacf0f5..7611a06 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -463,7 +463,8 @@ extern void reshape_free_fdlist(int **fdlist_in,
int size);
extern unsigned long compute_backup_blocks(int nchunk, int ochunk,
unsigned int ndata, unsigned int odata);
-extern int child_grow(int afd, struct mdinfo *sra, unsigned long stripes,
+extern int child_grow(struct supertype *st,
+ int afd, struct mdinfo *sra, unsigned long blocks,
int *fds, unsigned long long *offsets,
int disks, int chunk, int level, int layout, int data,
int dests, int *destfd, unsigned long long *destoffsets);
@@ -875,7 +876,6 @@ extern int Grow_restart(struct supertype *st, struct mdinfo *info,
int *fdlist, int cnt, char *backup_file, int verbose);
extern int Grow_continue(int mdfd, struct supertype *st,
struct mdinfo *info, char *backup_file);
-
extern int Assemble(struct supertype *st, char *mddev,
mddev_ident_t ident,
mddev_dev_t devlist, char *backup_file,
diff --git a/mdmon.h b/mdmon.h
index 2c41e47..6e86994 100644
--- a/mdmon.h
+++ b/mdmon.h
@@ -23,6 +23,7 @@ enum array_state { clear, inactive, suspended, readonly, read_auto,

enum sync_action { idle, reshape, resync, recover, check, repair, bad_action };

+enum reshape_wait { wait_grow_backup, wait_md_reshape };

enum state_of_reshape { reshape_not_active, reshape_is_starting, reshape_in_progress, reshape_cancel_request };

@@ -49,9 +50,10 @@ struct active_array {

enum state_of_reshape reshape_state;
int reshape_delta_disks;
- int waiting_resync_max; /* wait for resync_max cmd from mdadm */
- long long unsigned resync_max;
- long long unsigned sync_completed;
+ unsigned long long grow_sync_max; /* sync_max from mdadm Grow */
+ enum reshape_wait waiting_for; /* we can wait for grow backup event
+ or for md reshape completed */
+ int old_data_disks, new_data_disks;

int check_degraded; /* flag set by mon, read by manage */

diff --git a/monitor.c b/monitor.c
index 3e26a8a..2a92dee 100644
--- a/monitor.c
+++ b/monitor.c
@@ -218,12 +218,17 @@ static int read_and_act(struct active_array *a)
int deactivate = 0;
struct mdinfo *mdi;
int dirty = 0;
+ long long unsigned new_sync_completed;
+ long long unsigned curr_sync_max;
+ unsigned long long safe_sync_max;
+ int signal_md_reshape = 0;

a->next_state = bad_word;
a->next_action = bad_action;

a->curr_state = read_state(a->info.state_fd);
a->curr_action = read_action(a->action_fd);
+ new_sync_completed = read_resync_start(a->sync_completed_fd);
a->info.resync_start = read_resync_start(a->resync_start_fd);
sync_completed = read_sync_completed(a->sync_completed_fd);
for (mdi = a->info.devs; mdi ; mdi = mdi->next) {
@@ -234,6 +239,79 @@ static int read_and_act(struct active_array *a)
}
}

+ if (a->curr_action == reshape && a->waiting_for == wait_grow_backup) {
+ /* We are waiting for mdadm Grow backup completed
+ */
+ sysfs_get_ll(&a->info, NULL, "sync_max", &curr_sync_max);
+ if (a->grow_sync_max > curr_sync_max) {
+ /* grow_resync_max was update by mdadm:
+ * continue the reshape with md
+ */
+ signal_md_reshape = 1;
+ }
+ }
+
+ if (a->curr_action == reshape && a->waiting_for == wait_md_reshape) {
+ /* We are waiting for md reshape completed.
+ * note: if new_sync_completed == 0 md completed the reshape
+ */
+ if (new_sync_completed > 0) {
+ /* It is possible that sync_completed = sync_max + 2 */
+ new_sync_completed &= ~(a->info.array.chunk_size / 512 - 1);
+
+ if (new_sync_completed * a->new_data_disks >= a->info.reshape_progress) {
+ a->info.reshape_progress = new_sync_completed * a->new_data_disks;
+
+ /* write_metadata: migration record */
+ a->container->ss->discard_backup(a->container, &a->info);
+ }
+
+ sysfs_get_ll(&a->info, NULL, "sync_max", &curr_sync_max);
+ if (curr_sync_max == 0)
+ /* sync_max was set to max */
+ curr_sync_max = -1LLU;
+
+ if (new_sync_completed >= curr_sync_max) {
+
+ if (sysfs_set_num(&a->info, NULL, "suspend_lo",
+ new_sync_completed * a->new_data_disks) != 0)
+ dprintf("mdmon: setting suspend_lo() FAILED!\n");
+
+ if (a->grow_sync_max != -1LLU)
+ /* Still have to wait for mdadm Grow backup */
+ a->waiting_for = wait_grow_backup;
+ else
+ /* calculate next sync_max and wait for md*/
+ signal_md_reshape = 1;
+ }
+ } else {
+ /* reshape was finished. should we do something here? */
+ }
+ }
+
+ if (a->curr_action == reshape && signal_md_reshape == 1) {
+ if (a->grow_sync_max == -1LLU) {
+ /* calculate next safe sync_max for the reshape */
+ safe_sync_max = a->info.reshape_progress / a->old_data_disks;
+ safe_sync_max &= ~(a->info.array.chunk_size / 512 - 1);
+
+ if (safe_sync_max >= a->info.component_size)
+ sysfs_set_str(&a->info, NULL, "sync_max", "max");
+ else {
+ /* Workarround:
+ * sometimes md reports sync_completed == 2 but in fact it is 0
+ */
+ if ((new_sync_completed == 2) && (safe_sync_max == 0))
+ safe_sync_max = 2;
+ sysfs_set_num(&a->info, NULL, "sync_max", safe_sync_max);
+ }
+ } else {
+ sysfs_set_num(&a->info, NULL, "sync_max", a->grow_sync_max);
+ }
+ /* sync_max was set. wait for md. */
+ a->waiting_for = wait_md_reshape;
+ }
+
if (a->curr_state <= inactive &&
a->prev_state > inactive) {
/* array has been stopped */
diff --git a/super-intel.c b/super-intel.c
index 7eb7107..7e755fe 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -7911,7 +7911,8 @@ int imsm_child_grow(struct supertype *st, char *devname, int validate_fd, struct
stripes = blocks / (sra->array.chunk_size/512) / odata;
/* child grow returns fixed value == 1
*/
- child_grow(validate_fd, sra, stripes,
+
+ child_grow(st, validate_fd, sra, stripes,
fdlist, offsets,
odisks, sra->array.chunk_size,
sra->array.level, -1, odata,

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 46/53] FIX: Allow for reshape without backup file

am 26.11.2010 09:09:48 von adam.kwolek

When reshape process is guarded by metadata specific checkpointing,
backup file is no longer necessary.
Remove backup file requirement from mdadm command line when reshape_super
and manage reshape_super are defined for external metadata case.

Signed-off-by: Adam Kwolek
---

Grow.c | 25 +++++++++++++++++--------
1 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/Grow.c b/Grow.c
index 7253e5a..fdc5bfd 100644
--- a/Grow.c
+++ b/Grow.c
@@ -1659,6 +1659,11 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
goto release;
}
if (backup_file == NULL) {
+ int backup_file_required_for_external;
+
+ backup_file_required_for_external = st->ss->external &&
+ st->ss->reshape_super && st->ss->manage_reshape;
+
if (st->ss->external && !st->ss->manage_reshape) {
fprintf(stderr, Name ": %s Grow operation not supported by %s metadata\n",
devname, st->ss->name);
@@ -1666,10 +1671,12 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
break;
}
if (ndata <= odata) {
- fprintf(stderr, Name ": %s: Cannot grow - need backup-file\n",
- devname);
- rv = 1;
- break;
+ if (!backup_file_required_for_external) {
+ fprintf(stderr, Name ": %s: Cannot grow - need backup-file\n",
+ devname);
+ rv = 1;
+ break;
+ }
} else if (sra->array.spare_disks == 0) {
fprintf(stderr, Name ": %s: Cannot grow - need a spare or "
"backup-file to backup critical section\n",
@@ -1678,10 +1685,12 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
break;
}
if (d == array.raid_disks) {
- fprintf(stderr, Name ": %s: No spare device for backup\n",
- devname);
- rv = 1;
- break;
+ if (!backup_file_required_for_external) {
+ fprintf(stderr, Name ": %s: No spare device for backup\n",
+ devname);
+ rv = 1;
+ break;
+ }
}
} else {
/* need to check backup file is large enough */

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 47/53] FIX: Honor !reshape state on wait_reshape() entry

am 26.11.2010 09:09:55 von adam.kwolek

When wait_reshape() function starts it can occurs that reshape is finished already,
before wait_reshape() start. This can lead to wait for change state inside this function for a long time.
To avoid this before wait we should test if finish conditions are not reached already.

Signed-off-by: Adam Kwolek
---

Grow.c | 19 ++++++++++++-------
1 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/Grow.c b/Grow.c
index fdc5bfd..d5f3f16 100644
--- a/Grow.c
+++ b/Grow.c
@@ -507,17 +507,22 @@ void wait_reshape(struct mdinfo *sra)
int fd = sysfs_get_fd(sra, NULL, "sync_action");
char action[20];

- do {
+ if (fd < 0)
+ return;
+
+ if (sysfs_fd_get_str(fd, action, 20) < 0) {
+ close(fd);
+ return;
+ }
+ while (strncmp(action, "reshape", 7) == 0) {
fd_set rfds;
FD_ZERO(&rfds);
FD_SET(fd, &rfds);
select(fd+1, NULL, NULL, &rfds, NULL);
-
- if (sysfs_fd_get_str(fd, action, 20) < 0) {
- close(fd);
- return;
- }
- } while (strncmp(action, "reshape", 7) == 0);
+ if (sysfs_fd_get_str(fd, action, 20) < 0)
+ break;
+ }
+ close(fd);
}

static int reshape_super(struct supertype *st, long long size, int level,

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 48/53] WORKAROUND: md reports idle state during reshape start

am 26.11.2010 09:10:03 von adam.kwolek

md reports reshape->idle->reshape states transition on reshape start, so reshape finalization is wrongly indicated.
Finalize reshape when we have any progress only,
When reshape is really started, idle state causes reshape finalization as usually.

Signed-off-by: Adam Kwolek
---

monitor.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/monitor.c b/monitor.c
index 2a92dee..59d9bd6 100644
--- a/monitor.c
+++ b/monitor.c
@@ -387,7 +387,8 @@ static int read_and_act(struct active_array *a)
/* finalize reshape detection
*/
if ((a->curr_action != reshape) &&
- (a->prev_action == reshape)) {
+ (a->prev_action == reshape) &&
+ (a->info.reshape_progress > 2)) {
/* set zero to allow for future rebuilds
*/
a->reshape_state = reshape_not_active;

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 49/53] imsm Fix: Core during rebuild on array details read

am 26.11.2010 09:10:10 von adam.kwolek

When rebuild/reshape is in progress, executing mdadm for read array details causes core dump.
This is due to not initialized devices list pointer in getinfo_super() call.
Initializing it to NULL value allows code to detect such situation.

Signed-off-by: Adam Kwolek
---

Detail.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/Detail.c b/Detail.c
index 0fb90e8..a2d72f6 100644
--- a/Detail.c
+++ b/Detail.c
@@ -140,6 +140,7 @@ int Detail(char *dev, int brief, int export, int test, char *homehost)
close(fd2);
if (err)
continue;
+ info.devs = NULL;
st->ss->getinfo_super(st, &info);

if (array.raid_disks != 0 && /* container */

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 50/53] Change manage_reshape() placement

am 26.11.2010 09:10:18 von adam.kwolek

After reshape_super() call manage_reshape() should do the same things
as grow_reshape() for native metadata case (for execution on array).
The difference is on reshape finish only, when md finishes his work.
For external metadata size is managed externally from md point of view,
so specific to metadata action is required there.
This causes moving manage_reshape() placement to add necessary actions only
to common flow and not duplicate current code.

Signed-off-by: Adam Kwolek
---

Grow.c | 23 +++++++++++++----------
1 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/Grow.c b/Grow.c
index d5f3f16..57ca66c 100644
--- a/Grow.c
+++ b/Grow.c
@@ -1799,14 +1799,6 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
break;
}

- if (st->ss->external) {
- /* metadata handler takes it from here */
- ping_manager(container);
- st->ss->manage_reshape(st, backup_file);
- frozen = 0;
- break;
- }
-
/* set up the backup-super-block. This requires the
* uuid from the array.
*/
@@ -1877,6 +1869,15 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
d - odisks, fdlist+odisks, offsets+odisks);
if (backup_file && done)
unlink(backup_file);
+
+ /* manage/finalize reshape in metadata specific way
+ */
+ close(fd);
+ if (st->ss->external && st->ss->manage_reshape) {
+ st->ss->manage_reshape(st, backup_file);
+ break;
+ }
+
if (level != UnSet && level != array.level) {
/* We need to wait for the reshape to finish
* (which will have happened unless odata < ndata)
@@ -1887,8 +1888,10 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
if (c == NULL)
exit(0);/* not possible */

- if (odata < ndata)
- wait_reshape(sra);
+ /* child process has always wait for reshape finish
+ * to perform unfreeze
+ */
+ wait_reshape(sra);
err = sysfs_set_str(sra, NULL, "level", c);
if (err)
fprintf(stderr, Name ": %s: could not set level to %s\n",

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 51/53] Migration: raid5->raid0

am 26.11.2010 09:10:25 von adam.kwolek

Add implementation for migration from raid5 to raid0 in one step.
For this migration case (and others for external metadata case)
flow used for Expansion is used. This causes update array parameters
in managemon based on sent metadata update. To do this uptate md parameters
in Grow.c has to be disabled for external metadata case.

In Grow.c instead starting reshape for external metadata case
wait_reshape_start_ext() function is introduced.
Function waits for reshape start initialized by managemon after setting
array parameter as for Expansion case.

In managemon was added subarray_set_num_man() function.
It is similar to function that exists in Grow.c except 2 things:
1. it uses different way to "ping" monitor
2. it tries to set raid_disks more than 2 times as we are more sure that monitor works
during processing in managemon context

For imsm raid level parameters flow from mdadm (via metadata update)
to managemon was added.

Signed-off-by: Adam Kwolek
---

Grow.c | 93 ++++++++++++++++------
managemon.c | 102 ++++++++++++++++++++++---
mdadm.h | 2
mdmon.h | 3 +
super-intel.c | 237 ++++++++++++++++++++++++++++++++++++++++++++++++++++-----
5 files changed, 379 insertions(+), 58 deletions(-)

diff --git a/Grow.c b/Grow.c
index 57ca66c..ac9e5ea 100644
--- a/Grow.c
+++ b/Grow.c
@@ -1768,28 +1768,32 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
break;
}
} else {
- /* set them all just in case some old 'new_*' value
- * persists from some earlier problem
+ /* set parametes here only if managemon is not responsible for this
*/
- int err = err; /* only used if rv==1, and always set if
- * rv==1, so initialisation not needed,
- * despite gcc warning
- */
- if (sysfs_set_num(sra, NULL, "chunk_size", nchunk) < 0)
- rv = 1, err = errno;
- if (!rv && sysfs_set_num(sra, NULL, "layout", nlayout) < 0)
- rv = 1, err = errno;
- if (!rv && sysfs_set_num(sra, NULL, "raid_disks", ndisks) < 0)
- rv = 1, err = errno;
- if (rv) {
- fprintf(stderr, Name ": Cannot set device shape for %s\n",
- devname);
- if (get_linux_version() < 2006030)
- fprintf(stderr, Name ": linux 2.6.30 or later required\n");
- if (err == EBUSY &&
- (array.state & (1< - fprintf(stderr, " Bitmap must be removed before shape can be changed\n");
- break;
+ if ((st->ss->external == 0) || (st->ss->reshape_super == NULL)) {
+ /* set them all just in case some old 'new_*' value
+ * persists from some earlier problem
+ */
+ int err = err; /* only used if rv==1, and always set if
+ * rv==1, so initialisation not needed,
+ * despite gcc warning
+ */
+ if (sysfs_set_num(sra, NULL, "chunk_size", nchunk) < 0)
+ rv = 1, err = errno;
+ if (!rv && sysfs_set_num(sra, NULL, "layout", nlayout) < 0)
+ rv = 1, err = errno;
+ if (!rv && sysfs_set_num(sra, NULL, "raid_disks", ndisks) < 0)
+ rv = 1, err = errno;
+ if (rv) {
+ fprintf(stderr, Name ": Cannot set device shape for %s\n",
+ devname);
+ if (get_linux_version() < 2006030)
+ fprintf(stderr, Name ": linux 2.6.30 or later required\n");
+ if (err == EBUSY &&
+ (array.state & (1< + fprintf(stderr, " Bitmap must be removed before shape can be changed\n");
+ break;
+ }
}
}

@@ -2272,6 +2276,42 @@ int wait_reshape_completed_ext(struct supertype *st,
return 0;
}

+int wait_reshape_start_ext(struct supertype *st, struct mdinfo *sra)
+{
+#define WAIT_FOR_RESHAPE_START 20
+ int wait_time = WAIT_FOR_RESHAPE_START;
+ int ret_val = -1;
+ char *container = devnum2devname(st->devnum);
+
+ if (container == NULL) {
+ dprintf("wait_reshape_start_ext: cannot find container.\n");
+ return ret_val;
+ }
+ ping_manager(container);
+ ping_monitor(container);
+ while (wait_time) {
+ char action[20];
+ dprintf("wait_reshape_start_ext Waiting for reshape state (%i) ...\n", WAIT_FOR_RESHAPE_START - wait_time + 1);
+ if (sysfs_get_str(sra, NULL, "sync_action", action, 20) < 0) {
+ dprintf("Error: wait_reshape_start_ext cannot read sync_action\n");
+ break;
+ }
+ dprintf("wait_reshape_start_ext: read from sysfs: %s\n", action);
+ if (strncmp(action, "reshape", 7) == 0) {
+ dprintf("wait_reshape_start_ext: reshape started.\n");
+ ret_val = 0;
+ break;
+ }
+ ping_manager(container);
+ ping_monitor(container);
+ sleep(1);
+ wait_time--;
+ }
+
+ free(container);
+ return ret_val;
+}
+
void send_resync_max_to_mdmon(struct supertype *st,
struct mdinfo *sra,
unsigned long long resync_max)
@@ -2510,10 +2550,13 @@ static int child_same_size_ext(struct supertype *st, int afd, struct mdinfo *sra
sysfs_get_ll(sra, NULL, "sync_speed_min", &speed);
sysfs_set_num(sra, NULL, "sync_speed_min", 200000);

- /* Start the reshape - give a chance to update the metadata */
- sysfs_set_num(sra, NULL, "sync_max", 0);
- sysfs_set_str(sra, NULL, "sync_action", "reshape");
- flush_metadata_updates(st);
+ /* wait reshape is starteb by managemon
+ * - give a chance to update the metadata */
+ if (wait_reshape_start_ext(st, sra)) {
+ dprintf("Error: Reshape not started\n");
+ free(buf);
+ return -1;
+ }

size = sra->component_size / (chunk/512);
while (start < size) {
diff --git a/managemon.c b/managemon.c
index abc1291..2514963 100644
--- a/managemon.c
+++ b/managemon.c
@@ -379,6 +379,43 @@ static int disk_init_and_add(struct mdinfo *disk, struct mdinfo *clone,
return 0;
}

+int subarray_set_num_man(char *container, struct mdinfo *sra, char *name, int n)
+{
+ /* when dealing with external metadata subarrays we need to be
+ * prepared to handle EAGAIN. The kernel may need to wait for
+ * mdmon to mark the array active so the kernel can handle
+ * allocations/writeback when preparing the reshape action
+ * (md_allow_write()). We temporarily disable safe_mode_delay
+ * to close a race with the array_state going clean before the
+ * next write to raid_disks / stripe_cache_size
+ */
+ char safe[50];
+ int rc;
+#define MANAGEMON_COUNTER 20
+ int counter = MANAGEMON_COUNTER;
+
+ /* only 'raid_disks' and 'stripe_cache_size' trigger md_allow_write */
+ if (strcmp(name, "raid_disks") != 0 &&
+ strcmp(name, "stripe_cache_size") != 0)
+ return sysfs_set_num(sra, NULL, name, n);
+
+ rc = sysfs_get_str(sra, NULL, "safe_mode_delay", safe, sizeof(safe));
+ if (rc <= 0)
+ return -1;
+ sysfs_set_num(sra, NULL, "safe_mode_delay", 0);
+ rc = sysfs_set_num(sra, NULL, name, n);
+ while ((rc < 0) && counter) {
+ counter--;
+ dprintf("managemon: Try to set %s to value %i (%i time(s)).\n", name, n, MANAGEMON_COUNTER - counter);
+ wakeup_monitor();
+ usleep(250000);
+ rc = sysfs_set_num(sra, NULL, name, n);
+ }
+ sysfs_set_str(sra, NULL, "safe_mode_delay", safe);
+ return rc;
+}
+
+
static void manage_member(struct mdstat_ent *mdstat,
struct active_array *a)
{
@@ -433,17 +470,17 @@ static void manage_member(struct mdstat_ent *mdstat,
struct mdinfo *newdev = NULL;
struct mdinfo *d;
int delta_disks = a->reshape_delta_disks;
+ int status_ok = 1;

+ newa = duplicate_aa(a);
+ if (newa == NULL) {
+ a->reshape_state = reshape_not_active;
+ goto reshape_out;
+ }
newdev = newa->container->ss->reshape_array(newa, reshape_in_progress, &updates);
if (newdev) {
- int status_ok = 1;
- newa = duplicate_aa(a);
- if (newa == NULL)
- goto reshape_out;
-
for (d = newdev; d ; d = d->next) {
struct mdinfo *newd;
-
newd = malloc(sizeof(*newd));
if (!newd) {
status_ok = 0;
@@ -458,11 +495,41 @@ static void manage_member(struct mdstat_ent *mdstat,
}
disk_init_and_add(newd, d, newa);
}
- /* go with reshape
+ }
+ if (newa->reshape_state == reshape_in_progress) {
+ /* set reshape parametars
*/
- if (status_ok)
+ if (status_ok) {
+ dprintf("managemon: set sync_max to 0\n");
if (sysfs_set_num(&newa->info, NULL, "sync_max", 0) < 0)
status_ok = 0;
+ }
+
+ if (status_ok && newa->reshape_raid_disks) {
+ dprintf("managemon: set raid_disks to %i\n", newa->reshape_raid_disks);
+ if (subarray_set_num_man(a->container->devname, &newa->info, "raid_disks", newa->reshape_raid_disks))
+ status_ok = 0;
+ }
+
+ if (status_ok && newa->reshape_level > -1) {
+ char *c = map_num(pers, newa->reshape_level);
+ if (c == NULL)
+ status_ok = 0;
+ else {
+ dprintf("managemon: set level to %s\n", c);
+ if (sysfs_set_str(&newa->info, NULL, "level", c) < 0)
+ status_ok = 0;
+ }
+ }
+
+ if (status_ok && newa->reshape_layout >= 0) {
+ dprintf("managemon: set layout to %i\n", newa->reshape_layout);
+ if (sysfs_set_num(&newa->info, NULL, "layout", newa->reshape_layout) < 0)
+ status_ok = 0;
+ }
+
+ /* go with reshape
+ */
if (status_ok && sysfs_set_str(&newa->info, NULL, "sync_action", "reshape") == 0) {
/* reshape executed
*/
@@ -477,7 +544,10 @@ static void manage_member(struct mdstat_ent *mdstat,
newa->old_data_disks--;
if (newa->info.array.level == 6)
newa->old_data_disks--;
- newa->new_data_disks = newa->info.array.raid_disks + delta_disks;
+ if (newa->reshape_raid_disks > 0)
+ newa->new_data_disks = newa->reshape_raid_disks;
+ else
+ newa->new_data_disks = newa->info.array.raid_disks + delta_disks;
if (level == 4)
newa->new_data_disks--;
if (level == 5)
@@ -489,28 +559,38 @@ static void manage_member(struct mdstat_ent *mdstat,

replace_array(a->container, a, newa);
a = newa;
+ newa = NULL;
} else {
/* on problems cancel update
*/
- free_aa(newa);
free_updates(&updates);
updates = NULL;
+
a->container->ss->reshape_array(a, reshape_cancel_request, &updates);
sysfs_set_str(&a->info, NULL, "sync_action", "idle");
+ a->reshape_state = reshape_not_active;
}
}
+reshape_out:
+ if (a->reshape_state == reshape_not_active) {
+ dprintf("Cancel reshape.\n");
+ a->container->ss->reshape_array(a, reshape_cancel_request, &updates);
+ sysfs_set_str(&a->info, NULL, "sync_action", "idle");
+ }
dprintf("Send metadata update for reshape.\n");

queue_metadata_update(updates);
updates = NULL;
wakeup_monitor();
-reshape_out:
+
while (newdev) {
d = newdev->next;
free(newdev);
newdev = d;
}
free_updates(&updates);
+ if (newa)
+ free_aa(newa);
}
}

diff --git a/mdadm.h b/mdadm.h
index 7611a06..d9ea545 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -428,6 +428,8 @@ extern int sysfs_attr_match(const char *attr, const char *str);
extern int sysfs_match_word(const char *word, char **list);
extern int sysfs_set_str(struct mdinfo *sra, struct mdinfo *dev,
char *name, char *val);
+extern int sysfs_get_ll(struct mdinfo *sra, struct mdinfo *dev,
+ char *name, unsigned long long *val);
extern int sysfs_set_num(struct mdinfo *sra, struct mdinfo *dev,
char *name, unsigned long long val);
extern int sysfs_uevent(struct mdinfo *sra, char *event);
diff --git a/mdmon.h b/mdmon.h
index 6e86994..259ea82 100644
--- a/mdmon.h
+++ b/mdmon.h
@@ -50,6 +50,9 @@ struct active_array {

enum state_of_reshape reshape_state;
int reshape_delta_disks;
+ int reshape_raid_disks;
+ int reshape_level;
+ int reshape_layout;
unsigned long long grow_sync_max; /* sync_max from mdadm Grow */
enum reshape_wait waiting_for; /* we can wait for grow backup event
or for md reshape completed */
diff --git a/super-intel.c b/super-intel.c
index 7e755fe..444ae3a 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -344,6 +344,9 @@ struct imsm_update_reshape {
enum imsm_update_type type;
int update_memory_size;
int reshape_delta_disks;
+ int reshape_raid_disks;
+ int reshape_level;
+ int reshape_layout;
int disks_count;
int spares_in_update;
int devnum;
@@ -5684,6 +5687,7 @@ static void imsm_process_update(struct supertype *st,
__u32 new_mpb_size;
int new_disk_num;
struct intel_dev *current_dev;
+ struct imsm_dev *new_dev;

dprintf("imsm: imsm_process_update() for update_reshape [u->update_prepared = %i]\n", u->update_prepared);
if ((u->update_prepared == -1) ||
@@ -5741,11 +5745,12 @@ static void imsm_process_update(struct supertype *st,
}
/* find current dev in intel_super
*/
- dprintf("\t\tLooking for volume %s\n", (char *)u->devs_mem.dev->volume);
+ new_dev = (struct imsm_dev *)((void *)u + u->upd_devs_offset);
+ dprintf("\t\tLooking for volume %s\n", (char *)new_dev->volume);
current_dev = super->devlist;
while (current_dev) {
if (strcmp((char *)current_dev->dev->volume,
- (char *)u->devs_mem.dev->volume) == 0)
+ (char *)new_dev->volume) == 0)
break;
current_dev = current_dev->next;
}
@@ -5764,7 +5769,14 @@ static void imsm_process_update(struct supertype *st,
/* set reshape_delta_disks
*/
a->reshape_delta_disks = u->reshape_delta_disks;
+ a->reshape_raid_disks = u->reshape_raid_disks;
a->reshape_state = reshape_is_starting;
+ a->reshape_level = u->reshape_level;
+ a->reshape_layout = u->reshape_layout;
+ if (a->reshape_level == 0) {
+ a->reshape_level = 5;
+ a->reshape_layout = 5;
+ }

/* Clear migration record */
memset(super->migr_rec, 0, sizeof(struct migr_record));
@@ -6282,12 +6294,7 @@ static void imsm_prepare_update(struct supertype *st,
if (u->reshape_delta_disks < 0)
break;
u->update_prepared = 1;
- if (u->reshape_delta_disks == 0) {
- /* for non growing reshape buffers sizes are not affected
- * but check some parameters
- */
- break;
- }
+
/* count HDDs
*/
u->disks_count = 0;
@@ -7064,6 +7071,106 @@ abort:
return ret_val;
}

+/********************************************************** *******************
+ * Function: update_geometry
+ * Description: Prepares imsm volume map update in case of volume reshape
+ * Returns: 0 on success, -1 if fail
+ * ************************************************************ ***************/
+int update_geometry(struct supertype *st,
+ struct geo_params *geo)
+{
+ int fd = -1, ret_val = -1;
+ struct mdinfo *sra = NULL;
+ char buf[PATH_MAX];
+ char supported = 1;
+
+ snprintf(buf, PATH_MAX, "/dev/md%i", geo->dev_id);
+ fd = open(buf , O_RDONLY | O_DIRECT);
+ if (fd < 0) {
+ dprintf("imsm: cannot open device\n");
+ return -1;
+ }
+
+ sra = sysfs_read(fd, 0, GET_DISKS | GET_LAYOUT | GET_CHUNK | GET_SIZE | GET_LEVEL | GET_DEVS);
+ if (!sra) {
+ dprintf("imsm: Cannot get mdinfo!\n");
+ goto update_geometry_exit;
+ }
+
+ if (sra->devs == NULL) {
+ dprintf("imsm: Cannot load device information.\n");
+ goto update_geometry_exit;
+ }
+ /* is size change possible??? */
+ if (((unsigned long long)geo->size != sra->devs->component_size) && (geo->size != UnSet) && (geo->size > 0)) {
+ geo->size = sra->devs->component_size;
+ dprintf("imsm: Change the array size not supported in imsm!\n");
+ goto update_geometry_exit;
+ }
+
+ if ((geo->level != sra->array.level) && (geo->level >= 0) && (geo->level != UnSet)) {
+ switch (sra->array.level) {
+ case 0:
+ if (geo->level != 5)
+ supported = 0;
+ break;
+ case 5:
+ if (geo->level != 0)
+ supported = 0;
+ break;
+ case 1:
+ if ((geo->level != 5) || (geo->level != 0))
+ supported = 0;
+ break;
+ case 10:
+ if (geo->level != 5)
+ supported = 0;
+ break;
+ default:
+ supported = 0;
+ break;
+ }
+ if (!supported) {
+ dprintf("imsm: Error. Level Migration from %d to %d not supported!\n", sra->array.level, geo->level);
+ goto update_geometry_exit;
+ }
+ } else {
+ geo->level = sra->array.level;
+ }
+
+ if ((geo->layout != sra->array.layout) && ((geo->layout != UnSet) && (geo->layout != -1))) {
+ if ((sra->array.layout == 0) && (sra->array.level == 5) && (geo->layout == 5)) {
+ /* reshape 5 -> 4 */
+ geo->raid_disks++;
+ } else if ((sra->array.layout == 5) && (sra->array.level == 5) && (geo->layout == 0)) {
+ /* reshape 4 -> 5 */
+ geo->layout = 0;
+ geo->level = 5;
+ } else {
+ dprintf("imsm: Error. Layout Migration from %d to %d not supported!\n", sra->array.layout, geo->layout);
+ ret_val = -1;
+ goto update_geometry_exit;
+ }
+ }
+
+ if ((geo->chunksize == 0) || (geo->chunksize == UnSet))
+ geo->chunksize = sra->array.chunk_size;
+
+ if (!validate_geometry_imsm(st, geo->level, geo->layout, geo->raid_disks,
+ geo->chunksize, geo->size,
+ 0, 0, 1))
+ goto update_geometry_exit;
+
+ ret_val = 0;
+
+update_geometry_exit:
+ sysfs_free(sra);
+ if (fd > -1)
+ close(fd);
+
+ return ret_val;
+}
+
/*********************************************************** *******************
* function: imsm_create_metadata_update_for_reshape
* Function creates update for whole IMSM container.
@@ -7117,6 +7224,9 @@ struct imsm_update_reshape *imsm_create_metadata_update_for_reshape(struct super
}
u->reshape_delta_disks = delta_disks;
u->update_prepared = -1;
+ u->reshape_raid_disks = 0;
+ u->reshape_level = -1;
+ u->reshape_layout = -1;
u->update_memory_size = update_memory_size;
u->type = update_reshape;
u->spares_in_update = 0;
@@ -7164,6 +7274,18 @@ struct imsm_update_reshape *imsm_create_metadata_update_for_reshape(struct super
set_imsm_ord_tbl_ent(new_map, idx, idx);
}
u->devnum = geo->dev_id;
+ /* case for reshape without grow */
+ if (u->reshape_delta_disks == 0) {
+ dprintf("imsm: reshape prepate metadata for volume= %d, index= %d\n", geo->dev_id, i);
+ if (update_geometry(st, geo) == -1) {
+ dprintf("imsm: ERROR: Cannot prepare update for volume map!\n");
+ ret_val = NULL;
+ goto exit_imsm_create_metadata_update_for_reshape;
+ } else {
+ new_map->raid_level = geo->level;
+ new_map->blocks_per_strip = geo->chunksize / 512;
+ }
+ }
break;
}
}
@@ -7335,6 +7457,7 @@ int imsm_reshape_super(struct supertype *st, long long size, int level,
struct mdinfo *sra = NULL;
int fd = -1;
char buf[PATH_MAX];
+ int delta_disks = -1;
struct geo_params geo;

memset(&geo, sizeof (struct geo_params), 0);
@@ -7395,6 +7518,13 @@ int imsm_reshape_super(struct supertype *st, long long size, int level,
} else
dprintf("imsm: not a container operation\n");

+ sra = sysfs_read(fd, 0, GET_VERSION | GET_LEVEL | GET_LAYOUT |
+ GET_DISKS | GET_DEVS | GET_CHUNK | GET_SIZE);
+ if (sra == NULL) {
+ fprintf(stderr, Name ": Cannot read sysfs info (imsm)\n");
+ goto imsm_reshape_super_exit;
+ }
+
geo.dev_id = -1;
find_array_minor(geo.dev_name, 1, st->devnum, &geo.dev_id);

@@ -7407,12 +7537,6 @@ int imsm_reshape_super(struct supertype *st, long long size, int level,
int dn;
int err;

- sra = sysfs_read(fd, 0, GET_VERSION | GET_LEVEL |
- GET_LAYOUT | GET_DISKS | GET_DEVS);
- if (sra == NULL) {
- fprintf(stderr, Name ": Cannot read sysfs info (imsm)\n");
- goto imsm_reshape_super_exit;
- }
dn = devname2devnum(sra->text_version + 1);
container_fd = open_dev_excl(dn);
if (container_fd < 0) {
@@ -7435,11 +7559,49 @@ int imsm_reshape_super(struct supertype *st, long long size, int level,
st->update_tail = &st->updates;
err = update_level_imsm(st, sra, sra->name, 0, 0, NULL);
ret_val = 0;
+ goto imsm_reshape_super_exit;
+ }
+ }
+
+ /* this is not takeover
+ * continue volume check - proceed if delta_disk is zero only
+ */
+ if (geo.raid_disks > 0 && geo.raid_disks != UnSet)
+ delta_disks = geo.raid_disks - sra->array.raid_disks;
+ else
+ delta_disks = 0;
+ dprintf("imsm: imsm_reshape_super() called on array when delta disks = %i\n", delta_disks);
+ if (delta_disks == 0) {
+ struct imsm_update_reshape *u;
+ st->update_tail = &st->updates;
+ dprintf("imsm: imsm_reshape_super(): raid_disks not changed for volume reshape. Reshape allowed.\n");
+
+ if (find_array_minor(geo.dev_name, 1, st->devnum, &geo.dev_id) > -1) {
+ u = imsm_create_metadata_update_for_reshape(st, &geo);
+ if (u) {
+ if (geo.raid_disks > raid_disks)
+ u->reshape_raid_disks = geo.raid_disks;
+ u->reshape_level = geo.level;
+ u->reshape_layout = geo.layout;
+ ret_val = 0;
+ append_metadata_update(st, u, u->update_memory_size);
+ }
}
- sysfs_free(sra);
- sra = NULL;
+ goto imsm_reshape_super_exit;
+ } else {
+ char *devname = devnum2devname(st->devnum);
+ char *devtoprint = devname;
+
+ if (devtoprint == NULL)
+ devtoprint = "Device";
+ fprintf(stderr, Name
+ ": %s cannot be reshaped. Command has to be executed on container.\n",
+ devtoprint);
+ if (devname)
+ free(devname);
}

+
imsm_reshape_super_exit:
sysfs_free(sra);
if (fd >= 0)
@@ -7727,7 +7889,8 @@ struct mdinfo *imsm_reshape_array(struct active_array *a, enum state_of_reshape

if (a->reshape_delta_disks == 0) {
dprintf("array parameters has to be changed\n");
- /* TBD */
+ a->reshape_state = reshape_in_progress;
+ return disk_list;
}
if (a->reshape_delta_disks > 0) {
dprintf("grow is detected.\n");
@@ -7752,17 +7915,14 @@ imsm_reshape_array_exit:
sysfs_set_str(&a->info, NULL, "sync_action", "idle");
imsm_grow_array_remove_devices_on_cancel(a);
u = (struct imsm_update_reshape *)calloc(1, sizeof(struct imsm_update_reshape));
- if (u) {
+ if (u)
u->type = update_reshape_cancel;
- a->reshape_state = reshape_not_active;
- }
}

if (u) {
/* post any prepared update
*/
u->devnum = a->devnum;
-
u->update_memory_size = sizeof(struct imsm_update_reshape);
u->reshape_delta_disks = a->reshape_delta_disks;
u->update_prepared = 1;
@@ -7926,7 +8086,8 @@ int imsm_child_grow(struct supertype *st, char *devname, int validate_fd, struct

void return_to_raid0(struct mdinfo *sra)
{
- if (sra->array.level == 4) {
+ if ((sra->array.level == 4) ||
+ (sra->array.level == 0)) {
dprintf("Execute backward takeover to raid0\n");
sysfs_set_str(sra, NULL, "level", "raid0");
}
@@ -8295,7 +8456,38 @@ int imsm_manage_reshape(struct supertype *st, char *backup)
* for single vlolume reshape exit only and reuse Grow_reshape() code
*/
if (st->subarray[0] != 0) {
+ char buf[PATH_MAX];
+ int fd;
+
dprintf("imsm: manage_reshape() current volume: %s (devnum = %i)\n", st->subarray, st->devnum);
+
+ snprintf(buf, PATH_MAX, "/dev/md%i", st->devnum);
+ fd = open(buf , O_RDWR | O_DIRECT);
+ if (fd > -1) {
+ struct mdinfo *info;
+ struct mdinfo sra;
+
+ sra.devs = NULL;
+ st->ss->getinfo_super(st, &sra);
+ /* wait for reshape finish
+ * and manage array size based on metadata information
+ */
+ imsm_grow_manage_size(st, &sra);
+
+ /* for level == 4: execute takeover to raid0 */
+ info = sysfs_read(fd, 0, GET_VERSION | GET_LEVEL | GET_DEVS | GET_LAYOUT);
+ if (info) {
+ /* curently md doesn't support direct translation from raid5 to raid4
+ * it has be done via raid5 layout5
+ */
+ if ((info->array.level == 5) &&
+ (info->array.layout == 5))
+ info->array.level = 4;
+ return_to_raid0(info);
+ sysfs_free(info);
+ }
+ close(fd);
+ }
return ret_val;
}
ret_val = imsm_manage_container_reshape(st);
@@ -8363,3 +8555,4 @@ struct superswitch super_imsm = {
.prepare_update = imsm_prepare_update,
#endif /* MDASSEMBLE */
};
+

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 52/53] Migration raid0->raid5

am 26.11.2010 09:10:33 von adam.kwolek

Add implementation for migration from raid0 to raid5 in one step.
For imsm raid level parameters flow from mdadm (vi metadata update) to managemon was added.

Block takeover for this migration case (update_reshape is used only)
For migration on container (OLCE) reinitialize variables that are changed
by single array reshape case.

Signed-off-by: Adam Kwolek
---

super-intel.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index 444ae3a..9061d43 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -5335,7 +5335,7 @@ static void imsm_sync_metadata(struct supertype *container)
{
struct intel_super *super = container->sb;

- if (!super->updates_pending)
+ if (!super || !super->updates_pending)
return;

write_super_imsm(container, 0);
@@ -7196,6 +7196,13 @@ struct imsm_update_reshape *imsm_create_metadata_update_for_reshape(struct super

dprintf("imsm imsm_update_metadata_for_reshape(enter) raid_disks = %i\n", geo->raid_disks);

+ if (super == NULL || super->anchor == NULL) {
+ dprintf("Error: imsm_create_metadata_update_for_reshape(): null pointers on input\n");
+ dprintf("\t\t super = %p\n", super);
+ if (super)
+ dprintf("\t\t super->anchor = %p\n", super->anchor);
+ return ret_val;
+ }
if ((geo->raid_disks < super->anchor->num_disks) ||
(geo->raid_disks == UnSet))
geo->raid_disks = super->anchor->num_disks;
@@ -7274,8 +7281,11 @@ struct imsm_update_reshape *imsm_create_metadata_update_for_reshape(struct super
set_imsm_ord_tbl_ent(new_map, idx, idx);
}
u->devnum = geo->dev_id;
- /* case for reshape without grow */
- if (u->reshape_delta_disks == 0) {
+ /* case for reshape without grow
+ * or grow is level change effect
+ */
+ if ((u->reshape_delta_disks == 0) ||
+ ((new_map->raid_level != geo->level) && (geo->level != UnSet))) {
dprintf("imsm: reshape prepate metadata for volume= %d, index= %d\n", geo->dev_id, i);
if (update_geometry(st, geo) == -1) {
dprintf("imsm: ERROR: Cannot prepare update for volume map!\n");
@@ -7537,6 +7547,13 @@ int imsm_reshape_super(struct supertype *st, long long size, int level,
int dn;
int err;

+ /* takeover raid0->raid5 doesn't need meta update
+ * this can be handled by migrations if necessary
+ */
+ if ((geo.level == 5) && (sra->array.level == 5)) {
+ ret_val = 0;
+ goto imsm_reshape_super_exit;
+ }
dn = devname2devnum(sra->text_version + 1);
container_fd = open_dev_excl(dn);
if (container_fd < 0) {
@@ -7568,8 +7585,10 @@ int imsm_reshape_super(struct supertype *st, long long size, int level,
*/
if (geo.raid_disks > 0 && geo.raid_disks != UnSet)
delta_disks = geo.raid_disks - sra->array.raid_disks;
- else
+ else {
delta_disks = 0;
+ geo.raid_disks = sra->array.raid_disks;
+ }
dprintf("imsm: imsm_reshape_super() called on array when delta disks = %i\n", delta_disks);
if (delta_disks == 0) {
struct imsm_update_reshape *u;
@@ -7577,7 +7596,26 @@ int imsm_reshape_super(struct supertype *st, long long size, int level,
dprintf("imsm: imsm_reshape_super(): raid_disks not changed for volume reshape. Reshape allowed.\n");

if (find_array_minor(geo.dev_name, 1, st->devnum, &geo.dev_id) > -1) {
- u = imsm_create_metadata_update_for_reshape(st, &geo);
+ struct supertype *st2 = NULL;
+ struct supertype *st_tmp = st;
+ if (st->sb == NULL) {
+ close(fd);
+ /* open container
+ */
+ snprintf(buf, PATH_MAX, "/dev/md%i", st->container_dev);
+ dprintf("imsm: open device %s\n", buf);
+ fd = open(buf , O_RDWR | O_DIRECT);
+ if (fd < 0) {
+ dprintf("imsm: cannot open device %s\n", buf);
+ goto imsm_reshape_super_exit;
+ }
+ st2 = super_by_fd(fd);
+ st->ss->load_super(st2, fd, NULL);
+ if (st2)
+ st_tmp = st2;
+ }
+
+ u = imsm_create_metadata_update_for_reshape(st_tmp, &geo);
if (u) {
if (geo.raid_disks > raid_disks)
u->reshape_raid_disks = geo.raid_disks;
@@ -7585,7 +7623,10 @@ int imsm_reshape_super(struct supertype *st, long long size, int level,
u->reshape_layout = geo.layout;
ret_val = 0;
append_metadata_update(st, u, u->update_memory_size);
+
}
+ if (st2)
+ st->ss->free_super(st2);
}
goto imsm_reshape_super_exit;
} else {
@@ -8399,6 +8440,8 @@ int imsm_manage_container_reshape(struct supertype *st)
*/
dprintf("imsm: Preparing metadata update for: %s (md%i)\n", array, geo.dev_id);
st->update_tail = &st->updates;
+ geo.size = UnSet;
+ geo.level = UnSet;
u = imsm_create_metadata_update_for_reshape(st, &geo);
if (u) {
u->reshape_delta_disks = delta_disks;

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 53/53] Migration: Chunk size migration

am 26.11.2010 09:10:41 von adam.kwolek

Add implementation for chunk size migration for external metadata.
Update works using array parameters update in managemon. Reshape is started by managemon also.
mdadm waits for reshape array state instead starting reshape process.
For imsm chunk size parameter flow, from mdadm (via metadata update) to managemon was added.

Signed-off-by: Adam Kwolek
---

managemon.c | 6 ++++++
mdmon.h | 1 +
super-intel.c | 4 ++++
3 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/managemon.c b/managemon.c
index 2514963..f104d0a 100644
--- a/managemon.c
+++ b/managemon.c
@@ -522,6 +522,12 @@ static void manage_member(struct mdstat_ent *mdstat,
}
}

+ if (status_ok && newa->reshape_chunk_size > 0) {
+ dprintf("managemon: set chunk_size to %i\n", newa->reshape_chunk_size);
+ if (sysfs_set_num(&newa->info, NULL, "chunk_size", newa->reshape_chunk_size) < 0)
+ status_ok = 0;
+ }
+
if (status_ok && newa->reshape_layout >= 0) {
dprintf("managemon: set layout to %i\n", newa->reshape_layout);
if (sysfs_set_num(&newa->info, NULL, "layout", newa->reshape_layout) < 0)
diff --git a/mdmon.h b/mdmon.h
index 259ea82..c9f619f 100644
--- a/mdmon.h
+++ b/mdmon.h
@@ -53,6 +53,7 @@ struct active_array {
int reshape_raid_disks;
int reshape_level;
int reshape_layout;
+ int reshape_chunk_size;
unsigned long long grow_sync_max; /* sync_max from mdadm Grow */
enum reshape_wait waiting_for; /* we can wait for grow backup event
or for md reshape completed */
diff --git a/super-intel.c b/super-intel.c
index 9061d43..d26c81b 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -347,6 +347,7 @@ struct imsm_update_reshape {
int reshape_raid_disks;
int reshape_level;
int reshape_layout;
+ int reshape_chunk_size;
int disks_count;
int spares_in_update;
int devnum;
@@ -5777,6 +5778,7 @@ static void imsm_process_update(struct supertype *st,
a->reshape_level = 5;
a->reshape_layout = 5;
}
+ a->reshape_chunk_size = u->reshape_chunk_size;

/* Clear migration record */
memset(super->migr_rec, 0, sizeof(struct migr_record));
@@ -7234,6 +7236,7 @@ struct imsm_update_reshape *imsm_create_metadata_update_for_reshape(struct super
u->reshape_raid_disks = 0;
u->reshape_level = -1;
u->reshape_layout = -1;
+ u->reshape_chunk_size = -1;
u->update_memory_size = update_memory_size;
u->type = update_reshape;
u->spares_in_update = 0;
@@ -7621,6 +7624,7 @@ int imsm_reshape_super(struct supertype *st, long long size, int level,
u->reshape_raid_disks = geo.raid_disks;
u->reshape_level = geo.level;
u->reshape_layout = geo.layout;
+ u->reshape_chunk_size = geo.chunksize;
ret_val = 0;
append_metadata_update(st, u, u->update_memory_size);

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 14/53] FIX: Cannot exit monitor after takeover

am 29.11.2010 00:38:29 von NeilBrown

On Fri, 26 Nov 2010 09:05:37 +0100 Adam Kwolek wrote:

> When performing backward takeover to raid0 monitor cannot exit
> for single raid0 array configuration.
> Monitor is locked by communication (ping_manager()) after unfreeze()

I think you are saying that when we convert a RAID5 to a RAID0, the mdmon
notices that there is nothing more for it to do, so it exits. Then mdadm has
problems contacting it. Is that right?
It doesn't seem quite right as the 'ping_monitor' should simply fail if the
mdmon has disappeared.

Could you say a bit more about what you observe happening.

>
> Do not ping manager for raid0 array as they shouldn't be monitored.

Only this isn't quite what the patch does. What it does is:
if the 'last' subarray found is raid0, then don't ping the monitor.
In general, (though possibly not in imsm) there could be multiple arrays,
some RAID0, some not. So we would need to track if there are an with
level > 0
and ping_monitor if any such were found.

I would be reasonably happy with such a patch, except that I cannot yet see
exactly why it is needed. So could you explain exactly what you are seeing
please?

Thanks,
NeilBrown

>
> Signed-off-by: Adam Kwolek
> ---
>
> msg.c | 5 +++--
> 1 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/msg.c b/msg.c
> index 8e7ebfd..95c6f0b 100644
> --- a/msg.c
> +++ b/msg.c
> @@ -385,11 +385,12 @@ void unblock_monitor(char *container, const int unfreeze)
> if (!is_container_member(e, container))
> continue;
> sysfs_free(sra);
> - sra = sysfs_read(-1, e->devnum, GET_VERSION);
> + sra = sysfs_read(-1, e->devnum, GET_VERSION|GET_LEVEL);
> if (unblock_subarray(sra, unfreeze))
> fprintf(stderr, Name ": Failed to unfreeze %s\n", e->dev);
> }
> - ping_monitor(container);
> + if (sra && sra->array.level > 0)
> + ping_monitor(container);
>
> sysfs_free(sra);
> free_mdstat(ent);

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 15/53] FIX: Unfreeze not only container for externalmetadata

am 29.11.2010 00:48:44 von NeilBrown

On Fri, 26 Nov 2010 09:05:45 +0100 Adam Kwolek wrote:

> Unfreeze for external metadata case should unfreeze arrays and container,
> not only container as so far. Unfreeze() function doesn't know
> what the changes to configuration was made so far, and if arrays
> are pulled from frozen state in md.
> Unfreeze() has to make sure by performing array unfreeze that all arrays
> are not frozen and then unblock monitor.

unfreeze for external metadata case *does* unfreeze the arrays.
unfreeze_container calls unblock_monitor which calls unblock_subarray
for each subarray.

So I cannot see that this patch changes anything. What have I missed?

NeilBrown

>
> Signed-off-by: Adam Kwolek
> ---
>
> Grow.c | 18 ++++++++----------
> 1 files changed, 8 insertions(+), 10 deletions(-)
>
> diff --git a/Grow.c b/Grow.c
> index 4060129..8ca1812 100644
> --- a/Grow.c
> +++ b/Grow.c
> @@ -495,16 +495,14 @@ static void unfreeze(struct supertype *st, int frozen)
> return;
>
> if (st->ss->external)
> - return unfreeze_container(st);
> - else {
> - struct mdinfo *sra = sysfs_read(-1, st->devnum, GET_VERSION);
> -
> - if (sra)
> - sysfs_set_str(sra, NULL, "sync_action", "idle");
> - else
> - fprintf(stderr, Name ": failed to unfreeze array\n");
> - sysfs_free(sra);
> - }
> + unfreeze_container(st);
> +
> + struct mdinfo *sra = sysfs_read(-1, st->devnum, GET_VERSION);
> + if (sra)
> + sysfs_set_str(sra, NULL, "sync_action", "idle");
> + else
> + fprintf(stderr, Name ": failed to unfreeze array\n");
> + sysfs_free(sra);
> }
>
> static void wait_reshape(struct mdinfo *sra)

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 16/53] Add takeover support for external meta

am 29.11.2010 01:31:00 von NeilBrown

On Fri, 26 Nov 2010 09:05:52 +0100 Adam Kwolek wrote:

> When performing takeover 0->10 or 10->0 mdmon should update the external metadata (due to disk slot changes).
> To achieve that mdadm, after changing the level in md, mdadm calls update_super with "update_level" type.
> update_super() allocates a new imsm_dev with updated disk slot numbers to be processed by mdmon in process_update().
> process_update() discovers missing disks and adds them to imsm metadata.

I'm afraid I'm not going to apply this patch as it stands.

It combines multiple distinct changes, some of which are clearly wrong. Some
of which are doubtful.

It is really really important to make sure each patch makes just one change.
In the rare case that there is a real need to make multiple changes in the
one patch, it is really important to document them clearly at the top,
preferably with a numbered list.

I would much rather have twice as many patches that are each easy to review,
than fewer patch which, like this one, are hard to review and end up having
to be rejected.

So before you send a patch, make sure you have read through the patch and a
sure that every thing in the patch is justified by the commentary at the top.

See below for specific comments.

>
> Signed-off-by: Maciej Trela
> Signed-off-by: Adam Kwolek
> ---
>
> Grow.c | 28 ++++++
> managemon.c | 16 +++
> monitor.c | 2
> super-intel.c | 279 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 321 insertions(+), 4 deletions(-)
>
> diff --git a/Grow.c b/Grow.c
> index 8ca1812..e977ce2 100644
> --- a/Grow.c
> +++ b/Grow.c
> @@ -1066,6 +1066,31 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> fprintf(stderr, Name " level of %s changed to %s\n",
> devname, c);
> changed = 1;
> +
> + st = super_by_fd(fd);
> + if (!st) {
> + fprintf(stderr, Name ": cannot handle this array\n");
> + if (container)
> + free(container);
> + if (cfd > -1)
> + close(cfd);
> + return 1;
> + } else {
> + if (st && reshape_super(st, -1, level, UnSet, 0, 0, NULL, devname, !quiet)) {
> + rv = 1;
> + goto release;
> + }
> + /* before sending update make sure that for external metadata
> + * and after changing raid level mdmon is running
> + */
> + if (st->ss->external && !mdmon_running(st->container_dev) &&
> + level > 0) {
> + start_mdmon(st->container_dev);
> + if (container)
> + ping_monitor(container);
> + }
> + sync_metadata(st);
> + }
> }
> }

This contradicts the comment a little earlier which was only recently added
and says:
/* Level change is a simple takeover. In the external
* case we don't check with the metadata handler until
* we establish what the final layout will be. If the
* level change is disallowed we will revert to
* orig_level without disturbing the metadata, otherwise
* we will send an update.
*/

If that comment is no longer valid, it should at least be removed. But I
suspect it is valid and that this change is wrong.
I appreciate that raid10 <-> raid0 is a particularly complicated case and
might need some sort of special treatment. However the above code doesn't
mention raid10 at all so applies generally. I think that is wrong.

I'm guessing that what you really want is extra cases in the
switch (array.level) {

switch to handle RAID0 and RAID10 specially.

>
> @@ -1140,7 +1165,8 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> if (st->ss->external && !mdmon_running(st->container_dev) &&
> level > 0) {
> start_mdmon(st->container_dev);
> - ping_monitor(container);
> + if (container)
> + ping_monitor(container);
> }
> goto release;
> }

This simply wrong. If st->ss->external, then container is known to be !=
NULL. See earlier code where 'container' is set.

> diff --git a/managemon.c b/managemon.c
> index 164e4f8..53ab4a9 100644
> --- a/managemon.c
> +++ b/managemon.c
> @@ -381,6 +381,9 @@ static int disk_init_and_add(struct mdinfo *disk, struct mdinfo *clone,
> static void manage_member(struct mdstat_ent *mdstat,
> struct active_array *a)
> {
> + struct active_array *newa;
> + int level;
> +
> /* Compare mdstat info with known state of member array.
> * We do not need to look for device state changes here, that
> * is dealt with by the monitor.
> @@ -408,6 +411,19 @@ static void manage_member(struct mdstat_ent *mdstat,
> else
> frozen = 1; /* can't read metadata_version assume the worst */
>
> + level = a->info.array.level;
> + if (mdstat->level) {
> + level = map_name(pers, mdstat->level);
> + if (a->info.array.level != level && level >= 0) {
> + newa = duplicate_aa(a);
> + if (newa) {
> + newa->info.array.level = level;
> + replace_array(a->container, a, newa);
> + a = newa;
> + }
> + }
> + }
> +
> if (a->check_degraded && !frozen) {
> struct metadata_update *updates = NULL;
> struct mdinfo *newdev = NULL;

This probably makes sense, as a separate patch. It would help to have a
clear step-by-step understanding of how RAID10 conversions are handled to
know how this code fits with that.

> diff --git a/monitor.c b/monitor.c
> index 59b4181..5705a9b 100644
> --- a/monitor.c
> +++ b/monitor.c
> @@ -483,7 +483,7 @@ static int wait_and_act(struct supertype *container, int nowait)
> /* once an array has been deactivated we want to
> * ask the manager to discard it.
> */
> - if (!a->container) {
> + if (!a->container || a->info.array.level == 0) {
> if (discard_this) {
> ap = &(*ap)->next;
> continue;

The last time you sent this I observed that for consistency that should be
level <= 0
You neither changed it or explained why you chose not to change it.

The rest would make sense (though I haven't looked very closely) as a single
separate patch.

Thanks,
NeilBrown

> diff --git a/super-intel.c b/super-intel.c
> index 7c5fcc4..2434fa1 100644
> --- a/super-intel.c
> +++ b/super-intel.c
> @@ -285,6 +285,7 @@ enum imsm_update_type {
> update_kill_array,
> update_rename_array,
> update_add_disk,
> + update_level,
> };
>
> struct imsm_update_activate_spare {
> @@ -320,6 +321,13 @@ struct imsm_update_add_disk {
> enum imsm_update_type type;
> };
>
> +struct imsm_update_level {
> + enum imsm_update_type type;
> + int delta_disks;
> + int container_member;
> + struct imsm_dev dev;
> +};
> +
> static struct supertype *match_metadata_desc_imsm(char *arg)
> {
> struct supertype *st;
> @@ -1666,6 +1674,9 @@ static void getinfo_super_imsm(struct supertype *st, struct mdinfo *info)
> }
> }
>
> +static int is_raid_level_supported(const struct imsm_orom *orom, int level, int raiddisks);
> +static void imsm_copy_dev(struct imsm_dev *dest, struct imsm_dev *src);
> +
> static int update_super_imsm(struct supertype *st, struct mdinfo *info,
> char *update, char *devname, int verbose,
> int uuid_set, char *homehost)
> @@ -1698,12 +1709,15 @@ static int update_super_imsm(struct supertype *st, struct mdinfo *info,
> struct intel_super *super = st->sb;
> struct imsm_super *mpb;
>
> - /* we can only update container info */
> - if (!super || super->current_vol >= 0 || !super->anchor)
> + if (!super || !super->anchor)
> return 1;
>
> mpb = super->anchor;
>
> + /* we can only update container info */
> + if (super->current_vol >= 0)
> + return 1;
> +
> if (strcmp(update, "uuid") == 0 && uuid_set && !info->update_private)
> fprintf(stderr,
> Name ": '--uuid' not supported for imsm metadata\n");
> @@ -1778,6 +1792,45 @@ static void imsm_copy_dev(struct imsm_dev *dest, struct imsm_dev *src)
> memcpy(dest, src, sizeof_imsm_dev(src, 0));
> }
>
> +struct imsm_dev *reallocate_imsm_dev(struct intel_super *super,
> + unsigned int array_index,
> + int map_num_members)
> +{
> + struct imsm_dev *newdev = NULL;
> + struct imsm_dev *retval = NULL;
> + struct intel_dev *dv = NULL;
> + struct imsm_dev *dv_free = NULL;
> + int memNeeded;
> +
> + if (!super)
> + return NULL;
> +
> + /* Calculate space needed for imsm_dev with a double map */
> + memNeeded = sizeof(struct imsm_dev) + sizeof(__u32) * (map_num_members - 1) +
> + sizeof(struct imsm_map) + sizeof(__u32) * (map_num_members - 1);
> +
> + newdev = malloc(memNeeded);
> + if (!newdev) {
> + fprintf(stderr, "error: imsm meta update not possible due to no memory conditions\n");
> + return NULL;
> + }
> + /* Find our device */
> + for (dv = super->devlist; dv; dv = dv->next)
> + if (dv->index == array_index) {
> + /* Copy imsm_dev into the new buffer */
> + imsm_copy_dev(newdev, dv->dev);
> + dv_free = dv->dev;
> + dv->dev = newdev;
> + retval = newdev;
> + free(dv_free);
> + break;
> + }
> + if (retval == NULL)
> + free(newdev);
> +
> + return retval;
> +}
> +
> static int compare_super_imsm(struct supertype *st, struct supertype *tst)
> {
> /*
> @@ -5123,6 +5176,57 @@ static void imsm_process_update(struct supertype *st,
> mpb = super->anchor;
>
> switch (type) {
> + case update_level: {
> + struct imsm_update_level *u = (void *)update->buf;
> + struct imsm_dev *dev_new, *dev = NULL;
> + struct imsm_map *map;
> + struct dl *d;
> + int i;
> + int start_disk;
> +
> + dev_new = &u->dev;
> + for (i = 0; i < mpb->num_raid_devs; i++) {
> + dev = get_imsm_dev(super, i);
> + if (strcmp((char *)dev_new->volume, (char *)dev->volume) == 0)
> + break;
> + }
> + if (i == super->anchor->num_raid_devs)
> + return;
> +
> + if (dev == NULL)
> + return;
> +
> + imsm_copy_dev(dev, dev_new);
> + map = get_imsm_map(dev, 0);
> + start_disk = mpb->num_disks;
> + mpb->num_disks += u->delta_disks;
> +
> + /* clear missing disks list */
> + while (super->missing) {
> + d = super->missing;
> + super->missing = d->next;
> + __free_imsm_disk(d);
> + }
> + find_missing(super);
> +
> + /* clear new disk entries if number of disks increased*/
> + d = super->missing;
> + for (i = start_disk; i < map->num_members; i++) {
> + assert(d != NULL);
> + if (!d)
> + break;
> + memset(&d->disk, 0, sizeof(d->disk));
> + strcpy((char *)d->disk.serial, "MISSING");
> + d->disk.total_blocks = map->blocks_per_member;
> + /* Set slot for missing disk */
> + set_imsm_ord_tbl_ent(map, i, d->index | IMSM_ORD_REBUILD);
> + d->raiddisk = i;
> + d = d->next;
> + }
> +
> + super->updates_pending++;
> + break;
> + }
> case update_activate_spare: {
> struct imsm_update_activate_spare *u = (void *) update->buf;
> struct imsm_dev *dev = get_imsm_dev(super, u->array);
> @@ -5442,6 +5546,26 @@ static void imsm_prepare_update(struct supertype *st,
> size_t len = 0;
>
> switch (type) {
> + case update_level: {
> + struct imsm_update_level *u = (void *) update->buf;
> + struct active_array *a;
> +
> + dprintf("prepare_update(): update level\n");
> + len += u->delta_disks * sizeof(struct imsm_disk) +
> + u->delta_disks * sizeof(__u32);
> +
> + for (a = st->arrays; a; a = a->next)
> + if (a->info.container_member == u->container_member)
> + break;
> + if (a == NULL)
> + break; /* what else we can do here? */
> +
> + /* we'll add new disks to imsm_dev */
> + if (u->delta_disks > 0)
> + reallocate_imsm_dev(super, u->container_member,
> + a->info.array.raid_disks);
> + break;
> + }
> case update_create_array: {
> struct imsm_update_create_array *u = (void *) update->buf;
> struct intel_dev *dv;
> @@ -5561,6 +5685,156 @@ static void imsm_delete(struct intel_super *super, struct dl **dlp, unsigned ind
> }
> #endif /* MDASSEMBLE */
>
> +static int update_level_imsm(struct supertype *st, struct mdinfo *info,
> + char *devname, int verbose,
> + int uuid_set, char *homehost)
> +{
> + struct intel_super *super = st->sb;
> + struct imsm_super *mpb = super->anchor;
> + struct imsm_update_level *u;
> + struct imsm_dev *dev_new, *dev = NULL;
> + struct imsm_map *map_new, *map;
> + struct mdinfo *newdi;
> + struct dl *dl;
> + int *tmp_ord_tbl;
> + int i, slot, idx;
> + int len, disks;
> +
> + if (!is_raid_level_supported(super->orom,
> + info->array.level,
> + info->array.raid_disks))
> + return 1;
> +
> + for (i = 0; i < mpb->num_raid_devs; i++) {
> + dev = get_imsm_dev(super, i);
> + if (strcmp(devname, (char *)dev->volume) == 0)
> + break;
> + }
> + if (dev == NULL)
> + return 1;
> +
> + if (i == super->anchor->num_raid_devs)
> + return 1;
> +
> + map = get_imsm_map(dev, 0);
> +
> + /* update level is needed only for 0->10 and 10->0 transitions */
> + if ((info->array.level != 10 || map->raid_level != 0) &&
> + (info->array.level != 0 || map->raid_level != 10))
> + return 1;
> +
> + disks = (info->array.raid_disks > map->num_members) ?
> + info->array.raid_disks : map->num_members;
> + len = sizeof(struct imsm_update_level) +
> + ((disks - 1) * sizeof(__u32));
> +
> + u = malloc(len);
> + if (u == NULL)
> + return 1;
> +
> + dev_new = &u->dev;
> + imsm_copy_dev(dev_new, dev);
> + map_new = get_imsm_map(dev_new, 0);
> +
> + tmp_ord_tbl = malloc(sizeof(int) * disks);
> + if (tmp_ord_tbl == NULL) {
> + free(u);
> + return 1;
> + }
> +
> + for (i = 0; i < disks; i++)
> + tmp_ord_tbl[i] = -1;
> +
> + /* iterate through devices to detect slot changes */
> + for (dl = super->disks; dl; dl = dl->next)
> + for (newdi = info->devs; newdi; newdi = newdi->next) {
> + if ((dl->major != newdi->disk.major) ||
> + (dl->minor != newdi->disk.minor))
> + continue;
> + slot = get_imsm_disk_slot(map, dl->index);
> + idx = get_imsm_ord_tbl_ent(dev_new, slot);
> + tmp_ord_tbl[newdi->disk.raid_disk] = idx;
> + break;
> + }
> +
> + for (i = 0; i < disks; i++)
> + set_imsm_ord_tbl_ent(map_new, i, tmp_ord_tbl[i]);
> + free(tmp_ord_tbl);
> + map_new->raid_level = info->array.level;
> + map_new->num_members = info->array.raid_disks;
> + u->type = update_level;
> + u->delta_disks = info->array.raid_disks - map->num_members;
> + u->container_member = info->container_member;
> + append_metadata_update(st, u, len);
> +
> + return 0;
> +}
> +
> +
> +int imsm_reshape_super(struct supertype *st, long long size, int level,
> + int layout, int chunksize, int raid_disks,
> + char *backup, char *dev, int verbouse)
> +{
> + int ret_val = 1;
> + struct mdinfo *sra = NULL;
> + int fd = -1;
> + char buf[PATH_MAX];
> +
> + snprintf(buf, PATH_MAX, "/dev/md%i", st->devnum);
> + fd = open(buf , O_RDONLY | O_DIRECT);
> + if (fd < 0) {
> + dprintf("imsm: cannot open device: %s\n", buf);
> + goto imsm_reshape_super_exit;
> + }
> +
> + if ((size == -1) && (layout == UnSet) && (raid_disks == 0) && (level != UnSet)) {
> + /* ok - this is takeover */
> + int container_fd;
> + int dn;
> + int err;
> +
> + sra = sysfs_read(fd, 0, GET_VERSION | GET_LEVEL |
> + GET_LAYOUT | GET_DISKS | GET_DEVS);
> + if (sra == NULL) {
> + fprintf(stderr, Name ": Cannot read sysfs info (imsm)\n");
> + goto imsm_reshape_super_exit;
> + }
> + dn = devname2devnum(sra->text_version + 1);
> + container_fd = open_dev_excl(dn);
> + if (container_fd < 0) {
> + fprintf(stderr, Name ": Cannot get exclusive access "
> + "to container (imsm).\n");
> + goto imsm_reshape_super_exit;
> + }
> + st->ss->load_super(st, container_fd, NULL);
> + close(container_fd);
> + st->ss->getinfo_super(st, sra);
> +
> + /* send metadata update for raid10 takeover
> + * this means we are going from/to raid10
> + * to/from different than raid10 level
> + * if source level is raid0 mdmon is sterted only
> + */
> + if (((level == 10) || (sra->array.level == 10) || (sra->array.level == 0)) &&
> + (level != sra->array.level) &&
> + (level > 0)) {
> + st->update_tail = &st->updates;
> + err = update_level_imsm(st, sra, sra->name, 0, 0, NULL);
> + ret_val = 0;
> + }
> + sysfs_free(sra);
> + sra = NULL;
> + }
> +
> +imsm_reshape_super_exit:
> + sysfs_free(sra);
> + if (fd >= 0)
> + close(fd);
> +
> + dprintf("imsm: reshape_super Exit code = %i\n", ret_val);
> + return ret_val;
> +}
> +
> struct superswitch super_imsm = {
> #ifndef MDASSEMBLE
> .examine_super = examine_super_imsm,
> @@ -5592,6 +5866,7 @@ struct superswitch super_imsm = {
> .match_metadata_desc = match_metadata_desc_imsm,
> .container_content = container_content_imsm,
> .default_geometry = default_geometry_imsm,
> + .reshape_super = imsm_reshape_super,
>
> .external = 1,
> .name = "imsm",

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 17/53] Disk removal support for Raid10->Raid0 takeover

am 29.11.2010 02:00:23 von NeilBrown

On Fri, 26 Nov 2010 09:06:00 +0100 Adam Kwolek wrote:

> Until now Raid10->Raid0 takeover was possible only if all the mirrors where removed before md starts the takeover.
> Now mdadm, when performing Raid10->raid0 takeover, will remove all unwanted mirrors from the array before actual md takeover is called.
>
> Signed-off-by: Maciej Trela
> Signed-off-by: Adam Kwolek

This patch makes sense, but I found the logic in choosing which devices to
remove to be vary hard to follow.

I have replaced it with the following: We select one disk to keep from each
pair, then remove the remainder.

Thanks,
NeilBrown

commit 62a48395f60965d04da1a9cbb937bda79dd071c8
Author: Adam Kwolek
Date: Mon Nov 29 11:57:51 2010 +1100

Disk removal support for Raid10->Raid0 takeover

Until now Raid10->Raid0 takeover was possible only if all the mirrors
where removed before md starts the takeover. Now mdadm, when
performing Raid10->raid0 takeover, will remove all unwanted mirrors
from the array before actual md takeover is called.

Signed-off-by: Maciej Trela
Signed-off-by: Adam Kwolek
Signed-off-by: NeilBrown

diff --git a/Grow.c b/Grow.c
index ea8f493..1d92d68 100644
--- a/Grow.c
+++ b/Grow.c
@@ -771,6 +771,76 @@ static void revert_container_raid_disks(struct supertype *st, int fd, char *cont
free_mdstat(ent);
}

+int remove_disks_on_raid10_to_raid0_takeover(struct supertype *st,
+ struct mdinfo *sra,
+ int layout)
+{
+ int nr_of_copies;
+ struct mdinfo *remaining;
+ int slot;
+
+ nr_of_copies = layout & 0xff;
+
+ remaining = sra->devs;
+ sra->devs = NULL;
+ /* for each 'copy', select one device and remove from the list. */
+ for (slot = 0; slot < sra->array.raid_disks; slot += nr_of_copies) {
+ struct mdinfo **diskp;
+ int found = 0;
+
+ /* Find a working device to keep */
+ for (diskp = &remaining; *diskp ; diskp = &(*diskp)->next) {
+ struct mdinfo *disk = *diskp;
+
+ if (disk->disk.raid_disk < slot)
+ continue;
+ if (disk->disk.raid_disk >= slot + nr_of_copies)
+ continue;
+ if (disk->disk.state & (1< + continue;
+ if (disk->disk.state & (1< + continue;
+ if (!(disk->disk.state & (1< + continue;
+
+ /* We have found a good disk to use! */
+ *diskp = disk->next;
+ disk->next = sra->devs;
+ sra->devs = disk;
+ found = 1;
+ break;
+ }
+ if (!found)
+ break;
+ }
+
+ if (slot < sra->array.raid_disks) {
+ /* didn't find all slots */
+ struct mdinfo **e;
+ e = &remaining;
+ while (*e)
+ e = &(*e)->next;
+ *e = sra->devs;
+ sra->devs = remaining;
+ return 1;
+ }
+
+ /* Remove all 'remaining' devices from the array */
+ while (remaining) {
+ struct mdinfo *sd = remaining;
+ remaining = sd->next;
+
+ sysfs_set_str(sra, sd, "state", "faulty");
+ sysfs_set_str(sra, sd, "slot", "none");
+ sysfs_set_str(sra, sd, "state", "remove");
+ sd->disk.state |= (1< + sd->disk.state &= ~(1< + sd->next = sra->devs;
+ sra->devs = sd;
+ }
+ return 0;
+}
+
int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
long long size,
int level, char *layout_str, int chunksize, int raid_disks)
@@ -902,7 +972,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
st->update_tail = &st->updates;
}

- sra = sysfs_read(fd, 0, GET_LEVEL);
+ sra = sysfs_read(fd, 0, GET_LEVEL | GET_DISKS | GET_DEVS | GET_STATE);
if (sra) {
if (st->ss->external && subarray == NULL) {
array.level = LEVEL_CONTAINER;
@@ -973,6 +1043,25 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
size = array.size;
}

+ /* ========= check for Raid10 -> Raid0 conversion ===============
+ * current implemenation assumes that following conditions must be met:
+ * - far_copies == 1
+ * - near_copies == 2
+ */
+ if (level == 0 && array.level == 10 &&
+ array.layout == ((1 << 8) + 2) && !(array.raid_disks & 1)) {
+ int err;
+ err = remove_disks_on_raid10_to_raid0_takeover(st, sra, array.layout);
+ if (err) {
+ dprintf(Name": Array cannot be reshaped\n");
+ if (container)
+ free(container);
+ if (cfd > -1)
+ close(cfd);
+ return 1;
+ }
+ }
+
/* ======= set level =========== */
if (level != UnSet && level != array.level) {
/* Trying to change the level.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 18/53] Treat feature as experimental

am 29.11.2010 02:13:17 von NeilBrown

On Fri, 26 Nov 2010 09:06:08 +0100 Adam Kwolek wrote:

> Due to fact that IMSM Windows compatibility was not tested yet, feature has to be treated as experimental until compatibility verification will be performed.

I've applied most of this patch.
The addition to imsm_reshape_super cannot be added until that function itself
is added, and I didn't like that patch.

Also it is not correct to mark the function as 'inline', and mentioning
'IMSM' is the error message is not appropriate as experimental() could
eventually be used by other metadata handlers.

But with those changes, I have applied it.

Thanks,
NeilBrown

>
> Signed-off-by: Adam Kwolek
> ---
>
> mdadm.h | 1 +
> super-intel.c | 4 ++++
> util.c | 10 ++++++++++
> 3 files changed, 15 insertions(+), 0 deletions(-)
>
> diff --git a/mdadm.h b/mdadm.h
> index 64b32cc..bf3c1d3 100644
> --- a/mdadm.h
> +++ b/mdadm.h
> @@ -890,6 +890,7 @@ extern char *conf_word(FILE *file, int allow_key);
> extern int conf_name_is_free(char *name);
> extern int devname_matches(char *name, char *match);
> extern struct mddev_ident_s *conf_match(struct mdinfo *info, struct supertype *st);
> +extern inline int experimental(void);
>
> extern void free_line(char *line);
> extern int match_oneof(char *devices, char *devname);
> diff --git a/super-intel.c b/super-intel.c
> index 2434fa1..f092ccc 100644
> --- a/super-intel.c
> +++ b/super-intel.c
> @@ -5780,6 +5780,10 @@ int imsm_reshape_super(struct supertype *st, long long size, int level,
> int fd = -1;
> char buf[PATH_MAX];
>
> +
> + if (experimental() == 0)
> + return ret_val;
> +
> snprintf(buf, PATH_MAX, "/dev/md%i", st->devnum);
> fd = open(buf , O_RDONLY | O_DIRECT);
> if (fd < 0) {
> diff --git a/util.c b/util.c
> index 8739278..f220792 100644
> --- a/util.c
> +++ b/util.c
> @@ -1859,3 +1859,13 @@ void append_metadata_update(struct supertype *st, void *buf, int len)
> unsigned int __invalid_size_argument_for_IOC = 0;
> #endif
>
> +inline int experimental(void)
> +{
> + if (check_env("MDADM_EXPERIMENTAL"))
> + return 1;
> + else {
> + fprintf(stderr, Name "(IMSM): To use this feature MDADM_EXPERIMENTAL enviroment variable has to defined.\n");
> + return 0;
> + }
> +}
> +

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 19/53] imsm: Add support for general migration

am 29.11.2010 02:17:47 von NeilBrown

On Fri, 26 Nov 2010 09:06:15 +0100 Adam Kwolek wrote:

> Internal IMSM procedures need to support the General Migration.
> It is used during operations like:
> - Online Capacity Expansion,
> - migration initialization,
> - finishing migration,
> - apply changes to raid disks etc.
>
> Signed-off-by: Adam Kwolek
> ---
>
> mdmon.h | 2 +-
> super-intel.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++------
> 2 files changed, 56 insertions(+), 8 deletions(-)
>
> diff --git a/mdmon.h b/mdmon.h
> index 5c51566..8190358 100644
> --- a/mdmon.h
> +++ b/mdmon.h
> @@ -76,7 +76,7 @@ void do_monitor(struct supertype *container);
> void do_manager(struct supertype *container);
> extern int sigterm;
>
> -int read_dev_state(int fd);
> +extern int read_dev_state(int fd);
> int is_container_member(struct mdstat_ent *mdstat, char *container);

Here is another case of irrelevant extra changes. This change really isn't
needed. 'extern' before a function declaration is completely redundant.

It is true that in the various .h files, we sometimes do include 'extern' and
sometimes don't. It might be good to be consistent, so a single patch which
makes them all consistent, either removing 'extern' from function
declarations or adding it where missing, would be OK. But adding here when
you aren't changing anything else about the declaration is noise, and I
really don't need more noise.

I haven't applied the rest of this patch as I suspect it should be applied
after the 'imsm_reshape_super' patch is applied.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 19/53] imsm: Add support for general migration

am 29.11.2010 02:29:06 von NeilBrown

On Mon, 29 Nov 2010 12:17:47 +1100 Neil Brown wrote:

> I haven't applied the rest of this patch as I suspect it should be applied
> after the 'imsm_reshape_super' patch is applied.

Actually, I changed my mind. I have applied with, without the mdmon.h change.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 20/53] imsm: Add reshape_update for grow array case

am 29.11.2010 02:48:14 von NeilBrown

On Fri, 26 Nov 2010 09:06:29 +0100 Adam Kwolek wrote:

> Store metadata update during Online Capacity Expansion initialization to currently reshaped array in container.
> New update type imsm_update_reshape is added to perform this action.
> Active array is extended with reshape_delta_disk variable that triggers additional actions in managemon.
>
> 1. reshape_super() prepares metadata update and send it to mdmon 2. managemon in prepare_update() allocates required memory for bigger device object 3. monitor in
> process_update() updates (replaces) device object with information
> passed from mdadm (memory was allocated by managemon) 4. set reshape_delta_disks variable to delta_disks value from update.
> This signals managemon to add devices to md and start reshape for this array

I haven't applied this patch because there is too much of it that doesn't
make sense to me. Maybe you need to break it up and explain it better. But
see below

>
> Signed-off-by: Adam Kwolek
> Signed-off-by: Krzysztof Wojcik
> ---
>
> Makefile | 6
> managemon.c | 2
> mdadm.h | 4
> mdmon.h | 5
> super-intel.c | 792 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> sysfs.c | 144 ++++++++++
> util.c | 148 +++++++++++
> 7 files changed, 1094 insertions(+), 7 deletions(-)
>
> diff --git a/Makefile b/Makefile
> index e2c65a5..e3fb949 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -112,17 +112,17 @@ SRCS = mdadm.c config.c mdstat.c ReadMe.c util.c Manage.c Assemble.c Build.c \
> MON_OBJS = mdmon.o monitor.o managemon.o util.o mdstat.o sysfs.o config.o \
> Kill.o sg_io.o dlink.o ReadMe.o super0.o super1.o super-intel.o \
> super-ddf.o sha1.o crc32.o msg.o bitmap.o \
> - platform-intel.o probe_roms.o
> + platform-intel.o probe_roms.o mapfile.o
>
> MON_SRCS = mdmon.c monitor.c managemon.c util.c mdstat.c sysfs.c config.c \
> Kill.c sg_io.c dlink.c ReadMe.c super0.c super1.c super-intel.c \
> super-ddf.c sha1.c crc32.c msg.c bitmap.c \
> - platform-intel.c probe_roms.c
> + platform-intel.c probe_roms.c mapfile.c

I can see no justification for adding mapfile to mdmon. If you find you need
to do that, you have done something wrong.

>
> STATICSRC = pwgr.c
> STATICOBJS = pwgr.o
>
> -ASSEMBLE_SRCS := mdassemble.c Assemble.c Manage.c config.c dlink.c util.c \
> +ASSEMBLE_SRCS := mdassemble.c Assemble.c Manage.c config.c dlink.c util.c mapfile.c\
> super0.c super1.c super-ddf.c super-intel.c sha1.c crc32.c sg_io.c mdstat.c \
> platform-intel.c probe_roms.c sysfs.c
> ASSEMBLE_AUTO_SRCS := mdopen.c
> diff --git a/managemon.c b/managemon.c
> index 53ab4a9..d495014 100644
> --- a/managemon.c
> +++ b/managemon.c
> @@ -536,6 +536,8 @@ static void manage_new(struct mdstat_ent *mdstat,
>
> new->container = container;
>
> + new->reshape_state = reshape_not_active;
> +
> inst = to_subarray(mdstat, container->devname);
>
> new->info.array = mdi->array;
> diff --git a/mdadm.h b/mdadm.h
> index bf3c1d3..4777ad2 100644
> --- a/mdadm.h
> +++ b/mdadm.h
> @@ -447,6 +447,7 @@ extern int sysfs_disk_to_scsi_id(int fd, __u32 *id);
> extern int sysfs_unique_holder(int devnum, long rdev);
> extern int sysfs_freeze_array(struct mdinfo *sra);
> extern int load_sys(char *path, char *buf);
> +extern struct mdinfo *sysfs_get_unused_spares(int container_fd, int fd);
>
>
> extern int save_stripes(int *source, unsigned long long *offsets,
> @@ -473,6 +474,7 @@ extern char *map_dev(int major, int minor, int create);
>
> struct active_array;
> struct metadata_update;
> +enum state_of_reshape;

Why declare state_of_reshape like this in mdadm.h if it isn't used at all in
mdadm.h??

>
> /* A superswitch provides entry point the a metadata handler.
> *
> @@ -891,6 +893,8 @@ extern int conf_name_is_free(char *name);
> extern int devname_matches(char *name, char *match);
> extern struct mddev_ident_s *conf_match(struct mdinfo *info, struct supertype *st);
> extern inline int experimental(void);
> +extern int find_array_minor(char *text_version, int external, int container, int *minor);
> +extern int find_array_minor2(char *text_version, int external, int container, int *minor);
>
> extern void free_line(char *line);
> extern int match_oneof(char *devices, char *devname);
> diff --git a/mdmon.h b/mdmon.h
> index 8190358..9ea0b93 100644
> --- a/mdmon.h
> +++ b/mdmon.h
> @@ -24,6 +24,8 @@ enum array_state { clear, inactive, suspended, readonly, read_auto,
> enum sync_action { idle, reshape, resync, recover, check, repair, bad_action };
>
>
> +enum state_of_reshape { reshape_not_active, reshape_is_starting, reshape_in_progress, reshape_cancel_request };
> +
> struct active_array {
> struct mdinfo info;
> struct supertype *container;
> @@ -45,6 +47,9 @@ struct active_array {
> enum array_state prev_state, curr_state, next_state;
> enum sync_action prev_action, curr_action, next_action;
>
> + enum state_of_reshape reshape_state;
> + int reshape_delta_disks;
> +

Adding these fields seems correct, but I would really like a separate patch
which adds the fields, and the definition and the initialisation.

If it had a comment explaining the stages and how each state change happens,
that would be an added bonus.

> int check_degraded; /* flag set by mon, read by manage */
>
> int devnum;
> diff --git a/super-intel.c b/super-intel.c
> index 90faff6..98e4c6d 100644
> --- a/super-intel.c
> +++ b/super-intel.c
> @@ -286,6 +286,7 @@ enum imsm_update_type {
> update_rename_array,
> update_add_disk,
> update_level,
> + update_reshape,
> };
>
> struct imsm_update_activate_spare {
> @@ -296,6 +297,43 @@ struct imsm_update_activate_spare {
> struct imsm_update_activate_spare *next;
> };
>
> +struct geo_params {
> + int dev_id;
> + char *dev_name;
> + long long size;
> + int level;
> + int layout;
> + int chunksize;
> + int raid_disks;
> +};
> +
> +
> +struct imsm_update_reshape {
> + enum imsm_update_type type;
> + int update_memory_size;
> + int reshape_delta_disks;
> + int disks_count;
> + int spares_in_update;
> + int devnum;
> + /* pointers to memory that will be allocated
> + * by manager during prepare_update()
> + */
> + struct intel_dev devs_mem;
> + /* status of update preparation
> + */
> + int update_prepared;
> + /* anchor data prepared by mdadm */
> + int upd_devs_offset;
> + int device_size;
> + struct dl upd_disks[1];
> + /* here goes added spares
> + */
> + /* and here goes imsm_devs pointed by upd_devs
> + * devs are put here as row data every device_size bytes
> + *
> + */
> +};
> +
> struct disk_info {
> __u8 serial[MAX_RAID_SERIAL_LEN];
> };
> @@ -5189,6 +5227,7 @@ static int disks_overlap(struct intel_super *super, int idx, struct imsm_update_
> }
>
> static void imsm_delete(struct intel_super *super, struct dl **dlp, unsigned index);
> +int imsm_get_new_device_name(struct dl *dl);
>
> static void imsm_process_update(struct supertype *st,
> struct metadata_update *update)
> @@ -5224,6 +5263,102 @@ static void imsm_process_update(struct supertype *st,
> mpb = super->anchor;
>
> switch (type) {
> + case update_reshape: {
> + struct imsm_update_reshape *u = (void *)update->buf;
> + struct dl *new_disk;
> + struct active_array *a;
> + int i;
> + __u32 new_mpb_size;
> + int new_disk_num;
> + struct intel_dev *current_dev;
> +
> + dprintf("imsm: imsm_process_update() for update_reshape [u->update_prepared = %i]\n", u->update_prepared);
> + if ((u->update_prepared == -1) ||
> + (u->devnum < 0)) {
> + dprintf("imsm: Error: update_reshape not prepared\n");
> + goto update_reshape_exit;
> + }
> +
> + if (u->spares_in_update) {
> + new_disk_num = mpb->num_disks + u->reshape_delta_disks;
> + new_mpb_size = disks_to_mpb_size(new_disk_num);
> + if (mpb->mpb_size < new_mpb_size)
> + mpb->mpb_size = new_mpb_size;
> +
> + /* enable spares to use in array
> + */
> + for (i = 0; i < u->reshape_delta_disks; i++) {
> + char buf[PATH_MAX];
> +
> + new_disk = super->disks;
> + while (new_disk) {
> + if ((new_disk->major == u->upd_disks[i].major) &&
> + (new_disk->minor == u->upd_disks[i].minor))
> + break;
> + new_disk = new_disk->next;
> + }
> + if (new_disk == NULL) {
> + u->update_prepared = -1;
> + goto update_reshape_exit;
> + }
> + if (new_disk->index < 0) {
> + new_disk->index = i + mpb->num_disks;
> + new_disk->raiddisk = new_disk->index; /* slot to fill in autolayout */
> + new_disk->disk.status |= CONFIGURED_DISK;
> + new_disk->disk.status &= ~SPARE_DISK;
> + }
> + sprintf(buf, "%d:%d", new_disk->major, new_disk->minor);
> + if (new_disk->fd < 0)
> + new_disk->fd = dev_open(buf, O_RDWR);
> + imsm_get_new_device_name(new_disk);
> + }
> + }
> +
> + dprintf("imsm: process_update(): update_reshape: volume set mpb->num_raid_devs = %i\n", mpb->num_raid_devs);
> + /* manage changes in volumes
> + */
> + /* check if array is in RESHAPE_NOT_ACTIVE reshape state
> + */
> + for (a = st->arrays; a; a = a->next)
> + if (a->devnum == u->devnum)
> + break;
> + if ((a == NULL) || (a->reshape_state != reshape_not_active)) {
> + u->update_prepared = -1;
> + goto update_reshape_exit;
> + }
> + /* find current dev in intel_super
> + */
> + dprintf("\t\tLooking for volume %s\n", (char *)u->devs_mem.dev->volume);
> + current_dev = super->devlist;
> + while (current_dev) {
> + if (strcmp((char *)current_dev->dev->volume,
> + (char *)u->devs_mem.dev->volume) == 0)
> + break;
> + current_dev = current_dev->next;
> + }
> + if (current_dev == NULL) {
> + u->update_prepared = -1;
> + goto update_reshape_exit;
> + }
> +
> + dprintf("Found volume %s\n", (char *)current_dev->dev->volume);
> + /* replace current device with provided in update
> + */
> + free(current_dev->dev);
> + current_dev->dev = u->devs_mem.dev;
> + u->devs_mem.dev = NULL;
> +
> + /* set reshape_delta_disks
> + */
> + a->reshape_delta_disks = u->reshape_delta_disks;
> + a->reshape_state = reshape_is_starting;
> +
> + super->updates_pending++;
> +update_reshape_exit:
> + if (u->devs_mem.dev)
> + free(u->devs_mem.dev);
> + break;
> + }
> case update_level: {
> struct imsm_update_level *u = (void *)update->buf;
> struct imsm_dev *dev_new, *dev = NULL;
> @@ -5592,8 +5727,58 @@ static void imsm_prepare_update(struct supertype *st,
> struct imsm_super *mpb = super->anchor;
> size_t buf_len;
> size_t len = 0;
> + void *upd_devs;
>
> switch (type) {
> + case update_reshape: {
> + struct imsm_update_reshape *u = (void *)update->buf;
> + struct dl *dl = NULL;
> +
> + u->update_prepared = -1;
> + u->devs_mem.dev = NULL;
> + dprintf("imsm: imsm_prepare_update() for update_reshape\n");
> + if (u->devnum < 0) {
> + dprintf("imsm: No passed device.\n");
> + break;
> + }
> + dprintf("imsm: reshape delta disks is = %i\n", u->reshape_delta_disks);
> + if (u->reshape_delta_disks < 0)
> + break;
> + u->update_prepared = 1;
> + if (u->reshape_delta_disks == 0) {
> + /* for non growing reshape buffers sizes are not affected
> + * but check some parameters
> + */
> + break;
> + }
> + /* count HDDs
> + */
> + u->disks_count = 0;
> + for (dl = super->disks; dl; dl = dl->next)
> + if (dl->index >= 0)
> + u->disks_count++;
> +
> + /* set pointer in monitor address space
> + */
> + upd_devs = (struct imsm_dev *)((void *)u + u->upd_devs_offset);
> + /* allocate memory for new volumes */
> + if (((struct imsm_dev *)(upd_devs))->vol.migr_type != MIGR_GEN_MIGR) {
> + dprintf("imsm: Error.Device is not in migration state.\n");
> + u->update_prepared = -1;
> + break;
> + }
> + dprintf("passed device : %s\n", ((struct imsm_dev *)(upd_devs))->volume);
> + u->devs_mem.dev = calloc(1, u->device_size);
> + if (u->devs_mem.dev == NULL) {
> + u->update_prepared = -1;
> + break;
> + }
> + dprintf("METADATA Copy - using it.\n");
> + memcpy(u->devs_mem.dev, upd_devs, u->device_size);
> + len = disks_to_mpb_size(u->spares_in_update + mpb->num_disks);
> + dprintf("New anchor length is %llu\n", (unsigned long long)len);
> + break;
> + }
> case update_level: {
> struct imsm_update_level *u = (void *) update->buf;
> struct active_array *a;
> @@ -5818,6 +6003,525 @@ static int update_level_imsm(struct supertype *st, struct mdinfo *info,
> return 0;
> }
>
> +int imsm_reshape_is_allowed_on_container(struct supertype *st,
> + struct geo_params *geo)
> +{
> + int ret_val = 0;
> + struct mdinfo *info = NULL;
> + char buf[PATH_MAX];
> + int fd = -1;
> + int device_num = -1;
> + int devices_that_can_grow = 0;
> +
> + dprintf("imsm: imsm_reshape_is_allowed_on_container(ENTER): st->devnum = (%i)\n", st->devnum);
> +
> + if (geo == NULL ||
> + (geo->size != -1) || (geo->level != UnSet) ||
> + (geo->layout != UnSet) || (geo->chunksize != 0)) {
> + dprintf("imsm: Container operation is allowed for raid disks number change only.\n");
> + return ret_val;
> + }
> +
> + snprintf(buf, PATH_MAX, "/dev/md%i", st->devnum);
> + dprintf("imsm: open device (%s)\n", buf);
> + fd = open(buf , O_RDONLY | O_DIRECT);
> + if (fd < 0) {
> + dprintf("imsm: cannot open device\n");
> + return ret_val;
> + }
> +
> + if (geo->raid_disks == UnSet) {
> + dprintf("imsm: for container operation raid disks change is required\n");
> + goto exit_imsm_reshape_is_allowed_on_container;
> + }
> +
> + device_num = 0; /* start from first device (skip container info) */
> + while (device_num > -1) {
> + int result;
> + int minor;
> + unsigned long long array_blocks;
> + struct imsm_map *map = NULL;
> + struct imsm_dev *dev = NULL;
> + struct intel_super *super = NULL;
> + int used_disks;
> +
> +
> + dprintf("imsm: checking device_num: %i\n", device_num);
> + sprintf(st->subarray, "%i", device_num);
> + st->ss->load_super(st, fd, NULL);
> + if (st->sb == NULL) {
> + if (device_num == 0) {
> + /* for the first checked device this is error
> + there should be at least one device to check
> + */
> + dprintf("imsm: error: superblock is NULL during container operation\n");
> + } else {
> + dprintf("imsm: no more devices to check, number of forund devices: %i\n",
> + devices_that_can_grow);
> + /* check if any device in container can be groved
> + */
> + if (devices_that_can_grow)
> + ret_val = 1;
> + /* restore superblock, for last device not loaded */
> + sprintf(st->subarray, "%i", 0);
> + st->ss->load_super(st, fd, NULL);
> + }
> + break;
> + }
> + info = sysfs_read(fd, 0, GET_LEVEL|GET_VERSION|GET_DEVS|GET_STATE);
> + if (info == NULL) {
> + dprintf("imsm: Cannot get device info.\n");
> + break;
> + }
> + st->ss->getinfo_super(st, info);
> +
> + if (geo->raid_disks < info->array.raid_disks) {
> + /* we work on container for Online Capacity Expansion
> + * only so raid_disks has to grow
> + */
> + dprintf("imsm: for container operation raid disks increase is required\n");
> + break;
> + }
> + /* check if size is set corectly
> + * wrong conditions could happend when previous reshape wes interrupted
> + */
> + super = st->sb;
> + dev = get_imsm_dev(super, device_num);
> + if (dev == NULL) {
> + dprintf("cannot get imsm device\n");
> + ret_val = 0;
> + break;
> + }
> + map = get_imsm_map(dev, 0);
> + if (dev == NULL) {
> + dprintf("cannot get imsm device map\n");
> + ret_val = 0;
> + break;
> + }
> + used_disks = imsm_num_data_members(dev);
> + dprintf("read raid_disks = %i\n", used_disks);
> + dprintf("read requested disks = %i\n", geo->raid_disks);
> + array_blocks = map->blocks_per_member * used_disks;
> + /* round array size down to closest MB
> + */
> + array_blocks = (array_blocks >> SECT_PER_MB_SHIFT) << SECT_PER_MB_SHIFT;
> + if (sysfs_set_num(info, NULL, "array_size", array_blocks/2) < 0)
> + dprintf("cannot set array size to %llu\n", array_blocks/2);
> +
> + if (geo->raid_disks > info->array.raid_disks)
> + devices_that_can_grow++;
> +
> + if ((info->array.level != 0) &&
> + (info->array.level != 5)) {
> + /* we cannot use this container other raid level
> + */
> + dprintf("imsm: for container operation wrong raid level (%i) detected\n", info->array.level);
> + break;
> + } else {
> + /* check for platform support for this raid level configuration
> + */
> + struct intel_super *super = st->sb;
> + if (!is_raid_level_supported(super->orom, info->array.level, geo->raid_disks)) {
> + dprintf("platform does not support raid%d with %d disk%s\n",
> + info->array.level, geo->raid_disks, geo->raid_disks > 1 ? "s" : "");
> + break;
> + }
> + }
> +
> + /* all raid5 and raid0 volumes in container
> + * has to be ready for Online Capacity Expansion
> + */
> + result = find_array_minor2(info->text_version, st->ss->external, st->devnum, &minor);
> + if (result < 0) {
> + dprintf("imsm: cannot find array\n");
> + break;
> + }
> + sprintf(info->sys_name, "md%i", minor);
> + if (sysfs_get_str(info, NULL, "array_state", buf, 20) <= 0) {
> + dprintf("imsm: cannot read array state\n");
> + break;
> + }
> + if ((strncmp(buf, "clean", 5) != 0) &&
> + (strncmp(buf, "clear", 5) != 0) &&
> + (strncmp(buf, "active", 6) != 0)) {
> + int index = strlen(buf) - 1;
> +
> + if (index < 0)
> + index = 0;
> + *(buf + index) = 0;
> + fprintf(stderr, "imsm: Error: Array %s is not in proper state (current state: %s). Cannot continue.\n", info->sys_name, buf);
> + break;
> + }
> + if (info->array.level > 0) {
> + if (sysfs_get_str(info, NULL, "sync_action", buf, 20) <= 0) {
> + dprintf("imsm: for container operation no sync action\n");
> + break;
> + }
> + /* check if any reshape is not in progress
> + */
> + if (strncmp(buf, "reshape", 7) == 0) {
> + dprintf("imsm: for container operation reshape is currently in progress\n");
> + break;
> + }
> + }
> + sysfs_free(info);
> + info = NULL;
> + device_num++;
> + }
> + sysfs_free(info);
> + info = NULL;
> +
> +exit_imsm_reshape_is_allowed_on_container:
> + if (fd >= 0)
> + close(fd);
> +
> + dprintf("imsm: imsm_reshape_is_allowed_on_container(Exit) device_num = %i, ret_val = %i\n", device_num, ret_val);
> + if (ret_val)
> + dprintf("\tContainer operation allowed\n");
> + else
> + dprintf("\tError: %i\n", ret_val);
> +
> + return ret_val;
> +}
> +struct mdinfo *get_spares_imsm(int devnum)
> +{
> + int fd = -1;
> + char buf[PATH_MAX];
> + struct mdinfo *info = NULL;
> + struct mdinfo *ret_val = NULL;
> + int cont_id = -1;
> + struct supertype *st = NULL;
> + int find_result;
> +
> + dprintf("imsm: get_spares_imsm for device: %i.\n", devnum);
> +
> + sprintf(buf, "/dev/md%i", devnum);
> + dprintf("try to read container %s\n", buf);
> +
> + cont_id = open(buf, O_RDONLY);
> + if (cont_id < 0) {
> + dprintf("imsm: ERROR: Cannot open container.\n");
> + goto abort;
> + }
> +
> + /* get first volume */
> + st = super_by_fd(cont_id);
> + if (st == NULL) {
> + dprintf("imsm: ERROR: Cannot load container information.\n");
> + goto abort;
> + }
> + sprintf(buf, "/md%i/0", devnum);
> + find_result = find_array_minor2(buf, 1, devnum, &devnum);
> + if (find_result < 0) {
> + dprintf("imsm: ERROR: Cannot find array.\n");
> + goto abort;
> + }
> + sprintf(buf, "/dev/md%i", devnum);
> + fd = open(buf, O_RDONLY);
> + if (fd < 0) {
> + dprintf("imsm: ERROR: Cannot open device.\n");
> + goto abort;
> + }
> + sprintf(st->subarray, "0");
> + st->ss->load_super(st, cont_id, NULL);
> + if (st->sb == NULL) {
> + dprintf("imsm: ERROR: Cannot load array information.\n");
> + goto abort;
> + }
> + info = sysfs_read(fd, 0, GET_LEVEL | GET_VERSION | GET_DEVS | GET_STATE);
> + if (info == NULL) {
> + dprintf("imsm: Cannot get device info.\n");
> + goto abort;
> + }
> + st->ss->getinfo_super(st, info);
> + sprintf(buf, "/dev/md/%s", info->name);
> + ret_val = sysfs_get_unused_spares(cont_id, fd);
> + if (ret_val == NULL) {
> + dprintf("imsm: ERROR: Cannot get spare devices.\n");
> + goto abort;
> + }
> + if (ret_val->array.spare_disks == 0) {
> + dprintf("imsm: ERROR: No available spares.\n");
> + free(ret_val);
> + ret_val = NULL;
> + goto abort;
> + }
> +
> +abort:
> + if (st)
> + st->ss->free_super(st);
> + sysfs_free(info);
> + if (fd > -1)
> + close(fd);
> + if (cont_id > -1)
> + close(cont_id);
> +
> + return ret_val;
> +}
> +
> +/********************************************************** ********************
> + * function: imsm_create_metadata_update_for_reshape
> + * Function creates update for whole IMSM container.
> + * Slot number for new devices are guesed only. Managemon will correct them
> + * when reshape will be triggered and md sets slot numbers.
> + * Slot numbers in metadata will be updated with stage_2 update
> + ************************************************************ ******************/
> +struct imsm_update_reshape *imsm_create_metadata_update_for_reshape(struct supertype *st, struct geo_params *geo)
> +{
> + struct imsm_update_reshape *ret_val = NULL;
> + struct intel_super *super = st->sb;
> + int update_memory_size = 0;
> + struct imsm_update_reshape *u = NULL;
> + struct imsm_map *new_map = NULL;
> + struct mdinfo *spares = NULL;
> + int i;
> + unsigned long long array_blocks;
> + int used_disks;
> + int delta_disks = 0;
> + struct dl *new_disks;
> + int device_size;
> + void *upd_devs;
> +
> + dprintf("imsm imsm_update_metadata_for_reshape(enter) raid_disks = %i\n", geo->raid_disks);
> +
> + if ((geo->raid_disks < super->anchor->num_disks) ||
> + (geo->raid_disks == UnSet))
> + geo->raid_disks = super->anchor->num_disks;
> + delta_disks = geo->raid_disks - super->anchor->num_disks;
> +
> + /* size of all update data without anchor */
> + update_memory_size = sizeof(struct imsm_update_reshape);
> + /* add space for all devices,
> + * then add maps space
> + */
> + device_size = sizeof(struct imsm_dev);
> + device_size += sizeof(struct imsm_map);
> + device_size += 2 * (geo->raid_disks - 1) * sizeof(__u32);
> +
> + update_memory_size += device_size * super->anchor->num_raid_devs;
> + if (delta_disks > 1) {
> + /* now add space for spare disks information
> + */
> + update_memory_size += sizeof(struct dl) * (delta_disks - 1);
> + }
> +
> + u = calloc(1, update_memory_size);
> + if (u == NULL) {
> + dprintf("error: cannot get memory for imsm_update_reshape update\n");
> + return ret_val;
> + }
> + u->reshape_delta_disks = delta_disks;
> + u->update_prepared = -1;
> + u->update_memory_size = update_memory_size;
> + u->type = update_reshape;
> + u->spares_in_update = 0;
> + u->upd_devs_offset = sizeof(struct imsm_update_reshape) + sizeof(struct dl) * (delta_disks - 1);
> + upd_devs = (struct imsm_dev *)((void *)u + u->upd_devs_offset);
> + u->device_size = device_size;
> +
> + for (i = 0; i < super->anchor->num_raid_devs; i++) {
> + struct imsm_dev *old_dev = __get_imsm_dev(super->anchor, i);
> + int old_disk_number;
> + int devnum = -1;
> +
> + u->devnum = -1;
> + if (old_dev == NULL)
> + break;
> +
> + find_array_minor((char *)old_dev->volume, 1, st->devnum, &devnum);
> + if (devnum == geo->dev_id) {
> + __u8 to_state;
> + struct imsm_map *new_map2;
> + int idx;
> +
> + new_map = NULL;
> + imsm_copy_dev(upd_devs, old_dev);
> + new_map = get_imsm_map(upd_devs, 0);
> + old_disk_number = new_map->num_members;
> + new_map->num_members = geo->raid_disks;
> + u->reshape_delta_disks = new_map->num_members - old_disk_number;
> + /* start migration on new device
> + * it puts second map there also
> + */
> +
> + to_state = imsm_check_degraded(super, old_dev, 0);
> + migrate(upd_devs, to_state, MIGR_GEN_MIGR);
> + /* second map length is equal to first map
> + * correct second map length to old value
> + */
> + new_map2 = get_imsm_map(upd_devs, 1);
> + if (new_map2) {
> + if (new_map2->num_members != old_disk_number) {
> + new_map2->num_members = old_disk_number;
> + /* guess new disk indexes
> + */
> + for (idx = new_map2->num_members; idx < new_map->num_members; idx++)
> + set_imsm_ord_tbl_ent(new_map, idx, idx);
> + }
> + u->devnum = geo->dev_id;
> + break;
> + }
> + }
> + }
> +
> + if (delta_disks <= 0) {
> + dprintf("imsm: reshape without grow (disk add).\n");
> + /* finalize update */
> + goto calculate_size_only;
> + }
> +
> + /* now get spare disks list
> + */
> + spares = get_spares_imsm(st->container_dev);
> +
> + if (spares == NULL) {
> + dprintf("imsm: ERROR: Cannot get spare devices.\n");
> + goto exit_imsm_create_metadata_update_for_reshape;
> + }
> + if ((spares->array.spare_disks == 0) ||
> + (u->reshape_delta_disks > spares->array.spare_disks)) {
> + dprintf("imsm: ERROR: No available spares.\n");
> + goto exit_imsm_create_metadata_update_for_reshape;
> + }
> + /* we have got spares
> + * update disk list in imsm_disk list table in anchor
> + */
> + dprintf("imsm: %i spares are available.\n\n", spares->array.spare_disks);
> + new_disks = u->upd_disks;
> + for (i = 0; i < u->reshape_delta_disks; i++) {
> + struct mdinfo *dev = spares->devs;
> + __u32 id;
> + int fd;
> + char buf[PATH_MAX];
> + int rv;
> + unsigned long long size;
> +
> + sprintf(buf, "%d:%d", dev->disk.major, dev->disk.minor);
> + dprintf("open spare disk %s (%s)\n", buf, dev->sys_name);
> + fd = dev_open(buf, O_RDWR);
> + if (fd < 0) {
> + dprintf("\topen failed\n");
> + goto exit_imsm_create_metadata_update_for_reshape;
> + }
> + if (sysfs_disk_to_scsi_id(fd, &id) == 0)
> + new_disks[i].disk.scsi_id = __cpu_to_le32(id);
> + else
> + new_disks[i].disk.scsi_id = __cpu_to_le32(0);
> + new_disks[i].disk.status = CONFIGURED_DISK;
> + rv = imsm_read_serial(fd, NULL, new_disks[i].disk.serial);
> + if (rv != 0) {
> + dprintf("\tcannot read disk serial\n");
> + close(fd);
> + goto exit_imsm_create_metadata_update_for_reshape;
> + }
> + dprintf("\tdisk serial: %s\n", new_disks[i].disk.serial);
> + get_dev_size(fd, NULL, &size);
> + size /= 512;
> + new_disks[i].disk.total_blocks = __cpu_to_le32(size);
> + new_disks[i].disk.owner_cfg_num = super->anchor->disk->owner_cfg_num;
> +
> + new_disks[i].major = dev->disk.major;
> + new_disks[i].minor = dev->disk.minor;
> + /* no relink in update
> + * use table access
> + */
> + new_disks[i].next = NULL;
> +
> + close(fd);
> + spares->devs = dev->next;
> + u->spares_in_update++;
> +
> + free(dev);
> + dprintf("\n");
> + }
> +calculate_size_only:
> + /* calculate new size
> + */
> + if (new_map != NULL) {
> +
> + used_disks = imsm_num_data_members(upd_devs);
> + if (used_disks) {
> + array_blocks = new_map->blocks_per_member * used_disks;
> + /* round array size down to closest MB
> + */
> + array_blocks = (array_blocks >> SECT_PER_MB_SHIFT) << SECT_PER_MB_SHIFT;
> + ((struct imsm_dev *)(upd_devs))->size_low = __cpu_to_le32((__u32)array_blocks);
> + ((struct imsm_dev *)(upd_devs))->size_high = __cpu_to_le32((__u32)(array_blocks >> 32));
> + /* finalize update */
> + ret_val = u;
> + }
> + }
> +
> +exit_imsm_create_metadata_update_for_reshape:
> + /* free spares
> + */
> + if (spares) {
> + while (spares->devs) {
> + struct mdinfo *dev = spares->devs;
> + spares->devs = dev->next;
> + free(dev);
> + }
> + free(spares);
> + }
> +
> + if (ret_val == NULL)
> + free(u);
> +
> + return ret_val;
> +}
> +
> +char *get_volume_for_olce(struct supertype *st, int raid_disks)
> +{
> + char *ret_val = NULL;
> + struct mdinfo *sra = NULL;
> + struct mdinfo info;
> + char *ret_buf;
> + struct intel_super *super = st->sb;
> + int i;
> + int fd = -1;
> + char buf[PATH_MAX];
> +
> + snprintf(buf, PATH_MAX, "/dev/md%i", st->devnum);
> + dprintf("imsm: open device (%s)\n", buf);
> + fd = open(buf , O_RDONLY | O_DIRECT);
> + if (fd < 0) {
> + dprintf("imsm: cannot open device\n");
> + return ret_val;
> + }
> +
> + ret_buf = malloc(PATH_MAX);
> + if (ret_buf == NULL)
> + goto exit_get_volume_for_olce;
> +
> + super = st->sb;
> + for (i = 0; i < super->anchor->num_raid_devs; i++) {
> + sprintf(st->subarray, "%i", i);
> + st->ss->load_super(st, fd, NULL);
> + if (st->sb == NULL)
> + goto exit_get_volume_for_olce;
> + info.devs = NULL;
> + st->ss->getinfo_super(st, &info);
> +
> + if (raid_disks > info.array.raid_disks) {
> + snprintf(ret_buf, PATH_MAX,
> + "%s", info.name);
> + dprintf("Found device for OLCE requested raid_disks = %i, array raid_disks = %i\n",
> + raid_disks, info.array.raid_disks);
> + ret_val = ret_buf;
> + break;
> + }
> + }
> +
> +exit_get_volume_for_olce:
> + if ((ret_val == NULL) && ret_buf)
> + free(ret_buf);
> + sysfs_free(sra);
> + if (fd > -1)
> + close(fd);
> +
> + return ret_val;
> +}
> +
>
> int imsm_reshape_super(struct supertype *st, long long size, int level,
> int layout, int chunksize, int raid_disks,
> @@ -5827,7 +6531,20 @@ int imsm_reshape_super(struct supertype *st, long long size, int level,
> struct mdinfo *sra = NULL;
> int fd = -1;
> char buf[PATH_MAX];
> + struct geo_params geo;
> +
> + memset(&geo, sizeof (struct geo_params), 0);
> +
> + geo.dev_name = dev;
> + geo.size = size;
> + geo.level = level;
> + geo.layout = layout;
> + geo.chunksize = chunksize;
> + geo.raid_disks = raid_disks;
>
> + dprintf("imsm: reshape_super called().\n");
> + dprintf("\tfor level : %i\n", geo.level);
> + dprintf("\tfor raid_disks : %i\n", geo.raid_disks);
>
> if (experimental() == 0)
> return ret_val;
> @@ -5839,7 +6556,46 @@ int imsm_reshape_super(struct supertype *st, long long size, int level,
> goto imsm_reshape_super_exit;
> }
>
> - if ((size == -1) && (layout == UnSet) && (raid_disks == 0) && (level != UnSet)) {
> + /* verify reshape conditions
> + * on container level we can do almost everything */
> + if (st->subarray[0] == 0) {
> + /* check for delta_disks > 0 and supported raid levels 0 and 5 only in container */
> + if (imsm_reshape_is_allowed_on_container(st, &geo)) {
> + struct imsm_update_reshape *u;
> + char *array;
> +
> + array = get_volume_for_olce(st, geo.raid_disks);
> + if (array) {
> + find_array_minor(array, 1, st->devnum, &geo.dev_id);
> + if (geo.dev_id > 0) {
> + dprintf("imsm: Preparing metadata update for: %s\n", array);
> +
> + st->update_tail = &st->updates;
> + u = imsm_create_metadata_update_for_reshape(st, &geo);
> +
> + if (u) {
> + ret_val = 0;
> + append_metadata_update(st, u, u->update_memory_size);
> + } else
> + dprintf("imsm: Cannot prepare update\n");
> + } else
> + dprintf("imsm: Cannot find array in container\n");
> + free(array);
> + }
> + } else
> + dprintf("imsm: Operation is not allowed on container\n");
> + *st->subarray = 0;
> + goto imsm_reshape_super_exit;
> + } else
> + dprintf("imsm: not a container operation\n");
> +
> + geo.dev_id = -1;
> + find_array_minor(geo.dev_name, 1, st->devnum, &geo.dev_id);
> +
> + /* we have volume so takeover can be performed for single volume only
> + */
> + if ((geo.size == -1) && (geo.layout == UnSet) && (geo.raid_disks == 0) && (geo.level != UnSet) &&
> + (geo.dev_id > -1)) {
> /* ok - this is takeover */
> int container_fd;
> int dn;
> @@ -5867,9 +6623,9 @@ int imsm_reshape_super(struct supertype *st, long long size, int level,
> * to/from different than raid10 level
> * if source level is raid0 mdmon is sterted only
> */
> - if (((level == 10) || (sra->array.level == 10) || (sra->array.level == 0)) &&
> - (level != sra->array.level) &&
> - (level > 0)) {
> + if (((geo.level == 10) || (sra->array.level == 10) || (sra->array.level == 0)) &&
> + (geo.level != sra->array.level) &&
> + (geo.level > 0)) {
> st->update_tail = &st->updates;
> err = update_level_imsm(st, sra, sra->name, 0, 0, NULL);
> ret_val = 0;
> @@ -5887,6 +6643,34 @@ imsm_reshape_super_exit:
> return ret_val;
> }
>
> +int imsm_get_new_device_name(struct dl *dl)
> +{
> + int rv;
> + char dv[PATH_MAX];
> + char nm[PATH_MAX];
> + char *dname;
> +
> + if (dl->devname != NULL)
> + return 0;
> +
> + sprintf(dv, "/sys/dev/block/%d:%d", dl->major, dl->minor);
> + memset(nm, 0, sizeof(nm));
> + rv = readlink(dv, nm, sizeof(nm));
> + if (rv > 0) {
> + nm[rv] = '\0';
> + dname = strrchr(nm, '/');
> + if (dname) {
> + char buf[PATH_MAX];
> +
> + dname++;
> + sprintf(buf, "/dev/%s", dname);
> + dl->devname = strdup(buf);
> + }
> + }
> +
> + return rv;
> +}
> +
> struct superswitch super_imsm = {
> #ifndef MDASSEMBLE
> .examine_super = examine_super_imsm,
> diff --git a/sysfs.c b/sysfs.c
> index 3582fed..e316785 100644
> --- a/sysfs.c
> +++ b/sysfs.c
> @@ -800,6 +800,150 @@ int sysfs_unique_holder(int devnum, long rdev)
> return found;
> }
>
> +int sysfs_is_spare_device_belongs_to(int fd, char *devname)
> +{
> + int ret_val = -1;
> + char fname[PATH_MAX];
> + char *base;
> + char *dbase;
> + struct mdinfo *sra;
> + DIR *dir = NULL;
> + struct dirent *de;
> +
> + sra = malloc(sizeof(*sra));
> + if (sra == NULL)
> + goto abort;
> + memset(sra, 0, sizeof(*sra));
> + sysfs_init(sra, fd, -1);
> + if (sra->sys_name[0] == 0)
> + goto abort;
> +
> + memset(fname, PATH_MAX, 0);
> + sprintf(fname, "/sys/block/%s/md/", sra->sys_name);
> + base = fname + strlen(fname);
> +
> + /* Get all the devices as well */
> + *base = 0;
> + dir = opendir(fname);
> + if (!dir)
> + goto abort;
> + while ((de = readdir(dir)) != NULL) {
> + if (de->d_ino == 0 ||
> + strncmp(de->d_name, "dev-", 4) != 0)
> + continue;
> + strcpy(base, de->d_name);
> + dbase = base + strlen(base);
> + *dbase = '\0';
> + dbase = strstr(fname, "/md/");
> + if (dbase && strcmp(devname, dbase) == 0) {
> + ret_val = 1;
> + goto abort;
> + }
> + }
> +abort:
> + if (dir)
> + closedir(dir);
> + sysfs_free(sra);
> +
> + return ret_val;
> +}

This at least needs a comment at the tops saying what it does, and why.

Why don't you do a sysfs_read, then do a search based on device id rather
than name? It is always safer to use device id than device name if possible.

> +
> +struct mdinfo *sysfs_get_unused_spares(int container_fd, int fd)
> +{
> + char fname[PATH_MAX];
> + char buf[PATH_MAX];
> + char *base;
> + char *dbase;
> + struct mdinfo *ret_val;
> + struct mdinfo *dev;
> + DIR *dir = NULL;
> + struct dirent *de;
> + int is_in;
> + char *to_check;
> +
> + ret_val = malloc(sizeof(*ret_val));
> + if (ret_val == NULL)
> + goto abort;
> + memset(ret_val, 0, sizeof(*ret_val));
> + sysfs_init(ret_val, container_fd, -1);
> + if (ret_val->sys_name[0] == 0)
> + goto abort;
> +
> + sprintf(fname, "/sys/block/%s/md/", ret_val->sys_name);
> + base = fname + strlen(fname);
> +
> + strcpy(base, "raid_disks");
> + if (load_sys(fname, buf))
> + goto abort;
> + ret_val->array.raid_disks = strtoul(buf, NULL, 0);
> +
> + /* Get all the devices as well */
> + *base = 0;
> + dir = opendir(fname);
> + if (!dir)
> + goto abort;
> + ret_val->array.spare_disks = 0;
> + while ((de = readdir(dir)) != NULL) {
> + char *ep;
> + if (de->d_ino == 0 ||
> + strncmp(de->d_name, "dev-", 4) != 0)
> + continue;
> + strcpy(base, de->d_name);
> + dbase = base + strlen(base);
> + *dbase = '\0';
> +
> + to_check = strstr(fname, "/md/");
> + is_in = sysfs_is_spare_device_belongs_to(fd, to_check);
> + if (is_in == -1) {
> + dev = malloc(sizeof(*dev));
> + if (!dev)
> + goto abort;
> + strncpy(dev->text_version, fname, 50);
> +
> + *dbase++ = '/';
> +
> + dev->disk.raid_disk = strtoul(buf, &ep, 10);
> + dev->disk.raid_disk = -1;
> +
> + strcpy(dbase, "block/dev");
> + if (load_sys(fname, buf)) {
> + free(dev);
> + continue;
> + }
> + sscanf(buf, "%d:%d", &dev->disk.major, &dev->disk.minor);
> + strcpy(dbase, "block/device/state");
> + if (load_sys(fname, buf) != 0) {
> + free(dev);
> + continue;
> + }
> + if (strncmp(buf, "offline", 7) == 0) {
> + free(dev);
> + continue;
> + }
> + if (strncmp(buf, "failed", 6) == 0) {
> + free(dev);
> + continue;
> + }
> +
> + /* add this disk to spares list */
> + dev->next = ret_val->devs;
> + ret_val->devs = dev;
> + ret_val->array.spare_disks++;
> + *(dbase-1) = '\0';
> + dprintf("sysfs: found spare: (%s)\n", fname);
> + }
> + }
> + closedir(dir);
> + return ret_val;
> +
> +abort:
> + if (dir)
> + closedir(dir);
> + sysfs_free(ret_val);
> +
> + return NULL;
> +}

Again, why not sysfs_read, then process that data structure?

> +
> int sysfs_freeze_array(struct mdinfo *sra)
> {
> /* Try to freeze resync/rebuild on this array/container.
> diff --git a/util.c b/util.c
> index f220792..396f6d8 100644
> --- a/util.c
> +++ b/util.c
> @@ -1869,3 +1869,151 @@ inline int experimental(void)
> }
> }
>
> +int path2devnum(char *pth)
> +{
> + char *ep;
> + int fd = -1;
> + char *dev_pth = NULL;
> + char *dev_str;
> + int dev_num = -1;
> +
> + fd = open(pth, O_RDONLY);
> + if (fd < 0)
> + return dev_num;
> + close(fd);
> + dev_pth = canonicalize_file_name(pth);
> + if (dev_pth == NULL)
> + return dev_num;
> + dev_str = strrchr(dev_pth, '/');
> + if (dev_str) {
> + while (!isdigit(dev_str[0]))
> + dev_str++;
> + dev_num = strtoul(dev_str, &ep, 10);
> + if (*ep != '\0')
> + dev_num = -1;
> + }
> +
> + if (dev_pth)
> + free(dev_pth);
> +
> + return dev_num;
> +}
> +
> +extern void map_read(struct map_ent **map);
> +extern void map_free(struct map_ent *map);
> +int find_array_minor(char *text_version, int external, int container, int *minor)
> +{
> + int i;
> + char path[PATH_MAX];
> + struct stat s;
> +
> + if (minor == NULL)
> + return -2;
> +
> + snprintf(path, PATH_MAX, "/dev/md/%s", text_version);
> + i = path2devnum(path);
> + if (i > -1) {
> + *minor = i;
> + return 0;
> + }
> +
> + i = path2devnum(text_version);
> + if (i > -1) {
> + *minor = i;
> + return 0;
> + }
> +
> + if (container > 0) {
> + struct map_ent *map = NULL;
> + struct map_ent *m;
> + char cont[PATH_MAX];
> +
> + snprintf(cont, PATH_MAX, "/md%i/", container);
> + map_read(&map);
> + for (m = map; m; m = m->next) {
> + int index;
> + unsigned int len = 0;
> + char buf[PATH_MAX];
> +
> + /* array have belongs to proper container
> + */
> + if (strncmp(cont, m->metadata, 6) != 0)
> + continue;
> + /* begin of array name in map have to be the same
> + * as array name in metadata
> + */
> + if (strncmp(m->path, path, strlen(path)) != 0)
> + continue;
> + /* array name has to be followed by '_' char
> + */
> + len = strlen(path);
> + if (*(m->path + len) != '_')
> + continue;
> + /* then we have to have valid index
> + */
> + len++;
> + if (strlen(m->path + len) <= 0)
> + continue;
> + /* index has to be las position in array name
> + */
> + index = atoi(m->path + strlen(path) + 1);
> + snprintf(buf, PATH_MAX, "%i", index);
> + len += strlen(buf);
> + if (len != strlen(m->path))
> + continue;
> + dprintf("Found %s device based on mdadm maps\n", m->path);
> + *minor = m->devnum;
> + map_free(map);
> + return 0;
> + }
> + map_free(map);
> + }
> +
> + for (i = 127; i >= 0; i--) {
> + char buf[PATH_MAX];
> +
> + snprintf(path, PATH_MAX, "/sys/block/md%d/md/", i);
> + if (stat(path, &s) != -1) {
> + strcat(path, "metadata_version");
> + if (load_sys(path, buf))
> + continue;
> + if (external) {
> + char *version = strchr(buf, ':');
> + if (version && strcmp(version + 1,
> + text_version))
> + continue;
> + } else {
> + if (strcmp(buf, text_version))
> + continue;
> + }
> + *minor = i;
> + return 0;
> + }
> + }
> +
> +
> + return -1;
> +}
> +
> +/* find_array_minor2 looks for frozen devices also
> + */
> +int find_array_minor2(char *text_version, int external, int container, int *minor)
> +{
> + int result;
> + char buf[PATH_MAX];
> +
> + strcpy(buf, text_version);
> + result = find_array_minor(text_version, external, container, minor);
> + if (result < 0) {
> + /* try to find frozen array also
> + */
> + char buf[PATH_MAX];
> +
> + strcpy(buf, text_version);
> +
> + *buf = '-';
> + result = find_array_minor(buf, external, container, minor);
> + }
> + return result;
> +}
> +

This all looks way to complicated, should be using map_read, and it totally
undocumented so it is too hard to figure out what you were really trying to
do.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 21/53] imsm: FIX: core dump during imsm metadata writing

am 29.11.2010 02:54:21 von NeilBrown

On Fri, 26 Nov 2010 09:06:38 +0100 Adam Kwolek wrote:

> Wrong number of disks during metadata update causes core dump.
> New disks number based on internal mdmon information has to used for calculation (not previously read from metadata).
>
> Signed-off-by: Adam Kwolek

I've applied this, thanks.

One small comment ....

> ---
>
> super-intel.c | 27 ++++++++++++++++++---------
> 1 files changed, 18 insertions(+), 9 deletions(-)
>
> diff --git a/super-intel.c b/super-intel.c
> index 98e4c6d..1231fa8 100644
> --- a/super-intel.c
> +++ b/super-intel.c
> @@ -3510,8 +3510,9 @@ static int write_super_imsm_spares(struct intel_super *super, int doclose)
> return 0;
> }
>
> -static int write_super_imsm(struct intel_super *super, int doclose)
> +static int write_super_imsm(struct supertype *st, int doclose)
> {
> + struct intel_super *super = st->sb;
> struct imsm_super *mpb = super->anchor;
> struct dl *d;
> __u32 generation;
> @@ -3519,6 +3520,7 @@ static int write_super_imsm(struct intel_super *super, int doclose)
> int spares = 0;
> int i;
> __u32 mpb_size = sizeof(struct imsm_super) - sizeof(struct imsm_disk);
> + int num_disks = 0;
>
> /* 'generation' is incremented everytime the metadata is written */
> generation = __le32_to_cpu(mpb->generation_num);
> @@ -3531,21 +3533,28 @@ static int write_super_imsm(struct intel_super *super, int doclose)
> if (mpb->orig_family_num == 0)
> mpb->orig_family_num = mpb->family_num;
>
> - mpb_size += sizeof(struct imsm_disk) * mpb->num_disks;
> for (d = super->disks; d; d = d->next) {
> if (d->index == -1)
> spares++;
> - else
> + else {
> mpb->disk[d->index] = d->disk;
> + num_disks++;
> + }
> }
> - for (d = super->missing; d; d = d->next)
> + for (d = super->missing; d; d = d->next) {
> mpb->disk[d->index] = d->disk;
> + num_disks++;
> + }
> + mpb->num_disks = num_disks;
> + mpb_size += sizeof(struct imsm_disk) * mpb->num_disks;
>
> for (i = 0; i < mpb->num_raid_devs; i++) {
> struct imsm_dev *dev = __get_imsm_dev(mpb, i);
> -
> - imsm_copy_dev(dev, get_imsm_dev(super, i));
> - mpb_size += sizeof_imsm_dev(dev, 0);
> + struct imsm_dev *dev2 = get_imsm_dev(super, i);
> + if ((dev) && (dev2)) {

I don't like these extra parentheses. They are just noise and don't improve
clarity. Please just:
if (dev && dev2) {

NeilBrown

> + imsm_copy_dev(dev, dev2);
> + mpb_size += sizeof_imsm_dev(dev, 0);
> + }
> }
> mpb_size += __le32_to_cpu(mpb->bbm_log_size);
> mpb->mpb_size = __cpu_to_le32(mpb_size);
> @@ -3665,7 +3674,7 @@ static int write_init_super_imsm(struct supertype *st)
> struct dl *d;
> for (d = super->disks; d; d = d->next)
> Kill(d->devname, NULL, 0, 1, 1);
> - return write_super_imsm(st->sb, 1);
> + return write_super_imsm(st, 1);
> }
> }
> #endif
> @@ -4938,7 +4947,7 @@ static void imsm_sync_metadata(struct supertype *container)
> if (!super->updates_pending)
> return;
>
> - write_super_imsm(super, 0);
> + write_super_imsm(container, 0);
>
> super->updates_pending = 0;
> }

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 22/53] Send information to managemon about reshaperequest

am 29.11.2010 02:56:46 von NeilBrown

On Fri, 26 Nov 2010 09:06:45 +0100 Adam Kwolek wrote:

> When monitor made metadata update and indicates request to managemon to continue reshape initialization, kick managemon to perform its action, unless array is not during deactivation.
>
> Signed-off-by: Adam Kwolek
> ---
>
> monitor.c | 14 +++++++++++++-
> super-intel.c | 14 ++++++++++++++
> 2 files changed, 27 insertions(+), 1 deletions(-)
>
> diff --git a/monitor.c b/monitor.c
> index 5705a9b..05bd96c 100644
> --- a/monitor.c
> +++ b/monitor.c
> @@ -399,8 +399,20 @@ static int read_and_act(struct active_array *a)
> signal_manager();
> }
>
> - if (deactivate)
> + if (deactivate) {
> a->container = NULL;
> + /* break reshape also
> + */
> + if (a->reshape_state != reshape_in_progress)
> + a->reshape_state = reshape_not_active;
> + }
> +
> + /* signal manager when real delta_disks value is present
> + */
> + if ((a->reshape_state != reshape_not_active) &&
> + (a->reshape_state != reshape_in_progress)) {
> + signal_manager();
> + }

I suspect this should be in the same patch which I mentioned earlier which
introduces reshape_state etc.

I haven't applied it for now.

NeilBrown

>
> return dirty;
> }
> diff --git a/super-intel.c b/super-intel.c
> index 1231fa8..56f7ea4 100644
> --- a/super-intel.c
> +++ b/super-intel.c
> @@ -4755,6 +4755,16 @@ static int imsm_set_array_state(struct active_array *a, int consistent)
> __u8 map_state = imsm_check_degraded(super, dev, failed);
> __u32 blocks_per_unit;
>
> + if (a->reshape_state != reshape_not_active) {
> + /* array state change is blocked due to reshape action
> + * metadata changes are during applying only before reshape.
> + *
> + * '1' is returned to indicate that array is clean
> + */
> + dprintf("imsm: prepare to reshape\n");
> + return 1;
> + }
> +
> /* before we activate this array handle any missing disks */
> if (consistent == 2)
> handle_missing(super, dev);
> @@ -5106,6 +5116,10 @@ static struct mdinfo *imsm_activate_spare(struct active_array *a,
>
> dprintf("imsm: activate spare: inst=%d failed=%d (%d) level=%d\n",
> inst, failed, a->info.array.raid_disks, a->info.array.level);
> +
> + if (a->reshape_state != reshape_not_active)
> + return NULL;
> +
> if (imsm_check_degraded(super, dev, failed) != IMSM_T_STATE_DEGRADED)
> return NULL;
>

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/53] External Metadata Reshape

am 29.11.2010 04:32:24 von NeilBrown

On Fri, 26 Nov 2010 09:03:51 +0100 Adam Kwolek wrote:

> This patch series (combines 3 previous series in to one) for mdadm and introduces features:
> - Freeze array/container and new reshape vectors: patches 0001 to 0015
> mdadm devel 3.2 contains patches 0001 to 0013 already, patches 0014 and 0016 fixes 2 problems in this functionality
> - Takeover: patches 0016 to 0017
> - Online Capacity Expansion (OLCE): patches 0018 to 0036
> - Checkpointing: patches 0037 to 0045
> - Migrations: patches 0045 to 0053
> 1. raid0 to raid5 : patch 0051
> 2. raid5 to raid0 : patch 0052
> 3. chunk size migration) : patch 0053
>
> Patches are for mdadm 3.1.4 and Neil's feedback for 6 first OLCE patches is included.
> There should be no patch corruption problem now, as it is sent directly from stgit (not outlook).
>
> For checkpointing md patch "md: raid5: update suspend_hi during reshape" is required also (sent before).

I think I've decided that I don't want to apply this patch to raid5. I
discussed this with Dan Williams at the plumbers conference and he took
notes, so hopefully he can correct anything in the following.

I think it was me that suggested this patch in the first place, so it
probably seemed like a good idea at the time. But I no longer think so.

This is how I think it should work - which should probably go in
external-reshape-design.txt.

An important principle is that everything works much like it currently does
for the native metadata case except that some of the work normally performed
by the kernel is now performed by mdmon. So the only changes to mdadm need
to work with external metadata in general involve communicating directly with
mdmon when it would normally only communicate with the kernel. (of course
where will be other changes required to mdadm to deal with the specifics of
reshaping imsm and general container-based metadata).

Also, the atomicity provided by the kernel may not be implicitly available to
the kernel+mdmon pairing, so mdadm may get involved in negotiating the
required atomicity.

Just to be explicit, we are talking here about a 'reshape' which requires
restriping the array, moving data around and taking a long time. Reshapes
which are atomic or just require a resync are much simpler than this.

1/ mdadm freezes the array so the no recovery or reshape can start.
2/ mdadm sets sync_max to 0 so even when the array is unfrozen, no data will
be relocated. It also sets suspend_lo and suspend_hi to zero.
3/ mdadm tells the kernel about the requested reshape, setting some or all of
chunk_size, layout, level, raid_disks (and later, data_offset for each
device).
4/ mdadm checks that mdmon has noticed the changes and has updates the
metadata to show a reshape-in-progress (ping_monitor).
5/ mdadm unfreezes the array for mdmon (change the '-' in metadata_version
back to '/') and calls ping_monitor
6/ mdmon assigns spares as appropriate and tells the kernel which slot to use
for each. This requires a kernel change. The slot number will be stored
in saved_raid_disk. ping_monitor doesn't complete until the spares have
been assigned.
7/ mdadm asked the kernel to start reshape (echo reshape > sync_action).
This causes md_check_recovery to all remove_and_add_spares which will
add the chosen spares to the required slots and will create the reshape
thread. That thread will not actually do anything yet as sync_max
is still 0.

8/ Now we loop, performing backups, reshaping data, and updating the metadata.
It proceeds in a 'double-buffered' process where we are backing up one
section while the previous section is being reshaped.

8a/ mdadm sets suspend_hi to a larger number. This blocks until intervening
IO is flushed.
8b/ mdadm makes a backup copy of the data up to the new suspend_hi
8c/ mdadm updates sync_max to match suspend_hi.
8d/ kernel starts reshaping data and periodically signals progress through
sync_completed
8e/ mdmon notices sync_completed changing and updates the metadata to
record how far the reshape has progressed.
8f/ mdadm notices sync_completed changing and when it passes the end of the
oldest of the two sections being worked on it uses ping_monitor to
ensure the metadata is up-to-date and then moves suspend_lo to the
beginning of the next section, and then goes back to 8a.

9/ When sync_completed reaches the end of the array, mdmon will notice and
update the metadata to show that the reshape has finished, and mdadm will
set both suspend_lo and suspend_hi to beyond the end of the array, and all
is done.

In the case where the number of data devices is changing there are large
periods of time when no backup of data is needed. In this case mdmon still
needs to update the metadata from time to time, and the kernel needs to be
made to wait for that update. This is done with sync_max. So in those cases
the primary sets in the above become just 8c, 8d, 8e, 8f, and
suspend_lo,suspend_hi aren't changed.

It is tempting to have mdmon update sync_max, as then mdadm would not be
needed at all when no backup is happening. I think that is the path of
reasoning I followed previously which lead to having the kernel update
suspend_hi. But I don't think that is a good design now.
Sometimes it really has to be mdadm updating sync_max, so it should always
been mdadm updating sync_max.

It should be a reasonably simple change to your code to follow this
pattern. If the only problem that I find in any of your patches is that they
don't quite follow this pattern properly I will happily fix them to follow
the pattern and apply them with the fix.

> New vectors (introduced by Dan Williams) reshape_super() and manage_reshape() are used in whole process.
>
> In the next step, I'll rebase it to mdadm devel 3.2, meanwhile Krzysztof Wojcik will prepare additional fixes for raid10<->raid0 takeover
>
> I think that few patches can be taken in to devel 3.2 at this monent i.e.:
> 0014-FIX-Cannot-exit-monitor-after-takeover.patch
> 0015-FIX-Unfreeze-not-only-container-for-external-metada.pat ch
> 0016-Add-takeover-support-for-external-meta.patch
> 0018-Treat-feature-as-experimental.patch
> 0033-Prepare-and-free-fdlist-in-functions.patch
> 0034-Compute-backup-blocks-in-function.patch

I would really rather take as much as is ready. The fewer times I have to
review a patch, the better.
So if a patch looks close enough that I can apply it as-is, or with just a
few fixes, then I will. That way you only need to resent the patches that
need serious work.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/53] External Metadata Reshape

am 29.11.2010 05:07:30 von NeilBrown

On Mon, 29 Nov 2010 14:32:24 +1100 Neil Brown wrote:

> 6/ mdmon assigns spares as appropriate and tells the kernel which slot to use
> for each. This requires a kernel change. The slot number will be stored
> in saved_raid_disk. ping_monitor doesn't complete until the spares have
> been assigned.

Actually, this doesn't require a kernel change. We've had this functionality
since 2.6.26 - commit 6c2fce2ef6b4821c21b5c42c7207cb9cf8c87eda

If you add a spare to an array while sync_action is frozen, and then set
/sys/block/mdXX/md/dev-YYY/slot
to some number, it will move that device to fill that slot.
Then when you start a reshape (or whatever) it will leave the device in that
slot and 'do the right thing'.

So the "Verify slots in meta against slot numbers set by md" shouldn't be
needed. mdmon can explicitly request a slot number, and md will honour that.

Or have you tried that and found it doesn't work for some reason???

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 29/53] Add spares to raid0 array using takeover

am 30.11.2010 03:00:19 von NeilBrown

On Fri, 26 Nov 2010 09:07:38 +0100 Adam Kwolek wrote:

> Spares are used by Online Capacity Expansion to expand array.
> To run expansion on raid0, spares have to be added to raid0 volume also.
> Raid0 cannot have spares (no mdmon runs for raid0 array).
> To do this, takeover to raid5 (and back) is used. mdmon runs temporary for raid5 and spare drives can be added to container.
>
> Signed-off-by: Adam Kwolek

I don't like this patch at all.

There is a lot of code in here that is very specific to IMSM, that has been
placed directly in Manage.c. That is bad.

I gather you want to support

mdadm /dev/md/imsm --add /dev/newdisk

on a RAID0 and have it convert to RAID5, do the restripe, and convert back.

The IMSM specific code goes through a container and converts every member
array to raid5.

The
mdadm /dev/md/imsm --add /dev/newdisk

should just add the device to the container as a spare. This might require
updating the IMSM metadata in some way, but doesn't require changing anything
to RAID5 and back.

Then when you do

mdadm -G /dev/imsm --raid-disks=4

it will work with the imsm code to update every member array and add parts of
the newdisk to each array.

The conversion to RAID5 needs to happen in Grow_reshape when raid-disks is
changed on a RAID0. It needs to happen while the array is frozen.

Not in Manage.c

NeilBrown

> ---
>
> Manage.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++
> 1 files changed, 153 insertions(+), 1 deletions(-)
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 30/53] imsm: FIX: Fill sys_name field in getinfo_super()

am 30.11.2010 03:06:04 von NeilBrown

On Fri, 26 Nov 2010 09:07:46 +0100 Adam Kwolek wrote:

> sys_name field is not filled during getinfo_super() call.

No metadata ever sets sys_name. I agree this can be confusing, but it is the
sort of thing that needs to be tidied up across-the-board, just piecemeal
like this.

What was your particular need for setting sys_name?

thanks,
NeilBrown

>
> Signed-off-by: Adam Kwolek
> ---
>
> super-intel.c | 11 +++++++++++
> 1 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/super-intel.c b/super-intel.c
> index 42219f6..3243132 100644
> --- a/super-intel.c
> +++ b/super-intel.c
> @@ -1489,6 +1489,7 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info)
> struct imsm_map *map = get_imsm_map(dev, 0);
> struct dl *dl;
> char *devname;
> + int minor;
>
> for (dl = super->disks; dl; dl = dl->next)
> if (dl->raiddisk == info->disk.raid_disk)
> @@ -1560,6 +1561,11 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info)
> free(devname);
> info->safe_mode_delay = 4000; /* 4 secs like the Matrix driver */
> uuid_from_super_imsm(st, info->uuid);
> +
> + /* fill sys_name field
> + */
> + if (find_array_minor2(info->text_version, 1, st->devnum, &minor) == 0)
> + sprintf(info->sys_name, "md%i", minor);
> }
>
> /* check the config file to see if we can return a real uuid for this spare */
> @@ -1611,6 +1617,7 @@ static void getinfo_super_imsm(struct supertype *st, struct mdinfo *info)
> {
> struct intel_super *super = st->sb;
> struct imsm_disk *disk;
> + int minor;
>
> if (super->current_vol >= 0) {
> getinfo_super_imsm_volume(st, info);
> @@ -1712,6 +1719,10 @@ static void getinfo_super_imsm(struct supertype *st, struct mdinfo *info)
> memcpy(info->uuid, uuid_match_any, sizeof(int[4]));
> fixup_container_spare_uuid(info);
> }
> + /* fill sys_name field
> + */
> + if (find_array_minor2(info->text_version, 1, st->devnum, &minor) == 0)
> + sprintf(info->sys_name, "md%i", minor);
> }
>
> static int is_raid_level_supported(const struct imsm_orom *orom, int level, int raiddisks);

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 31/53] imsm: FIX: Fill delta_disks field ingetinfo_super()

am 30.11.2010 03:07:14 von NeilBrown

On Fri, 26 Nov 2010 09:07:53 +0100 Adam Kwolek wrote:

> delta_disks field is not always filled during getinfo_super() call.

delta_disks is only meaningful if reshape_active set set, so you would need
to set that too. As well as new_layout, new_chunk, and reshape_progress.

Thanks,
NeilBrown

>
> Signed-off-by: Adam Kwolek
> ---
>
> super-intel.c | 7 +++++++
> 1 files changed, 7 insertions(+), 0 deletions(-)
>
> diff --git a/super-intel.c b/super-intel.c
> index 3243132..3f75550 100644
> --- a/super-intel.c
> +++ b/super-intel.c
> @@ -1487,6 +1487,7 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info)
> struct intel_super *super = st->sb;
> struct imsm_dev *dev = get_imsm_dev(super, super->current_vol);
> struct imsm_map *map = get_imsm_map(dev, 0);
> + struct imsm_map *map2 = get_imsm_map(dev, 1);
> struct dl *dl;
> char *devname;
> int minor;
> @@ -1566,6 +1567,12 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info)
> */
> if (find_array_minor2(info->text_version, 1, st->devnum, &minor) == 0)
> sprintf(info->sys_name, "md%i", minor);
> +
> + /* fill delta_disks field
> + */
> + info->delta_disks = 0;
> + if (map2)
> + info->delta_disks = map->num_members - map2->num_members;
> }
>
> /* check the config file to see if we can return a real uuid for this spare */

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 32/53] imsm: FIX: spare list contains one device severaltimes

am 30.11.2010 03:17:25 von NeilBrown

On Fri, 26 Nov 2010 09:08:01 +0100 Adam Kwolek wrote:

> Assumption for spares searching was that after picking new device, it has to be added to array before next search.
> This causes returning different disk on each call.
>
> When spares list is created during Online Capacity Expansion, first devices list is collected and then all devices are added to md.
> Picked device from spares pool has to be checked against picked devices so far. If not, the same disk will be returned all the time.
> Already picked devices are stored in the list and this list is used for new devices verification also.
>
> Signed-off-by: Adam Kwolek

I have applied this without the change in imsm_grow_array, because I haven't
yet accepted the patch which adds that function.
So when we do add imsm_grow_array, imsm_add_spare will be read for it.

Also the two conversions from dprintf to fprintf didn't apply to my current
tree so I dropped them. I mention it only because this is a perfect example
of when I might be appropriate to include those extra changes in this patch,
but it would have been nice to see it mentioned in the description:

Two dprintf calls change to fprintf because .....

Thanks,
NeilBrown

> ---
>
> super-intel.c | 24 +++++++++++++++++-------
> 1 files changed, 17 insertions(+), 7 deletions(-)
>
> diff --git a/super-intel.c b/super-intel.c
> index 3f75550..e4ba875 100644
> --- a/super-intel.c
> +++ b/super-intel.c
> @@ -5006,7 +5006,8 @@ static struct dl *imsm_readd(struct intel_super *super, int idx, struct active_a
> }
>
> static struct dl *imsm_add_spare(struct intel_super *super, int slot,
> - struct active_array *a, int activate_new)
> + struct active_array *a, int activate_new,
> + struct mdinfo *additional_test_list)
> {
> struct imsm_dev *dev = get_imsm_dev(super, a->info.container_member);
> int idx = get_imsm_disk_idx(dev, slot);
> @@ -5032,6 +5033,16 @@ static struct dl *imsm_add_spare(struct intel_super *super, int slot,
> }
> if (d)
> continue;
> + while (additional_test_list) {
> + if (additional_test_list->disk.major == dl->major &&
> + additional_test_list->disk.minor == dl->minor) {
> + dprintf("%x:%x already in additional test list\n", dl->major, dl->minor);
> + break;
> + }
> + additional_test_list = additional_test_list->next;
> + }
> + if (additional_test_list)
> + continue;
>
> /* skip in use or failed drives */
> if (is_failed(&dl->disk) || idx == dl->index ||
> @@ -5165,9 +5176,9 @@ static struct mdinfo *imsm_activate_spare(struct active_array *a,
> */
> dl = imsm_readd(super, i, a);
> if (!dl)
> - dl = imsm_add_spare(super, i, a, 0);
> + dl = imsm_add_spare(super, i, a, 0, NULL);
> if (!dl)
> - dl = imsm_add_spare(super, i, a, 1);
> + dl = imsm_add_spare(super, i, a, 1, NULL);
> if (!dl)
> continue;
>
> @@ -6422,11 +6433,11 @@ struct mdinfo *get_spares_imsm(int devnum)
> sprintf(buf, "/dev/md/%s", info->name);
> ret_val = sysfs_get_unused_spares(cont_id, fd);
> if (ret_val == NULL) {
> - dprintf("imsm: ERROR: Cannot get spare devices.\n");
> + fprintf(stderr, Name": imsm: ERROR: Cannot get spare devices.\n");
> goto abort;
> }
> if (ret_val->array.spare_disks == 0) {
> - dprintf("imsm: ERROR: No available spares.\n");
> + fprintf(stderr, Name": imsm: ERROR: No available spares.\n");
> free(ret_val);
> ret_val = NULL;
> goto abort;
> @@ -7013,10 +7024,9 @@ struct mdinfo *imsm_grow_array(struct active_array *a)
> for (i = prev_raid_disks; i < new_raid_disks; i++) {
> /* OK, this device can be added. Try to add.
> */
> - dl = imsm_add_spare(super, i, a, 0);
> + dl = imsm_add_spare(super, i, a, 0, rv);
> if (!dl)
> continue;
> -
> if (dl->index < 0)
> dl->index = i;
> /* found a usable disk with enough space */

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 33/53] Prepare and free fdlist in functions

am 30.11.2010 03:28:33 von NeilBrown

On Fri, 26 Nov 2010 09:08:08 +0100 Adam Kwolek wrote:

> fd handles table creation is put in to function for code reuse.
>
> In manage_reshape(), child_grow() function from Grow.c will be reused.
> To prepare parameters for this function, code from Grow.c can be reused also.
>
> Signed-off-by: Adam Kwolek

I've applied this, though I changed the "_in" parameters to simple
pass-by-value parameters, and removed all the NULL checks which are fairly
pointless - if those values are zero it would be wrong to continue, you
should just abort, which is what will now happen anyway.

NeilBrown

> ---
>
> Grow.c | 136 ++++++++++++++++++++++++++++++++++++++++++++++++------------ ---
> mdadm.h | 11 +++++
> 2 files changed, 115 insertions(+), 32 deletions(-)
>
> diff --git a/Grow.c b/Grow.c
> index 347f07b..8cba82b 100644
> --- a/Grow.c
> +++ b/Grow.c
> @@ -832,6 +832,103 @@ int remove_disks_on_raid10_to_raid0_takeover(struct supertype *st,
> return 0;
> }
>
> +void reshape_free_fdlist(int **fdlist_in,
> + unsigned long long **offsets_in,
> + int size)
> +{
> + int i;
> + int *fdlist;
> + unsigned long long *offsets;
> + if ((offsets_in == NULL) || (offsets_in == NULL)) {
> + dprintf(Name " Error: Parameters verification error #1.\n");
> + return;
> + }
> +
> + fdlist = *fdlist_in;
> + offsets = *offsets_in;
> + if ((fdlist == NULL) || (offsets == NULL)) {
> + dprintf(Name " Error: Parameters verification error #2.\n");
> + return;
> + }
> +
> + for (i = 0; i < size; i++) {
> + if (fdlist[i] > 0)
> + close(fdlist[i]);
> + }
> +
> + free(fdlist);
> + free(offsets);
> + *fdlist_in = NULL;
> + *offsets_in = NULL;
> +}
> +
> +int reshape_prepare_fdlist(char *devname,
> + struct mdinfo *sra,
> + int raid_disks,
> + int nrdisks,
> + unsigned long blocks,
> + char *backup_file,
> + int **fdlist_in,
> + unsigned long long **offsets_in)
> +{
> + int d = 0;
> + int *fdlist;
> + unsigned long long *offsets;
> + struct mdinfo *sd;
> +
> + if ((devname == NULL) || (sra == NULL) ||
> + (fdlist_in == NULL) || (offsets_in == NULL)) {
> + dprintf(Name " Error: Parameters verification error #1.\n");
> + d = -1;
> + goto release;
> + }
> +
> + fdlist = *fdlist_in;
> + offsets = *offsets_in;
> +
> + if ((fdlist == NULL) || (offsets == NULL)) {
> + dprintf(Name " Error: Parameters verification error #2.\n");
> + d = -1;
> + goto release;
> + }
> +
> + for (d = 0; d <= nrdisks; d++)
> + fdlist[d] = -1;
> + d = raid_disks;
> + for (sd = sra->devs; sd; sd = sd->next) {
> + if (sd->disk.state & (1< > + continue;
> + if (sd->disk.state & (1< > + char *dn = map_dev(sd->disk.major,
> + sd->disk.minor, 1);
> + fdlist[sd->disk.raid_disk]
> + = dev_open(dn, O_RDONLY);
> + offsets[sd->disk.raid_disk] = sd->data_offset*512;
> + if (fdlist[sd->disk.raid_disk] < 0) {
> + fprintf(stderr, Name ": %s: cannot open component %s\n",
> + devname, dn ? dn : "-unknown-");
> + d = -1;
> + goto release;
> + }
> + } else if (backup_file == NULL) {
> + /* spare */
> + char *dn = map_dev(sd->disk.major,
> + sd->disk.minor, 1);
> + fdlist[d] = dev_open(dn, O_RDWR);
> + offsets[d] = (sd->data_offset + sra->component_size - blocks - 8)*512;
> + if (fdlist[d] < 0) {
> + fprintf(stderr, Name ": %s: cannot open component %s\n",
> + devname, dn ? dn : "-unknown-");
> + d = -1;
> + goto release;
> + }
> + d++;
> + }
> + }
> +release:
> + return d;
> +}
> +
> int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> long long size,
> int level, char *layout_str, int chunksize, int raid_disks)
> @@ -1547,38 +1644,13 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> rv = 1;
> break;
> }
> - for (d=0; d <= nrdisks; d++)
> - fdlist[d] = -1;
> - d = array.raid_disks;
> - for (sd = sra->devs; sd; sd=sd->next) {
> - if (sd->disk.state & (1< > - continue;
> - if (sd->disk.state & (1< > - char *dn = map_dev(sd->disk.major,
> - sd->disk.minor, 1);
> - fdlist[sd->disk.raid_disk]
> - = dev_open(dn, O_RDONLY);
> - offsets[sd->disk.raid_disk] = sd->data_offset*512;
> - if (fdlist[sd->disk.raid_disk] < 0) {
> - fprintf(stderr, Name ": %s: cannot open component %s\n",
> - devname, dn?dn:"-unknown-");
> - rv = 1;
> - goto release;
> - }
> - } else if (backup_file == NULL) {
> - /* spare */
> - char *dn = map_dev(sd->disk.major,
> - sd->disk.minor, 1);
> - fdlist[d] = dev_open(dn, O_RDWR);
> - offsets[d] = (sd->data_offset + sra->component_size - blocks - 8)*512;
> - if (fdlist[d]<0) {
> - fprintf(stderr, Name ": %s: cannot open component %s\n",
> - devname, dn?dn:"-unknown");
> - rv = 1;
> - goto release;
> - }
> - d++;
> - }
> +
> + d = reshape_prepare_fdlist(devname, sra, array.raid_disks,
> + nrdisks, blocks, backup_file,
> + &fdlist, &offsets);
> + if (d < 0) {
> + rv = 1;
> + goto release;
> }
> if (backup_file == NULL) {
> if (st->ss->external && !st->ss->manage_reshape) {
> diff --git a/mdadm.h b/mdadm.h
> index 750afcc..698f1bf 100644
> --- a/mdadm.h
> +++ b/mdadm.h
> @@ -448,6 +448,17 @@ extern int sysfs_unique_holder(int devnum, long rdev);
> extern int sysfs_freeze_array(struct mdinfo *sra);
> extern int load_sys(char *path, char *buf);
> extern struct mdinfo *sysfs_get_unused_spares(int container_fd, int fd);
> +extern int reshape_prepare_fdlist(char *devname,
> + struct mdinfo *sra,
> + int raid_disks,
> + int nrdisks,
> + unsigned long blocks,
> + char *backup_file,
> + int **fdlist_in,
> + unsigned long long **offsets_in);
> +extern void reshape_free_fdlist(int **fdlist_in,
> + unsigned long long **offsets_in,
> + int size);
>
>
> extern int save_stripes(int *source, unsigned long long *offsets,

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 34/53] Compute backup blocks in function.

am 30.11.2010 03:32:31 von NeilBrown

On Fri, 26 Nov 2010 09:08:16 +0100 Adam Kwolek wrote:

> number of backup blocks evaluation is put in to function for code reuse.

Applied, thanks.

NeilBrown

>
> Signed-off-by: Adam Kwolek
> ---
>
> Grow.c | 44 +++++++++++++++++++++++++++-----------------
> mdadm.h | 4 +++-
> 2 files changed, 30 insertions(+), 18 deletions(-)
>
> diff --git a/Grow.c b/Grow.c
> index 8cba82b..8cc17d5 100644
> --- a/Grow.c
> +++ b/Grow.c
> @@ -929,6 +929,31 @@ release:
> return d;
> }
>
> +unsigned long compute_backup_blocks(int nchunk, int ochunk,
> + unsigned int ndata, unsigned int odata)
> +{
> + unsigned long a, b, blocks;
> + /* So how much do we need to backup.
> + * We need an amount of data which is both a whole number of
> + * old stripes and a whole number of new stripes.
> + * So LCM for (chunksize*datadisks).
> + */
> + a = (ochunk/512) * odata;
> + b = (nchunk/512) * ndata;
> + /* Find GCD */
> + while (a != b) {
> + if (a < b)
> + b -= a;
> + if (b < a)
> + a -= b;
> + }
> + /* LCM == product / GCD */
> + blocks = (ochunk/512) * (nchunk/512) * odata * ndata / a;
> +
> + return blocks;
> +}
> +
> +
> int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> long long size,
> int level, char *layout_str, int chunksize, int raid_disks)
> @@ -967,7 +992,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> int nrdisks;
> int err;
> int frozen;
> - unsigned long a,b, blocks, stripes;
> + unsigned long blocks, stripes;
> unsigned long cache;
> unsigned long long array_size;
> int changed = 0;
> @@ -1587,22 +1612,7 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
> break;
> }
>
> - /* So how much do we need to backup.
> - * We need an amount of data which is both a whole number of
> - * old stripes and a whole number of new stripes.
> - * So LCM for (chunksize*datadisks).
> - */
> - a = (ochunk/512) * odata;
> - b = (nchunk/512) * ndata;
> - /* Find GCD */
> - while (a != b) {
> - if (a < b)
> - b -= a;
> - if (b < a)
> - a -= b;
> - }
> - /* LCM == product / GCD */
> - blocks = (ochunk/512) * (nchunk/512) * odata * ndata / a;
> + blocks = compute_backup_blocks(nchunk, ochunk, ndata, odata);
>
> sysfs_free(sra);
> sra = sysfs_read(fd, 0,
> diff --git a/mdadm.h b/mdadm.h
> index 698f1bf..06195c8 100644
> --- a/mdadm.h
> +++ b/mdadm.h
> @@ -459,7 +459,8 @@ extern int reshape_prepare_fdlist(char *devname,
> extern void reshape_free_fdlist(int **fdlist_in,
> unsigned long long **offsets_in,
> int size);
> -
> +extern unsigned long compute_backup_blocks(int nchunk, int ochunk,
> + unsigned int ndata, unsigned int odata);
>
> extern int save_stripes(int *source, unsigned long long *offsets,
> int raid_disks, int chunk_size, int level, int layout,
> @@ -471,6 +472,7 @@ extern int restore_stripes(int *dest, unsigned long long *offsets,
> int source, unsigned long long read_offset,
> unsigned long long start, unsigned long long length);
>
> +
> #ifndef Sendmail
> #define Sendmail "/usr/lib/sendmail -t"
> #endif

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 35/53] Control reshape in mdadm

am 30.11.2010 03:37:01 von NeilBrown

On Fri, 26 Nov 2010 09:08:23 +0100 Adam Kwolek wrote:

> When managemon starts reshape while sync_max is set to 0, mdadm waits already for it in manage_reshape().
> When array reaches reshape state, manage_reshape() handler checks if all metadata updates are in place.
> If not mdadm has to wait until updates hits array.
> It starts reshape using child_grow() common code. Then waits until reshape is not finished.
> When it happens it sets size to value specified in metadata and performs backward takeover to raid0 if necessary.
>
> If manage_reshape() finds idle array state (instead reshape state) it is treated as error condition and process is terminated.
>
> Signed-off-by: Adam Kwolek
> ---
>
> Grow.c | 16 +-
> Makefile | 4
> mdadm.h | 6 +
> super-intel.c | 526 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 540 insertions(+), 12 deletions(-)
>
> diff --git a/Grow.c b/Grow.c
> index 8cc17d5..37bcfd6 100644
> --- a/Grow.c
> +++ b/Grow.c
> @@ -418,10 +418,6 @@ static __u32 bsb_csum(char *buf, int len)
> return __cpu_to_le32(csum);
> }
>
> -static int child_grow(int afd, struct mdinfo *sra, unsigned long blocks,
> - int *fds, unsigned long long *offsets,
> - int disks, int chunk, int level, int layout, int data,
> - int dests, int *destfd, unsigned long long *destoffsets);
> static int child_shrink(int afd, struct mdinfo *sra, unsigned long blocks,
> int *fds, unsigned long long *offsets,
> int disks, int chunk, int level, int layout, int data,
> @@ -451,7 +447,7 @@ static int freeze_container(struct supertype *st)
> return 1;
> }
>
> -static void unfreeze_container(struct supertype *st)
> +void unfreeze_container(struct supertype *st)
> {
> int container_dev = st->subarray[0] ? st->container_dev : st->devnum;
> char *container = devnum2devname(container_dev);
> @@ -505,7 +501,7 @@ static void unfreeze(struct supertype *st, int frozen)
> sysfs_free(sra);
> }
>
> -static void wait_reshape(struct mdinfo *sra)
> +void wait_reshape(struct mdinfo *sra)
> {
> int fd = sysfs_get_fd(sra, NULL, "sync_action");
> char action[20];
> @@ -2202,10 +2198,10 @@ static void validate(int afd, int bfd, unsigned long long offset)
> }
> }
>
> -static int child_grow(int afd, struct mdinfo *sra, unsigned long stripes,
> - int *fds, unsigned long long *offsets,
> - int disks, int chunk, int level, int layout, int data,
> - int dests, int *destfd, unsigned long long *destoffsets)
> +int child_grow(int afd, struct mdinfo *sra, unsigned long stripes,
> + int *fds, unsigned long long *offsets,
> + int disks, int chunk, int level, int layout, int data,
> + int dests, int *destfd, unsigned long long *destoffsets)
> {
> char *buf;
> int degraded = 0;
> diff --git a/Makefile b/Makefile
> index e3fb949..6527152 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -112,12 +112,12 @@ SRCS = mdadm.c config.c mdstat.c ReadMe.c util.c Manage.c Assemble.c Build.c \
> MON_OBJS = mdmon.o monitor.o managemon.o util.o mdstat.o sysfs.o config.o \
> Kill.o sg_io.o dlink.o ReadMe.o super0.o super1.o super-intel.o \
> super-ddf.o sha1.o crc32.o msg.o bitmap.o \
> - platform-intel.o probe_roms.o mapfile.o
> + platform-intel.o probe_roms.o mapfile.o Grow.o restripe.o
>
> MON_SRCS = mdmon.c monitor.c managemon.c util.c mdstat.c sysfs.c config.c \
> Kill.c sg_io.c dlink.c ReadMe.c super0.c super1.c super-intel.c \
> super-ddf.c sha1.c crc32.c msg.c bitmap.c \
> - platform-intel.c probe_roms.c mapfile.c
> + platform-intel.c probe_roms.c mapfile.c Grow.c restripe.c

Adding Grow.c and restripe.c to mdmon is definitely wrong.
I assume you are doing this so that super-intel can call some 'grow'
functions, and you need Grow.c and restripe.c to make mdmon link.

We need to find a different way to do this.

Maybe we could compile two different .o files from super-intel.c, one for
mdadm and one for mdmon.
Or maybe we could put some stubs in mdmon.c for the functions in Grow.c that
you want to have defined - because of course mdmon will never actually call
them.

NeilBrown

>
> STATICSRC = pwgr.c
> STATICOBJS = pwgr.o
> diff --git a/mdadm.h b/mdadm.h
> index 06195c8..2c08ee6 100644
> --- a/mdadm.h
> +++ b/mdadm.h
> @@ -446,6 +446,7 @@ extern int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume);
> extern int sysfs_disk_to_scsi_id(int fd, __u32 *id);
> extern int sysfs_unique_holder(int devnum, long rdev);
> extern int sysfs_freeze_array(struct mdinfo *sra);
> +extern void wait_reshape(struct mdinfo *sra);
> extern int load_sys(char *path, char *buf);
> extern struct mdinfo *sysfs_get_unused_spares(int container_fd, int fd);
> extern int reshape_prepare_fdlist(char *devname,
> @@ -461,6 +462,11 @@ extern void reshape_free_fdlist(int **fdlist_in,
> int size);
> extern unsigned long compute_backup_blocks(int nchunk, int ochunk,
> unsigned int ndata, unsigned int odata);
> +extern int child_grow(int afd, struct mdinfo *sra, unsigned long stripes,
> + int *fds, unsigned long long *offsets,
> + int disks, int chunk, int level, int layout, int data,
> + int dests, int *destfd, unsigned long long *destoffsets);
> +extern void unfreeze_container(struct supertype *st);
>
> extern int save_stripes(int *source, unsigned long long *offsets,
> int raid_disks, int chunk_size, int level, int layout,
> diff --git a/super-intel.c b/super-intel.c
> index e4ba875..e57a127 100644
> --- a/super-intel.c
> +++ b/super-intel.c
> @@ -26,6 +26,7 @@
> #include
> #include
> #include
> +#include
>
> /* MPB == Metadata Parameter Block */
> #define MPB_SIGNATURE "Intel Raid ISM Cfg Sig. "
> @@ -6780,6 +6781,8 @@ int imsm_reshape_super(struct supertype *st, long long size, int level,
> }
> } else
> dprintf("imsm: Operation is not allowed on container\n");
> + if (ret_val)
> + unfreeze_container(st);
> *st->subarray = 0;
> goto imsm_reshape_super_exit;
> } else
> @@ -6901,6 +6904,13 @@ int imsm_reshape_array_set_slots(struct active_array *a)
>
> return imsm_reshape_array_manage_new_slots(super, inst, a->devnum, 1);
> }
> +
> +int imsm_reshape_array_count_slots_mismatches(struct intel_super *super, int inst, int devnum)
> +{
> +
> + return imsm_reshape_array_manage_new_slots(super, inst, devnum, 0);
> +}
> +
> /* imsm_reshape_array_manage_new_slots()
> * returns: number of corrected slots for correct == 1
> * counted number of different slots for correct == 0
> @@ -7174,6 +7184,521 @@ imsm_reshape_array_exit:
> return disk_list;
> }
>
> +int imsm_grow_manage_size(struct supertype *st, struct mdinfo *sra)
> +{
> + int ret_val = 0;
> + struct mdinfo *info = NULL;
> + unsigned long long size;
> + int container_fd;
> + unsigned long long current_size = 0;
> +
> + /* finalize current volume reshape
> + * for external meta size has to be managed by mdadm
> + * read size set in meta and put it to md when
> + * reshape is finished.
> + */
> +
> + if (sra == NULL) {
> + dprintf("Error: imsm_grow_manage_size(): sra == NULL\n");
> + goto exit_grow_manage_size_ext_meta;
> + }
> + wait_reshape(sra);
> +
> + /* reshape has finished, update md size
> + * get per-device size and multiply by data disks
> + */
> + container_fd = open_dev(st->devnum);
> + if (container_fd < 0) {
> + dprintf("Error: imsm_grow_manage_size(): container_fd == 0\n");
> + goto exit_grow_manage_size_ext_meta;
> + }
> + if (st->loaded_container)
> + st->ss->load_super(st, container_fd, NULL);
> + info = sysfs_read(container_fd, 0, GET_LEVEL|GET_VERSION|GET_DEVS|GET_STATE);
> + close(container_fd);
> + if (info == NULL) {
> + dprintf("imsm: Cannot get device info.\n");
> + goto exit_grow_manage_size_ext_meta;
> + }
> + st->ss->getinfo_super(st, info);
> + size = info->custom_array_size/2;
> + sysfs_get_ll(sra, NULL, "array_size", ¤t_size);
> + dprintf("imsm_grow_manage_size(): current size is %llu, set size to %llu\n", current_size, size);
> + sysfs_set_num(sra, NULL, "array_size", size);
> +
> + ret_val = 1;
> +
> +exit_grow_manage_size_ext_meta:
> + sysfs_free(info);
> + return ret_val;
> +}
> +
> +int imsm_child_grow(struct supertype *st, char *devname, int validate_fd, struct mdinfo *sra)
> +{
> + int ret_val = 0;
> + int nrdisks;
> + int *fdlist;
> + unsigned long long *offsets;
> + unsigned int ndata, odata;
> + int ndisks, odisks;
> + unsigned long blocks, stripes;
> + int d;
> + struct mdinfo *sd;
> +
> + nrdisks = ndisks = odisks = sra->array.raid_disks;
> + odisks -= sra->delta_disks;
> + odata = odisks-1;
> + ndata = ndisks-1;
> + fdlist = malloc((1+nrdisks) * sizeof(int));
> + offsets = malloc((1+nrdisks) * sizeof(offsets[0]));
> + if (!fdlist || !offsets) {
> + fprintf(stderr, Name ": malloc failed: grow aborted\n");
> + ret_val = 1;
> + if (fdlist)
> + free(fdlist);
> + if (offsets)
> + free(offsets);
> + return ret_val;
> + }
> + blocks = compute_backup_blocks(sra->array.chunk_size,
> + sra->array.chunk_size,
> + ndata, odata);
> +
> + /* set MD_DISK_SYNC flag to open all devices that has to be backuped
> + */
> + for (sd = sra->devs; sd; sd = sd->next) {
> + if ((sd->disk.raid_disk > -1) &&
> + ((unsigned int)sd->disk.raid_disk < odata)) {
> + sd->disk.state |= (1< > + sd->disk.state &= ~(1< > + } else {
> + sd->disk.state |= (1< > + sd->disk.state &= ~(1< > + }
> + }
> +#ifdef DEBUG
> + dprintf("FD list disk inspection:\n");
> + for (sd = sra->devs; sd; sd = sd->next) {
> + char *dn = map_dev(sd->disk.major,
> + sd->disk.minor, 1);
> + dprintf("Disk %s", dn);
> + dprintf("\tstate = %i\n", sd->disk.state);
> + }
> +#endif
> + d = reshape_prepare_fdlist(devname, sra, odisks,
> + nrdisks, blocks, NULL,
> + &fdlist, &offsets);
> + if (d < 0) {
> + fprintf(stderr, Name ": cannot prepare device list\n");
> + ret_val = 1;
> + return ret_val;
> + }
> +
> + mlockall(MCL_FUTURE);
> + if (ret_val == 0) {
> + sra->array.raid_disks = odisks;
> + sra->new_level = sra->array.level;
> + sra->new_layout = sra->array.layout;
> + sra->new_chunk = sra->array.chunk_size;
> +
> + stripes = blocks / (sra->array.chunk_size/512) / odata;
> + /* child grow returns fixed value == 1
> + */
> + child_grow(validate_fd, sra, stripes,
> + fdlist, offsets,
> + odisks, sra->array.chunk_size,
> + sra->array.level, -1, odata,
> + d - odisks, NULL, offsets + odata);
> + imsm_grow_manage_size(st, sra);
> + }
> + reshape_free_fdlist(&fdlist, &offsets, d);
> +
> + return ret_val;
> +}
> +
> +void return_to_raid0(struct mdinfo *sra)
> +{
> + if (sra->array.level == 4) {
> + dprintf("Execute backward takeover to raid0\n");
> + sysfs_set_str(sra, NULL, "level", "raid0");
> + }
> +}
> +
> +int imsm_check_reshape_conditions(int fd, struct supertype *st, int current_array)
> +{
> + char buf[PATH_MAX];
> + struct mdinfo *info = NULL;
> + int arrays_in_reshape_state = 0;
> + int wait_counter = 0;
> + int i;
> + int ret_val = 0;
> + struct intel_super *super = st->sb;
> + struct imsm_super *mpb = super->anchor;
> + int wrong_slots_counter;
> +
> + /* wait until all arrays will be in reshape state
> + * or error occures (iddle state detected)
> + */
> + while ((arrays_in_reshape_state == 0) &&
> + (ret_val == 0)) {
> + arrays_in_reshape_state = 0;
> + int temp_array;
> +
> + if (wait_counter)
> + sleep(1);
> +
> + for (i = 0; i < mpb->num_raid_devs; i++) {
> + int sync_max;
> + int len;
> +
> + /* check array state in md
> + */
> + sprintf(st->subarray, "%i", i);
> + st->ss->load_super(st, fd, NULL);
> + if (st->sb == NULL) {
> + dprintf("cannot get sb\n");
> + ret_val = 1;
> + break;
> + }
> + info = sysfs_read(fd, 0, GET_LEVEL|GET_VERSION|GET_DEVS|GET_STATE);
> + if (info == NULL) {
> + dprintf("imsm: Cannot get device info.\n");
> + break;
> + }
> + st->ss->getinfo_super(st, info);
> +
> + find_array_minor(info->name, 1, st->devnum, &temp_array);
> + if (temp_array != current_array) {
> + if (temp_array < 0) {
> + ret_val = -1;
> + break;
> + }
> + sysfs_free(info);
> + info = NULL;
> + continue;
> + }
> +
> + if (sysfs_get_str(info, NULL, "raid_disks", buf, sizeof(buf)) < 0) {
> + dprintf("cannot get raid_disks\n");
> + ret_val = 1;
> + break;
> + }
> + /* sync_max should be always set to 0
> + */
> + if (sysfs_get_str(info, NULL, "sync_max", buf, sizeof(buf)) < 0) {
> + dprintf("cannot get sync_max\n");
> + ret_val = 1;
> + break;
> + }
> + len = strlen(buf)-1;
> + if (len < 0)
> + len = 0;
> + *(buf+len) = 0;
> + sync_max = atoi(buf);
> + if (sync_max != 0) {
> + dprintf("sync_max has wrong value (%s)\n", buf);
> + sysfs_free(info);
> + info = NULL;
> + continue;
> + }
> + if (sysfs_get_str(info, NULL, "sync_action", buf, sizeof(buf)) < 0) {
> + dprintf("cannot get sync_action\n");
> + ret_val = 1;
> + break;
> + }
> + len = strlen(buf)-1;
> + if (len < 0)
> + len = 0;
> + *(buf+len) = 0;
> + if (strncmp(buf, "idle", 7) == 0) {
> + dprintf("imsm: Error found array in idle state during reshape initialization\n");
> + ret_val = 1;
> + break;
> + }
> + if (strncmp(buf, "reshape", 7) == 0) {
> + arrays_in_reshape_state++;
> + } else {
> + if (strncmp(buf, "frozen", 6) != 0) {
> + *(buf+strlen(buf)) = 0;
> + dprintf("imsm: Error unexpected array state (%s) during reshape initialization\n",
> + buf);
> + ret_val = 1;
> + break;
> + }
> + }
> + /* this device looks ok, so
> + * check if slots are set corectly
> + */
> + super = st->sb;
> + wrong_slots_counter = imsm_reshape_array_count_slots_mismatches(super, i, atoi(info->sys_name+2));
> + sysfs_free(info);
> + info = NULL;
> + if (wrong_slots_counter != 0) {
> + dprintf("Slots for correction %i.\n", wrong_slots_counter);
> + ret_val = 1;
> + goto exit_imsm_check_reshape_conditions;
> + }
> + }
> + sysfs_free(info);
> + info = NULL;
> + wait_counter++;
> + if (wait_counter > 60) {
> + dprintf("exit on timeout, container is not prepared to reshape\n");
> + ret_val = 1;
> + }
> + }
> +
> +exit_imsm_check_reshape_conditions:
> + sysfs_free(info);
> + info = NULL;
> +
> + return ret_val;
> +}
> +
> +int imsm_manage_container_reshape(struct supertype *st)
> +{
> + int ret_val = 1;
> + char buf[PATH_MAX];
> + struct intel_super *super = st->sb;
> + struct imsm_super *mpb = super->anchor;
> + int fd;
> + struct mdinfo *info = NULL;
> + struct mdinfo info2;
> + int validate_fd;
> + int delta_disks;
> + struct geo_params geo;
> +#ifdef DEBUG
> + int i;
> +#endif
> +
> + memset(&geo, sizeof (struct geo_params), 0);
> + /* verify reshape conditions
> + * for single vlolume reshape exit only and reuse Grow_reshape() code
> + */
> + if (st->subarray[0] != 0) {
> + dprintf("imsm: imsm_manage_container_reshape() current volume: %s\n", st->subarray);
> + dprintf("imsm: imsm_manage_container_reshape() detects volume reshape (devnum = %i), exit.\n", st->devnum);
> + return ret_val;
> + }
> +
> + geo.dev_name = devnum2devname(st->devnum);
> + if (geo.dev_name == NULL) {
> + dprintf("imsm: Error: imsm_manage_reshape(): cannot get device name.\n");
> + return ret_val;
> + }
> +
> + snprintf(buf, PATH_MAX, "/dev/%s", geo.dev_name);
> + fd = open(buf , O_RDONLY | O_DIRECT);
> + if (fd < 0) {
> + dprintf("imsm: cannot open device\n");
> + goto imsm_manage_container_reshape_exit;
> + }
> +
> + /* send pings to roll managemon and monitor
> + */
> + ping_manager(geo.dev_name);
> + ping_monitor(geo.dev_name);
> +
> +#ifdef DEBUG
> + /* device list for reshape
> + */
> + dprintf("Arrays to run reshape (no: %i)\n", mpb->num_raid_devs);
> + for (i = 0; i < mpb->num_raid_devs; i++) {
> + struct imsm_dev *dev = get_imsm_dev(super, i);
> + dprintf("\tDevice: %s\n", dev->volume);
> + }
> +#endif
> +
> + info2.devs = NULL;
> + st->ss->getinfo_super(st, &info2);
> + geo.dev_id = -1;
> + find_array_minor(info2.name, 1, st->devnum, &geo.dev_id);
> + if (geo.dev_id < 0) {
> + dprintf("imsm. Error.Cannot get first array.\n");
> + goto imsm_manage_container_reshape_exit;
> + }
> + if (imsm_check_reshape_conditions(fd, st, geo.dev_id)) {
> + dprintf("imsm. Error. Wrong reshape conditions.\n");
> + goto imsm_manage_container_reshape_exit;
> + }
> + geo.raid_disks = info2.array.raid_disks;
> + dprintf("Container is ready for reshape ...\n");
> + switch (fork()) {
> + case 0:
> + fprintf(stderr, Name ": Child forked to run and monitor reshape\n");
> + while (geo.dev_id > -1) {
> + int fd2 = -1;
> + int i;
> + int temp_array = -1;
> + char *array;
> +
> + for (i = 0; i < mpb->num_raid_devs; i++) {
> + sprintf(st->subarray, "%i", i);
> + st->ss->load_super(st, fd, NULL);
> + if (st->sb == NULL) {
> + dprintf("cannot get sb\n");
> + ret_val = 1;
> + goto imsm_manage_container_reshape_exit;
> + }
> + info2.devs = NULL;
> + st->ss->getinfo_super(st, &info2);
> + dprintf("Checking slots for device %s\n", info2.sys_name);
> + find_array_minor(info2.name, 1, st->devnum, &temp_array);
> + if (temp_array == geo.dev_id)
> + break;
> + }
> + snprintf(buf, PATH_MAX, "/dev/%s", info2.sys_name);
> + dprintf("Prepare to reshape for device %s (md%i)\n", info2.sys_name, geo.dev_id);
> + fd2 = open(buf, O_RDWR | O_DIRECT);
> + if (fd2 < 0) {
> + dprintf("Reshape is broken (cannot open array)\n");
> + ret_val = 1;
> + goto imsm_manage_container_reshape_exit;
> + }
> + info = sysfs_read(fd2, 0, GET_VERSION | GET_LEVEL | GET_DEVS | GET_STATE |\
> + GET_COMPONENT | GET_OFFSET | GET_CACHE |\
> + GET_CHUNK | GET_DISKS | GET_DEGRADED |
> + GET_SIZE | GET_LAYOUT);
> + if (info == NULL) {
> + dprintf("Reshape is broken (cannot read sysfs)\n");
> + close(fd2);
> + ret_val = 1;
> + goto imsm_manage_container_reshape_exit;
> + }
> + delta_disks = info->delta_disks;
> + super = st->sb;
> + if (check_env("MDADM_GROW_VERIFY"))
> + validate_fd = fd2;
> + else
> + validate_fd = -1;
> +
> + if (sysfs_get_str(info, NULL, "sync_completed", buf, sizeof(buf)) >= 0) {
> + /* check if in previous pass we reshape any array
> + * if not we have to omit sync_complete condition
> + * and try to reshape arrays
> + */
> + if ((*buf == '0') ||
> + /* or this array was already reshaped */
> + (strncmp(buf, "none", 4) == 0)) {
> + dprintf("Skip this array, sync_completed is %s\n", buf);
> + geo.dev_id = -1;
> + sysfs_free(info);
> + info = NULL;
> + close(fd2);
> + continue;
> + }
> + } else {
> + dprintf("Reshape is broken (cannot read sync_complete)\n");
> + dprintf("Array level is: %i\n", info->array.level);
> + ret_val = 1;
> + close(fd2);
> + goto imsm_manage_container_reshape_exit;
> + }
> + snprintf(buf, PATH_MAX, "/dev/md/%s", info2.name);
> + info->delta_disks = info2.delta_disks;
> +
> + delta_disks = info->array.raid_disks - geo.raid_disks;
> + geo.raid_disks = info->array.raid_disks;
> + if (info->array.level == 4) {
> + geo.raid_disks--;
> + delta_disks--;
> + }
> +
> + ret_val = imsm_child_grow(st, buf,
> + validate_fd,
> + info);
> + return_to_raid0(info);
> + sysfs_free(info);
> + info = NULL;
> + close(fd2);
> + if (ret_val) {
> + dprintf("Reshape is broken (cannot reshape)\n");
> + ret_val = 1;
> + goto imsm_manage_container_reshape_exit;
> + }
> + geo.dev_id = -1;
> + sprintf(st->subarray, "%i", 0);
> + array = get_volume_for_olce(st, geo.raid_disks);
> + if (array) {
> + struct imsm_update_reshape *u;
> + dprintf("imsm: next volume to reshape is: %s\n", array);
> + info2.devs = NULL;
> + st->ss->getinfo_super(st, &info2);
> + find_array_minor(info2.name, 1, st->devnum, &geo.dev_id);
> + if (geo.dev_id > -1) {
> + /* send next array update
> + */
> + dprintf("imsm: Preparing metadata update for: %s (md%i)\n", array, geo.dev_id);
> + st->update_tail = &st->updates;
> + u = imsm_create_metadata_update_for_reshape(st, &geo);
> + if (u) {
> + u->reshape_delta_disks = delta_disks;
> + append_metadata_update(st, u, u->update_memory_size);
> + flush_metadata_updates(st);
> + /* send pings to roll managemon and monitor
> + */
> + ping_manager(geo.dev_name);
> + ping_monitor(geo.dev_name);
> +
> + if (imsm_check_reshape_conditions(fd, st, geo.dev_id)) {
> + dprintf("imsm. Error. Wrong reshape conditions.\n");
> + ret_val = 1;
> + geo.dev_id = -1;
> + }
> + } else
> + geo.dev_id = -1;
> + }
> + free(array);
> + }
> + }
> + unfreeze_container(st);
> + close(fd);
> + break;
> + case -1:
> + fprintf(stderr, Name ": Cannot run child to monitor reshape: %s\n",
> + strerror(errno));
> + ret_val = 1;
> + break;
> + default:
> + /* The child will take care of unfreezing the array */
> + break;
> + }
> +
> +imsm_manage_container_reshape_exit:
> + sysfs_free(info);
> + if (fd > -1)
> + close(fd);
> + if (geo.dev_name)
> + free(geo.dev_name);
> +
> + return ret_val;
> +}
> +
> +int imsm_manage_reshape(struct supertype *st, char *backup)
> +{
> + int ret_val = 0;
> +
> + dprintf("imsm: manage_reshape() called\n");
> +
> + if (experimental() == 0)
> + return ret_val;
> +
> + /* verify reshape conditions
> + * for single vlolume reshape exit only and reuse Grow_reshape() code
> + */
> + if (st->subarray[0] != 0) {
> + dprintf("imsm: manage_reshape() current volume: %s (devnum = %i)\n", st->subarray, st->devnum);
> + return ret_val;
> + }
> + ret_val = imsm_manage_container_reshape(st);
> + /* unfreeze on error and success
> + * for any result this is end of work
> + */
> + unfreeze_container(st);
> +
> + return ret_val;
> +}
> +
> struct superswitch super_imsm = {
> #ifndef MDASSEMBLE
> .examine_super = examine_super_imsm,
> @@ -7207,6 +7732,7 @@ struct superswitch super_imsm = {
> .default_geometry = default_geometry_imsm,
> .reshape_super = imsm_reshape_super,
> .reshape_array = imsm_reshape_array,
> + .manage_reshape = imsm_manage_reshape,
>
> .external = 1,
> .name = "imsm",

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: [PATCH 14/53] FIX: Cannot exit monitor after takeover

am 30.11.2010 17:03:16 von adam.kwolek

The problem is that, when raid0 array is about unfreezing and this is single/last array in container,
Ping to this container causes to mdmon not to exit.
In such condition managemon receives message and in handle_message() for ping case, calls wakeup_monitor()
and then goes in to loop for monitor_loop_cnt update
1. this occurs after timeout
2. when this happens managemon stops on pselect() and as there is nothing to monitor in never wakeups.
3. monitor waits to be allowed to exit on open handlers.

How can this be resolved:
1. do not ping for last raid0 array during unfreezing (I've reworked patch to meet this condition)
2. guard waiting for monitor_loop_cnt change in handle_message() with:
if (container->arrays)

3. change in manage member condition:
if (sigterm)
Wakeup_monitor();

To
if (sigterm || (container->arrays == NULL))
Wakeup_monitor();

This causes additional monitor wakeup.

Any of method causes mdmon to exit as expected.
In cases 2 and 3 it takes a while (we are waiting on communication timeouts).
Method 1 is fast and we are not blocking mdmon exit by communication.

BR
Adam

> -----Original Message-----
> From: Neil Brown [mailto:neilb@suse.de]
> Sent: Monday, November 29, 2010 12:38 AM
> To: Kwolek, Adam
> Cc: linux-raid@vger.kernel.org; Williams, Dan J; Ciechanowski, Ed
> Subject: Re: [PATCH 14/53] FIX: Cannot exit monitor after takeover
>
> On Fri, 26 Nov 2010 09:05:37 +0100 Adam Kwolek
> wrote:
>
> > When performing backward takeover to raid0 monitor cannot exit
> > for single raid0 array configuration.
> > Monitor is locked by communication (ping_manager()) after unfreeze()
>
> I think you are saying that when we convert a RAID5 to a RAID0, the
> mdmon
> notices that there is nothing more for it to do, so it exits. Then
> mdadm has
> problems contacting it. Is that right?
> It doesn't seem quite right as the 'ping_monitor' should simply fail if
> the
> mdmon has disappeared.
>
> Could you say a bit more about what you observe happening.
>
> >
> > Do not ping manager for raid0 array as they shouldn't be monitored.
>
> Only this isn't quite what the patch does. What it does is:
> if the 'last' subarray found is raid0, then don't ping the monitor.
> In general, (though possibly not in imsm) there could be multiple
> arrays,
> some RAID0, some not. So we would need to track if there are an with
> level > 0
> and ping_monitor if any such were found.
>
> I would be reasonably happy with such a patch, except that I cannot yet
> see
> exactly why it is needed. So could you explain exactly what you are
> seeing
> please?
>
> Thanks,
> NeilBrown
>
>
>
> >
> > Signed-off-by: Adam Kwolek
> > ---
> >
> > msg.c | 5 +++--
> > 1 files changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/msg.c b/msg.c
> > index 8e7ebfd..95c6f0b 100644
> > --- a/msg.c
> > +++ b/msg.c
> > @@ -385,11 +385,12 @@ void unblock_monitor(char *container, const int
> unfreeze)
> > if (!is_container_member(e, container))
> > continue;
> > sysfs_free(sra);
> > - sra = sysfs_read(-1, e->devnum, GET_VERSION);
> > + sra = sysfs_read(-1, e->devnum, GET_VERSION|GET_LEVEL);
> > if (unblock_subarray(sra, unfreeze))
> > fprintf(stderr, Name ": Failed to unfreeze %s\n", e-
> >dev);
> > }
> > - ping_monitor(container);
> > + if (sra && sra->array.level > 0)
> > + ping_monitor(container);
> >
> > sysfs_free(sra);
> > free_mdstat(ent);

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: [PATCH 15/53] FIX: Unfreeze not only container for externalmetadata

am 30.11.2010 17:03:54 von adam.kwolek

You missed nothing ;), I'm removing this patch.

BR
Adam

> -----Original Message-----
> From: Neil Brown [mailto:neilb@suse.de]
> Sent: Monday, November 29, 2010 12:49 AM
> To: Kwolek, Adam
> Cc: linux-raid@vger.kernel.org; Williams, Dan J; Ciechanowski, Ed
> Subject: Re: [PATCH 15/53] FIX: Unfreeze not only container for
> external metadata
>
> On Fri, 26 Nov 2010 09:05:45 +0100 Adam Kwolek
> wrote:
>
> > Unfreeze for external metadata case should unfreeze arrays and
> container,
> > not only container as so far. Unfreeze() function doesn't know
> > what the changes to configuration was made so far, and if arrays
> > are pulled from frozen state in md.
> > Unfreeze() has to make sure by performing array unfreeze that all
> arrays
> > are not frozen and then unblock monitor.
>
> unfreeze for external metadata case *does* unfreeze the arrays.
> unfreeze_container calls unblock_monitor which calls unblock_subarray
> for each subarray.
>
> So I cannot see that this patch changes anything. What have I missed?
>
> NeilBrown
>
>
> >
> > Signed-off-by: Adam Kwolek
> > ---
> >
> > Grow.c | 18 ++++++++----------
> > 1 files changed, 8 insertions(+), 10 deletions(-)
> >
> > diff --git a/Grow.c b/Grow.c
> > index 4060129..8ca1812 100644
> > --- a/Grow.c
> > +++ b/Grow.c
> > @@ -495,16 +495,14 @@ static void unfreeze(struct supertype *st, int
> frozen)
> > return;
> >
> > if (st->ss->external)
> > - return unfreeze_container(st);
> > - else {
> > - struct mdinfo *sra = sysfs_read(-1, st->devnum,
> GET_VERSION);
> > -
> > - if (sra)
> > - sysfs_set_str(sra, NULL, "sync_action", "idle");
> > - else
> > - fprintf(stderr, Name ": failed to unfreeze array\n");
> > - sysfs_free(sra);
> > - }
> > + unfreeze_container(st);
> > +
> > + struct mdinfo *sra = sysfs_read(-1, st->devnum, GET_VERSION);
> > + if (sra)
> > + sysfs_set_str(sra, NULL, "sync_action", "idle");
> > + else
> > + fprintf(stderr, Name ": failed to unfreeze array\n");
> > + sysfs_free(sra);
> > }
> >
> > static void wait_reshape(struct mdinfo *sra)

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 14/53] FIX: Cannot exit monitor after takeover

am 30.11.2010 23:06:50 von NeilBrown

On Tue, 30 Nov 2010 16:03:16 +0000 "Kwolek, Adam"
wrote:

> The problem is that, when raid0 array is about unfreezing and this is single/last array in container,
> Ping to this container causes to mdmon not to exit.
> In such condition managemon receives message and in handle_message() for ping case, calls wakeup_monitor()
> and then goes in to loop for monitor_loop_cnt update
> 1. this occurs after timeout
> 2. when this happens managemon stops on pselect() and as there is nothing to monitor in never wakeups.
> 3. monitor waits to be allowed to exit on open handlers.
>
> How can this be resolved:
> 1. do not ping for last raid0 array during unfreezing (I've reworked patch to meet this condition)
> 2. guard waiting for monitor_loop_cnt change in handle_message() with:
> if (container->arrays)
>
> 3. change in manage member condition:
> if (sigterm)
> Wakeup_monitor();
>
> To
> if (sigterm || (container->arrays == NULL))
> Wakeup_monitor();
>
> This causes additional monitor wakeup.
>
> Any of method causes mdmon to exit as expected.
> In cases 2 and 3 it takes a while (we are waiting on communication timeouts).
> Method 1 is fast and we are not blocking mdmon exit by communication.

Thanks for the explanation!
I definitely want to fix the managemon/monitor interaction so that it doesn't
hang as you describe. I might end up with something a lot more heavy-weight
that the changes you suggest.

It might still be OK to include your option '1' as well - I decide when you
post the patch.

thanks,
NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html