[PATCH 0/9] recovering an imsm raid5 array

am 26.08.2011 04:13:59 von dan.j.williams

I run imsm raid5 and raid1 arrays on my personal systems and was
recently bit by the bug that was fixed in commit 1a2487c2 "FIX: imsm:
OROM does not recognize degraded arrays (V2)". In the process of
investigating that I pulled the wrong disk and ended up in a dual
degraded situation.

These patches (ordered in roughly increasing order of required review)
are the features needed to get the array up and running again, as well
as some random fixes spotted along the way.

The most important patch for recovery being patch7 "imsm: support
'missing' devices at Create", allowing the mdadm raid5 recovery path of
recreating the raid array with what one thinks are the good disks /
order, and then attempt to mount the filesystem to see if the guess was
correct.

Example, create degraded raid5 with slot1 missing.

mdadm --create /dev/md0 /dev/sd[ac] -n 2 -e imsm
mdadm --create /dev/md1 /dev/sda missing /dev/sdc -n 3 -l 5

However, during this process I wanted to make sure I could get back to
the exact same metadata that was on the disks, to root cause what went
wrong, examine the metadata offline, or backup a step if I made a
mistake in the recovery process. Patch9 implements --dump support, the
fact that something like this has not been implemented already is maybe
a clue that it is not such a great idea? I can imagine someone messing
up their configuration if they restored a metadata image to the wrong
device, but if you know what you are doing it could be a useful hack.

---

Dan Williams (9):
imsm: fix max disks per array
imsm: fix, stop metadata updates to newly failed devices
imsm: fix display spares
sysfs: fix sysfs_disk_to_scsi_id
imsm: fix reserved sectors for spares
mdmon: fix, close spare activation race
imsm: support 'missing' devices at Create
util: allow regular files through test_partition()
mdadm: 'dump' support

Examine.c | 42 ++++++++++++++++
ReadMe.c | 1
managemon.c | 5 ++
mdadm.8.in | 13 +++++
mdadm.c | 11 ++++
mdadm.h | 3 +
super-intel.c | 146 +++++++++++++++++++++++++++++++++++++--------------------
sysfs.c | 30 ++++--------
util.c | 6 ++
9 files changed, 182 insertions(+), 75 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 1/9] imsm: fix max disks per array

am 26.08.2011 04:14:04 von dan.j.williams

Validate geometry is incorrectly looking at max disks support which is
irrelevant for md/mdadm. ->dpa (disks per array) is how many disks the
orom will allow per volume.

Also cleanup an unnecessary ->orom check, is_raid_level_supported()
already does the right thing in the !orom case.

Cc: Marcin Labun
Signed-off-by: Dan Williams
---
super-intel.c | 19 +++++--------------
1 files changed, 5 insertions(+), 14 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index ddf4de9..a921cbc 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -4933,28 +4933,19 @@ static int imsm_default_chunk(const struct imsm_orom *orom)
}

#define pr_vrb(fmt, arg...) (void) (verbose && fprintf(stderr, Name fmt, ##arg))
-/*
- * validate volume parameters with OROM/EFI capabilities
- */
static int
validate_geometry_imsm_orom(struct intel_super *super, int level, int layout,
int raiddisks, int *chunk, int verbose)
{
-#if DEBUG
- verbose = 1;
-#endif
- /* validate container capabilities */
- if (super->orom && raiddisks > super->orom->tds) {
- if (verbose)
- fprintf(stderr, Name ": %d exceeds maximum number of"
- " platform supported disks: %d\n",
- raiddisks, super->orom->tds);
+ /* check/set platform and metadata limits/defaults */
+ if (super->orom && raiddisks > super->orom->dpa) {
+ pr_vrb(": platform supports a maximum of %d disks per array\n",
+ super->orom->dpa);
return 0;
}

/* capabilities of OROM tested - copied from validate_geometry_imsm_volume */
- if (super->orom && (!is_raid_level_supported(super->orom, level,
- raiddisks))) {
+ if (!is_raid_level_supported(super->orom, level, raiddisks)) {
pr_vrb(": platform does not support raid%d with %d disk%s\n",
level, raiddisks, raiddisks > 1 ? "s" : "");
return 0;

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 2/9] imsm: fix, stop metadata updates to newly failed devices

am 26.08.2011 04:14:09 von dan.j.williams

We already refrain from updating metadata on disks that are failed at
load, need to do the same for new failures. This also reverts b4add146
as we *do* want to update other disks' view of the failed device as out of
date.

Cc: Krzysztof Wojcik
Signed-off-by: Dan Williams
---
super-intel.c | 9 ++++++++-
1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index a921cbc..2f11698 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -4584,7 +4584,7 @@ static int write_super_imsm(struct supertype *st, int doclose)

/* write the mpb for disks that compose raid devices */
for (d = super->disks; d ; d = d->next) {
- if (d->index < 0)
+ if (d->index < 0 || is_failed(&d->disk))
continue;
if (store_imsm_mpb(d->fd, mpb))
fprintf(stderr, "%s: failed for device %d:%d %s\n",
@@ -5840,6 +5840,8 @@ static int mark_failure(struct imsm_dev *dev, struct imsm_disk *disk, int idx)
__u32 ord;
int slot;
struct imsm_map *map;
+ char buf[MAX_RAID_SERIAL_LEN+3];
+ unsigned int len, shift = 0;

/* new failures are always set in map[0] */
map = get_imsm_map(dev, 0);
@@ -5852,6 +5854,11 @@ static int mark_failure(struct imsm_dev *dev, struct imsm_disk *disk, int idx)
if (is_failed(disk) && (ord & IMSM_ORD_REBUILD))
return 0;

+ sprintf(buf, "%s:0", disk->serial);
+ if ((len = strlen(buf)) >= MAX_RAID_SERIAL_LEN)
+ shift = len - MAX_RAID_SERIAL_LEN + 1;
+ strncpy((char *)disk->serial, &buf[shift], MAX_RAID_SERIAL_LEN);
+
disk->status |= FAILED_DISK;
set_imsm_ord_tbl_ent(map, slot, idx | IMSM_ORD_REBUILD);
if (map->failed_disk_num == 0xff)

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 3/9] imsm: fix display spares

am 26.08.2011 04:14:14 von dan.j.williams

Commit 94827db3 "imsm: add spares to --examine output." may try to
display failed disks whose imsm_disk info is not uptodate (due to not
being able to look itself up by serial). The same effect can be had by
just loosening the restriction in print_imsm_disk().

Signed-off-by: Dan Williams
---
super-intel.c | 36 +++++++++++-------------------------
1 files changed, 11 insertions(+), 25 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index 2f11698..0347183 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -1059,18 +1059,20 @@ static void print_imsm_dev(struct intel_super *super,
printf(" Dirty State : %s\n", dev->vol.dirty ? "dirty" : "clean");
}

-static void print_imsm_disk(struct imsm_super *mpb, int index, __u32 reserved)
+static void print_imsm_disk(struct imsm_disk *disk, int index, __u32 reserved)
{
- struct imsm_disk *disk = __get_imsm_disk(mpb, index);
char str[MAX_RAID_SERIAL_LEN + 1];
__u64 sz;

- if (index < 0 || !disk)
+ if (index < -1 || !disk)
return;

printf("\n");
snprintf(str, MAX_RAID_SERIAL_LEN + 1, "%s", disk->serial);
- printf(" Disk%02d Serial : %s\n", index, str);
+ if (index >= 0)
+ printf(" Disk%02d Serial : %s\n", index, str);
+ else
+ printf(" Disk Serial : %s\n", str);
printf(" State :%s%s%s\n", is_spare(disk) ? " spare" : "",
is_configured(disk) ? " active" : "",
is_failed(disk) ? " failed" : "");
@@ -1254,7 +1256,7 @@ static void examine_super_imsm(struct supertype *st, char *homehost)
printf(" MPB Sectors : %d\n", mpb_sectors(mpb));
printf(" Disks : %d\n", mpb->num_disks);
printf(" RAID Devices : %d\n", mpb->num_raid_devs);
- print_imsm_disk(mpb, super->disks->index, reserved);
+ print_imsm_disk(__get_imsm_disk(mpb, super->disks->index), super->disks->index, reserved);
if (super->bbm_log) {
struct bbm_log *log = super->bbm_log;

@@ -1279,28 +1281,12 @@ static void examine_super_imsm(struct supertype *st, char *homehost)
for (i = 0; i < mpb->num_disks; i++) {
if (i == super->disks->index)
continue;
- print_imsm_disk(mpb, i, reserved);
+ print_imsm_disk(__get_imsm_disk(mpb, i), i, reserved);
}
- for (dl = super->disks ; dl; dl = dl->next) {
- struct imsm_disk *disk;
- char str[MAX_RAID_SERIAL_LEN + 1];
- __u64 sz;
-
- if (dl->index >= 0)
- continue;

- disk = &dl->disk;
- printf("\n");
- snprintf(str, MAX_RAID_SERIAL_LEN + 1, "%s", disk->serial);
- printf(" Disk Serial : %s\n", str);
- printf(" State :%s%s%s\n", is_spare(disk) ? " spare" : "",
- is_configured(disk) ? " active" : "",
- is_failed(disk) ? " failed" : "");
- printf(" Id : %08x\n", __le32_to_cpu(disk->scsi_id));
- sz = __le32_to_cpu(disk->total_blocks) - reserved;
- printf(" Usable Size : %llu%s\n", (unsigned long long)sz,
- human_size(sz * 512));
- }
+ for (dl = super->disks; dl; dl = dl->next)
+ if (dl->index == -1)
+ print_imsm_disk(&dl->disk, -1, reserved);

examine_migr_rec_imsm(super);
}

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 4/9] sysfs: fix sysfs_disk_to_scsi_id

am 26.08.2011 04:14:19 von dan.j.williams

Not sure how this ever worked, but now we just try to parse a directory
name that looks like :::.

Array creation segfaults on Fedora 14 without this.

Signed-off-by: Dan Williams
---
sysfs.c | 30 ++++++++++--------------------
1 files changed, 10 insertions(+), 20 deletions(-)

diff --git a/sysfs.c b/sysfs.c
index 56813b7..2146264 100644
--- a/sysfs.c
+++ b/sysfs.c
@@ -709,9 +709,9 @@ int sysfs_disk_to_scsi_id(int fd, __u32 *id)
/* from an open block device, try to retrieve it scsi_id */
struct stat st;
char path[256];
- char *c1, *c2;
DIR *dir;
struct dirent *de;
+ int host, bus, target, lun;

if (fstat(fd, &st))
return 1;
@@ -723,32 +723,22 @@ int sysfs_disk_to_scsi_id(int fd, __u32 *id)
if (!dir)
return 1;

- de = readdir(dir);
- while (de) {
- if (strchr(de->d_name, ':'))
+ for (de = readdir(dir); de; de = readdir(dir)) {
+ int count;
+
+ if (de->d_type != DT_DIR)
+ continue;
+
+ count = sscanf(de->d_name, "%d:%d:%d:%d", &host, &bus, &target, &lun);
+ if (count == 4)
break;
- de = readdir(dir);
}
closedir(dir);

if (!de)
return 1;

- c1 = de->d_name;
- c2 = strchr(c1, ':');
- *c2 = '\0';
- *id = strtol(c1, NULL, 10) << 24; /* host */
- c1 = c2 + 1;
- c2 = strchr(c1, ':');
- *c2 = '\0';
- *id |= strtol(c1, NULL, 10) << 16; /* bus */
- c1 = c2 + 1;
- c2 = strchr(c1, ':');
- *c2 = '\0';
- *id |= strtol(c1, NULL, 10) << 8; /* target */
- c1 = c2 + 1;
- *id |= strtol(c1, NULL, 10); /* lun */
-
+ *id = (host << 24) | (bus << 16) | (target << 8) | (lun << 0);
return 0;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 5/9] imsm: fix reserved sectors for spares

am 26.08.2011 04:14:24 von dan.j.williams

Different OROMs reserve different amounts of space for the migration area.
When activating a spare minimize the reserved space otherwise a valid spare
can be prevented from joining an array with a migration area smaller than
IMSM_RESERVED_SECTORS.

This may result in an array that cannot be reshaped, but that is less
surprising than not being able to rebuild a degraded array.
imsm_reserved_sectors() already reports the minimal value which adds to
the confusion when trying rebuild an array because mdadm -E indicates
that the device has enough space.

Cc: Anna Czarnowska
Signed-off-by: Dan Williams
---
super-intel.c | 11 ++++++++++-
1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index 0347183..193e0d0 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -833,7 +833,16 @@ static struct extent *get_extents(struct intel_super *super, struct dl *dl)
struct extent *rv, *e;
int i;
int memberships = count_memberships(dl, super);
- __u32 reservation = MPB_SECTOR_CNT + IMSM_RESERVED_SECTORS;
+ __u32 reservation;
+
+ /* trim the reserved area for spares, so they can join any array
+ * regardless of whether the OROM has assigned sectors from the
+ * IMSM_RESERVED_SECTORS region
+ */
+ if (dl->index == -1)
+ reservation = MPB_SECTOR_CNT;
+ else
+ reservation = MPB_SECTOR_CNT + IMSM_RESERVED_SECTORS;

rv = malloc(sizeof(struct extent) * (memberships + 1));
if (!rv)

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 6/9] mdmon: fix, close spare activation race

am 26.08.2011 04:14:29 von dan.j.williams

The following test fails when the md_check_recovery() event triggered by
the ro->rw transition causes remove_and_add_spares() to run while mdmon
is attempting spare activation.

Result is that the kernel races to set the slot immediately after
sysfs_add_disk() writes new_dev. mdmon thinks the spare activation
failed and declines to send the monitor a new acitve_array. We show
degraded after the wait because the monitor cannot notify the metadata
that all disks are in_sync.

#!/bin/bash
i=0
false
while [ $? == 1 ]
do
i=$((i+1))
mdadm -Ss
mdadm -CR /dev/md0 /dev/loop[0-2] -n 3 -e imsm
mdadm -CR /dev/md1 /dev/loop[01] missing -n 3 -l 5
mdadm --wait /dev/md1
mdadm -E /dev/loop2 | grep -i degraded
done
echo "failed: $i"

Signed-off-by: Dan Williams
---
managemon.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/managemon.c b/managemon.c
index 6662f67..d020f82 100644
--- a/managemon.c
+++ b/managemon.c
@@ -498,7 +498,10 @@ static void manage_member(struct mdstat_ent *mdstat,
newa = duplicate_aa(a);
if (!newa)
goto out;
- /* Cool, we can add a device or several. */
+ /* prevent the kernel from activating the disk(s) before we
+ * finish adding them
+ */
+ sysfs_set_str(&a->info, NULL, "sync_action", "frozen");

/* Add device to array and set offset/size/slot.
* and open files for each newdev */

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 7/9] imsm: support "missing" devices at Create

am 26.08.2011 04:14:35 von dan.j.williams

Specifying missing devices at create is very useful for array recovery.

For imsm create dummy disk entries at init_super_imsm time, and then use
them to fill in unoccupied slots in the final array (if the container is
unpopulated).

If the container is already populated (has a subarray)
'missing' disks must be in reference to already recorded missing devices
in the metadata.

Cc: Dave Jiang
Signed-off-by: Dan Williams
---
super-intel.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
1 files changed, 62 insertions(+), 9 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index 193e0d0..fee620d 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -4133,12 +4133,40 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
memset(mpb_new + size_old, 0, size_round - size_old);
}
super->current_vol = idx;
- /* when creating the first raid device in this container set num_disks
- * to zero, i.e. delete this spare and add raid member devices in
- * add_to_super_imsm_volume()
+
+ /* handle 'failed_disks' by either:
+ * a) create dummy disk entries in the table if this the first
+ * volume in the array. We add them here as this is the only
+ * opportunity to add them. add_to_super_imsm_volume()
+ * handles the non-failed disks and continues incrementing
+ * mpb->num_disks.
+ * b) validate that 'failed_disks' matches the current number
+ * of missing disks if the container is populated
*/
- if (super->current_vol == 0)
+ if (super->current_vol == 0) {
mpb->num_disks = 0;
+ for (i = 0; i < info->failed_disks; i++) {
+ struct imsm_disk *disk;
+
+ mpb->num_disks++;
+ disk = __get_imsm_disk(mpb, i);
+ disk->status = CONFIGURED_DISK | FAILED_DISK;
+ disk->scsi_id = __cpu_to_le32(~(__u32)0);
+ snprintf((char *) disk->serial, MAX_RAID_SERIAL_LEN,
+ "missing:%d", i);
+ }
+ find_missing(super);
+ } else {
+ int missing = 0;
+ struct dl *d;
+
+ for (d = super->missing; d; d = d->next)
+ missing++;
+ if (info->failed_disks > missing) {
+ fprintf(stderr, Name": unable to add 'missing' disk to container\n");
+ return 0;
+ }
+ }

if (!check_name(super, name, 0))
return 0;
@@ -4170,15 +4198,14 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
vol = &dev->vol;
vol->migr_state = 0;
set_migr_type(dev, MIGR_INIT);
- vol->dirty = 0;
+ vol->dirty = !info->state;
vol->curr_migr_unit = 0;
map = get_imsm_map(dev, 0);
map->pba_of_lba0 = __cpu_to_le32(super->create_offset);
map->blocks_per_member = __cpu_to_le32(info_to_blocks_per_member(info));
map->blocks_per_strip = __cpu_to_le16(info_to_blocks_per_strip(info));
map->failed_disk_num = ~0;
- map->map_state = info->level ? IMSM_T_STATE_UNINITIALIZED :
- IMSM_T_STATE_NORMAL;
+ map->map_state = info->failed_disks ? IMSM_T_STATE_DEGRADED : IMSM_T_STATE_NORMAL;
map->ddf = 1;

if (info->level == 1 && info->raid_disks > 2) {
@@ -4286,9 +4313,10 @@ static int add_to_super_imsm_volume(struct supertype *st, mdu_disk_info_t *dk,
{
struct intel_super *super = st->sb;
struct imsm_super *mpb = super->anchor;
- struct dl *dl;
+ struct imsm_disk *_disk;
struct imsm_dev *dev;
struct imsm_map *map;
+ struct dl *dl, *df;
int slot;

dev = get_imsm_dev(super, super->current_vol);
@@ -4335,12 +4363,37 @@ static int add_to_super_imsm_volume(struct supertype *st, mdu_disk_info_t *dk,
set_imsm_ord_tbl_ent(map, dk->raid_disk, dl->index);
dl->disk.status = CONFIGURED_DISK;

+ /* update size of 'missing' disks to be at least as large as the
+ * largest acitve member (we only have dummy missing disks when
+ * creating the first volume)
+ */
+ if (super->current_vol == 0) {
+ for (df = super->missing; df; df = df->next) {
+ if (dl->disk.total_blocks > df->disk.total_blocks)
+ df->disk.total_blocks = dl->disk.total_blocks;
+ _disk = __get_imsm_disk(mpb, df->index);
+ *_disk = df->disk;
+ }
+ }
+
+ /* refresh unset/failed slots to point to valid 'missing' entries */
+ for (df = super->missing; df; df = df->next)
+ for (slot = 0; slot < mpb->num_disks; slot++) {
+ __u32 ord = get_imsm_ord_tbl_ent(dev, slot, -1);
+
+ if ((ord & IMSM_ORD_REBUILD) == 0)
+ continue;
+ set_imsm_ord_tbl_ent(map, slot, df->index | IMSM_ORD_REBUILD);
+ dprintf("set slot:%d to missing disk:%d\n", slot, df->index);
+ break;
+ }
+
/* if we are creating the first raid device update the family number */
if (super->current_vol == 0) {
__u32 sum;
struct imsm_dev *_dev = __get_imsm_dev(mpb, 0);
- struct imsm_disk *_disk = __get_imsm_disk(mpb, dl->index);

+ _disk = __get_imsm_disk(mpb, dl->index);
if (!_dev || !_disk) {
fprintf(stderr, Name ": BUG mpb setup error\n");
return 1;

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 8/9] util: allow regular files through test_partition()

am 26.08.2011 04:14:40 von dan.j.williams

When using the --dump capability the resulting files cannot be read
unless test_partition allows regular files.

Signed-off-by: Dan Williams
---
util.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/util.c b/util.c
index ce03239..5d573bc 100644
--- a/util.c
+++ b/util.c
@@ -279,6 +279,12 @@ int test_partition(int fd)
*/
struct blkpg_ioctl_arg a;
struct blkpg_partition p;
+ struct stat st;
+
+ /* ignore regular files */
+ if (fstat(fd, &st) != -1 && S_ISREG(st.st_mode))
+ return 0;
+
a.op = BLKPG_DEL_PARTITION;
a.data = (void*)&p;
a.datalen = sizeof(p);

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH 9/9] mdadm: "dump" support

am 26.08.2011 04:14:45 von dan.j.williams

mdadm -E /dev/sda --dump=foo

Creates a sparse file image of /dev/sda named foo with a copy of the
metadata instance found by -E.

When used in the opposite direction:

mdadm -E foo --dump=/dev/sda

....can restore metadata to a given block device, assuming it is
identical size to the dump file.

Signed-off-by: Dan Williams
---
Examine.c | 42 ++++++++++++++++++++++++++++++++++++++++--
ReadMe.c | 1 +
mdadm.8.in | 13 +++++++++++++
mdadm.c | 11 ++++++++++-
mdadm.h | 3 ++-
5 files changed, 66 insertions(+), 4 deletions(-)

diff --git a/Examine.c b/Examine.c
index 5d71e53..ac5f36e 100644
--- a/Examine.c
+++ b/Examine.c
@@ -32,7 +32,7 @@
#include "md_p.h"
int Examine(struct mddev_dev *devlist, int brief, int export, int scan,
int SparcAdjust, struct supertype *forcest,
- char *homehost)
+ char *homehost, char *dump)
{

/* Read the raid superblock from a device and
@@ -50,7 +50,8 @@ int Examine(struct mddev_dev *devlist, int brief, int export, int scan,
* If (brief) gather devices for same array and just print a mdadm.conf line including devices=
* if devlist==NULL, use conf_get_devs()
*/
- int fd;
+ unsigned long long size = 0;
+ int fd, dfd;
int rv = 0;
int err = 0;

@@ -107,6 +108,8 @@ int Examine(struct mddev_dev *devlist, int brief, int export, int scan,
}
err = 1;
}
+ if (dump)
+ get_dev_size(fd, devlist->devname, &size);
close(fd);
}
if (err)
@@ -117,6 +120,41 @@ int Examine(struct mddev_dev *devlist, int brief, int export, int scan,
devlist->devname, 0, 0, NULL);
/* Ok, its good enough to try, though the checksum could be wrong */

+ if (dump && (have_container || !size)) {
+ fprintf(stderr, Name ": cannot get source device size\n");
+ break;
+ }
+
+ if (dump) {
+ struct stat s;
+
+ rv = 1;
+ dfd = open(dump, O_RDWR|O_CREAT|O_EXCL, S_IRUSR|S_IWUSR);
+ if (dfd < 0 && errno == EEXIST) {
+ if (ask("dump file exists, overwrite? "))
+ dfd = open(dump, O_RDWR, S_IRUSR|S_IWUSR);
+ }
+ if (dfd < 0) {
+ fprintf(stderr, Name ": failed to open %s: %s\n",
+ dump, strerror(errno));
+ break;
+ }
+
+ if (fstat(dfd, &s) != -1 && S_ISREG(s.st_mode)) {
+ if (ftruncate(dfd, size) < 0) {
+ fprintf(stderr, Name ": failed to setup dump file %s: %s\n",
+ dump, strerror(errno));
+ break;
+ }
+ }
+
+ rv = st->ss->store_super(st, dfd);
+ if (rv)
+ fprintf(stderr, Name ": failed to store metadata to %s: %s\n",
+ dump, strerror(errno));
+ break;
+ }
+
if (brief && st->ss->brief_examine_super == NULL) {
if (!scan)
fprintf(stderr, Name ": No brief listing for %s on %s\n",
diff --git a/ReadMe.c b/ReadMe.c
index b658841..684fbc9 100644
--- a/ReadMe.c
+++ b/ReadMe.c
@@ -169,6 +169,7 @@ struct option long_options[] = {

/* For Detail/Examine */
{"brief", 0, 0, Brief},
+ {"dump", 1, 0, Dump},
{"export", 0, 0, 'Y'},
{"sparc2.2", 0, 0, Sparc22},
{"test", 0, 0, 't'},
diff --git a/mdadm.8.in b/mdadm.8.in
index e22fde4..6eb65c3 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -1283,6 +1283,19 @@ device (e.g.
does not report the bitmap for that array.

.TP
+.BR \-\-dump
+Augment
+.B \-\-examine
+to, instead of printing the contents of the metadata, write the
+metadata image to the specified argument. The argument
+can be another block device or a regular file. In the case of a regular
+file the dump file will be created as a sparse file of equal size to the
+source of the metadata image. Note, this is a raw debug feature no
+attempt is made to check the validity of writing a given metadata image
+to a given block device. This assumes that only a single device is passed to
+.B \-\-examine.
+
+.TP
.BR \-R ", " \-\-run
start a partially assembled array. If
.B \-\-assemble
diff --git a/mdadm.c b/mdadm.c
index fb51051..154e116 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -106,6 +106,7 @@ int main(int argc, char *argv[])
char *subarray = NULL;
char *remove_path = NULL;
char *udev_filename = NULL;
+ char *dump_filename = NULL;

int print_help = 0;
FILE *outf;
@@ -160,6 +161,9 @@ int main(int argc, char *argv[])
case Brief:
brief = 1;
continue;
+ case Dump:
+ dump_filename = optarg;
+ continue;

case 'Y': export++;
continue;
@@ -1400,11 +1404,16 @@ int main(int argc, char *argv[])
fprintf(stderr, Name ": No devices listed in %s\n", configfile?configfile:DefaultConfFile);
exit(1);
}
+ if (dump_filename && (devs_found > 1 || scan)) {
+ fprintf(stderr,
+ Name ": Only one device can be specifed with --dump\n");
+ exit(2);
+ }
if (brief && verbose)
brief = 2;
rv = Examine(devlist, scan?(verbose>1?0:verbose+1):brief,
export, scan,
- SparcAdjust, ss, homehost);
+ SparcAdjust, ss, homehost, dump_filename);
} else if (devmode == DetailPlatform) {
rv = Detail_Platform(ss ? ss->ss : NULL, ss ? scan : 1, verbose);
} else {
diff --git a/mdadm.h b/mdadm.h
index 8bd0077..f242223 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -313,6 +313,7 @@ enum special_options {
RebuildMapOpt,
InvalidBackup,
UdevRules,
+ Dump,
};

/* structures read from config file */
@@ -1049,7 +1050,7 @@ extern int Detail(char *dev, int brief, int export, int test, char *homehost);
extern int Detail_Platform(struct superswitch *ss, int scan, int verbose);
extern int Query(char *dev);
extern int Examine(struct mddev_dev *devlist, int brief, int export, int scan,
- int SparcAdjust, struct supertype *forcest, char *homehost);
+ int SparcAdjust, struct supertype *forcest, char *homehost, char *dump);
extern int Monitor(struct mddev_dev *devlist,
char *mailaddr, char *alert_cmd,
int period, int daemonise, int scan, int oneshot,

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/9] recovering an imsm raid5 array

am 26.08.2011 13:06:22 von linbloke

On 26/08/11 12:13 PM, Dan Williams wrote:

> mistake in the recovery process. Patch9 implements --dump support, the
> fact that something like this has not been implemented already is maybe
> a clue that it is not such a great idea? I can imagine someone messing
> up their configuration if they restored a metadata image to the wrong
> device, but if you know what you are doing it could be a useful hack.
>

I'm just a nobody mdadm user but I read nearly every word of this list
and read the above and thought, if Dan thinks it's a good idea, then it
probably is. I then read the summary of Patch 9, included below and
thought, this might be safer if it required a --force switch to actually
make the change, force usually being a sign that the user knows what
they are doing (;-) Perhaps without it, it could just check that foo is
valid (identical size) and suggest the force switch to make it so. Sorry
I'm not a coder that can supply a patch, this compromise to deliver the
functionality but with a safeguard in place occurred to me.

Best regards and kudos to our linux developers.

mdadm -E /dev/sda --dump=foo

Creates a sparse file image of /dev/sda named foo with a copy of the
metadata instance found by -E.

When used in the opposite direction:

mdadm -E foo --dump=/dev/sda

....can restore metadata to a given block device, assuming it is
identical size to the dump file.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/9] imsm: fix reserved sectors for spares

am 26.08.2011 21:51:18 von dan.j.williams

On Thu, Aug 25, 2011 at 7:14 PM, Dan Williams > wrote:
> Different OROMs reserve different amounts of space for the migration =
area.
> When activating a spare minimize the reserved space otherwise a valid=
spare
> can be prevented from joining an array with a migration area smaller =
than
> IMSM_RESERVED_SECTORS.
>
> This may result in an array that cannot be reshaped, but that is less
> surprising than not being able to rebuild a degraded array.
> imsm_reserved_sectors() already reports the minimal value which adds =
to
> the confusion when trying rebuild an array because mdadm -E indicates
> that the device has enough space.
>
> Cc: Anna Czarnowska
> Signed-off-by: Dan Williams
> ---
> =A0super-intel.c | =A0 11 ++++++++++-
> =A01 files changed, 10 insertions(+), 1 deletions(-)
>
> diff --git a/super-intel.c b/super-intel.c
> index 0347183..193e0d0 100644
> --- a/super-intel.c
> +++ b/super-intel.c
> @@ -833,7 +833,16 @@ static struct extent *get_extents(struct intel_s=
uper *super, struct dl *dl)
> =A0 =A0 =A0 =A0struct extent *rv, *e;
> =A0 =A0 =A0 =A0int i;
> =A0 =A0 =A0 =A0int memberships =3D count_memberships(dl, super);
> - =A0 =A0 =A0 __u32 reservation =3D MPB_SECTOR_CNT + IMSM_RESERVED_SE=
CTORS;
> + =A0 =A0 =A0 __u32 reservation;
> +
> + =A0 =A0 =A0 /* trim the reserved area for spares, so they can join =
any array
> + =A0 =A0 =A0 =A0* regardless of whether the OROM has assigned sector=
s from the
> + =A0 =A0 =A0 =A0* IMSM_RESERVED_SECTORS region
> + =A0 =A0 =A0 =A0*/
> + =A0 =A0 =A0 if (dl->index == -1)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 reservation =3D MPB_SECTOR_CNT;

Anna rightly points out that this could probably be safely made
bigger. As it stands this applies to too broad an array of disks. I
think a happy medium (until we can nail down the forward compatibility
of older oroms, v8 in this case) is to detect the case where the disk
is being activated for rebuild and if it is at least as large as one
of the existing members truncate the reserved region to the same size
as the other member. That way we are at least compatible with
whatever agent created the array in the first instance.

But this also comes back to the problem of duplicating the array
configuration. It becomes difficult to make things the same size
unless the orom version (reserved region layout) is specified.

--
Dan

>
> =A0 =A0 =A0 =A0rv =3D malloc(sizeof(struct extent) * (memberships + 1=
));
> =A0 =A0 =A0 =A0if (!rv)
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/9] imsm: fix reserved sectors for spares

am 30.08.2011 04:20:36 von NeilBrown

On Fri, 26 Aug 2011 12:51:18 -0700 "Williams, Dan J"
wrote:

> On Thu, Aug 25, 2011 at 7:14 PM, Dan Williams om> wrote:
> > Different OROMs reserve different amounts of space for the migratio=
n area.
> > When activating a spare minimize the reserved space otherwise a val=
id spare
> > can be prevented from joining an array with a migration area smalle=
r than
> > IMSM_RESERVED_SECTORS.
> >
> > This may result in an array that cannot be reshaped, but that is le=
ss
> > surprising than not being able to rebuild a degraded array.
> > imsm_reserved_sectors() already reports the minimal value which add=
s to
> > the confusion when trying rebuild an array because mdadm -E indicat=
es
> > that the device has enough space.
> >
> > Cc: Anna Czarnowska
> > Signed-off-by: Dan Williams
> > ---
> > =A0super-intel.c | =A0 11 ++++++++++-
> > =A01 files changed, 10 insertions(+), 1 deletions(-)
> >
> > diff --git a/super-intel.c b/super-intel.c
> > index 0347183..193e0d0 100644
> > --- a/super-intel.c
> > +++ b/super-intel.c
> > @@ -833,7 +833,16 @@ static struct extent *get_extents(struct intel=
_super *super, struct dl *dl)
> > =A0 =A0 =A0 =A0struct extent *rv, *e;
> > =A0 =A0 =A0 =A0int i;
> > =A0 =A0 =A0 =A0int memberships =3D count_memberships(dl, super);
> > - =A0 =A0 =A0 __u32 reservation =3D MPB_SECTOR_CNT + IMSM_RESERVED_=
SECTORS;
> > + =A0 =A0 =A0 __u32 reservation;
> > +
> > + =A0 =A0 =A0 /* trim the reserved area for spares, so they can joi=
n any array
> > + =A0 =A0 =A0 =A0* regardless of whether the OROM has assigned sect=
ors from the
> > + =A0 =A0 =A0 =A0* IMSM_RESERVED_SECTORS region
> > + =A0 =A0 =A0 =A0*/
> > + =A0 =A0 =A0 if (dl->index == -1)
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 reservation =3D MPB_SECTOR_CNT;
>=20
> Anna rightly points out that this could probably be safely made
> bigger. As it stands this applies to too broad an array of disks. I
> think a happy medium (until we can nail down the forward compatibilit=
y
> of older oroms, v8 in this case) is to detect the case where the disk
> is being activated for rebuild and if it is at least as large as one
> of the existing members truncate the reserved region to the same size
> as the other member. That way we are at least compatible with
> whatever agent created the array in the first instance.
>=20

I think you are saying that I can go ahead and apply this patch, but th=
at it
might get improved upon in the future .... I hope that is right ?

> But this also comes back to the problem of duplicating the array
> configuration. It becomes difficult to make things the same size
> unless the orom version (reserved region layout) is specified.

Yes, that is occasionally a serious problem. As metadata becomes more
flexible it become harder to reproduce exact configuration.

Maybe I need a --no-default option to --create to ensure no defaults ar=
e used.
Then some sort of generic=20
--config-param foo=3Dbar
which creates a dictionary of foo=3Dbar assignments which are used by t=
he
metadata to fill in any detail that they would normally fill from defau=
lts.

Sounds a bit heavy-weight though...

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 7/9] imsm: support "missing" devices at Create

am 30.08.2011 04:26:34 von NeilBrown

On Thu, 25 Aug 2011 19:14:35 -0700 Dan Williams
wrote:

> Specifying missing devices at create is very useful for array recovery.
>
> For imsm create dummy disk entries at init_super_imsm time, and then use
> them to fill in unoccupied slots in the final array (if the container is
> unpopulated).
>
> If the container is already populated (has a subarray)
> 'missing' disks must be in reference to already recorded missing devices
> in the metadata.

It would appear you have also implement --assume-clean:

> - vol->dirty = 0;
> + vol->dirty = !info->state;

Thanks - that has been bothering me.

I'll update the commit-log to record this.

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 9/9] mdadm: "dump" support

am 30.08.2011 04:58:03 von NeilBrown

On Thu, 25 Aug 2011 19:14:45 -0700 Dan Williams
wrote:

> mdadm -E /dev/sda --dump=foo
>
> Creates a sparse file image of /dev/sda named foo with a copy of the
> metadata instance found by -E.
>
> When used in the opposite direction:
>
> mdadm -E foo --dump=/dev/sda
>
> ...can restore metadata to a given block device, assuming it is
> identical size to the dump file.

I like this functionality - I've thought about doing something like this
before but never quite done it.

I'm not convinced of the interface yet - especially the lack of checks. If
it can be misused it will be misused and we should try to avoid that.

Using a sparse file is probably a good idea - it removes the need to try to
invent a format for storing non-consecutive metadata and recording the offset.

dmraid has a --dump-metadata (-D) option which creates a directory and stores
one file per device there. I like that as it make it easy to get everything
in the right place.

Maybe the arg to --dump could be a directory name to store dump files in ...
but then we lose the pleasing symmetry of using the same command to dump and
restore. But is that symmetry only pleasing to us and would it be confusing
to new users?

Suppose we did have a separate 'restore' function - what would it look like?
An option to create?

mdadm --create --restore=some-directory /dev/sda /dev/sdb

Then the names of the dump files would need to indicate which array and which
position in the array... And this would make it hard to restore the metadata
to a single device.

So probably just
mdadm -E /dev/sda --restore=some-file
would restore the metadata then report it?? Maybe keeping the old metadata
as a backup.
Or maybe drop the -E and just have '--dump' and '--restore' as top-level
Misc options.

It would be good if --restore would create a backup of the in-disk metadata
if there is any - to "some-file.bak" maybe ??

So how about this:

- allow --examine to work on a file.
- new misc option --dump=foo will create directory foo and store an image of
any metadata on each device listed - filename matches basename of device.
- new misc option --restore=bar. For each device listed restore metadata
from bar (if it is a file, in which case only one device) or bar/basename
(if it is a directory). In either case if there is metadata on the device,
save it to the file + ".bak". If ".bak" exists and there is metadata -
abort.

Does that seem reasonable?

Thanks,
NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/9] recovering an imsm raid5 array

am 30.08.2011 05:13:31 von NeilBrown

On Thu, 25 Aug 2011 19:13:59 -0700 Dan Williams
wrote:

> I run imsm raid5 and raid1 arrays on my personal systems and was
> recently bit by the bug that was fixed in commit 1a2487c2 "FIX: imsm:
> OROM does not recognize degraded arrays (V2)". In the process of
> investigating that I pulled the wrong disk and ended up in a dual
> degraded situation.
>
> These patches (ordered in roughly increasing order of required review)
> are the features needed to get the array up and running again, as well
> as some random fixes spotted along the way.
>
> The most important patch for recovery being patch7 "imsm: support
> 'missing' devices at Create", allowing the mdadm raid5 recovery path of
> recreating the raid array with what one thinks are the good disks /
> order, and then attempt to mount the filesystem to see if the guess was
> correct.
>
> Example, create degraded raid5 with slot1 missing.
>
> mdadm --create /dev/md0 /dev/sd[ac] -n 2 -e imsm
> mdadm --create /dev/md1 /dev/sda missing /dev/sdc -n 3 -l 5
>
> However, during this process I wanted to make sure I could get back to
> the exact same metadata that was on the disks, to root cause what went
> wrong, examine the metadata offline, or backup a step if I made a
> mistake in the recovery process. Patch9 implements --dump support, the
> fact that something like this has not been implemented already is maybe
> a clue that it is not such a great idea? I can imagine someone messing
> up their configuration if they restored a metadata image to the wrong
> device, but if you know what you are doing it could be a useful hack.
>
> ---
>
> Dan Williams (9):
> imsm: fix max disks per array
> imsm: fix, stop metadata updates to newly failed devices
> imsm: fix display spares
> sysfs: fix sysfs_disk_to_scsi_id
> imsm: fix reserved sectors for spares
> mdmon: fix, close spare activation race
> imsm: support 'missing' devices at Create
> util: allow regular files through test_partition()
> mdadm: 'dump' support
>
>
> Examine.c | 42 ++++++++++++++++
> ReadMe.c | 1
> managemon.c | 5 ++
> mdadm.8.in | 13 +++++
> mdadm.c | 11 ++++
> mdadm.h | 3 +
> super-intel.c | 146 +++++++++++++++++++++++++++++++++++++--------------------
> sysfs.c | 30 ++++--------
> util.c | 6 ++
> 9 files changed, 182 insertions(+), 75 deletions(-)

Thanks.
I've applied all but the last two - which I might want done slightly
differently as discussed already.

I also fixed a build breakage in "make everything" .... though maybe I should
get rid of mdassemble - it is hard to maintain and I'm not really sure that
it is useful.

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 9/9] mdadm: "dump" support

am 30.08.2011 12:12:03 von alexander.kuehn

Zitat von NeilBrown :

> On Thu, 25 Aug 2011 19:14:45 -0700 Dan Williams
> wrote:
>
>> mdadm -E /dev/sda --dump=foo
>>
>> Creates a sparse file image of /dev/sda named foo with a copy of the
>> metadata instance found by -E.
>>
>> When used in the opposite direction:
>>
>> mdadm -E foo --dump=/dev/sda
>>
>> ...can restore metadata to a given block device, assuming it is
>> identical size to the dump file.
>
> Suppose we did have a separate 'restore' function - what would it look like?
> An option to create?
>
> mdadm --create --restore=some-directory /dev/sda /dev/sdb
>

Yes, from a UI perspective having separate dump/restore options is
better. Since there is also dump(8)/restore(8) it would also keep the
nomenclature.
Otherwise I would rather name it "copy-metadata" or something as with
copying the user is or should be aware that the order or arguments
matters.
Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/9] imsm: fix reserved sectors for spares

am 06.09.2011 22:42:45 von dan.j.williams

On Mon, Aug 29, 2011 at 7:20 PM, NeilBrown wrote:
>> Anna rightly points out that this could probably be safely made
>> bigger. =A0As it stands this applies to too broad an array of disks.=
=A0I
>> think a happy medium (until we can nail down the forward compatibili=
ty
>> of older oroms, v8 in this case) is to detect the case where the dis=
k
>> is being activated for rebuild and if it is at least as large as one
>> of the existing members truncate the reserved region to the same siz=
e
>> as the other member. =A0That way we are at least compatible with
>> whatever agent created the array in the first instance.
>>
>
> I think you are saying that I can go ahead and apply this patch, but =
that it
> might get improved upon in the future .... I hope that is right ?

Yes, this may have corrected things too far in the other direction,
but is less surprising than what we had before.

>> But this also comes back to the problem of duplicating the array
>> configuration. =A0It becomes difficult to make things the same size
>> unless the orom version (reserved region layout) is specified.
>
> Yes, that is occasionally a serious problem. =A0As metadata becomes m=
ore
> flexible it become harder to reproduce exact configuration.
>
> Maybe I need a --no-default option to --create to ensure no defaults =
are used.
> Then some sort of generic
> =A0 =A0--config-param foo=3Dbar
> which creates a dictionary of foo=3Dbar assignments which are used by=
the
> metadata to fill in any detail that they would normally fill from def=
aults.
>
> Sounds a bit heavy-weight though...

I think having a metadata dump facility covers most the need for
duplicating a configuration.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: [PATCH 5/9] imsm: fix reserved sectors for spares

am 19.09.2011 14:57:48 von anna.czarnowska

The patch below makes a better decision as to the size of required rese=
rvation.

> -----Original Message-----
> From: Williams, Dan J [mailto:dan.j.williams@intel.com]
> Sent: Tuesday, September 06, 2011 10:43 PM
> To: NeilBrown
> Cc: linux-raid@vger.kernel.org; Czarnowska, Anna; Labun, Marcin;
> Ciechanowski, Ed
> Subject: Re: [PATCH 5/9] imsm: fix reserved sectors for spares
>=20
> On Mon, Aug 29, 2011 at 7:20 PM, NeilBrown wrote:
> >> Anna rightly points out that this could probably be safely made
> >> bigger. =A0As it stands this applies to too broad an array of disk=
s.
> =A0I
> >> think a happy medium (until we can nail down the forward
> compatibility
> >> of older oroms, v8 in this case) is to detect the case where the
> disk
> >> is being activated for rebuild and if it is at least as large as o=
ne
> >> of the existing members truncate the reserved region to the same
> size
> >> as the other member. =A0That way we are at least compatible with
> >> whatever agent created the array in the first instance.
> >>
> >
> > I think you are saying that I can go ahead and apply this patch, bu=
t
> that it
> > might get improved upon in the future .... I hope that is right ?
>=20
> Yes, this may have corrected things too far in the other direction,
> but is less surprising than what we had before.

=46rom bf5c919219d9947a0eba1ce2b450c8f663291f48 Mon Sep 17 00:00:00 200=
1
=46rom: Anna Czarnowska
Date: Mon, 19 Sep 2011 13:41:46 +0200
Subject: [PATCH] Calculate reservation for a spare based on active disk=
s in container

New function to calculate minimum reservation to expect from a spare
is introduced.

The required amount of space at the end of the disk depends on what we
plan to do with the spare and what array we want to use it in.
=46or creating new subarray in an empty container the full reservation =
of
MPB_SECTOR_COUNT + IMSM_RESERVED_SECTORS is required.

=46or recovery or OLCE on a volume using new metadata format at least
MPB_SECTOR_CNT + NUM_BLOCKS_DIRTY_STRIPE_REGION is required.
The additional space for migration optimization included in
IMSM_RESERVED_SECTORS is not necessary and is not reserved by some orom=
s.

MPB_SECTOR_CNT alone is not sufficient as it does not include the
reservation at the end of subarray.

However if the real reservation on active disks is smaller than this
(when the array uses old metadata format) we should use the real value.
This will allow OLCE and recovery to start on the spare even if the vol=
ume
doesn't have the reservation we normally use for new volumes.

Signed-off-by: Anna Czarnowska
---
super-intel.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++=
+++++--
1 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index f1c924f..8dad03c 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -88,6 +88,7 @@
=20
#define MPB_SECTOR_CNT 2210
#define IMSM_RESERVED_SECTORS 4096
+#define NUM_BLOCKS_DIRTY_STRIPE_REGION 2056
#define SECT_PER_MB_SHIFT 11
=20
/* Disk configuration info. */
@@ -827,6 +828,8 @@ static int count_memberships(struct dl *dl, struct =
intel_super *super)
return memberships;
}
=20
+static __u32 imsm_min_reserved_sectors(struct intel_super *super);
+
static struct extent *get_extents(struct intel_super *super, struct dl=
*dl)
{
/* find a list of used extents on the given physical device */
@@ -840,7 +843,7 @@ static struct extent *get_extents(struct intel_supe=
r *super, struct dl *dl)
* IMSM_RESERVED_SECTORS region
*/
if (dl->index == -1)
- reservation =3D MPB_SECTOR_CNT;
+ reservation =3D imsm_min_reserved_sectors(super);
else
reservation =3D MPB_SECTOR_CNT + IMSM_RESERVED_SECTORS;
=20
@@ -933,6 +936,51 @@ static int is_failed(struct imsm_disk *disk)
return (disk->status & FAILED_DISK) == FAILED_DISK;
}
=20
+/* try to determine how much space is reserved for metadata from
+ * the last get_extents() entry on the smallest active disk,
+ * otherwise fallback to the default
+ */
+static __u32 imsm_min_reserved_sectors(struct intel_super *super)
+{
+ struct extent *e;
+ int i;
+ __u32 min_active, remainder;
+ __u32 rv =3D MPB_SECTOR_CNT + IMSM_RESERVED_SECTORS;
+ struct dl *dl, *dl_min =3D NULL;
+
+ if (!super)
+ return rv;
+
+ min_active =3D 0;
+ for (dl =3D super->disks; dl; dl =3D dl->next) {
+ if (dl->index < 0)
+ continue;
+ if (dl->disk.total_blocks < min_active || min_active == 0) {
+ dl_min =3D dl;
+ min_active =3D dl->disk.total_blocks;
+ }
+ }
+ if (!dl_min)
+ return rv;
+
+ /* find last lba used by subarrays on the smallest active disk */
+ e =3D get_extents(super, dl_min);
+ if (!e)
+ return rv;
+ for (i =3D 0; e[i].size; i++)
+ continue;
+
+ remainder =3D min_active - e[i].start;
+ free(e);
+
+ /* to give priority to recovery we should not require full
+ IMSM_RESERVED_SECTORS from the spare */
+ rv =3D MPB_SECTOR_CNT + NUM_BLOCKS_DIRTY_STRIPE_REGION;
+
+ /* if real reservation is smaller use that value */
+ return (remainder < rv) ? remainder : rv;
+}
+
/* Return minimum size of a spare that can be used in this array*/
static unsigned long long min_acceptable_spare_size_imsm(struct supert=
ype *st)
{
@@ -941,6 +989,7 @@ static unsigned long long min_acceptable_spare_size=
_imsm(struct supertype *st)
struct extent *e;
int i;
unsigned long long rv =3D 0;
+ __u32 reservation;
=20
if (!super)
return rv;
@@ -958,9 +1007,12 @@ static unsigned long long min_acceptable_spare_si=
ze_imsm(struct supertype *st)
continue;
if (i > 0)
rv =3D e[i-1].start + e[i-1].size;
+ reservation =3D __le32_to_cpu(dl->disk.total_blocks) - e[i].start;
free(e);
+
/* add the amount of space needed for metadata */
- rv =3D rv + MPB_SECTOR_CNT + IMSM_RESERVED_SECTORS;
+ rv =3D rv + imsm_min_reserved_sectors(super);
+
return rv * 512;
}
=20
--=20
1.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/9] imsm: fix reserved sectors for spares

am 21.09.2011 06:45:23 von NeilBrown

--Sig_/iBL6qLwcB48CLW7GajmIx63
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Mon, 19 Sep 2011 12:57:48 +0000 "Czarnowska, Anna"
wrote:

> The patch below makes a better decision as to the size of required reserv=
ation.

I've applied this, thanks.

NeilBrown

>=20
> > -----Original Message-----
> > From: Williams, Dan J [mailto:dan.j.williams@intel.com]
> > Sent: Tuesday, September 06, 2011 10:43 PM
> > To: NeilBrown
> > Cc: linux-raid@vger.kernel.org; Czarnowska, Anna; Labun, Marcin;
> > Ciechanowski, Ed
> > Subject: Re: [PATCH 5/9] imsm: fix reserved sectors for spares
> >=20
> > On Mon, Aug 29, 2011 at 7:20 PM, NeilBrown wrote:
> > >> Anna rightly points out that this could probably be safely made
> > >> bigger. =A0As it stands this applies to too broad an array of disks.
> > =A0I
> > >> think a happy medium (until we can nail down the forward
> > compatibility
> > >> of older oroms, v8 in this case) is to detect the case where the
> > disk
> > >> is being activated for rebuild and if it is at least as large as one
> > >> of the existing members truncate the reserved region to the same
> > size
> > >> as the other member. =A0That way we are at least compatible with
> > >> whatever agent created the array in the first instance.
> > >>
> > >
> > > I think you are saying that I can go ahead and apply this patch, but
> > that it
> > > might get improved upon in the future .... I hope that is right ?
> >=20
> > Yes, this may have corrected things too far in the other direction,
> > but is less surprising than what we had before.
>=20
> >From bf5c919219d9947a0eba1ce2b450c8f663291f48 Mon Sep 17 00:00:00 2001
> From: Anna Czarnowska
> Date: Mon, 19 Sep 2011 13:41:46 +0200
> Subject: [PATCH] Calculate reservation for a spare based on active disks =
in container
>=20
> New function to calculate minimum reservation to expect from a spare
> is introduced.
>=20
> The required amount of space at the end of the disk depends on what we
> plan to do with the spare and what array we want to use it in.
> For creating new subarray in an empty container the full reservation of
> MPB_SECTOR_COUNT + IMSM_RESERVED_SECTORS is required.
>=20
> For recovery or OLCE on a volume using new metadata format at least
> MPB_SECTOR_CNT + NUM_BLOCKS_DIRTY_STRIPE_REGION is required.
> The additional space for migration optimization included in
> IMSM_RESERVED_SECTORS is not necessary and is not reserved by some oroms.
>=20
> MPB_SECTOR_CNT alone is not sufficient as it does not include the
> reservation at the end of subarray.
>=20
> However if the real reservation on active disks is smaller than this
> (when the array uses old metadata format) we should use the real value.
> This will allow OLCE and recovery to start on the spare even if the volume
> doesn't have the reservation we normally use for new volumes.
>=20
> Signed-off-by: Anna Czarnowska
> ---
> super-intel.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++++=
+++--
> 1 files changed, 54 insertions(+), 2 deletions(-)
>=20
> diff --git a/super-intel.c b/super-intel.c
> index f1c924f..8dad03c 100644
> --- a/super-intel.c
> +++ b/super-intel.c
> @@ -88,6 +88,7 @@
> =20
> #define MPB_SECTOR_CNT 2210
> #define IMSM_RESERVED_SECTORS 4096
> +#define NUM_BLOCKS_DIRTY_STRIPE_REGION 2056
> #define SECT_PER_MB_SHIFT 11
> =20
> /* Disk configuration info. */
> @@ -827,6 +828,8 @@ static int count_memberships(struct dl *dl, struct in=
tel_super *super)
> return memberships;
> }
> =20
> +static __u32 imsm_min_reserved_sectors(struct intel_super *super);
> +
> static struct extent *get_extents(struct intel_super *super, struct dl *=
dl)
> {
> /* find a list of used extents on the given physical device */
> @@ -840,7 +843,7 @@ static struct extent *get_extents(struct intel_super =
*super, struct dl *dl)
> * IMSM_RESERVED_SECTORS region
> */
> if (dl->index == -1)
> - reservation =3D MPB_SECTOR_CNT;
> + reservation =3D imsm_min_reserved_sectors(super);
> else
> reservation =3D MPB_SECTOR_CNT + IMSM_RESERVED_SECTORS;
> =20
> @@ -933,6 +936,51 @@ static int is_failed(struct imsm_disk *disk)
> return (disk->status & FAILED_DISK) == FAILED_DISK;
> }
> =20
> +/* try to determine how much space is reserved for metadata from
> + * the last get_extents() entry on the smallest active disk,
> + * otherwise fallback to the default
> + */
> +static __u32 imsm_min_reserved_sectors(struct intel_super *super)
> +{
> + struct extent *e;
> + int i;
> + __u32 min_active, remainder;
> + __u32 rv =3D MPB_SECTOR_CNT + IMSM_RESERVED_SECTORS;
> + struct dl *dl, *dl_min =3D NULL;
> +
> + if (!super)
> + return rv;
> +
> + min_active =3D 0;
> + for (dl =3D super->disks; dl; dl =3D dl->next) {
> + if (dl->index < 0)
> + continue;
> + if (dl->disk.total_blocks < min_active || min_active == 0) {
> + dl_min =3D dl;
> + min_active =3D dl->disk.total_blocks;
> + }
> + }
> + if (!dl_min)
> + return rv;
> +
> + /* find last lba used by subarrays on the smallest active disk */
> + e =3D get_extents(super, dl_min);
> + if (!e)
> + return rv;
> + for (i =3D 0; e[i].size; i++)
> + continue;
> +
> + remainder =3D min_active - e[i].start;
> + free(e);
> +
> + /* to give priority to recovery we should not require full
> + IMSM_RESERVED_SECTORS from the spare */
> + rv =3D MPB_SECTOR_CNT + NUM_BLOCKS_DIRTY_STRIPE_REGION;
> +
> + /* if real reservation is smaller use that value */
> + return (remainder < rv) ? remainder : rv;
> +}
> +
> /* Return minimum size of a spare that can be used in this array*/
> static unsigned long long min_acceptable_spare_size_imsm(struct supertyp=
e *st)
> {
> @@ -941,6 +989,7 @@ static unsigned long long min_acceptable_spare_size_i=
msm(struct supertype *st)
> struct extent *e;
> int i;
> unsigned long long rv =3D 0;
> + __u32 reservation;
> =20
> if (!super)
> return rv;
> @@ -958,9 +1007,12 @@ static unsigned long long min_acceptable_spare_size=
_imsm(struct supertype *st)
> continue;
> if (i > 0)
> rv =3D e[i-1].start + e[i-1].size;
> + reservation =3D __le32_to_cpu(dl->disk.total_blocks) - e[i].start;
> free(e);
> +
> /* add the amount of space needed for metadata */
> - rv =3D rv + MPB_SECTOR_CNT + IMSM_RESERVED_SECTORS;
> + rv =3D rv + imsm_min_reserved_sectors(super);
> +
> return rv * 512;
> }
> =20

--Sig_/iBL6qLwcB48CLW7GajmIx63
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)

iD8DBQFOeWvjG5fc6gV+Wb0RAp1sAJ0cxMyKFWUQQeKsucJ5xGCyE6P6FACf X66X
aj0+5STTwmZ1G6+4WFGz0Ls=
=vV0v
-----END PGP SIGNATURE-----

--Sig_/iBL6qLwcB48CLW7GajmIx63--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html