A policy frame work for mdadm (incorporating domains and hotplugand such)

am 01.07.2010 08:50:07 von NeilBrown

Hi all,
I figured it was time to make a firm decision on what "domains" and related
things would look like in mdadm. In all the discussions so far I have just
been making suggestions and exploring possibilities and wandering around the
edges of the issue. But that cannot last forever as there is need for some
certainty.

I had a read through Doug's patch set and Przemyslaw's and Anna's work on
top of that and there were certain aspects of what I saw that I didn't
like.
In particular the model of what a 'domain' was seems to keep changing, first
growing special cases for partitions, and then growing subsets (which I admit
I didn't completely understand). When something grows and changes like that
so quickly there is a very real possibility that the final result won't meet
the original needs any more.

I think we need to start with something that is *right* - at least as far as
it goes. Refinements that are predictable are ok, but structural changes
aren't.

So here is my concrete proposal on how these things will work. I have
already started implementing it, which shows that I'm fairly committed to
this and would need a very strong argument for significant change to happen.

The first step is to forget about domains. We will come back to them later
as they are important and useful. But they are not central and we won't be
starting there. So forget them. (Forget what? I don't remember anything...)

What we need is a policy framework, for encoding policy about the various
automatic actions that mdadm performs. We already have bits of policy like
the spare-group tag (which guides automatic spare migration) and the 'auto'
mdadm.conf line (which guides automatic assembly). However that is all
ad-hoc and as the amount of policy increases, the amount of interaction
increases so we need a unifying platform. That is where we need to start.

So point 1 is that we need a policy framework.

Point 2 is that policy revolves primarily around devices (rather than
arrays) and to a lesser extent around metadata types.
It is devices that are migrated, devices that arrays are built from, devices
that are automatically made into spares etc.
Metadata types often encode some specific policy in the metadata, so they
need some fairly strong role in the policy framework too. Often the
metadata type is like a parameter to a policy. "You can incorporate this
device in any imsm array".

So Abstraction 1 is a "Policy statement".

A policy statement applies to a particular device, possibly in the context
of a particular metadata, and asserts that a particular name has a
particular value.
action=spare (ddf1)
might be a policy statement about a device. It says that where ddf1
metadata is involved, the device can be made a hot-spare when it is
hot-plugged.
auto=homehost (0.90)
might be another which says that auto-assembly may use a non-disambiguated
name (no trailing _NN) when assembling this device into a metadata=0.90
array providing the homehost information in the metadata matches this host.

A statement might not have any metadata type associated.
action=ignore
applies irrespective of metadata type.

The policy names that I currently envisage are:

action= ignore, include, spare, force-spare

which covers the hotplug actions that --incremental might perform.

auto= yes, homehost, no

which covers the functionality currently in the AUTO mdadm.conf line

domain= arbitrary-string

This provides the 'domain' isolation functionality.
The semantics I have in mind (and the precise details here are fairly
important so this cannot be changed lightly) are:
A device can have a number of domains, possibly from various sources.
An array can have a number of domains, from the devices plus from
spare-group

A device may be attached to an array if all of the domains of the device
are also domains of the array. The array may have extra domains. The
device may not.

This requires that if there are overlapping domains, they must properly
nest. i.e. the intersection of two domains must be empty, or one of the
domains. It might make sense to have a domain 'global' which all
devices have, and some other domains which just subsets have.

There is probably room for other policies like whether to start an
incrementally assembled degraded array early, or wait until it is not
degraded. Maybe some policy of handling "prodigal device" situations where
two halfs of a mirror both this they are "it" and the other is "not".

By now Doug (hope your back is feeling better) will have noticed that
partitions haven't been mentioned yet. So it is time for them.

Point 3: partitions become a new metadata type (or types).

If we want mdadm to ensure there is a MBR partition table on a device, then
provide a policy statement like
action=spare (mbr)

so if the device doesn't have recognised metadata, mdadm configures it as a
spare of type mdr, getting the table from some compatible pre-existing device.
There is probably room to refine this to get the table from a file like
Doug's patches aimed to. That wouldn't be my first preference as it requires
extra configuration, but it might be necessary. That would require adding
some sort of argument to each policy statement, they become
name = value (metadata) other-arguments
I'd rather keep that to a very minimum though.

Note that the above syntax is all abstract syntax. It reflects the internal
data structures, but not necessarily the way that policy will be expressed to
mdadm. For that we need to start with some concrete syntax for mdadm.conf
So:

Point 4: policy is specified in mdadm.conf by "POLICY" lines (aka policy
rules)

A policy line contains match words, assignment words, and metadata words.
match words are name=value or possibly name==value - haven't decided
yet.
assignment words are name=value (or name:=value ... probably not)
metadata words are "metadata=foo"

A device matches a policy line if, for each match name that appears, the
device matches at least one of the values.
So if we have
POLICY a==1 a==2 b==3 b==4

then for a device to match it must have an 'a' or 1 or 2, and a 'b' of
3 or 4, but it doesn't matter what the device has for 'c'.

One device may match multiple POLICY lines and if it does so, it
accumulates all the assigned words. The ordering of policy lines is
irrelevant to the end result. For this to work we might need to add
a "word!=value" - I hope not, but it wouldn't be a big problem.

If a device matches a policy line then a separate policy statement is
created combining each assignment word with each metadata word (if there
are any). This list of policy statements is added to the device's policy.

Sometimes policy is very metadata dependent so:

Point 5: policy can be specified by the metadata handler too.

If a device is found to have metadata on it, then when that metadata is
loaded (->load_super()) it might add some policy statements to the
device. If it does they will all be in the context of the relevant
metadata type. This will probably include 'domain' assignments to restrict
spare migration.

But wait, there's more

Point 6: We probably have platform policy too. I'm not really sure what
this will involve, and what if anything needs to be explicit. Maybe just

platform-policy imsm

in mdadm.conf tell mdadm to query the platform and deduce some policy
statements or police rules.

There is a strong pattern that when a set of devices is partitioned, all the
'1' partitions go in one array, all the '2' partitions in another etc.
It might be useful to have config-file support for this pattern, so a
possible config file line would be:

partition-policy path=foo domain=bar

which effectively makes multiple policy lines each of which has '-partNN'
added to all 'path' values and all 'domain' values. But I'm getting ahead
of myself...

The 'match' names that I imagine are:
path= which is given a 'glob' pattern to match against the path name
from /dev/disk/by-path/
type= which is either 'disk' or 'partition'

We could also have size= which uses the standardised disk sizes so it
would be easy to say that all 2GB devices only migrate to arrays with 2GB
devices in them.

So: given a device we extract a bunch of policy statements from various
sources. Now we need to know how to apply those policy statements in
different situations. There are various contexts where we need to review
policy.

A/ When considering adding a device to an array.
This can happen at hot-plug either because the device looked like
a member of the array, or because the device is being added as a new
spare.

The primary policy information here is 'domain'.
We extract a list of domains that the device is in which are specific
to the metadata of the array (or are not metadata-specific)

We also get a list of domains for the array by extracting
a similar list for each device and including any spare-group
from mdadm.conf

Then we check if the set of domains for the device is a subset of the set
of domains for the array. If it is (and is non-empty), the addition is
allowed. If it isn't then the addition probably isn't allow, though we
might invent some other policy like "no-strict-domains", or assert that
domains don't apply when the user explicitly makes a request. or uses
--force. Or something.

This might have some variation depending on whether the 'add this to an
array' came from --create or --assemble or --add or --re-add or
--incremental or --monitor doing spare migration.
My point at the moment isn't to give the entire algorithm but the show how
the policy framework would inform that algorithm.

B/ when considering what to do with a device that has been passed to
--incremental.

For this we need to
1/ identify an array, and hence a metadata type
2/ find the 'action' policy for the device with that metadata type.
3/ if there are more than one, fail
4/ if the one is 'ignore' do nothing
5/ if 'A' above says we cannot add this device, then give up
6/ consider which of 'include', 'spare', 'force-spare' might apply
here.....

If the device has recognisable metadata, which identifies an array, then
the array identified in step 1 is just that array.
If the device does not have recognisable metadata, then we consider
each array in turn (though we might optimise out some easy cases like
if all metadatas say 'ignore' then don't bother listing arrays).

If multiple arrays all allow the device to be added, we would need to
chose the first which is degraded (unless we invented some other policy).

So this is how I want these things to work, and this is what I'm going to be
coding. I should have the basic framework in place early next week (assuming
no major interruptions) at which point I'll make the code available.

The part of this that I'm least confident of is assigning domains to arrays.
Extracting a list of policy statements for each device sounds a bit
cumbersome. Maybe if I cache enough bits of it, it will work nicely.

Comments, as always, are most welcome.

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: A policy frame work for mdadm (incorporating domains and hotplugand such)

am 01.07.2010 10:26:45 von dan.j.williams

On 6/30/2010 11:50 PM, Neil Brown wrote:
> This requires that if there are overlapping domains, they must properly
> nest. i.e. the intersection of two domains must be empty, or one of the
> domains.It might make sense to have a domain 'global' which all
> devices have, and some other domains which just subsets have.

You lost me here "or one of the domains..." must be a superset of the other?

How do we a priori know which domain an array belongs to? Will we
require them to be tagged (makes our job easier at the cost of some
configuration file maintenance for the administrator). Taking the
domain == controller example, if a user identifies an array as
belonging to controller1 in the configuration file and later moves a set
of member devices to controller2 I assume we ignore those devices right?

This would simplify things for the imsm assembly case because it
requires the array-to-domain association to be identified ahead of time
rather than arbitrarily autodetected by where we happen to find the
first array member.

If an assembly statement is ambiguous we fail and ask for the domain to
be clarified.

> There is probably room for other policies like whether to start an
> incrementally assembled degraded array early, or wait until it is not
> degraded. Maybe some policy of handling "prodigal device" situations where
> two halfs of a mirror both this they are "it" and the other is "not".
>
> By now Doug (hope your back is feeling better) will have noticed that
> partitions haven't been mentioned yet. So it is time for them.
>
> Point 3: partitions become a new metadata type (or types).
>
> If we want mdadm to ensure there is a MBR partition table on a device, then
> provide a policy statement like
> action=spare (mbr)

Where the metadata type is determined by the current arrays in the
domain where the device was attached if I am following correctly.

[..]
> Point 6: We probably have platform policy too. I'm not really sure what
> this will involve, and what if anything needs to be explicit. Maybe just
>
> platform-policy imsm
>
> in mdadm.conf tell mdadm to query the platform and deduce some policy
> statements or police rules.

I don't know if we need to add platform policy to the configuration
file, maybe we can revisit this when we have a metadata format where
"RAID mode" cannot be disabled in the firmware. For now the policies
enforced by the platform really are not optional (lest we confuse
firmware), so I'd just as soon not allow them to be configured. The
mitigations are turn off raid mode or set the environment variable which
should tell you that you are doing something tricky. I'll come back if
I think of a non-critical platform dependent policy.

[..]
> The part of this that I'm least confident of is assigning domains to arrays.

It would be nice if every array came pre-tagged with what domain it
belongs, but that can't be a requirement. Conversely users that don't
set up a domain will sometimes find one forced upon them by the
metadata. On such a platform where there are hardware defined domains I
think it would be reasonable for the user to identify which domain is
the context for the action.

Like the following, (assuming an empty mdadm.conf) sda has imsm metadata
attached to ahci and sdb has imsm metadata, but is attached to usb.

mdadm -A /dev/md0 /dev/sda /dev/sdb

....we fail with an error message like "/dev/sda was tagged as a member
of the ahci domain while /dev/sdb is only a member of the global domain,
aborting".

mdadm -A /dev/md0 /dev/sda /dev/sdb --domain ahci

....would succeed with a message like "/dev/sdb is not a member of the
ahci domain, ignoring."

> Extracting a list of policy statements for each device sounds a bit
> cumbersome. Maybe if I cache enough bits of it, it will work nicely.
>
> Comments, as always, are most welcome.

Thanks for the thoughtful write up, as always.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html