Inferring Relationships?
am 08.09.2006 01:00:09 von smorrey
Hello Everyone,
Recently I was assigned the joyful task of creating a modern work-alike
of an application from the 1980's. Without delving into too much
detail, my job is to create a web based application, which can infer
additional relationships based on current knowledge stored in a
database.
The original software did this pretty well, but used a proprietary DB
format that left much to be desired when scaled up to use the internet.
Because of this it was determined that it would be better to rewrite
the whole thing from scratch and use a modern storage engine such as
mySQL or PostGres.
I have managed to rewrite the vast majority of the this software but am
in a bit of a bind.
One of the primary functions of the software was to track kinship
relationships between breeding animals such as dogs and cats, to help
prevent inbreeding to the best extent possible.
Fortunately the dataset has been grown continously since the 1980s and
is rather large, however it really only contains 3 values relevant to
this geneaological purpose.
They are
id, mother, father
All of these are of course serial numbers, and there is an additional
table that can be used to tie id's to names.
Since a creatures chance of contracting or manifesting a genetic
illness increases significantly according to the level of inbreeding
present, one of the primary purposes of this database is to return a
value representitive of how closely related mother and father are. For
instance if mother and father were brother and sister, then we assign
an extremely significant value to that relationship. mother and son,
father and daughter also have a significant but slightly lesser value,
granparent,grandchild, uncle, niece, aunt, nephew an even lesser value
but all roughly on par with one another, and first cousins, second
cousins etc until we run out of data and say the mother and father are
effectively not related.
I've come to the conclusion that I can infer the relationship between
mother and father by selecting mother and father from the table called
geneaology and then select their mother and fathers respectively and
their mothers and fathers and so on ad-infinitum until I run out of
data or I get a hit where something on the mothers side matches
something on the fathers side. While this isn't 100% accurate,
(obviously if we stop there we don't factor in breeding any higher up),
it is close enough.
But I have no idea how to go about implementing a query like that in
mySQL. I can totally see how to do it in PHP, however given the size
of the dataset I think that would be cumbersome, and I just have a
programmers hunch that there is a nice clean and simple way to
implement this.
Essentially all it boils down to is this...
Given the dataset, how would one find a path from mother to father?
Once we have the path we can infer the relationship based on distance
and severity, and that could be done at the application layer.
On the surface at least this seems like a perfect case of using
Dijkstra but I can only seem to find a Dijkstra algorithim implemented
in PostGres and it looks like it would require an addition of a
secondary table, thats something we just don't have the room for,
besides that all the other Dijkstra algorithms I can find seem to also
be optimized for finding driving directions and the like.
Anyways this seems to me to be a simple problem and maybe I just can't
get my mind around it. Here is some sample data, with IDs swapped for
names to make the point a little clearer.
fido, lady, tramp
lady, princess, killroy
tramp, queen, spot
princess, queen, barky
This is an oversimplified example for clarity, but as you can clearly
see fido has a pretty serious risk of developing a genetic illness,
this is because his mothers line and fathers line, both meet up at
queen, albeit in different places. Queen is Fidos great grandmother on
his mothers side, and his grandmother on his fathers side.
There are 2 steps to Queen from Fido on his Mothers side, but only 1
step from Fido to Queen on his fathers side. Fidos father and mother
are therefore a level 1.5 Uncle/Niece relationship. Whereas if they
were brother and sister it would be a level 0 relationship.
Anyways I would appreciate any help that could be offered in how to
work this pathfinding and/or inferrence into a SQL statement. Even
just returning the complete path would be sufficient.
Regards,
DevNull
Re: Inferring Relationships?
am 08.09.2006 11:34:00 von Davie
DevNull wrote:
> Hello Everyone,
> Recently I was assigned the joyful task of creating a modern work-alike
> of an application from the 1980's. Without delving into too much
> detail, my job is to create a web based application, which can infer
> additional relationships based on current knowledge stored in a
> database.
> The original software did this pretty well, but used a proprietary DB
> format that left much to be desired when scaled up to use the internet.
> Because of this it was determined that it would be better to rewrite
> the whole thing from scratch and use a modern storage engine such as
> mySQL or PostGres.
>
> I have managed to rewrite the vast majority of the this software but am
> in a bit of a bind.
> One of the primary functions of the software was to track kinship
> relationships between breeding animals such as dogs and cats, to help
> prevent inbreeding to the best extent possible.
> Fortunately the dataset has been grown continously since the 1980s and
> is rather large, however it really only contains 3 values relevant to
> this geneaological purpose.
> They are
> id, mother, father
>
> All of these are of course serial numbers, and there is an additional
> table that can be used to tie id's to names.
>
> Since a creatures chance of contracting or manifesting a genetic
> illness increases significantly according to the level of inbreeding
> present, one of the primary purposes of this database is to return a
> value representitive of how closely related mother and father are. For
> instance if mother and father were brother and sister, then we assign
> an extremely significant value to that relationship. mother and son,
> father and daughter also have a significant but slightly lesser value,
> granparent,grandchild, uncle, niece, aunt, nephew an even lesser value
> but all roughly on par with one another, and first cousins, second
> cousins etc until we run out of data and say the mother and father are
> effectively not related.
>
> I've come to the conclusion that I can infer the relationship between
> mother and father by selecting mother and father from the table called
> geneaology and then select their mother and fathers respectively and
> their mothers and fathers and so on ad-infinitum until I run out of
> data or I get a hit where something on the mothers side matches
> something on the fathers side. While this isn't 100% accurate,
> (obviously if we stop there we don't factor in breeding any higher up),
> it is close enough.
>
> But I have no idea how to go about implementing a query like that in
> mySQL. I can totally see how to do it in PHP, however given the size
> of the dataset I think that would be cumbersome, and I just have a
> programmers hunch that there is a nice clean and simple way to
> implement this.
> Essentially all it boils down to is this...
> Given the dataset, how would one find a path from mother to father?
> Once we have the path we can infer the relationship based on distance
> and severity, and that could be done at the application layer.
>
> On the surface at least this seems like a perfect case of using
> Dijkstra but I can only seem to find a Dijkstra algorithim implemented
> in PostGres and it looks like it would require an addition of a
> secondary table, thats something we just don't have the room for,
> besides that all the other Dijkstra algorithms I can find seem to also
> be optimized for finding driving directions and the like.
>
> Anyways this seems to me to be a simple problem and maybe I just can't
> get my mind around it. Here is some sample data, with IDs swapped for
> names to make the point a little clearer.
>
> fido, lady, tramp
> lady, princess, killroy
> tramp, queen, spot
> princess, queen, barky
>
> This is an oversimplified example for clarity, but as you can clearly
> see fido has a pretty serious risk of developing a genetic illness,
> this is because his mothers line and fathers line, both meet up at
> queen, albeit in different places. Queen is Fidos great grandmother on
> his mothers side, and his grandmother on his fathers side.
> There are 2 steps to Queen from Fido on his Mothers side, but only 1
> step from Fido to Queen on his fathers side. Fidos father and mother
> are therefore a level 1.5 Uncle/Niece relationship. Whereas if they
> were brother and sister it would be a level 0 relationship.
>
> Anyways I would appreciate any help that could be offered in how to
> work this pathfinding and/or inferrence into a SQL statement. Even
> just returning the complete path would be sufficient.
>
> Regards,
> DevNull
This may help
Managing Hierarchical Data in MySQL
http://dev.mysql.com/tech-resources/articles/hierarchical-da ta.html
Re: Inferring Relationships?
am 08.09.2006 13:30:34 von zac.carey
DevNull wrote:
>I have no idea how to go about implementing a query like that in
>mySQL. I can totally see how to do it in PHP, however given the size
>of the dataset I think that would be cumbersome, and I just have a
>programmers hunch that there is a nice clean and simple way to
>implement this.
I'm neither a programmer nor a mathematician - nor a geneticist for
that matter! - but given that a child can have at most only two parents
and the number of relevant generations is probably quite limited (4 or
5 maybe !?!) are you sure that the algorithm is appropriate? Cumbersome
as it probably is, I suspect something along the lines of an iterative
query (whether in PHP or MySQL) might be simpler/quicker.
Apologies if I'm talking out my proverbial - it wouldn't be the first
time.