How to deal with 96 Dimensional Points ?
How to deal with 96 Dimensional Points ?
am 30.03.2010 11:39:26 von Werner Van Belle
--------------enig4B8B08591DE5B9B1877F0504
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Hello,
I have been pondering this for a while, but never really looked deeply
into the problem.
I have 96 dimensional points and I would like to pose queries such as:
'give me all points that are within such a radius of this one'. The gis
extensions to mysql might support such type of query. The problem is of
course that points are 2 dimensional and I'm not sure whether I can
extend it to more than 3 dimensions ?
Does anybody have an idea about this ?
Wkr,
--=20
http://werner.yellowcouch.org/
--------------enig4B8B08591DE5B9B1877F0504
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkuxxs4ACgkQ2TDF0fTmy3XGiwCdFIOKnrS7/1i1tkj8L0mC Cd/7
+VQAoIcaZ2eS6n3WdaN/3BN7oJGW1RON
=x1cg
-----END PGP SIGNATURE-----
--------------enig4B8B08591DE5B9B1877F0504--
Re: How to deal with 96 Dimensional Points ?
am 30.03.2010 12:26:17 von Werner Van Belle
--------------enig5DB7D65602615F5C7F441530
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Geert-Jan Brits wrote:
> You're most likely talking about something like consine-similarity on
> N-dimensional vectors.
> http://en.wikipedia.org/wiki/Cosine_similarity
> http://stackoverflow.com/search?q=3Dcosine+similarity
>
Cool links ! Although it is not why I need it for. I'm really talking
about an eucledian distance measure between vectors. So in a sense it is
simpler. Normally the gis extensions already provide the basic tools
necessary, if only the points could be extended to more than 2 dimensions=
> You could google to see strategies that exist for mysql.=20
> However, depending on your use-case (e.g: scalable recommender
> systems) mysql (or any other rdbms) may not be the best tool for the jo=
b.
--=20
http://werner.yellowcouch.org/
--------------enig5DB7D65602615F5C7F441530
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkux0coACgkQ2TDF0fTmy3VQuACfe3nnYBwwvs3V60yuu6GJ QiXR
ur4AniDqTTJLlwGCqHgfnQ/21qqGY5os
=LFp0
-----END PGP SIGNATURE-----
--------------enig5DB7D65602615F5C7F441530--
Re: How to deal with 96 Dimensional Points ?
am 30.03.2010 12:31:56 von Werner Van Belle
--------------enig5E8BDD27F50E7CDC58325575
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Johan De Meersman wrote:
> Well... a "point" in an n-dimensional space, is a location that has a
> defined value for each of it's n dimensions. If you have a value for
> each of your 96 dimensions, you have a point.
Well, it's fairly simple. If you have two points with 96 values in each.
Point1=3D(x1,...x96) and Point2=3D(y1,...,y96). The distance between thes=
e
two is
d=3Dsqrt( (x_1-y_1)^2 + ... + (x_96-y_96)^2 )
There is no magic in this.
> The mathematics of comparing distances in 96 dimensions is beyond me,
> though :-) I guess a good start would be looking at comparing
> distances in 2 and 3 dimensions (vector math, that is) and trying to
> extrapolate a method from that. Alternatively, hire a mathematician :-p=
Extrapolating from lower dimensions doesn't work too well. In this case
this would mean storing 48 different points and then trying to define a
distance measure based on each individual point. I'm not sure this is
feasable.
In general: KD-trees are quite good tools to deal with such large
dimensional spaces, but I see no possibility to use them in mysql,
Wkr,
>
>
> On Tue, Mar 30, 2010 at 11:39 AM, Werner Van Belle
> > wrote:
>
> Hello,
>
> I have been pondering this for a while, but never really looked dee=
ply
> into the problem.
>
> I have 96 dimensional points and I would like to pose queries such =
as:
> 'give me all points that are within such a radius of this one'.
> The gis
> extensions to mysql might support such type of query. The problem
> is of
> course that points are 2 dimensional and I'm not sure whether I can=
> extend it to more than 3 dimensions ?
>
> Does anybody have an idea about this ?
>
> Wkr,
>
> --
> http://werner.yellowcouch.org/
>
>
>
>
>
> --=20
> Bier met grenadyn
> Is als mosterd by den wyn
> Sy die't drinkt, is eene kwezel
> Hy die't drinkt, is ras een ezel
--=20
http://werner.yellowcouch.org/
--------------enig5E8BDD27F50E7CDC58325575
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkux0xwACgkQ2TDF0fTmy3UWVACfREgavKIIIkPsGCt4umqK Fzdt
x3IAn3LZ6BnEjiAvkm6t9jt1Ge069xVK
=jb5a
-----END PGP SIGNATURE-----
--------------enig5E8BDD27F50E7CDC58325575--
Re: How to deal with 96 Dimensional Points ?
am 30.03.2010 21:14:48 von Chris W
I'm not sure why, but it seems that some people, I don't mean to imply
that you are one of them, think there is some magic MySQL can preform to
find points with in a given radius using the GIS extension. There is no
magic. They simply use the well known math required to determine what
points are inside the circle. I could be wrong but I doubt there is any
way to create an index that can directly indicate points with a a
certain distance of other points unless that index included the distance
from every point to every other point. That is obviously not practical
since with a set of only 14 points the index would have over 6 billion
entries.
lets call each of your dimensions d1, d2, d3 .... d96.
If you create an index on d1, d2, .... d69, you can then create a simple
query that will quickly find all points that will find all points that
are with in a bounding box. Since this query is going to get a bit
large with 96 dimensions, I would use code to create the query. I will
use php. Let's start with the desired radius being r and the test point
dimensions being in an array TestPointD[1] = x, TestPointD[2] = . . .
$select = 'SELECT `PointID`, ';
$where = 'WHERE ';
foreach($TestPointD as $i => $d){
$di = 'd' . "$i";
$select .= "`$di`, "
$MinD = $d - $r;
$MaxD = $d + $r;
$where .= "`$di` >= '$MinD' AND `$di` <= '$MaxD' AND ";
}
$select = substr($select, 0, -2); //trim of the trailing comma and space
$where = substr($where, 0, -4); //trim off the trailing 'AND '
$query = "$select FROM `points` $where";
Obviously this is going to give you points outside the sphere but still
inside the cube. However it will narrow down the set so the further
math will not take as long. If this were 3 dimensions with an uniform
distribution of points, about 52% of the points returned by that query
will be inside the sphere. I'm not sure how to calculate the ratio of
the volume sphere to a cube in 96 dimensions. Then it will be a
simple loop to find the points you really want. While this query will
likely return a lot of points that you don't want especially in 96D
space, it will reduce it enough that the following loop will be much
faster than looking all points in the table.
$result = mysql_query($query) or die("DB error $query " . mysql_error() );
while(($row = mysql_fetch_row($result))){
$sum
foreach($row as $i => $d){
if($i == 0){
$PointID = $d;
continue; // skip point id at $row[0]
}
$SumSq += pow($TestPointD[$i] - $d, 2);
}
if(sqrt($SumSq) <= $r){
print "$PointID is with in $r of test point.\n";
}
}
In an application I had that was similar (but in 2D) I would insert the
id of the points that passed the condition into a temp table. Then I
could join that temp table to other tables do other queries I may need
on those points.
Chris W
Werner Van Belle wrote:
> Hello,
>
> I have been pondering this for a while, but never really looked deeply
> into the problem.
>
> I have 96 dimensional points and I would like to pose queries such as:
> 'give me all points that are within such a radius of this one'. The gis
> extensions to mysql might support such type of query. The problem is of
> course that points are 2 dimensional and I'm not sure whether I can
> extend it to more than 3 dimensions ?
>
> Does anybody have an idea about this ?
>
> Wkr,
>
>
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org
Re: How to deal with 96 Dimensional Points ?
am 31.03.2010 01:54:56 von Geert-Jan Brits
--001485f64b444deda404830d5b89
Content-Type: text/plain; charset=ISO-8859-1
Perhaps you could give us a (generalized) description of your use-case, so
we can better grasp what you want to achieve, and how you want to use it.
i.e: since I can't imagine/ envison a real 'eucledian distance' over 96
dimensions I bet you're talking a generalized distance function over N
dimenions.
This is usually only used in two general ways afaik: 1 calculating all
points that lie within a certain threshold (Chris' implementation prints
these points) 2 calculating an ordered top-M list of closests points to the
target point (Chris' implementation slightly altered) 3 (hmm maybe three:
clustering points based on their distance to eachother)
It helps if we know what you'r after. For instance: if you're points don't
change often and you want to achieve case 1 or 2 I would calculate these
once and all-at-once and save them in a seperate table, bc. the on-demand
variant may quickly become too slow. again depending on your case: option A.
{} <-- list of neigborids per point id, with
pointid as key.
option B {
pointid + neighborid as key.
perhaps also helpful foor google etc.: - a distance function if more often
called a similarity function
- top-n 'points' for a given point are usually called its neighbors. - in
most cases you don't have to take the sqrt in Chris' implementation which
can save a lot (but instead do: if($SumSq <=($r*$r)){//code here}
2010/3/30 Chris W <4rfvgy7@cox.net>
> I'm not sure why, but it seems that some people, I don't mean to imply that
> you are one of them, think there is some magic MySQL can preform to find
> points with in a given radius using the GIS extension. There is no magic.
> They simply use the well known math required to determine what points are
> inside the circle. I could be wrong but I doubt there is any way to create
> an index that can directly indicate points with a a certain distance of
> other points unless that index included the distance from every point to
> every other point. That is obviously not practical since with a set of only
> 14 points the index would have over 6 billion entries.
>
> lets call each of your dimensions d1, d2, d3 .... d96. If you create an
> index on d1, d2, .... d69, you can then create a simple query that will
> quickly find all points that will find all points that are with in a
> bounding box. Since this query is going to get a bit large with 96
> dimensions, I would use code to create the query. I will use php. Let's
> start with the desired radius being r and the test point dimensions being in
> an array TestPointD[1] = x, TestPointD[2] = . . .
>
>
> $select = 'SELECT `PointID`, ';
> $where = 'WHERE ';
> foreach($TestPointD as $i => $d){
> $di = 'd' . "$i";
> $select .= "`$di`, "
> $MinD = $d - $r;
> $MaxD = $d + $r;
> $where .= "`$di` >= '$MinD' AND `$di` <= '$MaxD' AND ";
> }
> $select = substr($select, 0, -2); //trim of the trailing comma and space
> $where = substr($where, 0, -4); //trim off the trailing 'AND '
>
> $query = "$select FROM `points` $where";
>
>
> Obviously this is going to give you points outside the sphere but still
> inside the cube. However it will narrow down the set so the further math
> will not take as long. If this were 3 dimensions with an uniform
> distribution of points, about 52% of the points returned by that query will
> be inside the sphere. I'm not sure how to calculate the ratio of the volume
> sphere to a cube in 96 dimensions. Then it will be a simple loop to find
> the points you really want. While this query will likely return a lot of
> points that you don't want especially in 96D space, it will reduce it enough
> that the following loop will be much faster than looking all points in the
> table.
>
>
>
> $result = mysql_query($query) or die("DB error $query " . mysql_error() );
> while(($row = mysql_fetch_row($result))){
> $sum
> foreach($row as $i => $d){
> if($i == 0){
> $PointID = $d;
> continue; // skip point id at $row[0]
> }
> $SumSq += pow($TestPointD[$i] - $d, 2);
> }
> if(sqrt($SumSq) <= $r){
> print "$PointID is with in $r of test point.\n";
> }
> }
>
>
> In an application I had that was similar (but in 2D) I would insert the id
> of the points that passed the condition into a temp table. Then I could
> join that temp table to other tables do other queries I may need on those
> points.
>
> Chris W
>
>
>
> Werner Van Belle wrote:
>
>> Hello,
>>
>> I have been pondering this for a while, but never really looked deeply
>> into the problem.
>>
>> I have 96 dimensional points and I would like to pose queries such as:
>> 'give me all points that are within such a radius of this one'. The gis
>> extensions to mysql might support such type of query. The problem is of
>> course that points are 2 dimensional and I'm not sure whether I can
>> extend it to more than 3 dimensions ?
>>
>> Does anybody have an idea about this ?
>>
>> Wkr,
>>
>>
>>
>
> --
> MySQL General Mailing List
> For list archives: http://lists.mysql.com/mysql
> To unsubscribe: http://lists.mysql.com/mysql?unsub=gbrits@gmail.com
>
>
--001485f64b444deda404830d5b89--
Re: How to deal with 96 Dimensional Points ?
am 31.03.2010 06:58:52 von Werner Van Belle
--------------enigE135D76F1B1C9B391221AC18
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Hello Chris,
The use case I'
m talking about is actually a typical usecase for GIS applications: give
me the x closest points to this one. E.g: give me the 10 points closest
to (1,2,79) or in my case: give me the 100 points closest to
(x1,....x96). A query like yours might be possible and might be a good
solution if we would know the radius in which we are looking for the
points, but this is not really the case: we merely want a list returned
ordered by distance. Solving this with your solution is possible but is
quite slow. There exists nice datastructures to deal with this type of
problem as said and these are used in the GIS implementation in MySql.
Chris W wrote:
> I'm not sure why, but it seems that some people, I don't mean to imply
> that you are one of them, think there is some magic MySQL can preform
> to find points with in a given radius using the GIS extension. There
> is no magic. They simply use the well known math required to
> determine what points are inside the circle.
GIS extenstions are also not only about distances: the above query is
better solved with specialized datastructures.
> I could be wrong but I doubt there is any way to create an index that
> can directly indicate points with a a certain distance of other points
> unless that index included the distance from every point to every
> other point. That is obviously not practical since with a set of only
> 14 points the index would have over 6 billion entries.
Partitioning of the space such as done in 3D render engines do solve
this problem more efficiently than having a list of all pairtwise
distances. So the question is not whether such algorithms exist, it is
rather whether they are available in/through MySql.
> lets call each of your dimensions d1, d2, d3 .... d96. If you create
> an index on d1, d2, .... d69, you can then create a simple query that
> will quickly find all points that will find all points that are with
> in a bounding box. Since this query is going to get a bit large with
> 96 dimensions, I would use code to create the query. I will use php.=20
> Let's start with the desired radius being r and the test point
> dimensions being in an array TestPointD[1] =3D x, TestPointD[2] =3D . .=
.
>
> $select =3D 'SELECT `PointID`, ';
> $where =3D 'WHERE ';
> foreach($TestPointD as $i =3D> $d){
> $di =3D 'd' . "$i";
> $select .=3D "`$di`, "
> $MinD =3D $d - $r;
> $MaxD =3D $d + $r;
> $where .=3D "`$di` >=3D '$MinD' AND `$di` <=3D '$MaxD' AND ";
> }
> $select =3D substr($select, 0, -2); //trim of the trailing comma and s=
pace
> $where =3D substr($where, 0, -4); //trim off the trailing 'AND '
>
> $query =3D "$select FROM `points` $where";
>
Thanks for the nice illustration. In this case with the proper indices
this will indeed split the space in sections; nevertheless this approach
has great difficulties returning an ordered list of distances and
prefereably only the 100 closest ones at that.
Wkr,
--=20
http://werner.yellowcouch.org/
--------------enigE135D76F1B1C9B391221AC18
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkuy1o8ACgkQ2TDF0fTmy3WyyACfbnJ5Zm9DnHmC/RgkOoxs SO3M
F70AnipAMM23zFmDNL6knaBO9SsTxUbQ
=sDFY
-----END PGP SIGNATURE-----
--------------enigE135D76F1B1C9B391221AC18--
Re: How to deal with 96 Dimensional Points ?
am 31.03.2010 07:05:37 von Werner Van Belle
--------------enigD424F525627623BE0E306EEF
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Geert-Jan Brits wrote:
> Perhaps you could give us a (generalized) description of your use-case,=
so
> we can better grasp what you want to achieve, and how you want to use i=
t.
> i.e: since I can't imagine/ envison a real 'eucledian distance' over 96=
> dimensions I bet you're talking a generalized distance function over N
> dimenions.
> This is usually only used in two general ways afaik:=20
> 2 calculating an ordered top-M list of closests points to the
> target point (Chris' implementation slightly altered)=20
This is indeed the situation. A small alteration to chris his
implementation won't do, since we do not know with radius to start with,
so it is not just a matter of adapting the post-filtering.
> 3 (hmm maybe three:
> clustering points based on their distance to eachother)
> =20
Yes, this is part of the usecase, but at the moment not my main focus. A
statistical approach will need to employed for that, without going for
full aggregation.
> It helps if we know what you'r after. For instance: if you're points do=
n't
> change often and you want to achieve case 1 or 2 I would calculate the=
se
> once and all-at-once and save them in a seperate table, bc. the on-dema=
nd
> variant may quickly become too slow. again depending on your case: opti=
on A.
> {} <-- list of neigborids per point id, with
> pointid as key.
> option B {
> pointid + neighborid as key.
> =20
That is not an option. Every 2 minutes or so the next point is randomly
choosen and we need a collection of points in the neighboorhood.
> perhaps also helpful foor google etc.: - a distance function if more of=
ten
> called a similarity function
> - top-n 'points' for a given point are usually called its neighbors. - =
in
> most cases you don't have to take the sqrt in Chris' implementation whi=
ch
> can save a lot (but instead do: if($SumSq <=3D($r*$r)){//code here}
> =20
Indeed, but this is only a fraction of the time. The larger problem lies
in searching all points that have potential. An idea that might work is
to modify the radius of what we are looking at while we are searching
based on the maximum radius we have so far and cut down distance
comparisons if they will surely fall outside the current N closest
neighbours.
--=20
http://werner.yellowcouch.org/
--------------enigD424F525627623BE0E306EEF
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkuy2CEACgkQ2TDF0fTmy3WmhQCfU2RGNQ99qaGEUljny9D6 3oR8
82wAoIPdW7DFkqLGHR2OOf9cYfcHxbr4
=aGDj
-----END PGP SIGNATURE-----
--------------enigD424F525627623BE0E306EEF--
Re: How to deal with 96 Dimensional Points ?
am 31.03.2010 07:22:41 von Chris W
Here is an idea, I'm not going to code this one:) It's still not an
ideal solution because it has to make assumptions about your data set.
Execute the algorithm I outlined previously with a very small r value,
if you didn't find the number of points you are looking for, increase r
and modify the query slightly so it doesn't return any of the points the
first query returned .... something like AND `PointID` NOT in ('34',
'56', '67', . . .).
At every step along the way insert the point id of the points inside of
r along with the distance they are from the test point, once you have
over 100 records in this table stop increasing r and query the temp
table sorted by distance with a limit of 100. Of course you have to
have some knowledge of your data set to get a reasonable start value for
r and a reasonable method for determining how much to increase it each time.
On the other hand a minor modification seems better. By inserting all
the points in the cube along with their distance in the temp table, a
query like SELECT count(*) FROM temp WHERE `Distance` <= r Would be a
good way to see if you need to continue to the next round. Also doing
it that way, instead of using the NOT IN syntax, which I understand can
be slow, you can modify the where condition to find points that are
inside the current cube of size r but are outside the previous cube.
Chris W
Werner Van Belle wrote:
> Hello Chris,
>
> The use case I'
> m talking about is actually a typical usecase for GIS applications: give
> me the x closest points to this one. E.g: give me the 10 points closest
> to (1,2,79) or in my case: give me the 100 points closest to
> (x1,....x96). A query like yours might be possible and might be a good
> solution if we would know the radius in which we are looking for the
> points, but this is not really the case: we merely want a list returned
> ordered by distance. Solving this with your solution is possible but is
> quite slow. There exists nice datastructures to deal with this type of
> problem as said and these are used in the GIS implementation in MySql.
>
> Chris W wrote:
>
>> I'm not sure why, but it seems that some people, I don't mean to imply
>> that you are one of them, think there is some magic MySQL can preform
>> to find points with in a given radius using the GIS extension. There
>> is no magic. They simply use the well known math required to
>> determine what points are inside the circle.
>>
> GIS extenstions are also not only about distances: the above query is
> better solved with specialized datastructures.
>
>> I could be wrong but I doubt there is any way to create an index that
>> can directly indicate points with a a certain distance of other points
>> unless that index included the distance from every point to every
>> other point. That is obviously not practical since with a set of only
>> 14 points the index would have over 6 billion entries.
>>
> Partitioning of the space such as done in 3D render engines do solve
> this problem more efficiently than having a list of all pairtwise
> distances. So the question is not whether such algorithms exist, it is
> rather whether they are available in/through MySql.
>
>
>> lets call each of your dimensions d1, d2, d3 .... d96. If you create
>> an index on d1, d2, .... d69, you can then create a simple query that
>> will quickly find all points that will find all points that are with
>> in a bounding box. Since this query is going to get a bit large with
>> 96 dimensions, I would use code to create the query. I will use php.
>> Let's start with the desired radius being r and the test point
>> dimensions being in an array TestPointD[1] = x, TestPointD[2] = . . .
>>
>> $select = 'SELECT `PointID`, ';
>> $where = 'WHERE ';
>> foreach($TestPointD as $i => $d){
>> $di = 'd' . "$i";
>> $select .= "`$di`, "
>> $MinD = $d - $r;
>> $MaxD = $d + $r;
>> $where .= "`$di` >= '$MinD' AND `$di` <= '$MaxD' AND ";
>> }
>> $select = substr($select, 0, -2); //trim of the trailing comma and space
>> $where = substr($where, 0, -4); //trim off the trailing 'AND '
>>
>> $query = "$select FROM `points` $where";
>>
>>
> Thanks for the nice illustration. In this case with the proper indices
> this will indeed split the space in sections; nevertheless this approach
> has great difficulties returning an ordered list of distances and
> prefereably only the 100 closest ones at that.
>
> Wkr,
>
>
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org