Join on a where clause.

Join on a where clause.

am 09.12.2009 13:29:07 von Paul Halliday

I have 2 tables:

1) Event Data
2) Mappings

The query should return something like this:

Hits IP Country Code
20 213.136.52.29 SE

I am trying this:

SELECT COUNT(event.src_ip) AS count, INET_NTOA(event.src_ip),
mappings.cc FROM event, mappings WHERE event.timestamp BETWEEN
'2009-12-06 20:00' and '2009-12-07 20:00:00' AND event.src_ip BETWEEN
'mappings.start_ip' AND 'mappings.end_ip' GROUP BY event.src_ip ORDER
BY count DESC LIMIT 20;

Am I supposed to do a join somewhere? Do joins even apply in a where
clause? or am I totally off the mark.

Singularly, the queries look like this:

SELECT cc FROM mappings WHERE INET_ATON('src_ip') BETWEEN start_ip AND end_ip;

SELECT COUNT(src_ip) AS count, INET_NTOA(src_ip) FROM event WHERE
timestamp BETWEEN '2009-12-06 20:00' and '2009-12-07 20:00:00' GROUP
BY src_ip ORDER BY count DESC LIMIT 20;


Thanks.

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

RE: Join on a where clause.

am 09.12.2009 15:46:18 von Neil Aggarwal

Paul:

> SELECT COUNT(event.src_ip) AS count, INET_NTOA(event.src_ip),
> mappings.cc FROM event, mappings WHERE event.timestamp BETWEEN
> '2009-12-06 20:00' and '2009-12-07 20:00:00' AND event.src_ip BETWEEN
> 'mappings.start_ip' AND 'mappings.end_ip' GROUP BY event.src_ip ORDER
> BY count DESC LIMIT 20;

Hmm.. The hard part is that your mappings
table is not a list of all IP addresses.
It has a range from start to end, but the
actual IP in the event table is not listed
there.

Joins require a column value from each table
to match. That is not the case for you.

I think you are going to have to do this in
your application code.

I hope this helps,
Neil

--
Neil Aggarwal, (281)846-8957, http://UnmeteredVPS.net
Host your MySQL database on a CentOS VPS for $25/mo
Unmetered bandwidth = no overage charges, 7 day free trial


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: Join on a where clause.

am 09.12.2009 19:24:58 von Joerg Bruehe

Hi everybody!


Neil Aggarwal wrote:
> Paul:
>=20
>> SELECT COUNT(event.src_ip) AS count, INET_NTOA(event.src_ip),
>> mappings.cc FROM event, mappings WHERE event.timestamp BETWEEN
>> '2009-12-06 20:00' and '2009-12-07 20:00:00' AND event.src_ip BETW=
EEN
>> 'mappings.start_ip' AND 'mappings.end_ip' GROUP BY event.src_ip OR=
DER
>> BY count DESC LIMIT 20;

I am surprised by the quotes you have around the "start_ip" and "end_=
ip"
columns; to me, this makes that look like strings.

=46rom your posting, I see the result you hope to get but not the one=
you
actually get. IMO, just dropping the single quotes around the two col=
umn
names should produce the data you want to get.

Or what is the result you receive?

>=20
> Hmm.. The hard part is that your mappings
> table is not a list of all IP addresses.
> It has a range from start to end, but the
> actual IP in the event table is not listed
> there.
>=20
> Joins require a column value from each table
> to match. That is not the case for you.

A matching column is called an "equijoin" (from "equality") in SQL
slang, and this is the most common form of a join. (Also, it is the
fastest, if there are suitable indices which can be used.)
However, that is not mandatory / the only form.

You can have a join with any predicate combining columns of the (two)
involved tables.
You can even have a join without any such predicate, which means ever=
y
combination of any two rows of the tables is to be returned. This is
known as "Cartesian Product" and is in most cases not what you want.

>=20
> I think you are going to have to do this in
> your application code.

I never dealt with the assignment of IP addresses to countries.
As long as the problem can be solved using ranges (or multiple ranges=
)
which do not overlap, the join should solve it.


Regards,
Jörg

--=20
Joerg Bruehe, MySQL Build Team, Joerg.Bruehe@Sun.COM
(+49 30) 417 01 487
Sun Microsystems GmbH, Komturstraße 18a, D-12099 Berlin
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel
Vorsitzender des Aufsichtsrates: Martin Haering Muenchen: HRB1610=
28


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=3Dgcdmg-mysql-2@m.gmane.o rg

RE: Join on a where clause.

am 09.12.2009 20:42:55 von Neil Aggarwal

Joerg:

> A matching column is called an "equijoin"
> However, that is not mandatory / the only form.
> As long as the problem can be solved using ranges (or multiple ranges)
> which do not overlap, the join should solve it.

I just learned something. Thanks for the info!

Neil

--
Neil Aggarwal, (281)846-8957, http://UnmeteredVPS.net
Host your MySQL database on a CentOS VPS for $25/mo
Unmetered bandwidth = no overage charges, 7 day free trial


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: Join on a where clause.

am 10.12.2009 19:11:39 von Paul Halliday

On Wed, Dec 9, 2009 at 2:24 PM, Joerg Bruehe wrote:
> Hi everybody!
>
>
> Neil Aggarwal wrote:
>> Paul:
>>
>>> SELECT COUNT(event.src_ip) AS count, INET_NTOA(event.src_ip),
>>> mappings.cc FROM event, mappings WHERE event.timestamp BETWEEN
>>> '2009-12-06 20:00' and '2009-12-07 20:00:00' AND event.src_ip BETWEEN
>>> 'mappings.start_ip' AND 'mappings.end_ip' GROUP BY event.src_ip ORDER
>>> BY count DESC LIMIT 20;
>
> I am surprised by the quotes you have around the "start_ip" and "end_ip"
> columns; to me, this makes that look like strings.

That because I don't know what I am doing :). No quotes on integers; got it!

>
> From your posting, I see the result you hope to get but not the one you
> actually get. IMO, just dropping the single quotes around the two column
> names should produce the data you want to get.
>
> Or what is the result you receive?

Removing the quotes does work. The query however took 1h15m to complete. Yuck.

I am guessing this is because even though there is a limit, it is
still doing the lookup on everything past 20. Also, if the first
address has a count of say 2000, it would be doing the lookup 2000
times for a single address.

Is that right?

If it is I guess I will have to post process the results. Which is
fine, I just like to keep as much in the queries as I can.

Thanks.

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: Join on a where clause.

am 10.12.2009 19:48:05 von Andy Wallace

A couple of thoughts - it's not "no quotes on integers", but
"no quotes around column references". When you use 'mappings.end_ip',
you are saying "the string mappings.end_ip", and not referring to
a column in the mappings table. It just becomes a constant at that
point.

As for the performance, you should look at the indices on the tables
involved. Try doing "explain" on the query:

EXPLAIN SELECT COUNT(event.src_ip) AS count,
INET_NTOA(event.src_ip),
mappings.cc
FROM event join mappings ON event.src_ip between mappings.start_ip and mappings.end_ip
WHERE event.timestamp BETWEEN '2009-12-06 20:00' and '2009-12-07 20:00:00'
GROUP BY event.src_ip
ORDER BY count DESC
LIMIT 20;

That'll give you some more information on what mysql is doing with
your data.
andy


Paul Halliday wrote:
> On Wed, Dec 9, 2009 at 2:24 PM, Joerg Bruehe wrote:
>> Hi everybody!
>>
>>
>> Neil Aggarwal wrote:
>>> Paul:
>>>
>>>> SELECT COUNT(event.src_ip) AS count, INET_NTOA(event.src_ip),
>>>> mappings.cc FROM event, mappings WHERE event.timestamp BETWEEN
>>>> '2009-12-06 20:00' and '2009-12-07 20:00:00' AND event.src_ip BETWEEN
>>>> 'mappings.start_ip' AND 'mappings.end_ip' GROUP BY event.src_ip ORDER
>>>> BY count DESC LIMIT 20;
>> I am surprised by the quotes you have around the "start_ip" and "end_ip"
>> columns; to me, this makes that look like strings.
>
> That because I don't know what I am doing :). No quotes on integers; got it!
>
>> From your posting, I see the result you hope to get but not the one you
>> actually get. IMO, just dropping the single quotes around the two column
>> names should produce the data you want to get.
>>
>> Or what is the result you receive?
>
> Removing the quotes does work. The query however took 1h15m to complete. Yuck.
>
> I am guessing this is because even though there is a limit, it is
> still doing the lookup on everything past 20. Also, if the first
> address has a count of say 2000, it would be doing the lookup 2000
> times for a single address.
>
> Is that right?
>
> If it is I guess I will have to post process the results. Which is
> fine, I just like to keep as much in the queries as I can.
>
> Thanks.
>

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: Join on a where clause.

am 12.12.2009 00:17:42 von Joerg Bruehe

Hi Paul, all!


Paul Halliday wrote:
> On Wed, Dec 9, 2009 at 2:24 PM, Joerg Bruehe =
wrote:
>> Hi everybody!
>>
>>
>> Neil Aggarwal wrote:
>>> Paul:
>>>
>>>> SELECT COUNT(event.src_ip) AS count, INET_NTOA(event.src_ip),
>>>> mappings.cc FROM event, mappings WHERE event.timestamp BETWEEN
>>>> '2009-12-06 20:00' and '2009-12-07 20:00:00' AND event.src_ip BE=
TWEEN
>>>> 'mappings.start_ip' AND 'mappings.end_ip' GROUP BY event.src_ip =
ORDER
>>>> BY count DESC LIMIT 20;
>> I am surprised by the quotes you have around the "start_ip" and "e=
nd_ip"
>> columns; to me, this makes that look like strings.
>=20
> That because I don't know what I am doing :). No quotes on integers=
; got it!

Like Andy replied: No quotes around column names (or expressions) you
want evaluated as variables, these quotes mark (constant) strings.
(There is another kind of quotes which serves to solve problems aroun=
d
using large and small letters in names, I'll ignore that here.)

>=20
>> From your posting, I see the result you hope to get but not the on=
e you
>> actually get. IMO, just dropping the single quotes around the two =
column
>> names should produce the data you want to get.
>>
>> Or what is the result you receive?
>=20
> Removing the quotes does work. The query however took 1h15m to comp=
lete. Yuck.

In general, Andy is right recommending "explain".


This case, however, may be a bit different:
>=20
> I am guessing this is because even though there is a limit, it is
> still doing the lookup on everything past 20. Also, if the first
> address has a count of say 2000, it would be doing the lookup 2000
> times for a single address.
>=20
> Is that right?

Yes, that may be.

First, don't forget that the query has to evaluate all events (within
your date/time limits) to find which IP addresses are the 20 with the
highest count.

Second, I fear the server may first perform the join on all your data
and only then do the WHERE, the GROUP BY, and the LIMIT. Which means =
it
would handle much more data than needed.

I guess you might see an enormous boost if you split up your statemen=
t
into two parts (syntax not checked):

INSERT INTO events_to_locate
SELECT COUNT(src_ip) AS count, src_ip FROM event
WHERE timestamp BETWEEN '2009-12-06 20:00' and '2009-12-07 20:00:0=
0'
GROUP BY src_ip ORDER BY count DESC LIMIT 20;

SELECT count, INET_NTOA(events_to_locate.src_ip), mappings.cc
FROM events_to_locate, mappings
WHERE events_to_locate.src_ip BETWEEN mappings.start_ip AND
mappings.end_ip;

If your "mappings" table does not have a "src_ip" column (your column
names are unique in the tables listed in the FROM clause), you need n=
ot
give the table names in that second statement:

SELECT count, INET_NTOA(src_ip), cc
FROM events_to_locate, mappings
WHERE src_ip BETWEEN start_ip AND end_ip;

"Division of complexity":
Maybe the optimizer is not clever enough to apply this optimization,
then using a temporary table which holds just the 20 IP addresses you
are interested in should significantly reduce the amount of work.

>=20
> If it is I guess I will have to post process the results. Which is
> fine, I just like to keep as much in the queries as I can.

In general, doing work in the server is better, because it means you
need to transfer less data between server and client.
As usual, there are exceptions.


HTH,
Jörg

--=20
Joerg Bruehe, MySQL Build Team, Joerg.Bruehe@Sun.COM
Sun Microsystems GmbH, Komturstraße 18a, D-12099 Berlin
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel
Vorsitzender des Aufsichtsrates: Martin Haering Muenchen: HRB1610=
28


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=3Dgcdmg-mysql-2@m.gmane.o rg

Re: Join on a where clause.

am 12.12.2009 20:49:33 von Shawn Green

Hello Paul,

Paul Halliday wrote:
> I have 2 tables:
>
> 1) Event Data
> 2) Mappings
>
> The query should return something like this:
>
> Hits IP Country Code
> 20 213.136.52.29 SE
>
> I am trying this:
>
> SELECT COUNT(event.src_ip) AS count, INET_NTOA(event.src_ip),
> mappings.cc FROM event, mappings WHERE event.timestamp BETWEEN
> '2009-12-06 20:00' and '2009-12-07 20:00:00' AND event.src_ip BETWEEN
> 'mappings.start_ip' AND 'mappings.end_ip' GROUP BY event.src_ip ORDER
> BY count DESC LIMIT 20;
>
> Am I supposed to do a join somewhere? Do joins even apply in a where
> clause? or am I totally off the mark.
>
> Singularly, the queries look like this:
>
> SELECT cc FROM mappings WHERE INET_ATON('src_ip') BETWEEN start_ip AND end_ip;
>
> SELECT COUNT(src_ip) AS count, INET_NTOA(src_ip) FROM event WHERE
> timestamp BETWEEN '2009-12-06 20:00' and '2009-12-07 20:00:00' GROUP
> BY src_ip ORDER BY count DESC LIMIT 20;
>

Technically speaking, you are already doing a JOIN in your query. The
comma operator in the FROM clause combined with the WHERE conditions
make your query logically equivalent to the following rewrite:

SELECT COUNT(event.src_ip) AS count
, INET_NTOA(event.src_ip)
, mappings.cc
FROM event
INNER JOIN mappings
ON event.src_ip BETWEEN 'mappings.start_ip' AND 'mappings.end_ip'
WHERE event.timestamp BETWEEN '2009-12-06 20:00' and '2009-12-07 20:00:00'
GROUP BY event.src_ip
ORDER BY count DESC
LIMIT 20;

However, as we document in the manual, we have demoted the execution
precedence of the comma operator to form what I like to call an
"implicit join" to be evaluated AFTER any explicit JOIN clauses.
http://dev.mysql.com/doc/refman/5.0/en/join.html

What this means is that you may get better performance out of an
explicit join than you do an implicit join.

Also, you may want to consider rewriting your matching condition so that
it can use an index or a combination of indexes on your `start_ip` and
`end_ip` columns (notice my use of backticks ` ` not single quotes ' '
to identify column names) by rewriting your condition as an AND
comparison instead of a BETWEEN comparison

ON event.src_ip >= mappings.start_ip
AND event.src_ip <= mappings.end_ip

The way it is written now: " BETWEEN and " is not one
of the cases that we have an optimized and generalized execution plan to
handle well. The separate conditions, like I wrote in my example, is an
optimized situation and has a much better likelihood of using an index
during its evaluation.

I would think that an index on both columns would work better than two
single-column indexes.

ALTER TABLE event ADD KEY(src_ip, end_ip);

Warmest regards,
--
Shawn Green, MySQL Senior Support Engineer
Sun Microsystems, Inc.
Office: Blountville, TN



--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org