Selecting distinct records
Selecting distinct records
am 21.11.2002 19:03:10 von dave
Have a table that is populated by external radius reporting which contains
name, ipaddress, sessionid, refid, and others
None of the fields are unique, but the combination of those 4 listed should
reasonably be unique. There is a certain type of interaction with external
hardware that results in multple entries into the table, resulting in multiple
duplicate entries.
What we need to to is build a select that omits any duplicate records from the
output, but I can't get a distinct to work given the way we need the data
sorted.
For example,
Name IPaddress sessionid refid
User1 201.201 1234 5678 first user
User2 201.201 1235 5679 same ip as first user
User3 201.202 3234 5670 same sessid as first user
User4 201.203 3236 5678 same refid as first user
User1 201.202 4234 5678 first user new entry
User1 201.202 4234 5678 DUPLICATE
User1 201.202 4234 5678 DUPLICATE
User2 201.203 1234 5671 same ip as user 4
User3 201.204 2234 5672 unique
if we do distinct on sessionid then put a where clause for the user=User1 we end
up missing some of user1's entries where another users identical sessionid
appears first.
What I am looking to do is
- grab every record for $user
- remove any records that have identical ipaddress+sessionid+refid
ie: turn
user1 201.102 1234 5678
user1 201.102 1234 5678
user1 201.102 1234 5678
into
user1 201.102 1234 5678
- then sort the results by date_time or something else
Just can't get it to do all those things at the same time. Any thoughts, or am
I not making sense?
thanks
Dave
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
Re: Selecting distinct records
am 21.11.2002 19:43:15 von dave
>What I am looking to do is
>- grab every record for $user
>- remove any records that have identical ipaddress+sessionid+refid
>- then sort the results by date_time or something else
this last requirement is where the problem is...
is you do a sum() or order by in the select statement you get and error, for
example;
select distinct on (sessionid) sum(sessiontime) from logs where
name='joeblowuser' and datetime > 1036040400;
ERROR: Attribute logs.sessionid must be GROUPed or used in an aggregate
function
Same if you have to order by datetime or something...
Dave
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
Re: Selecting distinct records
am 21.11.2002 20:02:05 von David Busby
Dave,
Try:
select distinct on (sessionid) sum(sessiontime) from logs
where name='joeblowuser' and datetime > 1036040400
group by sessionid.
sum() is an aggerate (sp?) function, that means it munges a field
(sessiontime) from multiple records into one field in one record. Since you
are also selecting sessionid (from mulitple records) you need to munge it
some how, that munge is accomplished via 'group by'. From your previous
e-mail it seems that (IMHO) the real problem is that duplicates are getting
inserted via external hardware interaction, this select might be a bandage
on a wound whose true size isn't known...
PostgreSQL docs:
Distinct:
http://www.postgresql.org/idocs/index.php?queries-select-lis ts.html
Group By:
http://www.postgresql.org/idocs/index.php?sql-select.html
/B
----- Original Message -----
From: "Dave"
To:
Sent: Thursday, November 21, 2002 10:43
Subject: Re: [PHP] Selecting distinct records
> >What I am looking to do is
> >- grab every record for $user
> >- remove any records that have identical ipaddress+sessionid+refid
> >- then sort the results by date_time or something else
>
> this last requirement is where the problem is...
>
> is you do a sum() or order by in the select statement you get and error,
for
> example;
>
> select distinct on (sessionid) sum(sessiontime) from logs where
> name='joeblowuser' and datetime > 1036040400;
>
> ERROR: Attribute logs.sessionid must be GROUPed or used in an aggregate
> function
>
> Same if you have to order by datetime or something...
>
> Dave
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
Re: Selecting distinct records
am 21.11.2002 22:26:55 von dave
> Try:
>select distinct on (sessionid) sum(sessiontime) from logs
>where name='joeblowuser' and datetime > 1036040400
>group by sessionid.
sorry, I wasn't clear... the problem arises when we need to sort the data for
output. postgres demands that the distinct items be first in the ordering
process, which would not allow display by session, name etc...
the obvious result would be to nest the select statements but can't seem to get
that to work either (working with version 7.0.3)
eg/ select * from (select distinct on (sessionid) from logs where....) as
tempname order by sessiondate desc
gives me an error on the second select, so not sure that is a workable solution
either.
>sum() is an aggerate (sp?) function, that means it munges a field
>(sessiontime) from multiple records into one field in one record. Since you
>are also selecting sessionid (from mulitple records) you need to munge it
>some how, that munge is accomplished via 'group by'. From your previous
>e-mail it seems that (IMHO) the real problem is that duplicates are getting
>inserted via external hardware interaction, this select might be a bandage
>on a wound whose true size isn't known...
agreed, and that problem has been corrected, but we are dealing with close to a
million records which have these duplicates strewn about within... rather
annoying, looking for a bandaid in teh select to avoid intensive post-select
processing of the output.
Dave
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
http://archives.postgresql.org
Re: Selecting distinct records
am 22.11.2002 02:37:25 von David Busby
Dave,
How about a job that runs once that cleans the whole thing?
Or:
[Warning, pseudo PHPish code]
$rs_sid = select distinct (sessionid) from "logs";
while ($s = pg_fetch_object($rs_sid)
{
$rs_stuff = select * from "logs" where .... and "sessionid" = $s->id
for (loop of $rs_stuff)
{
$total+=$rs_stuff->sessiontime
}
echo "Session $s->id has $total seconds used";
}
This method involves a select for each session (might be slow on millions of
records :) ). This might work once or twice but I don't see it as a
suitable solution if this has to happen on a daily basis.
I know that the latest PostgreSQL does nested selects just fine (I've
got one paticular statement that has three nested selects ) and it's
performance isn't as bad as you'd think, well at least after the query
analyzer has seen it once and it gets cached.
/B
----- Original Message -----
From: "Dave [Hawk-Systems]"
To: "David Busby" ;
Sent: Thursday, November 21, 2002 13:26
Subject: RE: Selecting distinct records
> > Try:
> >select distinct on (sessionid) sum(sessiontime) from logs
> >where name='joeblowuser' and datetime > 1036040400
> >group by sessionid.
>
> sorry, I wasn't clear... the problem arises when we need to sort the data
for
> output. postgres demands that the distinct items be first in the ordering
> process, which would not allow display by session, name etc...
>
> the obvious result would be to nest the select statements but can't seem
to get
> that to work either (working with version 7.0.3)
>
> eg/ select * from (select distinct on (sessionid) from logs where....) as
> tempname order by sessiondate desc
>
> gives me an error on the second select, so not sure that is a workable
solution
> either.
>
> >sum() is an aggerate (sp?) function, that means it munges a field
> >(sessiontime) from multiple records into one field in one record. Since
you
> >are also selecting sessionid (from mulitple records) you need to munge it
> >some how, that munge is accomplished via 'group by'. From your previous
> >e-mail it seems that (IMHO) the real problem is that duplicates are
getting
> >inserted via external hardware interaction, this select might be a
bandage
> >on a wound whose true size isn't known...
>
> agreed, and that problem has been corrected, but we are dealing with close
to a
> million records which have these duplicates strewn about within... rather
> annoying, looking for a bandaid in teh select to avoid intensive
post-select
> processing of the output.
>
> Dave
>
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster