Counting Distinct Records

am 16.11.2004 20:45:45 von tfo

I am wondering whether the following two forms of SELECT statements are
logically equivalent:

SELECT COUNT( DISTINCT table.column ) ...

and

SELECT DISTINCT COUNT( * ) ...

If they are the same, then why is the latter query much slower in
postgres when applied to the same FROM and WHERE clauses?

Furthermore, is there a better way of performing this sort of operation
in postgres (or just in SQL in general)?

Thanks!

-tfo

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC
http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-260-0005

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Re: Counting Distinct Records

am 16.11.2004 21:03:35 von sszabo

On Tue, 16 Nov 2004, Thomas F.O'Connell wrote:

> I am wondering whether the following two forms of SELECT statements are
> logically equivalent:
>
> SELECT COUNT( DISTINCT table.column ) ...
>
> and
>
> SELECT DISTINCT COUNT( * ) ...

Not in general.

The former counts how many distinct table.column values there are. The
distinct in the latter would be basically meaningless unless there's a
group by involved.

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Re: Counting Distinct Records

am 16.11.2004 21:09:33 von tfo

Is there another way to accomplish what the former is doing, then?

For practical reasons, I'd like to come up with something better.

For theoretical curiosity, I'd like to know whether there's a way to
combine COUNT and DISTINCT that still allows one to reference * rather
than naming specific columns without grouping.

If I resort to GROUP BY, is there an efficient way of counting all the
groups, or would it just be something like:

SELECT COUNT ( * ) FROM ( SELECT ... GROUP BY ... );

-tfo

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC
http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-260-0005

On Nov 16, 2004, at 2:03 PM, Stephan Szabo wrote:

> On Tue, 16 Nov 2004, Thomas F.O'Connell wrote:
>
>> I am wondering whether the following two forms of SELECT statements
>> are
>> logically equivalent:
>>
>> SELECT COUNT( DISTINCT table.column ) ...
>>
>> and
>>
>> SELECT DISTINCT COUNT( * ) ...
>
> Not in general.
>
> The former counts how many distinct table.column values there are. The
> distinct in the latter would be basically meaningless unless there's a
> group by involved.
>

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Re: Counting Distinct Records

am 16.11.2004 23:34:31 von sszabo

On Tue, 16 Nov 2004, Thomas F.O'Connell wrote:

> Is there another way to accomplish what the former is doing, then?

The only thing I can think of is a subselect in from that uses distinct.
select count(*) from (select distinct ...) foo

That also theoretically allows you to use select distinct * inside the
subselect.

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Re: Counting Distinct Records

am 17.11.2004 00:02:56 von tfo

Hmm. I was more interested in using COUNT( * ) than DISTINCT *.

I want a count of all rows, but I want to be able to specify which
columns are distinct.

That's definitely an interesting approach, but testing doesn't show it
to be appreciably faster.

If I do a DISTINCT *, postgres will attempt to guarantee that there are
no duplicate values across all columns rather than a subset of columns?
Is that right?

Anyway, I was just wondering if there were any best practices out there
for counting distinct values in sets of values that might not
themselves be distinct.

Thanks for the tips so far!

-tfo

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC
http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-260-0005

On Nov 16, 2004, at 4:34 PM, Stephan Szabo wrote:

> On Tue, 16 Nov 2004, Thomas F.O'Connell wrote:
>
>> Is there another way to accomplish what the former is doing, then?
>
> The only thing I can think of is a subselect in from that uses
> distinct.
> select count(*) from (select distinct ...) foo
>
> That also theoretically allows you to use select distinct * inside the
> subselect.

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Re: Counting Distinct Records

am 17.11.2004 15:52:54 von sszabo

On Tue, 16 Nov 2004, Thomas F.O'Connell wrote:

> Hmm. I was more interested in using COUNT( * ) than DISTINCT *.
>
> I want a count of all rows, but I want to be able to specify which
> columns are distinct.

I'm now a bit confused about exactly what you're looking for in the end.
Can you give a short example?

> That's definitely an interesting approach, but testing doesn't show it
> to be appreciably faster.
>
> If I do a DISTINCT *, postgres will attempt to guarantee that there are
> no duplicate values across all columns rather than a subset of columns?
> Is that right?

It guarantees one output row for each distinct set of column values across
all columns.

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Re: Counting Distinct Records

am 17.11.2004 20:27:17 von tfo

The specific problem I'm trying to solve involves a user table with
some history.

Something like this:

create table user_history (
user_id int
event_time_stamp timestamp
);

I'd like to be able to count the distinct user_ids in this table, even
if it were joined to other tables.

-tfo

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC
http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-260-0005

On Nov 17, 2004, at 8:52 AM, Stephan Szabo wrote:

> On Tue, 16 Nov 2004, Thomas F.O'Connell wrote:
>
>> Hmm. I was more interested in using COUNT( * ) than DISTINCT *.
>>
>> I want a count of all rows, but I want to be able to specify which
>> columns are distinct.
>
> I'm now a bit confused about exactly what you're looking for in the
> end.
> Can you give a short example?
>
>> That's definitely an interesting approach, but testing doesn't show it
>> to be appreciably faster.
>>
>> If I do a DISTINCT *, postgres will attempt to guarantee that there
>> are
>> no duplicate values across all columns rather than a subset of
>> columns?
>> Is that right?
>
> It guarantees one output row for each distinct set of column values
> across
> all columns.

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html