Aggregate query for multiple records

Aggregate query for multiple records

am 26.08.2004 08:23:05 von scott

Hello, I am new to the list, my apology if this question is beyond the
scope or charter of this list.

My questions is:
What is the best method to perform an aggregate query to calculate
sum() values for each distinct wid as in the example below, but except
for all wid's (not just WHERE wid='01/1-6-1-30w1/0').

Also, performance wise, would it be better to build a function for this
query. The table has 9 million records and these aggregate queries
take hours.


SELECT
SUM(oil) as sumoil, SUM(hours) as sumhours,
FROM
(SELECT oil, hours prd_data WHERE wid='01/1-6-1-30w1/0'
ORDER BY date LIMIT 6) subtable
;


Table description:
Table "prd_data"
Column | Type | Modifiers
--------+-----------------------+-----------
date | integer |
hours | real |
oil | real |
gas | real |
water | real |
pwid | integer |
wid | character varying(20) |
year | smallint |
Indexes: wid_index6


Actual table (prd_data), 9 million records:

date | hours | oil | gas | water | pwid | wid | year
--------+-------+-------+------+-------+------+------------- ----+------
196507 | 360 | 159.4 | 11.3 | 40.9 | 413 | 01/1-1-1-31w1/0 | 1965
196508 | 744 | 280 | 20 | 27.2 | 413 | 01/1-1-1-31w1/0 | 1965
196509 | 360 | 171.1 | 11.4 | 50.4 | 413 | 01/1-1-1-31w1/0 | 1965
196510 | 744 | 202.1 | 25 | 89.8 | 413 | 01/1-1-1-31w1/0 | 1965
196512 | 744 | 201.3 | 23.8 | 71.9 | 413 | 01/1-1-1-31w1/0 | 1965
196511 | 720 | 184 | 17.6 | 78.9 | 413 | 01/1-1-1-31w1/0 | 1965
196610 | 744 | 99.8 | 15.4 | 53.7 | 413 | 01/1-1-1-31w1/0 | 1966
196612 | 744 | 86 | 12.8 | 36.1 | 413 | 01/1-1-1-31w1/0 | 1966
196611 | 720 | 86 | 12.6 | 41.7 | 413 | 01/1-1-1-31w1/0 | 1966
196601 | 744 | 191.6 | 22.6 | 50.7 | 413 | 01/1-1-1-31w1/0 | 1966
200301 | 461 | 68.8 | 0 | 186.3 | 47899 | 9G/6-1-50-24w3/0 | 2003
200310 | 740 | 446.3 | 0 | 563.1 | 47899 | 9G/6-1-50-24w3/0 | 2003
200306 | 667 | 92.1 | 0 | 968.8 | 47899 | 9G/6-1-50-24w3/0 | 2003
200304 | 0 | 0 | 0 | 0 | 47899 | 9G/6-1-50-24w3/0 | 2003
200308 | 457 | 100.7 | 0 | 82.8 | 47899 | 9G/6-1-50-24w3/0 | 2003
200307 | 574 | 78 | 0 | 752 | 47899 | 9G/6-1-50-24w3/0 | 2003
200312 | 582 | 360.9 | 0 | 569 | 47899 | 9G/6-1-50-24w3/0 | 2003
200311 | 681 | 260.8 | 0 | 563.9 | 47899 | 9G/6-1-50-24w3/0 | 2003
200305 | 452 | 0 | 0 | 0 | 47899 | 9G/6-1-50-24w3/0 | 2003
200309 | 637 | 244.6 | 0 | 193.8 | 47899 | 9G/6-1-50-24w3/0 | 2003
(20 rows)



Thanks,

--
Scott A. Gerhardt, P.Geo.
Gerhardt Information Technologies


---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Re: Aggregate query for multiple records

am 27.08.2004 06:55:42 von Josh

Scott,

> Hello, I am new to the list, my apology if this question is beyond the
> scope or charter of this list.

We have a charter? Why didn't anyone tell me?

> My questions is:
> What is the best method to perform an aggregate query to calculate
> sum() values for each distinct wid as in the example below, but except
> for all wid's (not just WHERE wid='01/1-6-1-30w1/0').

> SELECT
> SUM(oil) as sumoil, SUM(hours) as sumhours,
> FROM
> (SELECT oil, hours prd_data WHERE wid='01/1-6-1-30w1/0'
> ORDER BY date LIMIT 6) subtable

Your problem here is the "limit 6" There isn't any good+fast way to take only
6 of each thing and total them. Also the above query is missing a FROM
clause, so I;ve had to guess at your table name below. Oh, and the word
"date" is a reserved word, better to quote it.

This is valid, but it won't be fast:

SELECT wid,
(SELECT SUM(oil) FROM prd_data pd2
WHERE pd2.wid = prd_data.wid ORDER BY "date" LIMIT 6) as tot_oil,
(SELECT SUM(hours) FROM prd_data pd2
WHERE pd2.wid = prd_data.wid ORDER BY "date" LIMIT 6) as tot_hours
FROM prd_data
ORDER BY wid;

However, that will be querying the prd_data table about (distinct wid)*2+1
times. Don't hold your breath.

> Also, performance wise, would it be better to build a function for this
> query. The table has 9 million records and these aggregate queries
> take hours.

It might. Not for the summaries themselves (which are fastest as aggregates),
but to build a table that has only 6 records per WID.

--
Josh Berkus
Aglio Database Solutions
San Francisco

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Re: Aggregate query for multiple records

am 27.08.2004 11:32:51 von tjk

shouldn't take that long, I would think.
You have indexes on wid and date?


Troy

>
> Hello, I am new to the list, my apology if this question is beyond the
> scope or charter of this list.
>
> My questions is:
> What is the best method to perform an aggregate query to calculate
> sum() values for each distinct wid as in the example below, but except
> for all wid's (not just WHERE wid='01/1-6-1-30w1/0').
>
> Also, performance wise, would it be better to build a function for this
> query. The table has 9 million records and these aggregate queries
> take hours.
>
>
> SELECT
> SUM(oil) as sumoil, SUM(hours) as sumhours,
> FROM
> (SELECT oil, hours prd_data WHERE wid='01/1-6-1-30w1/0'
> ORDER BY date LIMIT 6) subtable
> ;
>
>
> Table description:
> Table "prd_data"
> Column | Type | Modifiers
> --------+-----------------------+-----------
> date | integer |
> hours | real |
> oil | real |
> gas | real |
> water | real |
> pwid | integer |
> wid | character varying(20) |
> year | smallint |
> Indexes: wid_index6
>
>
> Actual table (prd_data), 9 million records:
>
> date | hours | oil | gas | water | pwid | wid | year
> --------+-------+-------+------+-------+------+------------- ----+------
> 196507 | 360 | 159.4 | 11.3 | 40.9 | 413 | 01/1-1-1-31w1/0 | 1965
> 196508 | 744 | 280 | 20 | 27.2 | 413 | 01/1-1-1-31w1/0 | 1965
> 196509 | 360 | 171.1 | 11.4 | 50.4 | 413 | 01/1-1-1-31w1/0 | 1965
> 196510 | 744 | 202.1 | 25 | 89.8 | 413 | 01/1-1-1-31w1/0 | 1965
> 196512 | 744 | 201.3 | 23.8 | 71.9 | 413 | 01/1-1-1-31w1/0 | 1965
> 196511 | 720 | 184 | 17.6 | 78.9 | 413 | 01/1-1-1-31w1/0 | 1965
> 196610 | 744 | 99.8 | 15.4 | 53.7 | 413 | 01/1-1-1-31w1/0 | 1966
> 196612 | 744 | 86 | 12.8 | 36.1 | 413 | 01/1-1-1-31w1/0 | 1966
> 196611 | 720 | 86 | 12.6 | 41.7 | 413 | 01/1-1-1-31w1/0 | 1966
> 196601 | 744 | 191.6 | 22.6 | 50.7 | 413 | 01/1-1-1-31w1/0 | 1966
> 200301 | 461 | 68.8 | 0 | 186.3 | 47899 | 9G/6-1-50-24w3/0 | 2003
> 200310 | 740 | 446.3 | 0 | 563.1 | 47899 | 9G/6-1-50-24w3/0 | 2003
> 200306 | 667 | 92.1 | 0 | 968.8 | 47899 | 9G/6-1-50-24w3/0 | 2003
> 200304 | 0 | 0 | 0 | 0 | 47899 | 9G/6-1-50-24w3/0 | 2003
> 200308 | 457 | 100.7 | 0 | 82.8 | 47899 | 9G/6-1-50-24w3/0 | 2003
> 200307 | 574 | 78 | 0 | 752 | 47899 | 9G/6-1-50-24w3/0 | 2003
> 200312 | 582 | 360.9 | 0 | 569 | 47899 | 9G/6-1-50-24w3/0 | 2003
> 200311 | 681 | 260.8 | 0 | 563.9 | 47899 | 9G/6-1-50-24w3/0 | 2003
> 200305 | 452 | 0 | 0 | 0 | 47899 | 9G/6-1-50-24w3/0 | 2003
> 200309 | 637 | 244.6 | 0 | 193.8 | 47899 | 9G/6-1-50-24w3/0 | 2003
> (20 rows)
>
>
>
> Thanks,
>
> --
> Scott A. Gerhardt, P.Geo.
> Gerhardt Information Technologies
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Re: Aggregate query for multiple records

am 27.08.2004 20:59:26 von Josh

Scott,

> >> I tried your query but it doesn't seem to work. The interpreter
> >> expects prd2.date to be grouped but this doesn't make sence, see
> >> below:
> >
> > Oh, yeah, darn it.
> >
> > Hmmm ... maybe you could explain the purpose of selecting just 6?
> > This query
> > is going to get a *lot* uglier if there isn't another way to
> > accomplish it.
>
> The purpose of selecting the first 6 is we need sum values for just the
> first 3, 6 and 12 months oil/water production for every oil well in the
> database to perform statistical calculations. Optionally, we also need
> the _last_ 6 months of oil and water production.
> The prd_data table has total oil for each month but some wells have 500
> months of data, we just want the first 6.

Well, here's my suggestion, to make this query both easier and faster:

1) Add a column called "month_prod", integer, to the table. This is the
"number of months in production".
2) Populate this query with a calculated difference on your "date" column
against the first appearance of that WID (I'm assuming that each increment of
"date" = 1 month)

UPDATE prd_data SET months_prod = prd_data."date" - prd2."date" + 1
FROM prd_data prd2
WHERE prd_data.wid = prd2.wid
AND prd2."date" = ( SELECT "date" FROM prd_data prd3
WHERE prd3.wid = prd2.wid
ORDER BY "date" LIMIT 1 );

(warning: the above will take a long time and floor your server. Make sure to
do VACUUM FULL ANALYZE prd_data afterwards.)
(if you can't get the above to complete, which is possible depending on your
hardware, you could do it procedurally in a programmming language)

3) Then running your aggregate becomes very easy/fast:

SELECT wid, sum(oil) as tot_oil, sum(hours) as tot_hours
FROM prd_data
WHERE months_prod < 7
GROUP BY wid
ORDER BY wid;

--
Josh Berkus
Aglio Database Solutions
San Francisco

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Re: Aggregate query for multiple records

am 27.08.2004 22:19:41 von Josh

Scott,

> Unfortunately, your revised query works like a charm except for the
> fact that prd_data."date" - prd2."date" + 1 give incorrect values when
> the year wraps, see in the output below. Need to conditionally
> subtract 88 from the date or use an incrementing count() function
> instead of date math to get the correct values.

Oh, that column is text, not a serial number. Well, it's your fault for
using wierd custom data types; you figure it out.

> I'm also wondering if using the date functions of Postgres would be
> helpful since the date column does have the YYYY-year and MM-month
> parts.

Well, you'd want to convert the column to a timestamp, and then you could
compute months. Or you could break it in seperate integer "year" and
"month" columns and do the same thing.

--
Josh Berkus
Aglio Database Solutions
San Francisco

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: Aggregate query for multiple records

am 28.08.2004 10:35:58 von gsstark

Scott Gerhardt writes:

> Hello, I am new to the list, my apology if this question is beyond the scope or
> charter of this list.
>
> My questions is:
> What is the best method to perform an aggregate query to calculate sum() values
> for each distinct wid as in the example below, but except for all wid's (not
> just WHERE wid='01/1-6-1-30w1/0').
>
> Also, performance wise, would it be better to build a function for this query.
> The table has 9 million records and these aggregate queries take hours.

The "top n" type query (or in this case "first n" or "last n" but it's the
same thing) is actually very tricky to do in standard SQL. The best solution
seen here for postgres is to use arrays and custom aggregate functions.

The following is based on a previous answer from Tom Lane to a similar
question. (I thought I already posted this for you on pgsql-general but the
list archives are down and you don't seem to have seen it, so I'm resending
it)

It allows you to do the whole query with a single sort for the grouping and
the ordering by date together. You would have to use it with something like:

SELECT sum_first_6(oil)
FROM (SELECT oil from prd_data ORDER BY wid, "date")
GROUP BY wid

If you pump up sort_mem enough -- you can do it within the session for the
single connection using "SET sort_mem" it should be pretty fast.

I think it's the best you're going to get. If you're absolutely sure the data
is physically stored in chronological order -- which I would only feel
comfortable with if you've never done any updates or deletes, only inserts and
perhaps occasional truncates, then you might be able to get by without
ordering and convince it to do a hash aggregate. That would be the optimal
result, no sorts at all. But it would be hard to make sure it would always
work.

test=> create or replace function first_6_accum (integer[], integer) returns integer[] language sql immutable as 'select case when array_upper($1,1)>=6 then $1 else $1||$2 end';
CREATE FUNCTION

test=> create function sum_6(integer[]) returns integer immutable language sql as 'select $1[1]+$1[2]+$1[3]+$1[4]+$1[5]+$1[6]';
CREATE FUNCTION

test=> create aggregate sum_first_6 (basetype=integer, sfunc=first_6_accum, stype=integer[],initcond='{}',finalfunc=sum_6);
CREATE AGGREGATE

test=> select sum_first_6(i) from (select i from (select 1 as i union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8) as x order by i desc) as x;
sum_first_6
-------------
33
(1 row)

test=> select sum_first_6(i) from (select i from (select 1 as i union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8) as x order by i asc) as x;
sum_first_6
-------------
21
(1 row)

This can easily be switched around to make it "last_6" and you can write
functions to handle 6 records or 9 records. And all of these could be combined
in a single query, so you only have to do the sort once.

Unfortunately you cannot make aggregate functions that take multiple
parameters, nor can you pass extra parameters to the state function. So you'll
have to create a separate set of functions for each variant.

Also, you'll have to change it to use reals.


--
greg


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Re: Aggregate query for multiple records

am 28.08.2004 18:19:37 von tgl

Greg Stark writes:
> [ nice example snipped ]
> ... Also, you'll have to change it to use reals.

That part, at least, can be worked around as of 7.4: use polymorphic
functions. You can declare the functions and aggregate as working on
anyelement/anyarray, and then they will automatically work on any
datatype that has a + operator.

regression=# create or replace function first_6_accum (anyarray,anyelement) returns anyarray
regression-# language sql immutable as 'select case when array_upper($1,1)>=6 then $1 else $1||$2 end';
CREATE FUNCTION
regression=# create function sum_6(anyarray) returns anyelement immutable language sql as 'select $1[1]+$1[2]+$1[3]+$1[4]+$1[5]+$1[6]';
CREATE FUNCTION
regression=# create aggregate sum_first_6 (basetype=anyelement, sfunc=first_6_accum, stype=anyarray,initcond='{}',finalfunc=sum_6);
CREATE AGGREGATE
regression=# select sum_first_6(i) from (select i from (select 1 as i union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8) as x order by i desc) as x;
sum_first_6
-------------
33
(1 row)

regression=# select sum_first_6(i) from (select i from (select 1.1 as i union select 2 union select 3 union select 4 union select 5 union select 6 union select 7.7 union select 8) as x order by i desc) as x;
sum_first_6
-------------
33.7
(1 row)

regression=#

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Re: Aggregate query for multiple records

am 28.08.2004 20:08:58 von scott

> Greg Stark writes:
>> [ nice example snipped ]
>> ... Also, you'll have to change it to use reals.
>
> That part, at least, can be worked around as of 7.4: use polymorphic
> functions. You can declare the functions and aggregate as working on
> anyelement/anyarray, and then they will automatically work on any
> datatype that has a + operator.
>
> regression=# create or replace function first_6_accum
> (anyarray,anyelement) returns anyarray
> regression-# language sql immutable as 'select case when
> array_upper($1,1)>=6 then $1 else $1||$2 end';
> CREATE FUNCTION
> regression=# create function sum_6(anyarray) returns anyelement
> immutable language sql as 'select
> $1[1]+$1[2]+$1[3]+$1[4]+$1[5]+$1[6]';
> CREATE FUNCTION
> regression=# create aggregate sum_first_6 (basetype=anyelement,
> sfunc=first_6_accum, stype=anyarray,initcond='{}',finalfunc=sum_6);
> CREATE AGGREGATE
> regression=# select sum_first_6(i) from (select i from (select 1 as i
> union select 2 union select 3 union select 4 union select 5 union
> select 6 union select 7 union select 8) as x order by i desc) as x;
> sum_first_6
> -------------
> 33
> (1 row)
>
> regression=# select sum_first_6(i) from (select i from (select 1.1 as
> i union select 2 union select 3 union select 4 union select 5 union
> select 6 union select 7.7 union select 8) as x order by i desc) as x;
> sum_first_6
> -------------
> 33.7
> (1 row)
>
> regression=#
>
> regards, tom lane
>



An alternate solution I'm thinking is to add column to hold a
"total_months" value that could be used to simplify queries and speed
queries ( i.e. first month of oil productin = 1, second = 2 etc.) That
way I can use select the first 6 months by using "where < 6", or any
month interval for that matter.

The following query, suggested by another list member (thanks Josh
Berkus), to populate the "total_months" column sort of work but doesn't
handle the year wrapping as it adds 88 when the year wraps (see output
below).


UPDATE prd_data_test SET months_prod = prd_data_test."date" -
prd2."date" + 1
FROM prd_data_test prd2
WHERE prd_data_test.wid = prd2.wid
AND prd2."date" = ( SELECT "date" FROM prd_data_test prd3
WHERE prd3.wid = prd2.wid
ORDER BY "date" LIMIT 1 );


The results are:
SEM=# select * from prd_data_test order by wid, date limit 20;
date | hours | oil | gas | water | pwid | wid | year
| month_prd | months_prod
--------+-------+-------+------+-------+------+------------- ----+------
+-----------+-------------
196505 | 480 | 194.3 | 10.3 | 0 | 1 | 01/1-6-1-30w1/0 | 1965
| | 1
196506 | 600 | 279.4 | 13.1 | 0 | 1 | 01/1-6-1-30w1/0 | 1965
| | 2
196507 | 744 | 288.1 | 4.5 | 0 | 1 | 01/1-6-1-30w1/0 | 1965
| | 3
196508 | 720 | 234.6 | 9.4 | 2.9 | 1 | 01/1-6-1-30w1/0 | 1965
| | 4
196509 | 648 | 208.2 | 12.5 | 6 | 1 | 01/1-6-1-30w1/0 | 1965
| | 5
196510 | 744 | 209.8 | 15.3 | 0 | 1 | 01/1-6-1-30w1/0 | 1965
| | 6
196511 | 720 | 180.5 | 13.9 | 27.7 | 1 | 01/1-6-1-30w1/0 | 1965
| | 7
196512 | 744 | 227.4 | 22.8 | 5.2 | 1 | 01/1-6-1-30w1/0 | 1965
| | 8
196601 | 744 | 230.3 | 22.7 | 10 | 1 | 01/1-6-1-30w1/0 | 1966
| | 97
196602 | 672 | 173.2 | 16.5 | 17 | 1 | 01/1-6-1-30w1/0 | 1966
| | 98
196603 | 744 | 197.2 | 18.7 | 9.2 | 1 | 01/1-6-1-30w1/0 | 1966
| | 99
196604 | 720 | 168.1 | 14.1 | 3 | 1 | 01/1-6-1-30w1/0 | 1966
| | 100


Table description:
Table "prd_data"
Column | Type | Modifiers
--------+-----------------------+-----------
date | integer |
hours | real |
oil | real |
gas | real |
water | real |
pwid | integer |
wid | character varying(20) |
year | smallint |
Indexes: wid_index6


Actual table (prd_data), 9 million records:

date | hours | oil | gas | water | pwid | wid | year
--------+-------+-------+------+-------+------+------------- ----+------
196507 | 360 | 159.4 | 11.3 | 40.9 | 413 | 01/1-1-1-31w1/0 | 1965
196508 | 744 | 280 | 20 | 27.2 | 413 | 01/1-1-1-31w1/0 | 1965
196509 | 360 | 171.1 | 11.4 | 50.4 | 413 | 01/1-1-1-31w1/0 | 1965
196510 | 744 | 202.1 | 25 | 89.8 | 413 | 01/1-1-1-31w1/0 | 1965
196512 | 744 | 201.3 | 23.8 | 71.9 | 413 | 01/1-1-1-31w1/0 | 1965
196511 | 720 | 184 | 17.6 | 78.9 | 413 | 01/1-1-1-31w1/0 | 1965
196610 | 744 | 99.8 | 15.4 | 53.7 | 413 | 01/1-1-1-31w1/0 | 1966
196612 | 744 | 86 | 12.8 | 36.1 | 413 | 01/1-1-1-31w1/0 | 1966
196611 | 720 | 86 | 12.6 | 41.7 | 413 | 01/1-1-1-31w1/0 | 1966
196601 | 744 | 191.6 | 22.6 | 50.7 | 413 | 01/1-1-1-31w1/0 | 1966
200301 | 461 | 68.8 | 0 | 186.3 | 47899 | 9G/6-1-50-24w3/0 | 2003
200310 | 740 | 446.3 | 0 | 563.1 | 47899 | 9G/6-1-50-24w3/0 | 2003
200306 | 667 | 92.1 | 0 | 968.8 | 47899 | 9G/6-1-50-24w3/0 | 2003
200304 | 0 | 0 | 0 | 0 | 47899 | 9G/6-1-50-24w3/0 | 2003
200308 | 457 | 100.7 | 0 | 82.8 | 47899 | 9G/6-1-50-24w3/0 | 2003
200307 | 574 | 78 | 0 | 752 | 47899 | 9G/6-1-50-24w3/0 | 2003
200312 | 582 | 360.9 | 0 | 569 | 47899 | 9G/6-1-50-24w3/0 | 2003
200311 | 681 | 260.8 | 0 | 563.9 | 47899 | 9G/6-1-50-24w3/0 | 2003
200305 | 452 | 0 | 0 | 0 | 47899 | 9G/6-1-50-24w3/0 | 2003
200309 | 637 | 244.6 | 0 | 193.8 | 47899 | 9G/6-1-50-24w3/0 | 2003


Thanks,

--
Scott A. Gerhardt, P.Geo.
Gerhardt Information Technologies





---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings