Increasingly greater amounts of redundant aggregate data in queries?

am 02.09.2007 17:18:03 von csimam

Hello,

I'd like to start by saying I'm not a DBA, rather I am a Java
developer. But I'm learning more and more about database design, and
it's very interesting.

Here is the general issue I'm facing:

Which one is better?
1. To perform aggregate queries on data, and use that data to home in
and perform more detailed analysis? This gives result sets that are
potentially filled with columns that contain the aggregate data
repeated across many rows.

2. To perform the aggregate query once, and then hit the database
again for a detailed analysis, and return the detailed analysis
results in an order that is easily cross-correlated with the original
aggregate data? This means running the aggregate query essentially
twice, just the second time it is hidden as a subquery.

Now I realize this is a very general question, and I have made
assumptions that a junior DBA does (e.g. in the second option, the
aggregate query must be run again - perhaps this is false?).

Do I need to give an example? It seems to me that you senior DBA's
would run across this a lot; I myself am finding these issues as they
crop up very fascinating.

Thanks for any opinions in advance. I will be happy to give an example
if needed.

- C

Re: Increasingly greater amounts of redundant aggregate data in queries?

am 02.09.2007 17:40:54 von csimam

On Sep 2, 6:18 pm, csi...@gmail.com wrote:
> Hello,
>
> I'd like to start by saying I'm not a DBA, rather I am a Java
> developer. But I'm learning more and more about database design, and
> it's very interesting.
>
> Here is the general issue I'm facing:
>
> Which one is better?
> 1. To perform aggregate queries on data, and use that data to home in
> and perform more detailed analysis? This gives result sets that are
> potentially filled with columns that contain the aggregate data
> repeated across many rows.

By the way, I meant here by homing in: within the same query - using
nested SELECTS. For example, the inner query performs the aggregate
analysis, and returns whatever data we are interested in, and then the
outer query performs another SELECT on the same data, resulting in the
aggregate data being distributed across 'detailed results' from the
outer query.

Re: Increasingly greater amounts of redundant aggregate data in queries?

am 03.09.2007 00:05:03 von Erland Sommarskog

(csimam@gmail.com) writes:
> Which one is better?
> 1. To perform aggregate queries on data, and use that data to home in
> and perform more detailed analysis? This gives result sets that are
> potentially filled with columns that contain the aggregate data
> repeated across many rows.

And you clarified:

> By the way, I meant here by homing in: within the same query - using
> nested SELECTS. For example, the inner query performs the aggregate
> analysis, and returns whatever data we are interested in, and then the
> outer query performs another SELECT on the same data, resulting in the
> aggregate data being distributed across 'detailed results' from the
> outer query.

Your question is quite abstract, but to take a concrete example, it
seems like you have sales data, and you want to aggregate this on
per month, product, state, etc. You also want total sums per month,
per product etc.

The best is of course to get all at once in the same result set, since
then you only need to scan the data once. in SQL 2005 you can achieve
this with the proprietary extensions WITH CUBE and WITH ROLLUP to
the GROUP BY clause. SQL 2008, currently in beta, implements GROUPING
SETS from the ANSI standard, and are a lot more powerful. (They are
also a lot more confusing to use.)

--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downlo ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books .mspx

Re: Increasingly greater amounts of redundant aggregate data in queries?

am 04.09.2007 00:43:39 von csimam

On Sep 3, 1:05 am, Erland Sommarskog wrote:
> The best is of course to get all at once in the same result set, since
> then you only need to scan the data once. in SQL 2005 you can achieve
> this with the proprietary extensions WITH CUBE and WITH ROLLUP to
> the GROUP BY clause. SQL 2008, currently in beta, implements GROUPING
> SETS from the ANSI standard, and are a lot more powerful. (They are
> also a lot more confusing to use.)

Thank you Erland for your answer: I spent a bit of time reading about
ROLLUP and CUBE, and found them to be almost exactly what I needed.
But there is one obstacle that I am struggling with:

The data I have, I want to generate totals and subtotals, say of
sales: total sales per store, and subtotal of sales per store per day
for example.

The problem is that I ultimately only want the top 25 stores in total
sales BUT with their subtotals as well... So I used ROLLUP to generate
the totals and subtotals... and in the same query I generated a
ranking of the data... and there I found my problem: the ranking does
not distinguish between totals and subtotals. So the results look
like:

store1, realTotal, 1
store1, subTotal, 2
store2, realTotal, 3
store7, realTotal, 4
store10, subTotal, 5

Clearly, the rows with subtotals should NOT be ranked... but I am
struggling with SQL to figure out how to get exactly what I want.

Mind you, I've already got the problem solved but it involves a nested
query that hits the table twice... and I just wanted to use the
elegancy of ROLLUP to avoid hitting that table twice.

Any ideas?

thanks in advance.

Re: Increasingly greater amounts of redundant aggregate data in queries?

am 04.09.2007 04:49:29 von csimam

On Sep 4, 1:43 am, csi...@gmail.com wrote:
> The data I have, I want to generate totals and subtotals, say of
> sales: total sales per store, and subtotal of sales per store per day
> for example.
>
> The problem is that I ultimately only want the top 25 stores in total
> sales BUT with their subtotals as well... So I used ROLLUP to generate
> the totals and subtotals... and in the same query I generated a
> ranking of the data... and there I found my problem: the ranking does
> not distinguish between totals and subtotals. So the results look
> like:
>
> store1, realTotal, 1
> store1, subTotal, 2
> store2, realTotal, 3
> store7, realTotal, 4
> store10, subTotal, 5
>
> Clearly, the rows with subtotals should NOT be ranked... but I am
> struggling with SQL to figure out how to get exactly what I want.

Well... I have a solution. But I'm not sure about it. It involves
something I learned about only recently: a user-defined aggregate
function that concatenates values from across rows. Using this
aggregate function, stragg, I can do another pass through the results
mentioned above while grouping by store, and produce:

store1, 'realTotal subTotal', 1
store2, 'realTotal subTotal', 3
store7, 'realTotal subTotal', 4
store10, 'realTotal subTotal', 5

(it was a typo in the original example to list "store10, subTotal, 5"
- the realTotal would have been greater than or equal).

And another trick: use min( rank ) in the same aggregation, and order
by it as well.

Together that gives one a very compact result set that has everything
ranked and grouped perfectly.

....but... a worry: is this REALLY better than doing a nested query? I
wonder if the use of the user-defined aggregate function is very
expensive...

For the benefit of anyone wondering, here's the structure that
works... in one pass in theory... but again, I am now wondering if it
really is faster than the nested query equivalent that hits the
database twice on the table 'stores' (seems to me that the middle
select below is effectively hitting the table again, just in temporary
form):

select rownum as rank, id, days, totals
from
(
select id, stragg(day) as days, stragg(total) as totals,
min( rank ) as innerrank
from
(
select id, day, sum(sales) as total, rank() over (order by
sum(sales) desc) rank
from stores
group by id, rollup(day)
)
group by id
order by min(rank)
)
where rownum <= 25

Re: Increasingly greater amounts of redundant aggregate data in queries?

am 04.09.2007 23:49:13 von Erland Sommarskog

(csimam@gmail.com) writes:
>> The problem is that I ultimately only want the top 25 stores in total
>> sales BUT with their subtotals as well... So I used ROLLUP to generate
>> the totals and subtotals... and in the same query I generated a
>> ranking of the data... and there I found my problem: the ranking does
>> not distinguish between totals and subtotals. So the results look
>> like:
>>
>> store1, realTotal, 1
>> store1, subTotal, 2
>> store2, realTotal, 3
>> store7, realTotal, 4
>> store10, subTotal, 5
>>
>> Clearly, the rows with subtotals should NOT be ranked... but I am
>> struggling with SQL to figure out how to get exactly what I want.

An interesting challenge to do in one query and only hitting the table
once, and I was not able to think of a solution without using the
imaginary Previous() function that we have discussed internally among
us MVPs.

You can easily do in one query, if you hit the table twice, simply
one query with GROUP BY store and one with GROUP BY store, month.
Rank the first and join.

And it's possible to hit the table once, if you first get the
monthly aggregations into a temp table.

But in a single query? I don't think it's possible is you want the
format:

Store1, Grand Total, 1000000
Store1, Jan 2000, 12000
Store1, Feb 2000, 10000,
Store2, Grand Total, ....

But I will have to see if I get Itzik Ben-Gan to tackle this one.

> Well... I have a solution. But I'm not sure about it. It involves
> something I learned about only recently: a user-defined aggregate
> function that concatenates values from across rows. Using this
> aggregate function, stragg, I can do another pass through the results
> mentioned above while grouping by store, and produce:
>
> store1, 'realTotal subTotal', 1
> store2, 'realTotal subTotal', 3
> store7, 'realTotal subTotal', 4
> store10, 'realTotal subTotal', 5

One problem is that a UDA is limited to 8000 characters, so if there
are many subtotals, you query will bomb.

> ...but... a worry: is this REALLY better than doing a nested query? I
> wonder if the use of the user-defined aggregate function is very
> expensive...

The way to find out is to run both queries a couple of times on an
idle server. Just don't forget about the effects of caching. Either
first run the queries once, so that you know that table is entirely
in cache. If it's too big for that, run DBCC DROPCLEANBUFFERS between
each run, so that both queries run under the same condition.

--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downlo ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books .mspx

Re: Increasingly greater amounts of redundant aggregate data in queries?

am 05.09.2007 01:37:46 von Steve Kass

Erland,

Thanks for pointing the MVPs to this thread. I think the following
solution is efficient and accesses the table only once. The query
below uses the Northwind sample database data.

It requires a permanent table of Years (year ranges, actually,
with one range for each year and one "grand total" range) that
can work generally for all queries like this, and it also uses
aspects of the data model in Northwind, specifically that there
is a Customers table.

It can probably also be done with some sort of ROLLUP and
PIVOT/UNPIVOT, as queries with cross joins and MAX's often
can. Also, I only tested it on the SQL Server 2008 preview,
because at the moment, no other server is handy.

Here's the query, preceded by the created Years table (here with
inserts only sufficient for Northwind. You will have to add more.)

Steve Kass
Drew University
http://www.stevekass.com

-- Create a permanent table like this that will work for
-- the entire range of time you will ever need

CREATE TABLE Years(
y0 datetime primary key,
y1 datetime,
tag varchar(20),
ord int
);
insert into Years values ('1996','1997','1996',1996);
insert into Years values ('1997','1998','1997',1997);
insert into Years values ('1998','1999','1998',1998);
insert into Years values ('1900','2100','Total',0);
go

-- The query:

with T(CustomerID, Tag, Ord, Total, drk) as (
select
C.CustomerID,
max(case when O.OrderDate >= y0 and O.OrderDate < y1
then tag end
) as Tag,
max(case when O.OrderDate >= y0 and O.OrderDate < y1
then ord end
) as Ord,
sum(
case when O.OrderDate >= y0 and O.OrderDate < y1
then OD.Quantity * OD.UnitPrice end
) as Total,
dense_rank() over (order by sum(OD.Quantity * OD.UnitPrice) desc) as drk
from Orders O
join Customers C
on C.CustomerID = O.CustomerID
join [Order Details] OD
on O.OrderID = OD.OrderID
cross join Years
group by C.CustomerID, tag
)
select CustomerID, Tag, Total
from T
where drk <= 25
order by drk, Ord
;
go

-- DROP TABLE Years;

Erland Sommarskog wrote:

> (csimam@gmail.com) writes:
>
>>>The problem is that I ultimately only want the top 25 stores in total
>>>sales BUT with their subtotals as well... So I used ROLLUP to generate
>>>the totals and subtotals... and in the same query I generated a
>>>ranking of the data... and there I found my problem: the ranking does
>>>not distinguish between totals and subtotals. So the results look
>>>like:
>>>
>>>store1, realTotal, 1
>>>store1, subTotal, 2
>>>store2, realTotal, 3
>>>store7, realTotal, 4
>>>store10, subTotal, 5
>>>
>>>Clearly, the rows with subtotals should NOT be ranked... but I am
>>>struggling with SQL to figure out how to get exactly what I want.
>
>
> An interesting challenge to do in one query and only hitting the table
> once, and I was not able to think of a solution without using the
> imaginary Previous() function that we have discussed internally among
> us MVPs.
>
> You can easily do in one query, if you hit the table twice, simply
> one query with GROUP BY store and one with GROUP BY store, month.
> Rank the first and join.
>
> And it's possible to hit the table once, if you first get the
> monthly aggregations into a temp table.
>
> But in a single query? I don't think it's possible is you want the
> format:
>
> Store1, Grand Total, 1000000
> Store1, Jan 2000, 12000
> Store1, Feb 2000, 10000,
> Store2, Grand Total, ....
>
> But I will have to see if I get Itzik Ben-Gan to tackle this one.
>
>
>>Well... I have a solution. But I'm not sure about it. It involves
>>something I learned about only recently: a user-defined aggregate
>>function that concatenates values from across rows. Using this
>>aggregate function, stragg, I can do another pass through the results
>>mentioned above while grouping by store, and produce:
>>
>>store1, 'realTotal subTotal', 1
>>store2, 'realTotal subTotal', 3
>>store7, 'realTotal subTotal', 4
>>store10, 'realTotal subTotal', 5
>
>
> One problem is that a UDA is limited to 8000 characters, so if there
> are many subtotals, you query will bomb.
>
>
>>...but... a worry: is this REALLY better than doing a nested query? I
>>wonder if the use of the user-defined aggregate function is very
>>expensive...
>
>
> The way to find out is to run both queries a couple of times on an
> idle server. Just don't forget about the effects of caching. Either
> first run the queries once, so that you know that table is entirely
> in cache. If it's too big for that, run DBCC DROPCLEANBUFFERS between
> each run, so that both queries run under the same condition.
>
>

Re: Increasingly greater amounts of redundant aggregate data in queries?

am 06.09.2007 23:15:54 von Erland Sommarskog

Erland Sommarskog (esquel@sommarskog.se) writes:
> An interesting challenge to do in one query and only hitting the table
> once, and I was not able to think of a solution without using the
> imaginary Previous() function that we have discussed internally among
> us MVPs.
>
> You can easily do in one query, if you hit the table twice, simply
> one query with GROUP BY store and one with GROUP BY store, month.
> Rank the first and join.
>
> And it's possible to hit the table once, if you first get the
> monthly aggregations into a temp table.
>
> But in a single query? I don't think it's possible is you want the
> format:
>
> Store1, Grand Total, 1000000
> Store1, Jan 2000, 12000
> Store1, Feb 2000, 10000,
> Store2, Grand Total, ....
>
> But I will have to see if I get Itzik Ben-Gan to tackle this one.

Indeed I was, and I also got a suggestion from Steve Kass that you
have already seen, and one from Umachandar Jaychandran, a former MVP
who is now with the Storage Engine team. All three are really bright
guys when it comes to writing queries. Both Umachandar and Itzik
composed query that logically visit the table only once.

Note that the query Steve posted is a cheat: it assumes that you know all
involved subheadings in advance. So in practice, Steve's solution is a two-
pass solution, although retrieving all subheadings is fairly cheap if there
is a non-clustered index on that column.

But as I ran all proposals, save one from Itzik that would require
dynamic SQL to get the subheadings dynamically, I got a big surprise that
calls me to revisit what I said earlier in the thread.

To wit, the fastest solution was one that I wrote that uses a temp
table to hold the subtotals, and then I compute the grand totals from
that one. But the really big surprise was that a fairly naïve query
where I scan the table twice, once for the subtotals and one for the
grand totals was a lot faster than the queries that were one-pass
without using a temp table. The problem they had, is that the execution
plan included a spool operator, that is a worktable introduced by the
optimizer.

The moral: you know what is best until you have benchmarked.

To run the test, I used an inflated version of the Northwind database,
that I call Northgale. You find the script for it at
http://www.sommarskog.se/dynsearch/Northgale.sql.

The test script with the six queries that I tested is attached.

--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downlo ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books .mspx

Re: Increasingly greater amounts of redundant aggregate data in queries?

am 07.09.2007 23:16:59 von Erland Sommarskog

Erland Sommarskog (esquel@sommarskog.se) writes:
> To wit, the fastest solution was one that I wrote that uses a temp
> table to hold the subtotals, and then I compute the grand totals from
> that one. But the really big surprise was that a fairly naïve query
> where I scan the table twice, once for the subtotals and one for the
> grand totals was a lot faster than the queries that were one-pass
> without using a temp table. The problem they had, is that the execution
> plan included a spool operator, that is a worktable introduced by the
> optimizer.

But that was on SQL 2005. On SQL 2008 (currently in beta) solutions
that used WITH ROLLUP or GROUPING SETS were twice as fast as my naïve
query, and about the same speed as the temp table batch. Also, antother
query that was fairly slow, was just a tad slower than the temp table
on SQL 2008.

> The moral[e]: you know [don't] what is best until you have benchmarked.

Another morale: whatever conclusions you draw about performance applies
only to that version of SQL Server.

--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downlo ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books .mspx