Strange performance issue with UPDATE FROM

am 26.06.2007 17:31:39 von Richard

Hello!

I have this piece of SQL code:

UPDATE a
SET Field1 = c.Field1
FROM a
INNER JOIN b ON a.GUID1 = b.GUID1
INNER JOIN c ON b.GUID2 = c.GUID2
WHERE c.Type = 1
AND @date BETWEEN b.DateFrom AND b.DateTo

This query takes hours to complete.

Now while trying to find out what's causing the poor performance (it
surely looks simple enough!) I've rewritten it to use temp tables:

SELECT a.GUID1, a.Field1, c.Type, b.DateFrom, b.DateTo INTO #temptable
FROM a
INNER JOIN b ON a.GUID1 = b.GUID1
INNER JOIN c ON b.GUID2 = c.GUID2
WHERE c.Type = 1
AND @date BETWEEN b.DateFrom AND b.DateTo

UPDATE a SET Field1 = subsel.Field1
FROM (SELECT * FROM #temptable) AS subsel
WHERE subsel.GUID1 = a.GUID1

Now it completes in 10 seconds.

My question is why? Am I wrong in saying that the two batches above
produce same results? Is there something I've missed about the UPDATE
FROM syntax? Why would the first query perform THAT poorly?

Table sizes:
a: 24k rows
b: 268k rows
c: 260k rows

GUIDs are of type uniqueidentifier.

Any answers appreciated!

Regards,
// Richard

Re: Strange performance issue with UPDATE FROM

am 26.06.2007 23:46:15 von Erland Sommarskog

Richard (nassegris@gmail.com) writes:
> I have this piece of SQL code:
>
> UPDATE a
> SET Field1 = c.Field1
> FROM a
> INNER JOIN b ON a.GUID1 = b.GUID1
> INNER JOIN c ON b.GUID2 = c.GUID2
> WHERE c.Type = 1
> AND @date BETWEEN b.DateFrom AND b.DateTo
>
> This query takes hours to complete.
>
> Now while trying to find out what's causing the poor performance (it
> surely looks simple enough!) I've rewritten it to use temp tables:
>
> SELECT a.GUID1, a.Field1, c.Type, b.DateFrom, b.DateTo INTO #temptable
> FROM a
> INNER JOIN b ON a.GUID1 = b.GUID1
> INNER JOIN c ON b.GUID2 = c.GUID2
> WHERE c.Type = 1
> AND @date BETWEEN b.DateFrom AND b.DateTo
>
> UPDATE a SET Field1 = subsel.Field1
> FROM (SELECT * FROM #temptable) AS subsel
> WHERE subsel.GUID1 = a.GUID1
>
> Now it completes in 10 seconds.
>
> My question is why? Am I wrong in saying that the two batches above
> produce same results? Is there something I've missed about the UPDATE
> FROM syntax? Why would the first query perform THAT poorly?

One problem with UPDATE FROM is that you can update the same row
several times if your join conditions are not unique. What happens if
you run:

UPDATE a
SET Field = (SELECT c.Field1
FROM c
JOIN b ON c.GUID2 = b.GUID2
WHERE a.GUID1 = b.GUID1
AND c.type = 1
AND @date BETWEEN b.DateFrom AND b.DateTo)

I'm most interested to know if the query succeds at all, or if it fails
with an error message. From the table sizes you have indicated, I would
expect an error, but I don't know how your tables are related.

As for the performance, investigating the query plan can give some ideas.
Without seeing query plans, the table definitions, the indexes etc,
it's difficult to say much useful.

Which version of SQL Server are you using?

Is @date a parameter to a stored procedure or a local variable?

--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downlo ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books .mspx

Re: Strange performance issue with UPDATE FROM

am 27.06.2007 02:15:28 von Roy Harvey

On Tue, 26 Jun 2007 21:46:15 +0000 (UTC), Erland Sommarskog
wrote:

>One problem with UPDATE FROM is that you can update the same row
>several times if your join conditions are not unique. What happens if
>you run:
>
> UPDATE a
> SET Field = (SELECT c.Field1
> FROM c
> JOIN b ON c.GUID2 = b.GUID2
> WHERE a.GUID1 = b.GUID1
> AND c.type = 1
> AND @date BETWEEN b.DateFrom AND b.DateTo)

BE CAREFUL WITH THIS!!

One thing that could happen from this UPDATE is that it sets Field =
NULL for rows that are untouched by the UPDATEs in the original post.
That happens if there are rows in the table being UPDATEd that do not
have matches in the subquery. The FROM/JOIN prevents that in the
original versions. I don't know if any such non-matching rows exist,
but it certainly seems possible with the date range and type tests
dropping rows from the subquery.

I believe this query would show if the original UPDATEs using FROM
result in the same row updated more than once, without possible impact
on the data.

SELECT A.PrimaryKey, count(*)
FROM A
JOIN B
ON A.GUID1 = B.GUID1
JOIN C
ON B.GUID2 = B.GUID2
WHERE C.type = 1
AND @date BETWEEN B.DateFrom AND B.DateTo)
GROUP BY A.PrimaryKey
HAVING COUNT(*) > 1

Roy Harvey
Beacon Falls, CT

Re: Strange performance issue with UPDATE FROM

am 27.06.2007 02:27:42 von Joe Celko

I hope you know better than to use GUIDs in an RDBMS except for
replication, never to use reserved words like "date" for data element
names or vague names like "type" -- the basic ISO-11179 rules, etc.
You also seem to confuse fields and columns, but let's skip the signs
of poor SQL practices for now.

If you rewrite this in statement into Standard SQL, you will see if
the join returns multiple rows instead of a scalar value. That would
let us know that the schema has serious design problems. The illegal
syntax you used can do multiple updates on each row; talk to an old
Sybase programmer about this problem.

UPDATE A SET field1
= (SELECT C.field1 FROM B, C
WHERE A.guid1 = B.guid1
AND B.guid2 = C.guid2 AND C.somekind_type = 1
AND @my_date BETWEEN B.start_date AND B.end_date);

Re: Strange performance issue with UPDATE FROM

am 27.06.2007 09:31:08 von Erland Sommarskog

Roy Harvey (roy_harvey@snet.net) writes:
> On Tue, 26 Jun 2007 21:46:15 +0000 (UTC), Erland Sommarskog
> wrote:
>
>>One problem with UPDATE FROM is that you can update the same row
>>several times if your join conditions are not unique. What happens if
>>you run:
>>
>> UPDATE a
>> SET Field = (SELECT c.Field1
>> FROM c
>> JOIN b ON c.GUID2 = b.GUID2
>> WHERE a.GUID1 = b.GUID1
>> AND c.type = 1
>> AND @date BETWEEN b.DateFrom AND b.DateTo)
>
> BE CAREFUL WITH THIS!!
>
> One thing that could happen from this UPDATE is that it sets Field =
> NULL for rows that are untouched by the UPDATEs in the original post.

Right, Roy. I didn't add a WHERE clause, because it was more intended as a
test to see if the query would work at all. But I should have included the
warning.

--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downlo ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books .mspx

Re: Strange performance issue with UPDATE FROM

am 27.06.2007 15:40:33 von Richard

On Jun 27, 2:27 am, --CELKO-- wrote:
> I hope you know better than to use GUIDs in an RDBMS except for
> replication, never to use reserved words like "date" for data element
> names or vague names like "type" -- the basic ISO-11179 rules, etc.
> You also seem to confuse fields and columns, but let's skip the signs
> of poor SQL practices for now.
>

Well, yes, I actually do know better. The columns, variables and
tables in the query are renamed as I don't want to post production
code on the Internet. Also excuse the mixup between fields and
columns, I'm not a native English speaker.

One big problem (as i see it, and I'm by no means a SQL expert) is
that the db in question uses uniqueidentifier primary keys with
clustered indexes on those almost EVERYWHERE, and there is nothing I
can do to change that at the moment...Constructs like

FROM z
INNER JOIN a ON ..GUID = ..GUID
INNER JOIN b ON ..GUID = ..GUID
INNER JOIN c ON ..GUID = ..GUID
INNER JOIN d ON ..GUID = ..GUID
LEFT OUTER JOIN eON ..GUID = ..GUID
AND VERSION = (
SELECT MAX(VERSION) FROM f WHERE ..GUID = ..GUID)

make the queries run painfuly slow.

So the question is, is there ANYTHING I can do to optimize this type
of queries or is a redesign the only thing that would help?

@Erland:
I use MS SQL 2000 server and @date is a local variable =)

Re: Strange performance issue with UPDATE FROM

am 27.06.2007 19:39:28 von Hugo Kornelis

On Tue, 26 Jun 2007 17:27:42 -0700, --CELKO-- wrote:

> The illegal
>syntax you used can do multiple updates on each row; talk to an old
>Sybase programmer about this problem.

Hi Joe,

It's not illegal. It's non-standard, but in SQL Server (check the name
of this group!!), it's legal and documented. And in many cases much
faster than the ANSI-standard equivalent - an ommission I keep kicking
the optimizer team for until they finallly get it right, but until that
day we developers have to code our way around it.

>UPDATE A SET field1
> = (SELECT C.field1 FROM B, C
> WHERE A.guid1 = B.guid1
> AND B.guid2 = C.guid2 AND C.somekind_type = 1
> AND @my_date BETWEEN B.start_date AND B.end_date);

Oh boy, I really hope that Richard runs a quick test or two (oh wait,
one should be enough with such glaring errors) before deploying this
"alternative" in production.

I think that this snippet of code is a good contender for not only the
worst code ever posted to usenet, but also for the worst formatting. If
I were a book author, I'd now have to ask you for permission to include
it as a sample of bad query writing (though the reader probably wouldn't
believe it -hah! as if I even *could* come up with something this bad).

--
Hugo Kornelis, SQL Server MVP
My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis

Re: Strange performance issue with UPDATE FROM

am 27.06.2007 23:27:05 von Erland Sommarskog

Richard (nassegris@gmail.com) writes:
> One big problem (as i see it, and I'm by no means a SQL expert) is
> that the db in question uses uniqueidentifier primary keys with
> clustered indexes on those almost EVERYWHERE, and there is nothing I
> can do to change that at the moment...Constructs like

Clustered indexes on GUIDs requires a lot of skill in monitoring
fragmentation.

With the standard setup with a high fill factor, clustering on GUIDs is bad,
because you get page splits and fragmentation galore.

SQL Server MVP Greg Linwood suggested to me that clustering on GUIDs may
still be good for INSERT performance, if you create the indexes with a low
fill factor, say 50%. Now when you add new rows, there are good chance
that there is a hole to plug into. When the table starts to fill up, you
need to reindex again. But this strategy requires careful planning, and is
nothing for the armchair DBA.

If you are stuck with these clustered indexes, make you sure you set up
a reindexing job that runs regularly, and you should probably aim at
a lower fill factor. If not 50%, maybe 75-80%. It depends a bit how
big the INSERT frequency is. And use DBCC SHOWCONTIG to monitor
fragmentation.

> FROM z
> INNER JOIN a ON ..GUID = ..GUID
> INNER JOIN b ON ..GUID = ..GUID
> INNER JOIN c ON ..GUID = ..GUID
> INNER JOIN d ON ..GUID = ..GUID
> LEFT OUTER JOIN eON ..GUID = ..GUID
> AND VERSION = (
> SELECT MAX(VERSION) FROM f WHERE ..GUID = ..GUID)
>
> make the queries run painfuly slow.
>
> So the question is, is there ANYTHING I can do to optimize this type
> of queries or is a redesign the only thing that would help?

With the information you have posted, it's impossible to tell. But
I would at least give defragmentation with DBCC DBREINDEX a chance
first if DBCC SHOWCONTIG show horrendeous numbers.

You could also consider adding covering non-clustered index on tables
where only a few columns of many are involved in the query.

--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downlo ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books .mspx

Re: Strange performance issue with UPDATE FROM

am 28.06.2007 21:51:19 von Joe Celko

>> Well, yes, I actually do know better. The columns, variables and tables in the query are renamed as I don't want to post production code on the Internet. <<

Understood. But look at how many production DBs posted here have such
flaws.

>> Also excuse the mix up between fields and columns, I'm not a native English speaker. <<

And that means that your English is probably better than a native
English speaker :) But my objection is not English; it is RDBMS
versus File Systems. A big problem I see when I teach SQL is students
using SQL as if it were a sequential file system -- no constraints,
improper data types. no DRI actions, depending on applications to do
what DDL should do, etc.

>> One big problem (as i see it, and I'm by no means a SQL expert) is that the db in question uses uniqueidentifier primary keys with clustered indexes on those almost EVERYWHERE,.. <<

You are doing very well for an amateur :) Yes, this is a major
problem and not just for performance. A uniqueidentifier cannot be a
key in a properly designed RDBMS by definition -- it is an attribute
of the hardware and not the data model. You cannot verify it with a
trusted external source, so you have no data integrity. And it is
bitch to write them out without making an error.

The programmers who do this are trying to mimic pointer chains and
build a linked list in SQL. They missed the whole idea of RDBMS.

>> So the question is, is there ANYTHING I can do to optimize this type of queries or is a redesign the only thing that would help? <<

Not much. Clustered indexes are not going to help with the random
nature of a uniqueidentifier. Perhaps the best thing you can do is
kill the guy that did this to you and prevent him from coding again.

Re: Strange performance issue with UPDATE FROM

am 28.06.2007 22:32:03 von Joe Celko

>> It's not illegal. It's non-standard, but in SQL Server (check the name of this group!!), it's legal and documented. And in many cases much faster than the ANSI-standard equivalent - an omission I keep kicking the optimizer team for until they finally get it right, but until that day we developers have to code our way around it. <<

Okay, would accept "proprietary and in total violation of the
programming model in ANSI/ISO Standard SQL and RDBMS concepts" instead
of "illegal"?

>> Oh boy, I really hope that Richard runs a quick test or two (oh wait, one should be enough with such glaring errors) before deploying this alternative" in production. <<

I am not happy with the idea that one value is being replicated among
multiple tables in the same schema. I have the feeling that we should
not be doing this UPDATE at all. But I have no DDL, so the best I can
do is translate from dialect to Standard directly.

>> I think that this snippet of code is a good contender for not only the worst code ever posted to Usenet, but also for the worst formatting. <<

Hey, I am not even close the worst code ever posted to Usenet!! You
ought to see the stuff people send me in emails :) I just finished
one "quickie Email consult" for an old friend where we removed 70-75%
of the lines of code and reduced the time from 98 minutes to 4 minutes
for time-to-completion for a complex report using CASE expressions
instead of sub-queries. It took three emails and ~15 minutes of my
time. If I could re-do the schema design, I could get it down to 1
minute or less.

What I was not expecting in that exercise was that replacing INNER
JOIN operators with the WHERE clause syntax improved performance in
DB2. I have no idea why, but perhaps their optimizer has a bias in
ordering the tables. The infixed JOIN operators are required to
behave as if they are executed in left-to-right order, while the old
"FROM.. WHERE.." syntax is not so required.

Re: Strange performance issue with UPDATE FROM

am 28.06.2007 23:12:55 von Alex Kuznetsov

On Jun 26, 10:31 am, Richard wrote:
> Hello!
>
> I have this piece of SQL code:
>
> UPDATE a
> SET Field1 = c.Field1
> FROM a
> INNER JOIN b ON a.GUID1 = b.GUID1
> INNER JOIN c ON b.GUID2 = c.GUID2
> WHERE c.Type = 1
> AND @date BETWEEN b.DateFrom AND b.DateTo
>
> This query takes hours to complete.
>
> Now while trying to find out what's causing the poor performance (it
> surely looks simple enough!) I've rewritten it to use temp tables:
>
> SELECT a.GUID1, a.Field1, c.Type, b.DateFrom, b.DateTo INTO #temptable
> FROM a
> INNER JOIN b ON a.GUID1 = b.GUID1
> INNER JOIN c ON b.GUID2 = c.GUID2
> WHERE c.Type = 1
> AND @date BETWEEN b.DateFrom AND b.DateTo
>
> UPDATE a SET Field1 = subsel.Field1
> FROM (SELECT * FROM #temptable) AS subsel
> WHERE subsel.GUID1 = a.GUID1
>
> Now it completes in 10 seconds.
>
> My question is why? Am I wrong in saying that the two batches above
> produce same results? Is there something I've missed about the UPDATE
> FROM syntax? Why would the first query perform THAT poorly?
>
> Table sizes:
> a: 24k rows
> b: 268k rows
> c: 260k rows
>
> GUIDs are of type uniqueidentifier.
>
> Any answers appreciated!
>
> Regards,
> // Richard

Not arguing with other party on correctness/standards etc., when you
create a temp table, you get statistics on it. So them optimizer has a
better estimate of number of rows to modify and may choose a better
plan.

http://sqlserver-tips.blogspot.com/