Millions of Delete Statements

Millions of Delete Statements

am 31.10.2007 23:08:35 von halftime57

I have a very large DB (>300GB). Each day I receive a file containing
an average of 2 million individual delete statements. The delete
statements all delete based on full primary key so only an @@rowcount
value of 0 or 1 is possible for each statement. The deletes can be
from any one of several hundred tables in the database.

Right now I'm just cursoring through that collection of delete
statements and executing them one at a time as dynamic sql (since I
get the whole SQL statement I don't know of another way to do it). It
takes 1-2 hours to complete, depending on other traffic on the host.
Batching them into a single trx gives me better performance, but I
lose the ability to know which statement within the transaction failed
if any in the batch do, or which ones come back with a zero row count
(i.e., the row wasn't there, which is information I need to capture).
I keep thinking there's an easy, elegant solution, but it's eluding me
so I thought I'd ask. My undying gratitude to anyone who can help.

Re: Millions of Delete Statements

am 31.10.2007 23:20:32 von David Portas

wrote in message
news:1193868515.262136.223710@k35g2000prh.googlegroups.com.. .
>I have a very large DB (>300GB). Each day I receive a file containing
> an average of 2 million individual delete statements. The delete
> statements all delete based on full primary key so only an @@rowcount
> value of 0 or 1 is possible for each statement. The deletes can be
> from any one of several hundred tables in the database.
>
> Right now I'm just cursoring through that collection of delete
> statements and executing them one at a time as dynamic sql (since I
> get the whole SQL statement I don't know of another way to do it). It
> takes 1-2 hours to complete, depending on other traffic on the host.
> Batching them into a single trx gives me better performance, but I
> lose the ability to know which statement within the transaction failed
> if any in the batch do, or which ones come back with a zero row count
> (i.e., the row wasn't there, which is information I need to capture).
> I keep thinking there's an easy, elegant solution, but it's eluding me
> so I thought I'd ask. My undying gratitude to anyone who can help.
>

Instead of executing each one individually it would typically be better to
have one DELETE statement for many rows. Maybe you can change whatever
produces the DELETE statements to do it that way. Identifying any possible
errors could be a separate step. For example a SELECT statement to identify
the rows that would otherwise cause foreign key violations when deleted and
then exclude those rows from the actual DELETE.

--
David Portas

Re: Millions of Delete Statements

am 01.11.2007 00:03:21 von Erland Sommarskog

halftime57@gmail.com (halftime57@gmail.com) writes:
> I have a very large DB (>300GB). Each day I receive a file containing
> an average of 2 million individual delete statements. The delete
> statements all delete based on full primary key so only an @@rowcount
> value of 0 or 1 is possible for each statement. The deletes can be
> from any one of several hundred tables in the database.
>
> Right now I'm just cursoring through that collection of delete
> statements and executing them one at a time as dynamic sql (since I
> get the whole SQL statement I don't know of another way to do it). It
> takes 1-2 hours to complete, depending on other traffic on the host.
> Batching them into a single trx gives me better performance, but I
> lose the ability to know which statement within the transaction failed
> if any in the batch do, or which ones come back with a zero row count
> (i.e., the row wasn't there, which is information I need to capture).
> I keep thinking there's an easy, elegant solution, but it's eluding me
> so I thought I'd ask. My undying gratitude to anyone who can help.

As David said, the best would be to change the source that emits
this beast.

One thing that worries me when you talk about transactions, is that some
deletes may be related. Say for instance that there are DELETE statements
to delete an old order and the order lines. What if for some reason
only some of the order lines are deleted?

If the process that generates all these statements cannot be changed,
the only way out I can see is to parse them and build statements
that delete many rows at a time. Then again, if you need to capture
that some DELETES did not hit any rows, you would also have to
rewrite them in to SELECTs to see which rows that actually are in
the database. Depending on how the file looks like, there could be
quite some development cost for this.

As for the transaction length, I think you could take a middle path and
delete say 100 at a time. But as I think of it, it becomes quite messy.
To make it easy you could submit one statement at a time and capture
row count, and trap any error. But the errors can be of different kind.
If it's an FK violation, you can just chalk up one error and move on.
But if the error is a deadlock, you lost the previous deletes in the
batch and need to redo them. This can be managed, but the level of
complexity becomes high.

--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downlo ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books .mspx

Re: Millions of Delete Statements

am 01.11.2007 02:21:44 von Roy Harvey

My first reaction was sympathy. What a mess to have to handle! The
rest is just a few thoughts off the top of my head.

As everyone else has said the proper place to fix this is at the
source. But I am betting that is not possible.

You mention a cursor. Are you loading these DELETE commands into a
table and cursoring through them in a script or stored procedure? That
would certainly be better than reading the file line by line and
executing them that way.

I think Erland raised the key question:

>One thing that worries me when you talk about transactions, is that some
>deletes may be related.

If the collection only works when executed IN ORDER that seriously
limits your options. Others have mentioned batching the commands, a
good idea, so I will throw out one that is less good (executing them
out of order) just to cover more bases. Run two procs, each with
their own cursor, one for the odd-numbered rows and the other for the
even-numbered rows. Or three, or four, whatever. If you are working
from a staging table of DELETE commands it is easy enough for it to
have an IDENTITY column to work from. Perhaps contention among the
cursors would be a problem, but it you are desperate enough it might
be worth a try.

Or if you could work out a good way to pick out the table name you
might set up one cursor for each table. If there is a hierarchy to
them you could start the process for deleting from the bottom-most
table first, then start each succeeding upper table after a pause,
running several at a time.

All of which is neither easy nor elegant, unfortunately.

Roy Harvey
Beacon Falls, CT

On Wed, 31 Oct 2007 15:08:35 -0700, "halftime57@gmail.com"
wrote:

>I have a very large DB (>300GB). Each day I receive a file containing
>an average of 2 million individual delete statements. The delete
>statements all delete based on full primary key so only an @@rowcount
>value of 0 or 1 is possible for each statement. The deletes can be
>from any one of several hundred tables in the database.
>
>Right now I'm just cursoring through that collection of delete
>statements and executing them one at a time as dynamic sql (since I
>get the whole SQL statement I don't know of another way to do it). It
>takes 1-2 hours to complete, depending on other traffic on the host.
>Batching them into a single trx gives me better performance, but I
>lose the ability to know which statement within the transaction failed
>if any in the batch do, or which ones come back with a zero row count
>(i.e., the row wasn't there, which is information I need to capture).
>I keep thinking there's an easy, elegant solution, but it's eluding me
>so I thought I'd ask. My undying gratitude to anyone who can help.

Re: Millions of Delete Statements

am 01.11.2007 04:36:00 von jhofmeyr

On Nov 1, 1:21 am, "Roy Harvey (SQL Server MVP)"
wrote:
> My first reaction was sympathy. What a mess to have to handle! The
> rest is just a few thoughts off the top of my head.
>
> As everyone else has said the proper place to fix this is at the
> source. But I am betting that is not possible.
>
> You mention a cursor. Are you loading these DELETE commands into a
> table and cursoring through them in a script or stored procedure? That
> would certainly be better than reading the file line by line and
> executing them that way.
>
> I think Erland raised the key question:
>
> >One thing that worries me when you talk about transactions, is that some
> >deletes may be related.
>
> If the collection only works when executed IN ORDER that seriously
> limits your options. Others have mentioned batching the commands, a
> good idea, so I will throw out one that is less good (executing them
> out of order) just to cover more bases. Run two procs, each with
> their own cursor, one for the odd-numbered rows and the other for the
> even-numbered rows. Or three, or four, whatever. If you are working
> from a staging table of DELETE commands it is easy enough for it to
> have an IDENTITY column to work from. Perhaps contention among the
> cursors would be a problem, but it you are desperate enough it might
> be worth a try.
>
> Or if you could work out a good way to pick out the table name you
> might set up one cursor for each table. If there is a hierarchy to
> them you could start the process for deleting from the bottom-most
> table first, then start each succeeding upper table after a pause,
> running several at a time.
>
> All of which is neither easy nor elegant, unfortunately.
>
> Roy Harvey
> Beacon Falls, CT
>
> On Wed, 31 Oct 2007 15:08:35 -0700, "halftim...@gmail.com"
>
>
>
> wrote:
> >I have a very large DB (>300GB). Each day I receive a file containing
> >an average of 2 million individual delete statements. The delete
> >statements all delete based on full primary key so only an @@rowcount
> >value of 0 or 1 is possible for each statement. The deletes can be
> >from any one of several hundred tables in the database.
>
> >Right now I'm just cursoring through that collection of delete
> >statements and executing them one at a time as dynamic sql (since I
> >get the whole SQL statement I don't know of another way to do it). It
> >takes 1-2 hours to complete, depending on other traffic on the host.
> >Batching them into a single trx gives me better performance, but I
> >lose the ability to know which statement within the transaction failed
> >if any in the batch do, or which ones come back with a zero row count
> >(i.e., the row wasn't there, which is information I need to capture).
> >I keep thinking there's an easy, elegant solution, but it's eluding me
> >so I thought I'd ask. My undying gratitude to anyone who can help.- Hide quoted text -
>
> - Show quoted text -

Hi Halftime57,

As the other guys have said, this sounds like a nasty problem that
would be best resolved at the source .. having said that, I know that
this is often not possible :(
If you're working on SQL Server 2005 and changing the source is not an
option, you could try using SSIS to do the deletes for you. This has
the added advantage of providing out-the-box row-level error handling.

I created a quick package to test the concept which basically
replicates what you're doing now. Basically, I read the delete
statements from a file as a single column row per statement (depending
on how your file is structured this could take a bit of work) and
store the resulting rows in an ADO recordset. I then used this
recordset to drive a ForEach loop which used each statement (row) to
run a number of Execute SQL tasks with the statement derived from the
row. So for example the first task statement was set to replace the
"DELETE " part of the statement with "SELECT COUNT(*) AS RecCount ",
and the 2nd task either deleted the row or logged it as missing based
on the results of the 1st task.

As I said, this particular example only mimics your current
behaviour. However, once you have the statements in the correct
format, you can pretty much do anything with them in your package
(e.g. you could split the statements by table and concatenate all the
WHERE clauses to batch up the deletes - just don't forget the 4000
character limit to string variables!)

Another thing I tried was to parse each row using a script
transformation to pull out the table name and key column value(s).
The script task was configured with a separate output for each table,
and each output had the same columns as the table key. Once all these
have were configured, it was a fairly simple task to split up the rows
by table name and stick the key values into the correct columns.

Downstream from the transformation you could do any number of logging
and validation tasks (rowcounts, check if the key exists, etc). I
simply ended each data stream with a parameterised DELETE statement in
an OLE DB Command task.

This solution should run quicker than looping through each row and
setting SQL statements dynamically but takes longer to develop,
Hopefully it gives you an idea of what's possible :)

Good luck!
J

Re: Millions of Delete Statements

am 01.11.2007 19:39:47 von halftime57

On Oct 31, 6:21 pm, "Roy Harvey (SQL Server MVP)"
wrote:
> My first reaction was sympathy. What a mess to have to handle! The
> rest is just a few thoughts off the top of my head.
>
> As everyone else has said the proper place to fix this is at the
> source. But I am betting that is not possible.
>
> You mention a cursor. Are you loading these DELETE commands into a
> table and cursoring through them in a script or stored procedure? That
> would certainly be better than reading the file line by line and
> executing them that way.
>
> I think Erland raised the key question:
>
> >One thing that worries me when you talk about transactions, is that some
> >deletes may be related.
>
> If the collection only works when executed IN ORDER that seriously
> limits your options. Others have mentioned batching the commands, a
> good idea, so I will throw out one that is less good (executing them
> out of order) just to cover more bases. Run two procs, each with
> their own cursor, one for the odd-numbered rows and the other for the
> even-numbered rows. Or three, or four, whatever. If you are working
> from a staging table of DELETE commands it is easy enough for it to
> have an IDENTITY column to work from. Perhaps contention among the
> cursors would be a problem, but it you are desperate enough it might
> be worth a try.
>
> Or if you could work out a good way to pick out the table name you
> might set up one cursor for each table. If there is a hierarchy to
> them you could start the process for deleting from the bottom-most
> table first, then start each succeeding upper table after a pause,
> running several at a time.
>
> All of which is neither easy nor elegant, unfortunately.
>
> Roy Harvey
> Beacon Falls, CT
>
> On Wed, 31 Oct 2007 15:08:35 -0700, "halftim...@gmail.com"
>
>
>
> wrote:
> >I have a very large DB (>300GB). Each day I receive a file containing
> >an average of 2 million individual delete statements. The delete
> >statements all delete based on full primary key so only an @@rowcount
> >value of 0 or 1 is possible for each statement. The deletes can be
> >from any one of several hundred tables in the database.
>
> >Right now I'm just cursoring through that collection of delete
> >statements and executing them one at a time as dynamic sql (since I
> >get the whole SQL statement I don't know of another way to do it). It
> >takes 1-2 hours to complete, depending on other traffic on the host.
> >Batching them into a single trx gives me better performance, but I
> >lose the ability to know which statement within the transaction failed
> >if any in the batch do, or which ones come back with a zero row count
> >(i.e., the row wasn't there, which is information I need to capture).
> >I keep thinking there's an easy, elegant solution, but it's eluding me
> >so I thought I'd ask. My undying gratitude to anyone who can help.- Hide quoted text -
>
> - Show quoted text -

Thanks to all! You guessed it Roy, the source file is immutable. I
don't control it and it would take an act of God (not to mention an
epoch and a king's ransom) to get it changed. I am bulk-loading the
delete statements into a table and cursoring through them. I think
your idea of doing multiple cursors with some kind of rownumber offset
is a great one! These statements can be executed in any order and
there are no foreign key constraints (again, not my design, but one
thrust upon me). The bulk load puts the individual delete statements
in with a rowID column, so I think I might have my solution. I know,
fixing it at the source would by far be the better way, but at some
point you just quit beating your head against immovable brick
walls. ;^

I very much appreciate the help everybody's chipped in with!

Re: Millions of Delete Statements

am 02.11.2007 17:49:38 von Joe Celko

>>.. the source file is immutable. I don't control it and it would take an act of God (not to mention an epoch and a king's ransom) to get it changed. <<

The fifth labor of Hercules was to clean the stables of King Augeas in
a single day. The Augean stables held thousands of animals and were
over a mile long. This story has a happy ending for three reasons: (1)
Hercules solved the problem in a clever way (2) Hercules got one tenth
of the cattle for his work (3) At the end of the story of the Labors
of Hercules, he got to kill the bastard that gave him this job.

>> I am bulk-loading the delete statements into a table and cursoring through them. <<

Write a routine that will pull off the keys from the statements and
load them into a working table. This SQL and not BASIC! The
statement is then something like this:

DELETE FROM Foobar
WHERE Foobar.keycol
IN (SELECT keycol FROM WorkingTable);

or use an EXISTS() instead of an IN().

>> These statements can be executed in any order and there are no foreign key constraints. <<

You can add a clustered index to the working table. Do you need to
check for orphaned rows in what should have been a referenced table?
That can be a bitch, depending on the depth of the referencing.

If the system continues to grow, you will benefit from hashing instead
of indexing to find matches.

>> The bulk load puts the individual delete statements in with a rowID column, so I think I might have my solution. <<

Row_id? You mean like a 1950's magnetic tape file sequential record
number? So you can do a sequential tape merge in SQL with a cursor?
You might want to move to an RDBMS mindset instead. Make the
identifier the primary key of the working table.

>> I know, fixing it at the source would by far be the better way, but at some point you just quit beating your head against immovable brick walls. ;^ <<

1) Read the Hercules story again. Plot revenge.
2) Hire an expensive outside consultant who will tell management the
same thing. They will suddenly grow ears!