SQL question

Sat Aug 9 02:44:57 CDT 2003

KRFinch at dstsystems.com wrote:
> 
> Getting the dusty old DBA hat out of mothballs:
> 
> 1) It looks like you are having the thing go through the entire first table
> comparing every record with all of the records in the second table.  So,
> for every one of those 180 million records, you are comparing it against
> every one of those hundred or so other values, meaning that the server is
> having to do about 18 billion operations to do what you asked it to.  I
> have yet to see a DB server that could hit anything close to a billion
> queries an hour, even straight out of RAM, so this will obviously take more
> than 18 hours.  I've seen less complex queries run for more than 18 hours
> on tables with only around 3 million records, so I would guess that this
> query you have here should take at least the better part of a week to run.
> 
> I think you would be better served running a single query first with only
> one of the elements you want to delete, and see how long that takes to
> rifle through your 180 million records.  That should give you a reasonable
> gauge of how long things will take once you multiply that time by 100.
> Once you run that single query, you should also take a look at the
> performance characteristics of your database based on that run-though and
> see what processes need more RAM to handle the query more efficiently.

>                                                                            
> 
> Well although this is not Linux related it could easily be.
> I have a somewhat large database (approx 180 million records)
> that I need to delete certain "sets" from based on a foreign
> key in a smaller table (approx 4 million records) and the
> records I want to delete match a field in a third table
> (very small < 100 records). I ran this delete query and it ran
> for 18 hours and still didn't finish. We had to kill it
> (because the server was swapping back and forth and came
> to a standstill because we only have a 1/2 gig of RAM).
> Anyway, I, have two questions:
> 1) why should it take 18 hours to run (the delete query should
> have deleted about 1/3 of the data in the database),
> 2) is there anyway to have a single delete query statement
> do periodic commits (as really dangerous as it is)?
> 
> basically what I have is this:
> 
> table bad_marks (uid,mark)
> table messages  (indice,mark,...)
> table msg_elements (id,indice,...)
> 
> where bad_marks.mark 1->M/1->0 messages.mark, uid = PK
>       messages.indice 1->M msg_element.indice, message.indice = PK,
> message.mark = indexed
>       msg_element id = PK, indice = indexed
> and there is no referential integrity turned on (yet).
> 
> Thanks,
> Brian

I'm not even a database wannabe yet, but I seem to have an opinion on 
everything today (it sometimes helps to get a simplified look at things 
though).  So here goes.  Kevin said to look for one element, which is 
going along with the Scientific method, but take that a little farther 
and look at one element in one table and limit the number of rows to a 
subset of the table (e.g. rows 1-1000 or even 1-100 to get a time 
gauge).  That cuts down your wait time and you know if the query even is 
working.  When you go to run the actual query for real and not test you 
may need to do some limiting to maybe 100,000-200,000 rows at a time to 
not completely cook your RAM.  You may want to look into the details of 
your RDBMS and see if there are any optimizations you need to set for 
your specific amount of RAM.  I heard an instructor for NT 4 tell me 
that SQL Server 5 (I think) would take all available RAM when active. 
Your OS got what it could on boot up and the rest went to SQL and got 
locked in soemhow.  Sounded like a plan for disaster to me.

I'm reminded of a set of queries that Hal Duston was telling me about 
that he ran on a Sprint Billing database.  Millions and millions of rows 
(I was impressed).  He worked on a query for a couple of weeks or days 
or whatever and after he got permission to run the query, at night, he 
had it spawn child processes up to a certain amount so the CPUs on the 
servers that ran the query were soaked, pegged, 100% from 6pm until 
about 6am, then they started to back off and not create new child 
processes as each completed (it started to empty out the queue) as 7am 
approached.  He got a call at home about 6:30am from a worried server 
admin wondering if the query would be done in time for the start of day. 
  Just then the CPUs started to come down to about 90% and then 75% and 
by 6:45 they were about 50% and done by 7am on the nose.  It was a 
thrilling story the way Hal told it (I probably boofed up the details, 
but you should get the gist).  I was just excited by the clean way he 
planned it.  Since he had time to craft it well, all went as planned. 
have him tell you about it sometime.

One other thing.  Do you have a backup copy of the data you are working 
with?  Sounds like a good idea since you said it took you a long time to 
extract the data.
Brian
-- 
A Computer without Microsoft is like a chocolate cake without mustard.
-as seen on IRC