Thursday, February 21, 2008

Databases differ. Problem solving approaches STILL don't.

Still in the SQL Server world, but this project is on the last step, so there's a chance I can be back on Oracle in the near future.

A friend emailed me a useful link for equivalent functions between database products.
Given that anyone can add/edit entries in there, it should be more up-to-date than my previous resource here (which I still recommend for a read-through).

On the theme of equivalence, I've been looking at the SQL Server 2005 equivalent of an EXPLAIN PLAN.

I've had another issue with poor query performance, but this time with a SELECT.
I would have nosed at the EXPLAIN PLAN in Oracle and the SQL Server version seems to be "SET SHOWPLAN_ALL ON". Rather than doing an EXPLAIN PLAN for a specific query though, once this is set in SQL Server the subsequent queries return the plan rather than the results, until you turn the setting off.

I'm old-school, and like to look at text rather than pretty pictures. A picture being worth a thousand words is only valid when you start with the picture. If you start with a description, a dozen different artists will give a dozen interpretations, and some will be worse than others. With the description, I can make up my own mind what's important. Apparently there is an XML version of the output, but I haven't experimented with that yet, so don't know how readable that gets or can be made.

The query was a simple join between two tables; one was a small load table (a few thousand rows) and the other a much larger 'target' table (several million rows). I expected a scan of the small table with a Nested Loop containing an Index Scan and table access for the target table. It can be useful to start with an expectation of a 'good' plan. If the optimizer comes up with something different, you take a look and see where your assumptions differ from the optimizers, and who is 'right'.

In this case I got pretty much what I expected.



--Filter(WHERE:....
--Parallelism(Gather Streams) .
--Nested Loops(Inner Join, ...
--Nested Loops(Inner Join, ...
--Compute Scalar(...
--Table Scan(OBJECT:LOAD_TABLE....
--Index Seek(OBJECT:TARGET_TABLE_INDEX...
--RID Lookup(OBJECT:TARGET TABLE...





.

I could see (using the same analysis SQL as before) that it was doing a lot more work than I would have expected with that plan. The table scan shouldn't have been responsible, as it was a small table and only done once. The Row lookup on the target shouldn't have been responsible as, if there was an index match, it would have matched only one row. If the index matched lots of rows for this data, it would have been using the wrong index (or the right index in the wrong way).So I took a harder look at the Index Seek Line. It was the right index. It had a 'seek' component and a 'where' component, which I equated to the 'Access predicate' and 'Filter Predicate' in an Oracle plan.

Index Seek(OBJECT:(TARG_TAB),
SEEK:(TARG_TAB.COL_1=[Expr1008] AND TARG_TAB.COL_2=[Expr1009]),
WHERE:([Expr1010]=TARG_TAB.COL_3 AND
LOAD_TAB.COL_L1=CONVERT_IMPLICIT(numeric(15,0),TARG_TAB.COL_4,0)
AND LOAD_TAB.COL_L2]=CONVERT_IMPLICIT(numeric(15,0),TARG_TAB.COL_5,0))
The cause was obvious in this case. The Implicit conversion from a number to a character meant that it only used the two leading columns of the index (which had poor selectivity) and not the third (and subsequent columns) which would have made the index really useful.
I corrected the join to convert the number in the load table to a varchar, and the query that had taken an hour took just seconds.


Thursday, February 14, 2008

Databases differ. Problem solving approaches don't.

There's been no activity here in a while, because I haven't done much in the Oracle area. And no-one tagged me in the blog-tag game (so the story about my Dad and Princess Anne stays secret).

This entry is more SQL Server than Oracle, but it is generic in some ways, and its also got the closest I've found to v$sql in SQL Server 2005.

I've been working with a tool that generates procedures for moving data around. You know the sort of thing. Press the button and away it goes and you end up with a plain vanilla script with a big SELECT cursor loop, an update and, if no rows were updated, an insert. It commits every thousand records and, when it inserts more than a thousand records, it gathers stats. Not at the end, but after the thousand so if you insert ten million, the stats reflect the first thousand. Odd, but there you go.

Anyway, its just the sort of GUI button generated code that Noons remarked upon.
Guess what ? It ran pretty slowly when dealing with millions of records.

The initial temptation was to throw it away and write custom code. While there's definately benefits to that, there's also the downside that you lose that automatic generation capacity for the future, and since that was one reason for getting the tool, we need a good reason to swithch.

So in the spirit of BAAG
, we took a different approach.

Firstly, we stuck in a bunch of debugs to indicate how long it took for each 250,000 records. Enough messages to see how it was going, not so many that we are overwhelmed.

The result was that it started fast enough (about 1 min per batch) and consistently decelerated (to 10-15 minutes a batch after several million). The first suspect was the update statement. Initially the table would be empty, so there would be no updates and it should find that out quickly. As data is added, the update query would take longer even if no rows were found to update. In fact we knew that the update would never match rows, so one option would be to remove that altogether.

But it was still a guess that the update was the problem. Or at least a hypothesis as it was something we could test. So with sufficient googling, I came up with a query like :

select creation_time, last_execution_time, execution_count,
total_logical_writes, total_logical_reads, total_physical_reads,
(qs.total_worker_time/1000000) tot_wrk_sec, (qs.total_elapsed_time/1000000) tot_secs,
SUBSTRING(st.text, (qs.statement_start_offset/2) + 1,
((CASE statement_end_offset
WHEN -1 THEN DATALENGTH(st.text)
ELSE qs.statement_end_offset END - qs.statement_start_offset)/2) + 1) as statement_text
from sys.dm_exec_query_stats qs CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) as st
where last_execution_time > dateadd(Minute,-20,getdate())
order by execution_count, total_logical_writes, total_logical_reads desc

which is sort of like looking at SQL Server's version of v$sql.

It indicated that the time was pretty much all taken up in the insert component. The update was innocent. Actually, given that B-Tree indexes are pretty good at finding stuff, the 'slowing down' should have eased up fairly quickly, going from one hit to two/three for each index check as the 'depth' of the index grew. And even doing three times as many reads for each index check wouldn't have explained a ten-fold decrease in performance.

The underlying culprit was half-a-dozen indexes on the table on low-cardinality columns. Whether those indexes are actually useful or not for the application is a wider question, but for this particular purpose, it was simple enough to drop them before hand and recreate them at the end.

Result - run time cut to about 15% of the original, even including the time taken to rebuild the indexes.

And while the generated code clearly isn't optimal, it is adequate (at least in this case).