Thursday, December 11, 2008

The Certainty Bottleneck (and ORMs)

Oraclenerd has opened that can of worms about OO, ORMs and Databases.

I'm in the school that there are some (a lot ?) of application developers who don't understand the role of the database. Especially among those who call it a persistence layer.
They complain that "the database is the bottleneck". Guess what guys. That's the whole point of the database. It is the SINGLE point of reference that stops your Application Servers tripping over each other with their processing. For there to be one correct data value, it can exist in only one place. It's the Certainty principle and the database is the Certainty Layer. And it doesn't matter how you persist your data, if you want a single point of truth, it will be the single bottleneck of truth.

When you just use a RDBMS to persist data, rather than supporting concurrency, consistency and integrity, you've missed the point. You can use XML, flat files or anything for persistance. In Oracle, persistence (or it's ACID cousin, DURABILITY) is ultimately managed though a flat file, the redo log. Persistance isn't the hard bit.

There's a whole buch of Oracle that is simply about fast access to that persisted data. Oracle data files are about structuring the data for fast access. Indexes, buffer cache, are all about performance. There's also a chunk of Oracle about sharing data, especially in regards to locking data.

But what about the rest of ACID. Atomicity is ALL about business processes. It either succeeds or fails. And, to my mind, that is the true object-relational mapping. The OO layer changes an object, and atomicity turns that object-level change into a consistent set of changes to the underlying data (note: data, not tables). Consistency is the flip-side of that. It ensures that a view of data is an accurate representation of the truth at a point-in-time. If your 'order' object has a bunch of ordered items, consistency gives you back that order with all its items.

Durability, atomicity and consistency are aspects of databases I have no problem with being 'invisible' to the application developer. A lot Oracle developers don't know about REDO, UNDO and SCNs. I almost agree with "Persisting something to a database is really a low-level concern that as an application developer I shouldn't even have to be bothered with". From the application point of view, those parts should just work. I say "almost" because developers (database and application developers) need some understanding of the mechanism and its consequences. Anything that takes on the responsibility of consistency and atomicity BECOMES part of the database. And it doesn't have to be an RDBMS. I can see XML or OODBMS being able to do that.

The main advantage I see in the relational model is integrity. The normalisation process forces data to be stored in a way that embodies certain rules, specifically business rules (albeit not every business rule). Application logic can only enforce business rules for incoming data, and maybe filter out or raise alerts about non-conforming data already recorded.

There are applications where the integrity enforced by a normalised database are secondary to performance requirements. Data warehouses fit in here. So do a lot of 'Internet scale' databases, though here performance is more about scalability. A blog post may say '20 comments' but no-one really cares if only 19 show up because the database with the comments had to be refreshed from a backup and is 5 seconds out of step from the database with the posts.

Unfortunately a lot of web app developers seem to be under the mistaken assumption they work for Google, Ebay or Amazon and that they have to be able to scale to 100 million users. So when the big boys say "we sacrificed integrity for performance", they think that is "best practices". It isn't, it is a compromise. Oracle charge lots of money for RAC and people pay it when they need that level of integrity.

For the majority of business applications, scalability requirements can be met without sacrificing the integrity benefits of the relational model. For some of those, cost is a significant factor in sacrificing integrity too. I am confident that in five years time, changes within the RDBMS software will increase its ability to scale out (plus increased cost pressures from Open Source) and there will be LESS reason not to use an RDBMS. There are limits to how far RAC will scale when all nodes are sharing entire databases, but how about if only some tables are shared between some nodes, or with distributed databases with some over-arching recovery mechanism for consistency. There is nothing fundamentally broken about relational theory or relational databases. [Okay, thats debatable. You can certainly get into a situation when old data is so out of step with current business practices it is hard to reconcile the two into a workable data model.]

I'll agree that SQL is a poor language for application logic. I once coded in SQL*Forms before we had PL/SQL. Not good.
PL/SQL is a massive improvement especially with frameworks like Apex andr Quest Code Tester, but has its limits, and is fundamentally a procedural bridge to SQL. Java is poor at handling large volumes of persistent data. It relies on an API (JDBC) to something else that can do the job. A database resident language with static SQL (such as PL/SQL or potentially Java+SQLJ or anything else someone could write to run on Oracle's JVM) has advantages that code running on a separate server can't have (automatic validation against metadata, cursor caching, full error/trace stacks come to mind).

The first move in resolving the weaknesses of Java has already been made. Java applications have given up using JDBC directly and separated out their database logic from the application logic into an ORM. The next logical move is to shift that ORM towards the database where it can be better managed. After all, if you want to separate application logic from relational dependencies, the dumbest place to put an ORM is in the application layer.

Thursday, October 02, 2008

Virtual XE not just a cloud dream

I know I'm not blogging much. I’ve got a new job, and my boss reads my blog :)
Hi Boss !

But one thing popped out at me from the streams of OOW output, and it hasn't generated much comment in the Oracle blogosphere.

The Oracle XE license states that it is limited to one instance per server. It wasn't specific in how that related to a physical box running multiple virtual servers. If the ‘server’ referred to the physical machine, then this would preclude having different XE instances on multiple virtual servers on the same physical box.

OOW revealed Oracle's new Amazon cloud offering. Amazon machines are, as far as I am aware, just another virtual box. Of course there is hardware somewhere, but if you have a VM (or AMI) on Amazon, you don’t know what else is happening on that physical box (or even where it is). That’s the whole point of a cloud.

This revelation clarified the virtual server licensing for Standard and Enterprise Editions, but there are also explicit references to XE.
Slide 8 here refers to XE as one of the cloud building blocks and continues to discuss and walkthrough its deployment on EC2.
Similarly there is another Amazon / XE walkthrough on the oracle site.
And finally a prebuilt XE AMI from William Hodak, a Senior Product Manager from Oracle.

In my opinion, this settles the virtual server aspect of the XE license. Oracle are positively encouraging people to run XE on a virtual machine irrespective of whether this actually results in multiple XE versions running on the same physical box. Standard caveats apply. I'm not a lawyer, this is my opinion, jurisdiction may play a part....

The OTN license also permits distribution of XE, subject to the recipient agreeing to same terms and conditions. So a hosting service can supply XE on a virtual host as long as the customer accepts the XE license.

Technically, there may be a question about a hosting service offering only Apex workspaces in a XE database. The license does refer to use for ‘internal data processing’. In practice, if a hosting company can offer a full XE environment, I can’t see how Oracle could argue that partial access to an XE database is a breach….at least as long as the customer has agreed to the XE licence.

Of course if you try to run your MyFace-killing social app over a couple of hundred XE instances on Amazon, I wouldn’t be surprised if Oracle come knocking at your door extolling the benefits of SE or EE. But frankly, if you are getting that big, then you should be upgrading anyway, for the patches at least.

One more thing. This covers hosting, where the XE licensee doesn't have control of physical box. They may have a stricter interpretation if you are loading up VMs on your own hardware.

PS. If you are after an Oracle XE hosted solution, and can read German, look here

There are other Apex hosting services, including MaxApex where I've been trying some charts for my son's performance at Little Athletics. Yes, I'm a geek.

Friday, July 25, 2008

Apress books cheap in Sydney

A heads up for any techies in Sydney,
Basement Books (near Central Station) has a whole bunch of Apress books on sale.

There were half a dozen copies of Jonathan Lewis' Cost Based Oracle for $20, plus Beginning PL/SQL and Oracle PHP (though no sign of Tom's book). It isn't just Oracle. The stock includes SQL Server, MySQL, general database, Ruby, Joomla....

I picked up a couple of Javascript books myself, at about a quarter of the list price.

Worth a trip in, but I've got no idea how long they'll stay in stock.

Small slices of pie

Small post to get back into things.

I had to do a Pie Chart in Apex. The problem was, on the basic select, some of the labels ran together.

If you order the items by how big the slice of pie is, then all the items with small values will be next to each other so their labels will be close and there's a good chance of them overlapping. What I wanted to do was re-order the items to minimize the chance of small items being placed together. Obviously if there are lots of small items and only a couple of big ones that won't always be possible, but its worth a try.

What I came up with was analytics.

Firstly, I rank the items by the value, both ascending and descending. The item with the highest value has both the highest and lowest rankings, and so does the item with the lowest. Ones in the middle stay in the middle. I order by the greatest of the two rankings, with a slight adjustment to the 'descending' ranking to avoid ordering collisions.

select link, label, value
(select null link, label, count(*) value,
rank() over (order by count(*)) rnk,
rank() over (order by count(*) desc) rnk2
(select case when rownum /2 = trunc(rownum/2) then 'able'
when rownum /3 = trunc(rownum/3) then 'baker'
when rownum /5 = trunc(rownum/5) then 'charlie'
when rownum /7 = trunc(rownum/7) then 'delta'
when rownum /11 = trunc(rownum/11) then 'echo'
when rownum /13 = trunc(rownum/13) then 'foxtrot'
when rownum /17 = trunc(rownum/17) then 'golf'
else 'juliet' end label
from dual
connect by level < 1000) a
group by label)
order by greatest(rnk,rnk2-.1) desc

This way, the biggest slice gets put next to the smallest and so on, until the two middling ones are next to each other. And so there's a much reduced chance of label overlaps.

Tuesday, April 22, 2008

Switching jobs in Sydney revisited

I read up on Justin's post about staff retention here. This resonated with me as, after moving to a permanent job a year ago, I'm moving back into the contract market.

Life as a consultant is pretty odd. You get little say or control in your next assignment, so I was doing more SQL Server and Teradata than Oracle, and commuting to awkward locations. Cityrail has this wonderful ability to time its trains (at Hornsby at least) so that, just as your train draws slowly onto the platform, your connecting train is closing its doors on the opposite side (after having sat there for five minutes doing nothing). Very frustrating.

So I've found myself a gig in the City. Back to Oracle, though this time under the job title of "Database Administrator". It's a development database role, so I'm not being thrown into the scarily critical role of production backups. I hope to post more as I start to get to grips with my new role and tasks.

During my job search, I was encouraged to see that a few places are doing some Apex development, including the one I'm moving too.

Monday, April 21, 2008

Demo data with (some) grunt

Frustrated with SCOTT.EMP ? Are you looking for demo data with depth ?
Fed up of tables where you can count the data blocks on your fingers ?
Are you simply bored with the same old examples ?

I wondered whether I could find a fresh set of data that would be suitable for use in examples or demonstrations. I wanted something that would be big enough that indexes could be used (where appropriate) and where I could filter data and still have pagination. Something with volumes at least in the thousands of rows. I also wanted something where a general audience would easily understand the tables and relationships. And most of all I wanted something free.

Project Gutenburg has lots of free data but it's in the form of books which, while it may be handy for demonstrating Oracle Text, doesn't allow for much else. There's some census summaries available, but they are not especially queryable. My best discovery is the FAA in the US, who allow you to download a file of their aircraft registration details. If anyone has suggestions for other data sources, please add a comment.

The FAA data is over 350,000 records. The main data includes a unique registration number, aircraft type, model and name/address details. The zip file also includes reference data files for the engine and aircraft manufacturer (plus some others for documents lodged, dealers, deregistered aircraft and reserved registration numbers).

You don't have to click to agree to a licence or export restrictions or anything. I've looked and could not find any limits on what you are allowed to do with the data. There's a couple of web sites that have extracted it and allow searching on it. Given that some aircraft owners will be individuals, rather than corporations, your particular legal jurisdiction may have some relevant privacy legislation, which may be one downside to exploiting this resource. Or you can add an exercise to anonymise the data.

The concepts behind the data structures are familiar enough to be readily understood. People know what planes and helicopters are, they know about vehicle registrations, manufacturers, makes and models and so on.

The data content itself is pretty handy. You get skew in several columns (there are lots more planes than gliders and just a handful of blimps and so on) plus low-cardinality columns suitable for bitmap indexes.

The download file is zipped to less than 30 Mb, expanding to a couple of hundred megabytes of data. That's not 'production sized' data but is reasonable for running demonstrations or training on desktop/laptop level hardware. The FAA keep the data up to date, so you've got the potential to build up a library of 'update' files which you could use to demonstrate the concepts of batch/incremental updates.

Since most of the data is in a single Master file, we need to do some normalisation to separate out 'parent' owners to 'child' aricraft registrations. [I'm using the 'friendlier' term "owner" instead of the more accurate "registrant". Let's not be too particular here.] Some 'owner names' have multiple addresses, so we are free to treat these as a single owner with multiple locations, or as separate owners. I'm opting for the former as it gives an extra layer of complexity to what is a pretty simple data model.

Data Loading.
Start with a schema, and a directory from which we can load the data. Your directory will need to be wherever you unzip your data files. You may need to play with permissions, depending on your OS and setup.

create user faa identified by faa default tablespace users temporary tablespace temp;
alter user faa quota unlimited on USERS;
grant create table, create view, create synonym, create trigger, create procedure to faa;
grant create session, create sequence to faa;
CREATE OR REPLACE DIRECTORY "EXT_TABLES" AS '/home/gary/temp/Data_files';
grant all on directory "EXT_TABLES" to FAA;

The tables.
Firstly, there are some mini-reference tables that can be created from details in the PDF describing the files. I've created these as index-organized tables with the 'natural' key provided. They all fit in a single block, and using an index organized table means there's no separate index segment/block. The SQL for all these is in FAA_ref_data.sql.

Then I create the external tables for the aircraft, engine and master files (see faa_ext_table.sql). I follow that by converting the aircraft and engine external tables to regular tables with a sequence generated primary key and a unique key on manufacturer/model/series code (faa_eng_ac.sql). I've normalised out Manufacturer Id and Name from both Engine and Aircraft tables. That involves a bit of data cleansing where there where multiple names for the same id and also the same name with different ids.

That's the reference data sorted out. Now onto the meat in faa_main.sql.

For the main data, I pull the master file into an internal staging table. Then I extract the unique owners (just a sequence generated key and the names) into one table. I'll keep the 'Other Names' as attributes, treating the data on the basis that the main name encompasses the other names (eg 'The Beatles' with 'John','Paul','Ringo','George') rather than partners (eg 'Abbott','Costello'). Then I go down to the addresses (locations) for those owners (again, a sequence generated key and street, street2, ciy, state, zip_code, region, county and country) into another table.

I'll pull all those foreign keys back into the staging table and then extract the aircraft themselves, with a unique natural key on N_Number and a bunch of foreign keys. I also add a check constraint on fract_owner ('Y' or null). I'm not going to 'parse' the certification column (as I don't really understand it).

I've used natural keys because this is data being generated externally. Any sequence generated surrogate keys would never be present in fresh incoming data, so I'd only ever be able to update this data from the same source based on the natural keys. In real terms, the aircraft is actually two entities; the physical aircraft, which should be uniquely identifiable by the manufacturer and serial number, and the registration which is associated with an aircraft at a point it time but which can be transferred between aircraft. However the data I downloaded doesn't have that manufacturer/serial number uniqueness, and would need to be cleansed. For demonstration purposes, I'm going to be lazy and follow the data.

I then extend the model by including the 'documents lodged' file entries relating to aircraft. This is in faa_doc.sql. These get a sequence generated key, plus an N_Number foreign key referencing the aircraft table. All other fields are attributes. Any non-aircraft documents or those where we don't have the 'N_NUMBER' (presumably not currently registered, perhaps reserved) will be ignored.

If I wanted to go further with the data, I'd look at the 'de-registered aircraft' and 'reserved N Number' files and maybe add a history tables under aircraft. However I think I'd need to properly separate out aircraft and registrations which requires data cleansing.

As an alternative for additional data, there is aviation accident information here .
This is in Access 95 (MDB) format, so I'd need to go a different conversion route (eg SQL Developer), but I can tie the registration number of the aircraft in that database to the N_number in the registration data.

If there's any interest, I may post a separate blog entry on that.

Thursday, February 21, 2008

Databases differ. Problem solving approaches STILL don't.

Still in the SQL Server world, but this project is on the last step, so there's a chance I can be back on Oracle in the near future.

A friend emailed me a useful link for equivalent functions between database products.
Given that anyone can add/edit entries in there, it should be more up-to-date than my previous resource here (which I still recommend for a read-through).

On the theme of equivalence, I've been looking at the SQL Server 2005 equivalent of an EXPLAIN PLAN.

I've had another issue with poor query performance, but this time with a SELECT.
I would have nosed at the EXPLAIN PLAN in Oracle and the SQL Server version seems to be "SET SHOWPLAN_ALL ON". Rather than doing an EXPLAIN PLAN for a specific query though, once this is set in SQL Server the subsequent queries return the plan rather than the results, until you turn the setting off.

I'm old-school, and like to look at text rather than pretty pictures. A picture being worth a thousand words is only valid when you start with the picture. If you start with a description, a dozen different artists will give a dozen interpretations, and some will be worse than others. With the description, I can make up my own mind what's important. Apparently there is an XML version of the output, but I haven't experimented with that yet, so don't know how readable that gets or can be made.

The query was a simple join between two tables; one was a small load table (a few thousand rows) and the other a much larger 'target' table (several million rows). I expected a scan of the small table with a Nested Loop containing an Index Scan and table access for the target table. It can be useful to start with an expectation of a 'good' plan. If the optimizer comes up with something different, you take a look and see where your assumptions differ from the optimizers, and who is 'right'.

In this case I got pretty much what I expected.

--Parallelism(Gather Streams) .
--Nested Loops(Inner Join, ...
--Nested Loops(Inner Join, ...
--Compute Scalar(...
--Table Scan(OBJECT:LOAD_TABLE....


I could see (using the same analysis SQL as before) that it was doing a lot more work than I would have expected with that plan. The table scan shouldn't have been responsible, as it was a small table and only done once. The Row lookup on the target shouldn't have been responsible as, if there was an index match, it would have matched only one row. If the index matched lots of rows for this data, it would have been using the wrong index (or the right index in the wrong way).So I took a harder look at the Index Seek Line. It was the right index. It had a 'seek' component and a 'where' component, which I equated to the 'Access predicate' and 'Filter Predicate' in an Oracle plan.

SEEK:(TARG_TAB.COL_1=[Expr1008] AND TARG_TAB.COL_2=[Expr1009]),
The cause was obvious in this case. The Implicit conversion from a number to a character meant that it only used the two leading columns of the index (which had poor selectivity) and not the third (and subsequent columns) which would have made the index really useful.
I corrected the join to convert the number in the load table to a varchar, and the query that had taken an hour took just seconds.

Thursday, February 14, 2008

Databases differ. Problem solving approaches don't.

There's been no activity here in a while, because I haven't done much in the Oracle area. And no-one tagged me in the blog-tag game (so the story about my Dad and Princess Anne stays secret).

This entry is more SQL Server than Oracle, but it is generic in some ways, and its also got the closest I've found to v$sql in SQL Server 2005.

I've been working with a tool that generates procedures for moving data around. You know the sort of thing. Press the button and away it goes and you end up with a plain vanilla script with a big SELECT cursor loop, an update and, if no rows were updated, an insert. It commits every thousand records and, when it inserts more than a thousand records, it gathers stats. Not at the end, but after the thousand so if you insert ten million, the stats reflect the first thousand. Odd, but there you go.

Anyway, its just the sort of GUI button generated code that Noons remarked upon.
Guess what ? It ran pretty slowly when dealing with millions of records.

The initial temptation was to throw it away and write custom code. While there's definately benefits to that, there's also the downside that you lose that automatic generation capacity for the future, and since that was one reason for getting the tool, we need a good reason to swithch.

So in the spirit of BAAG
, we took a different approach.

Firstly, we stuck in a bunch of debugs to indicate how long it took for each 250,000 records. Enough messages to see how it was going, not so many that we are overwhelmed.

The result was that it started fast enough (about 1 min per batch) and consistently decelerated (to 10-15 minutes a batch after several million). The first suspect was the update statement. Initially the table would be empty, so there would be no updates and it should find that out quickly. As data is added, the update query would take longer even if no rows were found to update. In fact we knew that the update would never match rows, so one option would be to remove that altogether.

But it was still a guess that the update was the problem. Or at least a hypothesis as it was something we could test. So with sufficient googling, I came up with a query like :

select creation_time, last_execution_time, execution_count,
total_logical_writes, total_logical_reads, total_physical_reads,
(qs.total_worker_time/1000000) tot_wrk_sec, (qs.total_elapsed_time/1000000) tot_secs,
SUBSTRING(st.text, (qs.statement_start_offset/2) + 1,
((CASE statement_end_offset
ELSE qs.statement_end_offset END - qs.statement_start_offset)/2) + 1) as statement_text
from sys.dm_exec_query_stats qs CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) as st
where last_execution_time > dateadd(Minute,-20,getdate())
order by execution_count, total_logical_writes, total_logical_reads desc

which is sort of like looking at SQL Server's version of v$sql.

It indicated that the time was pretty much all taken up in the insert component. The update was innocent. Actually, given that B-Tree indexes are pretty good at finding stuff, the 'slowing down' should have eased up fairly quickly, going from one hit to two/three for each index check as the 'depth' of the index grew. And even doing three times as many reads for each index check wouldn't have explained a ten-fold decrease in performance.

The underlying culprit was half-a-dozen indexes on the table on low-cardinality columns. Whether those indexes are actually useful or not for the application is a wider question, but for this particular purpose, it was simple enough to drop them before hand and recreate them at the end.

Result - run time cut to about 15% of the original, even including the time taken to rebuild the indexes.

And while the generated code clearly isn't optimal, it is adequate (at least in this case).