Monday, April 21, 2008

Demo data with (some) grunt

Frustrated with SCOTT.EMP ? Are you looking for demo data with depth ?
Fed up of tables where you can count the data blocks on your fingers ?
Are you simply bored with the same old examples ?

I wondered whether I could find a fresh set of data that would be suitable for use in examples or demonstrations. I wanted something that would be big enough that indexes could be used (where appropriate) and where I could filter data and still have pagination. Something with volumes at least in the thousands of rows. I also wanted something where a general audience would easily understand the tables and relationships. And most of all I wanted something free.

Project Gutenburg has lots of free data but it's in the form of books which, while it may be handy for demonstrating Oracle Text, doesn't allow for much else. There's some census summaries available, but they are not especially queryable. My best discovery is the FAA in the US, who allow you to download a file of their aircraft registration details. If anyone has suggestions for other data sources, please add a comment.

The FAA data is over 350,000 records. The main data includes a unique registration number, aircraft type, model and name/address details. The zip file also includes reference data files for the engine and aircraft manufacturer (plus some others for documents lodged, dealers, deregistered aircraft and reserved registration numbers).

You don't have to click to agree to a licence or export restrictions or anything. I've looked and could not find any limits on what you are allowed to do with the data. There's a couple of web sites that have extracted it and allow searching on it. Given that some aircraft owners will be individuals, rather than corporations, your particular legal jurisdiction may have some relevant privacy legislation, which may be one downside to exploiting this resource. Or you can add an exercise to anonymise the data.

The concepts behind the data structures are familiar enough to be readily understood. People know what planes and helicopters are, they know about vehicle registrations, manufacturers, makes and models and so on.

The data content itself is pretty handy. You get skew in several columns (there are lots more planes than gliders and just a handful of blimps and so on) plus low-cardinality columns suitable for bitmap indexes.

The download file is zipped to less than 30 Mb, expanding to a couple of hundred megabytes of data. That's not 'production sized' data but is reasonable for running demonstrations or training on desktop/laptop level hardware. The FAA keep the data up to date, so you've got the potential to build up a library of 'update' files which you could use to demonstrate the concepts of batch/incremental updates.

Since most of the data is in a single Master file, we need to do some normalisation to separate out 'parent' owners to 'child' aricraft registrations. [I'm using the 'friendlier' term "owner" instead of the more accurate "registrant". Let's not be too particular here.] Some 'owner names' have multiple addresses, so we are free to treat these as a single owner with multiple locations, or as separate owners. I'm opting for the former as it gives an extra layer of complexity to what is a pretty simple data model.

Data Loading.
Start with a schema, and a directory from which we can load the data. Your directory will need to be wherever you unzip your data files. You may need to play with permissions, depending on your OS and setup.


create user faa identified by faa default tablespace users temporary tablespace temp;
alter user faa quota unlimited on USERS;
grant create table, create view, create synonym, create trigger, create procedure to faa;
grant create session, create sequence to faa;
CREATE OR REPLACE DIRECTORY "EXT_TABLES" AS '/home/gary/temp/Data_files';
grant all on directory "EXT_TABLES" to FAA;


The tables.
Firstly, there are some mini-reference tables that can be created from details in the PDF describing the files. I've created these as index-organized tables with the 'natural' key provided. They all fit in a single block, and using an index organized table means there's no separate index segment/block. The SQL for all these is in FAA_ref_data.sql.

Then I create the external tables for the aircraft, engine and master files (see faa_ext_table.sql). I follow that by converting the aircraft and engine external tables to regular tables with a sequence generated primary key and a unique key on manufacturer/model/series code (faa_eng_ac.sql). I've normalised out Manufacturer Id and Name from both Engine and Aircraft tables. That involves a bit of data cleansing where there where multiple names for the same id and also the same name with different ids.

That's the reference data sorted out. Now onto the meat in faa_main.sql.

For the main data, I pull the master file into an internal staging table. Then I extract the unique owners (just a sequence generated key and the names) into one table. I'll keep the 'Other Names' as attributes, treating the data on the basis that the main name encompasses the other names (eg 'The Beatles' with 'John','Paul','Ringo','George') rather than partners (eg 'Abbott','Costello'). Then I go down to the addresses (locations) for those owners (again, a sequence generated key and street, street2, ciy, state, zip_code, region, county and country) into another table.

I'll pull all those foreign keys back into the staging table and then extract the aircraft themselves, with a unique natural key on N_Number and a bunch of foreign keys. I also add a check constraint on fract_owner ('Y' or null). I'm not going to 'parse' the certification column (as I don't really understand it).

I've used natural keys because this is data being generated externally. Any sequence generated surrogate keys would never be present in fresh incoming data, so I'd only ever be able to update this data from the same source based on the natural keys. In real terms, the aircraft is actually two entities; the physical aircraft, which should be uniquely identifiable by the manufacturer and serial number, and the registration which is associated with an aircraft at a point it time but which can be transferred between aircraft. However the data I downloaded doesn't have that manufacturer/serial number uniqueness, and would need to be cleansed. For demonstration purposes, I'm going to be lazy and follow the data.

I then extend the model by including the 'documents lodged' file entries relating to aircraft. This is in faa_doc.sql. These get a sequence generated key, plus an N_Number foreign key referencing the aircraft table. All other fields are attributes. Any non-aircraft documents or those where we don't have the 'N_NUMBER' (presumably not currently registered, perhaps reserved) will be ignored.

If I wanted to go further with the data, I'd look at the 'de-registered aircraft' and 'reserved N Number' files and maybe add a history tables under aircraft. However I think I'd need to properly separate out aircraft and registrations which requires data cleansing.

As an alternative for additional data, there is aviation accident information here .
This is in Access 95 (MDB) format, so I'd need to go a different conversion route (eg SQL Developer), but I can tie the registration number of the aircraft in that database to the N_number in the registration data.

If there's any interest, I may post a separate blog entry on that.

3 comments:

Michael O'Neill said...

A very nice find indeed. I've often thought of duplicating the Microsoft AdventureWorks schema for Oracle. A mainstream sample database of merit is definitely missing in the Oracle realm.

lijialefw said...
This comment has been removed by a blog administrator.
Gary Myers said...

The files for this are now located on
http://www.sydoracle.com/Codespace

There is faa_sqls.zip containing all the SQL files.
Because the structure of the datafiles may have changed over the years, I've also included data copies from 2008.
These are in 2 files ( FAA_DATA.7z and FAA_MASTER.7z ) because of file size limits.