Thursday, December 09, 2010

Globally and Locally Structured Databases

I've been reading another article on NoSQL.

This one focuses on the 'No Join' aspect which the author relates to schema-less data.

Normalisation and joins don't go hand-in-hand though. If you are old enough, you might have had an address book (or a filofax). In there, you would write addresses organized by name. Under "F" you'd write "FLINTSTONE, FRED and WILMA" followed by an address in Bedrock. That's pretty normalised. You haven't got one entry for Fred and another for Wilma both with the same address. You might write Fred's birthday next to his name, and Wilma's next to hers, and have another date for their wedding anniversary.

Implicitly you've recognised that some items belong the the 'family' entity and others to the 'family member' entity. But you've made them part of a single object that provides the joining mechanism.

SQL databases use tables to represent entities, requiring joins to pull back related entities (which leads to concepts like primary keys). I started out on an IDMS network database which didn't have tables, but rather explicit relationships between entities. You would start with a parent, follow a relationship down to a child and then to siblings (either all of them, or as many as you wanted to find). You might then follow a different relationship from that child to another kind of parent. Still normalised, but a different linking mechanism.

There is a difference between what I'd term 'globally structured data' and 'locally structured data'. SQL databases have a global structure in that a consistent structure is forced on the entire database. One the other hand Document databases allow each document to follow its own structure. Key-value stores don't have any explicit structure, leaving it up to the application to interpret.

Where you have orders consisting of order lines, a SQL database will have one Orders table and one Order Lines table, and all the Orders would have the same attributes (for example Customer Name), and all the Order Lines would have the same attributes. A document database may have some orders that have a "Customer Name", but others that have "Customer Surname" and "Customer First Name" and yet others that have "Customer Corporate Name". This puts more emphasis on the application to work with the potential varieties.

So what is the difference when it comes to the 'agility' of the implementations ?

In the run up to Christmas, your firm decides to change your business model. Rather than forcing people to make separate orders for each delivery address, you now allow each item to be delivered to a different location. In a normalised database you need to add the DELIVERY_LOCATION to the ORDER_LINE entity. You then update all existing orders so that the DELIVERY_LOCATION from the ORDER is copied down to the individual child ORDER_LINE records and finally drop the DELIVERY_LOCATION from the ORDER table.

In a document database you can change the application and start recording your new delivery location on your Order Lines. At the database layer, you don't have to worry that the structures are not consistent. The problem is you still have to write code that copes with delivering items to the right place, and that handles orders that just have a delivery location at the order level as well as orders that have it at the order item level. Your application logic now has additional complexity, plus you need test cases that cope with both sets of conditions. To make it easy, you may have template versions for the documents, so that 'v1.1' Orders have the single delivery location but the newer 'v2.0' Orders have it at the lower level. 

Worse yet, your application will have bugs in it. Maybe not today, maybe not tomorrow but one day, and you'll regret the consequences for the rest of your life. Because you'll be stuck with some 'v1.1' Orders that have a delivery location on the Order Lines, or perhaps a 'v2.0' with no delivery location at all, or something that claims to be a 'v1.3' style which isn't documented anywhere but is something Barney had to write at five o'clock on Christmas Eve.

It is perfectly possible for you to run a 'migration' on your document database so that all your data is at the same version. This basically enforces the same situation as a 'globally structured database' like SQL....except you need to get all the code right because you won't get enforcement support from the database layer itself.

Another solution might be to delete orders once they are delivered. After a few months, those 'v1.1' documents won't be an issue any more. At least in five years time you won't have code that is built to deal with 20 different versions.

Generally, the problems of being Agile in an RDBMS are not to do with SQL or relational concepts. They are a consequence of being unable to restructure existing data (perhaps you can't work out how, or don't have the data you need). But they are ultimately problems in migration from the old business model to the new model, not database problems.

To be even handed, schema changes are still not seamless, especially with more primitive database engines, or when dealing with high-availability / multi-server / replicated databases. But that's a 'state of the art' issue, not a fundamental conceptual issue.

1 comment:

DomBrooks said...

Just been reading the article to which you refer.

I'm constantly amazed by people who just don't get data, don't get databases, don't get modelling.

It's easy to come up with an example of a bad data model - which is what the research paper example is - but to use that bad data model as an example that relational databases are bad is just b0ll0cks.

It just proves that good analysis and database design are key.

And there's more to agility than just being able to bash out the latest change quickly.

Long-term agility is about adapting quickly with the business and leaving behind something which performs, makes sense and is not storing up some great big problem that someone will have to deal with down the line.