Wednesday, August 05, 2009

I can see my house ... Part 3 - Sax and the Cities

In the previous post in the series, I mentioned trying to work with the raw OSM file from OpenStreetMap. If you could do this, it would remove the 'dependency' on the extracts from CloudMade.

Whinge

But it isn't easy. The OSM format is XML. But that isn't my problem with it. Yes, XML is verbose and that makes the files big (about 1GB for Australia). I actually think XML is appropriate for this sort of data interchange requirement, as people will be interested in various slices of it and new bits will need adding on as time goes on.

No, what I object to is that practically everything that isn't a reference to a longitude/latitude is a key/value 'tag' element. While I don't like it, I do understand the reason for it given in the FAQ

"Imposing too many rules on them would reduce mapping from a fun and creative activity to a dull and boring data collection task. We give our data away for free, and if our users need more structure, they can convert our data to match their needs."

At this stage in the project, I agree that data volume outweighs data structure. However I suspect (and hope) there will come a time when the project matures to point where more structure is imposed.

Structure, what structure ?

Currently, the OSM is mostly a bunch of nodes/points (with a latitude, longitude and multiple tags) and ways (an ordered set of points, and multiple tags). Some 'ways' are tagged as highways , some are coastlines, some are rivers. A single 'way' can be multiple things (eg a road or river may serve as a boundary between regions or countries, or a 'way' may serve as both road and tramway). There are also relations, which can group nodes and ways (and have their own tags).

Give this problem to a database designer, and they would have looked at a map and think "Hmmm. We need to differentiate between roads, rivers, train lines etc. And most roads have names, maybe speed limits...". They'd come up with a data model, But rather than store these as XML attributes (or sub-elements), they have been relegated to catch-all key/value pairs.

That means, rather than being documented as part of the XML schema (XSD or DTD), it is all on a wiki. And there's no automated way to validate conformance.

But enough whinging. If you want to use the OSM, here's a few suggestions based on my attempts.

How to get the OSM

You can download the entire planet's OSM file from one of the mirrors listed, but our planet is a big place. Personally, I'm just interested in Australia at the moment, and I can get a download of that from OSM Australia.

If, as is likely, you have a different area of interest, you can download the whole planet and use a tidy java app called osmosis, to extract a square, rectangle or other shape containing your region of interest.

They also do daily deltas for the planet.osm. Again, you'd chop out the rectangle of interest and apply that. Australia is surrounded by sea and it is pretty easy to come up with a rectange that includes Australia and nothing else. You can do other shapes too, but I like to keep it simple.

java -jar osmosis.jar --read-xml file="/osm/australia.osm" --bounding-box top="-33.50" bottom="-34.12" left="150.54" right="151.45" --write-xml file="greater_sydney.osm"

The osmosis instructions contains some other filters. Alas there isn't one to extract only highways, but it is open source and I've put in a request for that to the dev list.
How not to load the OSM

I tried to load part of Australia (the greater Sydney area - about 70MB) in one chunk to an Object-Relational XML schema and gave up. It worked for a smaller extract of 4 Mb though and I was using a single core laptop with 1GB of RAM. A bigger system and/or a smaller region and you may have better luck. The SQL for that is osm_schema.sql. You will have to amend the OSM file to put in the appropriate XML headers. Plus I had some character set issues which I resolved through uconv (simply deleting the characters it didn't recognise).

How to load the OSM

After hunting around, I found Marco's blog entry on SAXLoader. Thanks Marco. I've copied the source code from the messages on the thread and saved them on as separate files.
I didn't want to inflict JDeveloper on my little laptop, so I compiled them through the command line.
Your mileage may vary, but I set the following classpath, and compiled as follows:

export CLASSPATH=$CLASSPATH:.:$ORACLE_HOME/lib/xmlparserv2.jar:$ORACLE_HOME/lib/servlet.jar:/$ORACLE_HOME/jdbc/lib/ojdbc6.jar
javac -d *.java


As I said above, under the top level "osm" element, the XML has three different elements, for "node", "way" and "relation". This means three passes through the SaxLoader. I used three destination tables, all simply created like CREATE TABLE OSM_XML_NODE OF "SYS"."XMLTYPE". The easiest way to script the run is with three separate config XML files and call the java for each type.

java -Xmx256M -Dcom.oracle.st.xmldb.pm.ConnectionParameters=/osm/SAXLoader/node.xml com.oracle.st.xmldb.pm.saxLoader.SaxProcessor > node.log

java -Xmx256M -Dcom.oracle.st.xmldb.pm.ConnectionParameters=/osm/SAXLoader/rel.xml com.oracle.st.xmldb.pm.saxLoader.SaxProcessor > rel.log


java -Xmx256M -Dcom.oracle.st.xmldb.pm.ConnectionParameters=/osm/SAXLoader/way.xml com.oracle.st.xmldb.pm.saxLoader.SaxProcessor > way.log


For me, the 'node' took several hours, 'way' was a couple of hours and 'relation' about ten minutes. Again, your mileage will vary depending on your hardware and input file size. The smaller daily delta files should breeze through.

How to extract relational data from the OSM XML elements

The script xml_to_rel.sql is my first effort at this.

There were almost eight million nodes and about 1.5 million tags for Australia (but 1.2 million tags were 'created by', and I don't care about those). From the nodes I extracted the longitude, latitude, user and timestamp attributes. The tags are populated to a child table. Again, this operation takes several hours.

I extracted the details for way in a similar manner (450,000 entries), except there are two child tables. The first is the relationship between WAY and NODE (8 million), and the second the tags for the WAY (1.5 million). I don't bother with the visible attribute for WAY either. The osm_way_node extraction took about seven hours. I believe a large chunk of that was the handful of XMLs over 100kB in size, with tens of thousands of nodes (and I think is the largest of those is the Tasmanian coastline). You could try filtering out based on length(s.SYS_NC_ROWINFO$.getclobval()) if that's a problem for you.

Finally, I 'promoted' the name and highway to attributes of the way.

I haven't gone as far as pulling the longitude/latitudes into MDSYS.SDO_GEOMETRY datatypes, either on the nodes or ways. That would bring the data into line with that derived in the previous post, removing the reliance on the Cloudmade downloads.

3 comments:

Andrew said...

Hello Gary,

I wanted to filter some OSM data to find only ways tagged 'highway'... As you mentioned there was no simple way to do this. So I made a hybrid of the existing WayKeyValueFilter and NodeKeyFilter: the WayKeyFilter.

Here is the patch to add this plugin to Osmosis:

http://global_void.mailc.net/

Usage example:
osmosis --read-xml file="input.osm" --way-key keyList="boundary" --used-node --write-xml file="output.osm"

It seems to work correctly but I'll be testing it out more extensively over the next few days.

-Andrew

anon_anon said...

Have you looked at vtd-xml for big files?

http://vtd-xml.sf.net

sydoracle said...

Code now located at http://www.sydoracle.com/Codespace