Thursday, June 30, 2011

More on Checksumming

Prompted by a comment I realise that I need to put some more flesh on Checksumming.


A good use case is regression testing. A decent regression test will include a script to set up data so you can execute your task (preferably in an automated manner) and check your actual results match your expected results. That is pretty easy if the expected result is a value returned by a function. When it is a data change, it gets a bit trickier.

You can set up your expected results in a table (or a set of tables matching your main schema), run the tests and check the actual result data matches the expected result data. That can be as simple as

exec comp_tables('TEST_ACTUAL','TEST_EXPECTED');

Depending on your testing mechanisms, that can raise an exception if it fails, return an error code or record the results in a table. My example code just dumps it out to DBMS_OUTPUT. Also, a regression test will probably have expected results in several tables, so you'll need multiple comparisons.

Just to complicate things, often the tables will contain sequence generated ids or use the current date as a 'created / modified date'. So you might want to exclude some columns (or rows) from comparison. 

   'SELECT val_char, val_number from TEST_ACTUAL WHERE val_char=''ABC''',
   'SELECT val_char, val_number from TEST_EXPECTED WHERE val_char=''ABC''');


I've uploaded my code to the SydOracle CodeSpace location as Comp_Tables.sql
It works in several steps.

1. Take the entered query (eg 'SELECT 1 a, 2 b FROM DUAL') and determine the columns.

2. Create a wrapper query around that. That would give
SELECT ora_hash(a) a, ora_hash(b) b 
The hashes turn all the values (of the main datatypes) into numbers, which makes them easy to aggregate, and comparing aggregated values is easy. The hashing also practically guarantees no collisions. That would give a result set like 

         A             B
2342552567    2064090006

3. Create another wrapper to mock up a hash for each row.
SELECT mod(a,100) * mod(b,100) row_hash, h.* FROM
  (SELECT ora_hash(a) a, ora_hash(b) b 
   FROM (SELECT 1 a, 2 b FROM DUAL)) h

Paraphrasing Eric Morecombe, it is fine having all the right columns, but you also want them in all the right order. These results sets will show the same total hashes for columns 'A' and 'B' as the contents of those columns are identical. if you have one row of (1,'A') and a second of (2,'B'), a 'column-only' match would quite happily accept (1,'B') / (2,'A') since all the column values are identical.

By creating another column based on all the values for the entire row, we get a summary of the row. We can't simply add the hashes from the columns together as the sum of all the 'A's plus the sum of all the 'B's is mathematically the same as the sum of the As+Bs. In short, we'd lose the checksumming effect as we aggregate the row values. So I've used multiplication with a MOD so the numbers are unlikely to get too big.

ROW_HASH           A           B
     402  2342552567  2064090006

4. Finally we aggregate the values for all the rows. I've used SUM, though AVG would work just as well (and would give smaller, more manageable values for larger datasets). That gives an SQL like

SELECT SUM(row_hash),SUM(a),SUM(b)
FROM (SELECT mod(a,100) * mod(b,100) row_hash, h.*
      FROM(SELECT ora_hash(a) a, ora_hash(b) b
           FROM (SELECT rownum a, 2 b 
                 FROM DUAL CONNECT BY LEVEL < 5)) h)
(I've changed the SQL here to give multiple rows, so you can see the effect of the sum).

ROW_HASH             A             B
     402    2342552567    2064090006
      36    2064090006    2064090006
     354    2706503459    2064090006
     186    3217185531    2064090006
     978   10330331563    8256360024

5. That gives us a one record encapsulation of an entire table (or subset of a table).
If we do that once for the 'actual' and once for the 'expected' we can compare those two records and see where they match.

So if we compare the previous SQL with a similar one with a different value from 'B'

SELECT SUM(row_hash),SUM(a),SUM(b)
FROM (SELECT mod(a,100) * mod(b,100) row_hash, h.*
      FROM(SELECT ora_hash(a) a, ora_hash(b) b
           FROM (SELECT rownum a, 3 b 
                 FROM DUAL CONNECT BY LEVEL < 5)) h)

ROW_HASH            A            B
    9617  10330331563  10826013836

In the above example, the final row shows us that column 'A' matches but 'B' does not. We would then investigate the code to find out why 'B' wasn't always being derived as expected.


While you could just store the aggregate hash of the expected results, that would prevent you drilling down to investigate issues, so I don't recommend it. 

Also, since the code accepts SQLs, you can do exotic things like joins, functions and analytics. Again, I'd recommend against it on the 'Keep It Simple' principle.

And if you use weird column names (ie mixed or lower case with strange characters that require quoted identifiers) you are on your own.

Tuesday, June 28, 2011

Check summing a table row

Oracle has a built-in package for optimistic locking called OWA_OPT_LOCK. This can be used to generate a checksum for any row like this:

select owa_opt_lock.checksum('SCOTT','EMP',ROWID)
from empwhere empno = 123;

This nugget fell into the bucket of "Things I discovered, but couldn't actually find when I wanted to use it". Then it cropped up as a Stackoverflow answer, so I'm putting it in here so I can find it.
Apparently it is a very simple checksum involving summing up the binary values and inverting it. It isn't a sophisticated hash designed to give low chances of collisions. The code isn't wrapped, so you can have a peek. It seems to do a simple concatenation of the values (implicitly converted to strings), so definitely isn't bulletproof or super-secure. Honestly, it wouldn't have done the job I was attempting when I couldn't find it.

If you want a simple sanity check that a small table on one database has the same contents as it does on another, then it will generally work okay. But for each call (and therefore each row, if you want it for an entire table's contents), it will do queries against all_tab_columns. So it won't be quick.

Anything big, and I'd hack up my own select based on

select 'sum(ora_hash('||column_name||')),'
from all_tab_columns 
where table_name = '...'

which would also say WHICH columns had differences.

Wednesday, June 22, 2011

The exceptional case of the backwards null

Yes, Virginia, there is a unique NULL. The trick is to go backwards.

At my son's school, children in the younger years are buddied up with a child from an older year. Using this as example, I'm going to use the database to ensure that no child is buddied up to more than one other child. If we have an odd number of children, an unlucky one will have no buddy, but if our uniqueness constraint can ensure that there can only be one null, we can make sure we never have more than one 'un-buddied' child.

(child_id      NUMBER NOT NULL,
child_name     VARCHAR2(20) NOT NULL,
buddy_id       NUMBER,
CONSTRAINT child_pk PRIMARY KEY (child_id)
CREATE UNIQUE INDEX child_uk ON child (buddy_id);

insert into child  (child_id, child_name)
values  (1,'Adam');

insert into child  (child_id, child_name)
values  (2,'Bill');

insert into child  (child_id, child_name)
values  (3,'Chris');

update child set buddy_id = 3 where child_id != 3;
ERROR at line 1:
ORA-00001: unique constraint (PERFORM.CHILD_UK) violated

The UPDATE fails because of the uniqueness constraint. But it didn't object to multiple NULL values on the INSERTs.

Reading the documentation, we find out that DESC indexes will treat multiple NULLs as duplicates and so will not allow them.

Let's give it a try

DROP INDEX child_uk;
CREATE UNIQUE INDEX child_uk ON child (buddy_id DESC);
insert into child  (child_id, child_name)
values  (1,'Adam');

insert into child  (child_id, child_name)
values  (2,'Bill');
ERROR at line 1:
ORA-00001: unique constraint (PERFORM.CHILD_UK) violated

Yup, documented behavious matches actual behaviour. 

Why ? Well, the DESC index is actually a function-based index using SYS_OP_DESCEND. That takes every byte of the input and subtracts its ascii value from 255. Then it sticks an extra byte of value 255 on the end (so the descending order has 'ABC' coming after 'ABCD'). The NULL gets converted to a single byte with the value 0.

At least in 10g, the optimizer isn't smart enough to use the index for a 'WHERE buddy_id IS NULL' predicate, but will for a one with a predicate of SYS_OP_DESCEND(buddy_id) =SYS_OP_DESCEND(null)

PS. I'm not enforcing referential integrity on the buddy_id, so children may end up with imaginary friends. I'm also not enforcing that if Fred's buddy is Barney, then Barney's buddy is Fred. Those are left as an exercise for the reader.

Monday, June 20, 2011

SQL Injection and the variable IN list

Pete Finnegan tweeted a link from a guy who discovered an SQL injection vulnerability on the website. Actually he discovered it over a year ago, and reported it. And still no-one has got around to fixing it yet.

It is an Oracle database (apparently - so they're not keeping up with patches either). And not only is there an SQL injection vulnerability, but if the SQL is invalid, it throws the whole statement out to the web-page as part of the error 'handling'.

The vulnerability is in a requirement which gets regular questions on stackoverflow. Specifically, given a string with a list of ids (such as '1,2,3,4'), how do I put that in an 'IN' list. 

The problem is thinking in specifics not abstracts. You need to make the mental jump and realize that you are have IN (:variable) not IN (:var1,:var2,:v3). 

So here's my recommendation for this. Firstly a function that will turn that CSV value into a collection. Here I use the built-in SYS.DBMS_DEBUG_VC2COLL, though it is tidier to create your own collection type.

create or replace function csv_to_list (i_list in varchar2) return sys.dbms_debug_vc2coll is
  v_list varchar2(4000) := i_list;
  v_ind number;
  t_list sys.dbms_debug_vc2coll := sys.dbms_debug_vc2coll ();
  v_ind := instr(v_list,',');
  while v_ind > 0 loop
    t_list(t_list.count) := substr(v_list,1,v_ind-1);
    v_list := substr(v_list,v_ind + 1);
    v_ind := instr(v_list,',');
  end loop;
  if v_list is not null then
    t_list(t_list.count) := v_list;
  end if;
  return t_list; 

Next, you take that collection, and turn it into a dataset with the TABLE operator. Finally, you use that as the source for a single-column subquery.

select table_name from user_tables 
where blocks in 
  (select column_value from table(csv_to_list('1,2,3,4,5')));

Then you can BIND the variable rather than concatenating it into a dynamically executed statement.

Friday, June 03, 2011

Locked down in Oracle Amazon RDS

If you dig down into the Oracle on Amazon RDS, you can find the some security restrictions. These are things that Amazon won't let you do with your database and include the following:
·         Alter database
·         Alter system
·         Create any directory
·         Drop any directory
·         Grant any object privilege
·         grant any privilege
·         Grant any role
·         Restricted session
BlueGecko notes a couple of others:
·         Datapump / imp / exp FULL_DATABASE privileges
·         Alter user, revoke or any DDL on SYS, SYSTEM or RDSADMIN users
·         Drop tablespace RDSADMIN
·         Create public synonyms for objects belonging to SYS, SYSTEM or RDSADMIN

Most of those are pretty obviously 'dangerous'. These are the items that, in an organisation with a hierarchy of DBAs, would only be available to the Top Gun. And in this environment, Amazon holds that position.

The 'CREATE ANY DIRECTORY' is one where you need to think evil before you work out why it is so dangerous. So I shall adopt the persona of the villainous toad, 'Baron Silas Greenback', sit back in my black leather chair and stroke my furry white pet while I explain the plot. Then you can escape my henchmen and save the day.

The plan is cunning in its simplicity. Simply create an Oracle directory that points to the OS directory containing the Oracle data files. Then I could use UTL_FILE to read and even update those datafiles directly, without going through the protective SQL layer. The operation through UTL_FILE is, as far as the operating system is concerned, indistinguishable from an official, controlled change through an INSERT statement. 

In fact any of the Oracle database files could be accessed this way. Control files, redo logs, audit trails... In short, directory privileges are REALLY nasty things.

Thursday, June 02, 2011

The www of

When I moved my blog to, I intentionally put it under a "blog" subdomain rather than at The idea was that I'd eventually get around to building a conventional website. By that I mean that I am in awe of Tim Hall's (content, structure, etc) and have delusions of grandeur about making something similar.

Apparently this subdomain split isn't great for search engine optimization as it treats them as different sites. I guess search engines don't want to draw assumptions for from So don't do this.

For over a year, it was pretty much Lorem Ipsum, with tumbleweeds rolling about, and the sounds of far-off Whipporwills in the hills.

Partly inspired by playing with the /seo and /seourl search tags in blekko, I spent some time thinking about Search Engine Optimization, and decided to get my www subdomain in some sort of order. It is all built using the standard Google Sites (which you can think of as a very simple, and free, Content Management System).

Most of it is pretty standard web site stuff. Bio, a couple of pages of links, some articles (mostly to see if I could write something with the aim of search results, rather than just as a brain dump) and copies of a few presentations I've made.

Then I tried to stretch things to some of the other Google gems.
Upcoming Events leverages Google Calendar, and currently shows up the Sydney Oracle Meetups calendar. I may merge in other calendars later.
The 'Contact Me' part is a form that feeds off to a Google Docs spreadsheet and notifies my GMail account.
The 'Book Me' part is a Google Apps gadget for "You Can Book Me", because I wanted to link in something from the Google Apps Marketplace.
Then there is which runs off Google Moderator. It's sort of a Q&A, crowd vote thingy. I'm not particularly keen on its fit for use. Apparently stuff there doesn't come up in search, you've got no control over formatting (which is a pain for anything code related) and the layout is pretty fixed. But if you've got any questions, feel free to sling them in.

I've also been trying out writing some Google Apps Engine code in Java. It's cool. I'm not sure whether it is the tutorial or the fact that I haven't had to try to install web servers, app servers and frameworks, but I've found it easier to focus on the coding process there than my previous attacks on the subject. Now if someone could write a decent Forum / FAQ / Q&A system in that.....

I fully intend to add some of my standard code packages to the site at some point. I've never been completely happy with large chunks of code in my blog, and this would be a tidier solution.

I'm open to other suggestions.

Wednesday, June 01, 2011

Licensing can be fun....NOT

Some notes to help me work out what I am, and more importantly, am NOT allowed to use.

  • Check your edition. V$VERSION will tell you if you are on Standard, Enterprise, Express or Personal Editions.
Enterprise Edition can use Diagnostics and Tuning Features if the option is licensed. Personal Edition includes "all of the options that are available with Enterprise Edition", except RAC. Shame it is Windows only. Standard and Express are definite No-No's

  • For 11gR2, check the parameter CONTROL_MANAGEMENT_PACK_ACCESS.
A canny DBA will have this set to NONE if the Diagnostics and Tuning pack are not licensed. The 11gR2 Beta for XE has this value set to NONE so maybe they'll do that as a default in future.

  • If the package dbms_awr is installed, then there's a good chance it was done so in order to disable AWR data gathering.
  • Check the name, detected_usages and currently_used columns from DBA_FEATURE_USAGE_STATISTICS

If it isn't disallowed, disabled and has been used then there's a reasonable chance it is licensed, but check with the DBA anyway. But at least you won't annoy them when it is obviously not allowed.

What is off limits (based on the 11th revision of 11gR2 licensing document - E10594-11) includes :

  • Reports :  awrrpt.sql, awrrpti.sql, awrgrpt.sql, awrgrpti.sql, awrgdrpt.sql, awrgdrpi.sql, addmrpt.sql, addmrpti.sql, ashrpt.sql, ashrpti.sql, awrddrpt.sql, awrddrpi.sql, awrsqrpi.sql, awrsqrpt.sql, awrextr.sql, awrload.sql, awrinfo.sql, spawrrac.sql

Generally you are licensed for "internal business operation". At the recent Sydney Oracle Meetup, Mogens Nørgaard pointed out that you aren't allowed to use your databases to train people outside your company. One exception is Express Edition which expressly (sorry) states "(c) you may use the programs to provide third party demonstrations and training; "

So if anyone around Sydney wants some training on Oracle SQL, PL/SQL or similar, I'm open to offers :) But you may want to wait until the XE 11gR2 goes live.