Thursday, June 30, 2011

More on Checksumming

Prompted by a comment I realise that I need to put some more flesh on Checksumming.

WHY ?

A good use case is regression testing. A decent regression test will include a script to set up data so you can execute your task (preferably in an automated manner) and check your actual results match your expected results. That is pretty easy if the expected result is a value returned by a function. When it is a data change, it gets a bit trickier.

You can set up your expected results in a table (or a set of tables matching your main schema), run the tests and check the actual result data matches the expected result data. That can be as simple as

exec comp_tables('TEST_ACTUAL','TEST_EXPECTED');

Depending on your testing mechanisms, that can raise an exception if it fails, return an error code or record the results in a table. My example code just dumps it out to DBMS_OUTPUT. Also, a regression test will probably have expected results in several tables, so you'll need multiple comparisons.

Just to complicate things, often the tables will contain sequence generated ids or use the current date as a 'created / modified date'. So you might want to exclude some columns (or rows) from comparison. 

begin
  comp_tables(
   'SELECT val_char, val_number from TEST_ACTUAL WHERE val_char=''ABC''',
   'SELECT val_char, val_number from TEST_EXPECTED WHERE val_char=''ABC''');
end;
/

HOW ?

I've uploaded my code to the SydOracle CodeSpace location as Comp_Tables.sql
It works in several steps.

1. Take the entered query (eg 'SELECT 1 a, 2 b FROM DUAL') and determine the columns.

2. Create a wrapper query around that. That would give
SELECT ora_hash(a) a, ora_hash(b) b 
FROM (SELECT 1 a, 2 b FROM DUAL)
The hashes turn all the values (of the main datatypes) into numbers, which makes them easy to aggregate, and comparing aggregated values is easy. The hashing also practically guarantees no collisions. That would give a result set like 

         A             B
2342552567    2064090006

3. Create another wrapper to mock up a hash for each row.
SELECT mod(a,100) * mod(b,100) row_hash, h.* FROM
  (SELECT ora_hash(a) a, ora_hash(b) b 
   FROM (SELECT 1 a, 2 b FROM DUAL)) h

Paraphrasing Eric Morecombe, it is fine having all the right columns, but you also want them in all the right order. These results sets will show the same total hashes for columns 'A' and 'B' as the contents of those columns are identical. if you have one row of (1,'A') and a second of (2,'B'), a 'column-only' match would quite happily accept (1,'B') / (2,'A') since all the column values are identical.

By creating another column based on all the values for the entire row, we get a summary of the row. We can't simply add the hashes from the columns together as the sum of all the 'A's plus the sum of all the 'B's is mathematically the same as the sum of the As+Bs. In short, we'd lose the checksumming effect as we aggregate the row values. So I've used multiplication with a MOD so the numbers are unlikely to get too big.

ROW_HASH           A           B
     402  2342552567  2064090006

4. Finally we aggregate the values for all the rows. I've used SUM, though AVG would work just as well (and would give smaller, more manageable values for larger datasets). That gives an SQL like

SELECT SUM(row_hash),SUM(a),SUM(b)
FROM (SELECT mod(a,100) * mod(b,100) row_hash, h.*
      FROM(SELECT ora_hash(a) a, ora_hash(b) b
           FROM (SELECT rownum a, 2 b 
                 FROM DUAL CONNECT BY LEVEL < 5)) h)
 
(I've changed the SQL here to give multiple rows, so you can see the effect of the sum).

ROW_HASH             A             B
     402    2342552567    2064090006
      36    2064090006    2064090006
     354    2706503459    2064090006
     186    3217185531    2064090006
========================================
     978   10330331563    8256360024


5. That gives us a one record encapsulation of an entire table (or subset of a table).
If we do that once for the 'actual' and once for the 'expected' we can compare those two records and see where they match.

So if we compare the previous SQL with a similar one with a different value from 'B'

SELECT SUM(row_hash),SUM(a),SUM(b)
FROM (SELECT mod(a,100) * mod(b,100) row_hash, h.*
      FROM(SELECT ora_hash(a) a, ora_hash(b) b
           FROM (SELECT rownum a, 3 b 
                 FROM DUAL CONNECT BY LEVEL < 5)) h)

ROW_HASH            A            B
    9617  10330331563  10826013836

In the above example, the final row shows us that column 'A' matches but 'B' does not. We would then investigate the code to find out why 'B' wasn't always being derived as expected.

Caveats

While you could just store the aggregate hash of the expected results, that would prevent you drilling down to investigate issues, so I don't recommend it. 

Also, since the code accepts SQLs, you can do exotic things like joins, functions and analytics. Again, I'd recommend against it on the 'Keep It Simple' principle.

And if you use weird column names (ie mixed or lower case with strange characters that require quoted identifiers) you are on your own.

No comments: