Thursday, January 07, 2010

Cardinality Estimates are not Probabilities

Reading a post by Charles Hooper I had to express mis-givings around the following paragraph. Discussions are continuing as comments on that post.

"One of the problems with putting date values in number columns is this – if you select the range from 200810 to 200903, the optimizer will likely make the assumption that 200810 is just as likely of a number as 200808, 200812, 200814, 200816, 200820, 200890, 200900, etc. Some of those year/month combinations are simply not possible. In such a case, the optimizer should over-estimate the number of rows returned from that range when the column data type is NUMBER, and should be reasonably close when the column data type is DATE, since the optimizer knows that 200814 (2008-14-01), 200816 (2008-16-01), 200820 (2008-20-01), 200890 (2008-90-01), 200900 (2009-00-01), etc. could never be dates (and would be completely out of the serial sequence of dates). By putting the date type data into a DATE column, you have essentially added a constraint to the database to prevent invalid dates from being added. Additionally, date math, such as finding the number of days between 200802 and 200803 (compared to 200702 and 200703) is very simple – the answer is not 1 in both cases, but rather 29 and 28, respectively."

On the last point, while there may only be 28 or 29 days, that is still over 2.4 million seconds and there may be an entry for any second in that date range.

My understanding is that the optimizer makes the assumption, which is generally valid, that the database actually contains the data you are looking for. It is not worried about possible (or even impossible) values, just recorded values. Anything you want to know is in the database.

When you make a query, the optimizer is not determining the probability of a match, but an estimate of the number of rows returned ASSUMING there is a match. For that it uses the number of distinct values divided by the number of rows.

If I have a date column in a table and query that table for a specific value of that column, the optimizer will never assume that it will return zero rows, or even less than one row. It doesn't matter if there are three or three million rows, or if there are two or two million distinct values. In fact,if I apply ten different predicates to that select, it will still always assume at least one match. Even if the predicates are mutually exclusive, it makes the same assumption.

Demo Time

drop table test_150 purge;
create table test_150 (id number(6,3));

insert into test_150 values (1);
insert into test_150 values (150);
insert into test_150 select dbms_random.value(1,150) from dual ;
insert into test_150 select dbms_random.value(1,150) from dual ;
insert into test_150 select dbms_random.value(1,150) from dual ;

insert into test_150 
select a.* from test_150 a, 
   (select 1 from dual connect by level < 30);
exec dbms_stats.gather_table_stats(user, 'TEST_150',
      estimate_percent=>100, 
      method_opt=> 'FOR ALL COLUMNS SIZE 1');

select to_char(id,'999.999'), count(*) 
from test_150 group by id order by id;

TO_CHAR(    COUNT(*)
-------- -----------
   1.000       30.00
  60.781       30.00
  84.342       30.00
 111.778       30.00
 150.000       30.00

The optimizer knows there are five distinct values, ranging from 1 to 150, and 150 rows in total. With no histograms it will assume an even distribution and every values occurs 30 times. That's true in this case (but that shouldn't make any difference).

set autotrace on
select count(*) from test_150 where id = 17.17;

Execution Plan
-------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |     1 |     4 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |          |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| TEST_150 |    30 |   120 |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------

Between 1 and 150, there are another 148 integers. Going down to three decimal places, that is 150,000 potential values. Ignoring all that it says that, in a seeming defiance of the rules of probability, we will have 30 rows with that precise value. It is ignoring probability on the assumption that if you are asking about a particular value, it is because it is in there.

select count(*) from test_150 where id in (2,3,4,5);

Execution Plan
-------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |     1 |     4 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |          |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| TEST_150 |   120 |   480 |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------

In this case it assumes that all four values are in there, so 120 rows will be returned.
We can try something silly.

select count(*) from test_150 where id in (2,3,4,5,6,7);

Execution Plan
-------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |     1 |     4 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |          |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| TEST_150 |   150 |   600 |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------

It knows that there are 150 rows in the table, so obviously no more than that can be returned (at least as far as estimations are concerned).

Check constraints still obey the 'there must be one' rule. Rather than override the estimate, it simply puts in a filter that stops that path being executed (and incidentally allowing the cost to be ignored).

alter table test_150 add constraint tst_150_pos check (id > 0);

select count(*) from test_150 where id < -5;

Execution Plan
--------------------------------------------------------------------------------
| Id  | Operation           | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |          |     1 |     4 |     0   (0)|          |
|   1 |  SORT AGGREGATE     |          |     1 |     4 |            |          |
|*  2 |   FILTER            |          |       |       |            |          |
|*  3 |    TABLE ACCESS FULL| TEST_150 |    29 |   116 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(NULL IS NOT NULL)
   3 - filter("ID"<(-5))
  
An extension to this occurs with joins. Try this query

select to_char(count(*)) from c, test_150 t where t.id = 295 ;
Execution Plan
----------------------------------------------------------------------------------
| Id  | Operation             | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |     1 |     4 |  1718   (6)| 00:00:21 |
|   1 |  SORT AGGREGATE       |          |     1 |     4 |            |          |
|   2 |   MERGE JOIN CARTESIAN|          |  1906K|  7448K|  1718   (6)| 00:00:21 |
|*  3 |    TABLE ACCESS FULL  | TEST_150 |     1 |     4 |     3   (0)| 00:00:01 |
|   4 |    BUFFER SORT        |          |  2367K|       |  1715   (6)| 00:00:21 |
|   5 |     TABLE ACCESS FULL | C        |  2367K|       |  1715   (6)| 00:00:21 |
----------------------------------------------------------------------------------

The id value is out of bounds, but it still shows up as 1 row. But the 2.3 million rows in table "C" are reduced by about 20%. In effect it appears like the optimizer thinks that 0.8 rows will be returned. These optimizer calculations probably vary between versions (this was XE).


Between


It thought that the following would give 120 rows
select count(*) from test_150 where id in (2,3,4,5);
However if you try the following:
select count(*) from test_150 where id between 2 and 5;
-------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |     1 |     4 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |          |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| TEST_150 |    34 |   136 |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------


A different answer and totally inconsistent with the estimate for the earlier IN query which could only ever return fewer rows. Firstly it has assumed that 30 entries will match the value '2'. Then, with 150 entries between 1 and 150, it assumes an entry each '1', so it adds a potential match for 2, 2+1 (3), 2+2 (4) and 2+3 (5). It adds those 4 potential matches to the 30 to get 34.


select distinct id from test_150 where id between 2 and 16;
-------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |     2 |     8 |     4  (25)| 00:00:01 |
|   1 |  HASH UNIQUE       |          |     2 |     8 |     4  (25)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL| TEST_150 |    45 |   180 |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------


You can see it now expects 45 rows matching with 2 distinct values for id (whereas for a top of 15, it would only have 44 matches and 1 unique value).

PS. Depending on your database version, the optimizer may be completely different. I've used XE for this.

PPS. If you've read this far, go and get Jonathan Lewis' optimizer book.

1 comment:

Narendra said...

Gary,

Nice post. Thanks for the explanation.
PS. Depending on your database version, the optimizer may be completely different. I've used XE for this.
That scares me the most and shakes my beliefs. :)