By Alexander Polishchuk
Selecting a contiguous data range, when the source data range is incomplete, requires producing the missing data and merging it with the source data set. The following article describes a method of generating the required data in Oracle and merging it via full outer join or union data set operator.
Consider an example when the source table has timestamp and item count columns and contains 200,000 rows. The timestamp column values are not contiguous. Here is a data sample from this table:
Timestamp |
Item Count |
10/01/2012 10:00 |
10 |
10/01/2012 10:01 |
3 |
10/01/2012 10:02 |
4 |
10/01/2012 10:05 |
1 |
We can use Oracle hierarchical query to generate contiguous timestamps between the start and end timestamp. In this query pseudo column LEVEL is used as minute increment from the starting timestamp, while the timestamp difference multiplied by number of minutes in a day (1440) serves as a selection range.
SELECT TO_DATE('10-JAN-2012') + (LEVEL - 1)/1440 FROM dual CONNECT BY LEVEL <= 1440 * (TO_DATE('11-JAN-2012') - TO_DATE('10-JAN-2012'))
Now both data sets have to be merged, so that all the data from the existing table is used and only the missing one is used from the generated one. Here is an example of such a query utilizing union data set operator.
SELECT time_stamp, item_count FROM items WHERE time_stamp >= TO_DATE('10-JAN-2012') AND time_stamp < TO_DATE('11-JAN-2012') UNION ALL SELECT ts, 0 FROM ( SELECT TO_DATE('10-JAN-2012') + (LEVEL - 1)/1440 AS ts FROM dual CONNECT BY LEVEL <= 1440 * (TO_DATE('11-JAN-2012') - TO_DATE('10-JAN-2012'))) WHERE NOT EXISTS ( SELECT 1 FROM items WHERE time_stamp = ts) ORDER BY 1
Here is a solution using full outer join to merge both data sets.
SELECT ts AS time_stamp, NVL(item_count, 0) AS item_count FROM ( SELECT TO_DATE('10-JAN-2012') + (LEVEL - 1)/1440 AS ts FROM dual CONNECT BY LEVEL <= 1440 * (TO_DATE('11-JAN-2012') - TO_DATE('10-JAN-2012'))) FULL OUTER JOIN items ON ( time_stamp = ts AND time_stamp >= TO_DATE('10-JAN-2012') AND time_stamp < TO_DATE('11-JAN-2012')) WHERE ts >= TO_DATE('10-JAN-2012') AND ts < TO_DATE('11-JAN-2012') ORDER BY 1
Both statements return the following result set with contiguous timestamp values when running on Oracle 11gR2.
Timestamp |
Item Count |
10/01/2012 10:00 |
10 |
10/01/2012 10:01 |
3 |
10/01/2012 10:02 |
4 |
10/01/2012 10:03 |
0 |
10/01/2012 10:04 |
0 |
10/01/2012 10:05 |
1 |
However, it is not the case on all Oracle releases. For example, Oracle 9i full outer join returns extra rows that should not be there. A workaround for this is to use the WITH clause, but it will create temporary tables and cause additional I/O.
The following execution plan is for the union statement:
Id |
Operation |
Name |
Rows |
Bytes |
Cost |
%CPU |
Time |
0 |
SELECT STATEMENT |
|
554 |
6089 |
9 |
12 |
00:00:01 |
1 |
SORT ORDER BY |
|
554 |
6089 |
8 |
38 |
00:00:01 |
2 |
UNION-ALL |
|
|
|
|
|
|
3 |
TABLE ACCESS BY INDEX ROWID |
ITEMS |
553 |
6083 |
5 |
0 |
00:00:01 |
*4 |
INDEX RANGE SCAN |
IND1 |
553 |
|
3 |
0 |
00:00:01 |
*5 |
FILTER |
|
|
|
|
|
|
6 |
VIEW |
|
1 |
6 |
2 |
0 |
00:00:01 |
*7 |
CONNECT BY WITHOUT FILTERING |
|
|
|
|
|
|
8 |
FAST DUAL |
|
1 |
|
2 |
0 |
00:00:01 |
*9 |
INDEX RANGE SCAN |
IND1 |
1 |
8 |
1 |
0 |
00:00:01 |
Predicate Information (identified by operation id): 4 - access("TIME_STAMP">=TO_DATE('2012-01-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "TIME_STAMP"<TO_DATE('2012-01-11 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) 5 - filter( NOT EXISTS (SELECT 0 FROM "ITEMS" "ITEMS" WHERE "TIME_STAMP"=:B1)) 7 - filter(LEVEL<=1440) 9 - access("TIME_STAMP"=:B1)
The following execution plan is for the full outer join statement:
Id |
Operation |
Name |
Rows |
Bytes |
Cost |
%CPU |
Time |
0 |
SELECT STATEMENT |
|
1 |
17 |
5 |
20 |
00:00:01 |
1 |
SORT ORDER BY |
|
1 |
17 |
5 |
20 |
00:00:01 |
2 |
NESTED LOOPS OUTER |
|
1 |
17 |
4 |
0 |
00:00:01 |
*3 |
VIEW |
|
1 |
6 |
2 |
0 |
00:00:01 |
*4 |
CONNECT BY WITHOUT FILTERING |
|
|
|
|
|
|
5 |
FAST DUAL |
|
1 |
|
2 |
0 |
00:00:01 |
6 |
TABLE ACCESS BY INDEX ROWID |
ITEMS |
1 |
11 |
2 |
0 |
00:00:01 |
*7 |
INDEX RANGE SCAN |
IND1 |
1 |
|
1 |
0 |
00:00:01 |
Predicate Information (identified by operation id): 3 - filter(INTERNAL_FUNCTION("from$_subquery$_001"."TS")>= TO_DATE('2012-01-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND INTERNAL_FUNCTION("from$_subquery$_001"."TS")< TO_DATE('2012-01-11 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) 4 - filter(LEVEL<=1440) 7 - access("TIME_STAMP"(+)=INTERNAL_FUNCTION("TS")) filter("TIME_STAMP"(+)<TO_DATE('2012-01-11 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "TIME_STAMP"(+)>=TO_DATE('2012-01-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
These execution plans show that the full outer join statement is better, than the union one. However, be aware that the execution plan may change in favor of union depending on data sets, query constraints, and indexes.
Using this approach allows efficient dynamic data range generation and saving the space, rather than storing unnecessary repetitive data in the database. Also, using the appropriate merging technique allows minimizing the required resources for the query execution.
About the author
Alex Polishchuk is the founder and president of Advanced Computer Consulting (www.advcomputerconsulting.com). Alex has over twenty years of professional experience designing, developing, and implementing database applications in various industries and companies ranging from small to Fortune 50 corporations. Alex’s primary area of expertise is in database performance optimization and tuning. He can be contacted at alex@advcomputerconsulting.com.