Oracle Contiguous Data Range via Full Outer Joins

By Alexander Polishchuk

Selecting a contiguous data range, when the source data range is incomplete, requires producing the missing data and merging it with the source data set. The following article describes a method of generating the required data in Oracle and merging it via full outer join or union data set operator.

Consider an example when the source table has timestamp and item count columns and contains 200,000 rows. The timestamp column values are not contiguous. Here is a data sample from this table:

 

 

Timestamp

Item Count

10/01/2012 10:00

10

10/01/2012 10:01

3

10/01/2012 10:02

4

10/01/2012 10:05

1

We can use Oracle hierarchical query to generate contiguous timestamps between the start and end timestamp. In this query pseudo column LEVEL is used as minute increment from the starting timestamp, while the timestamp difference multiplied by number of minutes in a day (1440) serves as a selection range.

SELECT TO_DATE('10-JAN-2012') + (LEVEL - 1)/1440
FROM dual
CONNECT BY LEVEL <= 1440 * (TO_DATE('11-JAN-2012') - TO_DATE('10-JAN-2012'))

Now both data sets have to be merged, so that all the data from the existing table is used and only the missing one is used from the generated one. Here is an example of such a query utilizing union data set operator.

SELECT time_stamp, item_count

FROM   items

WHERE  time_stamp >= TO_DATE('10-JAN-2012')

AND    time_stamp < TO_DATE('11-JAN-2012')

UNION ALL

SELECT ts, 0

FROM ( SELECT TO_DATE('10-JAN-2012') + (LEVEL - 1)/1440 AS ts

     FROM dual

     CONNECT BY LEVEL <= 1440 * (TO_DATE('11-JAN-2012') - TO_DATE('10-JAN-2012')))

WHERE NOT EXISTS (

     SELECT 1

     FROM   items

     WHERE  time_stamp = ts)

ORDER BY 1

Here is a solution using full outer join to merge both data sets.

SELECT ts AS time_stamp, NVL(item_count, 0) AS item_count

FROM ( SELECT TO_DATE('10-JAN-2012') + (LEVEL - 1)/1440 AS ts

     FROM dual

     CONNECT BY LEVEL <= 1440 * (TO_DATE('11-JAN-2012') - TO_DATE('10-JAN-2012')))

FULL OUTER JOIN items ON ( time_stamp = ts AND

     time_stamp >= TO_DATE('10-JAN-2012') AND

     time_stamp < TO_DATE('11-JAN-2012'))

WHERE  ts >= TO_DATE('10-JAN-2012')

AND    ts < TO_DATE('11-JAN-2012')

ORDER BY 1

Both statements return the following result set with contiguous timestamp values when running on Oracle 11gR2.

Timestamp

Item Count

10/01/2012 10:00

10

10/01/2012 10:01

3

10/01/2012 10:02

4

10/01/2012 10:03

0

10/01/2012 10:04

0

10/01/2012 10:05

1

However, it is not the case on all Oracle releases. For example, Oracle 9i full outer join returns extra rows that should not be there. A workaround for this is to use the WITH clause, but it will create temporary tables and cause additional I/O.

The following execution plan is for the union statement:

Id

Operation

Name

Rows

Bytes

Cost

%CPU

Time

0

SELECT STATEMENT

 

554

6089

9

12

00:00:01

1

 SORT ORDER BY

 

554

6089

8

38

00:00:01

2

  UNION-ALL

 

 

 

 

 

 

3

   TABLE ACCESS BY INDEX ROWID

ITEMS

553

6083

5

0

00:00:01

*4

    INDEX RANGE SCAN

IND1

553

 

3

0

00:00:01

*5

   FILTER

 

 

 

 

 

 

6

    VIEW

 

1

6

2

0

00:00:01

*7

     CONNECT BY WITHOUT FILTERING

 

 

 

 

 

 

8

      FAST DUAL

 

1

 

2

0

00:00:01

*9

    INDEX RANGE SCAN

IND1

1

8

1

0

00:00:01

Predicate Information (identified by operation id):

 

4 - access("TIME_STAMP">=TO_DATE('2012-01-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND

     "TIME_STAMP"<TO_DATE('2012-01-11 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

5 - filter( NOT EXISTS (SELECT 0 FROM "ITEMS" "ITEMS" WHERE "TIME_STAMP"=:B1))

7 - filter(LEVEL<=1440)

9 - access("TIME_STAMP"=:B1)

The following execution plan is for the full outer join statement:

Id

Operation

Name

Rows

Bytes

Cost

%CPU

Time

0

SELECT STATEMENT

 

1

17

5

20

00:00:01

1

  SORT ORDER BY

 

1

17

5

20

00:00:01

2

   NESTED LOOPS OUTER

 

1

17

4

0

00:00:01

*3

    VIEW

 

1

6

2

0

00:00:01

*4

     CONNECT BY WITHOUT FILTERING

 

 

 

 

 

 

5

      FAST DUAL

 

1

 

2

0

00:00:01

6

    TABLE ACCESS BY INDEX ROWID

ITEMS

1

11

2

0

00:00:01

*7

     INDEX RANGE SCAN

IND1

1

 

1

0

00:00:01

 

Predicate Information (identified by operation id):

 

3 - filter(INTERNAL_FUNCTION("from$_subquery$_001"."TS")>=

                     TO_DATE('2012-01-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND

     INTERNAL_FUNCTION("from$_subquery$_001"."TS")<

                     TO_DATE('2012-01-11 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

4 - filter(LEVEL<=1440)

7 - access("TIME_STAMP"(+)=INTERNAL_FUNCTION("TS"))

     filter("TIME_STAMP"(+)<TO_DATE('2012-01-11 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND

          "TIME_STAMP"(+)>=TO_DATE('2012-01-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

These execution plans show that the full outer join statement is better, than the union one. However, be aware that the execution plan may change in favor of union depending on data sets, query constraints, and indexes.

Using this approach allows efficient dynamic data range generation and saving the space, rather than storing unnecessary repetitive data in the database. Also, using the appropriate merging technique allows minimizing the required resources for the query execution.

About the author

Alex Polishchuk is the founder and president of Advanced Computer Consulting (www.advcomputerconsulting.com). Alex has over twenty years of professional experience designing, developing, and implementing database applications in various industries and companies ranging from small to Fortune 50 corporations. Alex’s primary area of expertise is in database performance optimization and tuning. He can be contacted at alex@advcomputerconsulting.com.

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends & analysis

Latest Articles