Data Densification, Demystified: Oracle 10g SQL Enhancements, Part 2
August 26, 2005
Synopsis. Oracle 10g has extended the already robust features of Structured Query Language (SQL) with a plethora of intriguing new capabilities. This article - the second in this series - reviews the new capabilities Oracle 10g provides that, in concert with analytical functions, permit the creation of "densified" output with a few simple queries.
The previous article in this series reviewed several new SQL features in Oracle 10g, including improved query and access methods for nested tables, upgrades to the MERGE command, and hierarchical query enhancements. Impressive as these new capabilities are, Oracle 10g's new data modeling features allow any Oracle developer or DBA to produce sophisticated, complex calculations using just the database engine, thus eliminating the need to massage data output via a third-party tool. Many of the new Oracle 10g query features depend upon the analytical functions established in Oracle 9i, so I will first delve into those features before discussing partitioned outer joins and data densification.
Analytical Functions: A Brief Introduction
I have been using SQL since 1985, and in my humble but biased opinion, Oracle's extensions to standard ANSI SQL are superior to those of its competitors. Oracle 9i added a complete suite of analytical functions to provide some impressive reporting and calculation capabilities.
Just about every major system I have worked on has at least one historical table, i.e. a table that stores multiple historical entries to track changes to a particular data domain over time. A good example of an historical table is the JOB_HISTORY table in the HR demo schema, which contains multiple rows for each employee detailing that employee's job assignments over time. A typical reporting requirement usually required against historical tables is the retrieval of either the least or most recent entry- for example, the least recent job assignment for a specific employee based on the employee's start date in that position.
If I were using another DBMS, I would probably have little choice but to construct the following SQL query and utilize a subquery to find the minimum starting date, and then use the result from that subquery to isolate the remainder of the employee's job history information:
SQL> TTITLE 'Earliest Employee Job History Entry (Traditional Method)' SQL> COL employee_id FORMAT 9999 HEADING 'EmpID' SQL> COL full_name FORMAT A24 HEADING 'Employee Name' SQL> COL job_id FORMAT A10 HEADING 'JobID' SQL> COL start_date FORMAT A10 HEADING 'Start Date' SQL> COL end_date FORMAT A10 HEADING 'End Date' SQL> SQL> SELECT 2 E.employee_id 3 ,E.last_name || ', ' || E.first_name full_name 4 ,JH.job_id 5 ,TO_CHAR(JH.start_date, 'mm-dd-yyyy') start_date 6 ,TO_CHAR(JH.end_date, 'mm-dd-yyyy') end_date 7 FROM 8 hr.employees E 9 ,hr.job_history JH 10 WHERE E.employee_id = JH.employee_id 11 AND JH.start_date = ( 12 SELECT MIN(start_date) 13 FROM hr.job_history 14 WHERE employee_id = 114) 15 AND JH.employee_id = 114; Earliest Employee Job History Entry (Traditional Method) EmpID Employee Name JobID Start Date End Date ----- ------------------------ ---------- ---------- ---------- 114 Raphaely, Den ST_CLERK 03-24-1998 12-31-1999
As of Oracle 9i, however, I have an alternative to this method: I can use an Oracle analytical function to return just the first entry in the list of job history entries for the specific employee. Briefly, here is how analytical functions perform their tasks:
Execution Order. A SQL statement that uses an analytical function first executes all joins and then processes all WHERE, GROUP BY, and HAVING statements. Oracle then hands off the result set to the analytical function for calculation and processing, and then finally any query-level ORDER BY statements are processed.
Partitioning of the Result Set. If the analytical function's interior query has any GROUP BY statements, Oracle will logically divide the result set into partitions (not to be confused with partitioned tables or indexes) based on the columns specified. A single partition can be as large as all the rows in the result set, or the partitions can be relatively tiny, depending on the chosen columns. An analytical function can operate against aggregated values in each result set partition.
The Current Row and "Windowing." As the analytical function processes each row in the result set partition, that row is marked as the current row. The current row becomes a reference point for any windowing operations. Windowing operations occur whenever an analytical function defines a sliding window to limit the selection of rows within a result set partition. This means that analytical processing can be limited to a range of rows based on a displacement of a specific number of rows from the current row in the result set partition (e.g., plus or minus 3 rows from the current row). Analytical processing can also be based on specified inclusion criteria.
Analytical functions are typically used heavily in decision-support and data warehousing reporting because they are extremely powerful for calculating rankings and percentiles within a result set, as well as calculating moving and cumulative aggregations based on a displacement from a specific row in the result set. It is also possible to calculate lagging and leading values within a partition, or first and last values within a partition. Here is an example of how to return the same results using Oracle 9i's FIRST_VALUE analytical function:
SQL> TTITLE 'Earliest Employee Job History Entry (Analytic Function)' SQL> COL eid FORMAT 9999 HEADING 'EmpID' SQL> COL name FORMAT A24 HEADING 'Employee Name' SQL> COL jobid FORMAT A10 HEADING 'JobID' SQL> COL begin_date FORMAT A10 HEADING 'Start Date' SQL> COL stop_date FORMAT A10 HEADING 'End Date' SQL> SQL> SELECT 2 eid 3 ,NAME 4 ,jobid 5 ,begin_date 6 ,stop_date 7 FROM ( 8 SELECT 9 JH.employee_id eid 10 ,E.last_name || ', ' || E.first_name NAME 11 ,JH.job_id jobid 12 ,JH.START_DATE begin_date 13 ,JH.end_date stop_date 14 ,FIRST_VALUE(JH.start_date) OVER ( 15 ORDER BY JH.employee_id, JH.start_date 16 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 17 ) recent_history 18 FROM 19 hr.employees E 20 ,hr.job_history JH 21 WHERE E.employee_id = JH.employee_id 22 AND JH.employee_id IN (114) 23 ) 24 WHERE begin_date = recent_history 25 ORDER BY eid; Earliest Employee Job History Entry (Analytic Function) EmpID Employee Name JobID Start Date End Date ----- ------------------------ ---------- ---------- ---------- 114 Raphaely, Den ST_CLERK 03/24/1998 12/31/1999
In this example, note that the FIRST_VALUE analytical function indicates the retrieval of a single result set partition, sorted within Employee ID and Start Date based on its ORDER BY clause. The ROWS directive tells Oracle to use all entries in each result set partition to determine which entry has the first (and therefore earliest) start date in each partition.
Listing 2.1 shows the DML statements I issued to add rows into the original JOB_HISTORY table to demonstrate these queries and the original queries themselves. An expanded query that shows how the PARTITION BY clause could be used within an analytical query to return the first job history entry for each employee is shown in Listing 2.2.