Data Densification, Demystified: Oracle 10g SQL Enhancements, Part 2

August 26, 2005

Synopsis. Oracle 10g has extended the already robust features of Structured Query Language (SQL) with a plethora of intriguing new capabilities. This article - the second in this series - reviews the new capabilities Oracle 10g provides that, in concert with analytical functions, permit the creation of "densified" output with a few simple queries.

The previous article in this series reviewed several new SQL features in Oracle 10g, including improved query and access methods for nested tables, upgrades to the MERGE command, and hierarchical query enhancements. Impressive as these new capabilities are, Oracle 10g's new data modeling features allow any Oracle developer or DBA to produce sophisticated, complex calculations using just the database engine, thus eliminating the need to massage data output via a third-party tool. Many of the new Oracle 10g query features depend upon the analytical functions established in Oracle 9i, so I will first delve into those features before discussing partitioned outer joins and data densification.

Analytical Functions: A Brief Introduction

I have been using SQL since 1985, and in my humble but biased opinion, Oracle's extensions to standard ANSI SQL are superior to those of its competitors. Oracle 9i added a complete suite of analytical functions to provide some impressive reporting and calculation capabilities.

Just about every major system I have worked on has at least one historical table, i.e. a table that stores multiple historical entries to track changes to a particular data domain over time. A good example of an historical table is the JOB_HISTORY table in the HR demo schema, which contains multiple rows for each employee detailing that employee's job assignments over time. A typical reporting requirement usually required against historical tables is the retrieval of either the least or most recent entry- for example, the least recent job assignment for a specific employee based on the employee's start date in that position.

If I were using another DBMS, I would probably have little choice but to construct the following SQL query and utilize a subquery to find the minimum starting date, and then use the result from that subquery to isolate the remainder of the employee's job history information:

SQL> TTITLE 'Earliest Employee Job History Entry (Traditional Method)'
SQL> COL employee_id     FORMAT 9999     HEADING 'EmpID'
SQL> COL full_name       FORMAT A24      HEADING 'Employee Name'
SQL> COL job_id          FORMAT A10      HEADING 'JobID'
SQL> COL start_date      FORMAT A10      HEADING 'Start Date'
SQL> COL end_date        FORMAT A10      HEADING 'End Date'
SQL>
SQL> SELECT
  2      E.employee_id
  3     ,E.last_name || ', ' || E.first_name full_name
  4     ,JH.job_id
  5     ,TO_CHAR(JH.start_date, 'mm-dd-yyyy') start_date
  6     ,TO_CHAR(JH.end_date, 'mm-dd-yyyy') end_date
  7    FROM
  8      hr.employees E
  9     ,hr.job_history JH
 10   WHERE E.employee_id =  JH.employee_id
 11     AND JH.start_date = (
 12      SELECT MIN(start_date)
 13        FROM hr.job_history
 14       WHERE employee_id = 114)
 15     AND JH.employee_id = 114;
            Earliest Employee Job History Entry (Traditional Method)
EmpID Employee Name            JobID      Start Date End Date
----- ------------------------ ---------- ---------- ----------
  114 Raphaely, Den            ST_CLERK   03-24-1998 12-31-1999

As of Oracle 9i, however, I have an alternative to this method: I can use an Oracle analytical function to return just the first entry in the list of job history entries for the specific employee. Briefly, here is how analytical functions perform their tasks:

Execution Order. A SQL statement that uses an analytical function first executes all joins and then processes all WHERE, GROUP BY, and HAVING statements. Oracle then hands off the result set to the analytical function for calculation and processing, and then finally any query-level ORDER BY statements are processed.

Partitioning of the Result Set. If the analytical function's interior query has any GROUP BY statements, Oracle will logically divide the result set into partitions (not to be confused with partitioned tables or indexes) based on the columns specified. A single partition can be as large as all the rows in the result set, or the partitions can be relatively tiny, depending on the chosen columns. An analytical function can operate against aggregated values in each result set partition.

The Current Row and "Windowing." As the analytical function processes each row in the result set partition, that row is marked as the current row. The current row becomes a reference point for any windowing operations. Windowing operations occur whenever an analytical function defines a sliding window to limit the selection of rows within a result set partition. This means that analytical processing can be limited to a range of rows based on a displacement of a specific number of rows from the current row in the result set partition (e.g., plus or minus 3 rows from the current row). Analytical processing can also be based on specified inclusion criteria.

Analytical functions are typically used heavily in decision-support and data warehousing reporting because they are extremely powerful for calculating rankings and percentiles within a result set, as well as calculating moving and cumulative aggregations based on a displacement from a specific row in the result set. It is also possible to calculate lagging and leading values within a partition, or first and last values within a partition. Here is an example of how to return the same results using Oracle 9i's FIRST_VALUE analytical function:

SQL> TTITLE 'Earliest Employee Job History Entry (Analytic Function)'
SQL> COL eid             FORMAT 9999     HEADING 'EmpID'
SQL> COL name            FORMAT A24      HEADING 'Employee Name'
SQL> COL jobid           FORMAT A10      HEADING 'JobID'
SQL> COL begin_date      FORMAT A10      HEADING 'Start Date'
SQL> COL stop_date       FORMAT A10      HEADING 'End Date'
SQL> 
SQL> SELECT
  2       eid
  3      ,NAME
  4      ,jobid
  5      ,begin_date
  6      ,stop_date
  7    FROM (
  8      SELECT
  9          JH.employee_id eid
 10         ,E.last_name || ', ' || E.first_name  NAME
 11         ,JH.job_id jobid
 12         ,JH.START_DATE begin_date
 13         ,JH.end_date stop_date
 14         ,FIRST_VALUE(JH.start_date) OVER (
 15              ORDER BY JH.employee_id, JH.start_date
 16              ROWS BETWEEN UNBOUNDED PRECEDING
                      AND UNBOUNDED FOLLOWING
 17              ) recent_history
 18        FROM
 19          hr.employees E
 20         ,hr.job_history JH
 21       WHERE E.employee_id =  JH.employee_id
 22         AND JH.employee_id IN (114)
 23   )
 24   WHERE begin_date = recent_history
 25   ORDER BY eid;
            Earliest Employee Job History Entry (Analytic Function)
EmpID Employee Name            JobID      Start Date End Date
----- ------------------------ ---------- ---------- ----------
  114 Raphaely, Den            ST_CLERK   03/24/1998 12/31/1999

In this example, note that the FIRST_VALUE analytical function indicates the retrieval of a single result set partition, sorted within Employee ID and Start Date based on its ORDER BY clause. The ROWS directive tells Oracle to use all entries in each result set partition to determine which entry has the first (and therefore earliest) start date in each partition.

Listing 2.1 shows the DML statements I issued to add rows into the original JOB_HISTORY table to demonstrate these queries and the original queries themselves. An expanded query that shows how the PARTITION BY clause could be used within an analytical query to return the first job history entry for each employee is shown in Listing 2.2.








The Network for Technology Professionals

Search:

About Internet.com

Legal Notices, Licensing, Permissions, Privacy Policy.
Advertise | Newsletters | E-mail Offers