Data Densification, Demystified: Oracle 10g SQL Enhancements, Part 2 - Page 2
August 26, 2005
Oracle 10g Enhancements to Analytical Functions
One morning I arrived early at my client site to find a developer running multiple SQL queries, then feverishly transferring the resulting output into a Microsoft Excel spreadsheet. When I asked her the reason for her frantic activity, she told me she needed to perform some complex financial calculations on the results, but did not know how to perform them from within Oracle.
I took an informal poll of the developers I supported, and I found out that many of them were still creating queries to extract raw data from tables and/or views, but then dumping that output to comma-delimited files and then into a spreadsheet so that the raw data could be manipulated via Excel's analytical functions. Some typical requirements these developers were pursuing included these scenarios:
The good news is that Oracle 10g has beefed up analytical function processing so that all these scenarios can be handled by simply running an appropriate query directly against the database. I will cover the first scenario in the following sections, and I will tackle the last two scenarios in my next article when I cover the MODEL clause in depth.
Using Partitioned Outer Joins To "Densify" Data
Oracle 10g offers a new type of join, a partitioned outer join, that not only lets me join disparate sets of data, but also allows me to fill in "gaps" in the result sets because some data might be missing. For example, even though sales of a particular product may exist in most of the states in the Midwest, it is entirely possible that no sales exist in a particular calendar year for that product because of the time period in which it was introduced.
Oracle calls the concept of filling in missing data with partitioned outer joins data densification. To illustrate, I will create a new view, SH.CALENDAR_YEARS, that contains all existing calendar years in the SH.TIMES time periods table and adds in all current calendar years up to and including 2005. I will also create a new view named SH.SALES_MIDWEST_ONLY that will gather only a small subset of sales data from the sales history table, SALES. (I am using a small subset only to delineate the results of data densification; in real life, of course, I would use all the available sales data.)
Next, I will construct a query that uses a partitioned outer join to link together sales data from the SH.SALES_MIDWEST_ONLY view with a larger subset of Middle Western U.S. states to illustrate how data densification can be implemented:
SQL> TTITLE 'Partitioned Outer Join Example #1' SQL> COL prod FORMAT A32 HEADING 'Product' SQL> COL state FORMAT A4 HEADING 'State' SQL> COL year FORMAT 9999 HEADING 'JobID' SQL> COL tot_sales FORMAT 999,999,999.99 HEADING 'Total Sales' SQL> SQL> SELECT 2 SLS.prod 3 ,RGNS.state 4 ,SLS.year 5 ,NVL(SLS.sales,0) tot_sales 6 FROM 7 (SELECT 8 MWO.state 9 ,MWO.prod 10 ,MWO.year 11 ,SUM(MWO.sale) sales 12 FROM 13 sh.sales_midwest_only MWO 14 GROUP BY 15 MWO.state 16 ,MWO.prod 17 ,MWO.YEAR 18 ) SLS 19 PARTITION BY (SLS.prod) 20 RIGHT OUTER JOIN 21 (SELECT 22 DISTINCT cust_state_province state 23 FROM sh.customers 24 WHERE cust_state_province IN ('IA', 'IL','IN','MI','OH','WI') 25 ) RGNS 26 ON (RGNS.state = SLS.state) 27 ORDER BY SLS.prod, SLS.state, SLS.year; Partitioned Outer Join Example #1 Product Stat JobID Total Sales -------------------------------- ---- ----- --------------- And 2 Crosscourt Tee Kids IL 1998 826.00 And 2 Crosscourt Tee Kids IL 1999 1,187.90 And 2 Crosscourt Tee Kids IL 2000 2,619.40 And 2 Crosscourt Tee Kids IN 1998 266.00 And 2 Crosscourt Tee Kids IN 2000 2,086.00 And 2 Crosscourt Tee Kids MI 1999 53.20 And 2 Crosscourt Tee Kids MI 2000 1,209.60 And 2 Crosscourt Tee Kids WI 1998 686.00 And 2 Crosscourt Tee Kids WI 1999 504.00 And 2 Crosscourt Tee Kids WI 2000 378.00 And 2 Crosscourt Tee Kids IA .00 And 2 Crosscourt Tee Kids OH .00 Coin Pocket Twill Cargo Trousers MI 1999 234.00 Coin Pocket Twill Cargo Trousers MI 2000 78.00 Coin Pocket Twill Cargo Trousers WI 1999 722.15 Coin Pocket Twill Cargo Trousers WI 2000 390.00 Coin Pocket Twill Cargo Trousers IA .00 Coin Pocket Twill Cargo Trousers IL .00 Coin Pocket Twill Cargo Trousers OH .00 Coin Pocket Twill Cargo Trousers IN .00 Gurfield& Murks Pleated Trousers IL 1998 9,100.00 Gurfield& Murks Pleated Trousers IL 1999 13,825.00 Gurfield& Murks Pleated Trousers IL 2000 2,100.00 Gurfield& Murks Pleated Trousers IN 1999 7,525.00 Gurfield& Murks Pleated Trousers MI 1998 26,040.00 Gurfield& Murks Pleated Trousers MI 1999 12,600.00 Gurfield& Murks Pleated Trousers MI 2000 19,425.00 Gurfield& Murks Pleated Trousers WI 1998 175.00 Gurfield& Murks Pleated Trousers IA .00 Gurfield& Murks Pleated Trousers OH .00 Kahala Pleated Chino Short IL 1998 504.00 Kahala Pleated Chino Short IL 1999 3,738.00 Kahala Pleated Chino Short IA .00 Kahala Pleated Chino Short IN .00 Kahala Pleated Chino Short MI .00 Kahala Pleated Chino Short OH .00 Kahala Pleated Chino Short WI .00 37 rows selected.
Listing 2.3 shows the creation of both of these views, the initial query and this complete result set.
N-Dimensional Data Densification
The previous query does produce zeroed totals for those remaining states (i.e. Ohio and Iowa) which are present in the second subquery but not in the first subquery. That is better than nothing, but unfortunately, data is only present for the years 1999, 2000 and 2001 (the only time periods present in the SALES table's data). However, there is nothing to stop me from employing multiple partitioned outer joins to populate more than one data dimension.
Here is an example of how to accomplish this via one of my favorite, elegant SQL features that Oracle provides -- the WITH clause -- to specify the three component subqueries before invoking the multiple partitioned outer joins. The resulting output of this query will contain a fully "densified" result set:
SQL> TTITLE 'Partitioned Outer Join Densifying Across Two Dimensions' SQL> COL prod FORMAT A32 HEADING 'Product' SQL> COL state FORMAT A4 HEADING 'State' SQL> COL year FORMAT 9999 HEADING 'JobID' SQL> COL tot_sales FORMAT 999,999,999.99 HEADING 'Total Sales' SQL> SQL> WITH 2 -- Gather all Midwest-only sales data 3 Q1 AS ( 4 SELECT 5 prod 6 ,state 7 ,year 8 ,SUM(sale) sale 9 FROM 10 sh.sales_midwest_only 11 GROUP BY state, prod, year 12 ), 13 -- Gather a list of distinct states 14 Q2 AS ( 15 SELECT DISTINCT cust_state_province state 16 FROM sh.customers 17 WHERE cust_state_province IN ('IA', 'IL','IN','MI','OH','WI') 18 ), 19 -- Gather a list of distinct calendar years 20 Q3 AS ( 21 SELECT calendar_year year 22 FROM sh.calendar_years 23 ) 24 SELECT 25 Q4.prod 26 ,Q4.state 27 ,Q3.year 28 ,NVL(Q4.sale, 0) tot_sales 29 FROM 30 (SELECT 31 Q1.prod 32 ,Q2.state 33 ,Q1.year 34 ,Q1.sale 35 FROM Q1 36 PARTITION BY (prod) 37 RIGHT OUTER JOIN Q2 38 ON (Q1.state = Q2.state) 39 ) Q4 40 PARTITION BY (prod,state) 41 RIGHT OUTER JOIN Q3 42 ON (Q4.YEAR = Q3.year) 43 ORDER BY 1, 2, 3;
A fully documented version of this query as well as its complete result set is shown in Listing 2.4.
The new analytical function enhancements in Oracle 10g give developers and DBAs significantly more horsepower for advanced data modeling -- without the use of third-party software to perform these analyses. These new features make Structured Query Language within Oracle 10g even more robust and flexible. The next article in this series will concentrate on the new MODEL clause, an extremely powerful enhancement that Oracle 10g provides that significantly extends Oracle's robust analytical reporting capabilities.
References and Additional Reading
While there is no substitute for direct experience, reading the manual is not a bad idea, either. I have drawn upon the following Oracle 10g documentation for the deeper technical details of this article:
B10736-01 Oracle Database Data Warehousing Guide, Ch. 21
B10750-01 Oracle Database New Features Guide
B10759-01 SQL Reference