Data Densification, Demystified: Oracle 10g SQL Enhancements, Part 2 - Page 2August 26, 2005 Oracle 10g Enhancements to Analytical FunctionsOne morning I arrived early at my client site to find a developer running multiple SQL queries, then feverishly transferring the resulting output into a Microsoft Excel spreadsheet. When I asked her the reason for her frantic activity, she told me she needed to perform some complex financial calculations on the results, but did not know how to perform them from within Oracle. I took an informal poll of the developers I supported, and I found out that many of them were still creating queries to extract raw data from tables and/or views, but then dumping that output to comma-delimited files and then into a spreadsheet so that the raw data could be manipulated via Excel's analytical functions. Some typical requirements these developers were pursuing included these scenarios:
The good news is that Oracle 10g has beefed up analytical function processing so that all these scenarios can be handled by simply running an appropriate query directly against the database. I will cover the first scenario in the following sections, and I will tackle the last two scenarios in my next article when I cover the MODEL clause in depth. Using Partitioned Outer Joins To "Densify" DataOracle 10g offers a new type of join, a partitioned outer join, that not only lets me join disparate sets of data, but also allows me to fill in "gaps" in the result sets because some data might be missing. For example, even though sales of a particular product may exist in most of the states in the Midwest, it is entirely possible that no sales exist in a particular calendar year for that product because of the time period in which it was introduced. Oracle calls the concept of filling in missing data with partitioned outer joins data densification. To illustrate, I will create a new view, SH.CALENDAR_YEARS, that contains all existing calendar years in the SH.TIMES time periods table and adds in all current calendar years up to and including 2005. I will also create a new view named SH.SALES_MIDWEST_ONLY that will gather only a small subset of sales data from the sales history table, SALES. (I am using a small subset only to delineate the results of data densification; in real life, of course, I would use all the available sales data.) Next, I will construct a query that uses a partitioned outer join to link together sales data from the SH.SALES_MIDWEST_ONLY view with a larger subset of Middle Western U.S. states to illustrate how data densification can be implemented:
SQL> TTITLE 'Partitioned Outer Join Example #1'
SQL> COL prod FORMAT A32 HEADING 'Product'
SQL> COL state FORMAT A4 HEADING 'State'
SQL> COL year FORMAT 9999 HEADING 'JobID'
SQL> COL tot_sales FORMAT 999,999,999.99 HEADING 'Total Sales'
SQL>
SQL> SELECT
2 SLS.prod
3 ,RGNS.state
4 ,SLS.year
5 ,NVL(SLS.sales,0) tot_sales
6 FROM
7 (SELECT
8 MWO.state
9 ,MWO.prod
10 ,MWO.year
11 ,SUM(MWO.sale) sales
12 FROM
13 sh.sales_midwest_only MWO
14 GROUP BY
15 MWO.state
16 ,MWO.prod
17 ,MWO.YEAR
18 ) SLS
19 PARTITION BY (SLS.prod)
20 RIGHT OUTER JOIN
21 (SELECT
22 DISTINCT cust_state_province state
23 FROM sh.customers
24 WHERE cust_state_province IN ('IA', 'IL','IN','MI','OH','WI')
25 ) RGNS
26 ON (RGNS.state = SLS.state)
27 ORDER BY SLS.prod, SLS.state, SLS.year;
Partitioned Outer Join Example #1
Product Stat JobID Total Sales
-------------------------------- ---- ----- ---------------
And 2 Crosscourt Tee Kids IL 1998 826.00
And 2 Crosscourt Tee Kids IL 1999 1,187.90
And 2 Crosscourt Tee Kids IL 2000 2,619.40
And 2 Crosscourt Tee Kids IN 1998 266.00
And 2 Crosscourt Tee Kids IN 2000 2,086.00
And 2 Crosscourt Tee Kids MI 1999 53.20
And 2 Crosscourt Tee Kids MI 2000 1,209.60
And 2 Crosscourt Tee Kids WI 1998 686.00
And 2 Crosscourt Tee Kids WI 1999 504.00
And 2 Crosscourt Tee Kids WI 2000 378.00
And 2 Crosscourt Tee Kids IA .00
And 2 Crosscourt Tee Kids OH .00
Coin Pocket Twill Cargo Trousers MI 1999 234.00
Coin Pocket Twill Cargo Trousers MI 2000 78.00
Coin Pocket Twill Cargo Trousers WI 1999 722.15
Coin Pocket Twill Cargo Trousers WI 2000 390.00
Coin Pocket Twill Cargo Trousers IA .00
Coin Pocket Twill Cargo Trousers IL .00
Coin Pocket Twill Cargo Trousers OH .00
Coin Pocket Twill Cargo Trousers IN .00
Gurfield& Murks Pleated Trousers IL 1998 9,100.00
Gurfield& Murks Pleated Trousers IL 1999 13,825.00
Gurfield& Murks Pleated Trousers IL 2000 2,100.00
Gurfield& Murks Pleated Trousers IN 1999 7,525.00
Gurfield& Murks Pleated Trousers MI 1998 26,040.00
Gurfield& Murks Pleated Trousers MI 1999 12,600.00
Gurfield& Murks Pleated Trousers MI 2000 19,425.00
Gurfield& Murks Pleated Trousers WI 1998 175.00
Gurfield& Murks Pleated Trousers IA .00
Gurfield& Murks Pleated Trousers OH .00
Kahala Pleated Chino Short IL 1998 504.00
Kahala Pleated Chino Short IL 1999 3,738.00
Kahala Pleated Chino Short IA .00
Kahala Pleated Chino Short IN .00
Kahala Pleated Chino Short MI .00
Kahala Pleated Chino Short OH .00
Kahala Pleated Chino Short WI .00
37 rows selected.
Listing 2.3 shows the creation of both of these views, the initial query and this complete result set. N-Dimensional Data DensificationThe previous query does produce zeroed totals for those remaining states (i.e. Ohio and Iowa) which are present in the second subquery but not in the first subquery. That is better than nothing, but unfortunately, data is only present for the years 1999, 2000 and 2001 (the only time periods present in the SALES table's data). However, there is nothing to stop me from employing multiple partitioned outer joins to populate more than one data dimension. Here is an example of how to accomplish this via one of my favorite, elegant SQL features that Oracle provides -- the WITH clause -- to specify the three component subqueries before invoking the multiple partitioned outer joins. The resulting output of this query will contain a fully "densified" result set:
SQL> TTITLE 'Partitioned Outer Join Densifying Across Two Dimensions'
SQL> COL prod FORMAT A32 HEADING 'Product'
SQL> COL state FORMAT A4 HEADING 'State'
SQL> COL year FORMAT 9999 HEADING 'JobID'
SQL> COL tot_sales FORMAT 999,999,999.99 HEADING 'Total Sales'
SQL>
SQL> WITH
2 -- Gather all Midwest-only sales data
3 Q1 AS (
4 SELECT
5 prod
6 ,state
7 ,year
8 ,SUM(sale) sale
9 FROM
10 sh.sales_midwest_only
11 GROUP BY state, prod, year
12 ),
13 -- Gather a list of distinct states
14 Q2 AS (
15 SELECT DISTINCT cust_state_province state
16 FROM sh.customers
17 WHERE cust_state_province IN ('IA', 'IL','IN','MI','OH','WI')
18 ),
19 -- Gather a list of distinct calendar years
20 Q3 AS (
21 SELECT calendar_year year
22 FROM sh.calendar_years
23 )
24 SELECT
25 Q4.prod
26 ,Q4.state
27 ,Q3.year
28 ,NVL(Q4.sale, 0) tot_sales
29 FROM
30 (SELECT
31 Q1.prod
32 ,Q2.state
33 ,Q1.year
34 ,Q1.sale
35 FROM Q1
36 PARTITION BY (prod)
37 RIGHT OUTER JOIN Q2
38 ON (Q1.state = Q2.state)
39 ) Q4
40 PARTITION BY (prod,state)
41 RIGHT OUTER JOIN Q3
42 ON (Q4.YEAR = Q3.year)
43 ORDER BY 1, 2, 3;
A fully documented version of this query as well as its complete result set is shown in Listing 2.4. ConclusionThe new analytical function enhancements in Oracle 10g give developers and DBAs significantly more horsepower for advanced data modeling -- without the use of third-party software to perform these analyses. These new features make Structured Query Language within Oracle 10g even more robust and flexible. The next article in this series will concentrate on the new MODEL clause, an extremely powerful enhancement that Oracle 10g provides that significantly extends Oracle's robust analytical reporting capabilities. References and Additional ReadingWhile there is no substitute for direct experience, reading the manual is not a bad idea, either. I have drawn upon the following Oracle 10g documentation for the deeper technical details of this article: B10736-01 Oracle Database Data Warehousing Guide, Ch. 21 B10750-01 Oracle Database New Features Guide B10759-01 SQL Reference |