Using Oracle's SQL Functions - Page 2
February 26, 2004
Let's extract the data used to compute the linear regression output:
SQL> select s.quantity_sold, p.prod_list_price 2 FROM sales s, products p 3 WHERE s.prod_id=p.prod_id AND 4 p.prod_category='Men' AND 5 s.time_id=to_DATE('10-OCT-2000') 6 AND s.channel_id = 'C'; QUANTITY_SOLD PROD_LIST_PRICE ------------- --------------- 17 53.9 17 49.9 22 89.5 7 45 3 78 16 42.5 12 52.9 7 69.9 17 68 6 135 1 54 16 42.5 25 69.9 18 54 1 54 6 115 5 59.5 13 69.9 22 42.5 12 64 20 rows selected.
This looks much easier to deal with. In fact, any decent statistics book will show you step-by-step how to determine the linear regression line. You can even take this data and plug it into Excel, which, conveniently, is shown below.
The values shown in this table were computed using the formulas listed below, but they could have been calculated using the Analysis Toolpak in Excel.
Here is the output for "C" where all the SYY, SXY, and SXX values are shown:
SELECT s.channel_id, REGR_SYY(s.quantity_sold, p.prod_list_price) SYY, REGR_SXY(s.quantity_sold, p.prod_list_price) SXY, REGR_SXX(s.quantity_sold, p.prod_list_price) SXX FROM sales s, products p WHERE s.prod_id=p.prod_id AND p.prod_category='Men' AND s.time_id=to_DATE('10-OCT-2000') AND s.channel_id = 'C' GROUP BY s.channel_id ; C SYY SXY SXX - ---------- ---------- ---------- C 1006.55 -755.885 11056.0095
At this point, we have everything we need to complete the linear regression problem. You can read what the equation of the linear regression line is by examining the first or second SQL*Plus outputs shown earlier: Y = 16.627808 - 0.0683687(X).
Hidden in the documentation is the meaning of "expr1" and "expr2." In mathematical terms, one of these is the "X" and one of these is the "Y." The order shown in the "REGR_" line, using "(s.quantity_sold, p.prod_list_price)" really means (Y, X) pairings, not (X, Y) as you would probably expect. We usually expect to see the pairings as (X, Y), domain/range, independent/dependent variables or predictor/response. In regression terminology, I guess Oracle decided to go with the "regression of Y on X," so remember to order the expressions in EXPR_ as Y followed by X.
A picture is worth a thousand words (note to my editor: don't worry, I'm almost done), so here is a picture of the "C" data. The picture shows the scatter plot (the actual X/list price versus the Y/quantity sold), and what the linear regression predictive model shows (the actual X/list price versus what the model would have predicted for Y/quantity sold).
The widely scattered diamond-shaped (blue) points fall into the hazy cloud type of grouping mentioned earlier. The square points (magenta or purple-ish) along a line represent what the regression model would predict for a given list price. As you can readily see, the scatter plot is nowhere near the regression line, but without the picture, we already knew that because of the very low R-squared value (roughly speaking, the goodness of fit; the higher the value, the better the fit). In fact, none of the channel_id's used in the SQL Reference guide example have a good fit (the "best" being the one for "C," but even then, it is very low).
In summary, two things come to mind. First, Oracle can be used to perform data analysis that other tools, expensive or otherwise, can perform. Like any other tool, you have to become familiar with the format of the input and output, so there is some overhead on learning that, regardless of the tool.
Second and perhaps most importantly, in my opinion, is the impression that Oracle has made a colossal blunder by not making these tools more user friendly or surfaced in a GUI interface in their database product line. The Business Intelligence module in Oracle Apps references SQL Analytics. Good job of making that feature pretty obscure and offering practically next to nothing about what is included with it. The number crunching capability in Oracle is already there (for the most part), so why not go the little bit extra and make Oracle competitive with SAS, SPSS, Minitab, or whomever? It would be infinitely harder for SAS to become a competitive relational database management system, but it would be a drop in the bucket for Oracle to become a competitive data analysis tool. If done, the word "STATSPACK" could take on a whole new meaning.