# Using Oracle's SQL Functions - Page 2

February 26, 2004

Let's extract the data used to compute the linear regression output:

```SQL> select s.quantity_sold, p.prod_list_price
2  FROM  sales s, products p
3  WHERE s.prod_id=p.prod_id AND
4  p.prod_category='Men'  AND
5  s.time_id=to_DATE('10-OCT-2000')
6  AND s.channel_id = 'C';

QUANTITY_SOLD PROD_LIST_PRICE
------------- ---------------
17            53.9
17            49.9
22            89.5
7              45
3              78
16            42.5
12            52.9
7            69.9
17              68
6             135
1              54
16            42.5
25            69.9
18              54
1              54
6             115
5            59.5
13            69.9
22            42.5
12              64

20 rows selected.
```

This looks much easier to deal with. In fact, any decent statistics book will show you step-by-step how to determine the linear regression line. You can even take this data and plug it into Excel, which, conveniently, is shown below.

 Observation Product list price (X) Quantity sold (Y) X-squared Y-squared XY 1 53.90 17 2905.21 289 916.30 2 49.90 17 2490.01 289 848.30 3 89.50 22 8010.25 484 1969.00 4 45.00 7 2025.00 49 315.00 5 78.00 3 6084.00 9 234.00 6 42.50 16 1806.25 256 680.00 7 52.90 12 2798.41 144 634.80 8 69.90 7 4886.01 49 489.30 9 68.00 17 4624.00 289 1156.00 10 135.00 6 18225.00 36 810.00 11 54.00 1 2916.00 1 54.00 12 42.50 16 1806.25 256 680.00 13 69.90 25 4886.01 625 1747.50 14 54.00 18 2916.00 324 972.00 15 54.00 1 2916.00 1 54.00 16 115.00 6 13225.00 36 690.00 17 59.50 5 3540.25 25 297.50 18 69.90 13 4886.01 169 908.70 19 42.50 22 1806.25 484 935.00 20 64.00 12 4096.00 144 768.00 SUM 1309.90 243 96847.91 3959 15159.40 AVG 64.495 12.15

The values shown in this table were computed using the formulas listed below, but they could have been calculated using the Analysis Toolpak in Excel.

 Component Calculation n, number of observations Count Sxx, sum of squares related to X Sxx = ∑x2 - (∑x)2/n Syy, sum of squares related to Y Syy = ∑y2 - (∑y)2/n Sxy, related to interaction between X and Y Sxy = ∑xy - (∑x)( ∑y) / n b, the slope b = Sxy/Sxx a, the y-intercept a = y-bar - b(x-bar) ("bar" means average) Linear regression line Y = a + bX R-squared, goodness of fit R-squared = (Sxy)2/(Sxx)(Syy)

Here is the output for "C" where all the SYY, SXY, and SXX values are shown:

```SELECT
s.channel_id,
REGR_SYY(s.quantity_sold, p.prod_list_price)  SYY,
REGR_SXY(s.quantity_sold, p.prod_list_price)  SXY,
REGR_SXX(s.quantity_sold, p.prod_list_price)  SXX
FROM  sales s, products p
WHERE s.prod_id=p.prod_id AND
p.prod_category='Men'  AND
s.time_id=to_DATE('10-OCT-2000')
AND s.channel_id = 'C'
GROUP BY s.channel_id
;

C        SYY        SXY        SXX
- ---------- ---------- ----------
C    1006.55   -755.885 11056.0095
```

At this point, we have everything we need to complete the linear regression problem. You can read what the equation of the linear regression line is by examining the first or second SQL*Plus outputs shown earlier: Y = 16.627808 - 0.0683687(X).

Hidden in the documentation is the meaning of "expr1" and "expr2." In mathematical terms, one of these is the "X" and one of these is the "Y." The order shown in the "REGR_" line, using "(s.quantity_sold, p.prod_list_price)" really means (Y, X) pairings, not (X, Y) as you would probably expect. We usually expect to see the pairings as (X, Y), domain/range, independent/dependent variables or predictor/response. In regression terminology, I guess Oracle decided to go with the "regression of Y on X," so remember to order the expressions in EXPR_ as Y followed by X.

A picture is worth a thousand words (note to my editor: don't worry, I'm almost done), so here is a picture of the "C" data. The picture shows the scatter plot (the actual X/list price versus the Y/quantity sold), and what the linear regression predictive model shows (the actual X/list price versus what the model would have predicted for Y/quantity sold).

The widely scattered diamond-shaped (blue) points fall into the hazy cloud type of grouping mentioned earlier. The square points (magenta or purple-ish) along a line represent what the regression model would predict for a given list price. As you can readily see, the scatter plot is nowhere near the regression line, but without the picture, we already knew that because of the very low R-squared value (roughly speaking, the goodness of fit; the higher the value, the better the fit). In fact, none of the channel_id's used in the SQL Reference guide example have a good fit (the "best" being the one for "C," but even then, it is very low).

In summary, two things come to mind. First, Oracle can be used to perform data analysis that other tools, expensive or otherwise, can perform. Like any other tool, you have to become familiar with the format of the input and output, so there is some overhead on learning that, regardless of the tool.

Second and perhaps most importantly, in my opinion, is the impression that Oracle has made a colossal blunder by not making these tools more user friendly or surfaced in a GUI interface in their database product line. The Business Intelligence module in Oracle Apps references SQL Analytics. Good job of making that feature pretty obscure and offering practically next to nothing about what is included with it. The number crunching capability in Oracle is already there (for the most part), so why not go the little bit extra and make Oracle competitive with SAS, SPSS, Minitab, or whomever? It would be infinitely harder for SAS to become a competitive relational database management system, but it would be a drop in the bucket for Oracle to become a competitive data analysis tool. If done, the word "STATSPACK" could take on a whole new meaning.