Let's extract the data used to compute the linear regression
output:
SQL> select s.quantity_sold, p.prod_list_price
2 FROM sales s, products p
3 WHERE s.prod_id=p.prod_id AND
4 p.prod_category='Men' AND
5 s.time_id=to_DATE('10-OCT-2000')
6 AND s.channel_id = 'C';
QUANTITY_SOLD PROD_LIST_PRICE
------------- ---------------
17 53.9
17 49.9
22 89.5
7 45
3 78
16 42.5
12 52.9
7 69.9
17 68
6 135
1 54
16 42.5
25 69.9
18 54
1 54
6 115
5 59.5
13 69.9
22 42.5
12 64
20 rows selected.
This looks much easier to deal with. In fact, any decent
statistics book will show you step-by-step how to determine the linear
regression line. You can even take this data and plug it into Excel, which, conveniently,
is shown below.
Observation
|
Product list price
(X)
|
Quantity sold
(Y)
|
X-squared
|
Y-squared
|
XY
|
1
|
53.90
|
17
|
2905.21
|
289
|
916.30
|
2
|
49.90
|
17
|
2490.01
|
289
|
848.30
|
3
|
89.50
|
22
|
8010.25
|
484
|
1969.00
|
4
|
45.00
|
7
|
2025.00
|
49
|
315.00
|
5
|
78.00
|
3
|
6084.00
|
9
|
234.00
|
6
|
42.50
|
16
|
1806.25
|
256
|
680.00
|
7
|
52.90
|
12
|
2798.41
|
144
|
634.80
|
8
|
69.90
|
7
|
4886.01
|
49
|
489.30
|
9
|
68.00
|
17
|
4624.00
|
289
|
1156.00
|
10
|
135.00
|
6
|
18225.00
|
36
|
810.00
|
11
|
54.00
|
1
|
2916.00
|
1
|
54.00
|
12
|
42.50
|
16
|
1806.25
|
256
|
680.00
|
13
|
69.90
|
25
|
4886.01
|
625
|
1747.50
|
14
|
54.00
|
18
|
2916.00
|
324
|
972.00
|
15
|
54.00
|
1
|
2916.00
|
1
|
54.00
|
16
|
115.00
|
6
|
13225.00
|
36
|
690.00
|
17
|
59.50
|
5
|
3540.25
|
25
|
297.50
|
18
|
69.90
|
13
|
4886.01
|
169
|
908.70
|
19
|
42.50
|
22
|
1806.25
|
484
|
935.00
|
20
|
64.00
|
12
|
4096.00
|
144
|
768.00
|
SUM
|
1309.90
|
243
|
96847.91
|
3959
|
15159.40
|
AVG
|
64.495
|
12.15
|
|
|
|
The values shown in this table were computed using the
formulas listed below, but they could have been calculated using the Analysis Toolpak
in Excel.
Component
|
Calculation
|
n, number of observations
|
Count
|
Sxx, sum of squares related to X
|
Sxx = ∑x2 - (∑x)2/n
|
Syy, sum of squares related to Y
|
Syy = ∑y2 - (∑y)2/n
|
Sxy, related to interaction between X and Y
|
Sxy = ∑xy - (∑x)( ∑y) / n
|
b, the slope
|
b = Sxy/Sxx
|
a, the y-intercept
|
a = y-bar - b(x-bar) ("bar" means average)
|
Linear regression line
|
Y = a + bX
|
R-squared, goodness of fit
|
R-squared = (Sxy)2/(Sxx)(Syy)
|
Here is the output for "C" where all the SYY, SXY,
and SXX values are shown:
SELECT
s.channel_id,
REGR_SYY(s.quantity_sold, p.prod_list_price) SYY,
REGR_SXY(s.quantity_sold, p.prod_list_price) SXY,
REGR_SXX(s.quantity_sold, p.prod_list_price) SXX
FROM sales s, products p
WHERE s.prod_id=p.prod_id AND
p.prod_category='Men' AND
s.time_id=to_DATE('10-OCT-2000')
AND s.channel_id = 'C'
GROUP BY s.channel_id
;
C SYY SXY SXX
- ---------- ---------- ----------
C 1006.55 -755.885 11056.0095
At this point, we have everything we need to complete the
linear regression problem. You can read what the equation of the linear
regression line is by examining the first or second SQL*Plus outputs shown
earlier: Y = 16.627808 - 0.0683687(X).
Hidden in the documentation is the meaning of "expr1"
and "expr2." In mathematical terms, one of these is the "X"
and one of these is the "Y." The order shown in the "REGR_"
line, using "(s.quantity_sold,
p.prod_list_price)" really means (Y, X) pairings, not (X, Y) as you
would probably expect. We usually expect to see the pairings as (X, Y),
domain/range, independent/dependent variables or predictor/response. In
regression terminology, I guess Oracle decided to go with the "regression
of Y on X," so remember to order the expressions in EXPR_ as Y followed by
X.
A picture is worth a thousand words (note to my editor: don't
worry, I'm almost done), so here is a picture of the "C" data. The
picture shows the scatter plot (the actual X/list price versus the Y/quantity
sold), and what the linear regression predictive model shows (the actual X/list
price versus what the model would have predicted for Y/quantity sold).

The widely scattered diamond-shaped (blue) points fall into
the hazy cloud type of grouping mentioned earlier. The square points (magenta
or purple-ish) along a line represent what the regression model would predict
for a given list price. As you can readily see, the scatter plot is nowhere
near the regression line, but without the picture, we already knew that because
of the very low R-squared value (roughly speaking, the goodness of fit; the
higher the value, the better the fit). In fact, none of the channel_id's used
in the SQL Reference guide example have a good fit (the "best" being
the one for "C," but even then, it is very low).
In summary, two things come to mind. First, Oracle can be
used to perform data analysis that other tools, expensive or otherwise, can
perform. Like any other tool, you have to become familiar with the format of
the input and output, so there is some overhead on learning that, regardless of
the tool.
Second and perhaps most importantly, in my opinion, is the
impression that Oracle has made a colossal blunder by not making these tools
more user friendly or surfaced in a GUI interface in their database product
line. The Business Intelligence module in Oracle Apps references SQL Analytics.
Good job of making that feature pretty obscure and offering practically next to
nothing about what is included with it. The number crunching capability in
Oracle is already there (for the most part), so why not go the little bit extra
and make Oracle competitive with SAS, SPSS, Minitab, or whomever? It would be
infinitely harder for SAS to become a competitive relational database
management system, but it would be a drop in the bucket for Oracle to become a
competitive data analysis tool. If done, the word "STATSPACK" could
take on a whole new meaning.
»
See All Articles by Columnist Steve Callan