Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Links Database Forum

» Database Journal Home
» Database Articles
» Database Tutorials
MS SQL
Oracle
DB2
MS Access
MySQL
» RESOURCES
Database Tools
SQL Scripts & Samples
Links
» Database Forum
» Sitemap
Free Newsletters:
DatabaseDaily  
News Via RSS Feed


follow us on Twitter
Database Journal |DBA Support |SQLCourse |SQLCourse2
 

Featured Database Articles

Oracle

Posted Feb 26, 2004

Using Oracle's SQL Functions - Page 2

By Steve Callan



Let's extract the data used to compute the linear regression output:



SQL> select s.quantity_sold, p.prod_list_price
  2  FROM  sales s, products p
  3  WHERE s.prod_id=p.prod_id AND
  4  p.prod_category='Men'  AND
  5  s.time_id=to_DATE('10-OCT-2000')
  6  AND s.channel_id = 'C';


QUANTITY_SOLD PROD_LIST_PRICE
------------- ---------------
           17            53.9
           17            49.9
           22            89.5
            7              45
            3              78
           16            42.5
           12            52.9
            7            69.9
           17              68
            6             135
            1              54
           16            42.5
           25            69.9
           18              54
            1              54
            6             115
            5            59.5
           13            69.9
           22            42.5
           12              64

20 rows selected.


This looks much easier to deal with. In fact, any decent statistics book will show you step-by-step how to determine the linear regression line. You can even take this data and plug it into Excel, which, conveniently, is shown below.



Observation

Product list price

(X)

Quantity sold

(Y)

X-squared

Y-squared

XY

1

53.90

17

2905.21

289

916.30

2

49.90

17

2490.01

289

848.30

3

89.50

22

8010.25

484

1969.00

4

45.00

7

2025.00

49

315.00

5

78.00

3

6084.00

9

234.00

6

42.50

16

1806.25

256

680.00

7

52.90

12

2798.41

144

634.80

8

69.90

7

4886.01

49

489.30

9

68.00

17

4624.00

289

1156.00

10

135.00

6

18225.00

36

810.00

11

54.00

1

2916.00

1

54.00

12

42.50

16

1806.25

256

680.00

13

69.90

25

4886.01

625

1747.50

14

54.00

18

2916.00

324

972.00

15

54.00

1

2916.00

1

54.00

16

115.00

6

13225.00

36

690.00

17

59.50

5

3540.25

25

297.50

18

69.90

13

4886.01

169

908.70

19

42.50

22

1806.25

484

935.00

20

64.00

12

4096.00

144

768.00

SUM

1309.90

243

96847.91

3959

15159.40

AVG

64.495

12.15

The values shown in this table were computed using the formulas listed below, but they could have been calculated using the Analysis Toolpak in Excel.

Component

Calculation

n, number of observations

Count

Sxx, sum of squares related to X

Sxx = ∑x2 - (∑x)2/n

Syy, sum of squares related to Y

Syy = ∑y2 - (∑y)2/n

Sxy, related to interaction between X and Y

Sxy = ∑xy - (∑x)( ∑y) / n

b, the slope

b = Sxy/Sxx

a, the y-intercept

a = y-bar - b(x-bar) ("bar" means average)

Linear regression line

Y = a + bX

R-squared, goodness of fit

R-squared = (Sxy)2/(Sxx)(Syy)

Here is the output for "C" where all the SYY, SXY, and SXX values are shown:

SELECT
s.channel_id,
REGR_SYY(s.quantity_sold, p.prod_list_price)  SYY,
REGR_SXY(s.quantity_sold, p.prod_list_price)  SXY,
REGR_SXX(s.quantity_sold, p.prod_list_price)  SXX
FROM  sales s, products p
WHERE s.prod_id=p.prod_id AND
p.prod_category='Men'  AND
s.time_id=to_DATE('10-OCT-2000')
AND s.channel_id = 'C'
GROUP BY s.channel_id
;

C        SYY        SXY        SXX
- ---------- ---------- ----------
C    1006.55   -755.885 11056.0095

At this point, we have everything we need to complete the linear regression problem. You can read what the equation of the linear regression line is by examining the first or second SQL*Plus outputs shown earlier: Y = 16.627808 - 0.0683687(X).

Hidden in the documentation is the meaning of "expr1" and "expr2." In mathematical terms, one of these is the "X" and one of these is the "Y." The order shown in the "REGR_" line, using "(s.quantity_sold, p.prod_list_price)" really means (Y, X) pairings, not (X, Y) as you would probably expect. We usually expect to see the pairings as (X, Y), domain/range, independent/dependent variables or predictor/response. In regression terminology, I guess Oracle decided to go with the "regression of Y on X," so remember to order the expressions in EXPR_ as Y followed by X.

A picture is worth a thousand words (note to my editor: don't worry, I'm almost done), so here is a picture of the "C" data. The picture shows the scatter plot (the actual X/list price versus the Y/quantity sold), and what the linear regression predictive model shows (the actual X/list price versus what the model would have predicted for Y/quantity sold).

The widely scattered diamond-shaped (blue) points fall into the hazy cloud type of grouping mentioned earlier. The square points (magenta or purple-ish) along a line represent what the regression model would predict for a given list price. As you can readily see, the scatter plot is nowhere near the regression line, but without the picture, we already knew that because of the very low R-squared value (roughly speaking, the goodness of fit; the higher the value, the better the fit). In fact, none of the channel_id's used in the SQL Reference guide example have a good fit (the "best" being the one for "C," but even then, it is very low).

In summary, two things come to mind. First, Oracle can be used to perform data analysis that other tools, expensive or otherwise, can perform. Like any other tool, you have to become familiar with the format of the input and output, so there is some overhead on learning that, regardless of the tool.

Second and perhaps most importantly, in my opinion, is the impression that Oracle has made a colossal blunder by not making these tools more user friendly or surfaced in a GUI interface in their database product line. The Business Intelligence module in Oracle Apps references SQL Analytics. Good job of making that feature pretty obscure and offering practically next to nothing about what is included with it. The number crunching capability in Oracle is already there (for the most part), so why not go the little bit extra and make Oracle competitive with SAS, SPSS, Minitab, or whomever? It would be infinitely harder for SAS to become a competitive relational database management system, but it would be a drop in the bucket for Oracle to become a competitive data analysis tool. If done, the word "STATSPACK" could take on a whole new meaning.

» See All Articles by Columnist Steve Callan



Oracle Archives

Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 




Latest Forum Threads
Oracle Forum
Topic By Replies Updated
Oracle Data Mining: Classification jan.hasller 0 July 5th, 07:19 AM
Find duplicates - Unique IDs Lava 5 July 2nd, 08:30 AM
no matching unique or primary key rcanter 1 April 25th, 12:32 PM
Update values of one table based on condition of values in other table using Trigger Gladiator 3 February 29th, 06:01 PM