Free Newsletters:
DatabaseJournal  
DBANews
Database Journal
Search Database Journal:
 
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Links Database Forum DBA Videos
internet.com

» Database Journal Home
» Database News
» DBA Videos
» Database Articles
» Database Tutorials
MS SQL
Oracle
MS Access
MySQL
DB2
» RESOURCES
Database Tools
SQL Scripts & Samples
Links
» Database Forum
» DBA Jobs
» Sitemap

News Via RSS Feed



follow us on Twitter

Marketplace Partners
Be a Marketplace Partner

internet.commerce
Be a Commerce Partner


















Apple: 1M New iPhones Sold, Apologies for Snafus

T-Mobile's Next Android Phone: myTouch 3G

Firms Push Cloud, Virtualization for IT Niches

internet.com
IT
Developer
Internet News
Small Business
Personal Technology

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers


Database Journal | DBA Support | SQLCourse | SQLCourse2







Systems Architect Principal
The Computer Merchant, Ltd
US-NJ-Paramus

Justtechjobs.com Post A Job | Post A Resume

Featured Database Articles

Oracle

February 26, 2004

Using Oracle's SQL Functions

By Steve Callan



Let's extract the data used to compute the linear regression output:



SQL> select s.quantity_sold, p.prod_list_price
  2  FROM  sales s, products p
  3  WHERE s.prod_id=p.prod_id AND
  4  p.prod_category='Men'  AND
  5  s.time_id=to_DATE('10-OCT-2000')
  6  AND s.channel_id = 'C';


QUANTITY_SOLD PROD_LIST_PRICE
------------- ---------------
           17            53.9
           17            49.9
           22            89.5
            7              45
            3              78
           16            42.5
           12            52.9
            7            69.9
           17              68
            6             135
            1              54
           16            42.5
           25            69.9
           18              54
            1              54
            6             115
            5            59.5
           13            69.9
           22            42.5
           12              64

20 rows selected.


This looks much easier to deal with. In fact, any decent statistics book will show you step-by-step how to determine the linear regression line. You can even take this data and plug it into Excel, which, conveniently, is shown below.



Observation

Product list price

(X)

Quantity sold

(Y)

X-squared

Y-squared

XY

1

53.90

17

2905.21

289

916.30

2

49.90

17

2490.01

289

848.30

3

89.50

22

8010.25

484

1969.00

4

45.00

7

2025.00

49

315.00

5

78.00

3

6084.00

9

234.00

6

42.50

16

1806.25

256

680.00

7

52.90

12

2798.41

144

634.80

8

69.90

7

4886.01

49

489.30

9

68.00

17

4624.00

289

1156.00

10

135.00

6

18225.00

36

810.00

11

54.00

1

2916.00

1

54.00

12

42.50

16

1806.25

256

680.00

13

69.90

25

4886.01

625

1747.50

14

54.00

18

2916.00

324

972.00

15

54.00

1

2916.00

1

54.00

16

115.00

6

13225.00

36

690.00

17

59.50

5

3540.25

25

297.50

18

69.90

13

4886.01

169

908.70

19

42.50

22

1806.25

484

935.00

20

64.00

12

4096.00

144

768.00

SUM

1309.90

243

96847.91

3959

15159.40

AVG

64.495

12.15

The values shown in this table were computed using the formulas listed below, but they could have been calculated using the Analysis Toolpak in Excel.

Component

Calculation

n, number of observations

Count

Sxx, sum of squares related to X

Sxx = ∑x2 - (∑x)2/n

Syy, sum of squares related to Y

Syy = ∑y2 - (∑y)2/n

Sxy, related to interaction between X and Y

Sxy = ∑xy - (∑x)( ∑y) / n

b, the slope

b = Sxy/Sxx

a, the y-intercept

a = y-bar - b(x-bar) ("bar" means average)

Linear regression line

Y = a + bX

R-squared, goodness of fit

R-squared = (Sxy)2/(Sxx)(Syy)

Here is the output for "C" where all the SYY, SXY, and SXX values are shown:

SELECT
s.channel_id,
REGR_SYY(s.quantity_sold, p.prod_list_price)  SYY,
REGR_SXY(s.quantity_sold, p.prod_list_price)  SXY,
REGR_SXX(s.quantity_sold, p.prod_list_price)  SXX
FROM  sales s, products p
WHERE s.prod_id=p.prod_id AND
p.prod_category='Men'  AND
s.time_id=to_DATE('10-OCT-2000')
AND s.channel_id = 'C'
GROUP BY s.channel_id
;

C        SYY        SXY        SXX
- ---------- ---------- ----------
C    1006.55   -755.885 11056.0095

At this point, we have everything we need to complete the linear regression problem. You can read what the equation of the linear regression line is by examining the first or second SQL*Plus outputs shown earlier: Y = 16.627808 - 0.0683687(X).

Hidden in the documentation is the meaning of "expr1" and "expr2." In mathematical terms, one of these is the "X" and one of these is the "Y." The order shown in the "REGR_" line, using "(s.quantity_sold, p.prod_list_price)" really means (Y, X) pairings, not (X, Y) as you would probably expect. We usually expect to see the pairings as (X, Y), domain/range, independent/dependent variables or predictor/response. In regression terminology, I guess Oracle decided to go with the "regression of Y on X," so remember to order the expressions in EXPR_ as Y followed by X.

A picture is worth a thousand words (note to my editor: don't worry, I'm almost done), so here is a picture of the "C" data. The picture shows the scatter plot (the actual X/list price versus the Y/quantity sold), and what the linear regression predictive model shows (the actual X/list price versus what the model would have predicted for Y/quantity sold).

The widely scattered diamond-shaped (blue) points fall into the hazy cloud type of grouping mentioned earlier. The square points (magenta or purple-ish) along a line represent what the regression model would predict for a given list price. As you can readily see, the scatter plot is nowhere near the regression line, but without the picture, we already knew that because of the very low R-squared value (roughly speaking, the goodness of fit; the higher the value, the better the fit). In fact, none of the channel_id's used in the SQL Reference guide example have a good fit (the "best" being the one for "C," but even then, it is very low).

In summary, two things come to mind. First, Oracle can be used to perform data analysis that other tools, expensive or otherwise, can perform. Like any other tool, you have to become familiar with the format of the input and output, so there is some overhead on learning that, regardless of the tool.

Second and perhaps most importantly, in my opinion, is the impression that Oracle has made a colossal blunder by not making these tools more user friendly or surfaced in a GUI interface in their database product line. The Business Intelligence module in Oracle Apps references SQL Analytics. Good job of making that feature pretty obscure and offering practically next to nothing about what is included with it. The number crunching capability in Oracle is already there (for the most part), so why not go the little bit extra and make Oracle competitive with SAS, SPSS, Minitab, or whomever? It would be infinitely harder for SAS to become a competitive relational database management system, but it would be a drop in the bucket for Oracle to become a competitive data analysis tool. If done, the word "STATSPACK" could take on a whole new meaning.

» See All Articles by Columnist Steve Callan

Go to page: Prev  1  2  

Tools:
Add databasejournal.com to your favorites
Add databasejournal.com to your browser search box
IE 7 | Firefox 2.0 | Firefox 1.5.x
Receive news via our XML/RSS feed

Oracle Archives








Latest Forum Threads
Oracle Forum
Topic By Replies Updated
Problem with archive_log_format parameter in SPFILE dave_keyur 6 June 2nd, 02:14 AM
getting first date and last date of previous month in oracle Osho4U 8 May 26th, 05:29 PM
Removing middle initials antgaudi 1 May 12th, 10:40 AM
Extracting the smallest age. MIN with SYSDATE? antgaudi 2 May 11th, 01:45 AM