Oracle provides quite an array of functions when it comes to
manipulating data via SQL. The Oracle9i SQL Reference guide (for release 2)
lists five categories of SQL functions, with each category containing one or
more functions within a category. The major functions are single-row,
aggregate, analytic, object reference and user defined.
The single-row function category contains the following
functions: number, character, datetime, conversion and miscellaneous single
row. What are single-row functions? The definition shown in the SQL Reference
guide states that single-row "functions return a single result row for
every row of a queried table or view. These functions can appear in select
lists, WHERE
clauses, START
WITH
and CONNECT
BY
clauses, and HAVING
clauses." Aggregate
functions are also quite powerful. These functions "return a single result
row based on groups of rows, rather than on single rows." In general,
single-row and aggregate functions complement one another.
Of particular interest for this new series are the functions
related to numbers. How does this relate to your job as a DBA? Answer: in
several ways. First, you may have to support a Decision Support group. Decision
support groups frequently use analytic functions and statistical methods to
support a business decision. It helps to understand (or at least recognize the
name) some of the analytic tools being used. Second, you may be working in a
data warehouse type of environment where YOU are the decision support guru.
Granted, your job may be to simply massage or extract the data for others to
analyze, but you are the SQL expert by virtue of being the DBA. If you are
asked to find the regression line for quantity produced versus sales, you will
make a better impression by not having that deer in the headlights look come
across your face.
A third reason has to do to adding value to your company by
saving your company some money. How does that work? Well, companies that need
to analyze data often purchase sophisticated tools such as SPSS and SAS. These
products are not cheap. What frequently happens is that users of these high-end
statistical packages wind up using very few features. Many of these features
are the same ones found in Excel. Already you are thinking about how to extract
data into a comma-separated value flat file for use in Excel. However, what if
you didn’t need to do that "export" and could do the same exact thing
within Oracle itself? You have already paid for Oracle, and likely have Excel
on every PC, but have the overhead of extracting the data and then having to do
whatever in Excel. How about skipping the Excel step and performing the data
analysis within Oracle? Note: by "Oracle," I mean the RDBMS product
and not anything having to do with Oracle Apps. We are talking about using the
database as a high-end calculator, so to speak.
One thing that goes hand-in-hand with analysis is some type
of pictorial representation of what is being analyzed. Although Oracle does
provide other tools for accomplishing this, that falls more into the Forms
& Reports developer realm, so we’ll say that is a possible solution and
leave it at that. So even if there is a heavy chart/graph/histogram/whatever
requirement to be met by using Excel (or something else), you can still use
Oracle as a backup to check the work performed by someone else using a
different tool.
Let’s look at the linear regression aggregate function as an
example of what Oracle can do as an analysis engine. If you hate math, the next
few paragraphs may be painful, but you’ll be able to follow along without
getting buried in the theorems and formulas.
What is linear regression? Let’s say you have an input, like
list price of some product your company produces. The lower the price, the more
your company sells of that product. Conversely, the higher the price, the fewer
your company sells. Over time, you plot selling price versus quantity sold. When
you look at the dots or plots on a piece of graph paper, perhaps the
arrangement of the dots tends to suggest drawing a straight line, which
generally comes close to connecting most of the dots. If the dots are pretty
close to the line, then you may have a strong relationship between the X and Y
(selling price and quantity produced). In fact, the line may be such a good fit
that you can come close to predicting the quantity sold if given the selling
price, and vice versa.
On the other hand, the points, when plotted, may look like a
hazy cloud on the graph paper – there is no distinct line or trend between the
X and Y values. Because there is no line which "fits" or connects the
points, there is a weak relationship between X and Y (a list price of 52 may be
just as likely to result in a quantity sold of 12, 17, or 18 – there is no or
very little predictive power knowing one thing or the other).
The general equation of a line is y = mx + b. In English,
that means "Y" is equal to the slope ("m") times "X"
plus a constant "b." That constant is actually called the y-intercept
(if the input of x = 0, then the output of y = b, or that is where the line
crosses the Y axis). The y-intercept may or may not have any real meaning. The
slope is important because it shows the rate of change between X and Y, and the
sign of the slope reflects the direction of the relationship (a positive slope
shows that as X increases, so does Y; a negative slope shows that as X
increases, Y decreases).
So, getting back to what Oracle can do for you, with respect
to linear regression: quite a bit. The linear regression function, REGR_"component,"
outputs nine items of interest typically used when evaluating data. A component
is selected by specifying the related REGR phrase and the two inputs – expr1
and expr2 – or the independent and dependent variables. I could have said the "X"
and the "Y," and I will later, but not for now.
To look at the example Oracle provides, you will need to
install the sample schema named SH (for sales history). The sample schemas are
referenced in the Oracle9i Sample Schemas guide (http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96539/toc.htm).
If you installed the sample schemas during the Oracle9i installation, you will
also need to unlock the SH user account (and give it a password you can
remember, like "SH"). If you have never seen a million-row table,
this is your opportunity. As the user SH, doing a "select count(*) from
sales" shows just over a million rows (1016271, to be exact).
The example shown in the SQL Reference guide has five channel_id’s,
but we will concentrate on the first one (identified as "C"). Here is
the select statement and its output as shown in a SQL*Plus session as the user
named SH:
SQL> SELECT
2 s.channel_id,
3 REGR_SLOPE(s.quantity_sold, p.prod_list_price) SLOPE ,
4 REGR_INTERCEPT(s.quantity_sold, p.prod_list_price) INTCPT ,
5 REGR_R2(s.quantity_sold, p.prod_list_price) RSQR ,
6 REGR_COUNT(s.quantity_sold, p.prod_list_price) COUNT ,
7 REGR_AVGX(s.quantity_sold, p.prod_list_price) AVGLISTP ,
8 REGR_AVGY(s.quantity_sold, p.prod_list_price) AVGQSOLD
9 FROM sales s, products p
10 WHERE s.prod_id=p.prod_id AND
11 p.prod_category=’Men’ AND
12 s.time_id=to_DATE(’10-OCT-2000′)
13 GROUP BY s.channel_id;C SLOPE INTCPT RSQR COUNT AVGLISTP AVGQSOLD
– ———- ———- ———- —— ———- ———-
C -.0683687 16.627808 .051342581 20 65.495 12.15
I .019710295 14.8113924 .001631488 46 51.4804348 15.826087
P -.01247359 12.854546 .017039788 30 81.87 11.8333333
S .006155886 13.9919243 .000898438 83 69.813253 14.4216867
T -.00411314 5.22717214 .008132242 27 82.2444444 4.88888889
Let’s add another AND to the WHERE clause to get just the
first row:
SQL> SELECT
2 s.channel_id,
3 REGR_SLOPE(s.quantity_sold, p.prod_list_price) SLOPE ,
4 REGR_INTERCEPT(s.quantity_sold, p.prod_list_price) INTCPT ,
5 REGR_R2(s.quantity_sold, p.prod_list_price) RSQR ,
6 REGR_COUNT(s.quantity_sold, p.prod_list_price) COUNT ,
7 REGR_AVGX(s.quantity_sold, p.prod_list_price) AVGLISTP ,
8 REGR_AVGY(s.quantity_sold, p.prod_list_price) AVGQSOLD
9 FROM sales s, products p
10 WHERE s.prod_id=p.prod_id AND
11 p.prod_category=’Men’ AND
12 s.time_id=to_DATE(’10-OCT-2000′)
13 AND s.channel_id = ‘C’
14 group by s.channel_id;C SLOPE INTCPT RSQR COUNT AVGLISTP AVGQSOLD
– ———- ———- ———- —— ———- ———-
C -.0683687 16.627808 .051342581 20 65.495 12.15
Overall, there were 206 rows involved, but we are only
interested in the 20 rows for "C."