Using Oracle's SQL Functions
February 26, 2004
Oracle provides quite an array of functions when it comes to manipulating data via SQL. The Oracle9i SQL Reference guide (for release 2) lists five categories of SQL functions, with each category containing one or more functions within a category. The major functions are single-row, aggregate, analytic, object reference and user defined.
The single-row function category contains the following
functions: number, character, datetime, conversion and miscellaneous single
row. What are single-row functions? The definition shown in the SQL Reference
guide states that single-row "functions return a single result row for
every row of a queried table or view. These functions can appear in select
Of particular interest for this new series are the functions related to numbers. How does this relate to your job as a DBA? Answer: in several ways. First, you may have to support a Decision Support group. Decision support groups frequently use analytic functions and statistical methods to support a business decision. It helps to understand (or at least recognize the name) some of the analytic tools being used. Second, you may be working in a data warehouse type of environment where YOU are the decision support guru. Granted, your job may be to simply massage or extract the data for others to analyze, but you are the SQL expert by virtue of being the DBA. If you are asked to find the regression line for quantity produced versus sales, you will make a better impression by not having that deer in the headlights look come across your face.
A third reason has to do to adding value to your company by saving your company some money. How does that work? Well, companies that need to analyze data often purchase sophisticated tools such as SPSS and SAS. These products are not cheap. What frequently happens is that users of these high-end statistical packages wind up using very few features. Many of these features are the same ones found in Excel. Already you are thinking about how to extract data into a comma-separated value flat file for use in Excel. However, what if you didn't need to do that "export" and could do the same exact thing within Oracle itself? You have already paid for Oracle, and likely have Excel on every PC, but have the overhead of extracting the data and then having to do whatever in Excel. How about skipping the Excel step and performing the data analysis within Oracle? Note: by "Oracle," I mean the RDBMS product and not anything having to do with Oracle Apps. We are talking about using the database as a high-end calculator, so to speak.
One thing that goes hand-in-hand with analysis is some type of pictorial representation of what is being analyzed. Although Oracle does provide other tools for accomplishing this, that falls more into the Forms & Reports developer realm, so we'll say that is a possible solution and leave it at that. So even if there is a heavy chart/graph/histogram/whatever requirement to be met by using Excel (or something else), you can still use Oracle as a backup to check the work performed by someone else using a different tool.
Let's look at the linear regression aggregate function as an example of what Oracle can do as an analysis engine. If you hate math, the next few paragraphs may be painful, but you'll be able to follow along without getting buried in the theorems and formulas.
What is linear regression? Let's say you have an input, like list price of some product your company produces. The lower the price, the more your company sells of that product. Conversely, the higher the price, the fewer your company sells. Over time, you plot selling price versus quantity sold. When you look at the dots or plots on a piece of graph paper, perhaps the arrangement of the dots tends to suggest drawing a straight line, which generally comes close to connecting most of the dots. If the dots are pretty close to the line, then you may have a strong relationship between the X and Y (selling price and quantity produced). In fact, the line may be such a good fit that you can come close to predicting the quantity sold if given the selling price, and vice versa.
On the other hand, the points, when plotted, may look like a hazy cloud on the graph paper - there is no distinct line or trend between the X and Y values. Because there is no line which "fits" or connects the points, there is a weak relationship between X and Y (a list price of 52 may be just as likely to result in a quantity sold of 12, 17, or 18 - there is no or very little predictive power knowing one thing or the other).
The general equation of a line is y = mx + b. In English, that means "Y" is equal to the slope ("m") times "X" plus a constant "b." That constant is actually called the y-intercept (if the input of x = 0, then the output of y = b, or that is where the line crosses the Y axis). The y-intercept may or may not have any real meaning. The slope is important because it shows the rate of change between X and Y, and the sign of the slope reflects the direction of the relationship (a positive slope shows that as X increases, so does Y; a negative slope shows that as X increases, Y decreases).
So, getting back to what Oracle can do for you, with respect to linear regression: quite a bit. The linear regression function, REGR_"component," outputs nine items of interest typically used when evaluating data. A component is selected by specifying the related REGR phrase and the two inputs - expr1 and expr2 - or the independent and dependent variables. I could have said the "X" and the "Y," and I will later, but not for now.
To look at the example Oracle provides, you will need to install the sample schema named SH (for sales history). The sample schemas are referenced in the Oracle9i Sample Schemas guide (http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96539/toc.htm). If you installed the sample schemas during the Oracle9i installation, you will also need to unlock the SH user account (and give it a password you can remember, like "SH"). If you have never seen a million-row table, this is your opportunity. As the user SH, doing a "select count(*) from sales" shows just over a million rows (1016271, to be exact).
The example shown in the SQL Reference guide has five channel_id's, but we will concentrate on the first one (identified as "C"). Here is the select statement and its output as shown in a SQL*Plus session as the user named SH:
SQL> SELECT 2 s.channel_id, 3 REGR_SLOPE(s.quantity_sold, p.prod_list_price) SLOPE , 4 REGR_INTERCEPT(s.quantity_sold, p.prod_list_price) INTCPT , 5 REGR_R2(s.quantity_sold, p.prod_list_price) RSQR , 6 REGR_COUNT(s.quantity_sold, p.prod_list_price) COUNT , 7 REGR_AVGX(s.quantity_sold, p.prod_list_price) AVGLISTP , 8 REGR_AVGY(s.quantity_sold, p.prod_list_price) AVGQSOLD 9 FROM sales s, products p 10 WHERE s.prod_id=p.prod_id AND 11 p.prod_category='Men' AND 12 s.time_id=to_DATE('10-OCT-2000') 13 GROUP BY s.channel_id; C SLOPE INTCPT RSQR COUNT AVGLISTP AVGQSOLD - ---------- ---------- ---------- ------ ---------- ---------- C -.0683687 16.627808 .051342581 20 65.495 12.15 I .019710295 14.8113924 .001631488 46 51.4804348 15.826087 P -.01247359 12.854546 .017039788 30 81.87 11.8333333 S .006155886 13.9919243 .000898438 83 69.813253 14.4216867 T -.00411314 5.22717214 .008132242 27 82.2444444 4.88888889
Let's add another AND to the WHERE clause to get just the first row:
SQL> SELECT 2 s.channel_id, 3 REGR_SLOPE(s.quantity_sold, p.prod_list_price) SLOPE , 4 REGR_INTERCEPT(s.quantity_sold, p.prod_list_price) INTCPT , 5 REGR_R2(s.quantity_sold, p.prod_list_price) RSQR , 6 REGR_COUNT(s.quantity_sold, p.prod_list_price) COUNT , 7 REGR_AVGX(s.quantity_sold, p.prod_list_price) AVGLISTP , 8 REGR_AVGY(s.quantity_sold, p.prod_list_price) AVGQSOLD 9 FROM sales s, products p 10 WHERE s.prod_id=p.prod_id AND 11 p.prod_category='Men' AND 12 s.time_id=to_DATE('10-OCT-2000') 13 AND s.channel_id = 'C' 14 group by s.channel_id; C SLOPE INTCPT RSQR COUNT AVGLISTP AVGQSOLD - ---------- ---------- ---------- ------ ---------- ---------- C -.0683687 16.627808 .051342581 20 65.495 12.15
Overall, there were 206 rows involved, but we are only interested in the 20 rows for "C."