In this second article about SQL functions, we will look at 11

SQL-related functions commonly used in statistics: count, sum, average,

standard deviation, variance and covariance (standard deviation and variance

have three each; covariance has two).

Aside from other reasons (mentioned in the last SQL function

article) about why a DBA may need to be familiar with these tools, here is a

reason that pertains directly to what a DBA does: using "real"

statistics for performance tuning or design purposes. Oracle’s array of SQL

functions enables a DBA to compute meaningful statistics about almost any set

of input data. Computing the count, sum and average of some item of interest is

very straightforward, but how do you compute the *variability* of that

data?

As a point of clarification, when someone refers to

statistics about a set of data, the type of statistics being discussed is that

of descriptive statistics or simple data analysis. The counterpart of

descriptive statistics is inferential statistics. Inferential statistics

typically deal with a sample from a population, and from that sample, we try to

infer or answer questions about the entire population. The answers to questions

about the population are couched in terms of probability. For our purposes,

simple descriptive statistics will suffice because we have all the data.

When looking through the list of SQL functions in the SQL

Reference documentation, you will see "POP" and "SAMP"

appended to covariance, standard deviation and variance-related functions. If

you are dealing with a set of data consisting of more than 30 to 31 elements,

observations, or readings, you can use either one of the *function-name*_POP

or *function-name*_SAMP functions. The "SAMP" functions use a

population correction factor to provide a "better" or unbiased value

of the true population parameter. In simple terms, the denominator of whichever

value is being computed uses n-1 instead of n, where "n" is the

number of data points, readings, observations, and so on.

A quick numerical example shows the practical equivalence of

dividing by n-1 versus dividing by n. A large number divided by n-1 is

practically the same (for our purposes, anyway) as dividing by n when n is

greater than 30. The value of 34.483 (1000/(30-1)) versus that of 33.333

(1000/30) is around a 3% difference. However, when n is much higher, like 100,

then the "error" falls to less than 1%. We will see this lack of a

difference in the following examples using the sample schemas that ship with

Oracle9i.

Logging in as "SH" in the sales history schema,

and using the example shown in the SQL Reference guide, you can see there is no

practical difference between the sample standard deviation and the population

standard deviation.

SQL> SELECT STDDEV_POP (amount_sold) “Pop”,

2 STDDEV_SAMP (amount_sold) “Samp”

3 FROM sales;Pop Samp

———- ———-

896.355151 896.355592

How many records are in this table? If you do not recall from

the last SQL function article, this table has over a million rows of data.

SQL> SELECT count(amount_sold) count

2 from sales;COUNT

———-

1016271

As an illustrative example of how to use this particular SQL

function, the documentation is entirely correct and accurate. As a practical

example of using STDDEV_POP and STDDEV_SAMP, the documentation falls short.

After reading this article, or already knowing something about statistics, you

now know why the values returned are so similar, differing by slightly more

than 4/10,000, which is close enough to zero for all practical purposes.

Users of SQL functions should be grateful for these more

powerful, aggregate type of functions. If you were restricted to "simpler"

functions, the query to compute these standard deviations would look like what

is shown below. Note: the "-" is a minus sign, not a continuation

symbol, and if you cut and paste the code, you may need to reformat it for

SQL*Plus.

select sqrt((sum(power((amount_sold),2)) – (power(sum(amount_sold),2)/count(amount_sold)))/

count(amount_sold)) “Pop”

from sales;

and

select sqrt((sum(power((amount_sold),2)) – (power(sum(amount_sold),2)/count(amount_sold)))/

(count(amount_sold)-1) “Samp”

from sales;

Some of the parentheses were used to improve readability,

and as is readily apparent, the complexity of the queries is quite a bit more

involved than using the much simpler STDDEV variants. And as a bonus, the

STDDEV functions are faster (just slightly so) than the more computational looking

examples.

The VARIANCE, VAR_POP and VAR_SAMP functions are directly

related to their standard deviation counterparts, as standard deviation is

simply the square root of the variance. STDDEV and VARIANCE are similar in what

they return if there is only one element (both return a zero), and STDDEV_SAMP

and VAR_SAMP return the same overall values as STDDEV and VARIANCE, with the

exception of returning a null if there is only one element.

COUNT, SUM and AVERAGE are probably the most well known SQL

functions and there is nothing special about their use. There is one little

trick you can use concerning AVERAGE, and that trick has to do with verifying

the output given an input when using a regression line (see the previous

article on SQL functions). The data point of (X-bar, Y-bar) is a point on the

computed regression line even if the value of X-bar is not one of the original

observations or input values (or Y-bar is not an actual observed output).

Using the "C" channel from the sales table (there

were 20 data pairs), the computed regression line was Y = 16.627808 –

0.0683687(X). X-bar was 65.495 and Y-bar was 12.15. Plugging in 65.495 in the

equation, does, in fact, return a value of 12.15 for Y.

Without getting into details about what covariance is, you

should know that you have already seen it if you read the previous article. In

the regression line example, there was a computed value of -755.885 for the Sxy

term. This value was selected using the REGR_SXY function. If you select the

COVAR_POP value from the sales table (using the same "where" clause

as before):

SQL> SELECT

2 s.channel_id,

3 REGR_SXY(s.quantity_sold, p.prod_list_price) SXY,

4 COVAR_POP(s.quantity_sold, p.prod_list_price) COVAR_POP

5 FROM sales s, products p

6 WHERE s.prod_id=p.prod_id AND

7 p.prod_category=’Men’ AND

8 s.time_id=to_DATE(’10-OCT-2000′)

9 AND s.channel_id = ‘C’

10 group by s.channel_id;C SXY COVAR_POP

– ———- ———-

C -755.885 -37.79425

Guess what 20 times -37.79425 is? Answer: -755.885, which

just happens to be the REGR_SXY value.

The point of demonstrating how some of these SQL functions

are related is not to turn you into a statistics wizard, but rather, to help provide

more insight into Oracle’s analytic and data manipulation capabilities using

statistics-related SQL functions. As mentioned before, Oracle is rich with

analytic features (and poor in the interface) and knowing more about them

further enhances your skills as a DBA. A business analyst or report writer may

know the mathematical formula for computing some statistic, but not know how to

write the SQL to get it. Being more informed about SQL functions is, as Martha

Stewart would put it, "a good thing."

Bonus question: Knowing that the average is computed as the

sum of X divided by N, how else could you write the following:

select sqrt((sum(power((amount_sold),2)) – (power(sum(amount_sold),2)/count(amount_sold)))/

count(amount_sold)) “Pop”

from sales;

Answer:

select sqrt((sum(power((amount_sold),2)) – (avg(amount_sold)*sum(amount_sold)))/

count(amount_sold)) “Pop”

from sales;