In this second article about SQL functions, we will look at 11
SQL-related functions commonly used in statistics: count, sum, average,
standard deviation, variance and covariance (standard deviation and variance
have three each; covariance has two).

Aside from other reasons (mentioned in the last SQL function
article) about why a DBA may need to be familiar with these tools, here is a
reason that pertains directly to what a DBA does: using "real"
statistics for performance tuning or design purposes. Oracle's array of SQL
functions enables a DBA to compute meaningful statistics about almost any set
of input data. Computing the count, sum and average of some item of interest is
very straightforward, but how do you compute the *variability* of that
data?

As a point of clarification, when someone refers to
statistics about a set of data, the type of statistics being discussed is that
of descriptive statistics or simple data analysis. The counterpart of
descriptive statistics is inferential statistics. Inferential statistics
typically deal with a sample from a population, and from that sample, we try to
infer or answer questions about the entire population. The answers to questions
about the population are couched in terms of probability. For our purposes,
simple descriptive statistics will suffice because we have all the data.

When looking through the list of SQL functions in the SQL
Reference documentation, you will see "POP" and "SAMP"
appended to covariance, standard deviation and variance-related functions. If
you are dealing with a set of data consisting of more than 30 to 31 elements,
observations, or readings, you can use either one of the *function-name*_POP
or *function-name*_SAMP functions. The "SAMP" functions use a
population correction factor to provide a "better" or unbiased value
of the true population parameter. In simple terms, the denominator of whichever
value is being computed uses n-1 instead of n, where "n" is the
number of data points, readings, observations, and so on.

A quick numerical example shows the practical equivalence of
dividing by n-1 versus dividing by n. A large number divided by n-1 is
practically the same (for our purposes, anyway) as dividing by n when n is
greater than 30. The value of 34.483 (1000/(30-1)) versus that of 33.333
(1000/30) is around a 3% difference. However, when n is much higher, like 100,
then the "error" falls to less than 1%. We will see this lack of a
difference in the following examples using the sample schemas that ship with
Oracle9i.

Logging in as "SH" in the sales history schema,
and using the example shown in the SQL Reference guide, you can see there is no
practical difference between the sample standard deviation and the population
standard deviation.

SQL> SELECT STDDEV_POP (amount_sold) "Pop",
2 STDDEV_SAMP (amount_sold) "Samp"
3 FROM sales;
Pop Samp
---------- ----------
896.355151 896.355592

How many records are in this table? If you do not recall from
the last SQL function article, this table has over a million rows of data.

SQL> SELECT count(amount_sold) count
2 from sales;
COUNT
----------
1016271

As an illustrative example of how to use this particular SQL
function, the documentation is entirely correct and accurate. As a practical
example of using STDDEV_POP and STDDEV_SAMP, the documentation falls short.
After reading this article, or already knowing something about statistics, you
now know why the values returned are so similar, differing by slightly more
than 4/10,000, which is close enough to zero for all practical purposes.

Users of SQL functions should be grateful for these more
powerful, aggregate type of functions. If you were restricted to "simpler"
functions, the query to compute these standard deviations would look like what
is shown below. Note: the "-" is a minus sign, not a continuation
symbol, and if you cut and paste the code, you may need to reformat it for
SQL*Plus.

select sqrt((sum(power((amount_sold),2)) - (power(sum(amount_sold),2)/count(amount_sold)))/
count(amount_sold)) "Pop"
from sales;

and

select sqrt((sum(power((amount_sold),2)) - (power(sum(amount_sold),2)/count(amount_sold)))/
(count(amount_sold)-1) "Samp"
from sales;

Some of the parentheses were used to improve readability,
and as is readily apparent, the complexity of the queries is quite a bit more
involved than using the much simpler STDDEV variants. And as a bonus, the
STDDEV functions are faster (just slightly so) than the more computational looking
examples.

The VARIANCE, VAR_POP and VAR_SAMP functions are directly
related to their standard deviation counterparts, as standard deviation is
simply the square root of the variance. STDDEV and VARIANCE are similar in what
they return if there is only one element (both return a zero), and STDDEV_SAMP
and VAR_SAMP return the same overall values as STDDEV and VARIANCE, with the
exception of returning a null if there is only one element.

COUNT, SUM and AVERAGE are probably the most well known SQL
functions and there is nothing special about their use. There is one little
trick you can use concerning AVERAGE, and that trick has to do with verifying
the output given an input when using a regression line (see the previous
article on SQL functions). The data point of (X-bar, Y-bar) is a point on the
computed regression line even if the value of X-bar is not one of the original
observations or input values (or Y-bar is not an actual observed output).

Using the "C" channel from the sales table (there
were 20 data pairs), the computed regression line was Y = 16.627808 -
0.0683687(X). X-bar was 65.495 and Y-bar was 12.15. Plugging in 65.495 in the
equation, does, in fact, return a value of 12.15 for Y.

Without getting into details about what covariance is, you
should know that you have already seen it if you read the previous article. In
the regression line example, there was a computed value of -755.885 for the Sxy
term. This value was selected using the REGR_SXY function. If you select the
COVAR_POP value from the sales table (using the same "where" clause
as before):

SQL> SELECT
2 s.channel_id,
3 REGR_SXY(s.quantity_sold, p.prod_list_price) SXY,
4 COVAR_POP(s.quantity_sold, p.prod_list_price) COVAR_POP
5 FROM sales s, products p
6 WHERE s.prod_id=p.prod_id AND
7 p.prod_category='Men' AND
8 s.time_id=to_DATE('10-OCT-2000')
9 AND s.channel_id = 'C'
10 group by s.channel_id;
C SXY COVAR_POP
- ---------- ----------
C -755.885 -37.79425

Guess what 20 times -37.79425 is? Answer: -755.885, which
just happens to be the REGR_SXY value.

The point of demonstrating how some of these SQL functions
are related is not to turn you into a statistics wizard, but rather, to help provide
more insight into Oracle's analytic and data manipulation capabilities using
statistics-related SQL functions. As mentioned before, Oracle is rich with
analytic features (and poor in the interface) and knowing more about them
further enhances your skills as a DBA. A business analyst or report writer may
know the mathematical formula for computing some statistic, but not know how to
write the SQL to get it. Being more informed about SQL functions is, as Martha
Stewart would put it, "a good thing."

Bonus question: Knowing that the average is computed as the
sum of X divided by N, how else could you write the following:

select sqrt((sum(power((amount_sold),2)) - (power(sum(amount_sold),2)/count(amount_sold)))/
count(amount_sold)) "Pop"
from sales;

Answer:

select sqrt((sum(power((amount_sold),2)) - (avg(amount_sold)*sum(amount_sold)))/
count(amount_sold)) "Pop"
from sales;

**»**
See All Articles by Columnist **Steve Callan**