by Paul C. Zikopoulos and Roman B. Melnyk
Today’s databases (and especially data warehouses) have become so large, and their query workloads have become so complex, that it is often impractical and even unnecessary to retrieve all of the data that may be relevant to a particular query. Rather than use all of the data that satisfies some search condition, it might be possible in some cases to arrive at valid conclusions based on analysis of a relatively small sampling of the data. With such an approach, random sampling of a database could greatly increase query performance. In fact, one could reasonably expect the resulting performance increases to be more or less inversely proportional to the sample size.
This article outlines SQL query sampling support in the IBM DB2 Universal Database (DB2 UDB) products as of Version 8.1.2. Sampling is supported in DB2 UDB V8.1.2 with the new TABLESAMPLE clause, which has been added to the FROM clause of the SQL subselect. The TABLESAMPLE clause allows you to specify the percentage of a table that is to be sampled during query processing. The value that you specify here is called the “sampling percentage” or the “sampling rate”. For example, if the sampling rate is set to 0.1, then only 1/10 of 1% of the data will be sampled. This would mean that in a 10,000-row table, only 10 rows (approximately) would be accessed.
How could such a small sample produce valid results? There is empirical evidence demonstrating the accuracy of aggregate values (output from column functions such as AVG), even with low sampling rates such as 1% or less, when tables are large enough.
The most common application of query sampling is to queries that use aggregate functions such as AVG, SUM, or COUNT. In such cases, reasonably accurate results can be obtained from a sample of the data.
We can demonstrate this in a simple, yet effective way by running a set of 100 queries against the STAFF table of the SAMPLE database that comes with DB2 UDB. You can try this little exercise yourself; the easiest way to do it is to imbed the queries in a DB2 command script. Each query has the following structure:
db2 select avg(salary) as avg_salary
where n is the sampling rate, whose value ranges from 1 to 100, in increments of 1. Running such queries against a table such as STAFF, which has 35 records, shows that reasonably accurate results can be obtained, even with small tables, if the sampling rate is set high enough. Figure 1 shows that, in this case, that rate is still impressively low, somewhere around 25%.
Figure 1. A linear plot of the results from 100 queries against the STAFF table. The queries are identical except for the sampling rate specified on the TABLESAMPLE clause.