Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Tips Database Forum

» Database Journal Home
» Database Articles
» Database Tutorials
MS Access
SQL Scripts & Samples
» Database Forum
» Slideshows
» Sitemap
Free Newsletters:

By submitting your information, you agree that databasejournal.com may send you databasejournal offers via email, phone and text message, as well as email offers about other products and services that databasejournal believes may be of interest to you. databasejournal will process your information in accordance with the Quinstreet Privacy Policy.

News Via RSS Feed

Database Journal |DBA Support |SQLCourse |SQLCourse2

Featured Database Articles


Posted Mar 18, 2004

Speeding Up Performance with Query Sampling - Page 2

By DatabaseJournal.com Staff

by Paul C. Zikopoulos and Roman B. Melnyk

The TABLESAMPLE clause in DB2 UDB Version 8.1.2

Figure 2, which shows a partial syntax diagram for the SQL subselect, includes only those syntax fragments that form a path to the TABLESAMPLE clause. You can specify one of two different types of sampling, BERNOULLI or SYSTEM:

  • Row-level Bernoulli sampling. This method uses a "sargable" predicate to retrieve a sample of rows from a table. A sargable predicate is a predicate that can exploit an index to retrieve rows from a table. If no index is available, the performance associated with this approach can be relatively poor, because every row must be retrieved and the sampling predicate applied to it.
  • Page-level system sampling. In most cases, this method retrieves a sample of pages (rather than rows) from a table. If a page is selected, all of the rows in the page are selected. The performance associated with this approach is excellent, because of the minimal amount of input/output (I/O) activity that is required. If SYSTEM sampling is specified, the optimizer determines the most efficient sampling strategy in the given circumstances; in some cases, this may prove to be Bernoulli sampling.

Figure 2. A partial syntax diagram for the SQL subselect, showing only those syntax fragments that form a path to the TABLESAMPLE clause. The highlighted clause in each fragment points to the next fragment in the diagram.

Let's compare row-level sampling with page-level sampling by considering the following scenario. Assume that an index (to facilitate row-level sampling) exists, and that R represents the average number of rows per page.

In this scenario, for a given sampling rate:

  • The number of I/Os required for page-level sampling is 1/R times the number of I/Os required for row-level sampling; page-level sampling offers significant performance benefits.
  • The accuracy of aggregate estimates depends on the degree of data clustering if the rows are clustered on any columns referenced in the query. For example, suppose the query is computing SUM(x). If the rows are randomly distributed among pages, the accuracy of page-level sampling will be similar to the accuracy of row-level sampling. If, however, all of the x-values within a given page are about the same, but the x-values differ significantly from page to page, the accuracy of page-level sampling will be inferior to the accuracy of row-level sampling, because selecting more than one row per page would not add any new information to the sample. In this clustered case, it would be better to select many pages, and to pick only one row from each selected page (which is essentially row-level sampling).

Each execution of the query usually yields a different sample, and the number of rows in the sample may be slightly different. If you want the sampling to be repeatable from one execution of the query to the next (during testing, for example), you can specify the REPEATABLE keyword, followed by a numeric expression in parentheses, on the TABLESAMPLE clause. The numeric expression is used to ensure that repeated executions of the query return the same sample if the data has not been altered in any way.


As you can see, DB2 UDB query sampling support, which allows you to specify the percentage of a table that is to be sampled during query processing, can be a very useful way to relieve some of the pressure on heavy query workloads. Sampling can also be used to obtain a random subset of rows for auditing purposes. In general, it is good practice to use the TABLESAMPLE clause with only one table in a query, typically the fact table in a star schema. If you specify a TABLESAMPLE clause for more than one table, results could be difficult to interpret. We recommend that you experiment and learn more about this feature by observing the effects of applying different sampling rates to your test queries.

About the Authors

Paul C. Zikopoulos, BA, MBA, is with IBM Canada Ltd. Paul has written numerous magazine articles and books about DB2. Paul has co-authored the books: DB2 Version 8: The Official Guide, DB2: The Complete Reference, DB2 Fundamentals Certification for Dummies, DB2 for Dummies, and A DBA's Guide to Databases on Linux. Paul is a DB2 Certified Advanced Technical Expert (DRDA and Cluster/EEE) and a DB2 Certified Solutions Expert (Business Intelligence and Database Administration). You can reach him at: paulz_ibm at msn.com.

Roman B. Melnyk, PhD, is with IBM Canada Ltd., specializing in database administration, DB2 utilities, and SQL. Roman has written numerous DB2 books, articles, and other related materials. Roman co-authored DB2 Version 8: The Official Guide, DB2: The Complete Reference, DB2 Fundamentals Certification for Dummies, and DB2 for Dummies. You can reach him at roman_b_melnyk at hotmail.com.

DB2 Archives

Comment and Contribute


(Maximum characters: 1200). You have characters left.