SHARE

SqlCredit – Part 16: The Cost of Bloat

Written By

May 23, 2008

4 minute read

Download the code for this article.

This is part 16 of a series. If you have not read part 15, you can find it here, but this article does not require any knowledge of the previous work in the series.

Bloat: “We Never Delete Anything”

Every database designer, developer, and administrator I’ve talked to is dealing with a database that continues to grow over time, often at an alarming rate.

Applications often have tables that are fairly consistent (reference data), tables that grow fairly slowly over time (customers, stores), and tables that grow very quickly and to very large sizes (sales, purchases, web hits).

Database designers are more and more often receiving the orders, “Don’t delete anything.” This leads to database “bloat”. The question is, how does this affect performance? Can we reasonably support a million rows in a single table? How about a billion rows?

In this series’ current quest to understand costs of different architectures, we will look at what cost is paid in query performance as a table grows.

Building the Tests

Test Introduction

If you’ve been following the series, skip this section and move down to the Test Specifics section.

For this test, I built a database and added four sets of Card, Vendor, and Purchase tables. These tables were introduced in previous installments, but here is the DDL for review:

 CREATE TABLE dbo.Card1 (
     CardID         int          NOT NULL  IDENTITY,
     CardNumber1    bigint       NOT NULL,
     CardNumber2    bigint       NOT NULL,
     CardNumber3    bigint       NOT NULL,
     CardNumber4    bigint       NOT NULL,
     SecurityCode1  smallint     NOT NULL,
     SecurityCode2  smallint     NOT NULL,
     SecurityCode3  smallint     NOT NULL,
     SecurityCode4  smallint     NOT NULL,
     SecureString1  varchar(36)  NOT NULL,
     SecureString2  varchar(36)  NOT NULL,
     SecureString3  varchar(36)  NOT NULL,
     SecureString4  varchar(36)  NOT NULL,
 
     CONSTRAINT Card1_PK PRIMARY KEY CLUSTERED (CardID)
 )
 
 CREATE TABLE dbo.Vendor1 (
     VendorID         int          NOT NULL  IDENTITY,
     BusinessName     varchar(40)  NOT NULL,
     Address1         varchar(40)  NOT NULL,
     Address2         varchar(40)  NOT NULL,
     City             varchar(40)  NOT NULL,
     [State]          char(2)      NOT NULL,
     ZipCode          varchar(10)  NOT NULL,
     DateCreated      datetime     NOT NULL
         CONSTRAINT Vendor1_DateCreated_DF     DEFAULT GETUTCDATE(),
     DateLastUpdated  datetime     NOT NULL
         CONSTRAINT Vendor1_DateLastUpdated_DF DEFAULT GETUTCDATE(),
     StatusCd         tinyint      NOT NULL
         CONSTRAINT Vendor1_StatusCd_DF        DEFAULT 1,
 
     CONSTRAINT Vendor1_PK PRIMARY KEY CLUSTERED (VendorID)
 )
 
 CREATE TABLE dbo.Purchase1 (
     PurchaseID   int       NOT NULL  IDENTITY,
     CardID       int       NOT NULL,
     VendorID     int       NOT NULL,
     Amount       money     NOT NULL,
     DateCreated  datetime  NOT NULL
         CONSTRAINT Purchase1_DateCreated_DF DEFAULT GETUTCDATE(),
 
     CONSTRAINT Purchase1_PK PRIMARY KEY CLUSTERED (PurchaseID),
     CONSTRAINT Purchase1_Card_FK FOREIGN KEY (CardID)
         REFERENCES dbo.Card1 (CardID),
     CONSTRAINT Purchase1_Vendor_FK FOREIGN KEY (VendorID)
         REFERENCES dbo.Vendor1 (VendorID)
 )

For this test, I did not create any indexes other than the primary keys. The queries use the primary keys, so the other indexes are just space and processing overhead.

One thing I learned in building this set of tests is that creating a test table with 100 million records takes a long time. I put a reasonable amount of thought into the script that populated Purchase4 (the 100-million-record table). Still, populating the second half of the table (50 million records) started the morning of April 29th and finished the morning of May 3rd.

Test Specifics

For these tests, we keep the number of Card records and Vendor records the same and increase the number of Purchase records.

Card1: 100,000 records
Vendor1: 100,000 records
Purchase1: 100 records
Card2: 100,000 records
Vendor2: 100,000 records
Purchase2: 10,000 records (100 times the size of Purchase1)
Card3: 100,000 records
Vendor3: 100,000 records
Purchase3: 1,000,000 records (100 times the size of Purchase2)
Card4: 100,000 records
Vendor4: 100,000 records
Purchase4: 100,000,000 records (100 times the size of Purchase3)

Each test reads 1,000 random Purchase records JOINed to Card and Vendor. It is important to understand that the workload is the same regardless of the size of the Purchase table.

Given a standard workload (SELECTs only), what do you predict the penalty will be for significantly growing the size of just one of these tables? If we grow the record count 100 times, we don’t expect that reading 1,000 records will take 100 times as long, but we also don’t expect it to be free. I’ve given you the elapsed time for the 100-record test. Now make your informed guess as to what the other entries will be.

Test	Elapsed Time (ms)	Predicted Penalty
1 (100 Purchase records)	580	–
2 (10,000 Purchase records)
3(1,000,000 Purchase records)
4 (100,000,000 Purchase records)

Results

Here are the results of the tests.

Run 1

Test	Card Records	Vendor Records	Purchase Records	Elapsed Time (ms)	Penalty vs. Set 1	Penalty vs. Next-Smaller Set
1	100,000	100,000	100	580	–	–
2	100,000	100,000	10,000	2,313	3.988	3.988
3	100,000	100,000	1,000,000	5,936	10.234	2.566
4	100,000	100,000	100,000,000	11,763	20.281	1.982

Run 2

Test	Card Records	Vendor Records	Purchase Records	Elapsed Time (ms)	Penalty vs. Set 1	Penalty vs. Next-Smaller Set
1	100,000	100,000	100	546	–	–
2	100,000	100,000	10,000	2,156	3.949	3.949
3	100,000	100,000	1,000,000	5,766	10.560	2.674
4	100,000	100,000	100,000,000	11,796	21.604	2.046

Subsequent runs of the same test produced similar results.

Conclusion

Storing larger and larger amounts of data is often driven by the business needs of your organization. There are many different options for how to accommodate such a large data set, but it helps to know the affect of data growth given a simple architecture.

Interesting points from the results:

1. The larger the number of rows, the longer it takes to process a simple workload. (This is not surprising at all, but now there are specific numbers that show how much affect that bloat has.) This assumes that you are actually accessing all those rows. If 90% of your rows were old and not accessed in the SELECT workload, the results would be different.

2. The SQL Server team is doing a good job of optimization. Notice the last column in the results. As the record count grows 100 times larger, it takes less than 4 times as long to process the records. And the differential keeps going down as the table get bigger and bigger.

What are your thoughts here? Are you being asked to keep huge amounts of data? If so, are you being given the development time to modify your architecture to deal with the growth, or are you growing inside the same architecture? Drop a comment in the forum.

Download the code for this article.

» See All Articles by Columnist Rob Garrison

RG

Rob Garrison

Rob Garrison is a database architect at Fiserv in Portland, Oregon specializing in OLTP design and development. He has over twenty years of IT experience, with over ten years in database development and architecture.