SqlCredit - Part 16: The Cost of Bloat
May 23, 2008
Download the code for this article.
This is part 16 of a series. If you have not read part 15, you can find it here, but this article does not require any knowledge of the previous work in the series.
Bloat: “We Never Delete Anything”
Every database designer, developer, and administrator I’ve talked to is dealing with a database that continues to grow over time, often at an alarming rate.
Applications often have tables that are fairly consistent (reference data), tables that grow fairly slowly over time (customers, stores), and tables that grow very quickly and to very large sizes (sales, purchases, web hits).
Database designers are more and more often receiving the orders, “Don’t delete anything.” This leads to database “bloat”. The question is, how does this affect performance? Can we reasonably support a million rows in a single table? How about a billion rows?
In this series’ current quest to understand costs of different architectures, we will look at what cost is paid in query performance as a table grows.
Building the Tests
If you’ve been following the series, skip this section and move down to the Test Specifics section.
For this test, I built a database and added four sets of Card, Vendor, and Purchase tables. These tables were introduced in previous installments, but here is the DDL for review:
CREATE TABLE dbo.Card1 ( CardID int NOT NULL IDENTITY, CardNumber1 bigint NOT NULL, CardNumber2 bigint NOT NULL, CardNumber3 bigint NOT NULL, CardNumber4 bigint NOT NULL, SecurityCode1 smallint NOT NULL, SecurityCode2 smallint NOT NULL, SecurityCode3 smallint NOT NULL, SecurityCode4 smallint NOT NULL, SecureString1 varchar(36) NOT NULL, SecureString2 varchar(36) NOT NULL, SecureString3 varchar(36) NOT NULL, SecureString4 varchar(36) NOT NULL, CONSTRAINT Card1_PK PRIMARY KEY CLUSTERED (CardID) ) CREATE TABLE dbo.Vendor1 ( VendorID int NOT NULL IDENTITY, BusinessName varchar(40) NOT NULL, Address1 varchar(40) NOT NULL, Address2 varchar(40) NOT NULL, City varchar(40) NOT NULL, [State] char(2) NOT NULL, ZipCode varchar(10) NOT NULL, DateCreated datetime NOT NULL CONSTRAINT Vendor1_DateCreated_DF DEFAULT GETUTCDATE(), DateLastUpdated datetime NOT NULL CONSTRAINT Vendor1_DateLastUpdated_DF DEFAULT GETUTCDATE(), StatusCd tinyint NOT NULL CONSTRAINT Vendor1_StatusCd_DF DEFAULT 1, CONSTRAINT Vendor1_PK PRIMARY KEY CLUSTERED (VendorID) ) CREATE TABLE dbo.Purchase1 ( PurchaseID int NOT NULL IDENTITY, CardID int NOT NULL, VendorID int NOT NULL, Amount money NOT NULL, DateCreated datetime NOT NULL CONSTRAINT Purchase1_DateCreated_DF DEFAULT GETUTCDATE(), CONSTRAINT Purchase1_PK PRIMARY KEY CLUSTERED (PurchaseID), CONSTRAINT Purchase1_Card_FK FOREIGN KEY (CardID) REFERENCES dbo.Card1 (CardID), CONSTRAINT Purchase1_Vendor_FK FOREIGN KEY (VendorID) REFERENCES dbo.Vendor1 (VendorID) )
For this test, I did not create any indexes other than the primary keys. The queries use the primary keys, so the other indexes are just space and processing overhead.
One thing I learned in building this set of tests is that creating a test table with 100 million records takes a long time. I put a reasonable amount of thought into the script that populated Purchase4 (the 100-million-record table). Still, populating the second half of the table (50 million records) started the morning of April 29th and finished the morning of May 3rd.
For these tests, we keep the number of Card records and Vendor records the same and increase the number of Purchase records.
Each test reads 1,000 random Purchase records JOINed to Card and Vendor. It is important to understand that the workload is the same regardless of the size of the Purchase table.
Given a standard workload (SELECTs only), what do you predict the penalty will be for significantly growing the size of just one of these tables? If we grow the record count 100 times, we don’t expect that reading 1,000 records will take 100 times as long, but we also don’t expect it to be free. I’ve given you the elapsed time for the 100-record test. Now make your informed guess as to what the other entries will be.
Here are the results of the tests.
Subsequent runs of the same test produced similar results.
Storing larger and larger amounts of data is often driven by the business needs of your organization. There are many different options for how to accommodate such a large data set, but it helps to know the affect of data growth given a simple architecture.
Interesting points from the results:
1. The larger the number of rows, the longer it takes to process a simple workload. (This is not surprising at all, but now there are specific numbers that show how much affect that bloat has.) This assumes that you are actually accessing all those rows. If 90% of your rows were old and not accessed in the SELECT workload, the results would be different.
2. The SQL Server team is doing a good job of optimization. Notice the last column in the results. As the record count grows 100 times larger, it takes less than 4 times as long to process the records. And the differential keeps going down as the table get bigger and bigger.
What are your thoughts here? Are you being asked to keep huge amounts of data? If so, are you being given the development time to modify your architecture to deal with the growth, or are you growing inside the same architecture? Drop a comment in the forum.
Download the code for this article.SqlCredit - Developing a Complete SQL Server OLTP Database Project
Performance Testing SQL 2008's Transparent Data Encryption
SQL Server 2008's Change Data Capture - Tracking the Moving Parts
Performance Testing - SQL Server 2008 versus SQL Server 2005
Exploring SQL Server's Index INCLUDEs
Exploring SQL 2005's Ranking Functions - NTILE() and ROW_NUMBER()
SqlCredit - Part 19: Exploring SQL 2005's Ranking Functions - RANK() and DENSE_RANK()
SqlCredit, Part 18: Exploring the Performance of SQL 2005's OUTPUT Clause
SqlCredit - Part 17: Exploring SQL 2005's OUTPUT Clause
SqlCredit - Part 16: The Cost of Bloat
SqlCredit - Part 15: The Cost of Distribution
SqlCredit - Part 14: The Cost of Translation
SqlCredit - Part 13: More on Indexed Persisted Computed Columns
SqlCredit - Part 12: Exploring Indexed Persisted Computed Columns
SqlCredit - Part 11: Change Tracking Using History Records
SqlCredit - Part 10: MAC Performance and Updating SqlCredit
SqlCredit - Part 9: Message Authentication Codes
SqlCredit Part 8: Comparing Encrypt/DecryptByCert and Encrypt/DecryptByKey
SqlCredit Part 7: Performance Impact of EncryptByCert and DecryptByCert
SqlCredit Part 6: Exploring EncryptByCert and DecryptByCert
SqlCredit - Part 5: Adding Card, Vendor, and Purchase, Plus Much Refactoring
SqlCredit - Part 4: Schema and Procedure Security
SqlCredit - Part 3: Schema/Proc Updates and Automated Unit Testing
SqlCredit - Part 2: Creating the Database, Tables, CRUD Procedures
SqlCredit - Developing a Complete SQL Server OLTP Database Project