Download the code for this article.
This is part 16 of a series. If you have not read part 15, you can find it here, but this article does not require any knowledge of the previous work in the series.
Bloat: “We Never Delete Anything”
Every database designer, developer, and administrator I’ve talked to is dealing with a database that continues to grow over time, often at an alarming rate.
Applications often have tables that are fairly consistent (reference data), tables that grow fairly slowly over time (customers, stores), and tables that grow very quickly and to very large sizes (sales, purchases, web hits).
Database designers are more and more often receiving the orders, “Don’t delete anything.” This leads to database “bloat”. The question is, how does this affect performance? Can we reasonably support a million rows in a single table? How about a billion rows?
In this series’ current quest to understand costs of different architectures, we will look at what cost is paid in query performance as a table grows.
Building the Tests
Test Introduction
If you’ve been following the series, skip this section and move down to the Test Specifics section.
For this test, I built a database and added four sets of Card, Vendor, and Purchase tables. These tables were introduced in previous installments, but here is the DDL for review:
CREATE TABLE dbo.Card1 ( CardID int NOT NULL IDENTITY, CardNumber1 bigint NOT NULL, CardNumber2 bigint NOT NULL, CardNumber3 bigint NOT NULL, CardNumber4 bigint NOT NULL, SecurityCode1 smallint NOT NULL, SecurityCode2 smallint NOT NULL, SecurityCode3 smallint NOT NULL, SecurityCode4 smallint NOT NULL, SecureString1 varchar(36) NOT NULL, SecureString2 varchar(36) NOT NULL, SecureString3 varchar(36) NOT NULL, SecureString4 varchar(36) NOT NULL, CONSTRAINT Card1_PK PRIMARY KEY CLUSTERED (CardID) ) CREATE TABLE dbo.Vendor1 ( VendorID int NOT NULL IDENTITY, BusinessName varchar(40) NOT NULL, Address1 varchar(40) NOT NULL, Address2 varchar(40) NOT NULL, City varchar(40) NOT NULL, [State] char(2) NOT NULL, ZipCode varchar(10) NOT NULL, DateCreated datetime NOT NULL CONSTRAINT Vendor1_DateCreated_DF DEFAULT GETUTCDATE(), DateLastUpdated datetime NOT NULL CONSTRAINT Vendor1_DateLastUpdated_DF DEFAULT GETUTCDATE(), StatusCd tinyint NOT NULL CONSTRAINT Vendor1_StatusCd_DF DEFAULT 1, CONSTRAINT Vendor1_PK PRIMARY KEY CLUSTERED (VendorID) ) CREATE TABLE dbo.Purchase1 ( PurchaseID int NOT NULL IDENTITY, CardID int NOT NULL, VendorID int NOT NULL, Amount money NOT NULL, DateCreated datetime NOT NULL CONSTRAINT Purchase1_DateCreated_DF DEFAULT GETUTCDATE(), CONSTRAINT Purchase1_PK PRIMARY KEY CLUSTERED (PurchaseID), CONSTRAINT Purchase1_Card_FK FOREIGN KEY (CardID) REFERENCES dbo.Card1 (CardID), CONSTRAINT Purchase1_Vendor_FK FOREIGN KEY (VendorID) REFERENCES dbo.Vendor1 (VendorID) )
For this test, I did not create any indexes other than the primary keys. The queries use the primary keys, so the other indexes are just space and processing overhead.
One thing I learned in building this set of tests is that creating a test table with 100 million records takes a long time. I put a reasonable amount of thought into the script that populated Purchase4 (the 100-million-record table). Still, populating the second half of the table (50 million records) started the morning of April 29th and finished the morning of May 3rd.
Test Specifics
For these tests, we keep the number of Card records and Vendor records the same and increase the number of Purchase records.
- Card1: 100,000 records
- Vendor1: 100,000 records
- Purchase1: 100 records
- Card2: 100,000 records
- Vendor2: 100,000 records
- Purchase2: 10,000 records (100 times the size of Purchase1)
- Card3: 100,000 records
- Vendor3: 100,000 records
- Purchase3: 1,000,000 records (100 times the size of Purchase2)
- Card4: 100,000 records
- Vendor4: 100,000 records
- Purchase4: 100,000,000 records (100 times the size of Purchase3)
Each test reads 1,000 random Purchase records JOINed to Card and Vendor. It is important to understand that the workload is the same regardless of the size of the Purchase table.
Given a standard workload (SELECTs only), what do you predict the penalty will be for significantly growing the size of just one of these tables? If we grow the record count 100 times, we don’t expect that reading 1,000 records will take 100 times as long, but we also don’t expect it to be free. I’ve given you the elapsed time for the 100-record test. Now make your informed guess as to what the other entries will be.
Test |
Elapsed Time (ms) |
Predicted Penalty |
1 (100 Purchase records) |
580 |
– |
2 (10,000 Purchase records) |
|
|
3(1,000,000 Purchase records) |
|
|
4 (100,000,000 Purchase records) |
|
|
Results
Here are the results of the tests.
Run 1
Test |
Card |
Vendor |
Purchase |
Elapsed Time (ms) |
Penalty vs. Set 1 |
Penalty vs. Next-Smaller Set |
1 |
100,000 |
100,000 |
100 |
580 |
– |
– |
2 |
100,000 |
100,000 |
10,000 |
2,313 |
3.988 |
3.988 |
3 |
100,000 |
100,000 |
1,000,000 |
5,936 |
10.234 |
2.566 |
4 |
100,000 |
100,000 |
100,000,000 |
11,763 |
20.281 |
1.982 |
Run 2
Test |
Card |
Vendor |
Purchase |
Elapsed Time (ms) |
Penalty vs. Set 1 |
Penalty vs. Next-Smaller Set |
1 |
100,000 |
100,000 |
100 |
546 |
– |
– |
2 |
100,000 |
100,000 |
10,000 |
2,156 |
3.949 |
3.949 |
3 |
100,000 |
100,000 |
1,000,000 |
5,766 |
10.560 |
2.674 |
4 |
100,000 |
100,000 |
100,000,000 |
11,796 |
21.604 |
2.046 |
Subsequent runs of the same test produced similar results.
Conclusion
Storing larger and larger amounts of data is often driven by the business needs of your organization. There are many different options for how to accommodate such a large data set, but it helps to know the affect of data growth given a simple architecture.
Interesting points from the results:
1. The larger the number of rows, the longer it takes to process a simple workload. (This is not surprising at all, but now there are specific numbers that show how much affect that bloat has.) This assumes that you are actually accessing all those rows. If 90% of your rows were old and not accessed in the SELECT workload, the results would be different.
2. The SQL Server team is doing a good job of optimization. Notice the last column in the results. As the record count grows 100 times larger, it takes less than 4 times as long to process the records. And the differential keeps going down as the table get bigger and bigger.
What are your thoughts here? Are you being asked to keep huge amounts of data? If so, are you being given the development time to modify your architecture to deal with the growth, or are you growing inside the same architecture? Drop a comment in the forum.
Download the code for this article.