This is part 16 of a series. If you have not read part 15,
you can find it here,
but this article does not require any knowledge of the previous work in the
series.
Bloat: We Never Delete Anything
Every database designer, developer, and administrator
Ive talked to is dealing with a database that continues to grow over time,
often at an alarming rate.
Applications often have tables that are fairly consistent
(reference data), tables that grow fairly slowly over time (customers, stores),
and tables that grow very quickly and to very large sizes (sales, purchases,
web hits).
Database designers are more and more often receiving the
orders, Dont delete anything. This leads to database bloat. The question
is, how does this affect performance? Can we reasonably support a million rows
in a single table? How about a billion rows?
In this series current quest to understand costs of
different architectures, we will look at what cost is paid in query performance
as a table grows.
Building the Tests
Test Introduction
If youve been following the series, skip this section and
move down to the Test Specifics section.
For this test, I built a database and added four sets of
Card, Vendor, and Purchase tables. These tables were introduced in previous
installments, but here is the DDL for review:
CREATE TABLE dbo.Card1 (
CardID int NOT NULL IDENTITY,
CardNumber1 bigint NOT NULL,
CardNumber2 bigint NOT NULL,
CardNumber3 bigint NOT NULL,
CardNumber4 bigint NOT NULL,
SecurityCode1 smallint NOT NULL,
SecurityCode2 smallint NOT NULL,
SecurityCode3 smallint NOT NULL,
SecurityCode4 smallint NOT NULL,
SecureString1 varchar(36) NOT NULL,
SecureString2 varchar(36) NOT NULL,
SecureString3 varchar(36) NOT NULL,
SecureString4 varchar(36) NOT NULL,
CONSTRAINT Card1_PK PRIMARY KEY CLUSTERED (CardID)
)
CREATE TABLE dbo.Vendor1 (
VendorID int NOT NULL IDENTITY,
BusinessName varchar(40) NOT NULL,
Address1 varchar(40) NOT NULL,
Address2 varchar(40) NOT NULL,
City varchar(40) NOT NULL,
[State] char(2) NOT NULL,
ZipCode varchar(10) NOT NULL,
DateCreated datetime NOT NULL
CONSTRAINT Vendor1_DateCreated_DF DEFAULT GETUTCDATE(),
DateLastUpdated datetime NOT NULL
CONSTRAINT Vendor1_DateLastUpdated_DF DEFAULT GETUTCDATE(),
StatusCd tinyint NOT NULL
CONSTRAINT Vendor1_StatusCd_DF DEFAULT 1,
CONSTRAINT Vendor1_PK PRIMARY KEY CLUSTERED (VendorID)
)
CREATE TABLE dbo.Purchase1 (
PurchaseID int NOT NULL IDENTITY,
CardID int NOT NULL,
VendorID int NOT NULL,
Amount money NOT NULL,
DateCreated datetime NOT NULL
CONSTRAINT Purchase1_DateCreated_DF DEFAULT GETUTCDATE(),
CONSTRAINT Purchase1_PK PRIMARY KEY CLUSTERED (PurchaseID),
CONSTRAINT Purchase1_Card_FK FOREIGN KEY (CardID)
REFERENCES dbo.Card1 (CardID),
CONSTRAINT Purchase1_Vendor_FK FOREIGN KEY (VendorID)
REFERENCES dbo.Vendor1 (VendorID)
)
For this test, I did not create any indexes other than the
primary keys. The queries use the primary keys, so the other indexes are just space
and processing overhead.
One thing I learned in building this set of tests is that
creating a test table with 100 million records takes a long time. I put
a reasonable amount of thought into the script that populated Purchase4 (the
100-million-record table). Still, populating the second half of the table (50
million records) started the morning of April 29th and finished the morning of
May 3rd.
Test Specifics
For these tests, we keep the number of Card records and
Vendor records the same and increase the number of Purchase records.
Card1: 100,000 records
Vendor1: 100,000 records
Purchase1: 100 records
Card2: 100,000 records
Vendor2: 100,000 records
Purchase2: 10,000 records (100 times the size of
Purchase1)
Card3: 100,000 records
Vendor3: 100,000 records
Purchase3: 1,000,000 records (100 times the size of
Purchase2)
Card4: 100,000 records
Vendor4: 100,000 records
Purchase4: 100,000,000 records (100 times the size of
Purchase3)
Each test reads 1,000 random Purchase records JOINed to Card
and Vendor. It is important to understand that the workload is the same
regardless of the size of the Purchase table.
Given a standard workload (SELECTs only), what do you
predict the penalty will be for significantly growing the size of just one of
these tables? If we grow the record count 100 times, we dont expect that
reading 1,000 records will take 100 times as long, but we also dont expect it
to be free. Ive given you the elapsed time for the 100-record test. Now make
your informed guess as to what the other entries will be.
Test
Elapsed Time (ms)
Predicted Penalty
1 (100 Purchase
records)
580
-
2 (10,000 Purchase
records)
3(1,000,000 Purchase
records)
4 (100,000,000
Purchase records)
Results
Here are the results of the tests.
Run 1
Test
Card
Records
Vendor
Records
Purchase
Records
Elapsed Time (ms)
Penalty vs. Set 1
Penalty vs. Next-Smaller Set
1
100,000
100,000
100
580
-
-
2
100,000
100,000
10,000
2,313
3.988
3.988
3
100,000
100,000
1,000,000
5,936
10.234
2.566
4
100,000
100,000
100,000,000
11,763
20.281
1.982
Run 2
Test
Card
Records
Vendor
Records
Purchase
Records
Elapsed Time (ms)
Penalty vs. Set 1
Penalty vs. Next-Smaller Set
1
100,000
100,000
100
546
-
-
2
100,000
100,000
10,000
2,156
3.949
3.949
3
100,000
100,000
1,000,000
5,766
10.560
2.674
4
100,000
100,000
100,000,000
11,796
21.604
2.046
Subsequent runs of the same test produced similar results.
Conclusion
Storing larger and larger amounts of data is often driven by
the business needs of your organization. There are many different options for
how to accommodate such a large data set, but it helps to know the affect of
data growth given a simple architecture.
Interesting points from the results:
1. The
larger the number of rows, the longer it takes to process a simple workload.
(This is not surprising at all, but now there are specific numbers that
show how much affect that bloat has.) This assumes that you are actually
accessing all those rows. If 90% of your rows were old and not accessed in the
SELECT workload, the results would be different.
2. The
SQL Server team is doing a good job of optimization. Notice the last column in
the results. As the record count grows 100 times larger, it takes less than 4
times as long to process the records. And the differential keeps going down
as the table get bigger and bigger.
What are your
thoughts here? Are you being asked to keep huge amounts of data? If so, are
you being given the development time to modify your architecture to deal with
the growth, or are you growing inside the same architecture? Drop a comment in
the forum.
Add databasejournal.com to your favorites Add databasejournal.com to your browser search box IE 7 | Firefox 2.0 | Firefox 1.5.xReceive news via our XML/RSS feed