When you think of SQL Server performance components, you
think of the usage of CPU, memory, and the I/O it takes to process a query. However,
there is another component you should consider. That component is disk space
usage. In the old days disk space was expensive, so much so people spent great
effort to conserver disk space. Remember the Y2K issues with dates being
stored as a 6 character field. The ancient computer programmers left off the
first two digits of the year to save 2 bytes of space when storing each date.
Today disk drives are inexpensive, so we don’t spend great deals of time thinking
of ways to optimize our database design to minimize disk space usage. Nevertheless,
every extra byte of space you waste in your database causes a performance hit
to your application. This article looks at disk space usage and how it affects
performance.
Performance Impacts of Disk Space Usage
To discuss performance impacts of disk space usage we need
to review what it takes to retrieve and write data to disk drives. Each time
you need to read a piece of data from SQL Server you need to retrieve the
information from disk. This retrieval causes a disk I/O. Data in SQL Server
is stored in a number of different physical pages. Each single page is 8060
bytes long. For every page of data, SQL Server requires one I/O to retrieve
that data.
To better understand how I/O can impact performance let’s
consider how many I/O’s it would take to retrieve 10,000,000 records from a SQL
Server table. Say each record is 300 bytes long. This means you can store 26
different records per page. The entire 10,000,000 records would require
384,616 data pages just to store the raw data; this doesn’t take into account
space for indexes. So to read every record in my large 10 million record table
it would take 384,616 I/O.
Now say I saved just 2 bytes of data from every record,
making each record 298 bytes long. This would mean you could store 27 records
per SQL Server page. With that 2-byte savings, you now could retrieve 1 more
record with each I/O operation. The total I/O savings if you read the entire
10,000,000 record table would be 14,245 I/O’s. This is a lot of I/O you would
save for only 2 bytes of space saved per record.
So, each time you can save a few bytes of data from each
record stored in a SQL Server table you improve your performance. The larger
the table the bigger the performance gains you will see. Therefore, you want
to try to minimize your record size so you can maximize the number of records
that can be stored in each data page.
Now I/O is not your only savings when you minimize the space
it takes to store your data. Keep in mind each page that is read first needs
to be stored in the buffer pool. So the smaller the record sizes the more
records you can fit into a single buffer pool page. Therefore, by conserving
disk space for storing your data you are also conserving the amount of memory
you will need when reading your data.
Using Data Types to Minimize Disk Space Usage
When selecting a data type for a particular column you need
to make sure you select the right data type. It is easy to pick the wrong data
type and waste some disk space. Therefore, you need to be careful and make
sure you select a data type that meets your data requirements, and also
minimizes the amount of disk space required to store each data column. I’ll
review different data types and discuss space consideration for each.
First, let me talk about Unicode date types. Unicode data
types are NVARCHAR, NCHAR, and NTEXT. Unicode data types require 2 bytes to
store every character. Whereas non-Unicode date types like VARCHAR, CHAR, and
TEXT only take one byte to store each character. The non-Unicode data types
can only store 256 different characters. With Unicode data types, you can
store up to 65,536 different 2 byte patterns. Because of the limitation on the
number of unique characters that can be stored using non-Unicode data types,
the hexadecimal representation for a particular character is not the same
across different code pages. When you use Unicode data types, the character
representation for commonly used characters are the same across code pages.
Unicode data types are typically used for international applications. If your
application does not need to be supported internationally then you should
consider just using the VARCHAR, CHAR, and TEXT data type, provided the
characters your application uses can be represented using the 1 byte – 256
character set. By using the non-Unicode data types, you will use half the disk
space for each character-based column. If you store lots of character data
then your disk space consumption savings using non-Unicode characters could be
substantial. I converted one of our databases from using all Unicode data
types to using all non-Unicode data types and realized a 40% reduction in disk
space consumption. This kind of disk space savings can provide drastic
performance gains over using Unicode data types. This performance improvement
is made by maximizing the number of records that can be stored in a single SQL
Server page.
Something else to consider when storing character data is
your use of CHAR and VARCHAR data types. The CHAR data type is a fixed
formatted data type. So when you define a column as CHAR(50) it always stores
50 bytes of data. So if you only populate a CHAR(50) column with the value
“ABC”, SQL Server stores “ABC” followed by 47 spaces. On the other hand, a
VARCHAR column is variable length column. So when you create a column as VARCHAR(50)
and populate it with a value of “ABC” only 5 bytes are stored, 2 byte for the
length of the data, plus 3 bytes for the value “ABC”. If you have sparsely
populated character columns, then storing those columns as VARCHAR will help
reduce the disk space usage of those columns.
Now let’s consider how integer data types are stored. There
are 4 different integer data types: TINYINT, SMALLINT, INT, and BIGINT. Each
one of these data types requires a different number of bytes to store their
value. TINYINT takes 1 byte and supports values from 0 to 255. SMALLINT
takes 2 bytes and supports values from –32,768 to 32,767.
An INT date type takes 4 bytes and can handle values from -2,147,483,648 to
2,147,483,647. The BIGINT data type takes 8 bytes and can store values from
-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.
When defining a column to store an integer value you need to
consider all the possible different values that your application might
require. If your column will represent a small set of 10 or so different
values from 0-9, then you should use a TINYINT column. If you used one of the other
integer data types, you would just be wasting storage space.
It is common practice to have an ID column in a table
defined as an identity column. When creating your identity columns consider
the maximum number of rows you expect in your table when determining the data
type. If you are not going to have more than 10,000 rows then you should
select a SMALLINT data type over an INT for your identity column to save
space. Commonly people define all identity columns as INT data types even when
the table is expected to have only a few rows. Make sure when you are
declaring the data type for an identity column you select to appropriate
integer data type.
Another data type to be considered for saving space is the
BIT data type. The BIT data type is for storing 1 or 0 column value. The BIT
data type is ideal for storing TRUE/FALSE conditions. Since the smallest unit
of storage is a byte, a single byte of storage can store up to 8 different BIT
columns. If you have TRUE/FALSE or 0/1 type conditions you want to remember to
use the BIT data type instead of one of the other integer data types. If you
only have a single BIT data type column in your table, then defining it as a
BIT, or TINYINT uses the same amount of disk storage space. This is because to
store a single BIT data type it requires 1 byte. Your space saving over
TINYINT comes into play when you have multiple BIT data type columns in a
single table.
The DATETIME data type is another commonly misused data type that wastes space in your database. There are two different date data types: SMALLDATETIME, and DATETIME. Like the other data types, these two different date types require a different number of bytes to store their value. The SMALLDATETIME data type requires 4 bytes of storage, and stores dates from January 1, 1900 through June 6, 2079 and the time portion is only accurate to the minutes. The DATETIME data type takes 8 bytes to store its value and supports dates from January 1, 1753 through December 31, 9999 and the time portion is accurate to 3 milliseconds. If your application only needs to store the current date or some future date that does not to go beyond the year 2079, and you don’t need a time portion then the SMALLDATETIME column data type is the correct data type to use. Most applications that deal with dates can save space by using the SMALLDATETIME data type, provided the application doesn’t need a time precision less than one minute. Anytime your application needs to store time down to the second, or have dates prior to 1900, or beyond 2079 then DATETIME data type must be used.
Sometimes people use the UNIQUEIDENTIFER data type for ID
columns within a table. This data type produces a GUID and takes 16 bytes to
store it. It is ideal when you want to create a unique value for a column that
needs to be unique regardless of which column is being defined across all
columns, tables, or SQL Server machines. One common reason for using this
data type is when you are doing replication. A UNIQUEIDENTIFIER column would be
unique across both the publisher and subscriber databases. I only mention
this data type here because if you are using a UNIQUEIDENTIFIER data type in a
non-replicated environment then you are wasting space. In a non-replicated
environment to save disk space you should use one of the integer data types
instead of a UNIQUEIDENTIFIER.
Some applications use MONEY and SMALLMONEY data types to store
currency amounts. Just like the other data types MONEY and SMALLMONEY take
different amounts of space to store their data values. SMALLMONEY only takes 4
bytes and supports values from -214,748.3538 to 214,748.3647, whereas the MONEY
data type takes 8 bytes and stores values from -922,337,203,685,477.5808 to
922,337,203,685,477.5807. If your application is storing monetary transactions
that will never exceed a little over 200,000 then you will save 4 bytes for
each transaction amount by using the SMALLMONEY data type.
Binary data can be stored using two different data types:
BINARY and VARBINARY. These two data types function just like CHAR and
VARCHAR. Meaning if you use a BINARY data type it is of fixed length whereas a
VARBINARY is a variable length data type. Just like the VARCHAR data type the
VARBINARY data type stores a 2 byte length along with the data, so SQL Server
can determine the actual length of the VARBINARY data. If you plan to have
variable length binary strings then the VARBINARY data type will conserve on
disk space usage.
Conclusion
Using an incorrect data type that requires SQL Server to use
additional disk space to store a given value not only wastes disk space, but
also causes performance issues. You should always pick a data type that uses
the least amount of disk space in order to implement all the possible
combinations of a given column. Next time you are contemplating which data
type should be used for a particular column remember the space and performance
issues that might occur by choosing the wrong data type.