Disk Space Usage and SQL Server Performance
December 26, 2007
When you think of SQL Server performance components, you think of the usage of CPU, memory, and the I/O it takes to process a query. However, there is another component you should consider. That component is disk space usage. In the old days disk space was expensive, so much so people spent great effort to conserver disk space. Remember the Y2K issues with dates being stored as a 6 character field. The ancient computer programmers left off the first two digits of the year to save 2 bytes of space when storing each date. Today disk drives are inexpensive, so we dont spend great deals of time thinking of ways to optimize our database design to minimize disk space usage. Nevertheless, every extra byte of space you waste in your database causes a performance hit to your application. This article looks at disk space usage and how it affects performance.
Performance Impacts of Disk Space Usage
To discuss performance impacts of disk space usage we need to review what it takes to retrieve and write data to disk drives. Each time you need to read a piece of data from SQL Server you need to retrieve the information from disk. This retrieval causes a disk I/O. Data in SQL Server is stored in a number of different physical pages. Each single page is 8060 bytes long. For every page of data, SQL Server requires one I/O to retrieve that data.
To better understand how I/O can impact performance lets consider how many I/Os it would take to retrieve 10,000,000 records from a SQL Server table. Say each record is 300 bytes long. This means you can store 26 different records per page. The entire 10,000,000 records would require 384,616 data pages just to store the raw data; this doesnt take into account space for indexes. So to read every record in my large 10 million record table it would take 384,616 I/O.
Now say I saved just 2 bytes of data from every record, making each record 298 bytes long. This would mean you could store 27 records per SQL Server page. With that 2-byte savings, you now could retrieve 1 more record with each I/O operation. The total I/O savings if you read the entire 10,000,000 record table would be 14,245 I/Os. This is a lot of I/O you would save for only 2 bytes of space saved per record.
So, each time you can save a few bytes of data from each record stored in a SQL Server table you improve your performance. The larger the table the bigger the performance gains you will see. Therefore, you want to try to minimize your record size so you can maximize the number of records that can be stored in each data page.
Now I/O is not your only savings when you minimize the space it takes to store your data. Keep in mind each page that is read first needs to be stored in the buffer pool. So the smaller the record sizes the more records you can fit into a single buffer pool page. Therefore, by conserving disk space for storing your data you are also conserving the amount of memory you will need when reading your data.
Using Data Types to Minimize Disk Space Usage
When selecting a data type for a particular column you need to make sure you select the right data type. It is easy to pick the wrong data type and waste some disk space. Therefore, you need to be careful and make sure you select a data type that meets your data requirements, and also minimizes the amount of disk space required to store each data column. Ill review different data types and discuss space consideration for each.
First, let me talk about Unicode date types. Unicode data types are NVARCHAR, NCHAR, and NTEXT. Unicode data types require 2 bytes to store every character. Whereas non-Unicode date types like VARCHAR, CHAR, and TEXT only take one byte to store each character. The non-Unicode data types can only store 256 different characters. With Unicode data types, you can store up to 65,536 different 2 byte patterns. Because of the limitation on the number of unique characters that can be stored using non-Unicode data types, the hexadecimal representation for a particular character is not the same across different code pages. When you use Unicode data types, the character representation for commonly used characters are the same across code pages. Unicode data types are typically used for international applications. If your application does not need to be supported internationally then you should consider just using the VARCHAR, CHAR, and TEXT data type, provided the characters your application uses can be represented using the 1 byte 256 character set. By using the non-Unicode data types, you will use half the disk space for each character-based column. If you store lots of character data then your disk space consumption savings using non-Unicode characters could be substantial. I converted one of our databases from using all Unicode data types to using all non-Unicode data types and realized a 40% reduction in disk space consumption. This kind of disk space savings can provide drastic performance gains over using Unicode data types. This performance improvement is made by maximizing the number of records that can be stored in a single SQL Server page.
Something else to consider when storing character data is your use of CHAR and VARCHAR data types. The CHAR data type is a fixed formatted data type. So when you define a column as CHAR(50) it always stores 50 bytes of data. So if you only populate a CHAR(50) column with the value ABC, SQL Server stores ABC followed by 47 spaces. On the other hand, a VARCHAR column is variable length column. So when you create a column as VARCHAR(50) and populate it with a value of ABC only 5 bytes are stored, 2 byte for the length of the data, plus 3 bytes for the value ABC. If you have sparsely populated character columns, then storing those columns as VARCHAR will help reduce the disk space usage of those columns.
Now lets consider how integer data types are stored. There are 4 different integer data types: TINYINT, SMALLINT, INT, and BIGINT. Each one of these data types requires a different number of bytes to store their value. TINYINT takes 1 byte and supports values from 0 to 255. SMALLINT takes 2 bytes and supports values from 32,768 to 32,767. An INT date type takes 4 bytes and can handle values from -2,147,483,648 to 2,147,483,647. The BIGINT data type takes 8 bytes and can store values from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.
When defining a column to store an integer value you need to consider all the possible different values that your application might require. If your column will represent a small set of 10 or so different values from 0-9, then you should use a TINYINT column. If you used one of the other integer data types, you would just be wasting storage space.
It is common practice to have an ID column in a table defined as an identity column. When creating your identity columns consider the maximum number of rows you expect in your table when determining the data type. If you are not going to have more than 10,000 rows then you should select a SMALLINT data type over an INT for your identity column to save space. Commonly people define all identity columns as INT data types even when the table is expected to have only a few rows. Make sure when you are declaring the data type for an identity column you select to appropriate integer data type.
Another data type to be considered for saving space is the BIT data type. The BIT data type is for storing 1 or 0 column value. The BIT data type is ideal for storing TRUE/FALSE conditions. Since the smallest unit of storage is a byte, a single byte of storage can store up to 8 different BIT columns. If you have TRUE/FALSE or 0/1 type conditions you want to remember to use the BIT data type instead of one of the other integer data types. If you only have a single BIT data type column in your table, then defining it as a BIT, or TINYINT uses the same amount of disk storage space. This is because to store a single BIT data type it requires 1 byte. Your space saving over TINYINT comes into play when you have multiple BIT data type columns in a single table.
The DATETIME data type is another commonly misused data type that wastes space in your database. There are two different date data types: SMALLDATETIME, and DATETIME. Like the other data types, these two different date types require a different number of bytes to store their value. The SMALLDATETIME data type requires 4 bytes of storage, and stores dates from January 1, 1900 through June 6, 2079 and the time portion is only accurate to the minutes. The DATETIME data type takes 8 bytes to store its value and supports dates from January 1, 1753 through December 31, 9999 and the time portion is accurate to 3 milliseconds. If your application only needs to store the current date or some future date that does not to go beyond the year 2079, and you dont need a time portion then the SMALLDATETIME column data type is the correct data type to use. Most applications that deal with dates can save space by using the SMALLDATETIME data type, provided the application doesnt need a time precision less than one minute. Anytime your application needs to store time down to the second, or have dates prior to 1900, or beyond 2079 then DATETIME data type must be used.
Sometimes people use the UNIQUEIDENTIFER data type for ID columns within a table. This data type produces a GUID and takes 16 bytes to store it. It is ideal when you want to create a unique value for a column that needs to be unique regardless of which column is being defined across all columns, tables, or SQL Server machines. One common reason for using this data type is when you are doing replication. A UNIQUEIDENTIFIER column would be unique across both the publisher and subscriber databases. I only mention this data type here because if you are using a UNIQUEIDENTIFIER data type in a non-replicated environment then you are wasting space. In a non-replicated environment to save disk space you should use one of the integer data types instead of a UNIQUEIDENTIFIER.
Some applications use MONEY and SMALLMONEY data types to store currency amounts. Just like the other data types MONEY and SMALLMONEY take different amounts of space to store their data values. SMALLMONEY only takes 4 bytes and supports values from -214,748.3538 to 214,748.3647, whereas the MONEY data type takes 8 bytes and stores values from -922,337,203,685,477.5808 to 922,337,203,685,477.5807. If your application is storing monetary transactions that will never exceed a little over 200,000 then you will save 4 bytes for each transaction amount by using the SMALLMONEY data type.
Binary data can be stored using two different data types: BINARY and VARBINARY. These two data types function just like CHAR and VARCHAR. Meaning if you use a BINARY data type it is of fixed length whereas a VARBINARY is a variable length data type. Just like the VARCHAR data type the VARBINARY data type stores a 2 byte length along with the data, so SQL Server can determine the actual length of the VARBINARY data. If you plan to have variable length binary strings then the VARBINARY data type will conserve on disk space usage.
Using an incorrect data type that requires SQL Server to use additional disk space to store a given value not only wastes disk space, but also causes performance issues. You should always pick a data type that uses the least amount of disk space in order to implement all the possible combinations of a given column. Next time you are contemplating which data type should be used for a particular column remember the space and performance issues that might occur by choosing the wrong data type.