Getting Started with Hashing in SQL Server

Introduction

In my most recent articles, I’ve talked about encryption in detail and demonstrated its usage at the entire database level with Transparent Data Encryption and at the column level with granular\cell level encryption. In this article, I am going to discuss hashing in SQL Server and how it is different from encryption.

Encryption vs. Hashing

Encryption brings data into a state which cannot be interpreted by anyone who does not have access to the decryption key, password, or certificates. Though encryption does not restrict the access to the data, it ensures if data loss happens, then in that case data is useless for the person who does not have access to the decryption key\password\certificates. On the other hand, Hashing brings a string of characters of arbitrary size into a usually shorter fixed-length value or key that represents the original string and acts as a shortened reference to the original data. A slight change in the input string of characters produces a completely different hashed output.

To meet the demands of regulatory compliance and corporate data security standards, SQL Server allows you to enable encryption at the column\cell level or on the entire database level whereas hashing can be used for several purposes for example:

  • Identifying incremental data or changed data – Hash values generated for an entire row (by concatenating values of all the columns of the row and then generating hash key on it) are useful for efficiently searching for differences between rows in tables and identifying if the rows have changed, in case there is no mechanism in place to identify incremental data on the source table.
  • Hashing is used to index and retrieve items in a database because it is faster to find the item using the shorter hashed key than to find it using the original value.
  • If you have a composite key on several lengthy columns, you can concatenate them and generate hash key on the concatenated value and then you can use this hash key as a joining key.

Encryption is bidirectional, which means data encrypted can be decrypted back to the original string if you have access to the correct decryption key, whereas hashing is unidirectional, which means hashed data cannot be reversed back to the original string.

Getting Started with Hashing

SQL Server has the HASHBYTES inbuilt function to hash the string of characters using different hashing algorithms. The supported algorithms are MD2, MD4, MD5, SHA, SHA1, or SHA2. The hashed data conforms to the algorithm standard in terms of storage size i.e. 128 bits (16 bytes) for MD2, MD4, and MD5; 160 bits (20 bytes) for SHA and SHA1; 256 bits (32 bytes) for SHA2_256, and 512 bits (64 bytes) for SHA2_512. SHA2_256 and SHA2_512 algorithms are available in SQL Server 2012 and later versions.

The stronger hash function you use, the more storage space it takes, and performance is slower but it provides a stronger hash value with minimal chance of hash collision (generating the same hashed output for two different input string of characters). Hence, it’s recommended to use hashing algorithms depending on your workload and data to hash by making it an appropriate trade-off.

The example below, demonstrates the use of the HASHBYTES function to do hashing using MD5 algorithm. As mentioned before, a slight change in the input string of characters produces a completely different hashed output and this is what you could see in the second column. The only difference between input for the first column and input for the second column is an extra space at the end of the input string in the second input string:

SELECT HASHBYTES('MD5', 'Sample String to hash')   AS Col1, HASHBYTES('MD5', 'Sample String to hash ')   AS Col2
GO

HASHBYTES function example

No matter how many times you do the hashing, the hashed output will remain same for the same set of input strings and same hashing algorithm:

SELECT HASHBYTES('MD5', 'Sample String to hash')   AS Col1, HASHBYTES('MD5', 'Sample String to hash ')   AS Col2
GO
SELECT HASHBYTES('MD5', 'Sample String to hash')   AS Col1, HASHBYTES('MD5', 'Sample String to hash ')   AS Col2
GO
SELECT HASHBYTES('MD5', 'Sample String to hash')   AS Col1, HASHBYTES('MD5', 'Sample String to hash ')   AS Col2
GO

HASHBYTES function example

Based on the hashing algorithm you choose, your hashed output will change both in terms of value and size. For example, as demonstrated below, hashed output with MD5 algorithm produces a 16 bytes long value whereas SHA1 algorithm produces a 20 bytes long value:

SELECT HASHBYTES('MD5', 'Sample String to hash')   AS Col1, HASHBYTES('MD5', 'Sample String to hash ')   AS Col2
GO
SELECT HASHBYTES('SHA1', 'Sample String to hash')   AS Col1, HASHBYTES('SHA1', 'Sample String to hash ')   AS Col2
GO

HASHBYTES function example

You can use the HASHBYTES function in line with a SELECT statement when querying data from tables. It’s up to your requirement to hash a value from one column or multiple columns. For example, as you can see in the below query I am hashing combined values (by combining ProductNumber and Name) when querying data from the table.

SELECT HASHBYTES('MD5', ProductNumber   + ' : ' + Name) HashedValue, ProductNumber + ' : '  + Name  AS ActualValue 
FROM   [Production].[Product]

HASHBYTES function example

Limitation of the HASHBYTES Function and how to Overcome It

The maximum length of input values to be hashed is limited to 8000 bytes – the HASHBYTES function gives an error, “String or binary data would be truncated” when the length of the input string is more than 8000 bytes. To overcome this limitation, I have written this user defined function, which overcomes the HASHBYTES function by splitting chunks of 8000 bytes of a large string, hashing each chunk individually and then combining each hash key together and finally generating a hash key for the concatenated hash keys. This workaround seems to be working fine in my case, but I would highly recommend you evaluate it in your scenario for your suitability and before putting it in use.

CREATE FUNCTION [dbo].[fnHashBytes] (@EncryptAlgorithm   NVARCHAR(4), @DataToEncrypt VARBINARY(MAX))
RETURNS   VARBINARY(MAX)
-- Author          : Arshad Ali
-- Create date     : 06/18/2014
-- Description     : It calculate MD5 for Large Objects by   splitting chunks of 8000 bytes to overcome HASHBYTES function limitation.
-- Parameters      :
--                                    1. EncryptAlgorithm   (NVARCHAR(4))     : It can be either MD2 or MD4 or MD5 or SHA or SHA1.
--                                    2.   DataToEncrypt(VARBINARY(MAX))      : Plain text to encrypt, length of it will   be determined by the varbinary(max) data type size.
--                                    3. Return   Value(VARBINARY(MAX))                 : If the encryption succeeded, it will   return the encrypted data or else will return NULL.
-- History      :  
-- ModifiedBy           ModifiedOn                        Remarks    
--
AS
BEGIN
        DECLARE @Index INTEGER
        DECLARE @DataToEncryptLength INTEGER
        DECLARE @EncryptedResult VARBINARY(MAX)
 
          IF @DataToEncrypt IS   NOT NULL
          BEGIN
                   SET @Index = 1
                   SET @DataToEncryptLength =   DATALENGTH(@DataToEncrypt)
                   WHILE @Index <=   @DataToEncryptLength
                   BEGIN
                             IF(@EncryptedResult   IS NULL )
                                          SET @EncryptedResult =   HASHBYTES(@EncryptAlgorithm, SUBSTRING(@DataToEncrypt,   @Index, 8000))
                             ELSE
                                          SET @EncryptedResult =   @EncryptedResult + HASHBYTES(@EncryptAlgorithm,   SUBSTRING(@DataToEncrypt, @Index, 8000))
                       
                             SET @Index = @Index   + 8000
                   END 
                   SET @EncryptedResult =   HASHBYTES(@EncryptAlgorithm, @EncryptedResult)
          END
          RETURN @EncryptedResult
END

 Once the user defined function is created as above, you can use it similar to the way shown below:

--Usage Scenario for MD5 Hashing
UPDATE   <table_name>
SET   MD5Hash = CAST(dbo.fnHashData('MD5', 
CAST (ISNULL(<column_name_1>,'') AS VARBINARY(MAX))+ 
CAST (ISNULL(<column_name_2>,'') AS VARBINARY(MAX))+ 
CAST (ISNULL(<column_name_3>,'') AS VARBINARY(MAX))+ 
..............
..............
..............
CAST (ISNULL(<column_name_n>,'') AS VARBINARY(MAX)))
AS BINARY(16))
 
--Usage Scenario for SHA1 Hashing
UPDATE   <table_name>
SET   MD5Hash = CAST(dbo.fnHashData('SHA1', 
CAST (ISNULL(<column_name_1>,'') AS VARBINARY(MAX))+ 
CAST (ISNULL(<column_name_2>,'') AS VARBINARY(MAX))+ 
CAST (ISNULL(<column_name_3>,'') AS VARBINARY(MAX))+ 
..............
..............
..............
CAST (ISNULL(<column_name_n>,'') AS VARBINARY(MAX)))
AS BINARY(16))

Conclusion

In this article, I discussed hashing in SQL Server and how to use it. I also talked about how hashing is different from encryption along with some use cases where hashing would be a good approach to take.

Resources

HASHBYTES (Transact-SQL)

See all articles by Arshad Ali

Arshad Ali
Arshad Ali
Arshad Ali works with Microsoft India R&D Pvt Ltd. He has 8+ years of experience, mostly on Microsoft Technologies. Most recently, as a SQL Developer and BI Developer he has been working on a Data Warehousing project. Arshad is an MCSD, MCITP: Business Intelligence, MCITP: Database Developer 2008 and MCITP: Database Administrator 2008 certified and has presented at several technical events including SQL-School. On an educational front, he has an MCA (Master in Computer Applications) and MBA-IT. Disclaimer : I work for Microsoft and help people and businesses make better use of technology to realize their full potential. The opinions mentioned herein are solely mine and do not reflect those of my current employer or previous employers.

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends & analysis

Latest Articles