Data Masking in the World of GDPR

Your production database environment stores personally identifiable information in an encrypted form, limits access to these fields and masks them when displayed. However, some types of data encryption create potential performance problems as well as security issues. Your data may be at risk unless you mask it effectively.

The State of Database Security

The European Union’s General Data Protection Regulation (GDPR) became effective worldwide in May 2018. It applies to any company that obtains data from any customer in the European Union and provides for extremely harsh penalties for companies that breach this new law. To comply with it, companies throughout the world increased their data security awareness, appointed data protection officers and updated their privacy policies. Because GDPR specifically prohibits companies from using sensitive personal data to make business decisions, IT must identify these elements in databases and ensure that decisions such as credit checks or customer purchases are not based on these values.

Most DBAs have ensured that any such personal data retained on disk storage is stored in files that are highly secured. Unfortunately, this is not enough. In particular, non-production environments are sometimes populated with production data, and these environments do not typically implement full production-level security. In addition, developers might sometimes have “public” access that allows them to see the values of all data elements in the database.

Out of this concern arose a key data security process called data masking, which involves obscuring, anonymizing or otherwise encrypting data elements in-place. While this adds a level of security it also has performance implications, especially in a database environment. For data in simple flat files it is usually feasible to partly or fully mask data items without issue. However, in a database environment such data changes may cause issues.

All database management systems contain features that support various categories of data integrity:

  • Entity integrity, including uniqueness and enforcing non-null values;
  • Domain integrity, which forbids certain invalid data values such as a date of 99/99/9999;
  • Referential integrity that defines primary and foreign keys and enforces data value relationships between tables.

In order to support these integrity rules, the DBMS includes various features such as unique and non-unique indexes, strict data typing for data elements (such as INTEGER and DATE), as well as various data column constraints. At times these features are at odds with the need to secure individual data items. How, for example, can you enforce domain integrity on a column if you encrypt the data?

Database Data Masking Concepts

There are multiple ways to mask data in order to be compliant with the GDPR. Each has advantages and disadvantages in performance and ease of implementation.

Data obscuring. Using the data obscuring method, the DBA and application developers work together to ensure that data elements are not stored or displayed in their entirety; rather, portions of the data elements are overlaid with masking characters. For example, and account number might be stored as XXXX-XXXX-XXXX-1234, where the X character overlays the initial digits of the number.

Data obscuring is easy for the application to implement when displaying or reporting such an element; however, there are problems with storing the number in this form. Multiple different account numbers that end in the same four digits will now seem equal, leading to duplication. Lookups on this number will return multiple records. It will also be impossible for the database to enforce uniqueness on this account number, and any data validation such as computation of check digits will not work. Problems such as these make data obscuring a poor choice for database storage.

Data anonymization. Using the data anonymization masking method, data elements are altered using a function that hashes or encrypts them prior to database storage. The encryption method must be reversible in order to retrieve the correct records. It is possible to implement anonymizing algorithms that preserve the uniqueness of the data element; however, it is extremely resource-intensive for data access that retrieves multiple elements and must sort them. For example, consider displaying all records for customers with Last Name of SMITH, sorted by First Name. The application must retrieve all records, decrypt the last name field, temporarily store all those with SMITH as Last Name, decrypt all the First Name values, sort the result, then finally display them in order.

This problem extends to the implementation of database indexes. Many times the DBA will create an index on a data element in order to increase SQL Select performance. However, if the data element is stored in encrypted form, such indexes may not provide any advantage.

Another issue with anonymization involves the relationships between data elements. Consider the elements in U.S. addresses of City, State and Zip-Code. These data elements are valid as a group because certain cities will not exist in some states, and every state occurs only with certain zip codes. If any of these elements is anonymized, the combined address will be invalid.

Anonymization provides a measure of security but results in too many performance and data validation disadvantages to be used on its own.

Masking tables. Using the masking tables method, IT implements a set of tables that contain valid-looking but otherwise harmless data element values. The masking algorithm first uses a hashing technique to convert the original data to an integer hash value and then uses the result as the key for lookup in the target table. For example, a Last Name of SMITH might hash to the value 123. The algorithm then references the 123rd row of the Name Masking Table and uses the name found therein.

The advantages of masking tables are many. These tables can be populated with unique values, thus allowing the results of the masking to be unique. If the masking and algorithm are properly secured, the final data cannot be decrypted by hackers. Finally, combinations of related fields (such as City, State and Zip-Code) can be hashed as a group, and the result used to retrieve a set of values in one row of a City-State-Zip masking table, thus preserving the validity of the data relationship.

Special Data Elements

Some data elements cannot be easily encrypted or masked due to their nature. Some of these are the following.

Bank account numbers. These data elements are typically stored in table columns with data types that make it tricky to validate. For many banks, not all account number combinations are valid. This is particularly true after a merger or acquisition, where there are two sets of valid account numbers, perhaps of differing lengths. Account numbers sometimes have a check digit, and some account numbers are associated with bank routing numbers. Masking these numbers will require a creative masking table approach, perhaps involving converting the combination of routing number and account number to results acceptable to applications.

E-mail addresses. Masking table entries for e-mail addresses should contain valid-looking addresses. However, it is possible that some applications may actually test validity by attempting to send an e-mail to the address. To prevent such issues, use the domain “example” in your masking table, as example.com, example.net, example.org and example.edu are all second-level domain names reserved for documentation purposes by the Internet Corporation for Assigned Names and Numbers (ICANN).

Social security numbers. There are only a few applications that test the validity of these numbers. Although the numbers are no longer tied to specific geographic locations, there are some digit combinations that are invalid. For example, in the three-group format “XXX-XX-XXXX” no single group can be all zeroes. Rather than implement as a masking table, it is faster to simply hash these numbers to a single nine-digit number and then check for validity.

Credit card numbers. These numbers contain substrings of digits that have several internal meanings. The first few digits (typically six) correspond to the issuer of the card, also called the Issuer Identification Number or IIN, and the last digit is typically a check digit calculated with the Luhn algorithm. The easiest masking method is to hash the number to a sixteen-digit result (or whatever length is required), replace the IIN with a valid value, then calculate the check digit.

Dates. Due to the way that applications handle dates, masking them is not easy. Certain dates such as BirthDate may require specialized encryption, as simply replacing them with a masked date will throw off calculations of Age, Years of Vesting, or Years Until Retirement. If masking is required, one simple way is to hash the entire date to a number in the range of one to 365 and then add the result to the date 01/01/nnnn, where “nnnn” is the original year.

Unstructured data. This includes text fields such as Remarks or Comments, where a user may deliberately or inadvertently enter a personal data element such as BirthDate. It also includes relatively new data element types such as still images (e.g. GIF, JPG or PDF documents) and data encoded as extensible markup language (XML). It is also possible, usually in a poorly-designed system, to have file names that contain this information (e.g. “Server01 / Customers / Smith / 2001-01-01”). For these data types that may contain hidden personal data you must combine good application and database design and constant monitoring to prevent security breaches.

Implementing Data Masking

In a database environment it is important to implement encryption and masking tables efficiently. The DBA should create masking tables as database tables; however, there are several options for the masking code. Some of these are:

Column constraints exist as part of the table column definition and can be used for simple data validation; however, constraints are limited to simple logic and any changes require a database change and corresponding outages for all applications.

In-application code is possible for columns that appear in a single table and are used by only a single application. However, any expansion of the columns across tables, changes to the column data type or length, or changes to the masking logic will require an application coding change, testing, and perhaps re-masking of the entire table, as well as an application outage.

Stored procedures are a possible choice for masking logic. Disadvantages include a CPU and elapsed time penalty for invoking them, as well as a small amount of added logic in the application for invocation and return code checking.

User-defined functions (UDFs) are the best method of implementing data masking logic. They provide a simple application coding interface combined with the ability to code them in the SQL/PL language. It is also possible that UDFs that are called through the distributed data facility (DDF) may be “zIIP-eligible”; that is, they may be able to use the System z Integrated Information Processor (zIIP) CPUs engines. These special-purpose CPUs can be installed on your z hardware and have the advantage of their work not being counted in most software license charges, which are commonly based on CPU usage.

Masking Next Steps

It is not enough to implement a set of masking tables and algorithms once and then forget about it. Several factors will require that you revisit your masking strategy from time to time. These factors include table column data type or length changes, migration of columns to additional tables and additions to column values.

Another issue is the status of your non-production environments. How do you create test data?

There are several ways to populate test database environments, including the following:

Masked test environment. Copy a set of data from production to a pre-production environment.  This requires enough data storage in pre-production to store an (almost) complete set of production data. Whenever test data is required, extract the needed data from pre-production and execute masking as you store the data in test. This requires CPU resources whenever you extract and mask your test data.

Masked pre-production. Copy a set of data from production to a pre-production environment and then execute masking there. You may now create test data by selecting what you need from pre-production. This requires executing masking only once.

Summary

GDPR requires that IT departments pay closer attention to how they process and store personal data. The way that you implement one or more forms of data encryption or masking can have an effect on database and application performance. Masking tables provide a convenient way to pre-store data elements that meet validation criteria but obscure personal information. Implementing masking logic with user-defined functions can avoid excessive general-purpose CPU usage and gives DBAs and application developers a powerful tool to execute masking effectively.

See all articles by Lockwood Lyon

Lockwood Lyon
Lockwood Lyon
Lockwood Lyon is a systems and database performance specialist. He has more than 20 years of experience in IT as a database administrator, systems analyst, manager, and consultant. Most recently, he has spent time on DB2 subsystem installation and performance tuning. He is also the author of The MIS Manager's Guide to Performance Appraisal (McGraw-Hill, 1993).

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends & analysis

Latest Articles