GDPR for the DBA

The General Data Protection Regulation (GDPR) went into effect worldwide on May 25, 2018. In response, companies throughout the world increased their data security awareness, appointed data protection officers and updated their privacy policies.

IT support staff responded with updated data dictionaries, flagging of personal data, encryption at various points (local and cloud storage, network traffic, etc.) and heightened security procedures. However, more work is needed. In this article we focus on what the DBA must do in the near term in order to anticipate and prevent performance and capacity issues.

The Basics
Notification of Rights
Concerns for the DBA
Data Protection Everywhere
Interface with Data Modelers
Data in the Cloud
Reviewing Automated Processes
Summary

The Basics

GDPR applies to any company that obtains data from any customer in the European Union, and provides for extremely harsh penalties for companies that breach this new law. It affects how data is modelled and how metadata is kept and made available to applications. In addition, the use of encryption, key masking and other data security processes has increased resources usage.

One of the most interesting requirements surrounds stratifying data items into “personal data” and “sensitive personal data”. The first refers to items that can be used to identify you or your presence such as names, telephone numbers and IP addresses. Sensitive personal data refers to data elements that GDPR prohibits organizations from using to make business decisions, and includes religious beliefs, political opinions and sexual orientation. Therefore, IT must be able to identify data elements that fall into these categories and ensure that business decisions such as credit checking or approving customer purchases are not based on their values.

This leads to the first rule for IT for successful support of GDPR: an enterprise data dictionary with metadata for all data elements is a necessity. Further, all current applications may need to be reviewed to confirm that they are handling these data elements correctly. Last, data modelers must develop the appropriate procedures to be used by application developers when storing or retrieving data.

Notification of Rights

Another part of GDPR concerns notifying customers of their rights regarding their personal data. For example, you may have a web site used for on-line order entry or for customer inquiries. If you implement Google Analytics to track visitors you must ensure that you set the “anonymize IP” flag. This ensures that Google stores the IP address with the last octet anonymized. This is because GDPR requires that personal data (here, the IP address) may not be used without users’ permission. The notification provision was so important (and the penalties for failure to notify quite significant) that many companies issued new privacy and cookie policies just prior to GDPR’s implementation.

Concerns for the DBA

Your enterprise should already have discussed the above issues and notified IT of any new policies or procedures. What, then, does the DBA need to be concerned with going forward?

Data Protection Everywhere

The two categories of personal data need to be protected, meaning restricted access and security measures against prying. Most DBAs will equate this to mean encryption of the data on disk storage. Regrettably, this will not be sufficient. One potential problem involves test data; that is, consistent subsets of production data copied into the development system for use by application developers. Development environments typically do not have full-blown data security, nor are developers closely watched as to what data they may access.

Sometimes an effort is made to mask or anonymize key personal data items; however, some of these data elements may be used as keys or search terms, meaning that the DBA creates indexes on the columns to increase performance. If the data are masked or anonymized on disk storage, then performance indexes may not work as designed. For example, a customer ID of 123-456-789 may be masked and stored as 123-xxx-xxx, which will end up being the same value for some other customer IDs having the same first three digits.

One possible mitigation for this performance issue may be coming soon from IBM. IBM has announced new hardware to be available soon with an order-preserving data compression feature for indexes. This would allow for a minimal level of encryption of index data while preserving the order of index entries.

Data must also be protected in transit from storage to memory and thence to applications, some of which may be implemented in remote locations. Automated teller machines (ATMs) are a good example of this, where customer information is stored at a central site but is then sent and retrieved to the remote hardware. End-to-end data security is a must in these situations. Similar issues surround data extracts from operational systems sent to the data warehouse, big data application, and data marts. All of these files need protection as well.

Finally, there is disaster recovery. Most enterprise disaster recovery involves a secondary site where data is sent and kept up-to-date in case of a major disaster. However, minor disasters occur as well. Consider a newly-implemented application that updates a customer master file incorrectly. Assuming that someone spots the error, the DBA must recover the master file based on database or file backups; hence, the backups themselves must be secured as well.

Interface with Data Modelers

As noted above, data modelers must ensure that all personal data items are identified in an enterprise data dictionary so that they can be used and protected properly. One byproduct of this will be ensuring that data modelers are involved in all application development and change decisions. This information will then be passed to the DBA, who must then incorporate correct protection of personal data into database design decisions. These typically involve decisions the DBA makes to ensure fast transaction performance, support of data integrity rules and purge or archive of stale data.

Performance issues. Performance-related tactics in database design usually boil down to creation of the appropriate indexes, data clustering and data partitioning. The DBA creates indexes on search fields in order to allow a fast data access path for queries that select specific values or value ranges, and to speed up sorting. Data clustering keeps rows with similar key values physically close together, which speeds up data retrieval for queries that access multiple rows (for example, a display of customers with last name of Smith). Finally, the DBA considers physically partitioning some tables by key value range to speed up disk storage access to multiple rows. Any of these tactics that incorporates table columns containing personal data must be reviewed to ensure that proper security is maintained.

Data integrity. There are times when business rules are implemented in the database definition rather than by the application. A requirement for a unique customer number may be enforced by the database (using a unique index) or by the application code. The same is true for domain integrity rules controlling valid column values. Some columns such as salary may be non-negative, date columns must be valid dates (not 00/00/0000), and some columns may require specific values (such as Y or N). Again, these can be enforced in the database definition (e.g., using column constraints) or in the application code. If data integrity rules are enforced by the database definition, then the DBA must ensure that the GDPR rules for personal data are followed.

Stale data purge. When purging or archiving old data, the DBA should document the need for this during application development. The reason is that there are different database designs that can support such a purge, and they have different performance characteristics. One common method when data is to be purged by creation or expiration date is to physically partition the data by that date. Removing old data is then as simple as emptying the oldest partition, and this method is very fast. If data cannot be partitioned using the purge criteria, another common method is to create an index on the purge key. This index need not exist at all times: it is only necessary that it exist during the purge process, which uses the index to search for rows to purge. In either case, purged data may be subject to data retention rules or regulations. This means that the purged data must be saved, either in a queryable form or in a reloadable form, in case it is needed in the future. The purge file and these saved locations must be secured properly as well.

Data in the Cloud

Data stored at one or more off-site locations must be protected as well. This is especially true if you have implemented application as a service (AaaS), where your application code resides in the cloud as well. It is your responsibility to protect personal data, and simply encrypting the data in the cloud is not enough. The applications themselves must be able to decrypt the data for display and encrypt the data for storage, so the code must be well-protected. In addition, you may not be in control of all the points of access or network routes, so it is imperative that all data be encrypted during transit. Setup of this involves the DBA, since it is their responsibility to implement disaster recovery procedures. Of course, you can contract with a third-party vendor to do database maintenance for you (known as database as a service, or DBaaS); however, as the data owner it is your responsibility to ensure that GDPR data protections are in place, and that data recovery backup and recovery procedures follow data protection practices as well.

Reviewing Automated Processes

There is more work for the DBA related to any automated processes that exist. These include regularly-scheduled database reorganizations, backups, statistics generation and so forth. Others are more complex, and may include analyses of data distributions and statistics that are then used to selectively generate jobs. For the advanced shop there may be hundreds of these processes, and they may have been enhanced to include data purges, exception reports, audit reports or data extracts. For each of these, the DBA needs to review the process and determine if personal data is being protected. For example, if data is automatically extracted to a file that is then transmitted to a remote data mart, is the data in the file encrypted, and is the file itself subject to the correct security?

Summary

The most important aspect of the General Data Protection Regulation for the DBA is the need to have approved procedures in place to protect personal data. These procedures may affect a database design, either by limiting options such as indexes or partitioning, or by causing possible performance issues. This makes it essential that the DBA engage with developers and data modelers as early as possible in the application development life cycle.

# # #

LINKS

https://en.wikipedia.org/wiki/General_Data_Protection_Regulation

https://gdpr-info.eu/

See all articles by Lockwood Lyon