In order to be effective, developers and the team performing system and QA testing need to have a test system that is realistic in terms of both volumes and look and feel. For example, it’s hard to do effective performance testing on a system that has 10 to 25 percent of a true production system. It’s also challenging to do testing on systems where names are obviously made up (although I kind of liked my cartoon based names from the past – Marvin the Martian was always a particular favorite LOL).
A frequent request made of the DBAs is to provide a full clone of a production system that will then be used for a development and/or testing system. While Oracle provides great tools (primarily RMAN) to do cloning…just giving a straightforward full clone of production, as is, is one of those actions that makes (and should make) IT Security folks shudder with fear.
At the same time, almost every production database has at least some sensitive data in it and the more copies of this data that exist in our organizations, the higher the risks of some sort of accidental exposure. Corporate rules often mandate that sensitive data is protected and secured at all times.
Needless to say, these two requirements pretty well directly conflict with one another. In order to address the balance between needing to protect the data and the requirement to provide realistic test data, some form of data masking (modifying the data in the development/test system) is required. Manual procedures, such as writing our own code, have often been used to modify the original data. The drawback to most manual approaches is that the possibility does exist that the masking process could potentially be “reverse” engineered.
Oracle has introduced their Data Masking Pack as a new feature in Oracle Database 11g that provides a permanent, one way mask of sensitive data. The Data Masking Pack is a separately licensed Oracle Enterprise Manager pack that has been included with both Database Control and Grid Control in Oracle Database 11g R2.
If you are using Data Masking through EM Grid Control you need to make sure the DM_FMTLIB package has been installed in the target database. They can be created by running the following two supplied scripts (copy them to the target system if necessary).
$ORACLE_HOME/sysman/admin/emdrep/sql/db/latest/masking/dm_fmtlib_pkgdef.sql
and
$ORACLE_HOME/sysman/admin/emdrep/sql/db/latest/masking/dm_fmtlib_pkgbody.plb
and ensure that the DM_FMTLIB package is owned by the DBSNMP schema.
The most important feature of the Data Masking Pack is that it provides an irreversible method for replacing the original data with realistic looking “scrubbed” data. A second great feature is that a sharable library of masking formats can be used to ensure that any masking rules and processes are applied consistently to all production data in an organization. It also means that custom masking formats only need to be developed once.
Because this is a non-reversible activity – ALWAYS make sure that you are running masking jobs against a clone of the production database.
Built In Data Masking Formats
Oracle provides many built in Masking Formats as part of their library. These include more complex masks such as
- American Express CC Numbers
- MasterCard CC Numbers
- Visa CC Numbers
- Discover CC Numbers
- Generic CC Numbers
- National Insurance Numbers
- Social Security Numbers
- Social Insurance Numbers
- ISBN Numbers
- UPC Numbers
- US Phone Numbers
Oracle also has “primitive” masks such as:
- Array List (a list of values that should be selected randomly)
- Fixed Number
- Fixed String
- Random Dates (a range can be specified)
- Random Digits (a range can be specified)
- Random Numbers ( a range can be specified)
- Random Strings (literals in a range)
There are also some built in routines such as:
- Shuffle (the original data)
- Substitute (deterministic masking that ensures the same value in yields the same result)
- Substring
- Table Column (select a random entry from another table)
Beyond the built in masks, we have the ability to create our own custom formats (which will be covered in my next article).
Steps Performed by A Data Masking Job
When a data masking job is executed against the database the following series of steps is performed.
1) A mapping table is built for each column being masked, which contains the following two columns – original_value, masked_value
2) Drop the constraints and grants on the table
3) Rename the table
4) Create a new table using the original table and the mapping table(s)
5) Create indexes on the new table
6) Gather new statistics on the table
7) Replace the constraints and grants on the table
8) Drop the indexes on the original table and the original table itself with the purge option
9) Drop the mapping tables created in step one
In order to help with the performance, the data masking job takes advantage of bulk operations, disables logging and runs in parallel. Because the original table with the original data is dropped with the purge option the sensitive data is removed completely.
For testing purposes only a DBA could override the behavior of disabling logging (both redo and flashback logging are disabled by default while a masking job runs). Additionally the automatic gathering of new statistics could be turned off, the automatic drop of the mapping tables could be turned off and the degree of parallel processing can be controlled.
Data Masking Reports for Auditors in Grid Control
While there are no predefined reports for Data Masking in Enterprise Manager Grid Control, it is very easy to create Data Masking reports if they are required. The steps to create a report are:
1) Click on the reports tab
2) Click Create
3) Specify what you want for your report title
4) Select “Security” from the Category drop down list (or create a custom category if preferred)
5) Select “Security Policy Overview” for the Sub-Category (or create a custom sub-category if preferred)
6) On the Elements page, click Add
7) Add a Table From SQL and click OK
8) Enter a header and the SQL, continue and OK
Here are some query examples. (Select on the tables must be given to the MGMT_VIEW user account)
This first query would show the information about the masking job(s) themselves including which database they ran in, the name of the job, which account ran it, when it ran and how long it took.
SELECT ds.source_name as "Database", ds.ss_name as "Mask Defn", ds.ss_owner as "Owner", je.submission_ts as "Run Date", j.job_owner as "Job Submitted By", (js.end-time - js.start-time) as "Job Duration" FROM sysman.mgmt_dm_scopespecs ds, sysman.mgmt_dm_job_executions je, sysman.mgmt_job j, sysman.mgmt_job_exec_summary js WHERE ds.ss_guid = je.ss_guid AND j.job_id = js.job_id AND je.execution_id = js.execution_id AND ds.source_id = ??EMID_BIND_TARGET_GUID??
This query could be used to provide information on what was actually masked and which mask formats were used
SELECT ds.ss_name as "Mask Defn", sc.table_schema as "Schema", sc.table_name as "Table", sc.column_name as "Column", re.entry_order as "Step", decode (re.rule_type, 'RN','Random Number', 'CC','Table Column', 'DT','Random Date', 'AL', Array List', 'FN','Fixed Number', 'RD','Random Digits', 'RS','Random String', 'SH','Shuffling', 'UF','User Function', 'UT','Post Process' as "Format Type", decode (re.rule_type, 'RN', 'Start Length: '||rule_low||' End Length: '||rule_high, 'CC','Schema Owner: '||re.table_schema||' Source table and column: '||re.table_name||'.'||e.column_name, 'AL',da.array_list.item, 'FS','Fixed String', 'RD','Start Digits:,||rule_low||' End Digits: '||rule_high, null as "Format Mask Parameters" FROM sysman.mgmt_dm_ruleentry re, sysman.mgmt_dm_scopespecs ds, sysman.mgmt_dm_ss_columns sc, sysman.mgmt_dm_alitems da, sysman.mgmt_dm_job_executions je WHERE re.rule_guid = sc.rule_guid AND re.rule_guid = a.rule_guid (+) AND re.entry_order = a.entry_order (+) AND ds.ss_guid = je.ss_guid AND ds.ss_guid = sc.ss_guid AND ds.source_id = ??EMIP_BIND_TARGET_GUID?? ORDER BY ds.ss_name, sc.table_schema, sc.table_name, sc.column_ name, re.entry_order;
Next month we’ll take a detailed look at creating custom format masks, sharing the format mask libraries and creating and running a data masking job. Until then…