The Complete Guide to DIY Static Data Masking

Mask your data manually in Oracle and SQL Server databases by downloading our free script.

Static Data Masking (SDM) is the process of replacing sensitive information with realistic but fictional data in non-production environments. The goal is simple: eliminate the security risk.

Since databases are built to manipulate data at scale, you may choose to take advantage of those capabilities and opt for a “Do-It-Yourself” (DIY) approach. The idea is to build SQL scripts that mask the data instead of leveraging the power of third-party tools. Here is how to navigate the complexities of DIY masking.

Core Masking Vs. DIY

It’s vital to understand that this DIY method works differently from a masking solution like Core Masking. It’s not just about having an easy-to-use UI, but about how the masking works.

Core Masking reads all the data in the column, applies the masking algorithm in software, and then updates each row individually with the new masked data. That means that for a million rows, Core Masking will issue a million individual updates. Core Masking leverages various techniques (like array binding) to make masking run fast. However, the important point is that Core Masking runs the masking algorithms in the solution. It means the masking algorithm can be complex and perform any functionality supported by the product.

The DIY method uses a single vertical update. That means that it runs very fast. However, it also means that it’s limited to masking algorithms that can be implemented using built-in database functions. We attempted to simulate as many of the masking algorithms as we could with built-in database functionality, but it’s not the same, and there are significant gaps that we will review.

In short, DIY is the best you can achieve without a solution. It might be sufficient for your purposes, but it’s not the same as a masking solution. Read the details below and try it out for yourself.

The “Wipe” Method: The Blunt Instrument

If your only goal is to remove the sensitive data and not replace it with anything else, the database can do that with a single simple command:

Data Type	SQL Command	Risk
Nullable	UPDATE tbl1 SET col1 = NULL;	Application must handle null values
String	UPDATE tbl1 SET col1 = ”;	Breaks logic that expects specific lengths/formats
Numeric	UPDATE tbl1 SET col1 = 0;	Breaks logic that expects specific numerical range

Balancing Risk vs. Data Quality

The wipe-out method described above is not very useful because it eliminates the data. The primary objective and challenge in masking is maintaining Data Quality. We move production data to dev/test environments because we need “real-world” examples with edge cases. We want to simulate production so that we find bugs earlier in the development and test cycles.

Therefore, we usually don’t want to wipe out the data, but deliver data that is a good fake of the original. Eliminate the sensitivity while retaining the utility.

When we mask, we must consider the level of “realism” we want to achieve by weighing the risk against the testing requirements. Here are some examples:

The sensitive data isn’t tested. The data just needs to exist, but anything would do. For example, a name and phone number must resemble a name and a phone, but not necessarily good fakes of the original.
The application has data requirements. The data must follow particular rules. For example, a credit card must pass the Luhn checksum, a phone number must fit the phone format, a ZIP code must match the City, or a social security number must match across various columns (known as consistency).
Realism. The masked data must retain certain properties from the original sensitive data. For example, maintain the order of magnitude of salaries or keep the phone number formats from the original data. Realism can also require retaining statistical properties such as the ratio between males and females, or the number of users in each country.

Five Essential DIY Masking Algorithms

For a DIY implementation, these five patterns cover a large percentage of use cases.

1. Noise Infusion (Numerical)

Type of data: Numerical quantities. Mostly salaries or other dollar amounts, but also useful for inventory, product counts, etc.
Method: Adding a random noise within a range. For example, within plus or minus 30% of the original value.
Pros: Preserves the order of magnitude, delivering data that is similar yet different from the original.
Limitation: Adding 10% to a 0 value results in 0.
DIY Limitation: The noise usually has a uniform distribution. A Gaussian (Normal) distribution or other distributions may result in better quality data with lower risk.
Risks: Outlier values (like a CEO’s salary) remain outliers even after adding noise, making them vulnerable to “re-identification attacks.”

2. Character Replacement (Patterns & Text)

Type of data: Textual data. Best results are achieved for patterns. For example, phone numbers, national ID, license plates, etc.
Method: Replacing specific characters with others. For example, converting all the digits to stars (*), or replacing digits with other digits.
The “Star” Strategy: replace designated characters with a star (*), and X, or similar placeholders.
The “White List” Strategy: Define safe characters (space, dot, hyphen) and replace everything else.
The “Black List” Strategy: Define the characters you wish to replace and keep everything else. For example, replace digits (0-9) and letters (A-Z, a-z).
The “Partial Field” Strategy: Keep the last 3 characters and mask the rest. For example, to keep the area code of a phone number.
Limitation: When applied to textual data that isn’t a pattern (e.g., a first name), the masked data will look like gibberish. It’s valid information, but it may be inconvenient for testing.
DIY Limitation: Replacement usually relies on TRANSLATE, which replaces specific characters with others. That means that all the occurrences of a digit or letter within the field will be replaced by the same digit or letter (for example, “212-555-1234” to “646-999-4673”).
Additionally, the partial field is usually based on character count. For example, the first 3 characters and not the first 3 digits or everything before the @.
Risks: The pattern that isn’t exposed. For example, an international phone number looks different than a domestic one. A two-letter last name is probably an asian name like “Ho” or “Li”. Long or hyphenated names can be recognizable in the employee directory.
In addition, DIY limitations may expose additional information such as insufficient masking in partial field strategy as well as statistical attacks based on character frequencies (e.g., vowels are more frequent and can be identified based on repeating characters) or predictable values (e.g., if many phone numbers start with +1 or a known area code, those digits are identifiable in the rest of the number).

3. Limited Data Sets

Type of data: Columns that can only contain a limited number of valid values. For example, gender, country, or order status. The values can be textual (e.g., ‘Male’, ‘Female’) or numerical (e.g., 1 for Male and 2 for Female).
Method: Create a table with a list of possible values, and during masking, use those values at random. The table can be created manually, from a lookup table, or based on the current values in the column.
DIY Limitation: Cannot maintain “Frequency Distribution”. For example, a gender column with two values will likely be assigned each one 50% of the time, regardless of the gender distribution in the data.Note: This method can be used to generate some limited textual data. For example, with a list of first names, this method can mask the first name column with readable names.

4. Dates

Type of data: Columns that contain dates and times. For example, birthdates, card issue date, expiration date, or transaction dates.
Method: Shift the date up or down within a range. For example, plus or minus one year. This is a type of noise infusion on date or time data.
Pros: Preserves an order of magnitude. For example, the approximate age of a person (an old person and a child remain old and young), or the age of an account.
DIY Limitation: If dates are stored as strings, you must first convert them into a date type, add the noise, and convert them back. This process depends on knowing the format of the textual field and having database functions that can parse it.

5. Number Generation

Type of data: Numerical.
Method: Using a random number generator to create data for the field.
Pros: The masked data is unrelated to the original data and doesn’t expose any information.
Cons: The masked data is unrelated to the original data and doesn’t provide any testing value.
DIY Limitation: Without control over the distribution, there is no control over the numerical profile except for the range.

Where DIY Hits the Wall

Beyond the lack of UI and the limitations we mentioned so far, the DIY approach has certain barriers that are impossible to overcome. If these are part of your requirements, a commercial solution is more appropriate.

Consistency, Primary/Foreign Keys: DIY cannot generate unique values for primary keys or mask the referencing foreign keys with the same values. Other types of consistency between columns or between databases are also not possible.

Data generation: Generating more complex and realistic textual data is outside the scope of a DIY project. For example, generating realistic email addresses or physical addresses.

Profiling: Profiling the data to generate masked information based on that profile is impossible with DIY. Whether it is to identify patterns in the data or to retain the frequency of the data

Other cases: There are many situations where a commercial solution is more appropriate. For example, to mask LOBs (large objects) like images, biometric data, etc., to parse dates from strings, to perform custom adjustments, and more.

Final Thoughts

DIY masking is a valid option for startups or small-scale projects with tight budgets. It forces you to write the SQLs manually, but the scripts we provide on the right panel will take you most of the way there. However, as your requirements for security and data quality increase or the availability of database personnel decreases, maintaining your own data masking scripts will often exceed the price of a solution like Core Masking.

Bottom line: Whether you build it or buy it, don’t leave your dev and test environments exposed. Mask your data.

Core Masking Vs. DIY

The “Wipe” Method: The Blunt Instrument

Balancing Risk vs. Data Quality

Five Essential DIY Masking Algorithms

1. Noise Infusion (Numerical)

2. Character Replacement (Patterns & Text)

3. Limited Data Sets

4. Dates

5. Number Generation

Where DIY Hits the Wall

Final Thoughts

Download File

AI Recommended

Ask a Question

Core Masking Vs. DIY

The “Wipe” Method: The Blunt Instrument

Balancing Risk vs. Data Quality

Five Essential DIY Masking Algorithms

1. Noise Infusion (Numerical)

2. Character Replacement (Patterns & Text)

3. Limited Data Sets

4. Dates

5. Number Generation

Where DIY Hits the Wall

Final Thoughts

Download File

AI Recommended

Ask a Question

You might also like

Best Database Activity Control Solutions

Static Data Masking: Build or Buy?

Application or Database Activity Control: How Do They Compare?