Contact Us
Data Masking, Anonymization, Obfuscation, and Privacy – Methods and Examples
Learn how to mask different types of data, understand common terminology and concepts, and figure out what matters to you most.

In a world driven by data with threats lurking around every corner, keeping data safe is nearly impossible. However, we then make this challenge worse by copying the data for testing, development, training, and more. If protecting data in production is difficult, protecting these copies outside of the secured production environment is impossible. So, what can we do?

The solution is to remove the sensitive portion of the data from all the copies we make. That is called static data masking. However, there are many terms and techniques that all blend together. From anonymization to obfuscation and more, the definitions are not always consistent and often overlap. However, the objective is the same – to allow us to use sensitive data outside of production without the security risk.

This is where the challenge comes in: to ensure the data remains valuable after masking. The masked data must allow for high-quality development, testing, training, etc. If the masked data fails to retain its unique qualities, teams that use it will demand access to the original sensitive data so they can do their work.

This article walks you through several common examples to explain how each type of data can be desensitized and what you can expect in terms of data quality and security risk.

Names

First names, last names, company names, and more are the most common requests when masking data, and there are multiple strategies you can employ.

Original DataMasked Data
JohnJames
PaulMark
GeorgeLuke
RingoPeter

The most popular is data generation. Creating random names based on a name dictionary. You could use a dictionary provided by the masking solution or a custom dictionary you provide. The data generation engine can use a list of weighted patterns to make the data seem more realistic. For example, some of the names should be one word in lowercase (e.g., john). More should be capitalized (e.g., James) and some in all upper case (e.g., JAKE). A few entries should contain two or three names (e.g., Smith-Jones) or letters (e.g., J.J.), etc. By creating a weighted mix of patterns like that, the generated data will look more realistic.

The benefit of data generation is that the generated data is completely divorced from the original data. It cannot disclose any information about the original data. However, that is also its weakness – it retains no characteristics of the original data. If the data will be used, for example, to test something related to names, the test will not simulate the original results. It may lack, for example, names in Chinese characters. It may also lack very long names or very short ones. It may sound more Western than Latin or Asian. When you look at the synthetic data, you will discover more differences between that and your real data.

Original DataMasked Data
JohnDhox
PaulPxje
GeorgeEhksor
RingoNajre

Another approach is character replacement. That will retain some aspects of the data, like character sets and length, but the result will be gibberish and completely incomprehensible. While it may aid in certain types of testing, it will not be readable, and users will generally dislike it.

From a security perspective, character replacement may divulge information about the original values. For example, the character set and word length. Note that these are the exact characteristics we aimed to preserve to improve certain types of testing.

Original DataMasked Data
JohnPaul
PaulRingo
GeorgeJohn
RingoGeorge

Other strategies, such as row randomization, may work well for first names or last names. They will disclose the names that exist in the database, but not the rows they associate with. You could also preserve the gender associated with the first name by randomizing a composite of the first name and the gender (more about composites later).

There are more complex strategies that may benefit particular use cases, so it’s all about the data you have and the properties you need to preserve.

Gender

Original DataMasked Data
MaleMale
FemaleFemale
MaleFemale
FemaleMale
MaleMale

The easiest way to mask genders is by randomizing the column. This kind of masking will use the current values in the column but associate them with different rows. That will automatically handle the case when the gender column contains more than two distinct values (more than Male and Female). It will also work regardless of the language used in the column, or if the column contains numerical representation (e.g., 1 for male and 2 for female).

Adding weights to the randomization will ensure the population ratio between the different genders remains the same. More complex variations can retain the gender population ratio within each state, city, and more. The question always revolves around how you use the data and what qualities you need to preserve.

Dates (e.g., Birthday)

Original DataMasked Data
1971-06-141971-09-26
2013-02-232012-07-16
2007-10-192007-03-07
1983-08-031984-01-30

Dates and times are usually masked with noise infusion. Adding noise to dates means changing the value within a certain range. For example, shifting the date up or down by no more than one year. That ensures the actual birthday is unknown, but the date still makes sense. For example, it avoids situations where a 70-year-old person turns into a newborn, parents are younger than their children, things expire before they are produced or issued, and more.

Dates you might need to mask may include birthdays, issue and expiration dates of a passport or an ID, transaction dates, and more.

Original DataMasked Data
06/14/197109/26/1971
2013-FEB-232012-JUL-16
2007101920070307
August 3, 1983January 1, 1984

An important aspect of date masking is how the date is stored in the database. All databases contain a special date type, and masking it is easier. However, many databases store dates in textual format. In that case, the masking process needs to understand the date, mask it, and store it back into the same textual format. Dates stored as text can be formatted in many ways (e.g., 12-31-2025, Dec 31, 2025, 20251231, 31/12/2025, 31 DEC 25, etc.), and the masking process must be flexible enough to accommodate the format(s) used in your database.

Emails

The most common approach in email masking is data generation, creating fictional new emails. However, if it’s important to retain the domain names or some special formats, character replacement is a good option. You could, for example, replace all the alphanumeric characters before the @.

Beware of email columns that contain multiple emails (e.g., separated by commas, spaces, or semicolons). Character replacement may be a good option for masking emails when the format is unknown and you wish to preserve it.

Keep in mind that to preserve domain names, you should mask until the last @ and not the first one (to preserve the TLD, until the last dot). Masking to the latest occurrence aims to handle fields that contain multiple emails.

Phone Numbers

Phone numbers are a type of data that follows patterns. Pattern profiling is a powerful method for masking phones, preserving the patterns, and effectively hiding all the information.

However, if you wish to preserve portions of the phone number, such as the country code and area code, character replacement is probably the best option. Masking the last 7 digits is a popular option.

However, keep in mind that if a phone number contains an extension, the last 7 digits may include part of the extension.

Also, keep in mind that if the column contains multiple phone numbers, partial field masking will likely reveal some of the information. A good alternative is to mask everything except the first few digits. Another option is to apply different masking policies depending on the field length.

Which brings up another common situation: when the digit 0 indicates there’s no phone. There are similar “fake” phone numbers that suggest the person doesn’t have a phone or doesn’t wish to disclose it. Masking such data should include a condition to avoid masking those special entries.

Street Address

Street Addresses usually include some numbers and names in various formats. The most common method of masking is using data generation with a weighted pattern list that emulates the main address formats.

If these addresses are an important part of the test, you may also consider alphanumeric character replacement. That will retain the pattern and length but will generate addresses that look like gibberish.

City, State, Zipcode, and Country

Masking each column independently is simple, and the common approach is to randomize the values between the rows.

However, it is often desirable to retain the relationship between these columns. That means ensuring the state exists within the country, the city within the state, and that the zip code matches the rest. Composite masking is the solution as it can simultaneously operate on multiple values. The most common is a composite-randomize that switches the full set of values between the rows.

Adding weights will also retain the statistical properties, ensuring the same number of people remain in each city, state, zip code, and country.

Another option is to ensure people remain within the same country or state. An invariant constraint can help achieve that, which, for example, will ensure the country phone prefix matches the country.

Composites with weights and invariants are a powerful means that helps keep individual privacy while achieving highly realistic data. Realistic data that retains statistical properties and ensures column values match the rest of the information.

National ID

Character replacement is a good general solution for masking National IDs. When the first few characters sometimes indicate attributes such as the type of ID or a person’s age. In such cases, it may be desirable to retain those first few characters.

Sometimes, you’ll need to apply a custom function to calculate a checksum for the ID. If the checksum is based on the Luhn algorithm, that is an option you can enable in the masking policy. Otherwise, a database or Python function can perform the required adjustments.

There are, however, two subjects that are important to consider when masking such IDs:

First, IDs must be unique. Masking characters at random will likely generate duplicates even with relatively small datasets. The birthday paradox is a good example of how a small number of random values can create a collision. To give a better idea, a 9-digit ID has a 99% chance of generating a duplicate with 96,000 rows, and a 10-digit ID will likely have duplicates with 310,000 rows.

The correct solution to the uniqueness problem is using a masking dictionary. A dictionary that keeps track of all the values and ensures new values are unique.

The second challenge in masking national IDs is that they usually appear in multiple locations in the database and often in other databases as well. To ensure consistency, the same ID must be masked the same way everywhere.

The solution is, again, a masking dictionary that keeps track of the values and can deliver consistent masking across one database or more databases.

Salaries

Salaries differ from the rest of the information we discussed because they are numerical quantities. Numerical quantities are dollar amounts, inventory, or any number that represents a quantity of something.

Numerical quantities can be masked in two primary ways: number generation or noise infusion.

Number generation creates random values within a particular range using the chosen distribution. It is very secure since the masked data is unrelated to the original. However, it also lacks any relationship to the original data. For example, the salary range, the average salary, and all other statistical attributes will not be retained.

Noise infusion is a method for modifying the original value by adding or subtracting random numbers. Adding noise can mask the original value, but retain the order of magnitude. Usually, it will approximately maintain statistical properties such as the range, the average, etc.

While noise infusing is a great way to mask salaries in a similar range, it creates a potential security problem when a handful of executive salaries are significantly higher and will stand out after masking. The solution is conditional masking, which applies different masking policies to different salary ranges.

Keep in mind that there may also be values like zero to indicate interns, contractors, and various other individuals who are not on a salary. Such values would usually also need to be handled through conditional masking.

Financial data

Some databases contain significant financial data that is difficult to mask. For example, bank account transactions, credit card charges, etc. While it’s easy to mask each value, masking many columns that contain such values and that have a relationship between them can be a challenge. For example, masking credit card transactions should also recalculate the balance due, minimum payment, etc.

A simpler privacy-preserving solution is masking the identity of the individual rather than the details of their financial data. While it may not be appropriate or allowed in all cases, it is a common solution that simplifies the masking project.

However, masking only personal identifiable information requires diligent work to ensure nothing in the data can compromise an individual’s identity. It will also require masking the account number, which, as mentioned earlier in the discussion about national ID, requires handling of uniqueness and consistency.

In addition to the basic requirement of a unique account number and consistent masking across all tables (and potentially all databases), account numbers are also likely to have a primary key / foreign key relationship constraint. These are generally handled by the masking solution, but keep those in mind to watch for gotchas.

Credit cards

Credit cards are usually masked using character replacement or pattern profiling.

Character replacement is highly effective, also allowing you to mask a portion of the card number, leaving the first digits unchanged. That can be valuable since the first digits often indicate the card type and similar information.

Pattern profiling is a more advanced method where the solution learns the patterns used to store cards and imitates them. It is a powerful masking strategy similar to data generation but based on the original information.

The only other complexity involving credit cards is that the last digit is a checksum. That checksum uses the Luhn algorithm, but you can easily enable that option in the masking policy.

LOBs, BLOBs, etc.

Databases have column types that can store unstructured information and various files. For example, they can store documents, pictures, JSON, etc. Different database vendors name these columns differently, but they are sometimes known as LOB (Large Object) or BLOB (Binary Large Object), Image, or Binary.

Such columns store certain types of sensitive information. For example:

  • Pictures
  • Scans of documents, such as signed documents
  • Biometric data
  • Medical records and test results
  • Genetic information

Masking this type of information is crucial, but not trivial. There are several reasons for that:

  • The format of the content of a LOB or BLOB is unknown to the masking software and may require specialized software to open and modify.
  • The information must be saved in the correct format for the application software to read and process it successfully.
  • Reading and writing such columns requires specialized code. It’s not something you can usually do with regular SQL tools like Management Studio or SQL*Plus.

The solution to masking these types of columns is to provide a list of fake replacement documents. The masking solution will use these dummy documents at random, uploading them instead of the existing LOB values.

More types of data

There are other types of sensitive information we will not discuss in this article. For example, license plates, GPS data, IP information, and more. Ultimately, many organizations have unique data or special requirements, but those are handled by the same algorithms and principles we explained.

There is always the possibility that your data is unique enough and falls outside the scope of what the solution algorithms can do. That’s one more case where a strong vendor and partner will ensure you are successful.

Consistency and Referential Integrity

In some cases, you need to mask data in different tables or different databases in the same way. A simple example is the account ID or national ID mentioned earlier. On the technical side, you may think of primary and foreign keys.

There are two approaches for handling this situation in data masking solutions: deterministic masking and masking dictionaries. While the terminology may change, the algorithms and their limitations remain.

Deterministic masking uses the value (more accurately, a hash of the value) to ensure each value undergoes the same “random” transformation. This algorithm is flawed because of a statistical problem commonly known as the birthday paradox. It’s a complex subject in probability theory, but it essentially means that even fairly large numbers are likely to have a value collision (see the explanation on National ID). A collision means two different fields attempt to mask into the same value and, therefore, break referential integrity.

Masking dictionaries are a proper solution to consistency challenges. This is the method used in Core Masking, and it guarantees both uniqueness and referential integrity across multiple tables and even across various databases from different vendors.

Terminology

In recent years, the data masking world has been flooded by various terminologies. Every vendor invented their own terminology and definition to distinguish itself from others. As a result, not everyone interprets these terms the same way. That’s why it’s better to focus on specific algorithms and methods over abstract and inaccurate concepts.

However, to help the discussion, we created a little glossary with reasonable interpretations:

  • Data Masking is often used as the main term for the market space of these types of solutions. As such, it refers to any data manipulation that aims to hide the original values. However, this term also refers to the technique of changing part of the value while retaining a portion of it.
  • Obfuscation often means just that: that the original data will be difficult, if not impossible, to retrieve. However, it sometimes refers to particular techniques such as data generation (creating new data that is unrelated to the original).
  • Redaction is a masking technique (same as the masking technique). It is the process of wiping out a portion of the value. Not to be confused with Reduction, which is the process of reducing the size of the dataset (creating a smaller database for testing).
  • Anonymization is the concept of ensuring data cannot be traced back to the person it relates to. Outside of databases, it is done by removing columns that contain PII. However, in databases, you must retain all the columns to ensure the application functions. Therefore, anonymization is performed by replacing personal information with fake data in an irreversible process. In other words, anonymization usually translates to generating PII data.
  • Pseudonymization is similar to anonymization, only that the masking process should be reversible. Outside the database, PII columns are replaced with a single column that contains a pseudonym. The pseudonym can be manually converted back to PII using a translation table. Inside the database, pseudonymization replaces each piece of PII with a pseudonym (or token). Again, that pseudonym can be reversed using the translation table. Pseudonymization produces data of poor testing quality because the pseudonyms offer no testing value. It is, therefore, uncommon in data masking for test and development.
  • Tokenization is similar to pseudonymization, only it goes beyond PII and replaces any sensitive data with tokens. Again, the tokens can be later resolved if needed. Like pseudonymization, it is not a good solution in databases since the tokens offer no testing value.
  • Encryption is a variation on tokenization, where instead of tokens, the data is replaced with its encrypted form. To retrieve the original data, you need access to the decryption key. It is also a poor solution for database masking.

As mentioned above, various individuals and companies can use and understand these terms in different ways. Our best advice is to focus on specific functionality, algorithms, and the type of data they produce. It is far better to focus on evaluating the security and quality of the data for testing, development, etc.

Static Vs. Dynamic

This article refers to static, not dynamic, data masking. However, it’s important to understand the differences. This short introduction to dynamic masking explains the differences between dynamic and static masking.

Dynamic masking changes data on the fly. That means that the data stored in the database is the original sensitive data, and it’s changed on the fly when queried. The change is usually performed by modifying the query rather than the data fields in the result.

Changing data dynamically as part of executing the query has several direct ramifications:

  • The sensitive data is still stored in the database. That means you must still secure the database server, monitor administrator access, etc.
  • Modifying data is challenging. If the user or application queries masked information and then relies on that information to perform a change, the change will refer to masked data that does not exist in the database.
  • The data must be consistent. For example, when querying for the name of person number 123, the original data must be masked on the fly, and always return the same masked name.

Because of these reasons, dynamic masking is:

  • Only for production databases. It is not meant to help secure test or dev environments.
  • It is only for a subset of the connections and a subset of the data. It is for data that needs to be hidden for those particular connections and isn’t updated by them.
  • The data masking algorithms are usually trivial, like wiping out part or all of the information in the field. They cannot produce random data or generate fake information because such data will be different every time you query it.
  • The masked connections are usually identified by the user and program that connected to the database, not the application end-user.

Dynamic masking is more complicated and costly than static masking and has a very narrow use case. Most masking is static masking.

Other Challenges

This article focuses on how to mask data. That is the primary concern for anyone working with data masking: providing good-quality data. The objective is that people who use the masked data are happy with the data and the results they get from using it. If the data is of low quality, users often reject it and demand access to the unmasked data.

However, when starting with data masking, organizations often dismiss the quality of the masked data and are concerned with other initial challenges, like:

  • Defining requirements. One of the initial challenges is the requirements. Many projects start with unclear objectives and aimlessly search for them. Sometimes, this search leads to compliance and legal teams, who are often viewed as responsible for these definitions. However, the guidance these teams eventually provide tends to be partial at best. This is the subject of a much longer article, and the short version is that proper requirements tend to come from a combination of technical and security teams. These teams are familiar with the data in the database and understand its security implications.
  • Locating sensitive data. As an extension of requirement definition comes the challenge of locating the tables and columns with the data we’d like to mask. That can be especially challenging in large applications where the teams are unfamiliar with the database data model. While there are some product features and practical methodologies that can assist in this endeavor, it is also a time-consuming effort. Again, this is the subject of another long article.
  • Performance. One of the practical challenges customers can face is that masking tasks take too long to complete. It is a challenge that relates to the solution, its implementation, and the database. For example, database triggers are a common problem in masking projects. Read more in our article about data masking performance.

While transient, failing to meet these challenges will likely result in a failed project. It is yet another reason for choosing committed vendors and partners who will help you overcome the obstacles you encounter.

Advanced Challenges

Some security professionals ask a great question: “How can I know the masked data is properly masked and safe to use?”

A common method for validating masking is to pick a few random rows and compare the data before and after masking. However, this cursory test doesn’t mean the entire dataset was properly masked. Here are a few examples of how reasonable masking policies can result in poorly masked data.

A masking policy that leaves the first few digits of a phone number unchanged may leave a short number mostly unmasked. On the flipside, a masking policy that only masks the last few digits may provide insufficient masking when the phone is followed by an extension or when the field contains two or more numbers.

A masking policy that preserves the state a person resides in may unintentionally preserve the city. That can happen if all the residents in the state live in the same city (e.g., a prominent city in the state) or if there are few entries in the state that all come from the same city. Not to mention that a single person in the state will be easily identified.

As a final example, adding a percentage of noise to a salary or dollar amount will add very little noise to small numbers and preserve the value when the amount is zero.

Ultimately, the problem is that masking involves randomness that interacts with your data. Therefore, certain unexpected data will produce results you didn’t consider. Equally challenging is that the randomness introduced may randomly leave the data with small or no changes (which is ok).

The only way to ensure all the data is well masked is to apply the masking policy to all the data multiple times. When analyzing these many masking runs, the average change must be sufficient for all the values. That is the task of a masking evaluation feature in a data masking solution.

Final Thoughts

Perhaps one of the main takeaways from this article is that there’s no “right way” to mask data. Masking is a balancing act between data needs and security requirements. A vital component in this balance is data quality requirements. Requirements that change between projects and evolve over time.

In other words, a masking project is about finding the “correct way” to mask your data so that it fits your customer’s current needs. Solutions that provide multiple masking options for each type of data offer you the flexibility to find the balance point between security and data quality.

While this article doesn’t focus on challenges outside of data quality and security, it’s vital to remember that masking projects fail regardless of their ability to produce good masked data. For example, triggers on tables are a common reason for unbearable masking performance and consequent project failure.

For that, and many other reasons, experienced and committed partners and vendors are the critical difference between project success and failure.

Beyond everything, remember to always mask your data. Keeping sensitive data outside of the secure production environment is irresponsible and will undoubtedly lead to catastrophic consequences.

No posts found

If you have a question or a comment, please let us know. We’ll be happy to hear from you.