Introduction
In today’s data-driven world, privacy and security are more crucial than ever before. Data masking solutions help protect personal, financial, and business-critical information. Selecting the right solution is essential to a successful masking project and effective protection of your sensitive information.
Misleading Terminology
Many vendors use terms like Anonymization, Pseudonymization, Tokenization, Hashing, Encryption, Reduction, and more. The problem is that everyone has a different definition of what these mean.
These terms sound good because they relate to privacy or security and seem like features you’d want. However, they are non-descriptive in terms of functionality and can mean anything. They are often used as a marketing buzzword to lure prospective buyers.
When someone sells you a feature like Anonymization, you should inquire for details about what it does.
![](https://bluecoreresearch.com/wp-content/uploads/sites/3/terminology4.png)
The Problem
Why do we need data masking? The underlying problem is that many organizations copy data out of the secured production environment, and it is difficult, if not impossible, to protect the data once it is out of production. There are many possible reasons to copy data out, but the most common is for application testing and development. Other reasons include analysis, training, providing it to 3rd parties, and more.
Production data is valuable and useful for many purposes. However, outside of production, the data resides in insecure systems, can be accessed by unapproved individuals, and creates unnecessary exposure with increased security risk.
![](https://bluecoreresearch.com/wp-content/uploads/sites/3/unsafe_copies6.png)
The Solution & Objectives
The solution is to remove the sensitive data from the non-production copies. That is called Static Data Masking. And once you mask the data, you can use it in many less secure environments.
However, wiping out the sensitive data is not enough. The masked data must also be of high quality to remain valuable. Whether you need it for testing or other purposes, it must continue to yield results similar to the original data. If the masked data does not provide equivalent results, the teams using it will demand access to the original sensitive information. That is one way a masking project can fail.
Other objectives are for the data to retain data validity and application integrity. Those are important so the application will function correctly.
The problem is that removing sensitive data and retaining test quality can be in direct conflict. The better we scrub out the sensitive information, the less there is left for the testing team to test with. While data masking is not a trivial undertaking, it is not too difficult either. You need the right solution and should take the time to do it right.
There is also Dynamic Data Masking, but we will discuss it later since it is a different solution to a different problem. Yet another part of the confusing terminology.
Critical Features
These data masking capabilities are critical to evaluate as they may cause your project to fail. These features are at the core of what you need from data masking, so be sure to check them out.
![](https://bluecoreresearch.com/wp-content/uploads/sites/3/critical_features1.png)
1. Platform Support
The first requirement of a solution is that it supports the databases you need to mask. Whether it’s Oracle databases, SQL Server databases, CSV files, etc., the data masking solution must be able to connect to the relevant system and mask the data inside it.
The difficulty in this obvious requirement is that you may not know all the systems you need to support now and in the future. Vendors are also likely to introduce support for additional databases over time. That creates a fluid situation.
Best Practice: Evaluate solutions based on your expectation of your current needs and discuss with the vendor your potential future needs and how they fit with their roadmap.
2. Algorithms & Techniques
The algorithms & techniques available in a solution are critical to the quality of the masked data. They will determine the masking quality from a security perspective and how realistic it is in providing quality testing. It’s also central to data validity and application integrity.
In other words, the algorithms in the masking solution are the main feature affecting all the objectives.
Masking algorithms is a large subject, and we have many articles on this point alone. However, generally speaking, algorithms fall into three main categories. Data Generation algorithms create synthetic data unrelated to the original data. Value Manipulation algorithms modify the original values in some way and retain some of their properties. Profiling algorithms are a middle point as they generate data, but the generation uses a profile of the original data. There are also custom profiles and other variations that can take things even further.
You should also consider the type of data you will mask since different algorithm classes have varying support across data types. For example, you cannot do value manipulation on a Gender column since there’s only a limited number of valid values. It is best to mask Gender using a profile of the original data.
Best Practice: Experiment during a POC. Define a few realistic test cases and ask the vendor for several masking alternatives. For example, one option for better security, one for better test quality, and one balanced approach. Have security personnel evaluate the results from a security perspective, and the target teams (for example, QA) validate the data quality for their uses. You should also compare several solutions to see other capabilities in data manipulation. We recommend Core Masking since it has many powerful algorithms.
3. Performance & Scalability
Performance is in 3rd place on the critical feature list because it’s one of the main reasons for masking projects to fail.
Data masking performance involves multiple elements, from the technology used by the solution to database tuning for write-intensive activity (unlike the regular read-optimized tuning).
However, the biggest problem in data masking is triggers. Triggers are small code fragments that execute in the database when you update tables. They only have a small overhead on each update. However, data masking executes millions of updates, and these small overheads accumulate to a massive slow-down. They can cause a masking process to take days or longer, rendering it impractical. The solution isn’t simple either since triggers are usually critical for data integrity and shouldn’t be disabled without a way to compensate for their functionality.
Best Practice: Consult your DBA team whether the databases you intend to mask have triggers on tables with sensitive data. Also, during the POC, you should insist on masking one of your large tables from beginning to end. If masking performance seems unacceptable for any reason (triggers or otherwise), make sure it is either resolved during the POC or, if the resolution is complex (in the case of triggers), that the vendor will help resolve this as part of the implementation.
4. Vendor Support
Vendor support can be critical in data masking projects. Whether it’s to overcome performance problems or to help customize a masking policy to deliver quality data without exposing sensitive information. Vendors can make the difference between a failed and a successful project.
Best Practice: During the POC, challenge the vendor to provide more than canned solutions. Request multiple masking alternatives for the data you’re testing. Mask large volumes of data and expect performance to be reasonable. Don’t follow the vendor’s path for a POC, but chart your own in a way that will give you comfort they are equipped and willing to help you when you need it.
5. Ease of Implementation & Use
Usually, you will not use a data masking solution every day. These solutions are also used by personnel who have many other tasks and come from varying backgrounds (DBAs, QA, security personnel, etc). In other words, a solution that requires specific knowledge, that has a steep learning curve, or that has a non-intuitive interface will be difficult to assimilate and is less likely to be used.
Best Practice: During the POC, ensure your personnel is driving, not the vendor. After an initial meeting with the vendor to set things up and show you around, have your personnel spend a few days trying to perform a few masking tasks without assistance.
6. Masking Evaluation
One of the challenges is knowing whether the masking policies are doing their job. You will especially want to know if all the data is well-masked. That is not a trivial question. It depends on the data you mask just as much as the policy that masks it. Core Masking has a feature to help answer this critical question.
For example, replacing digits with other digits is only effective if all the values contain digits and a sufficient number of those. Another example, in many masking products, fixing the seed or ensuring consistency comes at the undocumented cost of failing to mask a few of the values. These are just a few examples, but how will you know whether you masked all your sensitive data?
Best Practice: During a POC, security is often tested by sampling some values and comparing them before and after. That is not a good test. Look for a way to ensure all the values are well masked. That may be challenging because of the statistical differences between consecutive masking runs. Again, if the solution can resolve this problem for you, that will be much easier.
Distracting Features
The following capabilities may be valuable in some cases but are usually used to distract customers from what matters. We’ll explain each one and why it’s distracting you from your objectives.
![](https://bluecoreresearch.com/wp-content/uploads/sites/3/distracting_features2.png)
7. Dynamic Masking
Dynamic masking doesn’t change the data in the database and returns masked data for not all but some of the database queries. The use case is pretty narrow. It’s for when some applications connecting to the production database need access to columns with sensitive data but only a masked version of those columns. The masking is usually only for an entire application, not particular end users.
Dynamic masking is only relevant to Production databases. You must still secure the database since the sensitive data is still inside it.
Additionally, because dynamic masking is a real-time operation, it offers a small number of algorithms with limited capabilities. For example, replacing characters with stars. These algorithms cannot create realistic fake data.
Dynamic masking is a complicated and expensive solution unrelated to non-production databases, and it doesn’t eliminate the need to secure the database.
8. Discovery
Discovery usually revolves around finding sensitive data in the database. It’s a good feature to have that most products include. However, its ability to correctly identify all the data is pretty limited.
One method used by discovery is to scan the data in the database and look for data that resembles specific patterns. That has two limitations. First, the sensitive information must follow specific patterns. Salaries, for example, are just numbers that cannot be identified this way. Secondly, it has a high number of false positives.
Another method used is looking at column names. That has a high chance of missing sensitive data and could also come with many false positives.
While discovery is a nice feature, it’s not critical, and you cannot rely on it to locate your data.
9. Provisioning & Deployment
Some solutions allow you to copy data from production to non-production or between non-production systems.
While this seems like a valuable feature, every DBA knows how to deploy a copy of a production database. DBAs have methods that are faster and well-tested, like restoring a backup or using specific database features. In our experience, DBAs do not rely on data masking features to make database copies.
Another reason why this feature is distracting is that there are solutions that specialize in a data pipeline, system provisioning, or various other types of data management. They are not security tools with a different scope and purpose than data masking.
In other words, copying data is related to operations, not security, and therefore, distracts from critical requirements relating to masking.
10. Compliance
Compliance sounds like the perfect feature. Wouldn’t it be wonderful if the masking solution supported your specific compliance requirement? The reality is that all data masking solutions are equally suited for compliance and solutions that claim support for one requirement or another don’t do anything different. More than that, no compliance requirement, except PCI, specifies how to mask data. Even with PCI, good practice is to mask more aggressively than the minimal masking requirement in the regulation.
In other words, compliance is a marketing pitch, not an actual feature.
Best Practices
We listed evaluation best practices for each critical feature, but the bottom line is to do a proper POC. Do the tests yourself and avoid internet reviews or rankings. Your POC should validate that the solution does what you need from beginning to end without cutting corners. Beware of vendors that try to convince you otherwise, regardless of their reputation.
You’ll need to mask complicated test cases that involve, for example, consistency across columns or retaining the statistical distribution of values like gender or state. You’ll need to validate that all the values are masked, check performance, and more. Most importantly, ensure you can use the masked data and that the end customer is happy with the data quality.
Final Thoughts
Choosing the right data masking solution is crucial to a successful masking project. One that you can use regularly and that will maintain the security and privacy of your sensitive information. Data masking will allow you to do much more with the data you already have without increasing your risk.