Contact Us
Static Data Masking: Build or Buy?
Data masking can feel like an afternoon project until you hit the performance challenge. Learn why building your own solution can be a costly trap, where the real risks are hidden, and when you should consider building one yourself.

A potential customer recently admitted to us that they were debating whether to buy our masking solution or build one themselves. Personally, I usually prefer to build things myself. However, as a development manager and software architect with 30 years of experience, for data masking, my suggestion is “buy”.

Data masking can be tricky. It can feel like a simple project you can finish in one afternoon. Before you dive into a development project, let me share some of my experiences in data masking and other projects. It may help you avoid a costly investment with zero return, and end up at the same purchasing decision point. I’ll explain why, when you should consider building, and some of the pitfalls your team will encounter when trying to do it.

And, yes. There is a case when building is a good idea. We have an article about DIY Data Masking, including sample scripts to get you started. However, if your requirements stretch beyond what those free scripts can accomplish, building a solution will likely be a long and costly endeavor.

The first half of this article discusses the business and software management perspective, and the second focuses on the software architecture and technical details.

Scope

Before going into pragmatic and technical considerations, it’s good to be upfront about the scope of the efforts. A “professional” solution you purchase will have everything you expect from a solution, like a UI, authentication, an audit trail, etc. It will also benefit from the perception and credibility that a homegrown solution lacks.

When you think of a “build it” option, you’re probably not trying to compete with a solution you could purchase, but trying to gain cost savings by using internal resources. You must, therefore, consider that the scope of your project will make it harder to win current, not to mention future, compliance approval.

However, it’s important to acknowledge that there will always be gaps. A solution you buy may not do 100% of what you need, and neither will be one that you build. The question of scope is which will be closer, and how difficult or necessary it is to bridge those gaps. The reason commercial masking solutions seem more palatable is that their gaps are usually smaller.

Risks & Unknowns

How do you make a build vs. buy decision? It may seem like a simple exercise of balancing pros and cons, but when you start a development project, there are many unknowns. The first step is, actually, to understand these unknowns and the risks they pose.

Thinking there are no unknowns is a dangerous trap. Managing unknowns is an integral part of the headache of software management. Until you finish building something, you cannot predict all the technological or algorithmic complexities, the severity or nature of the bugs, or the performance and stability challenges ahead. Engineering a software project is about managing these risks, not foreseeing them.

Developers often think that “it’s just a small thing” and “we’ll get it done quickly”. Sometimes it’s true. But sometimes it’s like stepping into a puddle only to realize it’s a hole and you’re neck-deep in water. In a data masking project, you realize the water is deeper than expected when you start testing it for real. That’s when performance problems and missing functionality hit (see later in the article). It’s just difficult to predict how deep the water is, as it depends on the project.

Business & Software Management

Managing a software project is a bit tricky. Many managers, including me, have made mistakes. For me, at least, it’s when I go with how complicated a project “feels” and downplay core principles I know, like the 80/20 rule and hidden unknowns. Collecting solid requirements and thinking everything through helps me mitigate those mistakes.

Economy of Scale

There is a good reason why most of the software in the world is off-the-shelf software: software development is easily profitable only when you have multiple customers paying for it. When the costs are divided across many customers, the value proposition to each customer is higher than the price they pay. You essentially have 1,000 customers chipping in to build a large, full-featured solution.

When the full financial burden falls on a single customer, the value from that software must be unique. It’s very difficult to gain sufficient value when 100% of the cost falls on you instead of spreading it across 1,000 or more companies.

The 80/20 Trap

The 80/20 rule is a well-known law in software development, and it applies to many different things. For example, 20% of the effort builds 80% of the functionality, and the remaining 80% of the effort is for the final 20%. In general, 80/20 means that the time, efforts, and costs you predict will probably be 4 times higher. Depending on the project, 80/20 could end up as a 90/10, 95/5, or, in extreme cases, the project can fail and never deliver. It is also part of the unknown, as those 20% are usually partially known and partially hidden.

It may sound counterintuitive, and many times people respond with “just stop after the 80% and don’t finish the last 20%”, or “My experienced team can better predict the schedule and costs”. Yes, it’s all true. An experienced development manager could factor in the unknowns and know how to descope functionality that requires disproportionate efforts. However, that development manager would almost always advise you to buy a solution if one already exists.

Done Vs. Ready

A common mind glitch is the gap between “we finished development” and the point in time when users can start using it. Sometimes, development managers refer to “feature complete” as the point in time when the product is done. However, that’s when testing begins and the long journey to get things working.

I clearly remember a conversation as a young development manager with my seasoned project manager. I thought a project should take 1-2 months, but she set the delivery date to 6 months. She explained that when I say we finished coding, that’s when we start QA. With two rounds of QA, matching bug-fixing rounds, and acceptance checklists, 6 months was a realistic timeframe.

At the time, it felt like a lot of wasted time, and I was confident we could finish sooner. As it turned out, my experienced project manager knew exactly what she was doing, and we delivered, on schedule, exactly when she predicted.

So, when you get estimates, dig deeper. Try to understand the deliverables, what will be missing, and whether it will be “production tested” or just “feature complete”. The difference can be hundreds of percent.

The Low-Hanging Fruits

Another take on 80/20 is that 80% of the functionality requires 20% of the effort, and the last 20% consumes most of the time. In Data Masking, for example, converting characters to other characters or to stars is simple. There are many simple things you could do with a one-line script.

However, some things are incredibly difficult to accomplish, and if you require any of them, that’s not something you should consider for a homegrown project. A good example is uniqueness and consistency: making sure masked primary keys are unique and that the foreign keys point to those values.

If you start a masking project, don’t ignore those small, difficult requirements. Without them, your project is useless, and accomplishing them can be a monumental task.

Maintenance

When it comes to development, one of the first things managers forget is maintenance. It’s funny because it’s the reason you pay annual maintenance fees to software vendors. After all, you need those regular security updates, bug fixes, and the support line to call when things break.

In software engineering theory, 20% of the time is spent building the code, and 80% maintaining it. Those ratios are used to explain why investing in building quality software is worthwhile (it reduces the maintenance burden that follows). For most software, that’s an understatement, and 90/10 is the likely reality.

As you can see in every software you use, there are always patches and new versions, and it never gets to a “finished” state. Just like adopting a pet, you have to feed it and clean up after it forever. But why? There are many reasons, from bugs and security flaws to evolving requirements and more. Not to mention the natural evolution of the tech stack with new versions of PHP, Python, Java, JQuery, .NET, operating systems, etc. It’s an undeniable reality that as long as you use the software, you need to continue investing in it.

When to Build

There are several known scenarios when building is better than buying. Most don’t apply to data masking, but they are worth reviewing anyway:

  • Trivial requirements may suggest you should consider building the masking yourself. When a script can do what you need, you don’t need to purchase a solution. In our DIY data masking article, we give examples of masking you can perform using internal resources. If these scripts do what you need, there’s no reason to buy. However, if there is a functionality gap between them and your requirements, you should buy.
  • Unique functionality is the most common reason for full-blown internal development projects. When no product can do what you need, you have no choice. If you require the functionality, you must pay for the development. It may be costly, but it’s the cost of being in business.
  • Significant customization. When a solution is available for purchase, but requires a significant investment to make it fit your requirements. However, most examples, like ERPs or ticketing systems, show that even with massive customizations, it’s cheaper to purchase and customize than to build from scratch.
  • Free / Open-source software. Unfortunately, there is no free comprehensive enterprise-class masking solution. But even if there were, open-source projects require significantly more effort to implement and support. That’s why IBM acquired Red Hat for $34 billion. Red Hat’s entire business model is to convert free software into something businesses would be willing to implement.
  • Control & dependency. When the software is critical to the business, an organization may choose to develop it internally to maintain control over the code and data structures and to eliminate dependency on external vendors. That also helps maintain a business differentiator and a competitive advantage over the competition. For example, a bank may choose to build its own banking software rather than purchase and customize an off-the-shelf bank management application. However, data masking isn’t part of your core business.

The bottom line is that just like you wouldn’t build a ticketing system or a version control system, you really shouldn’t build a data masking solution.

Building Risks

There are three major risks in any build project: Time, Costs, and Deliverables.

When you buy, you know what you’re getting, when you’re getting it, and how much it costs. There are no unknowns. And the timeline, specifically, is very short.

When you build, the timeline will always be longer, but in addition to waiting, you also run the risk that the project will take even longer. That is extremely common, so it’s a realistic expectation you must factor in. When a project takes longer, paying salaries means it’s also over budget. That is, also, very common and something you must expect.

Lastly, projects tend to descope functionality when they are tight on time. Depending on the functionality they de-scope, it might be an acceptable compromise or render the solution useless. You must consider that some of the critical functionality you require may be de-scoped due to time constraints or technical complexity.

Vendor Support

There are endless pitfalls in data masking. You are attempting to manipulate production data. You are trying to find the balancing point between security and utility. On the one hand, strict security and compliance requirements. On the other hand, development and QA teams must find the masked data useful.

Beyond the algorithmic challenge is also the technical challenge of making it happen. When you run into trouble because you have, for example, triggers, you’ll need people who have done this before and know what to do.

One of the vital reasons to pay for a masking solution is that it comes with support. Be certain you work with a vendor that will ensure your project is successful and will not fail you.

Software Architecture

Building a data masking solution seems like a no-brainer. Read the data, run a simple algorithm on it, and write it back masked. What could possibly go wrong?

Contrary to this gut feeling, static data masking does pose a few challenges. One is the technical mechanism of reading and writing, and the rest relate to the masking algorithms we apply.

Writing Performance

Reading the data is pretty straightforward, but writing it is a challenge. It’s difficult because updating a million rows requires a million updates, and that many updates can take days or weeks to run.

There are 3 possible architectures for building a data masking solution, each with its own set of pitfalls:

  • External Solution: Independent software that connects to the database and sends SQL statements (SELECT and UPDATE). The challenge is that millisecond latencies in SQL executions aggregate to impossibly long overall execution times.
  • Vertical Updates: a single SQL that updates all the rows in the table. This is an elegant bypass to the million-update problem, but it creates another challenge relating to functionality.
  • Internal Solution: Software running inside the database engine, written, for example, in PL/SQL in Oracle, or TSQL in SQL Server. This is an attempt to overcome the latency problem, but it creates an insurmountable performance challenge.

In this explanation, I’ll use some examples and terminology from the Oracle world, but equivalent examples and terminology exist in any database.

Ultimately, real data masking solutions always run as external software. The reason is that it’s the only way to run proper masking algorithms. However, the latency is a deal killer.

If you see a 10ms ping time, it means the network time alone will not allow you to run more than 100 updates per second. Even with a 1ms ping, there is the time it takes the database to receive the SQL and execute it. With 100 updates per second, it will take 2 hours and 45 minutes to run a million updates. If you have 20 columns to update, we’re talking about over 2 days. That is, actually, optimistic, and you could end up with a week to run your masking.

But there is a solution. It’s called array binding. A prepared statement can be submitted not just with a single set of bind variables, but with an array of those. Sending an array of bind variables for execution means that in a single round-trip, you can execute thousands of SQLs, rendering latency a non-issue.

Also important is how you access the rows. Using block locators such as ROWIDs can make updates run blazing fast. There are a few more tricks, like committing every so often to avoid long transactions, etc.

While all this is doable, it isn’t trivial database code and requires experience, experimentation, and sufficient testing to get it done right. However, a well-coded solution can run almost any masking logic and update a million rows in 9 seconds.

To avoid challenges like latency and row addressing, you can use a single vertical update. A vertical update is a single UPDATE statement that modifies multiple rows or, in this case, all the rows in the table.

The challenge with a vertical update is that the transformation should use built-in database functions. For example, using a TRANSLATE function in an Oracle UPDATE statement with a few DBMS_RANDOM operations can process a million rows in about 54 seconds. It’s a lot slower than external software, but still within an acceptable range. The challenge is that more complex transformations will take exponentially more.

However, building an internal solution is much worse. Writing a PL/SQL function with a loop that iterates through the string (with the same TRANSLATE logic) increases the time to 9 minutes. Implementing it in a PL/SQL stored procedure instead of a function yields similar 9-minute results. While an internal solution isn’t limited by built-in database functions, complex PL/SQL code can take forever to run.

If we take the external solution as a baseline, here are the performance implications in the test we performed on a particular Oracle instance:

MethodTime% BaselineComment
External Solution9 seconds100%Well-coded external masking software
Vertical Update54 seconds600%Using TRANSLATE
Internal solution
(SQL Function)
9 minutes6,000%Simple loop over a short string
Internal Solution
(Stored Procedure)
9 minutes6,000%Simple loop over a short string

Note that these benchmarks were on a particular Oracle installation with a specific setup. Different Oracle databases, not to mention other types of databases like SQL Server, will result in dramatically different measurements. I suggest you perform your own benchmarks in the type of environments you plan to mask.

However, the general concept remains:

  • If you can overcome the latency issue, running the masking logic outside the database is the fastest.
  • Using a vertical update that relies on built-in database functions is slower, but it works well for a simple DIY project. It can deliver limited functionality, which may be acceptable.
  • Using internal software coded in the database scripting language (whether as a function or a procedure) is incredibly slow, and complex logic will make it impossibly slow.

Again, as the masking logic becomes more complex, the vertical update and internal solution become unacceptable paths.

Masking Algorithms

A DIY project is limited by the type of masking algorithms it can implement. While complex algorithms aren’t needed for every field, they are required even if they are needed only for one field.

Here are some examples of complex algorithms:

Uniqueness / Primary key: If one of the fields that requires masking must be unique, function as a primary key, or have a unique index, then value generation must include collision detection and re-masking. In other words, when a unique column is masked, the process must ensure that the same masked value doesn’t repeat. A “large value” is unlikely to help because of the birthday paradox:

Key Length99% chance of a collision
8 digits30,000 rows
9 digits100,000 rows
10 digits300,000 rows
11 digits1,000,000 rows
12 digits3,000,000 rows

Masking a social security number (9 digits) will almost guarantee a collision in 100,000 rows, and randomly changing a phone number (10 digits) will produce a duplicate within 300,000 rows.

To ensure uniqueness, a solution must maintain its own value collision detection algorithm or rely on the database to detect duplicates and remask if there’s an error.

Consistency / Foreign key: To ensure referential integrity, certain values must be masked the same way in different columns and, potentially, different databases.

The only way to do this correctly is to use a masking dictionary that maps the original values to the masked values. This dictionary must be referenced for every value before applying the masking algorithm.

There is a common flawed alternative of relying on deterministic masking. This alternative breaks when the predicted masked value collides in the primary key and requires re-masking. In that case, the foreign key reference wouldn’t know about the alternate value used in the primary key.

Data generation: In many cases, there’s a need to generate realistic textual data. For example, generating First names, Last names, Full names, Email addresses, Street addresses, and more. Data generation relies on name dictionaries and patterns used to combine them (usually with weights that determine the frequency of each pattern).

Limited Data Sets: Some columns may only have a limited number of valid values. For example, a gender (male or female), a country name, a country code number, etc. While it is possible to mask limited data sets even with vertical updates, the performance will usually be extremely poor for large tables. Applying weights to ensure the masked data follows certain ratios makes this far more complicated.

Composites: In some cases, multiple columns must be masked together. For example, in the case of a city, state, and zip code. By masking them together, the masked data can retain the relationships between the columns. If the application requires the masked data to preserve such relationships (e.g., ensure a zip code matches the city and state), then this is essential.

Pattern Profiling: In certain cases, it’s advisable to profile the original data so that the generated masked data will be similar to the original data. For example, to determine the credit card patterns or phone number patterns that exist in the original data, as well as their frequency.

Dates as Strings: Some datasets store date and time information as textual strings. In such cases, the textual data must be converted to some form of internal “date/time” data, masked, and then converted back into the original format.

LOBs: In some datasets, sensitive data is stored in binary format. Pictures, fingerprints, and other biometric data are perfect examples. This type of data is usually stored as LOBs (Large OBjects), images, binary, etc.

Final Thoughts

When accounting for the cost of development, testing, maintenance, and support, DIY is much more expensive than it initially seems. Piling on the uncertain delivery timeline, risks, and limited capabilities makes it even less attractive. A DIY masking project is only viable if it’s a simple script that can be implemented in a matter of hours or days. Something you can easily discard if you realize you need more. Any longer timeline will likely result in a failed project that still requires you to purchase a solution.

Take some of the details in this article and test them against your intended project lead:

  • Do they have a list of requirements mapped against all the data you need to mask?
  • What is the timeline for them to deliver a tested and production-ready solution that fulfills those requirements?
  • Have they evaluated the performance of their proposed architecture?
  • Have you factored in long-term maintenance, support, and evolving future requirements?

The math of running internal development never adds up when there are off-the-shelf solutions that do what you need. Otherwise, those off-the-shelf solutions wouldn’t exist, and everyone would opt to build internally. Sooner or later, internal developments hit a ‘complexity wall’ or a ‘maintenance wall’, and it becomes cheaper to cut your losses and just buy a license. If that is the likely eventuality, it may be easier to skip the DIY phase.

The final straw might be “Is it worth it?” The headache and resources you pour into a solution you could buy. Aren’t those better spent on pursuing your core business?

If you have a question or a comment, please let us know. We’ll be happy to hear from you.