Finding Sensitive Data for Free Using AI

Locate Sensitive Data for Free Using AI

Discover how to pinpoint sensitive data in your databases for free. Learn how to map schemas and use AI to do the heavy lifting. Comply with regulations and close security gaps in under 5 minutes.

The Challenge: What Data Do I Have?

Most organizations face a critical compliance and security roadblock: they don’t actually know where their sensitive data lives. However, with the advent of AI, you can now easily scan through an entire database schema in a few minutes or less.

The security roadblock: Locating sensitive information is an absolute prerequisite for data masking, and is incredibly valuable for all database security initiatives, activity control, and even application security.

The compliance landmine: Without a clear map of the sprawl of sensitive data in your organization, you don’t know the risk level of each database, what security controls are necessary, or even which compliance regulations (like GDPR, HIPAA, SOX, or PCI-DSS) you are currently violating.

This is all solvable in under 5 minutes and for free. It sounds a little unbelievable, but it’s simply the power of AI.

The AI Method: Turning Schema into Intelligence

We’ll later discuss traditional methods and why they fail to work well. The AI method is different and a lot more powerful. It involves a two-step process:

Extract the Schema Layout: Have your DBA run a small, non-intrusive, quick script that reads the data dictionary and exports your table and column names (metadata), but not the actual data records.
AI Classification: You feed that exported schema layout into an AI with a specialized prompt that instructs the Large Language Model (LLM) to look for columns that could contain sensitive information.

By examining the structure of your database schema, the AI can evaluate each column in every table, connect the dots, and identify columns you might consider sensitive.

There are two reasons why this AI method is one of the most powerful approaches to sensitive data discovery:

The AI understands context. For example, in a schema related to money management, and a table dealing with transactions, the ‘amount’ column would refer to money that changed hands. In a payroll schema, the ‘amount’ could be the salary. However, in a schema dealing with atmospheric data, and the table relating to rainfall, ‘amount’ is the number of millimeters of rain. Another example is of a table named T_105 with columns that include NAME and DOB. The AI understands that the DOB is the Date of Birth and that the table contains PII, and should be carefully scrutinized.
The AI understands what sensitive means. The AI can figure it out without having to explain that transactions, salaries, credit cards, PII, healthcare data, and thousands of other pieces of information might be sensitive. In our tests, the AI often highlighted data that could be treated as sensitive, but we never even considered it.

A traditional Regex method cannot do any of this. While Regex Data Discovery solutions may cost 6 figures, they cannot compete with what an AI can do for free. Even with a long list of Regex search keywords, the AI will win hands down. The AI will understand or infer table and column names using any term or abbreviation in any language. It will deduce what they mean based on context, how they might be related to data security, and which compliance regulations may apply. AI simply renders this type of Regex search obsolete.

You could develop the prompt and the scripts to export the schema layout yourself. It is not very difficult. However, we’ve already done the work for you and tested it. Just enter your email in the sidebar to get it in your mailbox.

The Power of the Prompt

We created a “good for everything” prompt designed to identify any type of sensitive info (PII, PHI, financial records, etc). In our tests, we discovered that once the AI gets an idea of what you’re looking for, it will find pretty much anything.

However, if you know the type of data you’re looking for, or if your organization has unique requirements (such as project names, part numbers, etc.), you can easily extend the prompt by adding more explanations and examples of what to look for.

Because the AI understands language rather than just matching patterns, you can easily add to the list of examples: “Proprietary data: blueprints, part numbers, project codes”.

Also, if you need the report in another language, you add, for example, “Write the output in Spanish.”

Trusting AI: Security and Privacy

When using AI, you must be careful about what data you’re uploading and where your data is going.

This method never requires uploading your actual data records (rows). You are only exporting and uploading schema information (metadata), which significantly lowers the risk profile of using AI.

However, even just the structure of your database (table and column names) can be sensitive and reveal critical information about your application, the data you manage, and potentially, help penetrate your security.

There are three primary categories of AIs you could use:

Company Internal AI: If your organization has an internal instance of a model, this is the ideal place to run these checks. Ensure it has a large enough context length to support your schema size.
Paid / Enterprise AI: Usually the safest bet. Most paid tiers guarantee that your input is kept confidential and not used to train their models.
Free AI: Many vendors allow you to improve privacy and opt out of data training. Check the privacy settings and terms of service of the AI you intend to use. For example, Google Gemini states (at the time of this writing) that they do not use your data for training if you turn off your history. In general, using a temporary chat with no history is advisable for sensitive data.

If your organization has an internal AI security policy, please follow it.

The Human Element: Discuss Your Findings

AI is a powerful starting point, but we recommend using it as a basis for a conversation. Once you have your AI-generated list of sensitive columns, take it to your Database Administrators (DBAs) and Application Owners.

The conversation serves two purposes:

Validation: Get verification that the columns you identified contain sensitive information and eliminate false positives or columns that are in question.
Expansion: Once a DBA understands what you’re looking for, they may say something like, “Oh, if you care about that, you should also look at the ‘Legacy_Archive’ table where we renamed the SSN column there to USER_ID_EXT.”

In your conversation, actively seek:

Additional sensitive data. They may know of other columns in the same tables, additional tables, or types of data you didn’t think about.
Additional databases that may contain this type of data. It could be replicated copies, related databases used by the same applications, or other databases they used or heard about.
Copies of this data. They could probably tell you when data is exported to spreadsheets or database copies given to development, QA, training, etc.
Shadow Data. Ask the DBAs about “Temp”, “Backup”, or “Archive” tables created during migrations, saved when performing maintenance, or used for long-term retention. They often contain forgotten sensitive data.

This collaborative step turns a good list into a comprehensive, bulletproof audit. It also transforms a DBA and application administrator from a gatekeeper into a Data Steward. By giving them an AI-generated head start, you are making this into a quick review rather than an endless task. The turnaround will be faster, and the results more accurate and comprehensive.

Alternative Methods: How Does AI Compare?

Method	How it Works	Pros	Cons
Column Name Regex Scan	Search column names for keywords like “Name” or “Phone”.	The traditional method you might already have in an existing solution. It has been superseded by the AI method below.	Can miss abbreviations, non-English names, etc. Lacks context based on the table name and the names or other columns in the table. Column names may be inaccurate or non-descriptive.
Column Name AI Scan	Use an AI logic to understand the schema meaning.	Powerful. Understands abbreviations, context, and intent. Free and fast.	AI security/privacy considerations. Schema names (tables and columns) may be non-descriptive or obsolete.
Data Scan	Performs pattern matching on actual data samples.	Complementary. Finds data even if the column names are inaccurate, non-descriptive, or misleading.	Cannot identify data that doesn’t have a distinctive pattern. For example, a salary is just a number with no special pattern. Tends to produce a high false positive rate.
SQL Analysis	Analyze SQLs with AI, leveraging the way data is joined and utilized.	Extremely Powerful. Provides the most context, conveying the meaning and usage of the schema.	Requires an Activity Control solution (like Core Audit) that can capture and store all the application SQL history. That is something you could use if you deployed an Activity Control solution, but not something you would deploy to locate sensitive information.

What’s Next?

Run the Check: Don’t wait. Use our scripts today to discover sensitive data in your database. It is easy, fast, and free. You can also discuss your findings with the relevant DBAs and application managers.
Assess Risk: Determine which compliance regulations apply to your findings and the damage potential of this data leaking or being modified (rendering it unreliable).
Mask Non-Production Data: This is your most immediate next step. Ask your DBAs about copies of this sensitive data (in dev, test, or training environments). That data is the most exposed, with no security at all. Remove sensitive data and retain data quality with a static data masking solution like Core Masking.
Secure Production: Monitor and control who touches sensitive data with an Activity Control solution like Core Audit. Also, ensure your DBAs have a list of the tables and columns that should require special authorization to grant access (Least Privileged Access).

Ready to start? Enter your email in the sidebar and receive the Oracle/SQL Server scripts and the AI prompt directly to your inbox.

The Challenge: What Data Do I Have?

The AI Method: Turning Schema into Intelligence

The Power of the Prompt

Trusting AI: Security and Privacy

The Human Element: Discuss Your Findings

Alternative Methods: How Does AI Compare?

What’s Next?

Download Files

Comments

AI Recommended

Ask a Question

The Challenge: What Data Do I Have?

The AI Method: Turning Schema into Intelligence

The Power of the Prompt

Trusting AI: Security and Privacy

The Human Element: Discuss Your Findings

Alternative Methods: How Does AI Compare?

What’s Next?

Download Files

Comments

AI Recommended

Ask a Question

You might also like

Application or Database Activity Control: How Do They Compare?

Database Security Technology: Don’t Stay Behind the Curve

Masking Poll