Back to all articles

How to become (and stay) GDPR compliant in AI & ML models

Jonathan Whiteside
Jonathan Whiteside
Global SVP Technology & Engineering
Length
5 min read
Date
9 January 2023

It’s one thing to get to GDPR compliance, and it’s another thing to stay in compliance, especially when your data is big, unstructured, and changing by the minute. This is a common challenge for data science teams around the world, and without systems and strategies in place, you could become subject to a hefty GDPR fine.  

This is a real challenge for AI and machine learning teams because they can use terabytes of data, in various formats to create AI and machine learning models. And in many organisations, data scientists desire the autonomy to create datasets on an as-needed basis without oversight. 

The only problem here is that Personally Identifiable Information (PII) lurks within all that self-managed, ever-changing data, and this keeps teams from being in compliance with GDPR. 

Asking data scientists to self-manage PII is unfeasible, as is the notion of limiting them from creating datasets and models. 

A tailored solution: Automated and flexible

To design a data governance system for AI and machine learning teams, there are a few constraints to consider 

  • Don’t interfere with your team’s workflow or create more overhead or work for them
  • Any solution needs to be flexible enough to handle data that changes daily 
  • Get to compliance, and then stay in compliance, so you can pass audits
compliance in AIML models graphic

How it works: Scan and remove

This solution features three phases that can be scheduled and requires minimal intervention from your data science team. It also keeps your team in compliance, even as you create new datasets which potentially contain PII.

1. Scan all datasets

The first step in the process is to crawl through and compile a list of every single dataset (this could be a table or any type of file) that exists in your data science team’s world. 

The benefit of this compilation is that it gives you a starting point for a data registry. Not only is this generally useful for your team to know all the data, sizes, and types of datasets that you work with — but when additional compliance laws or policies crop up, you now have a central place to look. 

Often, teams don’t even know how much or what types of data (parquet, hive, JSON, CSV, etc.) they have.

2. Flag all datasets that contain PII

Once you have your lists, the next step is to comb through the datasets and identify which of them contain PII. Once identified, you’ll need to save the metadata around the datasets, which you should log and store for audit purposes.

3. Remove or redact PII from flagged datasets

Now you can remove your PII. This entails taking the list of flagged datasets containing PII, and removing all users from the dataset who need to be removed according to GDPR guidelines. You’ll also need to log the removal process for audit purposes.

As a note, there are techniques where you can be in compliance without completely removing PII, such as obfuscation. Whether you completely remove or obfuscate depends on your team’s use case and how you use your data. 

To the finish line: GDPR audits and beyond

Putting into place this automated system gives you several advantages, all without sacrificing a change of workflow or adding onerous restrictions to your way of working.

In the short term, you can sail through an audit and legal review process, instead focusing on the actual work of machine learning and creating models while your data automatically stays in compliance.

Second, this framework puts you in a good position to pass future audits. Audits occur regularly, and rather than having to scramble for every legal review, this framework makes it easy to document and provide evidence of compliance for subsequent audits.

Finally, the system also puts them in a strong position for any future laws or even changes to current laws. The automated system serves as a foundation, and any changes or additional compliance can easily be built on top of the current solution.

The future of data compliance

GDPR is not going away anytime soon – in fact, additional laws like CCPA are being added to the list of compliance your company will soon need to follow.

It’s best to have an established, flexible framework that can be tweaked to handle new laws. That way you can continue working on your core business, rather than scrambling to adjust to a new set of compliance laws each time they come out. I

f you need help with compliance or any sort of data strategy, reach out to the data analysts and engineers at DEPT®. 

More?

View all articles

Questions?

Global SVP Technology & Engineering

Jonathan Whiteside