Skip to content
This repository was archived by the owner on Aug 2, 2025. It is now read-only.

A Python library designed to detect and remove Personally Identifiable Information (PII) from CSV files.

Notifications You must be signed in to change notification settings

ajschofield/gdpr-obfuscator

Repository files navigation

GDPR Obfuscator - Launchpad Project

Overview

A Python library designed to detect and remove Personally Identifiable Information (PII) from CSV files stored in an AWS S3 bucket.

Minimum Viable Product (MVP)

The MVP covers:

  1. Reading a JSON string containing the S3 location of the CSV file and the names of the fields that are required to be obfuscated
  2. Ingesting the CSV file containing data records (with a primary key) from an AWS S3 bucket
  3. Obfuscating chosen PII fields (e.g. name, email_address) by replacing their values with an obfuscated string (***)
  4. Returning the obfuscated data as a byte-stream that maintains the original structure but with sensitive fields changed

This meets the requirements under the General Data Protection Regulation (GDPR) to ensure that all data containing information that can be used to identify an individual should be anonymised.

Setup

Prerequisites

  • Python >= 3.13
  • Poetry >= 2.0.1

Installation

There are two ways to install the package:

Source

git clone https://github.com/ajschofield/gdpr-obfuscator.git
cd gdpr-obfuscator
poetry install

Prebuilt Package

Download the latest release from here and install using pip:

# Package name may be different to what is below
pip install gdpr_obfuscator-0.1.0-py3-none-any.whl

Usage

The Obfuscator class can be imported directly into your Python code. Once instiantiated, you may call either the process_s3 or process_local method. Each method takes a JSON string as the input, which must contain file_path and pii_fields.

{
    "file_path": "s3://bucket-name/file-name.csv",
    "pii_fields": ["name", "email_address"]
}

Both methods return a byte-stream containing the obfuscated data which can be used with the put_object method in the boto3 library to upload the data reliably back to S3.

from gdpr_obfuscator import Obfuscator
import json

input = json.dumps({
    "file_path": "s3://bucket-name/file-name.csv",
    "pii_fields": ["name", "email_address"]
})

obfuscator = Obfuscator()
result = obfuscator.process_s3(input)

print(result.decode("utf-8"))

Alternatively, there is a command line interface available to use the package from the terminal. The CLI is not packaged with the library, so you will have to follow the steps in the source installation section to use it.

❯❯ poetry run python cli.py --help
usage: GDPR-Obfuscator [-h] (-l LOCAL | -s S3) -p PII [PII ...]

Obfuscate sensitive data stored locally or in an AWS environment

options:
  -h, --help            show this help message and exit
  -l, --local LOCAL     Local path to file
  -s, --s3 S3           URI path to file stored in S3
  -p, --pii PII [PII ...]
                        List of PII fields to obfuscate, separated by spaces
❯❯ poetry run python cli.py -l test/data/mock_data.csv -p name
student_id,name,course,cohort,graduation_date,email_address
1,***,UX/UI Design Bootcamp,2/29/2024,7/11/2024,jleger0@facebook.com
2,***,Digital Marketing Bootcamp,2/24/2024,9/6/2024,cadrian1@gizmodo.com
3,***,UX/UI Design Bootcamp,3/13/2024,10/24/2024,whugnin2@archive.org
4,***,Artificial Intelligence Bootcamp,2/24/2024,9/14/2024,aspight3@4shared.com
5,***,Artificial Intelligence Bootcamp,1/31/2024,9/4/2024,dcowpland4@dot.gov
6,***,Digital Marketing Bootcamp,2/8/2024,,gkliement5@auda.org.au
7,***,Internet of Things Bootcamp,2/21/2024,7/16/2024,
8,***,Mobile App Development Bootcamp,2/17/2024,7/15/2024,smyrkus7@i2i.jp
9,***,Game Development Bootcamp,3/1/2024,9/7/2024,nryal8@symantec.com
...

About

A Python library designed to detect and remove Personally Identifiable Information (PII) from CSV files.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published