Skip to content
This repository was archived by the owner on Oct 16, 2025. It is now read-only.

Multi Cloud Data Tokenization Solution By Using Dataflow and Cloud DLP

License

Notifications You must be signed in to change notification settings

GoogleCloudPlatform/dlp-dataflow-deidentification

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

198 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Migrate Sensitive Data in BigQuery Using Dataflow & Cloud DLP

This repo contains a reference implementation of an end to end data tokenization solution designed to migrate sensitive data in BigQuery. Please check out the links below for reference guides:

  1. Concept & Overview.
  2. Create & Manage Cloud DLP Configurations.
  3. Automated Dataflow Pipeline to De-identify PII Dataset.
  4. Validate Dataset in BigQuery and Re-identify using Dataflow.

Table of Contents

Reference Architecture

Reference Architecture

Quick Start

Open in Cloud Shell

Run the following commands to trigger an automated deployment in your GCP project. Script handles following topics:

gcloud config set project <project_id>
sh deploy-data-tokeninzation-solution.sh

You can run some quick validations in BigQuery table to check on tokenized data.

For re-identification (getting back the original data in a Pub/Sub topic), please follow this instruction here.

Quick Start To S3 Inspection PoC

This is a hybrid solution for customers who would like to use Cloud DLP to scan PII data stored in a S3 bucket. Solution stores the inspection result in a BigQuery table.

Open in Cloud Shell

  • Note: Please modify the shell script below to update the required env variables before executing.
gcloud config set project <project_id>
sh deploy-s3-inspect-solution.sh

New S3 Scanner Build and Run

export AWS_ACCESS_KEY="<access_key>"
export AWS_SECRET_KEY="<secret_key>"
export AWS_CRED="{\"@type\":\"AWSStaticCredentialsProvider\",\"awsAccessKeyId\":\"${AWS_ACCESS_KEY_ID}\"
,\"awsSecretKey\":\"${AWS_SECRET_ACCESS_KEY}\"}"
gradle spotLessApply -DmainClass=com.google.solutions.s3.scanner.DLPS3ScannerPipeline 

gradle build -DmainClass=com.google.solutions.s3.scanner.DLPS3ScannerPipeline 

gradle run -DmainClass=com.google.swarm.tokenization.DLPS3ScannerPipeline -Pargs="--runner=DataflowRunner --project=<id> --autoscalingAlgorithm=NONE --workerMachineType=n1-standard-4 --numWorkers=5 --maxNumWorkers=5 --region=us-central1  --filePattern=gs://<bucket>/*.csv --inspectTemplateName=projects/<id>/inspectTemplates/inspect-test1 --tableSpec=project:demo_dataset.dlp_inspection_results --auditTableSpec=project:demo_dataset.dlp_inspection_audit --tempLocation=gs://dfs-temp-files/tmp  --batchSize=500000 --usePublicIps=false --diskSizeGb=500 --workerDiskType=compute.googleapis.com/projects/id/zones/us-central1-b/diskTypes/pd-ssd"

To Do

  • S3 Scanner accuracy.
  • Faul tolerant deployment scripts.

About

Multi Cloud Data Tokenization Solution By Using Dataflow and Cloud DLP

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published