Name	Name	Last commit message	Last commit date
Latest commit History 198 Commits
diagrams	diagrams
gradle/wrapper	gradle/wrapper
src	src
terraform_setup	terraform_setup
.gitignore	.gitignore
CONTRIBUTING.md	CONTRIBUTING.md
LICENSE	LICENSE
README.md	README.md
build.gradle	build.gradle
create-df-template.sh	create-df-template.sh
create-dlp-template.sh	create-dlp-template.sh
create-kek.sh	create-kek.sh
default.profraw	default.profraw
deploy-data-tokeninzation-solution.sh	deploy-data-tokeninzation-solution.sh
deploy-s3-inspect-solution.sh	deploy-s3-inspect-solution.sh
dlp-demo-part-1-crypto-key.yaml	dlp-demo-part-1-crypto-key.yaml
dlp-demo-part-2-dlp-template.yaml	dlp-demo-part-2-dlp-template.yaml
dlp-demo-s3-gcs-inspect.yaml	dlp-demo-s3-gcs-inspect.yaml
dlp-tokenization_metadata	dlp-tokenization_metadata
gcs-s3-inspect-config.json	gcs-s3-inspect-config.json
gradlew	gradlew
gradlew.bat	gradlew.bat

Name

Last commit message

Last commit date

198 Commits

create-df-template.sh

create-dlp-template.sh

create-kek.sh

default.profraw

deploy-data-tokeninzation-solution.sh

deploy-s3-inspect-solution.sh

dlp-demo-part-1-crypto-key.yaml

dlp-demo-part-2-dlp-template.yaml

dlp-demo-s3-gcs-inspect.yaml

dlp-tokenization_metadata

gcs-s3-inspect-config.json

gradlew

gradlew.bat

Migrate Sensitive Data in BigQuery Using Dataflow & Cloud DLP

This repo contains a reference implementation of an end to end data tokenization solution designed to migrate sensitive data in BigQuery. Please check out the links below for reference guides:

Quick Start.
- Reference Architecture.
- Quick Start- Setup Data Tokenization Demo.
Quick Start To S3 Inspection PoC.

Reference Architecture

Quick Start

Run the following commands to trigger an automated deployment in your GCP project. Script handles following topics:

Create a bucket ({project-id}-demo-data) in us-central1 and uploads a sample dataset with mock PII data.
Create a BigQuery dataset in US (demo_dataset) to store the tokenized data.
Create a KMS wrapped key(KEK) by creating an automatic TEK (Token Encryption Key).
Create DLP inspect and re-identification template with the KEK and crypto based transformations identified in this section of the guide
Trigger an automated Dataflow pipeline by passing all the required parameters e.g: data, configuration & dataset name.
Please allow 5-10 mins for the deployment to be completed.

gcloud config set project <project_id>
sh deploy-data-tokeninzation-solution.sh

You can run some quick validations in BigQuery table to check on tokenized data.

For re-identification (getting back the original data in a Pub/Sub topic), please follow this instruction here.

Quick Start To S3 Inspection PoC

This is a hybrid solution for customers who would like to use Cloud DLP to scan PII data stored in a S3 bucket. Solution stores the inspection result in a BigQuery table.

Note: Please modify the shell script below to update the required env variables before executing.

gcloud config set project <project_id>
sh deploy-s3-inspect-solution.sh

New S3 Scanner Build and Run

export AWS_ACCESS_KEY="<access_key>"
export AWS_SECRET_KEY="<secret_key>"
export AWS_CRED="{\"@type\":\"AWSStaticCredentialsProvider\",\"awsAccessKeyId\":\"${AWS_ACCESS_KEY_ID}\"
,\"awsSecretKey\":\"${AWS_SECRET_ACCESS_KEY}\"}"

gradle spotLessApply -DmainClass=com.google.solutions.s3.scanner.DLPS3ScannerPipeline 

gradle build -DmainClass=com.google.solutions.s3.scanner.DLPS3ScannerPipeline 

gradle run -DmainClass=com.google.swarm.tokenization.DLPS3ScannerPipeline -Pargs="--runner=DataflowRunner --project=<id> --autoscalingAlgorithm=NONE --workerMachineType=n1-standard-4 --numWorkers=5 --maxNumWorkers=5 --region=us-central1  --filePattern=gs://<bucket>/*.csv --inspectTemplateName=projects/<id>/inspectTemplates/inspect-test1 --tableSpec=project:demo_dataset.dlp_inspection_results --auditTableSpec=project:demo_dataset.dlp_inspection_audit --tempLocation=gs://dfs-temp-files/tmp  --batchSize=500000 --usePublicIps=false --diskSizeGb=500 --workerDiskType=compute.googleapis.com/projects/id/zones/us-central1-b/diskTypes/pd-ssd"

To Do

S3 Scanner accuracy.
Faul tolerant deployment scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Migrate Sensitive Data in BigQuery Using Dataflow & Cloud DLP

Table of Contents

Reference Architecture

Quick Start

Quick Start To S3 Inspection PoC

New S3 Scanner Build and Run

To Do

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 15

Languages

License

GoogleCloudPlatform/dlp-dataflow-deidentification

Folders and files

Latest commit

History

Repository files navigation

Migrate Sensitive Data in BigQuery Using Dataflow & Cloud DLP

Table of Contents

Reference Architecture

Quick Start

Quick Start To S3 Inspection PoC

New S3 Scanner Build and Run

To Do

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 15

Languages

Packages