This repo contains a reference implementation of an end to end data tokenization solution designed to migrate sensitive data in BigQuery. Please check out the links below for reference guides:
- Concept & Overview.
- Create & Manage Cloud DLP Configurations.
- Automated Dataflow Pipeline to De-identify PII Dataset.
- Validate Dataset in BigQuery and Re-identify using Dataflow.
Run the following commands to trigger an automated deployment in your GCP project. Script handles following topics:
-
Create a bucket ({project-id}-demo-data) in us-central1 and uploads a sample dataset with mock PII data.
-
Create a BigQuery dataset in US (demo_dataset) to store the tokenized data.
-
Create a KMS wrapped key(KEK) by creating an automatic TEK (Token Encryption Key).
-
Create DLP inspect and re-identification template with the KEK and crypto based transformations identified in this section of the guide
-
Trigger an automated Dataflow pipeline by passing all the required parameters e.g: data, configuration & dataset name.
-
Please allow 5-10 mins for the deployment to be completed.
gcloud config set project <project_id>
sh deploy-data-tokeninzation-solution.sh
You can run some quick validations in BigQuery table to check on tokenized data.
For re-identification (getting back the original data in a Pub/Sub topic), please follow this instruction here.
This is a hybrid solution for customers who would like to use Cloud DLP to scan PII data stored in a S3 bucket. Solution stores the inspection result in a BigQuery table.
- Note: Please modify the shell script below to update the required env variables before executing.
gcloud config set project <project_id>
sh deploy-s3-inspect-solution.sh
export AWS_ACCESS_KEY="<access_key>"
export AWS_SECRET_KEY="<secret_key>"
export AWS_CRED="{\"@type\":\"AWSStaticCredentialsProvider\",\"awsAccessKeyId\":\"${AWS_ACCESS_KEY_ID}\"
,\"awsSecretKey\":\"${AWS_SECRET_ACCESS_KEY}\"}"
gradle spotLessApply -DmainClass=com.google.solutions.s3.scanner.DLPS3ScannerPipeline
gradle build -DmainClass=com.google.solutions.s3.scanner.DLPS3ScannerPipeline
gradle run -DmainClass=com.google.swarm.tokenization.DLPS3ScannerPipeline -Pargs="--runner=DataflowRunner --project=<id> --autoscalingAlgorithm=NONE --workerMachineType=n1-standard-4 --numWorkers=5 --maxNumWorkers=5 --region=us-central1 --filePattern=gs://<bucket>/*.csv --inspectTemplateName=projects/<id>/inspectTemplates/inspect-test1 --tableSpec=project:demo_dataset.dlp_inspection_results --auditTableSpec=project:demo_dataset.dlp_inspection_audit --tempLocation=gs://dfs-temp-files/tmp --batchSize=500000 --usePublicIps=false --diskSizeGb=500 --workerDiskType=compute.googleapis.com/projects/id/zones/us-central1-b/diskTypes/pd-ssd"
- S3 Scanner accuracy.
- Faul tolerant deployment scripts.