Skip to content

Metadata synchronization for GCP Dataplex. Spring Boot microservices with Terraform IaC for automated data catalog integration from external sources. Features Cloud Run, Cloud Functions, and Data Catalog.

Notifications You must be signed in to change notification settings

SergeyKirk/gcp-dataplex-metadata-connector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

GCP Dataplex Metadata Connector

Solution for synchronizing metadata from external data catalogs to Google Cloud Dataplex and Data Catalog. This project provides automated extraction, transformation, and ingestion of data catalog metadata using serverless architecture on GCP.

Overview

This connector enables organizations to maintain synchronized metadata between external data sources and GCP Dataplex, providing centralized data governance and discovery capabilities. The solution consists of two Spring Boot microservices deployed on Cloud Run and Cloud Functions, orchestrated with GCP infrastructure provisioned via Terraform.

Architecture

┌─────────────────┐         ┌──────────────────┐         ┌─────────────────┐
│   Data Source   │────────▶│   Metadata       │────────▶│  Cloud Storage  │
│   REST API      │         │   Extractor      │         │                 │
└─────────────────┘         │  (Cloud Run)     │         └────────┬────────┘
                            └──────────────────┘                  │
                                                                  │
                            ┌──────────────────┐                  │
                            │   Metadata       │◀─────────────────┘
                            │   Ingester       │
                            │ (Cloud Function) │
                            └────────┬─────────┘
                                     │
                            ┌────────▼─────────┐
                            │  GCP Dataplex    │
                            │  Data Catalog    │
                            └──────────────────┘

Components

1. Metadata Extractor (appdev/MetadataExtractor)

  • Technology: Spring Boot 3.0.2, Java 17
  • Deployment: Cloud Run
  • Purpose: Extracts metadata from external data sources via REST API
  • Features:
    • Connects to data source REST endpoints
    • Retrieves catalog, schema, table, and column metadata
    • Transforms metadata into standardized JSON format
    • Stores processed metadata in Cloud Storage

2. Metadata Ingester (appdev/MetadataIngester)

  • Technology: Spring Boot 3.0.3, Java 17, Cloud Functions Framework
  • Deployment: Cloud Function (Gen 2)
  • Purpose: Ingests metadata from Cloud Storage into GCP Data Catalog
  • Features:
    • Triggered by Cloud Storage events
    • Reads processed metadata JSON files
    • Creates/updates Data Catalog entries
    • Applies taxonomy tags and metadata templates
    • Manages entry groups and tag templates

3. Infrastructure (infra-terraform)

  • Technology: Terraform
  • Purpose: Provisions all required GCP infrastructure
  • Resources:
    • VPC and networking
    • Cloud Run services
    • Cloud Functions
    • Cloud Storage buckets
    • Dataplex lakes and zones
    • Data Catalog resources
    • IAM service accounts and permissions
    • Artifact Registry
    • Dataproc Metastore

Prerequisites

  • GCP Project with billing enabled
  • Terraform >= 1.0
  • Java 17
  • Maven 3.6+
  • Docker (for containerization)
  • gcloud CLI
  • External data source with REST API access

Quick Start

1. Configure Infrastructure

cd infra-terraform

# Update terraform.tfvars with your values
cp terraform.tfvars.example terraform.tfvars

# Initialize Terraform
terraform init

# Review planned changes
terraform plan

# Apply infrastructure
terraform apply

2. Build and Deploy Applications

Metadata Extractor

cd appdev/MetadataExtractor

# Build JAR
mvn clean package

# Build Docker image
docker build -t gcr.io/PROJECT_ID/metadata-extractor:latest .

# Push to Artifact Registry
docker push gcr.io/PROJECT_ID/metadata-extractor:latest

# Deploy to Cloud Run (handled by Terraform)

Metadata Ingester

cd appdev/MetadataIngester

# Build shaded JAR for Cloud Functions
mvn clean package

# Deploy to Cloud Functions
gcloud functions deploy metadata-ingester \
  --gen2 \
  --runtime java17 \
  --entry-point org.springframework.cloud.function.adapter.gcp.GcfJarLauncher \
  --source target/deployment/ \
  --trigger-bucket METADATA_BUCKET

3. Configure Data Source Connection

Update environment variables in Cloud Run:

  • DATA_SOURCE_API_URL: Your data source REST API endpoint
  • DATA_SOURCE_AUTH_TOKEN: Authentication token (stored in Secret Manager)
  • GCS_BUCKET_NAME: Cloud Storage bucket for metadata

Configuration

Environment Variables

Metadata Extractor

spring.cloud.gcp.project-id=YOUR_PROJECT_ID
spring.cloud.gcp.storage.bucket=metadata-bucket
datasource.api.base-url=https://datasource.example.com/api
datasource.api.timeout=30000

Metadata Ingester

spring.cloud.gcp.project-id=YOUR_PROJECT_ID
datacatalog.location=us-central1
datacatalog.entry-group=starburst-catalog

Terraform Variables

Key variables in terraform.tfvars:

project_id     = "your-gcp-project"
location       = "us-central1"
vpc_name       = "dataplex-vpc"
storage_bucket_names = ["metadata-staging", "metadata-processed"]

dataplex_lake_name = "enterprise-data-lake"
dataplex_zone_names = {
  "raw-zone"      = ["raw-bucket"]
  "curated-zone"  = ["curated-bucket"]
}

Metadata Flow

  1. Extraction:

    • Cloud Scheduler triggers Metadata Extractor (Cloud Run)
    • Extractor queries data source REST API
    • Metadata is transformed to JSON format
    • JSON files stored in Cloud Storage staging bucket
  2. Transformation:

    • Storage event triggers processing
    • Metadata enriched with additional context
    • Standardized for Data Catalog format
  3. Ingestion:

    • Cloud Function triggered by processed metadata files
    • Creates/updates Data Catalog entries
    • Applies taxonomy and tags
    • Updates Dataplex asset metadata

Features

Data Catalog Integration

  • Automatic entry group creation
  • Custom tag templates for source metadata
  • Taxonomy management for data classification
  • Column-level metadata and lineage

Dataplex Management

  • Lake and zone provisioning
  • Asset registration
  • Metadata synchronization
  • Discovery asset configuration

Security

  • Service account isolation
  • Secret Manager for credentials
  • VPC networking with private IPs
  • IAM least-privilege roles

Monitoring

  • Cloud Logging integration
  • Error tracking and alerting
  • Execution metrics
  • Audit logging for compliance

Development

Local Testing

Extractor

cd appdev/MetadataExtractor
mvn spring-boot:run -Dspring-boot.run.profiles=dev

Ingester

cd appdev/MetadataIngester
mvn function:run

Running Tests

mvn clean test

Terraform Modules

The infrastructure is modularized for reusability:

  • cloud_vpc - VPC and networking
  • cloud_artifact_registry - Container registry
  • cloud_storage - GCS buckets
  • cloud_run - Cloud Run services
  • cloud_function - Cloud Functions
  • cloud_dataplex - Dataplex resources
  • cloud_data_catalog - Data Catalog setup
  • cloud_dataproc_meta - Dataproc Metastore
  • cloud_service_account - IAM service accounts
  • cloud_secret_manager - Secret management

Troubleshooting

Common Issues

Metadata Extraction Fails

  • Verify data source API connectivity
  • Check authentication tokens in Secret Manager
  • Review Cloud Run logs for errors

Ingestion Not Triggered

  • Verify Cloud Storage bucket notifications
  • Check Cloud Function event triggers
  • Ensure proper IAM permissions

Data Catalog Errors

  • Validate entry group exists
  • Check tag template configurations
  • Review IAM roles for Data Catalog API

Cost Optimization

  • Use Cloud Run minimum instances = 0 for dev
  • Enable Cloud Storage lifecycle policies
  • Use preemptible Dataproc clusters
  • Configure appropriate Cloud Function memory limits

Security Best Practices

  1. Store all secrets in Secret Manager
  2. Use VPC Service Controls for data exfiltration prevention
  3. Enable audit logging for all services
  4. Implement least-privilege IAM roles
  5. Use customer-managed encryption keys (CMEK)

Related Resources

About

Metadata synchronization for GCP Dataplex. Spring Boot microservices with Terraform IaC for automated data catalog integration from external sources. Features Cloud Run, Cloud Functions, and Data Catalog.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published