Solution for synchronizing metadata from external data catalogs to Google Cloud Dataplex and Data Catalog. This project provides automated extraction, transformation, and ingestion of data catalog metadata using serverless architecture on GCP.
This connector enables organizations to maintain synchronized metadata between external data sources and GCP Dataplex, providing centralized data governance and discovery capabilities. The solution consists of two Spring Boot microservices deployed on Cloud Run and Cloud Functions, orchestrated with GCP infrastructure provisioned via Terraform.
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Data Source │────────▶│ Metadata │────────▶│ Cloud Storage │
│ REST API │ │ Extractor │ │ │
└─────────────────┘ │ (Cloud Run) │ └────────┬────────┘
└──────────────────┘ │
│
┌──────────────────┐ │
│ Metadata │◀─────────────────┘
│ Ingester │
│ (Cloud Function) │
└────────┬─────────┘
│
┌────────▼─────────┐
│ GCP Dataplex │
│ Data Catalog │
└──────────────────┘
- Technology: Spring Boot 3.0.2, Java 17
- Deployment: Cloud Run
- Purpose: Extracts metadata from external data sources via REST API
- Features:
- Connects to data source REST endpoints
- Retrieves catalog, schema, table, and column metadata
- Transforms metadata into standardized JSON format
- Stores processed metadata in Cloud Storage
- Technology: Spring Boot 3.0.3, Java 17, Cloud Functions Framework
- Deployment: Cloud Function (Gen 2)
- Purpose: Ingests metadata from Cloud Storage into GCP Data Catalog
- Features:
- Triggered by Cloud Storage events
- Reads processed metadata JSON files
- Creates/updates Data Catalog entries
- Applies taxonomy tags and metadata templates
- Manages entry groups and tag templates
- Technology: Terraform
- Purpose: Provisions all required GCP infrastructure
- Resources:
- VPC and networking
- Cloud Run services
- Cloud Functions
- Cloud Storage buckets
- Dataplex lakes and zones
- Data Catalog resources
- IAM service accounts and permissions
- Artifact Registry
- Dataproc Metastore
- GCP Project with billing enabled
- Terraform >= 1.0
- Java 17
- Maven 3.6+
- Docker (for containerization)
- gcloud CLI
- External data source with REST API access
cd infra-terraform
# Update terraform.tfvars with your values
cp terraform.tfvars.example terraform.tfvars
# Initialize Terraform
terraform init
# Review planned changes
terraform plan
# Apply infrastructure
terraform applycd appdev/MetadataExtractor
# Build JAR
mvn clean package
# Build Docker image
docker build -t gcr.io/PROJECT_ID/metadata-extractor:latest .
# Push to Artifact Registry
docker push gcr.io/PROJECT_ID/metadata-extractor:latest
# Deploy to Cloud Run (handled by Terraform)cd appdev/MetadataIngester
# Build shaded JAR for Cloud Functions
mvn clean package
# Deploy to Cloud Functions
gcloud functions deploy metadata-ingester \
--gen2 \
--runtime java17 \
--entry-point org.springframework.cloud.function.adapter.gcp.GcfJarLauncher \
--source target/deployment/ \
--trigger-bucket METADATA_BUCKETUpdate environment variables in Cloud Run:
DATA_SOURCE_API_URL: Your data source REST API endpointDATA_SOURCE_AUTH_TOKEN: Authentication token (stored in Secret Manager)GCS_BUCKET_NAME: Cloud Storage bucket for metadata
spring.cloud.gcp.project-id=YOUR_PROJECT_ID
spring.cloud.gcp.storage.bucket=metadata-bucket
datasource.api.base-url=https://datasource.example.com/api
datasource.api.timeout=30000spring.cloud.gcp.project-id=YOUR_PROJECT_ID
datacatalog.location=us-central1
datacatalog.entry-group=starburst-catalogKey variables in terraform.tfvars:
project_id = "your-gcp-project"
location = "us-central1"
vpc_name = "dataplex-vpc"
storage_bucket_names = ["metadata-staging", "metadata-processed"]
dataplex_lake_name = "enterprise-data-lake"
dataplex_zone_names = {
"raw-zone" = ["raw-bucket"]
"curated-zone" = ["curated-bucket"]
}-
Extraction:
- Cloud Scheduler triggers Metadata Extractor (Cloud Run)
- Extractor queries data source REST API
- Metadata is transformed to JSON format
- JSON files stored in Cloud Storage staging bucket
-
Transformation:
- Storage event triggers processing
- Metadata enriched with additional context
- Standardized for Data Catalog format
-
Ingestion:
- Cloud Function triggered by processed metadata files
- Creates/updates Data Catalog entries
- Applies taxonomy and tags
- Updates Dataplex asset metadata
- Automatic entry group creation
- Custom tag templates for source metadata
- Taxonomy management for data classification
- Column-level metadata and lineage
- Lake and zone provisioning
- Asset registration
- Metadata synchronization
- Discovery asset configuration
- Service account isolation
- Secret Manager for credentials
- VPC networking with private IPs
- IAM least-privilege roles
- Cloud Logging integration
- Error tracking and alerting
- Execution metrics
- Audit logging for compliance
cd appdev/MetadataExtractor
mvn spring-boot:run -Dspring-boot.run.profiles=devcd appdev/MetadataIngester
mvn function:runmvn clean testThe infrastructure is modularized for reusability:
cloud_vpc- VPC and networkingcloud_artifact_registry- Container registrycloud_storage- GCS bucketscloud_run- Cloud Run servicescloud_function- Cloud Functionscloud_dataplex- Dataplex resourcescloud_data_catalog- Data Catalog setupcloud_dataproc_meta- Dataproc Metastorecloud_service_account- IAM service accountscloud_secret_manager- Secret management
Metadata Extraction Fails
- Verify data source API connectivity
- Check authentication tokens in Secret Manager
- Review Cloud Run logs for errors
Ingestion Not Triggered
- Verify Cloud Storage bucket notifications
- Check Cloud Function event triggers
- Ensure proper IAM permissions
Data Catalog Errors
- Validate entry group exists
- Check tag template configurations
- Review IAM roles for Data Catalog API
- Use Cloud Run minimum instances = 0 for dev
- Enable Cloud Storage lifecycle policies
- Use preemptible Dataproc clusters
- Configure appropriate Cloud Function memory limits
- Store all secrets in Secret Manager
- Use VPC Service Controls for data exfiltration prevention
- Enable audit logging for all services
- Implement least-privilege IAM roles
- Use customer-managed encryption keys (CMEK)