🎯 Perfect for data teams who know SQL and want to build production-ready GenAI pipelines without Python expertise
🤖 Data analytics and engineering teams in enterprises are finding themselves responsible for GenAI data pipelines, often without extensive machine learning expertise. Many of these teams are most comfortable working in SQL, and traditional approaches would require significant upskilling in Python and ML frameworks. This repository bridges that gap by showing how to build sophisticated GenAI pipelines using familiar SQL-based tools.
⚡ Get Started in Minutes: Deploy a complete GenAI pipeline with just SQL and a few commands.
📚 What You'll Learn:
✨ Build production-grade GenAI pipelines using SQL
🛡️ Implement best practices for data quality and security
📊 Monitor and evaluate GenAI model performance
Prerequisites: SQL knowledge
Not Required: Python, Machine Learning expertise
Transform your customer feedback data into actionable insights using state-of-the-art AI/ML techniques:
- 🎯 Sentiment analysis using Gemini 1.5
- 🔍 Vector embeddings for semantic search
- 🔮 Intelligent question clustering
- 📊 Automated theme identification
- ⚡ Analytics on RAG system usage, quality and performance
Built on enterprise-grade Google Cloud technologies:
- Dataform - Orchestration & transformation
- BigQuery - Serverless data warehouse
- Vertex AI - Machine learning operations
- Gemini 1.5
- Text Embedding API
- Vector Search
The project supports two options for Dataform repository configuration:
-
Default Repository (Recommended for Learning)
- Uses Dataform's built-in repository
- Perfect for learning and experimentation
- No additional Git setup required
- Enabled by default
-
Remote Git Repository (Optional)
- For production or team collaboration
- Requires a GitHub repository and personal access token
- Enable by setting
use_remote_git = trueinterraform.tfvars
Choose the option that best suits your needs. For most learning scenarios, the default repository is recommended.
Before starting the deployment, ensure you have the following prerequisites in place:
- A Google Cloud project with billing enabled
- Owner or Editor role on the project
- Git (version >= 2.0)
- Terraform (version >= 1.0)
- Google Cloud SDK (version >= 440.0.0)
Run this command to enable necessary APIs:
gcloud services enable \
secretmanager.googleapis.com \
dataform.googleapis.com \
bigquery.googleapis.com \
artifactregistry.googleapis.com-
Clone the Repository
git clone <repository-url> cd genai_data_pipeline
-
Create Required Google Groups
- Create three Google Groups in your workspace:
- Dataform Users (e.g.,
dataform-users@your-domain.com) - Data Readers (e.g.,
data-readers@your-domain.com) - Data Owner (use an existing team email)
- Dataform Users (e.g.,
- Create three Google Groups in your workspace:
-
Configure Terraform Variables
cd terraform cp terraform.tfvars.example terraform.tfvarsEdit
terraform.tfvarsand set:- Your
project_id - Your Google Groups emails
- Optionally customize other settings
- Your
-
Initialize and Apply Terraform
terraform init terraform plan terraform apply
-
Verify Deployment
- Visit the Dataform UI
- Select your project and repository
- Try creating a new definition
This project supports two options for managing Terraform state:
By default, Terraform will store state locally in your workspace. This is suitable for:
- Individual learning and development
- Quick prototyping
- Local testing and experimentation
For team environments or production deployments, we recommend using Google Cloud Storage (GCS) for state management. This provides:
- 🤝 Team collaboration capabilities
- 🔒 State locking to prevent concurrent modifications
- 🔄 Version history and backup
- 🛡️ Better security through Google Cloud IAM
To enable GCS state storage:
# 1. Create a GCS bucket for state storage
export PROJECT_ID="your-project-id"
gsutil mb -l us-central1 gs://${PROJECT_ID}-terraform-state
# 2. Enable versioning for state history
gsutil versioning set on gs://${PROJECT_ID}-terraform-state
# 3. Update backend configuration
# Uncomment and configure the backend block in terraform/backend.tf:
terraform {
backend "gcs" {
bucket = "YOUR_PROJECT_ID-terraform-state"
prefix = "genai-pipeline"
}
}Note: For this educational project, local state is perfectly fine for getting started. Consider switching to GCS state storage when working in a team or moving to production.
Choose your preferred setup path:
-
Access Google Cloud Console
- Navigate to console.cloud.google.com
- Create or select your project
- Note your
Project IDfor later use
-
Enable Required APIs
- Go to APIs & Services
- Click "Enable APIs and Services"
- Enable the following:
- BigQuery API
- BigQuery Connection API
- Cloud Storage API
- Vertex AI API
- Navigate to Cloud Storage
- Click "Create Bucket"
- Name:
your-project-consumer-reviews - Location:
us-central1 - Default storage class:
Standard - Access control:
Uniform
- Name:
- Click "Create"
- Upload Data:
- Open your new bucket
- Click "Upload Files"
- Select the sample data file from
genai_data_pipeline/data/consumer_review_data.parquet - Wait for completion
- Open BigQuery Console
- Create Dataset:
- Click your project name
- Click "Create Dataset"
- Dataset ID:
consumer_reviews_dataset - Data location:
US (multi-region) - Click "Create dataset"
- Load Data:
- Click "Create Table"
- Source: Select "Google Cloud Storage"
- File format:
Parquet - Source path:
gs://your-project-consumer-reviews/consumer_review_data.parquet - Table name:
consumer_review_data - Schema: Select "Auto detect"
- Click "Create table"
-
Create Connection
- In BigQuery, click "More" → "Connections"
- Click "Create Connection"
- Configure:
Connection type: Cloud Resource Service: Vertex AI Connection ID: vertex-ai Location: us-central1 - Click "Create"
-
Set Up Permissions
- Go to IAM & Admin
- Find:
bq-connection-sa@your-project-id.iam.gserviceaccount.com - Add roles:
- Vertex AI User
- BigQuery Admin
-
Edit
dataform.json:{ "defaultSchema": "consumer_reviews_dataset", "defaultDatabase": "your-project-id", "defaultLocation": "US" } -
Edit
includes/constants.js:const PROJECT_ID = "your-project-id"; const SCHEMA_NAME = "consumer_reviews_dataset"; const REMOTE_CONNECTION = "projects/your-project-id/locations/us-central1/connections/vertex-ai";
- In BigQuery Console:
- Run:
SELECT COUNT(*) FROM consumer_reviews_dataset.consumer_review_data
- Run:
- Check Connection:
- Go to "Connections"
- Verify
vertex-aistatus is "Connected"
# Set environment variables
export PROJECT_ID="your-project-id"
export BUCKET_NAME="${PROJECT_ID}-consumer-reviews"
# Configure gcloud
gcloud config set project $PROJECT_ID
# Enable APIs
gcloud services enable bigquery.googleapis.com
gcloud services enable bigqueryconnection.googleapis.com
gcloud services enable storage.googleapis.com
gcloud services enable aiplatform.googleapis.com# Create bucket
gsutil mb -l us-central1 gs://$BUCKET_NAME
# Upload data
gsutil cp genai_data_pipeline/data/consumer_review_data.parquet gs://$BUCKET_NAME/raw/reviews/# Create dataset
bq mk --dataset \
--location=US \
${PROJECT_ID}:consumer_reviews_dataset
# Load data
bq query --use_legacy_sql=false \
"LOAD DATA INTO \`${PROJECT_ID}.consumer_reviews_dataset.consumer_review_data\`
FROM FILES (
format = 'PARQUET',
uris = ['gs://${BUCKET_NAME}/consumer_review_data.parquet']
);"# Create connection
bq mk --connection \
--location=us-central1 \
--project_id=${PROJECT_ID} \
--connection_type=CLOUD_RESOURCE \
vertex-ai
# Get service account
export CONNECTION_SA=$(bq show --connection ${PROJECT_ID}.us-central1.vertex-ai \
| grep "serviceAccountId" | cut -d'"' -f4)
# Grant permissions
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
--member="serviceAccount:${CONNECTION_SA}" \
--role="roles/aiplatform.user"
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
--member="serviceAccount:${CONNECTION_SA}" \
--role="roles/bigquery.admin"# Get connection ID
export CONNECTION_ID=$(bq show --connection ${PROJECT_ID}.us-central1.vertex-ai \
| grep "name" | cut -d'"' -f4)
# Update files (manual step)
echo "Update dataform.json and constants.js with your project details"# Check data
bq query --use_legacy_sql=false \
"SELECT COUNT(*) FROM ${PROJECT_ID}.consumer_reviews_dataset.consumer_review_data"
# Verify connection
bq show --connection ${PROJECT_ID}.us-central1.vertex-ai
# Test Vertex AI access
gcloud ai models list --region=us-central1incoming_reviews- Data ingestion & validationreviews_with_sentiment- Sentiment analysisreviews_with_embeddings- Vector embedding generationcreate_vector_index- Similarity search indexing
questions_with_embeddings- Semantic embeddingquestions_with_clusters- K-means clusteringquestion_themes- Theme generationqa_with_evaluation- Quality assessmentqa_with_product_type- Product classificationqa_quality_data- Analysis aggregation
process_reviews- Review processingquality_data_prep- Question analysisbqml_model- Model operationsvector_index_creation- Search setupregenerate_question_themes- Theme updates
Built-in data quality checks ensure:
- ✓ Key uniqueness
- ✓ Required field validation
- ✓ Row-level conditions
- ✓ Incremental processing
- @dataform/core: 2.8.3
- Google Cloud Platform:
- BigQuery
- Vertex AI (Gemini)
- Cloud Storage
Common Issues & Solutions
# Verify IAM roles
gcloud projects get-iam-policy $PROJECT_ID \
--flatten="bindings[].members" \
--format='table(bindings.role)' \
--filter="bindings.members:$(gcloud config get-value account)"# Check API status
gcloud services list --enabled | grep -E "bigquery|aiplatform"
# Verify service account
gcloud iam service-accounts describe ${CONNECTION_SA}# Check job status
bq show -j ${PROJECT_ID}:US.recent_job_idThis project uses Google Groups to manage Dataform access. Users need to be members of the Dataform users group to:
- Create and edit Dataform definitions
- Execute Dataform workflows
- View and query data in BigQuery
-
Create a Google Group for Dataform users:
# Using Google Workspace Admin Console or gcloud gcloud identity groups create dataform-users@your-domain.com \ --organization=your-org-id \ --display-name="Dataform Users"
-
Add members to the group:
gcloud identity groups memberships add \ --group-email=dataform-users@your-domain.com \ --member-email=user@your-domain.com
-
Update
terraform.tfvarswith your group:dataform_users_group = "dataform-users@your-domain.com"
-
Apply the Terraform configuration:
terraform apply
After setup, users can verify their access:
- Visit the Dataform UI
- Select your project and repository
- Try creating a new definition or running a workflow
If users can't access Dataform:
-
Verify group membership:
gcloud identity groups memberships list \ --group-email=dataform-users@your-domain.com
-
Check IAM bindings:
gcloud projects get-iam-policy $PROJECT_ID \ --flatten="bindings[].members" \ --filter="bindings.role:dataform.developer"
