Building GenAI Data Pipelines with SQL

🎯 Perfect for data teams who know SQL and want to build production-ready GenAI pipelines without Python expertise

🤖 Data analytics and engineering teams in enterprises are finding themselves responsible for GenAI data pipelines, often without extensive machine learning expertise. Many of these teams are most comfortable working in SQL, and traditional approaches would require significant upskilling in Python and ML frameworks. This repository bridges that gap by showing how to build sophisticated GenAI pipelines using familiar SQL-based tools.

⚡ Get Started in Minutes: Deploy a complete GenAI pipeline with just SQL and a few commands.

📚 What You'll Learn:

✨ Build production-grade GenAI pipelines using SQL
🛡️ Implement best practices for data quality and security
📊 Monitor and evaluate GenAI model performance

Prerequisites: SQL knowledge
Not Required: Python, Machine Learning expertise

📋 Overview

Transform your customer feedback data into actionable insights using state-of-the-art AI/ML techniques:

🎯 Sentiment analysis using Gemini 1.5
🔍 Vector embeddings for semantic search
🔮 Intelligent question clustering
📊 Automated theme identification
⚡ Analytics on RAG system usage, quality and performance

🏗️ Architecture

Built on enterprise-grade Google Cloud technologies:

Dataform - Orchestration & transformation
BigQuery - Serverless data warehouse
Vertex AI - Machine learning operations
- Gemini 1.5
- Text Embedding API
- Vector Search

Dataform Configuration

The project supports two options for Dataform repository configuration:

Default Repository (Recommended for Learning)
- Uses Dataform's built-in repository
- Perfect for learning and experimentation
- No additional Git setup required
- Enabled by default
Remote Git Repository (Optional)
- For production or team collaboration
- Requires a GitHub repository and personal access token
- Enable by setting use_remote_git = true in terraform.tfvars

Choose the option that best suits your needs. For most learning scenarios, the default repository is recommended.

🚀 Prerequisites Checklist

Before starting the deployment, ensure you have the following prerequisites in place:

1. Google Cloud Environment

A Google Cloud project with billing enabled
Owner or Editor role on the project

2. Development Tools

Git (version >= 2.0)
Terraform (version >= 1.0)
Google Cloud SDK (version >= 440.0.0)

3. Required APIs

Run this command to enable necessary APIs:

gcloud services enable \
  secretmanager.googleapis.com \
  dataform.googleapis.com \
  bigquery.googleapis.com \
  artifactregistry.googleapis.com

🚀 Quick Start

Clone the Repository

git clone <repository-url>
cd genai_data_pipeline

Create Required Google Groups
- Create three Google Groups in your workspace:
  - Dataform Users (e.g., dataform-users@your-domain.com)
  - Data Readers (e.g., data-readers@your-domain.com)
  - Data Owner (use an existing team email)
Configure Terraform Variables
```
cd terraform
cp terraform.tfvars.example terraform.tfvars
```
Edit terraform.tfvars and set:
- Your project_id
- Your Google Groups emails
- Optionally customize other settings

Initialize and Apply Terraform

terraform init
terraform plan
terraform apply

Verify Deployment
- Visit the Dataform UI
- Select your project and repository
- Try creating a new definition

🚀 Terraform State Management

This project supports two options for managing Terraform state:

Local State (Default)

By default, Terraform will store state locally in your workspace. This is suitable for:

Individual learning and development
Quick prototyping
Local testing and experimentation

Remote State in Google Cloud Storage (Recommended for Teams)

For team environments or production deployments, we recommend using Google Cloud Storage (GCS) for state management. This provides:

🤝 Team collaboration capabilities
🔒 State locking to prevent concurrent modifications
🔄 Version history and backup
🛡️ Better security through Google Cloud IAM

To enable GCS state storage:

# 1. Create a GCS bucket for state storage
export PROJECT_ID="your-project-id"
gsutil mb -l us-central1 gs://${PROJECT_ID}-terraform-state

# 2. Enable versioning for state history
gsutil versioning set on gs://${PROJECT_ID}-terraform-state

# 3. Update backend configuration
# Uncomment and configure the backend block in terraform/backend.tf:
terraform {
  backend "gcs" {
    bucket = "YOUR_PROJECT_ID-terraform-state"
    prefix = "genai-pipeline"
  }
}

Note: For this educational project, local state is perfectly fine for getting started. Consider switching to GCS state storage when working in a team or moving to production.

🚀 Setup Guide

Choose your preferred setup path:

📱 Option A: Using Google Cloud Console (Recommended for Beginners)

Step 1: Initial Setup

Access Google Cloud Console
- Navigate to console.cloud.google.com
- Create or select your project
- Note your Project ID for later use
Enable Required APIs
- Go to APIs & Services
- Click "Enable APIs and Services"
- Enable the following:
  - BigQuery API
  - BigQuery Connection API
  - Cloud Storage API
  - Vertex AI API

Step 2: Create Storage Bucket

Navigate to Cloud Storage
Click "Create Bucket"
- Name: your-project-consumer-reviews
- Location: us-central1
- Default storage class: Standard
- Access control: Uniform
Click "Create"
Upload Data:
- Open your new bucket
- Click "Upload Files"
- Select the sample data file from genai_data_pipeline/data/consumer_review_data.parquet
- Wait for completion

Step 3: Initialize BigQuery Dataset

Open BigQuery Console
Create Dataset:
- Click your project name
- Click "Create Dataset"
- Dataset ID: consumer_reviews_dataset
- Data location: US (multi-region)
- Click "Create dataset"
Load Data:
- Click "Create Table"
- Source: Select "Google Cloud Storage"
- File format: Parquet
- Source path: gs://your-project-consumer-reviews/consumer_review_data.parquet
- Table name: consumer_review_data
- Schema: Select "Auto detect"
- Click "Create table"

Step 4: Configure Remote Connection

Create Connection
- In BigQuery, click "More" → "Connections"
- Click "Create Connection"
- Configure:
```
Connection type: Cloud Resource
Service: Vertex AI
Connection ID: vertex-ai
Location: us-central1
```
- Click "Create"
Set Up Permissions
- Go to IAM & Admin
- Find: bq-connection-sa@your-project-id.iam.gserviceaccount.com
- Add roles:
  - Vertex AI User
  - BigQuery Admin

Step 5: Update Configuration Files

Edit dataform.json:

{
  "defaultSchema": "consumer_reviews_dataset",
  "defaultDatabase": "your-project-id",
  "defaultLocation": "US"
}

Edit includes/constants.js:

const PROJECT_ID = "your-project-id";
const SCHEMA_NAME = "consumer_reviews_dataset";
const REMOTE_CONNECTION = "projects/your-project-id/locations/us-central1/connections/vertex-ai";

Step 6: Verify Setup

In BigQuery Console:
- Run: SELECT COUNT(*) FROM consumer_reviews_dataset.consumer_review_data
Check Connection:
- Go to "Connections"
- Verify vertex-ai status is "Connected"

💻 Option B: Using Command Line

Step 1: Initial Setup

# Set environment variables
export PROJECT_ID="your-project-id"
export BUCKET_NAME="${PROJECT_ID}-consumer-reviews"

# Configure gcloud
gcloud config set project $PROJECT_ID

# Enable APIs
gcloud services enable bigquery.googleapis.com
gcloud services enable bigqueryconnection.googleapis.com
gcloud services enable storage.googleapis.com
gcloud services enable aiplatform.googleapis.com

Step 2: Create Storage Bucket

# Create bucket
gsutil mb -l us-central1 gs://$BUCKET_NAME

# Upload data
gsutil cp genai_data_pipeline/data/consumer_review_data.parquet gs://$BUCKET_NAME/raw/reviews/

Step 3: Initialize BigQuery Dataset

# Create dataset
bq mk --dataset \
  --location=US \
  ${PROJECT_ID}:consumer_reviews_dataset

# Load data
bq query --use_legacy_sql=false \
  "LOAD DATA INTO \`${PROJECT_ID}.consumer_reviews_dataset.consumer_review_data\`
   FROM FILES (
     format = 'PARQUET',
     uris = ['gs://${BUCKET_NAME}/consumer_review_data.parquet']
   );"

Step 4: Configure Remote Connection

# Create connection
bq mk --connection \
  --location=us-central1 \
  --project_id=${PROJECT_ID} \
  --connection_type=CLOUD_RESOURCE \
  vertex-ai

# Get service account
export CONNECTION_SA=$(bq show --connection ${PROJECT_ID}.us-central1.vertex-ai \
  | grep "serviceAccountId" | cut -d'"' -f4)

# Grant permissions
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
  --member="serviceAccount:${CONNECTION_SA}" \
  --role="roles/aiplatform.user"

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
  --member="serviceAccount:${CONNECTION_SA}" \
  --role="roles/bigquery.admin"

Step 5: Update Configuration Files

# Get connection ID
export CONNECTION_ID=$(bq show --connection ${PROJECT_ID}.us-central1.vertex-ai \
  | grep "name" | cut -d'"' -f4)

# Update files (manual step)
echo "Update dataform.json and constants.js with your project details"

Step 6: Verify Setup

# Check data
bq query --use_legacy_sql=false \
  "SELECT COUNT(*) FROM ${PROJECT_ID}.consumer_reviews_dataset.consumer_review_data"

# Verify connection
bq show --connection ${PROJECT_ID}.us-central1.vertex-ai

# Test Vertex AI access
gcloud ai models list --region=us-central1

🔄 Pipeline Components

Review Processing

incoming_reviews - Data ingestion & validation
reviews_with_sentiment - Sentiment analysis
reviews_with_embeddings - Vector embedding generation
create_vector_index - Similarity search indexing

Question Analysis

questions_with_embeddings - Semantic embedding
questions_with_clusters - K-means clustering
question_themes - Theme generation
qa_with_evaluation - Quality assessment
qa_with_product_type - Product classification
qa_quality_data - Analysis aggregation

🏷️ Tags

process_reviews - Review processing
quality_data_prep - Question analysis
bqml_model - Model operations
vector_index_creation - Search setup
regenerate_question_themes - Theme updates

✅ Data Quality

Built-in data quality checks ensure:

✓ Key uniqueness
✓ Required field validation
✓ Row-level conditions
✓ Incremental processing

📦 Dependencies

@dataform/core: 2.8.3
Google Cloud Platform:
- BigQuery
- Vertex AI (Gemini)
- Cloud Storage

🆘 Troubleshooting

Common Issues & Solutions

Permission Errors

# Verify IAM roles
gcloud projects get-iam-policy $PROJECT_ID \
  --flatten="bindings[].members" \
  --format='table(bindings.role)' \
  --filter="bindings.members:$(gcloud config get-value account)"

Connection Issues

# Check API status
gcloud services list --enabled | grep -E "bigquery|aiplatform"

# Verify service account
gcloud iam service-accounts describe ${CONNECTION_SA}

Data Loading Issues

# Check job status
bq show -j ${PROJECT_ID}:US.recent_job_id

🔑 Access Management

Dataform User Access

This project uses Google Groups to manage Dataform access. Users need to be members of the Dataform users group to:

Create and edit Dataform definitions
Execute Dataform workflows
View and query data in BigQuery

Setting up Dataform Access

Create a Google Group for Dataform users:

# Using Google Workspace Admin Console or gcloud
gcloud identity groups create dataform-users@your-domain.com \
  --organization=your-org-id \
  --display-name="Dataform Users"

Add members to the group:

gcloud identity groups memberships add \
  --group-email=dataform-users@your-domain.com \
  --member-email=user@your-domain.com

Update terraform.tfvars with your group:

dataform_users_group = "dataform-users@your-domain.com"

Apply the Terraform configuration:
```
terraform apply
```

Verifying Access

After setup, users can verify their access:

Visit the Dataform UI
Select your project and repository
Try creating a new definition or running a workflow

Troubleshooting Access Issues

If users can't access Dataform:

Verify group membership:

gcloud identity groups memberships list \
  --group-email=dataform-users@your-domain.com

Check IAM bindings:

gcloud projects get-iam-policy $PROJECT_ID \
  --flatten="bindings[].members" \
  --filter="bindings.role:dataform.developer"

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
definitions		definitions
docs		docs
includes		includes
scripts		scripts
terraform		terraform
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
dataform.json		dataform.json
package-lock.json		package-lock.json
package.json		package.json

EddieAtGoogle/SQL-Based-GenAI-Data-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Building GenAI Data Pipelines with SQL

📋 Overview

🏗️ Architecture

Dataform Configuration

🚀 Prerequisites Checklist

1. Google Cloud Environment

2. Development Tools

3. Required APIs

🚀 Quick Start

🚀 Terraform State Management

Local State (Default)

Remote State in Google Cloud Storage (Recommended for Teams)

🚀 Setup Guide

📱 Option A: Using Google Cloud Console (Recommended for Beginners)

Step 1: Initial Setup

Step 2: Create Storage Bucket

Step 3: Initialize BigQuery Dataset

Step 4: Configure Remote Connection

Step 5: Update Configuration Files

Step 6: Verify Setup

💻 Option B: Using Command Line

Step 1: Initial Setup

Step 2: Create Storage Bucket

Step 3: Initialize BigQuery Dataset

Step 4: Configure Remote Connection

Step 5: Update Configuration Files

Step 6: Verify Setup

🔄 Pipeline Components

Review Processing

Question Analysis

🏷️ Tags

✅ Data Quality

📦 Dependencies

🆘 Troubleshooting

Permission Errors

Connection Issues

Data Loading Issues

🔑 Access Management

Dataform User Access

Setting up Dataform Access

Verifying Access

Troubleshooting Access Issues

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages