This project is an archived reference implementation of a local-first data pipeline development platform with AI assistance and automated production promotion.
This repository contains the source code and documentation for the AI-Driven Analytics Engineering Platform. The project's goal was to enable data engineers to develop, test, and deploy data transformation pipelines locally using DuckDB, and then promote validated changes to production environments like BigQuery or Snowflake.
The core workflow of the platform was:
Natural Language → AI-Generated dbt Models → Local Testing → Human Validation → Production Deployment
All the original documentation, specifications, and source code have been moved to the archive directory for reference.
The AI-Driven Analytics Engineering Platform enabled data engineers to develop, test, and deploy data transformation pipelines locally using DuckDB, then promote validated changes to production BigQuery/Snowflake with full infrastructure as code.
- Local-First Development: Develop and test on DuckDB for fast, cost-effective iteration.
- AI-Assisted Code Generation: Convert natural language descriptions into dbt models, tests, and documentation.
- User-in-the-Loop Validation: Human review checkpoints at every critical stage.
- Infrastructure as Code: Automated Terraform configuration generation.
- Automated Production Promotion: One-command promotion from local to production.
/archive: Contains all the original source code, documentation, scripts, and specifications.README.md: This file..gitignore: Git ignore file.
data-engineering, ai, dbt, duckdb, bigquery, snowflake, iac, local-first, analytics-engineering, archived
services/analytics-engineering/
├── ai-engines/ # Python AI processing
│ ├── clients/ # Claude API integration
│ ├── dbt_generation/ # AI model generation
│ ├── validation/ # Code validation
│ ├── lineage/ # Data lineage analysis
│ └── deployment/ # IaC generation
│
├── orchestrator/ # TypeScript coordination
│ ├── src/
│ │ ├── agents/ # Multi-agent coordination
│ │ ├── dbt-interface/ # dbt Core integration
│ │ ├── duckdb-manager/# Local database
│ │ └── promotion/ # Production pipeline
│ └── tests/
│
├── local-environment/ # Development environment
│ ├── duckdb/ # Local databases
│ ├── dbt-project/ # dbt project
│ └── sample-data/ # Test datasets
│
├── infrastructure/ # Infrastructure as Code
│ ├── terraform/ # Cloud infrastructure
│ ├── dbt-profiles/ # Environment configs
│ └── ci-cd/ # GitHub Actions
│
└── tools/ # Utilities
├── sample-data/ # Data generation
└── cli/ # Command-line tools
- DuckDB: Local development database
- dbt Core: Data transformation framework
- Claude API: AI model generation
- TypeScript 5.x: Orchestration layer
- Python 3.11+: AI engines
- BigQuery/Snowflake: Production data warehouses
- Terraform: Infrastructure as code
- Airflow: Orchestration (optional)
- dbt Cloud: Managed dbt (optional)
npm run generate:dbt "Calculate monthly active users by cohort"-- AI generates dbt model
{{ config(materialized='table') }}
WITH user_activity AS (
SELECT
user_id,
DATE_TRUNC('month', activity_date) AS activity_month,
DATE_TRUNC('month', first_seen_date) AS cohort_month
FROM {{ ref('user_events') }}
WHERE event_type = 'active'
)
SELECT
cohort_month,
activity_month,
COUNT(DISTINCT user_id) AS active_users
FROM user_activity
GROUP BY 1, 2# Run on local DuckDB
npm run dbt:run --target local
# Validate results
npm run dbt:test --target localnpm run review:feedback "Add cohort retention rate calculation"
# AI regenerates with improvements# Generate infrastructure configs
npm run generate:iac
# Deploy to production
npm run promote:prod
# Monitor deployment
npm run status# AI Configuration
ANTHROPIC_API_KEY=sk-ant-...
AI_MODEL=claude-3-5-sonnet-20241022
# Local Development
DUCKDB_PATH=./local-environment/duckdb/analytics.db
DBT_PROFILES_DIR=./local-environment/dbt-project
# Production (BigQuery)
BIGQUERY_PROJECT=your-project-id
BIGQUERY_DATASET=analytics
GOOGLE_APPLICATION_CREDENTIALS=./credentials.json
# Production (Snowflake)
SNOWFLAKE_ACCOUNT=your-account
SNOWFLAKE_DATABASE=ANALYTICS
SNOWFLAKE_WAREHOUSE=COMPUTE_WHanalytics_platform:
target: local
outputs:
local:
type: duckdb
path: ./local-environment/duckdb/analytics.db
production_bq:
type: bigquery
project: "{{ env_var('BIGQUERY_PROJECT') }}"
dataset: analytics
method: service-account
keyfile: "{{ env_var('GOOGLE_APPLICATION_CREDENTIALS') }}"
production_sf:
type: snowflake
account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
database: ANALYTICS
warehouse: COMPUTE_WH
schema: PUBLIC- AI Analytics Pipeline Overview - Comprehensive project documentation
- Feature Specification - User stories and requirements
- Implementation Plan - Technical architecture
- Data Model - Entity relationships
- Quickstart Guide - Step-by-step tutorial
| Metric | Target | Status |
|---|---|---|
| Model generation time | <5 minutes | 🚧 In Progress |
| Local-to-prod fidelity | 95%+ | ⏳ Planned |
| Human review time | <10 minutes | ⏳ Planned |
| Production promotion | <15 minutes | ⏳ Planned |
| AI code pass rate | 90%+ | ⏳ Planned |
| IaC generation | 100% automated | ⏳ Planned |
| Cycle time reduction | 60%+ | ⏳ Planned |
This project follows SPEC-KIT methodology:
- Create specification (spec.md)
- Design implementation (plan.md)
- Break down tasks (tasks.md)
- Implement with tests
- Validate against success criteria
- Embedded analytics database (no server required)
- Handles 1GB-100GB datasets efficiently
- SQL dialect similar to BigQuery/Snowflake
- Perfect for local development and testing
- Dramatically reduces time from idea to working code
- Generates tests and documentation automatically
- Learns from feedback to improve over time
- Handles repetitive boilerplate work
- Ensures business logic accuracy
- Builds trust in AI-generated code
- Enables gradual adoption
- Provides safety net before production
- ✅ Project structure and architecture
- 🚧 AI-powered dbt model generation
- 🚧 Local DuckDB management
- ⏳ Validation gates
- ⏳ Infrastructure as code generation
- ⏳ Deployment pipeline
- ⏳ Rollback capabilities
- ⏳ Performance recommendations
- ⏳ Cost optimization
- ⏳ Data quality improvements
- ⏳ Airflow integration
- ⏳ dlt pipelines
- ⏳ dbt Cloud compatibility
MIT
For questions or issues, see project documentation or create an issue.