Skip to content
This repository was archived by the owner on Nov 30, 2025. It is now read-only.

matt-strautmann/agentic-builder

Repository files navigation

AI-Driven Analytics Engineering Platform (Archived)

This project is an archived reference implementation of a local-first data pipeline development platform with AI assistance and automated production promotion.


Repository Overview

This repository contains the source code and documentation for the AI-Driven Analytics Engineering Platform. The project's goal was to enable data engineers to develop, test, and deploy data transformation pipelines locally using DuckDB, and then promote validated changes to production environments like BigQuery or Snowflake.

The core workflow of the platform was:

Natural Language → AI-Generated dbt Models → Local Testing → Human Validation → Production Deployment

All the original documentation, specifications, and source code have been moved to the archive directory for reference.

Original Project Description

The AI-Driven Analytics Engineering Platform enabled data engineers to develop, test, and deploy data transformation pipelines locally using DuckDB, then promote validated changes to production BigQuery/Snowflake with full infrastructure as code.

Key Features

  • Local-First Development: Develop and test on DuckDB for fast, cost-effective iteration.
  • AI-Assisted Code Generation: Convert natural language descriptions into dbt models, tests, and documentation.
  • User-in-the-Loop Validation: Human review checkpoints at every critical stage.
  • Infrastructure as Code: Automated Terraform configuration generation.
  • Automated Production Promotion: One-command promotion from local to production.

Repository Structure

  • /archive: Contains all the original source code, documentation, scripts, and specifications.
  • README.md: This file.
  • .gitignore: Git ignore file.

Suggested Tags

data-engineering, ai, dbt, duckdb, bigquery, snowflake, iac, local-first, analytics-engineering, archived

Project Structure

services/analytics-engineering/
├── ai-engines/              # Python AI processing
│   ├── clients/            # Claude API integration
│   ├── dbt_generation/     # AI model generation
│   ├── validation/         # Code validation
│   ├── lineage/           # Data lineage analysis
│   └── deployment/        # IaC generation
│
├── orchestrator/           # TypeScript coordination
│   ├── src/
│   │   ├── agents/        # Multi-agent coordination
│   │   ├── dbt-interface/ # dbt Core integration
│   │   ├── duckdb-manager/# Local database
│   │   └── promotion/     # Production pipeline
│   └── tests/
│
├── local-environment/     # Development environment
│   ├── duckdb/           # Local databases
│   ├── dbt-project/      # dbt project
│   └── sample-data/      # Test datasets
│
├── infrastructure/        # Infrastructure as Code
│   ├── terraform/        # Cloud infrastructure
│   ├── dbt-profiles/     # Environment configs
│   └── ci-cd/            # GitHub Actions
│
└── tools/                # Utilities
    ├── sample-data/      # Data generation
    └── cli/              # Command-line tools

Technology Stack

Core

  • DuckDB: Local development database
  • dbt Core: Data transformation framework
  • Claude API: AI model generation
  • TypeScript 5.x: Orchestration layer
  • Python 3.11+: AI engines

Production

  • BigQuery/Snowflake: Production data warehouses
  • Terraform: Infrastructure as code
  • Airflow: Orchestration (optional)
  • dbt Cloud: Managed dbt (optional)

Development Workflow

1. Describe Transformation

npm run generate:dbt "Calculate monthly active users by cohort"

2. Review AI-Generated Code

-- AI generates dbt model
{{ config(materialized='table') }}

WITH user_activity AS (
  SELECT
    user_id,
    DATE_TRUNC('month', activity_date) AS activity_month,
    DATE_TRUNC('month', first_seen_date) AS cohort_month
  FROM {{ ref('user_events') }}
  WHERE event_type = 'active'
)

SELECT
  cohort_month,
  activity_month,
  COUNT(DISTINCT user_id) AS active_users
FROM user_activity
GROUP BY 1, 2

3. Test Locally

# Run on local DuckDB
npm run dbt:run --target local

# Validate results
npm run dbt:test --target local

4. Provide Feedback (if needed)

npm run review:feedback "Add cohort retention rate calculation"
# AI regenerates with improvements

5. Promote to Production

# Generate infrastructure configs
npm run generate:iac

# Deploy to production
npm run promote:prod

# Monitor deployment
npm run status

Configuration

Environment Variables

# AI Configuration
ANTHROPIC_API_KEY=sk-ant-...
AI_MODEL=claude-3-5-sonnet-20241022

# Local Development
DUCKDB_PATH=./local-environment/duckdb/analytics.db
DBT_PROFILES_DIR=./local-environment/dbt-project

# Production (BigQuery)
BIGQUERY_PROJECT=your-project-id
BIGQUERY_DATASET=analytics
GOOGLE_APPLICATION_CREDENTIALS=./credentials.json

# Production (Snowflake)
SNOWFLAKE_ACCOUNT=your-account
SNOWFLAKE_DATABASE=ANALYTICS
SNOWFLAKE_WAREHOUSE=COMPUTE_WH

dbt Profiles

analytics_platform:
  target: local

  outputs:
    local:
      type: duckdb
      path: ./local-environment/duckdb/analytics.db

    production_bq:
      type: bigquery
      project: "{{ env_var('BIGQUERY_PROJECT') }}"
      dataset: analytics
      method: service-account
      keyfile: "{{ env_var('GOOGLE_APPLICATION_CREDENTIALS') }}"

    production_sf:
      type: snowflake
      account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
      database: ANALYTICS
      warehouse: COMPUTE_WH
      schema: PUBLIC

Documentation

Success Criteria

Metric Target Status
Model generation time <5 minutes 🚧 In Progress
Local-to-prod fidelity 95%+ ⏳ Planned
Human review time <10 minutes ⏳ Planned
Production promotion <15 minutes ⏳ Planned
AI code pass rate 90%+ ⏳ Planned
IaC generation 100% automated ⏳ Planned
Cycle time reduction 60%+ ⏳ Planned

Contributing

This project follows SPEC-KIT methodology:

  1. Create specification (spec.md)
  2. Design implementation (plan.md)
  3. Break down tasks (tasks.md)
  4. Implement with tests
  5. Validate against success criteria

Architecture Highlights

Why DuckDB?

  • Embedded analytics database (no server required)
  • Handles 1GB-100GB datasets efficiently
  • SQL dialect similar to BigQuery/Snowflake
  • Perfect for local development and testing

Why AI-Assisted?

  • Dramatically reduces time from idea to working code
  • Generates tests and documentation automatically
  • Learns from feedback to improve over time
  • Handles repetitive boilerplate work

Why User-in-the-Loop?

  • Ensures business logic accuracy
  • Builds trust in AI-generated code
  • Enables gradual adoption
  • Provides safety net before production

Roadmap

Phase 1: Local Development (Current)

  • ✅ Project structure and architecture
  • 🚧 AI-powered dbt model generation
  • 🚧 Local DuckDB management
  • ⏳ Validation gates

Phase 2: Production Promotion

  • ⏳ Infrastructure as code generation
  • ⏳ Deployment pipeline
  • ⏳ Rollback capabilities

Phase 3: Optimization

  • ⏳ Performance recommendations
  • ⏳ Cost optimization
  • ⏳ Data quality improvements

Phase 4: Modern Stack Integration

  • ⏳ Airflow integration
  • ⏳ dlt pipelines
  • ⏳ dbt Cloud compatibility

License

MIT

Support

For questions or issues, see project documentation or create an issue.

About

This project is an archived reference implementation of a local-first data pipeline development platform with AI assistance and automated production promotion.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages