Orchestration framework for AI pipelines with built-in evaluation
Loom is the "dbt for AI(E)TL" - a declarative orchestration framework for AI pipelines.
Traditional ETL becomes AI(E)TL: Extract → Transform → Evaluate → Load
Declarative YAML pipelines with built-in quality gates ensure your AI outputs meet quality thresholds before reaching production.
Core Value: Production-grade AI pipeline orchestration without complexity, vendor lock-in, or hidden evaluation gaps.
Status: Alpha software (v0.1.0-alpha). Functional but early-stage. Best suited for evaluation, experimentation, and development.
The Problem: Building production AI pipelines requires orchestration AND evaluation. Existing tools do one or the other, not both.
What Loom Provides:
- Declarative Pipelines: Define AI workflows as version-controlled YAML
- Built-in Evaluation: Quality gates using Arbiter prevent bad outputs from reaching production
- Provider-Agnostic: Works with OpenAI, Anthropic, Google, Groq - no vendor lock-in
- Production-Ready: Circuit breakers, retry logic, timeout enforcement
Use Case Example: A sentiment analysis pipeline needs quality assurance. Loom provides:
- Declarative YAML pipeline definition (Extract → Transform → Evaluate → Load)
- Automatic evaluation with configurable quality gates
- Quarantine pattern for failed records
- Complete audit trail of transformations and evaluations
# pipelines/customer_sentiment.yaml
name: customer_sentiment
version: 2.1.0
extract:
source: postgres://customers/reviews
transform:
prompt: prompts/classify_sentiment.txt
model: gpt-4o-mini
batch_size: 50
evaluate:
evaluators:
- type: semantic
threshold: 0.8
- type: custom_criteria
criteria: "Accurate, no hallucination"
threshold: 0.75
quality_gate: all_pass
load:
destination: postgres://analytics/sentiment_scoresRun it:
loom run customer_sentiment- ✅ Declarative Pipelines: YAML-based pipeline definitions (Extract, Transform, Evaluate, Load)
- ✅ Built-in Evaluation: Arbiter integration with quality gates (all_pass, majority_pass, any_pass, weighted)
- ✅ Provider-Agnostic LLMs: OpenAI, Anthropic, Google, Groq support
- ✅ Multiple Data Formats: CSV, JSON, JSONL, Parquet support
- ✅ Quality Gates: Four gate types with precise mathematical definitions
- ✅ Circuit Breaker Pattern: Production resilience for LLM calls
- ✅ Quarantine Pattern: Failed records logged with failure reasons for investigation
- ✅ CLI Interface:
loom run,loom validatecommands
# Clone the repository
git clone https://github.com/evanvolgas/loom.git
cd loom
# Install dependencies
uv venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
uv pip install -e ".[dev]"
# Run tests
pytestLoom uses Arbiter as its evaluation engine:
- Arbiter: Evaluates individual LLM outputs (what)
- Loom: Orchestrates pipelines with evaluation gates (when/how)
Separate projects, complementary goals.
Note: This is a personal project. Roadmap items are ideas and explorations, not commitments. Priorities and timelines may change based on what's useful.
Phase 1 - Foundation ✅ (Completed)
- Core pipeline engine (Extract, Transform, Evaluate, Load)
- YAML pipeline parser
- Arbiter integration with quality gates
- Circuit breaker and resilience patterns
- Basic CLI (
loom run,loom validate)
Future Ideas (No timeline, exploring as needed)
- Database connectors (PostgreSQL, MySQL)
- Cost tracking and monitoring
- Semantic caching for duplicate inputs
- Smart retry logic with failure-type awareness
- Testing framework for pipelines
- More advanced monitoring and alerting
Contributions welcome! This is a personal project, but if you find it useful and want to contribute, pull requests are appreciated.
MIT License - see LICENSE file for details.
Inspired by dbt's declarative approach to data pipelines and built on top of Arbiter for evaluation.