GitHub - baselanaya/Mercer: Text-to-SQL for messy production schemas — 6-stage agentic pipeline, local GPU inference, no vector DB required

Mercer converts plain English questions into accurate SQL against production schemas — including the messy ones with cryptic abbreviations (cust_seg_cd, e_add, p_spec), missing foreign keys, legacy denormalization, and inconsistent naming. It runs a 6-stage agentic pipeline entirely on a consumer GPU (8 GB VRAM) with no vector database required.

How It Works

Each question passes through six stages before a SQL query is returned:

Entity Retrieval — BM25 + LSH matches natural language tokens against actual sampled cell values (RETAIL, CORP, etc., not just column names) and a business glossary
Schema Linking — 3-step CHESS-inspired linker: column pre-filter → table selection → final column selection
Query Decomposition — Breaks complex questions into subproblems with a chain-of-thought reasoning plan; the glossary-expanded form of the question (E-SQL pattern) is forwarded into Stage 4
Candidate Generation — Three SQL strategies run in parallel at distinct temperatures (direct CoT @ 0.0, divide-and-conquer @ 0.2, plan-execute @ 0.3) using the XiYan-SQL M-Schema prompt format, which renders schemas with inline sample values for each column
Execution + Selection — Each candidate is gated by a cheap EXPLAIN pre-flight; survivors execute against the database in a read-only sandbox; the best result is selected by consistency scoring
Taxonomy Correction — Classifies and fixes errors by type: schema errors, join errors, filter errors, aggregation errors, syntax errors, logic errors

Key Design Decisions

No vector database. Schema navigation uses BM25, LSH against sampled cell values, an FK graph, and LLM reasoning. Vector RAG remains opt-in for glossary expansion only.

Multi-candidate selection with real diversity. Three SQL queries are generated with distinct strategies and distinct temperatures (0.0, 0.2, 0.3), all gated by EXPLAIN, all executed, and the best is picked. One-shot generation produces worse results on messy schemas; collapsing the three strategies to a single temperature kills the CHASE-SQL diversity signal.

Local-first inference, SQL-tuned by default. llama.cpp with GGUF IQ4_XS + q8_0 KV cache (TurboQuant profile) runs Snowflake's Arctic-Text2SQL-R1-7B — a Qwen2.5-Coder-7B-Instruct fine-tune trained with execution-based GRPO on BIRD/Spider — fully on an RTX 4070 (8 GB VRAM, ~5.4 GB working set). Cloud API (Anthropic Claude Opus 4.7 / OpenAI GPT-4.1) is an optional fallback.

M-Schema prompting with inline sample values. Schemas are rendered to the LLM in XiYan-SQL's M-Schema format, with each column's actual sampled values shown next to its definition ((cust_seg_cd:TEXT, "RETAIL", "CORP", "SMB")). Cryptic abbreviations stop being cryptic when the model can see what real values look like.

Read-only sandbox with structural validation. SQL is parsed by sqlglot and rejected before execution if it isn't a single SELECT (or WITH ... SELECT); DDL/DML nodes anywhere in the AST are rejected, even inside subqueries or CTEs. PostgreSQL connections additionally issue SET TRANSACTION READ ONLY as a second line of defense. 5-second statement timeout, 100-row fetch limit. The previous keyword-grep approach had false positives on string literals containing keywords (WHERE body LIKE '%please DROP%'); the AST validator does not.

Quick Start

Prerequisites: Python 3.11+, Redis, Node.js 18+ (for the UI)

git clone https://github.com/baselanaya/Mercer.git
cd Mercer

# Install Python dependencies
python -m venv .venv && source .venv/bin/activate

# Runtime install:
pip install -e . -r requirements.txt

# Development install (adds pytest, ruff, mypy):
pip install -e ".[dev]" -r requirements.txt

# Copy and configure environment
cp .env.example .env

Edit .env and set at minimum:

DATABASE_URL=sqlite+aiosqlite:///data/chinook/Chinook.sqlite  # or your database
INFERENCE_BACKEND=anthropic                                    # or llamacpp / openai
ANTHROPIC_API_KEY=sk-ant-...

Start the server:

uvicorn app.api.main:app --port 8000

Open http://localhost:8000 — the API is live. For the full UI, see Usage → UI below.

Usage

Local GPU (llama.cpp)

Requires CUDA 12.4+ and ~5.4 GB VRAM. Tested on RTX 4070 Laptop.

# Install GPU stack
pip install -r requirements-gpu.txt

# Download the recommended SQL-tuned model (Apache 2.0)
pip install huggingface-hub
huggingface-cli download mradermacher/Arctic-Text2SQL-R1-7B-i1-GGUF \
  Arctic-Text2SQL-R1-7B.i1-IQ4_XS.gguf --local-dir models/
mv models/Arctic-Text2SQL-R1-7B.i1-IQ4_XS.gguf \
   models/Arctic-Text2SQL-R1-7B-IQ4_XS.gguf

# Start the inference server (TurboQuant: IQ4_XS weights + q8_0 KV cache)
bash scripts/serve_llamacpp_turboquant.sh

Already have Qwen2.5-Coder-7B downloaded? Override LOCAL_MODEL_PATH in .env to point at it. See docs/custom-models.md for the full model-swap matrix.

The server starts on http://localhost:8080 (OpenAI-compatible API). Set in .env:

INFERENCE_BACKEND=llamacpp
LLAMACPP_URL=http://localhost:8080
LOCAL_MODEL_PATH=models/Arctic-Text2SQL-R1-7B-IQ4_XS.gguf

Cloud API

Set the appropriate key and backend in .env:

# Anthropic
INFERENCE_BACKEND=anthropic
ANTHROPIC_API_KEY=sk-ant-...

# OpenAI
INFERENCE_BACKEND=openai
OPENAI_API_KEY=sk-proj-...

No model download required. Start the server normally:

uvicorn app.api.main:app --port 8000

API Server

# Start (production)
uvicorn app.api.main:app --port 8000

# Start (development, auto-reload)
uvicorn app.api.main:app --reload --port 8000

REST API

# Run a query
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "Top 5 customers by total spend"}'

# Get schema-aware suggested questions
curl http://localhost:8000/suggest

# Health check
curl http://localhost:8000/health

WebSocket (Streaming)

Connect to /ws/query to receive pipeline progress in real time:

Client → {"question": "How many tracks are in each genre?"}

Server → {"type": "started",        "question": "..."}
Server → {"type": "stage_complete", "stage": "entity_retrieval",    "latency_ms": 9}
Server → {"type": "stage_complete", "stage": "schema_linking",      "latency_ms": 3700}
Server → {"type": "stage_complete", "stage": "query_decomposition", "latency_ms": 4000}
Server → {"type": "stage_complete", "stage": "candidate_generation","latency_ms": 5000}
Server → {"type": "stage_complete", "stage": "execution_scoring",   "latency_ms": 7}
Server → {"type": "stage_complete", "stage": "correction",          "latency_ms": 0}
Server → {"type": "complete",       "sql": "SELECT ...", "execution_result": {...}, "latency_ms": 12722}

UI Development Server

The chat interface is a separate Vite + React app that proxies to the API backend:

cd app/ui
npm install
npm run dev        # starts on http://localhost:5173

The UI proxies /api, /query, /ws, /suggest, /schema, /health, and /setup to localhost:8000.

Ingesting a Schema

Preload the schema graph and entity index for faster queries:

python scripts/ingest_schema.py --db-url $DATABASE_URL

This caches the full schema in Redis and pre-builds the BM25/LSH entity indices.

Configuration

Environment Variables

Variable	Description	Default
`DATABASE_URL`	SQLAlchemy async connection string	—
`REDIS_URL`	Redis for schema cache	`redis://localhost:6379/0`
`INFERENCE_BACKEND`	`llamacpp` \| `anthropic` \| `openai` \| `ollama`	`llamacpp`
`LLAMACPP_URL`	llama.cpp server URL	`http://localhost:8080`
`LOCAL_MODEL_PATH`	GGUF file served by llama.cpp	`models/Arctic-Text2SQL-R1-7B-IQ4_XS.gguf`
`ANTHROPIC_API_KEY`	Anthropic API key	—
`ANTHROPIC_MODEL`	Anthropic model snapshot	`claude-opus-4-7`
`OPENAI_API_KEY`	OpenAI API key	—
`OPENAI_MODEL`	OpenAI model name	`gpt-4.1`
`LOG_LEVEL`	`DEBUG` \| `INFO` \| `WARNING` \| `ERROR`	`INFO`

Connection string examples:

# SQLite
DATABASE_URL=sqlite+aiosqlite:///path/to/database.db

# PostgreSQL
DATABASE_URL=postgresql+asyncpg://user:password@localhost:5432/mydb

# MySQL
DATABASE_URL=mysql+aiomysql://user:password@localhost:3306/mydb

Messy Schema Annotations

Add semantic context for cryptic column names in config/mappings.yaml:

tables:
  cust_mstr:
    description: "Main customer master table"
    columns:
      cust_seg_cd: "Customer segment code (RETAIL, CORP, SMB)"
      e_add:       "Employee address — legacy field, rarely populated"
      p_spec:      "Product specialization code"

  ord_hdr:
    description: "Order header — one row per order"
    columns:
      ord_dt:   "Order date"
      cust_fk:  "Foreign key to cust_mstr.cust_id"
      tot_amt:  "Total order amount in USD"

glossary:
  revenue:  "SUM of ord_hdr.tot_amt"
  churn:    "Customers with no orders in the past 90 days"
  segment:  "cust_mstr.cust_seg_cd"

Evaluation

Baseline Measurement

The numbers below were measured against the prior Mercer setup — Qwen2.5-Coder-7B-Instruct (general-purpose code model), keyword-grep sandbox, single-temperature candidate generation, no inline sample values, flat-prose schema format. They establish a baseline; they have not yet been re-measured against the current configuration (Arctic-Text2SQL-R1-7B + M-Schema + per-strategy temperatures + sample-value LSH + EXPLAIN gate + sqlglot AST sandbox).

The component literature suggests the current configuration should produce a meaningful gain, but until a re-measurement is run on the same suites and committed alongside the code, treat the 74% as the previous floor rather than the current ceiling.

External Benchmark (baseline)

Benchmarked with Qwen2.5-Coder-7B-Instruct-Q4_K_M on an RTX 4070 Laptop (8 GB VRAM) via llama.cpp.

Dataset	Questions	Execution Accuracy	Exact Match
gretelai/synthetic_text_to_sql	50 stratified	74.0%	34.0%

Breakdown by SQL complexity:

Complexity	Execution	Exact Match
Basic SQL	6/8 (75%)	3/8 (38%)
Aggregation	7/8 (88%)	5/8 (63%)
Single join	6/8 (75%)	3/8 (38%)
Subqueries	3/7 (43%)	1/7 (14%)
Window functions	7/7 (100%)	3/7 (43%)
Multiple joins	4/6 (67%)	1/6 (17%)
Set operations	4/4 (100%)	1/4 (25%)
CTEs	0/2	0/2

5 of 50 questions were DDL/DML (DELETE/INSERT) — blocked by the read-only sandbox and counted as failures. These are outside scope for a read-only NL→SQL pipeline.

Live Queries (Chinook Database, baseline)

Question	Correct	Latency
How many customers are there?	✓ (59)	7.6s
Top 5 artists by album count	✓ (Iron Maiden, 21)	13.6s
Employee with most invoices + revenue	✓ (Jane Peacock, $833)	16.8s
Genres for artist with most tracks	✓ (Rock, 1297 tracks)	22.7s

All four queries required zero correction steps.

Re-running the Eval Suite

python scripts/benchmark.py --suite regression
python scripts/benchmark.py --suite mercer_messy
python eval/gretelai_eval.py

If you re-measure on the current configuration, please open a PR with the updated numbers and the configuration matrix used. BIRD-dev is the more discriminating benchmark for SOTA comparisons (gretelai is synthetic and easier); a good follow-up benchmarking effort starts there.

Architecture

mercer/
├── core/
│   ├── pipeline.py              # 6-stage orchestrator
│   ├── entity_retriever.py      # Stage 1: BM25 + LSH (over sampled values) + glossary
│   ├── schema_linker.py         # Stage 2: 3-step CHESS linker
│   ├── query_decomposer.py      # Stage 3: subproblem decomposition
│   ├── candidate_generator.py   # Stage 4: per-strategy temps (0.0/0.2/0.3)
│   ├── executor.py              # Stage 5: EXPLAIN-gated execution + selection
│   └── corrector.py             # Stage 6: taxonomy-guided correction
│
├── kernels/                     # GPU kernels with CPU fallbacks
│   ├── lsh_hash.py              # Batch LSH for entity matching
│   ├── schema_encode.py         # Batch schema tokenization
│   └── result_score.py          # Result consistency scoring
│
├── inference/                   # LLM backend abstraction
│   ├── base.py                  # LLMBackend protocol (per-prompt temperatures)
│   ├── llamacpp_backend.py      # llama.cpp async client
│   ├── api_backend.py           # Anthropic / OpenAI client
│   └── router.py                # Complexity-based model routing
│
├── schema/
│   ├── introspector.py          # SQLAlchemy introspection + bounded value sampling
│   ├── semantic_mapper.py       # Glossary + column description injection
│   ├── graph_builder.py         # FK graph (networkx)
│   └── cache.py                 # Redis schema cache
│
├── db/
│   ├── connectors/              # PostgreSQL, MySQL, SQLite, DuckDB
│   ├── sandbox.py               # sqlglot AST-validated read-only sandbox
│   └── explain.py               # Dialect-aware EXPLAIN pre-flight
│
├── prompts/
│   ├── m_schema.py              # XiYan-SQL M-Schema renderer (single source of truth)
│   ├── sql_generation.py        # Stage 4 prompts
│   ├── query_plan.py            # Stage 3 prompts
│   ├── column_selection.py      # Stage 2 prompts
│   └── correction.py            # Stage 6 prompts
│
├── app/
│   ├── api/                     # FastAPI + WebSocket streaming
│   └── ui/                      # React + TailwindCSS chat UI
│
├── eval/                        # Benchmark runners
├── config/                      # mappings.yaml, inference.yaml, settings.py
└── data/                        # Chinook, DVDRental, Northwind test databases

Stack

Layer	Technology
Language	Python 3.11+
API backend	FastAPI + uvicorn
Frontend	React 19 + TailwindCSS v4 + Vite
LLM serving (local)	llama.cpp (GGUF, IQ4_XS + q8_0 KV)
Local model (default)	Snowflake Arctic-Text2SQL-R1-7B (Apache 2.0)
API fallback	Claude Opus 4.7 / GPT-4.1
GPU kernels	Triton (with CPU fallbacks)
SQL parsing	sqlglot 25+
DB abstraction	SQLAlchemy 2.0 (async)
Schema graph	networkx
Entity retrieval	rank-bm25 + custom LSH over sampled values
Cache	Redis (schema) + DuckDB (audit log)
Databases supported	PostgreSQL, MySQL, SQLite, DuckDB
CI	GitHub Actions (ruff, mypy strict, pytest matrix on 3.11/3.12)

Research Foundation

Paper	Contribution	Pipeline Stage
CHESS (Talaei et al. 2024)	3-step schema linking + LSH entity retrieval	Stage 1 + Stage 2
E-SQL (NeurIPS 2024)	Question enrichment via glossary expansion	Stage 1 → Stage 4
XiYan-SQL (Liu et al. 2024)	M-Schema prompt format with inline sample values	Stage 2 + Stage 4 + Stage 6
CHASE-SQL (ICLR 2025)	Multi-path candidate generation at distinct temperatures	Stage 4
SQL-of-Thought (2025)	Taxonomy-guided error correction	Stage 6
Arctic-Text2SQL-R1 (Snowflake 2025)	Execution-reward GRPO fine-tune of Qwen2.5-Coder	Local inference

Contributing

See CONTRIBUTING.md. All PRs are gated by CI: ruff lint, mypy strict, and pytest on Python 3.11 and 3.12. Run locally before pushing:

ruff check .
mypy .
pytest

High-value areas:

Re-running eval/gretelai_eval.py against the current configuration (Arctic-Text2SQL-R1-7B + M-Schema + EXPLAIN gate) and committing the updated numbers
BIRD-dev evaluation harness — currently no benchmark on the harder of the two main public suites
Additional database connectors (BigQuery, Snowflake, SQL Server)
Improved messy-schema test cases
Better FK graph traversal for complex join paths
Data catalog integrations (Amundsen, DataHub, OpenMetadata)

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How It Works

Key Design Decisions

Quick Start

Usage

Local GPU (llama.cpp)

Cloud API

API Server

REST API

WebSocket (Streaming)

UI Development Server

Ingesting a Schema

Configuration

Environment Variables

Messy Schema Annotations

Evaluation

Baseline Measurement

External Benchmark (baseline)

Live Queries (Chinook Database, baseline)

Re-running the Eval Suite

Architecture

Stack

Research Foundation

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github		.github
app		app
config		config
core		core
data		data
db		db
docs		docs
eval		eval
inference		inference
kernels		kernels
logs		logs
prompts		prompts
schema		schema
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
banner.svg		banner.svg
pyproject.toml		pyproject.toml
requirements-gpu.txt		requirements-gpu.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

How It Works

Key Design Decisions

Quick Start

Usage

Local GPU (llama.cpp)

Cloud API

API Server

REST API

WebSocket (Streaming)

UI Development Server

Ingesting a Schema

Configuration

Environment Variables

Messy Schema Annotations

Evaluation

Baseline Measurement

External Benchmark (baseline)

Live Queries (Chinook Database, baseline)

Re-running the Eval Suite

Architecture

Stack

Research Foundation

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages