Mercer converts plain English questions into accurate SQL against production schemas — including the messy ones with cryptic abbreviations (cust_seg_cd, e_add, p_spec), missing foreign keys, legacy denormalization, and inconsistent naming. It runs a 6-stage agentic pipeline entirely on a consumer GPU (8 GB VRAM) with no vector database required.
Each question passes through six stages before a SQL query is returned:
- Entity Retrieval — BM25 + LSH matches natural language tokens against actual sampled cell values (
RETAIL,CORP, etc., not just column names) and a business glossary - Schema Linking — 3-step CHESS-inspired linker: column pre-filter → table selection → final column selection
- Query Decomposition — Breaks complex questions into subproblems with a chain-of-thought reasoning plan; the glossary-expanded form of the question (E-SQL pattern) is forwarded into Stage 4
- Candidate Generation — Three SQL strategies run in parallel at distinct temperatures (direct CoT @ 0.0, divide-and-conquer @ 0.2, plan-execute @ 0.3) using the XiYan-SQL M-Schema prompt format, which renders schemas with inline sample values for each column
- Execution + Selection — Each candidate is gated by a cheap
EXPLAINpre-flight; survivors execute against the database in a read-only sandbox; the best result is selected by consistency scoring - Taxonomy Correction — Classifies and fixes errors by type: schema errors, join errors, filter errors, aggregation errors, syntax errors, logic errors
No vector database. Schema navigation uses BM25, LSH against sampled cell values, an FK graph, and LLM reasoning. Vector RAG remains opt-in for glossary expansion only.
Multi-candidate selection with real diversity. Three SQL queries are generated with distinct strategies and distinct temperatures (0.0, 0.2, 0.3), all gated by EXPLAIN, all executed, and the best is picked. One-shot generation produces worse results on messy schemas; collapsing the three strategies to a single temperature kills the CHASE-SQL diversity signal.
Local-first inference, SQL-tuned by default. llama.cpp with GGUF IQ4_XS + q8_0 KV cache (TurboQuant profile) runs Snowflake's Arctic-Text2SQL-R1-7B — a Qwen2.5-Coder-7B-Instruct fine-tune trained with execution-based GRPO on BIRD/Spider — fully on an RTX 4070 (8 GB VRAM, ~5.4 GB working set). Cloud API (Anthropic Claude Opus 4.7 / OpenAI GPT-4.1) is an optional fallback.
M-Schema prompting with inline sample values. Schemas are rendered to the LLM in XiYan-SQL's M-Schema format, with each column's actual sampled values shown next to its definition ((cust_seg_cd:TEXT, "RETAIL", "CORP", "SMB")). Cryptic abbreviations stop being cryptic when the model can see what real values look like.
Read-only sandbox with structural validation. SQL is parsed by sqlglot and rejected before execution if it isn't a single SELECT (or WITH ... SELECT); DDL/DML nodes anywhere in the AST are rejected, even inside subqueries or CTEs. PostgreSQL connections additionally issue SET TRANSACTION READ ONLY as a second line of defense. 5-second statement timeout, 100-row fetch limit. The previous keyword-grep approach had false positives on string literals containing keywords (WHERE body LIKE '%please DROP%'); the AST validator does not.
Prerequisites: Python 3.11+, Redis, Node.js 18+ (for the UI)
git clone https://github.com/baselanaya/Mercer.git
cd Mercer
# Install Python dependencies
python -m venv .venv && source .venv/bin/activate
# Runtime install:
pip install -e . -r requirements.txt
# Development install (adds pytest, ruff, mypy):
pip install -e ".[dev]" -r requirements.txt
# Copy and configure environment
cp .env.example .envEdit .env and set at minimum:
DATABASE_URL=sqlite+aiosqlite:///data/chinook/Chinook.sqlite # or your database
INFERENCE_BACKEND=anthropic # or llamacpp / openai
ANTHROPIC_API_KEY=sk-ant-...Start the server:
uvicorn app.api.main:app --port 8000Open http://localhost:8000 — the API is live. For the full UI, see Usage → UI below.
Requires CUDA 12.4+ and ~5.4 GB VRAM. Tested on RTX 4070 Laptop.
# Install GPU stack
pip install -r requirements-gpu.txt
# Download the recommended SQL-tuned model (Apache 2.0)
pip install huggingface-hub
huggingface-cli download mradermacher/Arctic-Text2SQL-R1-7B-i1-GGUF \
Arctic-Text2SQL-R1-7B.i1-IQ4_XS.gguf --local-dir models/
mv models/Arctic-Text2SQL-R1-7B.i1-IQ4_XS.gguf \
models/Arctic-Text2SQL-R1-7B-IQ4_XS.gguf
# Start the inference server (TurboQuant: IQ4_XS weights + q8_0 KV cache)
bash scripts/serve_llamacpp_turboquant.shAlready have Qwen2.5-Coder-7B downloaded? Override LOCAL_MODEL_PATH in .env to point at it. See docs/custom-models.md for the full model-swap matrix.
The server starts on http://localhost:8080 (OpenAI-compatible API). Set in .env:
INFERENCE_BACKEND=llamacpp
LLAMACPP_URL=http://localhost:8080
LOCAL_MODEL_PATH=models/Arctic-Text2SQL-R1-7B-IQ4_XS.ggufSet the appropriate key and backend in .env:
# Anthropic
INFERENCE_BACKEND=anthropic
ANTHROPIC_API_KEY=sk-ant-...
# OpenAI
INFERENCE_BACKEND=openai
OPENAI_API_KEY=sk-proj-...No model download required. Start the server normally:
uvicorn app.api.main:app --port 8000# Start (production)
uvicorn app.api.main:app --port 8000
# Start (development, auto-reload)
uvicorn app.api.main:app --reload --port 8000# Run a query
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "Top 5 customers by total spend"}'
# Get schema-aware suggested questions
curl http://localhost:8000/suggest
# Health check
curl http://localhost:8000/healthConnect to /ws/query to receive pipeline progress in real time:
Client → {"question": "How many tracks are in each genre?"}
Server → {"type": "started", "question": "..."}
Server → {"type": "stage_complete", "stage": "entity_retrieval", "latency_ms": 9}
Server → {"type": "stage_complete", "stage": "schema_linking", "latency_ms": 3700}
Server → {"type": "stage_complete", "stage": "query_decomposition", "latency_ms": 4000}
Server → {"type": "stage_complete", "stage": "candidate_generation","latency_ms": 5000}
Server → {"type": "stage_complete", "stage": "execution_scoring", "latency_ms": 7}
Server → {"type": "stage_complete", "stage": "correction", "latency_ms": 0}
Server → {"type": "complete", "sql": "SELECT ...", "execution_result": {...}, "latency_ms": 12722}
The chat interface is a separate Vite + React app that proxies to the API backend:
cd app/ui
npm install
npm run dev # starts on http://localhost:5173The UI proxies /api, /query, /ws, /suggest, /schema, /health, and /setup to localhost:8000.
Preload the schema graph and entity index for faster queries:
python scripts/ingest_schema.py --db-url $DATABASE_URLThis caches the full schema in Redis and pre-builds the BM25/LSH entity indices.
| Variable | Description | Default |
|---|---|---|
DATABASE_URL |
SQLAlchemy async connection string | — |
REDIS_URL |
Redis for schema cache | redis://localhost:6379/0 |
INFERENCE_BACKEND |
llamacpp | anthropic | openai | ollama |
llamacpp |
LLAMACPP_URL |
llama.cpp server URL | http://localhost:8080 |
LOCAL_MODEL_PATH |
GGUF file served by llama.cpp | models/Arctic-Text2SQL-R1-7B-IQ4_XS.gguf |
ANTHROPIC_API_KEY |
Anthropic API key | — |
ANTHROPIC_MODEL |
Anthropic model snapshot | claude-opus-4-7 |
OPENAI_API_KEY |
OpenAI API key | — |
OPENAI_MODEL |
OpenAI model name | gpt-4.1 |
LOG_LEVEL |
DEBUG | INFO | WARNING | ERROR |
INFO |
Connection string examples:
# SQLite
DATABASE_URL=sqlite+aiosqlite:///path/to/database.db
# PostgreSQL
DATABASE_URL=postgresql+asyncpg://user:password@localhost:5432/mydb
# MySQL
DATABASE_URL=mysql+aiomysql://user:password@localhost:3306/mydbAdd semantic context for cryptic column names in config/mappings.yaml:
tables:
cust_mstr:
description: "Main customer master table"
columns:
cust_seg_cd: "Customer segment code (RETAIL, CORP, SMB)"
e_add: "Employee address — legacy field, rarely populated"
p_spec: "Product specialization code"
ord_hdr:
description: "Order header — one row per order"
columns:
ord_dt: "Order date"
cust_fk: "Foreign key to cust_mstr.cust_id"
tot_amt: "Total order amount in USD"
glossary:
revenue: "SUM of ord_hdr.tot_amt"
churn: "Customers with no orders in the past 90 days"
segment: "cust_mstr.cust_seg_cd"The numbers below were measured against the prior Mercer setup — Qwen2.5-Coder-7B-Instruct (general-purpose code model), keyword-grep sandbox, single-temperature candidate generation, no inline sample values, flat-prose schema format. They establish a baseline; they have not yet been re-measured against the current configuration (Arctic-Text2SQL-R1-7B + M-Schema + per-strategy temperatures + sample-value LSH + EXPLAIN gate + sqlglot AST sandbox).
The component literature suggests the current configuration should produce a meaningful gain, but until a re-measurement is run on the same suites and committed alongside the code, treat the 74% as the previous floor rather than the current ceiling.
Benchmarked with Qwen2.5-Coder-7B-Instruct-Q4_K_M on an RTX 4070 Laptop (8 GB VRAM) via llama.cpp.
| Dataset | Questions | Execution Accuracy | Exact Match |
|---|---|---|---|
| gretelai/synthetic_text_to_sql | 50 stratified | 74.0% | 34.0% |
Breakdown by SQL complexity:
| Complexity | Execution | Exact Match |
|---|---|---|
| Basic SQL | 6/8 (75%) | 3/8 (38%) |
| Aggregation | 7/8 (88%) | 5/8 (63%) |
| Single join | 6/8 (75%) | 3/8 (38%) |
| Subqueries | 3/7 (43%) | 1/7 (14%) |
| Window functions | 7/7 (100%) | 3/7 (43%) |
| Multiple joins | 4/6 (67%) | 1/6 (17%) |
| Set operations | 4/4 (100%) | 1/4 (25%) |
| CTEs | 0/2 | 0/2 |
5 of 50 questions were DDL/DML (DELETE/INSERT) — blocked by the read-only sandbox and counted as failures. These are outside scope for a read-only NL→SQL pipeline.
| Question | Correct | Latency |
|---|---|---|
| How many customers are there? | ✓ (59) | 7.6s |
| Top 5 artists by album count | ✓ (Iron Maiden, 21) | 13.6s |
| Employee with most invoices + revenue | ✓ (Jane Peacock, $833) | 16.8s |
| Genres for artist with most tracks | ✓ (Rock, 1297 tracks) | 22.7s |
All four queries required zero correction steps.
python scripts/benchmark.py --suite regression
python scripts/benchmark.py --suite mercer_messy
python eval/gretelai_eval.pyIf you re-measure on the current configuration, please open a PR with the updated numbers and the configuration matrix used. BIRD-dev is the more discriminating benchmark for SOTA comparisons (gretelai is synthetic and easier); a good follow-up benchmarking effort starts there.
mercer/
├── core/
│ ├── pipeline.py # 6-stage orchestrator
│ ├── entity_retriever.py # Stage 1: BM25 + LSH (over sampled values) + glossary
│ ├── schema_linker.py # Stage 2: 3-step CHESS linker
│ ├── query_decomposer.py # Stage 3: subproblem decomposition
│ ├── candidate_generator.py # Stage 4: per-strategy temps (0.0/0.2/0.3)
│ ├── executor.py # Stage 5: EXPLAIN-gated execution + selection
│ └── corrector.py # Stage 6: taxonomy-guided correction
│
├── kernels/ # GPU kernels with CPU fallbacks
│ ├── lsh_hash.py # Batch LSH for entity matching
│ ├── schema_encode.py # Batch schema tokenization
│ └── result_score.py # Result consistency scoring
│
├── inference/ # LLM backend abstraction
│ ├── base.py # LLMBackend protocol (per-prompt temperatures)
│ ├── llamacpp_backend.py # llama.cpp async client
│ ├── api_backend.py # Anthropic / OpenAI client
│ └── router.py # Complexity-based model routing
│
├── schema/
│ ├── introspector.py # SQLAlchemy introspection + bounded value sampling
│ ├── semantic_mapper.py # Glossary + column description injection
│ ├── graph_builder.py # FK graph (networkx)
│ └── cache.py # Redis schema cache
│
├── db/
│ ├── connectors/ # PostgreSQL, MySQL, SQLite, DuckDB
│ ├── sandbox.py # sqlglot AST-validated read-only sandbox
│ └── explain.py # Dialect-aware EXPLAIN pre-flight
│
├── prompts/
│ ├── m_schema.py # XiYan-SQL M-Schema renderer (single source of truth)
│ ├── sql_generation.py # Stage 4 prompts
│ ├── query_plan.py # Stage 3 prompts
│ ├── column_selection.py # Stage 2 prompts
│ └── correction.py # Stage 6 prompts
│
├── app/
│ ├── api/ # FastAPI + WebSocket streaming
│ └── ui/ # React + TailwindCSS chat UI
│
├── eval/ # Benchmark runners
├── config/ # mappings.yaml, inference.yaml, settings.py
└── data/ # Chinook, DVDRental, Northwind test databases
| Layer | Technology |
|---|---|
| Language | Python 3.11+ |
| API backend | FastAPI + uvicorn |
| Frontend | React 19 + TailwindCSS v4 + Vite |
| LLM serving (local) | llama.cpp (GGUF, IQ4_XS + q8_0 KV) |
| Local model (default) | Snowflake Arctic-Text2SQL-R1-7B (Apache 2.0) |
| API fallback | Claude Opus 4.7 / GPT-4.1 |
| GPU kernels | Triton (with CPU fallbacks) |
| SQL parsing | sqlglot 25+ |
| DB abstraction | SQLAlchemy 2.0 (async) |
| Schema graph | networkx |
| Entity retrieval | rank-bm25 + custom LSH over sampled values |
| Cache | Redis (schema) + DuckDB (audit log) |
| Databases supported | PostgreSQL, MySQL, SQLite, DuckDB |
| CI | GitHub Actions (ruff, mypy strict, pytest matrix on 3.11/3.12) |
| Paper | Contribution | Pipeline Stage |
|---|---|---|
| CHESS (Talaei et al. 2024) | 3-step schema linking + LSH entity retrieval | Stage 1 + Stage 2 |
| E-SQL (NeurIPS 2024) | Question enrichment via glossary expansion | Stage 1 → Stage 4 |
| XiYan-SQL (Liu et al. 2024) | M-Schema prompt format with inline sample values | Stage 2 + Stage 4 + Stage 6 |
| CHASE-SQL (ICLR 2025) | Multi-path candidate generation at distinct temperatures | Stage 4 |
| SQL-of-Thought (2025) | Taxonomy-guided error correction | Stage 6 |
| Arctic-Text2SQL-R1 (Snowflake 2025) | Execution-reward GRPO fine-tune of Qwen2.5-Coder | Local inference |
See CONTRIBUTING.md. All PRs are gated by CI: ruff lint, mypy strict, and pytest on Python 3.11 and 3.12. Run locally before pushing:
ruff check .
mypy .
pytestHigh-value areas:
- Re-running
eval/gretelai_eval.pyagainst the current configuration (Arctic-Text2SQL-R1-7B + M-Schema + EXPLAIN gate) and committing the updated numbers - BIRD-dev evaluation harness — currently no benchmark on the harder of the two main public suites
- Additional database connectors (BigQuery, Snowflake, SQL Server)
- Improved messy-schema test cases
- Better FK graph traversal for complex join paths
- Data catalog integrations (Amundsen, DataHub, OpenMetadata)
MIT — see LICENSE.