perf(map): Optimize LLM pipeline with merged prompts, model tiering, and async by JoshDoesIT · Pull Request #49 · JoshDoesIT/ctrlmap

JoshDoesIT · 2026-03-08T22:01:44Z

Summary

Reduce total LLM calls from ~573 to ~100-150 per run through three phases of optimization, targeting both the map enrichment pipeline and the parse pipeline.

Changes

Phase 1 — Concurrency Foundation

Add async LLM client methods (call_llm_async, generate_async, etc.) using ollama.AsyncClient
Add SQLite-backed LLM response cache (LLMCache in cache.py)
Add query_by_embedding() for batch vector queries
Batch embedding in mapper via embed_batch()
Add --concurrency and --cache CLI options to ctrlmap map

Phase 2 — Call Reduction

Merged relevance+rationale prompt — combines two separate LLM calls into one via merged_relevance_rationale.txt and evaluate_chunk_async(), eliminating ~200 calls
Streaming per-control pipeline — each control flows through all steps independently instead of waiting for all controls per step
Raise min_score from 0.35 → 0.50 to skip weak matches before they reach the LLM

Phase 3 — Model Tiering & Remaining Optimizations

Model tiering — qwen2.5:7b for simple tasks (meta-classify, gap rationale, parse), qwen2.5:14b for accuracy-critical compliance evaluation (~2x faster for 60% of calls)
Async parse pipeline — llm_chunker.py now uses asyncio.gather() for concurrent page extraction (3-5x parse speedup)
Cache wired into call_llm_async() — transparent get/put makes --cache flag actually work for near-instant re-runs
Async gap rationale — true async instead of sync-in-async wrapper
Top-K 10 → 5 — fewer chunks sent to LLM evaluation
Embedder singleton — @functools.cache shares the SentenceTransformer model across pipeline stages
Skip meta-classify for mapped controls — ~20 fewer LLM calls

Test Plan

All existing tests pass (uv run pytest) — 229/229
New tests written following TDD (Red → Green → Refactor)
Linting passes (uv run ruff check .)
Type checking passes (uv run mypy src/)
Eval suite: relevance/compliance/meta at 100%, 2 pre-existing faithfulness flakes unchanged

Checklist

Tests written before implementation (TDD)
Documentation updated (ARCHITECTURE.md)
No unused code, console logs, or dead comments
Follows existing project coding style

…and async Reduce total LLM calls from ~573 to ~100-150 per run through three phases of optimization: Phase 1 — Concurrency foundation: - Add async LLM client methods (call_llm_async, generate_async, etc.) - Add SQLite-backed LLM response cache (LLMCache) - Add query_by_embedding() for batch vector queries - Batch embedding in mapper via embed_batch() - Add --concurrency and --cache CLI options Phase 2 — Call reduction: - Merge relevance check + rationale into single LLM call (merged_relevance_rationale.txt, evaluate_chunk_async) - Streaming per-control pipeline (no cross-control blocking) - Raise min_score 0.35 to 0.50 (skip weak matches before LLM) Phase 3 — Model tiering and remaining optimizations: - Use qwen2.5:7b for simple tasks (meta-classify, gap, parse) - Keep qwen2.5:14b for accuracy-critical compliance evaluation - Async parse pipeline in llm_chunker.py (concurrent page extraction) - Wire cache transparently into call_llm_async (get/put) - Async gap rationale generation - Top-K 10 to 5 (fewer chunks reach LLM) - Embedder singleton via @functools.cache - Skip meta-classify for controls that already have rationales

JoshDoesIT merged commit 1082995 into main Mar 8, 2026
5 checks passed

JoshDoesIT deleted the perf/pipeline-optimization branch March 8, 2026 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(map): Optimize LLM pipeline with merged prompts, model tiering, and async#49

perf(map): Optimize LLM pipeline with merged prompts, model tiering, and async#49
JoshDoesIT merged 1 commit into
mainfrom
perf/pipeline-optimization

JoshDoesIT commented Mar 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JoshDoesIT commented Mar 8, 2026

Summary

Changes

Phase 1 — Concurrency Foundation

Phase 2 — Call Reduction

Phase 3 — Model Tiering & Remaining Optimizations

Test Plan

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant