perf(map): Optimize LLM pipeline with merged prompts, model tiering, and async#49
Merged
Merged
Conversation
…and async Reduce total LLM calls from ~573 to ~100-150 per run through three phases of optimization: Phase 1 — Concurrency foundation: - Add async LLM client methods (call_llm_async, generate_async, etc.) - Add SQLite-backed LLM response cache (LLMCache) - Add query_by_embedding() for batch vector queries - Batch embedding in mapper via embed_batch() - Add --concurrency and --cache CLI options Phase 2 — Call reduction: - Merge relevance check + rationale into single LLM call (merged_relevance_rationale.txt, evaluate_chunk_async) - Streaming per-control pipeline (no cross-control blocking) - Raise min_score 0.35 to 0.50 (skip weak matches before LLM) Phase 3 — Model tiering and remaining optimizations: - Use qwen2.5:7b for simple tasks (meta-classify, gap, parse) - Keep qwen2.5:14b for accuracy-critical compliance evaluation - Async parse pipeline in llm_chunker.py (concurrent page extraction) - Wire cache transparently into call_llm_async (get/put) - Async gap rationale generation - Top-K 10 to 5 (fewer chunks reach LLM) - Embedder singleton via @functools.cache - Skip meta-classify for controls that already have rationales
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reduce total LLM calls from ~573 to ~100-150 per run through three phases of optimization, targeting both the map enrichment pipeline and the parse pipeline.
Changes
Phase 1 — Concurrency Foundation
call_llm_async,generate_async, etc.) usingollama.AsyncClientLLMCacheincache.py)query_by_embedding()for batch vector queriesembed_batch()--concurrencyand--cacheCLI options toctrlmap mapPhase 2 — Call Reduction
merged_relevance_rationale.txtandevaluate_chunk_async(), eliminating ~200 callsmin_scorefrom 0.35 → 0.50 to skip weak matches before they reach the LLMPhase 3 — Model Tiering & Remaining Optimizations
qwen2.5:7bfor simple tasks (meta-classify, gap rationale, parse),qwen2.5:14bfor accuracy-critical compliance evaluation (~2x faster for 60% of calls)llm_chunker.pynow usesasyncio.gather()for concurrent page extraction (3-5x parse speedup)call_llm_async()— transparent get/put makes--cacheflag actually work for near-instant re-runs@functools.cacheshares the SentenceTransformer model across pipeline stagesTest Plan
uv run pytest) — 229/229uv run ruff check .)uv run mypy src/)Checklist