Skip to content

perf(map): Optimize LLM pipeline with merged prompts, model tiering, and async#49

Merged
JoshDoesIT merged 1 commit into
mainfrom
perf/pipeline-optimization
Mar 8, 2026
Merged

perf(map): Optimize LLM pipeline with merged prompts, model tiering, and async#49
JoshDoesIT merged 1 commit into
mainfrom
perf/pipeline-optimization

Conversation

@JoshDoesIT
Copy link
Copy Markdown
Owner

Summary

Reduce total LLM calls from ~573 to ~100-150 per run through three phases of optimization, targeting both the map enrichment pipeline and the parse pipeline.

Changes

Phase 1 — Concurrency Foundation

  • Add async LLM client methods (call_llm_async, generate_async, etc.) using ollama.AsyncClient
  • Add SQLite-backed LLM response cache (LLMCache in cache.py)
  • Add query_by_embedding() for batch vector queries
  • Batch embedding in mapper via embed_batch()
  • Add --concurrency and --cache CLI options to ctrlmap map

Phase 2 — Call Reduction

  • Merged relevance+rationale prompt — combines two separate LLM calls into one via merged_relevance_rationale.txt and evaluate_chunk_async(), eliminating ~200 calls
  • Streaming per-control pipeline — each control flows through all steps independently instead of waiting for all controls per step
  • Raise min_score from 0.35 → 0.50 to skip weak matches before they reach the LLM

Phase 3 — Model Tiering & Remaining Optimizations

  • Model tieringqwen2.5:7b for simple tasks (meta-classify, gap rationale, parse), qwen2.5:14b for accuracy-critical compliance evaluation (~2x faster for 60% of calls)
  • Async parse pipelinellm_chunker.py now uses asyncio.gather() for concurrent page extraction (3-5x parse speedup)
  • Cache wired into call_llm_async() — transparent get/put makes --cache flag actually work for near-instant re-runs
  • Async gap rationale — true async instead of sync-in-async wrapper
  • Top-K 10 → 5 — fewer chunks sent to LLM evaluation
  • Embedder singleton@functools.cache shares the SentenceTransformer model across pipeline stages
  • Skip meta-classify for mapped controls — ~20 fewer LLM calls

Test Plan

  • All existing tests pass (uv run pytest) — 229/229
  • New tests written following TDD (Red → Green → Refactor)
  • Linting passes (uv run ruff check .)
  • Type checking passes (uv run mypy src/)
  • Eval suite: relevance/compliance/meta at 100%, 2 pre-existing faithfulness flakes unchanged

Checklist

  • Tests written before implementation (TDD)
  • Documentation updated (ARCHITECTURE.md)
  • No unused code, console logs, or dead comments
  • Follows existing project coding style

…and async

Reduce total LLM calls from ~573 to ~100-150 per run through three
phases of optimization:

Phase 1 — Concurrency foundation:
- Add async LLM client methods (call_llm_async, generate_async, etc.)
- Add SQLite-backed LLM response cache (LLMCache)
- Add query_by_embedding() for batch vector queries
- Batch embedding in mapper via embed_batch()
- Add --concurrency and --cache CLI options

Phase 2 — Call reduction:
- Merge relevance check + rationale into single LLM call
  (merged_relevance_rationale.txt, evaluate_chunk_async)
- Streaming per-control pipeline (no cross-control blocking)
- Raise min_score 0.35 to 0.50 (skip weak matches before LLM)

Phase 3 — Model tiering and remaining optimizations:
- Use qwen2.5:7b for simple tasks (meta-classify, gap, parse)
- Keep qwen2.5:14b for accuracy-critical compliance evaluation
- Async parse pipeline in llm_chunker.py (concurrent page extraction)
- Wire cache transparently into call_llm_async (get/put)
- Async gap rationale generation
- Top-K 10 to 5 (fewer chunks reach LLM)
- Embedder singleton via @functools.cache
- Skip meta-classify for controls that already have rationales
@JoshDoesIT JoshDoesIT merged commit 1082995 into main Mar 8, 2026
5 checks passed
@JoshDoesIT JoshDoesIT deleted the perf/pipeline-optimization branch March 8, 2026 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant