Skip to content

feat(eval): Chat Eval Pipeline Phase 1-4#546

Merged
mangowhoiscloud merged 10 commits intomainfrom
feat/chat-eval-pipeline
Feb 9, 2026
Merged

feat(eval): Chat Eval Pipeline Phase 1-4#546
mangowhoiscloud merged 10 commits intomainfrom
feat/chat-eval-pipeline

Conversation

@mangowhoiscloud
Copy link
Contributor

Summary

  • Phase 1: Domain + Application layer (EvalResult VO, EvalConfig DTO, L1/L2/L3 서비스)
  • Phase 2: BARS prompts, LangGraph eval subgraph (Send API fan-out), main graph 통합
  • Phase 3: Gateway adapters (Redis + PG composite), DI wiring, V005 migration, calibration fixture
  • Phase 4: PG pool DI (asyncpg), SSE eval stage, structured logging, eval-feedback-loop skill

Key Numbers

  • 36 files across Domain/Application/Infrastructure layers
  • 165 tests ALL PASS (pytest -m eval_unit)
  • 5-expert review loop: Design R5(99.8) / Code R2(97.1)

Architecture

  • Swiss Cheese 3-Tier: L1 Code Grader → L2 LLM BARS Grader → L3 Calibration CUSUM
  • Clean Architecture: Port/Adapter, CQRS (Command/Query Gateway), frozen VO
  • FAIL_OPEN: eval 실패 시 B-grade(65.0) fallback, 사용자 응답 차단 없음

Test plan

  • pytest -m eval_unit -v --tb=short — 165 tests ALL PASS
  • black --check && ruff check — CI lint clean
  • Redis-only 모드 동작 확인 (eval_postgres_dsn 비어있을 때)
  • SSE 이벤트에 eval 단계 표시 확인
  • enable_eval_pipeline=True 기본 동작 확인

🤖 Generated with Claude Code

mangowhoiscloud and others added 10 commits February 10, 2026 00:58
Swiss Cheese 3-Tier Eval Pipeline의 Domain 계층 구현.
순수 Python, 외부 의존 없음.

- EvalGrade(S/A/B/C): 연속 점수→등급 매핑 Enum
- AxisScore: BARS 단일 축 평가 결과 VO (frozen)
- ContinuousScore: 0-100 연속 점수 VO (정보 손실 추적)
- CalibrationSample: 보정 샘플 VO (Cohen's kappa 검증)
- EvalScoringService: 비대칭 가중 합산 (faith=0.30, safe=0.15)
- Domain Exceptions: InvalidBARSScoreError 등

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EvalConfig/EvalResult DTO와 Protocol 기반 Port 정의.

- EvalConfig: 11개 Feature Flag (모드, 샘플링, 비용 가드레일)
- EvalResult: 통합 평가 결과 DTO (frozen, FAIL_OPEN=B등급)
- BARSEvaluator Port: LLM 5축 평가 Protocol
- EvalResultCommandGateway: 결과 저장 Protocol (CQS Command)
- EvalResultQueryGateway: 결과 조회 + 일일 비용 Protocol (CQS Query)
- CalibrationDataGateway: 보정 데이터 Protocol

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
L1/L2/L3 평가 서비스와 오케스트레이터 Command 구현.

- CodeGraderService (L1): 6개 직교 슬라이스 결정적 평가 (<50ms)
- LLMGraderService (L2): BARS 5축 + Self-Consistency (경계 점수 재평가)
- ScoreAggregatorService: L1+L2 통합, 비대칭 가중 합산
- CalibrationMonitorService (L3): Two-Sided CUSUM 드리프트 감지
- EvaluateResponseCommand: 3-Tier 오케스트레이터
  - 샘플링 게이트, 일일 비용 가드레일, 주기적 Calibration
  - FAIL_OPEN 정책 (실패→B등급, 재생성 미트리거)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
OpenAIBARSEvaluator adapter implementing BARSEvaluator Protocol.
Pydantic schemas for LLM Structured Output parsing.
6 BARS rubric prompt files (system + 5 axis anchors).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…egration

eval_node.py: ChatState → EvalState field mapping adapter.
eval_graph_factory.py: Eval subgraph with grader node factories.
state.py, contracts.py: EvalState fields + node contracts.
node_policy.py: Eval node FAIL_OPEN policy.
factory.py: Conditional eval subgraph wiring in main graph.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Domain: EvalGrade, AxisScore, ContinuousScore, CalibrationSample, EvalScoringService.
Application: CodeGrader, LLMGrader, ScoreAggregator, CalibrationMonitor, EvaluateResponseCommand.
Infrastructure: BARSEvaluator, eval_node, eval_subgraph_keys.
pyproject.toml: eval_unit, eval_regression, eval_capability markers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Swiss Cheese 3-Tier Grader architecture, expert review scores (97.1/100),
108 unit tests (100% pass), BARS rubric design, known limitations, next steps.
Related: ADR blog posts #273-#276.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
contracts.py: frozenset multiline formatting.
test files: trailing blank lines, duplicate import removal.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_make_failed_eval_result now delegates to EvalResult.failed() DTO
instead of hardcoding a raw dict. Eliminates stale stopgap comment.
Tests updated for EvalResult.to_dict() structure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… skill

- Config: enable_eval_pipeline default True, add PG DSN fields
- DI: get_eval_pg_pool() with conditional asyncpg pool creation
- SSE: eval stage in STAGE_ORDER, PHASE_PROGRESS, NODE_MESSAGES
- Logging: structured extra dicts in eval_entry, graders, aggregator
- Skill: eval-feedback-loop 5-expert review guide
- Tests: 5 new tests (PG wiring + eval progress), 165 total ALL PASS
- Report: Phase 4 section added, test counts updated

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mangowhoiscloud mangowhoiscloud self-assigned this Feb 9, 2026
@github-project-automation github-project-automation bot moved this to Backlog in @Eco² Feb 9, 2026
@mangowhoiscloud mangowhoiscloud merged commit 1a5463a into main Feb 9, 2026
15 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in @Eco² Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant