Skip to content

feat(eval): Chat Eval Pipeline — Swiss Cheese 3-Tier Grader#545

Merged
mangowhoiscloud merged 9 commits intodevelopfrom
feat/chat-eval-pipeline
Feb 9, 2026
Merged

feat(eval): Chat Eval Pipeline — Swiss Cheese 3-Tier Grader#545
mangowhoiscloud merged 9 commits intodevelopfrom
feat/chat-eval-pipeline

Conversation

@mangowhoiscloud
Copy link
Contributor

Summary

  • Swiss Cheese 3-Tier Grader 구현 (L1 Code + L2 LLM BARS + L3 Calibration CUSUM)
  • Clean Architecture 4-Layer 구조: Domain → Application → Infrastructure (LangGraph)
  • 65 files, +6,789 lines, 108 unit tests (100% pass)

Architecture

L1 Code Grader (deterministic, 6 slices)
  ↓ priority_preemptive_reducer
L2 LLM Grader (BARS 5-axis, Self-Consistency, SDK Structured Output)
  ↓ asymmetric weighted sum
L3 Calibration Monitor (Two-sided CUSUM, k=0.5, h=4.0)
  ↓ EvalGrade (A/B/C) + regeneration decision

Commits (7)

# Commit Scope
1 feat(eval): Domain layer EvalGrade, AxisScore, ContinuousScore, CalibrationSample, EvalScoringService
2 feat(eval): Application layer — DTOs, Ports, Exceptions EvalConfig, EvalResult, 4 Ports, eval exceptions
3 feat(eval): Application Services + Command CodeGrader, LLMGrader, ScoreAggregator, CalibrationMonitor, EvaluateResponseCommand
4 feat(eval): Infrastructure — BARS evaluator + prompts OpenAIBARSEvaluator adapter, Pydantic schemas, 6 BARS rubric prompts
5 feat(eval): Infrastructure — LangGraph eval subgraph eval_node, eval_graph_factory, state/contracts, main graph integration
6 test(eval): Unit tests + pytest markers 108 tests across all layers, eval_unit/eval_regression/eval_capability markers
7 docs(eval): Implementation report Phase 1+2 report with expert review scores (97.1/100)

Key Design Decisions

  • FAIL_OPEN policy: Eval 실패 시 B등급(65.0점), regeneration 미트리거
  • Self-Consistency: 경계 점수(2, 4)에서 N회 재평가 후 중앙값 채택
  • Bias mitigation: 축 순서 랜덤 셔플(positional bias 완화)
  • Cost guardrail: eval_max_cost_usd 초과 시 LLM Grader 스킵
  • SDK Structured Output: OpenAI/Gemini 모두 SDK 레벨 constrained decoding 사용

Phase 3 (Next)

  • Gateway adapters (PostgreSQL, Redis)
  • DI wiring (setup/dependencies.py)
  • Integration tests

Test plan

  • pytest -m eval_unit — 108 tests, 100% pass
  • black --check — clean
  • ruff check — clean
  • CI pipeline 실행 (PR merge 전)
  • Docker 환경에서 전체 테스트 스위트 실행

🤖 Generated with Claude Code

mangowhoiscloud and others added 6 commits February 10, 2026 00:58
Swiss Cheese 3-Tier Eval Pipeline의 Domain 계층 구현.
순수 Python, 외부 의존 없음.

- EvalGrade(S/A/B/C): 연속 점수→등급 매핑 Enum
- AxisScore: BARS 단일 축 평가 결과 VO (frozen)
- ContinuousScore: 0-100 연속 점수 VO (정보 손실 추적)
- CalibrationSample: 보정 샘플 VO (Cohen's kappa 검증)
- EvalScoringService: 비대칭 가중 합산 (faith=0.30, safe=0.15)
- Domain Exceptions: InvalidBARSScoreError 등

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EvalConfig/EvalResult DTO와 Protocol 기반 Port 정의.

- EvalConfig: 11개 Feature Flag (모드, 샘플링, 비용 가드레일)
- EvalResult: 통합 평가 결과 DTO (frozen, FAIL_OPEN=B등급)
- BARSEvaluator Port: LLM 5축 평가 Protocol
- EvalResultCommandGateway: 결과 저장 Protocol (CQS Command)
- EvalResultQueryGateway: 결과 조회 + 일일 비용 Protocol (CQS Query)
- CalibrationDataGateway: 보정 데이터 Protocol

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
L1/L2/L3 평가 서비스와 오케스트레이터 Command 구현.

- CodeGraderService (L1): 6개 직교 슬라이스 결정적 평가 (<50ms)
- LLMGraderService (L2): BARS 5축 + Self-Consistency (경계 점수 재평가)
- ScoreAggregatorService: L1+L2 통합, 비대칭 가중 합산
- CalibrationMonitorService (L3): Two-Sided CUSUM 드리프트 감지
- EvaluateResponseCommand: 3-Tier 오케스트레이터
  - 샘플링 게이트, 일일 비용 가드레일, 주기적 Calibration
  - FAIL_OPEN 정책 (실패→B등급, 재생성 미트리거)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
OpenAIBARSEvaluator adapter implementing BARSEvaluator Protocol.
Pydantic schemas for LLM Structured Output parsing.
6 BARS rubric prompt files (system + 5 axis anchors).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…egration

eval_node.py: ChatState → EvalState field mapping adapter.
eval_graph_factory.py: Eval subgraph with grader node factories.
state.py, contracts.py: EvalState fields + node contracts.
node_policy.py: Eval node FAIL_OPEN policy.
factory.py: Conditional eval subgraph wiring in main graph.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Domain: EvalGrade, AxisScore, ContinuousScore, CalibrationSample, EvalScoringService.
Application: CodeGrader, LLMGrader, ScoreAggregator, CalibrationMonitor, EvaluateResponseCommand.
Infrastructure: BARSEvaluator, eval_node, eval_subgraph_keys.
pyproject.toml: eval_unit, eval_regression, eval_capability markers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Swiss Cheese 3-Tier Grader architecture, expert review scores (97.1/100),
108 unit tests (100% pass), BARS rubric design, known limitations, next steps.
Related: ADR blog posts #273-#276.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mangowhoiscloud and others added 2 commits February 10, 2026 01:19
contracts.py: frozenset multiline formatting.
test files: trailing blank lines, duplicate import removal.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_make_failed_eval_result now delegates to EvalResult.failed() DTO
instead of hardcoding a raw dict. Eliminates stale stopgap comment.
Tests updated for EvalResult.to_dict() structure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mangowhoiscloud mangowhoiscloud self-assigned this Feb 9, 2026
@mangowhoiscloud mangowhoiscloud merged commit 4ac9d6b into develop Feb 9, 2026
5 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in @Eco² Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant