feat(eval): Chat Eval Pipeline — Swiss Cheese 3-Tier Grader by mangowhoiscloud · Pull Request #545 · eco2-team/backend

mangowhoiscloud · 2026-02-09T16:07:06Z

Summary

Swiss Cheese 3-Tier Grader 구현 (L1 Code + L2 LLM BARS + L3 Calibration CUSUM)
Clean Architecture 4-Layer 구조: Domain → Application → Infrastructure (LangGraph)
65 files, +6,789 lines, 108 unit tests (100% pass)

Architecture

L1 Code Grader (deterministic, 6 slices)
  ↓ priority_preemptive_reducer
L2 LLM Grader (BARS 5-axis, Self-Consistency, SDK Structured Output)
  ↓ asymmetric weighted sum
L3 Calibration Monitor (Two-sided CUSUM, k=0.5, h=4.0)
  ↓ EvalGrade (A/B/C) + regeneration decision

Commits (7)

#	Commit	Scope
1	`feat(eval): Domain layer`	EvalGrade, AxisScore, ContinuousScore, CalibrationSample, EvalScoringService
2	`feat(eval): Application layer — DTOs, Ports, Exceptions`	EvalConfig, EvalResult, 4 Ports, eval exceptions
3	`feat(eval): Application Services + Command`	CodeGrader, LLMGrader, ScoreAggregator, CalibrationMonitor, EvaluateResponseCommand
4	`feat(eval): Infrastructure — BARS evaluator + prompts`	OpenAIBARSEvaluator adapter, Pydantic schemas, 6 BARS rubric prompts
5	`feat(eval): Infrastructure — LangGraph eval subgraph`	eval_node, eval_graph_factory, state/contracts, main graph integration
6	`test(eval): Unit tests + pytest markers`	108 tests across all layers, eval_unit/eval_regression/eval_capability markers
7	`docs(eval): Implementation report`	Phase 1+2 report with expert review scores (97.1/100)

Key Design Decisions

FAIL_OPEN policy: Eval 실패 시 B등급(65.0점), regeneration 미트리거
Self-Consistency: 경계 점수(2, 4)에서 N회 재평가 후 중앙값 채택
Bias mitigation: 축 순서 랜덤 셔플(positional bias 완화)
Cost guardrail: eval_max_cost_usd 초과 시 LLM Grader 스킵
SDK Structured Output: OpenAI/Gemini 모두 SDK 레벨 constrained decoding 사용

Phase 3 (Next)

Gateway adapters (PostgreSQL, Redis)
DI wiring (setup/dependencies.py)
Integration tests

Test plan

pytest -m eval_unit — 108 tests, 100% pass
black --check — clean
ruff check — clean
CI pipeline 실행 (PR merge 전)
Docker 환경에서 전체 테스트 스위트 실행

🤖 Generated with Claude Code

Swiss Cheese 3-Tier Eval Pipeline의 Domain 계층 구현. 순수 Python, 외부 의존 없음. - EvalGrade(S/A/B/C): 연속 점수→등급 매핑 Enum - AxisScore: BARS 단일 축 평가 결과 VO (frozen) - ContinuousScore: 0-100 연속 점수 VO (정보 손실 추적) - CalibrationSample: 보정 샘플 VO (Cohen's kappa 검증) - EvalScoringService: 비대칭 가중 합산 (faith=0.30, safe=0.15) - Domain Exceptions: InvalidBARSScoreError 등 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

EvalConfig/EvalResult DTO와 Protocol 기반 Port 정의. - EvalConfig: 11개 Feature Flag (모드, 샘플링, 비용 가드레일) - EvalResult: 통합 평가 결과 DTO (frozen, FAIL_OPEN=B등급) - BARSEvaluator Port: LLM 5축 평가 Protocol - EvalResultCommandGateway: 결과 저장 Protocol (CQS Command) - EvalResultQueryGateway: 결과 조회 + 일일 비용 Protocol (CQS Query) - CalibrationDataGateway: 보정 데이터 Protocol Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

L1/L2/L3 평가 서비스와 오케스트레이터 Command 구현. - CodeGraderService (L1): 6개 직교 슬라이스 결정적 평가 (<50ms) - LLMGraderService (L2): BARS 5축 + Self-Consistency (경계 점수 재평가) - ScoreAggregatorService: L1+L2 통합, 비대칭 가중 합산 - CalibrationMonitorService (L3): Two-Sided CUSUM 드리프트 감지 - EvaluateResponseCommand: 3-Tier 오케스트레이터 - 샘플링 게이트, 일일 비용 가드레일, 주기적 Calibration - FAIL_OPEN 정책 (실패→B등급, 재생성 미트리거) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

OpenAIBARSEvaluator adapter implementing BARSEvaluator Protocol. Pydantic schemas for LLM Structured Output parsing. 6 BARS rubric prompt files (system + 5 axis anchors). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…egration eval_node.py: ChatState → EvalState field mapping adapter. eval_graph_factory.py: Eval subgraph with grader node factories. state.py, contracts.py: EvalState fields + node contracts. node_policy.py: Eval node FAIL_OPEN policy. factory.py: Conditional eval subgraph wiring in main graph. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Domain: EvalGrade, AxisScore, ContinuousScore, CalibrationSample, EvalScoringService. Application: CodeGrader, LLMGrader, ScoreAggregator, CalibrationMonitor, EvaluateResponseCommand. Infrastructure: BARSEvaluator, eval_node, eval_subgraph_keys. pyproject.toml: eval_unit, eval_regression, eval_capability markers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Swiss Cheese 3-Tier Grader architecture, expert review scores (97.1/100), 108 unit tests (100% pass), BARS rubric design, known limitations, next steps. Related: ADR blog posts #273-#276. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

contracts.py: frozenset multiline formatting. test files: trailing blank lines, duplicate import removal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

_make_failed_eval_result now delegates to EvalResult.failed() DTO instead of hardcoding a raw dict. Eliminates stale stopgap comment. Tests updated for EvalResult.to_dict() structure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mangowhoiscloud and others added 6 commits February 10, 2026 00:58

mangowhoiscloud added the evaluation label Feb 9, 2026

mangowhoiscloud force-pushed the feat/chat-eval-pipeline branch from e6de06a to 14f4714 Compare February 9, 2026 16:08

mangowhoiscloud added this to @Eco² Feb 9, 2026

github-project-automation bot moved this to Backlog in @Eco² Feb 9, 2026

mangowhoiscloud and others added 2 commits February 10, 2026 01:19

fix(eval): black + ruff formatting fixes for CI

1f1a165

contracts.py: frozenset multiline formatting. test files: trailing blank lines, duplicate import removal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mangowhoiscloud self-assigned this Feb 9, 2026

mangowhoiscloud merged commit 4ac9d6b into develop Feb 9, 2026
5 checks passed

github-project-automation bot moved this from Backlog to Done in @Eco² Feb 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): Chat Eval Pipeline — Swiss Cheese 3-Tier Grader#545

feat(eval): Chat Eval Pipeline — Swiss Cheese 3-Tier Grader#545
mangowhoiscloud merged 9 commits intodevelopfrom
feat/chat-eval-pipeline

mangowhoiscloud commented Feb 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mangowhoiscloud commented Feb 9, 2026

Summary

Architecture

Commits (7)

Key Design Decisions

Phase 3 (Next)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant