feat(eval): Chat Eval Pipeline Phase 1-4 by mangowhoiscloud · Pull Request #546 · eco2-team/backend

mangowhoiscloud · 2026-02-09T18:16:02Z

Summary

Phase 1: Domain + Application layer (EvalResult VO, EvalConfig DTO, L1/L2/L3 서비스)
Phase 2: BARS prompts, LangGraph eval subgraph (Send API fan-out), main graph 통합
Phase 3: Gateway adapters (Redis + PG composite), DI wiring, V005 migration, calibration fixture
Phase 4: PG pool DI (asyncpg), SSE eval stage, structured logging, eval-feedback-loop skill

Key Numbers

36 files across Domain/Application/Infrastructure layers
165 tests ALL PASS (pytest -m eval_unit)
5-expert review loop: Design R5(99.8) / Code R2(97.1)

Architecture

Swiss Cheese 3-Tier: L1 Code Grader → L2 LLM BARS Grader → L3 Calibration CUSUM
Clean Architecture: Port/Adapter, CQRS (Command/Query Gateway), frozen VO
FAIL_OPEN: eval 실패 시 B-grade(65.0) fallback, 사용자 응답 차단 없음

Test plan

pytest -m eval_unit -v --tb=short — 165 tests ALL PASS
black --check && ruff check — CI lint clean
Redis-only 모드 동작 확인 (eval_postgres_dsn 비어있을 때)
SSE 이벤트에 eval 단계 표시 확인
enable_eval_pipeline=True 기본 동작 확인

🤖 Generated with Claude Code

Swiss Cheese 3-Tier Eval Pipeline의 Domain 계층 구현. 순수 Python, 외부 의존 없음. - EvalGrade(S/A/B/C): 연속 점수→등급 매핑 Enum - AxisScore: BARS 단일 축 평가 결과 VO (frozen) - ContinuousScore: 0-100 연속 점수 VO (정보 손실 추적) - CalibrationSample: 보정 샘플 VO (Cohen's kappa 검증) - EvalScoringService: 비대칭 가중 합산 (faith=0.30, safe=0.15) - Domain Exceptions: InvalidBARSScoreError 등 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

EvalConfig/EvalResult DTO와 Protocol 기반 Port 정의. - EvalConfig: 11개 Feature Flag (모드, 샘플링, 비용 가드레일) - EvalResult: 통합 평가 결과 DTO (frozen, FAIL_OPEN=B등급) - BARSEvaluator Port: LLM 5축 평가 Protocol - EvalResultCommandGateway: 결과 저장 Protocol (CQS Command) - EvalResultQueryGateway: 결과 조회 + 일일 비용 Protocol (CQS Query) - CalibrationDataGateway: 보정 데이터 Protocol Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

L1/L2/L3 평가 서비스와 오케스트레이터 Command 구현. - CodeGraderService (L1): 6개 직교 슬라이스 결정적 평가 (<50ms) - LLMGraderService (L2): BARS 5축 + Self-Consistency (경계 점수 재평가) - ScoreAggregatorService: L1+L2 통합, 비대칭 가중 합산 - CalibrationMonitorService (L3): Two-Sided CUSUM 드리프트 감지 - EvaluateResponseCommand: 3-Tier 오케스트레이터 - 샘플링 게이트, 일일 비용 가드레일, 주기적 Calibration - FAIL_OPEN 정책 (실패→B등급, 재생성 미트리거) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

OpenAIBARSEvaluator adapter implementing BARSEvaluator Protocol. Pydantic schemas for LLM Structured Output parsing. 6 BARS rubric prompt files (system + 5 axis anchors). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…egration eval_node.py: ChatState → EvalState field mapping adapter. eval_graph_factory.py: Eval subgraph with grader node factories. state.py, contracts.py: EvalState fields + node contracts. node_policy.py: Eval node FAIL_OPEN policy. factory.py: Conditional eval subgraph wiring in main graph. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Domain: EvalGrade, AxisScore, ContinuousScore, CalibrationSample, EvalScoringService. Application: CodeGrader, LLMGrader, ScoreAggregator, CalibrationMonitor, EvaluateResponseCommand. Infrastructure: BARSEvaluator, eval_node, eval_subgraph_keys. pyproject.toml: eval_unit, eval_regression, eval_capability markers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Swiss Cheese 3-Tier Grader architecture, expert review scores (97.1/100), 108 unit tests (100% pass), BARS rubric design, known limitations, next steps. Related: ADR blog posts #273-#276. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

contracts.py: frozenset multiline formatting. test files: trailing blank lines, duplicate import removal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

_make_failed_eval_result now delegates to EvalResult.failed() DTO instead of hardcoding a raw dict. Eliminates stale stopgap comment. Tests updated for EvalResult.to_dict() structure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… skill - Config: enable_eval_pipeline default True, add PG DSN fields - DI: get_eval_pg_pool() with conditional asyncpg pool creation - SSE: eval stage in STAGE_ORDER, PHASE_PROGRESS, NODE_MESSAGES - Logging: structured extra dicts in eval_entry, graders, aggregator - Skill: eval-feedback-loop 5-expert review guide - Tests: 5 new tests (PG wiring + eval progress), 165 total ALL PASS - Report: Phase 4 section added, test counts updated Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mangowhoiscloud and others added 10 commits February 10, 2026 00:58

fix(eval): black + ruff formatting fixes for CI

1f1a165

contracts.py: frozenset multiline formatting. test files: trailing blank lines, duplicate import removal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mangowhoiscloud self-assigned this Feb 9, 2026

mangowhoiscloud added the agent label Feb 9, 2026

mangowhoiscloud added this to @Eco² Feb 9, 2026

github-project-automation bot moved this to Backlog in @Eco² Feb 9, 2026

mangowhoiscloud merged commit 1a5463a into main Feb 9, 2026
15 checks passed

github-project-automation bot moved this from Backlog to Done in @Eco² Feb 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): Chat Eval Pipeline Phase 1-4#546

feat(eval): Chat Eval Pipeline Phase 1-4#546
mangowhoiscloud merged 10 commits intomainfrom
feat/chat-eval-pipeline

mangowhoiscloud commented Feb 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mangowhoiscloud commented Feb 9, 2026

Summary

Key Numbers

Architecture

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant