feat(evaluation): add pairwise evaluator leveraging LLM as judge #9883

sahusiddharth · 2025-10-12T21:09:53Z

Resolves: #3738

Summary

This PR introduces a pairwise evaluation framework for judging two model generations against each other using an LLM. It includes support for consensus checks by allowing the evaluator to judge both the original and flipped outputs.

Pairwise comparison of responses using LLM as judge.
consensus parameter for forcing flipped evaluations.
Integration with existing evaluation experiments for expected outputs.
Documentation and usage example provided.

Pairwise Evaluation Consensus Logic

This table breaks down all possible outcomes from the consensus evaluation method. Scores correspond to: output wins (1.0), second_output wins (0.0), and Tie (0.5).

Case	Initial Score (s)	Flipped Score (fs)	Votes for output (s + (1−fs))	Votes for second_output ((1−s)+fs)	Final Verdict	Tie Type
1	1.0 (output)	1.0 (second_output)	1.0 + (1−1.0) = 1.0	(1−1.0)+1.0 = 1.0	Tie	Disagreement
2	1.0 (output)	0.0 (output)	1.0 + (1−0.0) = 2.0	(1−1.0)+0.0 = 0.0	output Wins	N/A
3	0.0 (second_output)	1.0 (second_output)	0.0 + (1−1.0) = 0.0	(1−0.0)+1.0 = 2.0	second_output Wins	N/A
4	0.0 (second_output)	0.0 (output)	0.0 + (1−0.0) = 1.0	(1−0.0)+0.0 = 1.0	Tie	Disagreement
5	1.0 (output)	0.5 (Tie)	1.0 + (1−0.5) = 1.5	(1−1.0)+0.5 = 0.5	output Wins	N/A
6	0.5 (Tie)	1.0 (second_output)	0.5 + (1−1.0) = 0.5	(1−0.5)+1.0 = 1.5	second_output Wins	N/A
7	0.0 (second_output)	0.5 (Tie)	0.0 + (1−0.5) = 0.5	(1−0.0)+0.5 = 1.5	second_output Wins	N/A
8	0.5 (Tie)	0.0 (output)	0.5 + (1−0.0) = 1.5	(1−0.5)+0.0 = 0.5	output Wins	N/A
9	0.5 (Tie)	0.5 (Tie)	0.5 + (1−0.5) = 1.0	(1−0.5)+0.5 = 1.0	Tie	Actual

Example

from llama_index.evaluation import PairwiseEvaluator

evaluator = PairwiseEvaluator(consensus=True)
result = evaluator.evaluate(response_a, response_b, expected_output)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Introduces `PairwiseEvaluator` for LLM-judged pairwise comparisons with optional consensus (order-flip) to reduce positional bias and exports it in `metrics/__init__.py`.
> 
> - **Metrics**:
>   - **New `PairwiseEvaluator`** (`metrics/pairwise.py`): LLM-based pairwise comparison over labels `output`/`second_output`/`tie`, optional explanations, and consensus mode that flips response order and resolves results.
>   - **API Export**: Adds `PairwiseEvaluator` to `metrics/__init__.py` `__all__`.
> 
> <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 7be6e581a238482339a14823effc1d393937dd6f. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

packages/phoenix-evals/src/phoenix/evals/metrics/pairwise.py

sahusiddharth · 2025-10-12T21:53:58Z

hi @mikeldking,

Please take a look at this PR when you have a chance and let me know if you’d recommend any changes.

axiomofjoy · 2025-10-12T23:46:32Z

Hi @sahusiddharth, Mikyo is out on leave. We'll take a look at the PR, but we may not be ready to accept this change since our evals library is currently under migration.

sahusiddharth · 2025-10-13T02:36:32Z

Hi @axiomofjoy, thanks for the update and no problem at all

Understood about the migration, I was under the impression that this change was part of the planned work for the evals 2.0 migration #8243

Would it be possible to revisit this PR after the evals library migration is complete?

feat(evaluation): add pairwise evaluator leveraging LLM as judge

5d6822a

sahusiddharth requested a review from a team as a code owner October 12, 2025 21:09

github-project-automation bot added this to phoenix Oct 12, 2025

github-project-automation bot moved this to 📘 Todo in phoenix Oct 12, 2025

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Oct 12, 2025

This comment was marked as outdated.

Sign in to view

fixed changes suggested by cursor bot

7be6e58

cursor bot reviewed Oct 12, 2025

View reviewed changes

packages/phoenix-evals/src/phoenix/evals/metrics/pairwise.py Show resolved Hide resolved

packages/phoenix-evals/src/phoenix/evals/metrics/pairwise.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(evaluation): add pairwise evaluator leveraging LLM as judge #9883

feat(evaluation): add pairwise evaluator leveraging LLM as judge #9883

sahusiddharth commented Oct 12, 2025 •

edited by cursor bot

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

sahusiddharth commented Oct 12, 2025

Uh oh!

axiomofjoy commented Oct 12, 2025

Uh oh!

sahusiddharth commented Oct 13, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(evaluation): add pairwise evaluator leveraging LLM as judge #9883

Are you sure you want to change the base?

feat(evaluation): add pairwise evaluator leveraging LLM as judge #9883

Conversation

sahusiddharth commented Oct 12, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Pairwise Evaluation Consensus Logic

Example

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

sahusiddharth commented Oct 12, 2025

Uh oh!

axiomofjoy commented Oct 12, 2025

Uh oh!

sahusiddharth commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sahusiddharth commented Oct 12, 2025 •

edited by cursor bot

Loading

sahusiddharth commented Oct 13, 2025 •

edited

Loading