Skip to content

Conversation

@sahusiddharth
Copy link
Contributor

@sahusiddharth sahusiddharth commented Oct 12, 2025

Resolves: #3738

Summary

This PR introduces a pairwise evaluation framework for judging two model generations against each other using an LLM. It includes support for consensus checks by allowing the evaluator to judge both the original and flipped outputs.

  • Pairwise comparison of responses using LLM as judge.
  • consensus parameter for forcing flipped evaluations.
  • Integration with existing evaluation experiments for expected outputs.
  • Documentation and usage example provided.

Pairwise Evaluation Consensus Logic

This table breaks down all possible outcomes from the consensus evaluation method. Scores correspond to: output wins (1.0), second_output wins (0.0), and Tie (0.5).

Case Initial Score (s) Flipped Score (fs) Votes for output (s + (1−fs)) Votes for second_output ((1−s)+fs) Final Verdict Tie Type
1 1.0 (output) 1.0 (second_output) 1.0 + (1−1.0) = 1.0 (1−1.0)+1.0 = 1.0 Tie Disagreement
2 1.0 (output) 0.0 (output) 1.0 + (1−0.0) = 2.0 (1−1.0)+0.0 = 0.0 output Wins N/A
3 0.0 (second_output) 1.0 (second_output) 0.0 + (1−1.0) = 0.0 (1−0.0)+1.0 = 2.0 second_output Wins N/A
4 0.0 (second_output) 0.0 (output) 0.0 + (1−0.0) = 1.0 (1−0.0)+0.0 = 1.0 Tie Disagreement
5 1.0 (output) 0.5 (Tie) 1.0 + (1−0.5) = 1.5 (1−1.0)+0.5 = 0.5 output Wins N/A
6 0.5 (Tie) 1.0 (second_output) 0.5 + (1−1.0) = 0.5 (1−0.5)+1.0 = 1.5 second_output Wins N/A
7 0.0 (second_output) 0.5 (Tie) 0.0 + (1−0.5) = 0.5 (1−0.0)+0.5 = 1.5 second_output Wins N/A
8 0.5 (Tie) 0.0 (output) 0.5 + (1−0.0) = 1.5 (1−0.5)+0.0 = 0.5 output Wins N/A
9 0.5 (Tie) 0.5 (Tie) 0.5 + (1−0.5) = 1.0 (1−0.5)+0.5 = 1.0 Tie Actual

Example

from llama_index.evaluation import PairwiseEvaluator

evaluator = PairwiseEvaluator(consensus=True)
result = evaluator.evaluate(response_a, response_b, expected_output)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Introduces `PairwiseEvaluator` for LLM-judged pairwise comparisons with optional consensus (order-flip) to reduce positional bias and exports it in `metrics/__init__.py`.
> 
> - **Metrics**:
>   - **New `PairwiseEvaluator`** (`metrics/pairwise.py`): LLM-based pairwise comparison over labels `output`/`second_output`/`tie`, optional explanations, and consensus mode that flips response order and resolves results.
>   - **API Export**: Adds `PairwiseEvaluator` to `metrics/__init__.py` `__all__`.
> 
> <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 7be6e581a238482339a14823effc1d393937dd6f. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

@sahusiddharth sahusiddharth requested a review from a team as a code owner October 12, 2025 21:09
@github-project-automation github-project-automation bot moved this to 📘 Todo in phoenix Oct 12, 2025
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Oct 12, 2025
cursor[bot]

This comment was marked as outdated.

@sahusiddharth
Copy link
Contributor Author

hi @mikeldking,

Please take a look at this PR when you have a chance and let me know if you’d recommend any changes.

@axiomofjoy
Copy link
Contributor

Hi @sahusiddharth, Mikyo is out on leave. We'll take a look at the PR, but we may not be ready to accept this change since our evals library is currently under migration.

@sahusiddharth
Copy link
Contributor Author

sahusiddharth commented Oct 13, 2025

Hi @axiomofjoy, thanks for the update and no problem at all

Understood about the migration, I was under the impression that this change was part of the planned work for the evals 2.0 migration #8243

Would it be possible to revisit this PR after the evals library migration is complete?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

Status: 📘 Todo

Development

Successfully merging this pull request may close these issues.

[evals] pairwise evaluator

2 participants