Skip to content

[AI Evaluation] EquivalenceEvaluator does not appear to support reasoning models due to very low MaxOutputTokens value #7002

@richstokoe

Description

@richstokoe

Description

When using a reasoning model to perform the EquivalenceEvaluator evaluation, the reasoning tokens exceed the MaxOutputTokens set in the evaluator's private ChatOptions (https://github.com/dotnet/extensions/blob/main/src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/EquivalenceEvaluator.cs#L51) and this is not overrideable.

The MaxOutputTokens length is set to 16, which for a reasoning model, it nowhere near enough. The model returns the start of a reasoning monologue such as "We" only, resulting in the evaluator's diagnostics stating "Failed to parse numeric score for 'Equivalence' from the following text:\r\n".

I believe this may also be the root cause for issue #6814

Reproduction Steps

I'm using OpenAI's GPT-OSS-20b hosted locally on LMStudio, which may be part of the problem. Due to organisational limitations, I am not able to use hosted LLMs.

Create a simple Equivalence evaluation test with the local LLM configured (http://localhost:1234/v1 as the OpenAIClientOptions.Endpoint property and "gpt-oss-20b" as the selected model).

Run the test.

Observe that the EvaluationResult.Interpretation.Rating is "Inconclusive", viewing the Diagnostics[0] property of the EvaluationResult shows that the evaluator was unable to parse the result - because the LLM stopped responding early in its reasoning phase.

Expected behavior

The evaluator should support reasoning models and allow them to reason before giving their response. Reasoning has been demonstrated to improve response accuracy.

The evaluator should not limit the number of output tokens as this also limits the output of reasoning models, preventing them from returning an acceptable response to the evaluator.

Perhaps the evaluator could permit developers to override the ChatOptions.MaxOutputTokens property in the EquivalenceEvaluator?

Actual behavior

The EvaluationResult.Interpretation.Rating is "Inconclusive", viewing the Diagnostics[0] property of the EvaluationResult shows that the evaluator was unable to parse the result - because the LLM stopped responding early in its reasoning phase.

Regression?

No. But recent changes / attempts to fix this have not been addressing the need for reasoning models to be able to output additional tokens.

Known Workarounds

We are building our own EquivalenceEvaluator.

Configuration

.NET 9
MacOS 26 (Apple Silicon)
LM Studio 0.30.0
openai/gpt-oss-20b model from Huggingface (downloaded through the LM Studio model search tool)

Other information

Releated to:

Metadata

Metadata

Labels

area-ai-evalMicrosoft.Extensions.AI.Evaluation and relatedbugThis issue describes a behavior which is not expected - a bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions