[AI Evaluation] EquivalenceEvaluator does not appear to support reasoning models due to very low MaxOutputTokens value

### Description

When using a reasoning model to perform the EquivalenceEvaluator evaluation, the reasoning tokens exceed the MaxOutputTokens set in the evaluator's private ChatOptions (https://github.com/dotnet/extensions/blob/main/src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/EquivalenceEvaluator.cs#L51) and this is not overrideable.

The MaxOutputTokens length is set to 16, which for a reasoning model, it nowhere near enough. The model returns the start of a reasoning monologue such as "We" only, resulting in the evaluator's diagnostics stating "Failed to parse numeric score for 'Equivalence' from the following text:\r\n".

I believe this may also be the root cause for issue https://github.com/dotnet/extensions/issues/6814



### Reproduction Steps

I'm using OpenAI's GPT-OSS-20b hosted locally on LMStudio, which may be part of the problem. Due to organisational limitations, I am not able to use hosted LLMs.

Create a simple Equivalence evaluation test with the local LLM configured (http://localhost:1234/v1 as the OpenAIClientOptions.Endpoint property and "gpt-oss-20b" as the selected model).

Run the test.

Observe that the EvaluationResult.Interpretation.Rating is "Inconclusive", viewing the Diagnostics[0] property of the EvaluationResult shows that the evaluator was unable to parse the result - because the LLM stopped responding early in its reasoning phase.

### Expected behavior

The evaluator should support reasoning models and allow them to reason before giving their response. Reasoning has been demonstrated to improve response accuracy.

The evaluator should not limit the number of output tokens as this also limits the output of reasoning models, preventing them from returning an acceptable response to the evaluator.

Perhaps the evaluator could permit developers to override the ChatOptions.MaxOutputTokens property in the EquivalenceEvaluator?

### Actual behavior

The EvaluationResult.Interpretation.Rating is "Inconclusive", viewing the Diagnostics[0] property of the EvaluationResult shows that the evaluator was unable to parse the result - because the LLM stopped responding early in its reasoning phase.

### Regression?

No. But recent changes / attempts to fix this have not been addressing the need for reasoning models to be able to output additional tokens.

### Known Workarounds

We are building our own EquivalenceEvaluator.

### Configuration

.NET 9
MacOS 26 (Apple Silicon)
LM Studio 0.30.0
openai/gpt-oss-20b model from Huggingface (downloaded through the LM Studio model search tool)

### Other information

Releated to:
- https://github.com/dotnet/extensions/issues/6814
- https://github.com/dotnet/extensions/issues/6945 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AI Evaluation] EquivalenceEvaluator does not appear to support reasoning models due to very low MaxOutputTokens value #7002

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[AI Evaluation] EquivalenceEvaluator does not appear to support reasoning models due to very low MaxOutputTokens value #7002

Description

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions