-
Notifications
You must be signed in to change notification settings - Fork 841
Description
Description
When using a reasoning model to perform the EquivalenceEvaluator evaluation, the reasoning tokens exceed the MaxOutputTokens set in the evaluator's private ChatOptions (https://github.com/dotnet/extensions/blob/main/src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/EquivalenceEvaluator.cs#L51) and this is not overrideable.
The MaxOutputTokens length is set to 16, which for a reasoning model, it nowhere near enough. The model returns the start of a reasoning monologue such as "We" only, resulting in the evaluator's diagnostics stating "Failed to parse numeric score for 'Equivalence' from the following text:\r\n".
I believe this may also be the root cause for issue #6814
Reproduction Steps
I'm using OpenAI's GPT-OSS-20b hosted locally on LMStudio, which may be part of the problem. Due to organisational limitations, I am not able to use hosted LLMs.
Create a simple Equivalence evaluation test with the local LLM configured (http://localhost:1234/v1 as the OpenAIClientOptions.Endpoint property and "gpt-oss-20b" as the selected model).
Run the test.
Observe that the EvaluationResult.Interpretation.Rating is "Inconclusive", viewing the Diagnostics[0] property of the EvaluationResult shows that the evaluator was unable to parse the result - because the LLM stopped responding early in its reasoning phase.
Expected behavior
The evaluator should support reasoning models and allow them to reason before giving their response. Reasoning has been demonstrated to improve response accuracy.
The evaluator should not limit the number of output tokens as this also limits the output of reasoning models, preventing them from returning an acceptable response to the evaluator.
Perhaps the evaluator could permit developers to override the ChatOptions.MaxOutputTokens property in the EquivalenceEvaluator?
Actual behavior
The EvaluationResult.Interpretation.Rating is "Inconclusive", viewing the Diagnostics[0] property of the EvaluationResult shows that the evaluator was unable to parse the result - because the LLM stopped responding early in its reasoning phase.
Regression?
No. But recent changes / attempts to fix this have not been addressing the need for reasoning models to be able to output additional tokens.
Known Workarounds
We are building our own EquivalenceEvaluator.
Configuration
.NET 9
MacOS 26 (Apple Silicon)
LM Studio 0.30.0
openai/gpt-oss-20b model from Huggingface (downloaded through the LM Studio model search tool)
Other information
Releated to: