Skip to content

RULER produces scores >1, arbitrary outputs (7, 10, 15…), nondeterministic behavior, and malformed JSON when judge model is Qwen3-32B running on Ollama #477

@ansh-info

Description

@ansh-info

What happened?

I’m running ART’s RULER scoring step using a local Ollama server (on a machine with 2× H100 80GB) hosting:

  • Qwen3-32B (tried both quantized + full FP16, which uses ~66GB VRAM)
  • Serving as the LLM judge inside ruler() / ruler_score_group()

However, RULER often produces invalid or nonsensical outputs, specifically:

1. RULER returns scores >1 (e.g., 7, 10, 15)

Even though the schema requires:

score: float  # between 0 and 1

I frequently get outputs like:

  • score: 7
  • score: 10
  • score: 15
  • or other arbitrary integers / floats

Sometimes the score is actually "10" (string), "1.", "0.8x", etc.

This violates the API contract and breaks training.


2. RULER gives hallucinated/meaningless scores per trajectory

Even when trajectories look similar or follow the system instructions, the judge:

  • Randomly assigns very high or very low scores
  • Ignores the rubric
  • Produces wildly different outputs across runs (no determinism)
  • Sometimes gives no comparison at all, just arbitrary values

3. JSON returned by judge is malformed or incomplete

Very often, the content inside:

first_choice.message.content

is:

  • Partial JSON
  • Badly formatted JSON
  • JSON containing stray characters
  • Missing required fields (trajectory_id, explanation, etc.)
  • A mixture of plaintext + JSON fragments
  • Sometimes entirely non-JSON despite RULER passing a Pydantic response_format

This triggers failures like:

pydantic_core._pydantic_core.ValidationError: JSON parsing error

or:

json.decoder.JSONDecodeError

I even had to introduce a fallback parser because well-formed JSON simply does not arrive reliably.


4. RULER’s behavior is nondeterministic even with same prompt + same trajectories

Across repeated runs:

  • Same trajectories produce different scores
  • Sometimes >1, sometimes valid (0.3), sometimes JSON errors
  • Even with FP16 full-precision version of Qwen3-32B (66GB VRAM) and temperature = 0

This strongly suggests the judge model is not following the system prompt reliably.


Why I believe this is an ART / RULER robustness issue

Looking at the code inside ruler():

response = await acompletion(
    model=judge_model,
    messages=messages,
    response_format=Response,
    caching=False,
)
...
content = first_choice.message.content or "{}"
parsed = Response.model_validate_json(content)

RULER assumes:

  1. The backend model supports strict JSON mode
  2. The returned content is fully compliant with the Pydantic schema
  3. The judge will obey the system prompt, including scoring between 0 and 1

But these assumptions do not always hold for:

  • Ollama models
  • Qwen3-32B, which does not have deterministic structured output
  • Non-OpenAI providers through LiteLLM

Ollama cannot guarantee strict JSON output, and model behavior varies even at temperature 0.

As a result:

  • RULER collapses when JSON isn’t perfect
  • Scoring becomes arbitrary / hallucinated
  • Training reward curves become meaningless

This matches exactly the issues I’m seeing.


What I expected to happen

  • If response_format=Response is provided, the judge should reliably produce JSON
  • Scores should always be in [0, 1]
  • Behavior should be deterministic (or configurable)
  • RULER should gracefully handle imperfect JSON (or allow a custom parse)
  • Models that cannot reliably produce JSON should not break the entire training run

Actual behavior

  • Scored values frequently outside the allowed range
  • JSON formatting frequently invalid
  • Judge fails to follow rubric
  • Behavior nondeterministic across runs
  • Leads to failed training steps, invalid reward gradients, and incorrect trajectory ranking

Why this matters

RULER is advertised as:

“A general-purpose reward function for RL agents”

But in practice, with non-OpenAI models (e.g., local HPC deployment via Ollama), RULER’s strict JSON assumptions break repeatedly.

A growing number of users want:

  • local inference
  • open models
  • offline training
  • privacy and cost control

But RULER currently assumes an OpenAI-style JSON-mode model.


Suggested solutions

Here are concrete changes that would fix or improve the situation:

1. Add clamping for scores

Automatically:

score = max(0.0, min(1.0, score))

This immediately resolves the “score > 1” crashes.

2. Add a robust JSON extraction / correction layer

e.g.:

  • extract the first valid JSON block
  • filter out stray characters
  • fallback to regex extraction

3. Allow a user-defined parse_response() hook

For example:

async def ruler(..., parse_response: Callable[[ModelResponse], Response] = None)

4. Add a validation step in RULER that warns when the judge cannot follow JSON-mode

Instead of crashing training.

5. Recommend judges known to produce reliable JSON

Or explicitly warn that models like Qwen3-32B (via Ollama) may not be JSON-mode safe.


Environment

  • 2× NVIDIA H100 80GB
  • Ollama server hosting Qwen3-32B (both quantized + full FP16 variants)
  • LiteLLM router
  • ART latest version (2025)
  • RULER used via ruler_score_group()
  • Temperature = 0

I can also provide logs + trajectories if needed.

But the core issue is that RULER's strict JSON-mode expectations do not hold for many models, leading to:

  • hallucinated scores
  • invalid reward signals
  • malformed JSON
  • nondeterministic judge behavior

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions