-
Notifications
You must be signed in to change notification settings - Fork 645
Description
What happened?
I’m running ART’s RULER scoring step using a local Ollama server (on a machine with 2× H100 80GB) hosting:
- Qwen3-32B (tried both quantized + full FP16, which uses ~66GB VRAM)
- Serving as the LLM judge inside
ruler()/ruler_score_group()
However, RULER often produces invalid or nonsensical outputs, specifically:
1. RULER returns scores >1 (e.g., 7, 10, 15)
Even though the schema requires:
score: float # between 0 and 1I frequently get outputs like:
score: 7score: 10score: 15- or other arbitrary integers / floats
Sometimes the score is actually "10" (string), "1.", "0.8x", etc.
This violates the API contract and breaks training.
2. RULER gives hallucinated/meaningless scores per trajectory
Even when trajectories look similar or follow the system instructions, the judge:
- Randomly assigns very high or very low scores
- Ignores the rubric
- Produces wildly different outputs across runs (no determinism)
- Sometimes gives no comparison at all, just arbitrary values
3. JSON returned by judge is malformed or incomplete
Very often, the content inside:
first_choice.message.contentis:
- Partial JSON
- Badly formatted JSON
- JSON containing stray characters
- Missing required fields (
trajectory_id,explanation, etc.) - A mixture of plaintext + JSON fragments
- Sometimes entirely non-JSON despite RULER passing a Pydantic
response_format
This triggers failures like:
pydantic_core._pydantic_core.ValidationError: JSON parsing error
or:
json.decoder.JSONDecodeError
I even had to introduce a fallback parser because well-formed JSON simply does not arrive reliably.
4. RULER’s behavior is nondeterministic even with same prompt + same trajectories
Across repeated runs:
- Same trajectories produce different scores
- Sometimes >1, sometimes valid (0.3), sometimes JSON errors
- Even with FP16 full-precision version of Qwen3-32B (66GB VRAM) and temperature = 0
This strongly suggests the judge model is not following the system prompt reliably.
Why I believe this is an ART / RULER robustness issue
Looking at the code inside ruler():
response = await acompletion(
model=judge_model,
messages=messages,
response_format=Response,
caching=False,
)
...
content = first_choice.message.content or "{}"
parsed = Response.model_validate_json(content)RULER assumes:
- The backend model supports strict JSON mode
- The returned content is fully compliant with the Pydantic schema
- The judge will obey the system prompt, including scoring between 0 and 1
But these assumptions do not always hold for:
- Ollama models
- Qwen3-32B, which does not have deterministic structured output
- Non-OpenAI providers through LiteLLM
Ollama cannot guarantee strict JSON output, and model behavior varies even at temperature 0.
As a result:
- RULER collapses when JSON isn’t perfect
- Scoring becomes arbitrary / hallucinated
- Training reward curves become meaningless
This matches exactly the issues I’m seeing.
What I expected to happen
- If
response_format=Responseis provided, the judge should reliably produce JSON - Scores should always be in
[0, 1] - Behavior should be deterministic (or configurable)
- RULER should gracefully handle imperfect JSON (or allow a custom parse)
- Models that cannot reliably produce JSON should not break the entire training run
Actual behavior
- Scored values frequently outside the allowed range
- JSON formatting frequently invalid
- Judge fails to follow rubric
- Behavior nondeterministic across runs
- Leads to failed training steps, invalid reward gradients, and incorrect trajectory ranking
Why this matters
RULER is advertised as:
“A general-purpose reward function for RL agents”
But in practice, with non-OpenAI models (e.g., local HPC deployment via Ollama), RULER’s strict JSON assumptions break repeatedly.
A growing number of users want:
- local inference
- open models
- offline training
- privacy and cost control
But RULER currently assumes an OpenAI-style JSON-mode model.
Suggested solutions
Here are concrete changes that would fix or improve the situation:
1. Add clamping for scores
Automatically:
score = max(0.0, min(1.0, score))This immediately resolves the “score > 1” crashes.
2. Add a robust JSON extraction / correction layer
e.g.:
- extract the first valid JSON block
- filter out stray characters
- fallback to regex extraction
3. Allow a user-defined parse_response() hook
For example:
async def ruler(..., parse_response: Callable[[ModelResponse], Response] = None)4. Add a validation step in RULER that warns when the judge cannot follow JSON-mode
Instead of crashing training.
5. Recommend judges known to produce reliable JSON
Or explicitly warn that models like Qwen3-32B (via Ollama) may not be JSON-mode safe.
Environment
- 2× NVIDIA H100 80GB
- Ollama server hosting Qwen3-32B (both quantized + full FP16 variants)
- LiteLLM router
- ART latest version (2025)
- RULER used via
ruler_score_group() - Temperature = 0
I can also provide logs + trajectories if needed.
But the core issue is that RULER's strict JSON-mode expectations do not hold for many models, leading to:
- hallucinated scores
- invalid reward signals
- malformed JSON
- nondeterministic judge behavior