RULER produces scores >1, arbitrary outputs (7, 10, 15…), nondeterministic behavior, and malformed JSON when judge model is Qwen3-32B running on Ollama

## **What happened?**

I’m running ART’s RULER scoring step using a **local Ollama server** (on a machine with **2× H100 80GB**) hosting:

* **Qwen3-32B** (tried both quantized + **full FP16**, which uses ~66GB VRAM)
* Serving as the **LLM judge** inside `ruler()` / `ruler_score_group()`

However, RULER often produces **invalid or nonsensical outputs**, specifically:

###  **1. RULER returns scores >1 (e.g., 7, 10, 15)**

Even though the schema requires:

```python
score: float  # between 0 and 1
```

I frequently get outputs like:

* `score: 7`
* `score: 10`
* `score: 15`
* or other arbitrary integers / floats

Sometimes the score is actually `"10"` (string), `"1."`, `"0.8x"`, etc.

This violates the API contract and breaks training.

---

###  **2. RULER gives hallucinated/meaningless scores per trajectory**

Even when trajectories look similar or follow the system instructions, the judge:

* Randomly assigns **very high or very low scores**
* Ignores the rubric
* Produces wildly different outputs across runs (no determinism)
* Sometimes gives no comparison at all, just arbitrary values

---

###  **3. JSON returned by judge is malformed or incomplete**

Very often, the content inside:

```python
first_choice.message.content
```

is:

* Partial JSON
* Badly formatted JSON
* JSON containing stray characters
* Missing required fields (`trajectory_id`, `explanation`, etc.)
* A mixture of plaintext + JSON fragments
* Sometimes *entirely non-JSON* despite RULER passing a Pydantic `response_format`

This triggers failures like:

```
pydantic_core._pydantic_core.ValidationError: JSON parsing error
```

or:

```
json.decoder.JSONDecodeError
```

I even had to introduce a **fallback parser** because well-formed JSON simply does not arrive reliably.

---

###  **4. RULER’s behavior is nondeterministic even with same prompt + same trajectories**

Across repeated runs:

* Same trajectories produce **different scores**
* Sometimes >1, sometimes valid (0.3), sometimes JSON errors
* Even with FP16 full-precision version of Qwen3-32B (66GB VRAM) and temperature = 0

This strongly suggests the judge model is not following the system prompt reliably.

---

## **Why I believe this is an ART / RULER robustness issue**

Looking at the code inside `ruler()`:

```python
response = await acompletion(
    model=judge_model,
    messages=messages,
    response_format=Response,
    caching=False,
)
...
content = first_choice.message.content or "{}"
parsed = Response.model_validate_json(content)
```

RULER assumes:

1. The backend model **supports strict JSON mode**
2. The returned content is **fully compliant** with the Pydantic schema
3. The judge will **obey the system prompt**, including scoring between 0 and 1

But these assumptions do not always hold for:

* **Ollama models**
* **Qwen3-32B**, which does not have deterministic structured output
* Non-OpenAI providers through LiteLLM

Ollama cannot guarantee strict JSON output, and model behavior varies even at temperature 0.

As a result:

* RULER collapses when JSON isn’t perfect
* Scoring becomes arbitrary / hallucinated
* Training reward curves become meaningless

This matches exactly the issues I’m seeing.

---

## **What I expected to happen**

* If `response_format=Response` is provided, the judge should reliably produce JSON
* Scores should **always** be in `[0, 1]`
* Behavior should be **deterministic** (or configurable)
* RULER should gracefully handle imperfect JSON (or allow a custom parse)
* Models that cannot reliably produce JSON should not break the entire training run

---

## **Actual behavior**

* Scored values frequently **outside the allowed range**
* JSON formatting frequently **invalid**
* Judge fails to follow rubric
* Behavior nondeterministic across runs
* Leads to **failed training steps**, invalid reward gradients, and incorrect trajectory ranking

---

## **Why this matters**

RULER is advertised as:

> “A general-purpose reward function for RL agents”

But in practice, with non-OpenAI models (e.g., local HPC deployment via Ollama), RULER’s strict JSON assumptions break repeatedly.

A growing number of users want:

* local inference
* open models
* offline training
* privacy and cost control

But RULER currently assumes an OpenAI-style JSON-mode model.

---

## **Suggested solutions**

Here are concrete changes that would fix or improve the situation:

### **1. Add clamping for scores**

Automatically:

```python
score = max(0.0, min(1.0, score))
```

This immediately resolves the “score > 1” crashes.

### **2. Add a robust JSON extraction / correction layer**

e.g.:

* extract the first valid JSON block
* filter out stray characters
* fallback to regex extraction

### **3. Allow a user-defined `parse_response()` hook**

For example:

```python
async def ruler(..., parse_response: Callable[[ModelResponse], Response] = None)
```

### **4. Add a validation step in RULER that warns when the judge cannot follow JSON-mode**

Instead of crashing training.

### **5. Recommend judges known to produce reliable JSON**

Or explicitly warn that models like Qwen3-32B (via Ollama) may not be JSON-mode safe.

---

## **Environment**

* 2× NVIDIA H100 80GB
* Ollama server hosting Qwen3-32B (both quantized + full FP16 variants)
* LiteLLM router
* ART latest version (2025)
* RULER used via `ruler_score_group()`
* Temperature = 0

---

## **I can also provide logs + trajectories if needed.**

But the core issue is that **RULER's strict JSON-mode expectations do not hold for many models**, leading to:

* hallucinated scores
* invalid reward signals
* malformed JSON
* nondeterministic judge behavior

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RULER produces scores >1, arbitrary outputs (7, 10, 15…), nondeterministic behavior, and malformed JSON when judge model is Qwen3-32B running on Ollama #477

What happened?

1. RULER returns scores >1 (e.g., 7, 10, 15)

2. RULER gives hallucinated/meaningless scores per trajectory

3. JSON returned by judge is malformed or incomplete

4. RULER’s behavior is nondeterministic even with same prompt + same trajectories

Why I believe this is an ART / RULER robustness issue

What I expected to happen

Actual behavior

Why this matters

Suggested solutions

1. Add clamping for scores

2. Add a robust JSON extraction / correction layer

3. Allow a user-defined `parse_response()` hook

4. Add a validation step in RULER that warns when the judge cannot follow JSON-mode

5. Recommend judges known to produce reliable JSON

Environment

I can also provide logs + trajectories if needed.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RULER produces scores >1, arbitrary outputs (7, 10, 15…), nondeterministic behavior, and malformed JSON when judge model is Qwen3-32B running on Ollama #477

Description

What happened?

1. RULER returns scores >1 (e.g., 7, 10, 15)

2. RULER gives hallucinated/meaningless scores per trajectory

3. JSON returned by judge is malformed or incomplete

4. RULER’s behavior is nondeterministic even with same prompt + same trajectories

Why I believe this is an ART / RULER robustness issue

What I expected to happen

Actual behavior

Why this matters

Suggested solutions

1. Add clamping for scores

2. Add a robust JSON extraction / correction layer

3. Allow a user-defined parse_response() hook

4. Add a validation step in RULER that warns when the judge cannot follow JSON-mode

5. Recommend judges known to produce reliable JSON

Environment

I can also provide logs + trajectories if needed.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

3. Allow a user-defined `parse_response()` hook