fix(grpo): inner tokenizer extraction bypasses processor preprocessing

## Problem

`trainer.py:510-513` extracts `inner_tok = getattr(self._tokenizer, "tokenizer", self._tokenizer)` and uses it to tokenize action text directly. For VLM processors (like `Qwen2VLProcessor`), the inner tokenizer may differ from the processor's text handling (e.g., chat template normalization). The tokens produced by `inner_tok` may not match what the processor would produce for the same text, causing log-probability bias in the GRPO loss.

## Proposed Fix

Verify token equivalence or use a more robust approach that goes through the processor's text handling pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(grpo): inner tokenizer extraction bypasses processor preprocessing #46

Problem

Proposed Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

fix(grpo): inner tokenizer extraction bypasses processor preprocessing #46

Description

Problem

Proposed Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions