Problem
trainer.py:510-513 extracts inner_tok = getattr(self._tokenizer, "tokenizer", self._tokenizer) and uses it to tokenize action text directly. For VLM processors (like Qwen2VLProcessor), the inner tokenizer may differ from the processor's text handling (e.g., chat template normalization). The tokens produced by inner_tok may not match what the processor would produce for the same text, causing log-probability bias in the GRPO loss.
Proposed Fix
Verify token equivalence or use a more robust approach that goes through the processor's text handling pipeline.