Skip to content

fix(grpo): inner tokenizer extraction bypasses processor preprocessing #46

@abrichr

Description

@abrichr

Problem

trainer.py:510-513 extracts inner_tok = getattr(self._tokenizer, "tokenizer", self._tokenizer) and uses it to tokenize action text directly. For VLM processors (like Qwen2VLProcessor), the inner tokenizer may differ from the processor's text handling (e.g., chat template normalization). The tokens produced by inner_tok may not match what the processor would produce for the same text, causing log-probability bias in the GRPO loss.

Proposed Fix

Verify token equivalence or use a more robust approach that goes through the processor's text handling pipeline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions