Closed
Description
I want to get the character offset mapping of tokens when decoding for model-generated ids, similar to the return_offsets_mapping in tokenizer.call. But I cannot find out how to do it. For some tokenizers, I try first decoding the generated ids to text and encoding it back and get the char offset mapping. However , input_ids == tokenizer.encode(tokenizer.decode(input_ids)
is not always true.
Metadata
Metadata
Assignees
Labels
No labels