-
Couldn't load subscription status.
- Fork 31k
Closed
Description
System Info
transformersversion: 4.35.2- Platform: Linux-6.1.58+-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.19.4
- Safetensors version: 0.4.1
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.0+cu121 (False)
- Tensorflow version (GPU?): 2.15.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.7.5 (cpu)
- Jax version: 0.4.20
- JaxLib version: 0.4.20
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help?
@ArthurZucker (tokenizers) @Vaibhavs10 @sanchit-gandhi (audio team)
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
>>> from transformers import AutoTokenizer
>>> tokenizer=AutoTokenizer.from_pretrained('facebook/mms-tts-eng')
>>> tokenizer.encode('hello world')
[0, 6, 0, 7, 0, 21, 0, 21, 0, 22, 0, 19, 0, 9, 0, 22, 0, 25, 0, 21, 0, 5, 0]
>>> tokenizer.decode(tokenizer.encode('hello world'), skip_special_tokens=False)
'hello world'
>>> tokenizer.decode(tokenizer.encode('hello world'), skip_special_tokens=True)
'el ol'
>>> tokenizer.decode(tokenizer.encode('abcdefghijklmnopqrstuvwxyz'), skip_special_tokens=True)
'bdfhjmoqsuwy'From the last example, it looks like it's taking the even-positioned elements.
Expected behavior
[0, 6, 0, 7, 0, 21, 0, 21, 0, 22, 0, 19, 0, 9, 0, 22, 0, 25, 0, 21, 0, 5, 0], for which the tokenized version is:
['k', 'h', 'k', 'e', 'k', 'l', 'k', 'l', 'k', 'o', 'k', ' ', 'k', 'w', 'k', 'o', 'k', 'r', 'k', 'l', 'k', 'd', 'k']
should be decoded as 'hello world', or something more informative than 'el ol'.
ArthurZucker
Metadata
Metadata
Assignees
Labels
No labels