-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Tokenizer is_split_into_words 参数选项不符合预期。 #3195
Comments
该问题导致了 |
描述目前paddlenlp的tokenizer是和transformers对齐,我用paddlenlp确实复现了这个问题,并且结果和hf一致,测试代码如下: >>> from transformers.models.bert.tokenization_bert import BertTokenizer
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
>>> tokenizer(["241", "##5","人"], is_split_into_words=True)
{'input_ids': [101, 22343, 1001, 1001, 1019, 1756, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]} 不过这个问题也是比较好解决:
示例代码如下所示: >>> import paddlenlp
>>> ernie_tokenizer = paddlenlp.transformers.AutoTokenizer.from_pretrained("ernie-2.0-base-zh")
>>> sentence = ernie_tokenizer.convert_tokens_to_string(["241", "##5","人"])
>>> sentence
'2415 人'
>>> ernie_tokenizer(sentence, return_attention_mask=True)
{'input_ids': [1, 5494, 9486, 8, 2], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]} 总结新版本tokenizer fix了一些问题,也添加了一些新特性,所以老版本可能会有些不兼容的地方。 其实examples中的一些写法可以精简代码,减少一些与huggingface transformers概念上的不一致,比如Pad、Stack等。 |
PaddleNLP/examples/benchmark/clue/mrc/run_c3.py Lines 250 to 261 in c195224
|
此PR的offline测试代码如下所示: def test_bad_case():
ernie_tokenizer: PretrainedTokenizer = paddlenlp.transformers.AutoTokenizer.from_pretrained("ernie-2.0-base-zh")
tokens = ["241", "##5","人"]
assert ernie_tokenizer([tokens], is_split_into_words='token')['input_ids'] == [[1, 5494, 9486, 8, 2]] |
欢迎您反馈PaddleNLP使用问题,辛苦您提供以下信息,方便我们快速定位和解决问题:
Tokenizer
is_split_into_words
参数选项不符合预期。虽然加了
is_split_into_words
选项, 这里的##5
没有转化成一个单独token,而是切分为# # 5
The text was updated successfully, but these errors were encountered: