-
Notifications
You must be signed in to change notification settings - Fork 30.3k
Closed
Description
Dear authors,
I have two questions.
First, how can I use multilingual pre-trained BERT in pytorch?
Is it all download model to $BERT_BASE_DIR?
Second is tokenization issue.
For Chinese and Japanese, tokenizer may works, however, for Korean, it shows different result that I expected
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "안녕하세요"
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)
` ['ᄋ', '##ᅡ', '##ᆫ', '##ᄂ', '##ᅧ', '##ᆼ', '##ᄒ', '##ᅡ', '##ᄉ', '##ᅦ', '##ᄋ', '##ᅭ']
The result is based on not 'character' but 'byte-based character'
May it comes from unicode issue. (I expect ['안녕', '##하세요'])
Metadata
Metadata
Assignees
Labels
No labels