Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not find vocabulary file for Chinese model #34

Closed
zlinao opened this issue Nov 18, 2018 · 5 comments
Closed

Can not find vocabulary file for Chinese model #34

zlinao opened this issue Nov 18, 2018 · 5 comments

Comments

@zlinao
Copy link

zlinao commented Nov 18, 2018

After I convert the TF model to pytorch model, I run a classification task on a new Chinese dataset, but get this:

CUDA_VISIBLE_DEVICES=3 python run_classifier.py --task_name weibo --do_eval --do_train --bert_model chinese_L-12_H-768_A-12 --max_seq_length 128 --train_batch_size 32 --learning_rate 2e-5 --num_train_epochs 3.0 --output_dir bert_result

11/18/2018 21:56:59 - INFO - main - device cuda n_gpu 1 distributed training False
11/18/2018 21:56:59 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file chinese_L-12_H-768_A-12
Traceback (most recent call last):
File "run_classifier.py", line 661, in
main()
File "run_classifier.py", line 508, in main
tokenizer = BertTokenizer.from_pretrained(args.bert_model)
File "/home/lin/jpmorgan/pytorch-pretrained-BERT/pytorch_pretrained_bert/tokenization.py", line 141, in from_pretrained
tokenizer = cls(resolved_vocab_file, do_lower_case)
File "/home/lin/jpmorgan/pytorch-pretrained-BERT/pytorch_pretrained_bert/tokenization.py", line 94, in init
"model use tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)".format(vocab_file))
ValueError: Can't find a vocabulary file at path 'chinese_L-12_H-768_A-12'. To load the vocabulary from a Google pretrained model use tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)

@zlinao zlinao closed this as completed Nov 19, 2018
@zlinao
Copy link
Author

zlinao commented Nov 19, 2018

need to specify the path of vocab.txt for:
tokenizer = BertTokenizer.from_pretrained(args.bert_model)

@coddinglxf
Copy link

@zlinao ,i try to load the vocab using the following code:
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese//vocab.txt"

however,get errors
11/19/2018 15:33:13 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file bert-base-chinese//vocab.txt
Traceback (most recent call last):
File "E:/PythonWorkSpace/PytorchBert/BertTest/torchTest.py", line 6, in
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese//vocab.txt")
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 141, in from_pretrained
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 95, in init
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 70, in load_vocab
UnicodeDecodeError: 'gbk' codec can't decode byte 0x81 in position 1564: illegal multibyte sequenc

do you have the same problem?

@thomwolf
Copy link
Member

Hi,
Why don't you guys just do tokenizer = BertTokenizer.from_pretrained('bert-base-chinese') as indicated in the readme and the run_classifier.py example?

@zlinao
Copy link
Author

zlinao commented Nov 19, 2018

Hi,
Why don't you guys just do tokenizer = BertTokenizer.from_pretrained('bert-base-chinese') as indicated in the readme and the run_classifier.py example?

Yes, it is easier to use shortcut name. Thanks for your great work.

@zlinao
Copy link
Author

zlinao commented Nov 19, 2018

@zlinao ,i try to load the vocab using the following code:
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese//vocab.txt"

however,get errors
11/19/2018 15:33:13 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file bert-base-chinese//vocab.txt
Traceback (most recent call last):
File "E:/PythonWorkSpace/PytorchBert/BertTest/torchTest.py", line 6, in
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese//vocab.txt")
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 141, in from_pretrained
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 95, in init
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 70, in load_vocab
UnicodeDecodeError: 'gbk' codec can't decode byte 0x81 in position 1564: illegal multibyte sequenc

do you have the same problem?

you can change you encoding to 'utf-8' when you load the vocab.txt

stevezheng23 added a commit to stevezheng23/transformers that referenced this issue Mar 24, 2020
update adversarial training for roberta question anwsering
Narsil pushed a commit to Narsil/transformers that referenced this issue Jan 25, 2022
jameshennessytempus pushed a commit to jameshennessytempus/transformers that referenced this issue Jun 1, 2023
jonb377 added a commit to jonb377/hf-transformers that referenced this issue Nov 3, 2023
* Replace matmul with einsum

* Fix assertion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants