Skip to content

Commit 48e58f0

Browse files
iverxinyingyibiao
andauthored
[PaddlePaddle Hackathon] 第51题 (#1115)
* add bert japanese * fix model-weight files position * add weights files url * create package: bert_japanese * update weights readme * update weights files * update config pretrain weights https * 修复权重配置文件 * retest CI * update * update * fix docstring * update * 预训练权重更新 * update weights readme * remove weights url in codes * update... * update... * update weights readme * update * update * update docstring * 清理冗余代码 Co-authored-by: yingyibiao <yyb0576@163.com>
1 parent 20acd16 commit 48e58f0

File tree

15 files changed

+785
-7
lines changed

15 files changed

+785
-7
lines changed
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
2+
3+
# BERT base Japanese (character tokenization, whole word masking enabled)
4+
5+
This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language.
6+
7+
This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by character-level tokenization.
8+
9+
Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective.
10+
11+
The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0).
12+
13+
## Model architecture
14+
15+
The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.
16+
17+
## Training Data
18+
19+
The model is trained on Japanese Wikipedia as of September 1, 2019.
20+
21+
To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles.
22+
23+
The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.
24+
25+
## Tokenization
26+
27+
The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into characters.
28+
29+
The vocabulary size is 4000.
30+
31+
## Training
32+
33+
The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.
34+
35+
For the training of the MLM (masked language modeling) objective, we introduced the **Whole Word Masking** in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once.
36+
37+
## Licenses
38+
39+
The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).
40+
41+
## Acknowledgments
42+
43+
For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program.
44+
45+
## Usage
46+
```python
47+
import paddle
48+
from paddlenlp.transformers import BertJapaneseTokenizer, BertForMaskedLM
49+
50+
path = "iverxin/bert-base-japanese-char-whole-word-masking/"
51+
tokenizer = BertJapaneseTokenizer.from_pretrained(path)
52+
model = BertForMaskedLM.from_pretrained(path)
53+
text1 = "こんにちは"
54+
55+
model.eval()
56+
inputs = tokenizer(text1)
57+
inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
58+
output = model(**inputs)
59+
print(output.shape)
60+
# [1, 5, 32000]
61+
```
62+
63+
## Weights source
64+
https://huggingface.co/cl-tohoku/bert-base-japanese-char-whole-word-masking
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
{
2+
"model_config_file":"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/model_config.json",
3+
"model_state": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/model_state.pdparams",
4+
"tokenizer_config_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/tokenizer_config.pdparams",
5+
"vocab_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/vocab.txt"
6+
}
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
2+
3+
# BERT base Japanese (character tokenization)
4+
5+
This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language.
6+
7+
This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by character-level tokenization.
8+
9+
The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0).
10+
11+
## Model architecture
12+
13+
The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.
14+
15+
## Training Data
16+
17+
The model is trained on Japanese Wikipedia as of September 1, 2019.
18+
19+
To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles.
20+
21+
The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.
22+
23+
## Tokenization
24+
25+
The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into characters.
26+
27+
The vocabulary size is 4000.
28+
29+
## Training
30+
31+
The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.
32+
33+
## Licenses
34+
35+
The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).
36+
37+
## Acknowledgments
38+
39+
For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program.
40+
41+
## Usage
42+
```python
43+
import paddle
44+
from paddlenlp.transformers import BertJapaneseTokenizer, BertForMaskedLM, MecabTokenizer
45+
46+
path = "iverxin/bert-base-japanese-char/"
47+
tokenizer = BertJapaneseTokenizer.from_pretrained(path)
48+
model = BertForMaskedLM.from_pretrained(path)
49+
text1 = "こんにちは"
50+
text2 = "櫓を飛ばす"
51+
52+
model.eval()
53+
inputs = tokenizer(text1)
54+
inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
55+
output = model(**inputs)
56+
print(output.shape)
57+
```
58+
59+
## Weights source
60+
https://huggingface.co/cl-tohoku/bert-base-japanese-char
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
{
2+
"model_config_file":"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/model_config.json",
3+
"model_state": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/model_state.pdparams",
4+
"tokenizer_config_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/tokenizer_config.pdparams",
5+
"vocab_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/vocab.txt"
6+
}
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
2+
3+
# BERT base Japanese (IPA dictionary, whole word masking enabled)
4+
5+
This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language.
6+
7+
This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by the WordPiece subword tokenization.
8+
9+
Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective.
10+
11+
The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0).
12+
13+
## Model architecture
14+
15+
The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.
16+
17+
## Training Data
18+
19+
The model is trained on Japanese Wikipedia as of September 1, 2019.
20+
21+
To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles.
22+
23+
The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.
24+
25+
## Tokenization
26+
27+
The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into subwords by the WordPiece algorithm.
28+
29+
The vocabulary size is 32000.
30+
31+
## Training
32+
33+
The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.
34+
35+
For the training of the MLM (masked language modeling) objective, we introduced the **Whole Word Masking** in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once.
36+
37+
## Licenses
38+
39+
The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).
40+
41+
## Acknowledgments
42+
43+
For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program.
44+
45+
## Usage
46+
```python
47+
import paddle
48+
from paddlenlp.transformers import BertJapaneseTokenizer, BertForMaskedLM
49+
50+
path = "iverxin/bert-base-japanese-whole-word-masking/"
51+
tokenizer = BertJapaneseTokenizer.from_pretrained(path)
52+
model = BertForMaskedLM.from_pretrained(path)
53+
text1 = "こんにちは"
54+
55+
model.eval()
56+
inputs = tokenizer(text1)
57+
inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
58+
output = model(**inputs)
59+
print(output.shape)
60+
```
61+
62+
## Weights source
63+
https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
{
2+
"model_config_file":"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/model_config.json",
3+
"model_state": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/model_state.pdparams",
4+
"tokenizer_config_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/tokenizer_config.pdparams",
5+
"vocab_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/vocab.txt"
6+
}
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# BERT base Japanese (IPA dictionary)
2+
3+
This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language.
4+
5+
This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by the WordPiece subword tokenization.
6+
7+
The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0).
8+
9+
## Model architecture
10+
11+
The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.
12+
13+
## Training Data
14+
15+
The model is trained on Japanese Wikipedia as of September 1, 2019.
16+
17+
To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles.
18+
19+
The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.
20+
21+
## Tokenization
22+
23+
The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into subwords by the WordPiece algorithm.
24+
25+
The vocabulary size is 32000.
26+
27+
## Training
28+
29+
The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.
30+
31+
## Licenses
32+
33+
The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).
34+
35+
## Acknowledgments
36+
37+
For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program.
38+
39+
40+
## Usage
41+
```python
42+
import paddle
43+
from paddlenlp.transformers import BertJapaneseTokenizer, BertForMaskedLM
44+
45+
path = "iverxin/bert-base-japanese/"
46+
tokenizer = BertJapaneseTokenizer.from_pretrained(path)
47+
model = BertForMaskedLM.from_pretrained(path)
48+
text1 = "こんにちは"
49+
50+
model.eval()
51+
inputs = tokenizer(text1)
52+
inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
53+
output = model(**inputs)
54+
print(output.shape)
55+
```
56+
57+
58+
## Weights source
59+
https://huggingface.co/cl-tohoku/bert-base-japanese
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
{
2+
"model_config_file":"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/model_config.json",
3+
"model_state": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/model_state.pdparams",
4+
"tokenizer_config_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/tokenizer_config.pdparams",
5+
"vocab_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/vocab.txt"
6+
}

paddlenlp/transformers/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818

1919
from .bert.modeling import *
2020
from .bert.tokenizer import *
21+
from .bert_japanese.tokenizer import *
2122
from .ernie.modeling import *
2223
from .ernie.tokenizer import *
2324
from .gpt.modeling import *

paddlenlp/transformers/bert/tokenizer.py

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -14,16 +14,17 @@
1414
# limitations under the License.
1515

1616
import copy
17-
import io
18-
import json
1917
import os
20-
import six
2118
import unicodedata
2219

2320
from .. import PretrainedTokenizer
2421
from ..tokenizer_utils import convert_to_unicode, whitespace_tokenize, _is_whitespace, _is_control, _is_punctuation
2522

26-
__all__ = ['BasicTokenizer', 'BertTokenizer', 'WordpieceTokenizer']
23+
__all__ = [
24+
'BasicTokenizer',
25+
'BertTokenizer',
26+
'WordpieceTokenizer',
27+
]
2728

2829

2930
class BasicTokenizer(object):
@@ -290,9 +291,9 @@ class BertTokenizer(PretrainedTokenizer):
290291
.. code-block::
291292
292293
from paddlenlp.transformers import BertTokenizer
293-
berttokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
294+
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
294295
295-
inputs = berttokenizer.tokenize('He was a puppeteer')
296+
inputs = tokenizer('He was a puppeteer')
296297
print(inputs)
297298
298299
'''
@@ -554,7 +555,7 @@ def create_token_type_ids_from_sequences(self,
554555
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
555556
| first sequence | second sequence |
556557
557-
If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
558+
If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
558559
559560
Args:
560561
token_ids_0 (List[int]):

0 commit comments

Comments
 (0)