[PaddlePaddle Hackathon] 第51题 (#1115)

iverxin · yingyibiao · web-flow · commit 48e58f01ac5c · 2021-10-28T13:03:41.000+08:00
* add bert japanese

* fix model-weight files position

* add weights files url

* create package: bert_japanese

* update weights readme

* update weights files

* update config pretrain weights https

* 修复权重配置文件

* retest CI

* update

* update

* fix docstring

* update

* 预训练权重更新

* update weights readme

* remove weights url in codes

* update...

* update...

* update weights readme

* update

* update

* update docstring

* 清理冗余代码

Co-authored-by: yingyibiao &lt;yyb0576@163.com&gt;
diff --git a/community/iverxin/bert-base-japanese-char-whole-word-masking/README.md b/community/iverxin/bert-base-japanese-char-whole-word-masking/README.md
@@ -0,0 +1,64 @@
+
+
+# BERT base Japanese (character tokenization, whole word masking enabled)
+
+This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language.
+
+This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by character-level tokenization.
+
+Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective.
+
+The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0).
+
+## Model architecture
+
+The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.
+
+## Training Data
+
+The model is trained on Japanese Wikipedia as of September 1, 2019.
+
+To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles.
+
+The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.
+
+## Tokenization
+
+The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into characters.
+
+The vocabulary size is 4000.
+
+## Training
+
+The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.
+
+For the training of the MLM (masked language modeling) objective, we introduced the **Whole Word Masking** in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once.
+
+## Licenses
+
+The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).
+
+## Acknowledgments
+
+For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program.
+
+## Usage
+```python
+import paddle
+from paddlenlp.transformers import BertJapaneseTokenizer, BertForMaskedLM
+
+path = "iverxin/bert-base-japanese-char-whole-word-masking/"
+tokenizer = BertJapaneseTokenizer.from_pretrained(path)
+model = BertForMaskedLM.from_pretrained(path)
+text1 = "こんにちは"
+
+model.eval()
+inputs = tokenizer(text1)
+inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
+output = model(**inputs)
+print(output.shape)
+# [1, 5, 32000]
+```
+
+## Weights source
+https://huggingface.co/cl-tohoku/bert-base-japanese-char-whole-word-masking
diff --git a/community/iverxin/bert-base-japanese-char-whole-word-masking/files.json b/community/iverxin/bert-base-japanese-char-whole-word-masking/files.json
@@ -0,0 +1,6 @@
+{
+  "model_config_file":"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/model_config.json",
+  "model_state": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/model_state.pdparams",
+  "tokenizer_config_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/tokenizer_config.pdparams",
+  "vocab_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/vocab.txt"
+}
diff --git a/community/iverxin/bert-base-japanese-char/README.md b/community/iverxin/bert-base-japanese-char/README.md
@@ -0,0 +1,60 @@
+
+
+# BERT base Japanese (character tokenization)
+
+This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language.
+
+This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by character-level tokenization.
+
+The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0).
+
+## Model architecture
+
+The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.
+
+## Training Data
+
+The model is trained on Japanese Wikipedia as of September 1, 2019.
+
+To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles.
+
+The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.
+
+## Tokenization
+
+The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into characters.
+
+The vocabulary size is 4000.
+
+## Training
+
+The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.
+
+## Licenses
+
+The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).
+
+## Acknowledgments
+
+For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program.
+
+## Usage
+```python
+import paddle
+from paddlenlp.transformers import BertJapaneseTokenizer, BertForMaskedLM, MecabTokenizer
+
+path = "iverxin/bert-base-japanese-char/"
+tokenizer = BertJapaneseTokenizer.from_pretrained(path)
+model = BertForMaskedLM.from_pretrained(path)
+text1 = "こんにちは"
+text2 = "櫓を飛ばす"
+
+model.eval()
+inputs = tokenizer(text1)
+inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
+output = model(**inputs)
+print(output.shape)
+```
+
+## Weights source
+https://huggingface.co/cl-tohoku/bert-base-japanese-char
diff --git a/community/iverxin/bert-base-japanese-char/files.json b/community/iverxin/bert-base-japanese-char/files.json
@@ -0,0 +1,6 @@
+{
+  "model_config_file":"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/model_config.json",
+  "model_state": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/model_state.pdparams",
+  "tokenizer_config_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/tokenizer_config.pdparams",
+  "vocab_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/vocab.txt"
+}
diff --git a/community/iverxin/bert-base-japanese-whole-word-masking/README.md b/community/iverxin/bert-base-japanese-whole-word-masking/README.md
@@ -0,0 +1,63 @@
+
+
+# BERT base Japanese (IPA dictionary, whole word masking enabled)
+
+This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language.
+
+This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by the WordPiece subword tokenization.
+
+Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective.
+
+The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0).
+
+## Model architecture
+
+The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.
+
+## Training Data
+
+The model is trained on Japanese Wikipedia as of September 1, 2019.
+
+To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles.
+
+The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.
+
+## Tokenization
+
+The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into subwords by the WordPiece algorithm.
+
+The vocabulary size is 32000.
+
+## Training
+
+The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.
+
+For the training of the MLM (masked language modeling) objective, we introduced the **Whole Word Masking** in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once.
+
+## Licenses
+
+The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).
+
+## Acknowledgments
+
+For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program.
+
+## Usage
+```python
+import paddle
+from paddlenlp.transformers import BertJapaneseTokenizer, BertForMaskedLM
+
+path = "iverxin/bert-base-japanese-whole-word-masking/"
+tokenizer = BertJapaneseTokenizer.from_pretrained(path)
+model = BertForMaskedLM.from_pretrained(path)
+text1 = "こんにちは"
+
+model.eval()
+inputs = tokenizer(text1)
+inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
+output = model(**inputs)
+print(output.shape)
+```
+
+## Weights source
+https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking
diff --git a/community/iverxin/bert-base-japanese-whole-word-masking/files.json b/community/iverxin/bert-base-japanese-whole-word-masking/files.json
@@ -0,0 +1,6 @@
+{
+  "model_config_file":"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/model_config.json",
+  "model_state": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/model_state.pdparams",
+  "tokenizer_config_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/tokenizer_config.pdparams",
+  "vocab_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/vocab.txt"
+}
diff --git a/community/iverxin/bert-base-japanese/README.md b/community/iverxin/bert-base-japanese/README.md
@@ -0,0 +1,59 @@
+# BERT base Japanese (IPA dictionary)
+
+This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language.
+
+This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by the WordPiece subword tokenization.
+
+The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0).
+
+## Model architecture
+
+The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.
+
+## Training Data
+
+The model is trained on Japanese Wikipedia as of September 1, 2019.
+
+To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles.
+
+The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.
+
+## Tokenization
+
+The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into subwords by the WordPiece algorithm.
+
+The vocabulary size is 32000.
+
+## Training
+
+The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.
+
+## Licenses
+
+The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).
+
+## Acknowledgments
+
+For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program.
+
+
+## Usage
+```python
+import paddle
+from paddlenlp.transformers import BertJapaneseTokenizer, BertForMaskedLM
+
+path = "iverxin/bert-base-japanese/"
+tokenizer = BertJapaneseTokenizer.from_pretrained(path)
+model = BertForMaskedLM.from_pretrained(path)
+text1 = "こんにちは"
+
+model.eval()
+inputs = tokenizer(text1)
+inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
+output = model(**inputs)
+print(output.shape)
+```
+
+
+## Weights source
+https://huggingface.co/cl-tohoku/bert-base-japanese
diff --git a/community/iverxin/bert-base-japanese/files.json b/community/iverxin/bert-base-japanese/files.json
@@ -0,0 +1,6 @@
+{
+  "model_config_file":"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/model_config.json",
+  "model_state": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/model_state.pdparams",
+  "tokenizer_config_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/tokenizer_config.pdparams",
+  "vocab_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/vocab.txt"
+}
diff --git a/paddlenlp/transformers/__init__.py b/paddlenlp/transformers/__init__.py
@@ -18,6 +18,7 @@
 
 from .bert.modeling import *
 from .bert.tokenizer import *
+from .bert_japanese.tokenizer import *
 from .ernie.modeling import *
 from .ernie.tokenizer import *
 from .gpt.modeling import *
diff --git a/paddlenlp/transformers/bert/tokenizer.py b/paddlenlp/transformers/bert/tokenizer.py
@@ -14,16 +14,17 @@
 # limitations under the License.
 
 import copy
-import io
-import json
 import os
-import six
 import unicodedata
 
 from .. import PretrainedTokenizer
 from ..tokenizer_utils import convert_to_unicode, whitespace_tokenize, _is_whitespace, _is_control, _is_punctuation
 
-__all__ = ['BasicTokenizer', 'BertTokenizer', 'WordpieceTokenizer']
+__all__ = [
+    'BasicTokenizer',
+    'BertTokenizer',
+    'WordpieceTokenizer',
+]
 
 
 class BasicTokenizer(object):
@@ -290,9 +291,9 @@ class BertTokenizer(PretrainedTokenizer):
         .. code-block::
 
             from paddlenlp.transformers import BertTokenizer
-            berttokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
 
-            inputs = berttokenizer.tokenize('He was a puppeteer')
+            inputs = tokenizer('He was a puppeteer')
             print(inputs)
 
             '''
@@ -554,7 +555,7 @@ def create_token_type_ids_from_sequences(self,
             0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
             | first sequence    | second sequence |
 
-        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
+        If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
 
         Args:
             token_ids_0 (List[int]):
diff --git a/paddlenlp/transformers/bert_japanese/__init__.py b/paddlenlp/transformers/bert_japanese/__init__.py
diff --git a/paddlenlp/transformers/bert_japanese/convert_bert_japanese_params.py b/paddlenlp/transformers/bert_japanese/convert_bert_japanese_params.py
@@ -0,0 +1,69 @@
+import paddle
+import torch
+import numpy as np
+from paddle.utils.download import get_path_from_url
+
+model_names = [
+    "bert-base-japanese", "bert-base-japanese-whole-word-masking",
+    "bert-base-japanese-char", "bert-base-japanese-char-whole-word-masking"
+]
+
+for model_name in model_names:
+    torch_model_url = "https://huggingface.co/cl-tohoku/%s/resolve/main/pytorch_model.bin" % model_name
+    torch_model_path = get_path_from_url(torch_model_url, '../bert')
+    torch_state_dict = torch.load(torch_model_path)
+
+    paddle_model_path = "%s.pdparams" % model_name
+    paddle_state_dict = {}
+
+    # State_dict's keys mapping: from torch to paddle
+    keys_dict = {
+        # about embeddings
+        "embeddings.LayerNorm.gamma": "embeddings.layer_norm.weight",
+        "embeddings.LayerNorm.beta": "embeddings.layer_norm.bias",
+
+        # about encoder layer
+        'encoder.layer': 'encoder.layers',
+        'attention.self.query': 'self_attn.q_proj',
+        'attention.self.key': 'self_attn.k_proj',
+        'attention.self.value': 'self_attn.v_proj',
+        'attention.output.dense': 'self_attn.out_proj',
+        'attention.output.LayerNorm.gamma': 'norm1.weight',
+        'attention.output.LayerNorm.beta': 'norm1.bias',
+        'intermediate.dense': 'linear1',
+        'output.dense': 'linear2',
+        'output.LayerNorm.gamma': 'norm2.weight',
+        'output.LayerNorm.beta': 'norm2.bias',
+
+        # about cls predictions
+        'cls.predictions.transform.dense': 'cls.predictions.transform',
+        'cls.predictions.decoder.weight': 'cls.predictions.decoder_weight',
+        'cls.predictions.transform.LayerNorm.gamma':
+        'cls.predictions.layer_norm.weight',
+        'cls.predictions.transform.LayerNorm.beta':
+        'cls.predictions.layer_norm.bias',
+        'cls.predictions.bias': 'cls.predictions.decoder_bias'
+    }
+
+    for torch_key in torch_state_dict:
+        paddle_key = torch_key
+        for k in keys_dict:
+            if k in paddle_key:
+                paddle_key = paddle_key.replace(k, keys_dict[k])
+
+        if ('linear' in paddle_key) or ('proj' in paddle_key) or (
+                'vocab' in paddle_key and 'weight' in paddle_key) or (
+                    "dense.weight" in paddle_key) or (
+                        'transform.weight' in paddle_key) or (
+                            'seq_relationship.weight' in paddle_key):
+            paddle_state_dict[paddle_key] = paddle.to_tensor(torch_state_dict[
+                torch_key].cpu().numpy().transpose())
+        else:
+            paddle_state_dict[paddle_key] = paddle.to_tensor(torch_state_dict[
+                torch_key].cpu().numpy())
+
+        print("torch: ", torch_key, "\t", torch_state_dict[torch_key].shape)
+        print("paddle: ", paddle_key, "\t", paddle_state_dict[paddle_key].shape,
+              "\n")
+
+    paddle.save(paddle_state_dict, paddle_model_path)
diff --git a/paddlenlp/transformers/bert_japanese/tokenizer.py b/paddlenlp/transformers/bert_japanese/tokenizer.py
diff --git a/tests/transformers/bert_japanese/__init__.py b/tests/transformers/bert_japanese/__init__.py
diff --git a/tests/transformers/bert_japanese/test_tokenizer.py b/tests/transformers/bert_japanese/test_tokenizer.py