Name		Name	Last commit message	Last commit date
parent directory ..
data/pretrain-toy		data/pretrain-toy
nezha		nezha
scripts		scripts
utils		utils
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
THIRD PARTY OPEN SOURCE SOFTWARE NOTICE.txt		THIRD PARTY OPEN SOURCE SOFTWARE NOTICE.txt
__init__.py		__init__.py
extract_features.py		extract_features.py
fp16_utils.py		fp16_utils.py
fused_layer_norm.py		fused_layer_norm.py
gpu_environment.py		gpu_environment.py
modeling.py		modeling.py
modeling_ori.py		modeling_ori.py
modeling_test.py		modeling_test.py
multilingual.md		multilingual.md
optimization.py		optimization.py
optimization_test.py		optimization_test.py
read_tf_events.py		read_tf_events.py
run_classifier.py		run_classifier.py
run_classifier_ner.py		run_classifier_ner.py
run_classifier_with_tfhub.py		run_classifier_with_tfhub.py
run_pretraining.py		run_pretraining.py
run_squad.py		run_squad.py
run_squad_trtis_client.py		run_squad_trtis_client.py
sample_text.txt		sample_text.txt
tf_metrics.py		tf_metrics.py
tokenization.py		tokenization.py
tokenization_test.py		tokenization_test.py

README.md

NEZHA

NEZHA (NEural contextualiZed representation for CHinese lAnguage understanding) is the Chinese pretrained language model currently based on BERT developed by Huawei Noah's Ark lab.
Please note that this code is for training NEZHA on normal GPU clusters, and not identical to what we used in training NEZHA ModelArts provided by Huawei Cloud.
For the convenience of reproducing our result, this code is revised based on the early versions of NVIDIA's code and Google's code, with integrating all techniques we adopted.

1. Prepare data

Following the data preparation as in BERT, run command as below:

python utils/create_pretraining_data.py \
  --input_file=./sample_text.txt \
  --output_file=/tmp/tf_examples.tfrecord \
  --vocab_file=./your/path/vocab.txt \
  --do_lower_case=True \
  --max_seq_length=128 \
  --max_predictions_per_seq=20 \
  --masked_lm_prob=0.15 \
  --random_seed=12345 \
  --dupe_factor=5

2. Pretrain

First, prepare the horovod distributed training environment, and then run scripts/run_pretraining.sh:

3 Finetune NEZHA

For the time being, we support three kinds of fine-tuning tasks: text classification, sequence labelling, and SQuAD-like MRC.
Our fine-tuning codes are mainly based on Google BERT,BERT NER ,CMRC2018-DRCD-BERT

Download the pretrained model and unpack model file,vocab file and config file in nezha/.
Build the fine-tuning task
(1)scripts/run_clf.sh is for text classification tasks such as LCQMC,ChnSenti,XNLI.
(2)scripts/run_seq_labelling.sh is for sequence labelling tasks such as Peoples-daily-NER .
(3)scripts/run_reading.sh is for SQuAD-like MRC tasks such as CMRC2018(https://github.com/ymcui/cmrc2018).
Get the evaluation and test results from the related output repository.
Note that CMRC task evaluation is a little bit different. Please run this script separately:

python cmrc2018_evaluate.py data/cmrc/cmrc2018_dev.json output/cmrc/dev_predictions.json output/cmrc/metric.txt.

cmrc2018_evaluate.py can be found here.

4. NEZHA model download

We released 4 Chinese pretrained models, NEZHA-base and NEZHA-large, models with WWM tag means Whole Word Masking.

NEZHA-base: Baidu Yun download, password:ntn3; Google Driver download
NEZHA-base-WWM: Baidu Yun download, password:f68o; Google Driver download
NEZHA-large: Baidu Yun download, password:7thu; Google Driver download
NEZHA-large-WWM: Baidu Yun download, password:ni4o; Google Driver download
MD5 File: Baidu Yun download, password:yxpk;
Google Driver download

We further released a multilingual pretrained model NEZHA-base-multilingual-11-cased tokenized with Byte BPE. Currently the model covers 11 languages (in alphabetical order): Arabic, Deutsch, English, Espanol, French, Italian, Malay Polish, Portuguese, Russian and Thai. Please use the tokenizationBBPE.py in Byte BPE as the tokenizer (i.e., replace the original tokenization.py with this tokenizationBBPE.py) if you want to use our multilingual pretrained NEZHA model.

NEZHA-base-multilingual-11-cased: Baidu Yun download, password:gs31; Google Driver download

5. References

Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen, Qun Liu. NEZHA: Neural Contextualized Representation for Chinese Language Understanding. arXiv preprint arXiv:1909.00204

@article{wei2019nezha,
  title = {NEZHA: Neural Contextualized Representation for Chinese Language Understanding},
  author = {Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen, Qun Liu},  
  journal = {arXiv preprint arXiv:1909.00204},
  year = {2019},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NEZHA-TensorFlow

NEZHA-TensorFlow

README.md

NEZHA

1. Prepare data

2. Pretrain

3 Finetune NEZHA

4. NEZHA model download

5. References

Files

NEZHA-TensorFlow

Directory actions

More options

Directory actions

More options

Latest commit

History

NEZHA-TensorFlow

Folders and files

parent directory

README.md

NEZHA

1. Prepare data

2. Pretrain

3 Finetune NEZHA

4. NEZHA model download

5. References