Skip to content

Commit 9476795

Browse files
authored
Merge pull request #107 from TomNong/bert-add-TFrecord-module-refined
- Add TFRecordData module - Refactor BERT example using TFRecordData - MultialignedData adds support for TFRecordData Fixes #60
2 parents b8bab09 + 1a4b299 commit 9476795

13 files changed

+1383
-274
lines changed

examples/bert/README.md

+48-22
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# BERT: Pre-trained models and downstream applications
22

3-
This is a Texar implementation of Google's BERT model, which allows to load pre-trained model parameters downloaded from the [official releaes](https://github.com/google-research/bert) and build/fine-tune arbitrary downstream applications with **distributed training** (This example showcases BERT for sentence classification).
3+
This is a Texar implementation of Google's BERT model, which allows to load pre-trained model parameters downloaded from the [official release](https://github.com/google-research/bert) and build/fine-tune arbitrary downstream applications with **distributed training** (This example showcases BERT for sentence classification).
44

55
With Texar, building the BERT model is as simple as creating a [`TransformerEncoder`](https://texar.readthedocs.io/en/latest/code/modules.html#transformerencoder) instance. We can initialize the parameters of the TransformerEncoder using a pre-trained BERT checkpoint by calling `init_bert_checkpoint(path_to_bert_checkpoint)`.
66

@@ -9,47 +9,73 @@ In sum, this example showcases:
99
* Use of pre-trained Google BERT models in Texar
1010
* Building and fine-tuning on downstream tasks
1111
* Distributed training of the models
12+
* Use of Texar `TFRecordData` module for data loading and processing
1213

1314
## Quick Start
1415

16+
### Download BERT Pre-train Model
17+
18+
```
19+
sh bert_pretrained_models/download_model.sh
20+
```
21+
By default, it will download a pretrained model (BERT-Base Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters) named `uncased_L-12_H-768_A-12` to `bert_pretrained_models/`.
22+
23+
Under `bert_pretrained_models/uncased_L-12_H-768_A-12`, you can find 5 files, where
24+
- `bert-config.json` is the model configuration of the BERT model. For the particular model we just downloaded, it is an uncased-vocabulary, 12-layer, 768-hidden, 12-heads Transformer model.
25+
1526
### Download Dataset
1627

17-
We explain the use of the example code based on the Microsoft Research Paraphrase Corpus (MRPC) corpus for sentence classification.
28+
We explain the use of the example code based on the Microsoft Research Paraphrase Corpus (MRPC) corpus for sentence classification.
1829

1930
Download the data with the following cmd
2031
```
2132
python data/download_glue_data.py --tasks=MRPC
2233
```
2334
By default, it will download the MRPC dataset into the `data` directory. FYI, the MRPC dataset part of the [GLUE](https://gluebenchmark.com/tasks) dataset collection.
2435

25-
### Download BERT Pre-train Model
36+
### Prepare data
2637

38+
We first preprocess the downloaded raw data into [TFRecord](https://www.tensorflow.org/tutorials/load_data/tf_records) files. The preprocessing tokenizes raw text with BPE encoding, truncates sequences, adds special tokens, etc.
39+
Run the following cmd to this end:
2740
```
28-
sh bert_pretrained_models/download_model.sh
41+
python prepare_data.py --task=MRPC
42+
[--max_seq_length=128]
43+
[--vocab_file=bert_pretrained_models/uncased_L-12_H-768_A-12/vocab.txt]
44+
[--tfrecords_output_dir=data/MRPC]
2945
```
30-
By default, it will download a pretrained model (BERT-Base Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters) named `uncased_L-12_H-768_A-12` to `bert_pretrained_models/`.
31-
32-
Under `bert_pretrained_models/uncased_L-12_H-768_A-12`, you can find 5 files, where
33-
- `bert-config.json` is the model configuration of the BERT model. For the particular model we just downloaded, it is an uncased-vocabulary, 12-layer, 768-hidden, 12-heads Transformer model.
46+
- `task`: Specifies the dataset name to preprocess. BERT provides default support for `{'CoLA', 'MNLI', 'MRPC', 'XNLI', 'SST'}` data.
47+
- `max_seq_length`: The maxium length of sequence. This includes BERT special tokens that will be automatically added. Longer sequence will be trimmed.
48+
- `vocab_file`: Path to a vocabary file used for tokenization.
49+
- `tfrecords_output_dir`: The output path where the resulting TFRecord files will be put in. Be default, it is set to `data/{task}` where `{task}` is the (upper-cased) dataset name specified in `--task` above. So in the above cmd, the TFRecord files are output to `data/MRPC`.
50+
51+
**Outcome of the Preprocessing**:
52+
- The preprocessing will output 3 TFRecord data files `{train.tf_record, eval.tf_record, test.tf_record}` in the specified output directory.
53+
- The cmd also prints logs as follows:
54+
```
55+
INFO:tensorflow:Loading data
56+
INFO:tensorflow:num_classes:2; num_train_data:3668
57+
INFO:tensorflow:config_data.py has been updated
58+
INFO:tensorflow:Data preparation finished
59+
```
60+
**Note that** the data info `num_classes` and `num_train_data`, as well as `max_seq_length` specified in the cmd, are required for BERT training in the following. They should be specified in the data configuration file passed to BERT training (see below).
61+
- For convenience, the above cmd automatically writes `num_classes`, `num_train_data` and `max_seq_length` to `config_data.py`.
3462

3563
### Train and Evaluate
3664

3765
For **single-GPU** training (and evaluation), run the following cmd. The training updates the classification layer and fine-tunes the pre-trained BERT parameters.
3866
```
3967
python bert_classifier_main.py --do_train --do_eval
40-
[--task=mrpc]
4168
[--config_bert_pretrain=uncased_L-12_H-768_A-12]
4269
[--config_downstream=config_classifier]
43-
[--config_data=config_data_mrpc]
44-
[--output_dir=output]
70+
[--config_data=config_data]
71+
[--output_dir=output]
4572
```
4673
Here:
4774

48-
- `task`: Specifies which dataset to experiment on.
49-
- `config_bert_pretrain`: Specifies the architecture of pre-trained BERT model to use.
50-
- `config_downstream`: Configuration of the downstream part. In this example, [`config_classifier.py`](https://github.com/asyml/texar/blob/master/examples/bert/bert_classifier_main.py) configs the classification layer and the optimization method.
51-
- `config_data`: The data configuration.
52-
- `output_dir`: The output path where checkpoints and summaries for tensorboard visualization are saved.
75+
- `config_bert_pretrain`: Specifies the architecture of pre-trained BERT model. Used to find architecture configs under `bert_pretrained_models/{config_bert_pretrain}`.
76+
- `config_downstream`: Configuration of the downstream part. In this example, [`config_classifier.py`](https://github.com/asyml/texar/blob/master/examples/bert/bert_classifier_main.py) configures the classification layer and the optimization method.
77+
- `config_data`: The data configuration. See the default [`config_data.py`](./config_data.py) for example. Make sure to specify `num_classes`, `num_train_data`, `max_seq_length`, and `tfrecord_data_dir` as used or output in the above [data preparation](#prepare-data) step.
78+
- `output_dir`: The output path where checkpoints and TensorBoard summaries are saved.
5379

5480
For **Multi-GPU training** on one or multiple machines, you may first install the prerequisite OpenMPI and Hovorod packages, as detailed in the [distributed_gpu](https://github.com/asyml/texar/tree/master/examples/distributed_gpu) example.
5581

@@ -62,10 +88,9 @@ mpirun -np 2 \
6288
-mca pml ob1 -mca btl tcp,self \
6389
-mca btl_tcp_if_include ens3 \
6490
python bert_classifier_main.py --do_train --do_eval --distributed
65-
[--task=mrpc]
6691
[--config_bert_pretrain=uncased_L-12_H-768_A-12]
6792
[--config_downstream=config_classifier]
68-
[--config_data=config_data_mrpc]
93+
[--config_data=config_data]
6994
[--output_dir=output]
7095
```
7196
The key configurations of multi-gpu training:
@@ -75,7 +100,7 @@ The key configurations of multi-gpu training:
75100

76101
Please refer to [distributed_gpu](https://github.com/asyml/texar/tree/master/examples/distributed_gpu) example for more details of the other multi-gpu configurations.
77102

78-
Note that we also specified the `--distributed` flag for multi-gpu training.
103+
Make sure to specifiy the `--distributed` flag as above for multi-gpu training.
79104

80105
 
81106

@@ -95,10 +120,11 @@ The output is by default saved in `output/test_results.tsv`, where each line con
95120

96121
## Use other datasets/tasks
97122

98-
`bert_classifier_main.py` also support other datasets/tasks. To do this, specify a different value to the `--task` flag, and use a corresponding data configuration file.
123+
`bert_classifier_main.py` also support other datasets/tasks. To do this, specify a different value to the `--task` flag when running [data preparation](#prepare-data).
99124

100-
For example, use the following commands to download the SST (Stanford Sentiment Treebank) dataset and run for sentence classification.
125+
For example, use the following commands to download the SST (Stanford Sentiment Treebank) dataset and run for sentence classification. Make sure to specify the correct data path and other info in the data configuration file.
101126
```
102127
python data/download_glue_data.py --tasks=SST
103-
python bert_classifier_main.py --do_train --do_eval --task=sst --config_data=config_data_sst
128+
python prepare_data.py --task=SST
129+
python bert_classifier_main.py --do_train --do_eval --config_data=config_data
104130
```

0 commit comments

Comments
 (0)