You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# BERT: Pre-trained models and downstream applications
2
2
3
-
This is a Texar implementation of Google's BERT model, which allows to load pre-trained model parameters downloaded from the [official releaes](https://github.com/google-research/bert) and build/fine-tune arbitrary downstream applications with **distributed training** (This example showcases BERT for sentence classification).
3
+
This is a Texar implementation of Google's BERT model, which allows to load pre-trained model parameters downloaded from the [official release](https://github.com/google-research/bert) and build/fine-tune arbitrary downstream applications with **distributed training** (This example showcases BERT for sentence classification).
4
4
5
5
With Texar, building the BERT model is as simple as creating a [`TransformerEncoder`](https://texar.readthedocs.io/en/latest/code/modules.html#transformerencoder) instance. We can initialize the parameters of the TransformerEncoder using a pre-trained BERT checkpoint by calling `init_bert_checkpoint(path_to_bert_checkpoint)`.
6
6
@@ -9,47 +9,73 @@ In sum, this example showcases:
9
9
* Use of pre-trained Google BERT models in Texar
10
10
* Building and fine-tuning on downstream tasks
11
11
* Distributed training of the models
12
+
* Use of Texar `TFRecordData` module for data loading and processing
12
13
13
14
## Quick Start
14
15
16
+
### Download BERT Pre-train Model
17
+
18
+
```
19
+
sh bert_pretrained_models/download_model.sh
20
+
```
21
+
By default, it will download a pretrained model (BERT-Base Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters) named `uncased_L-12_H-768_A-12` to `bert_pretrained_models/`.
22
+
23
+
Under `bert_pretrained_models/uncased_L-12_H-768_A-12`, you can find 5 files, where
24
+
-`bert-config.json` is the model configuration of the BERT model. For the particular model we just downloaded, it is an uncased-vocabulary, 12-layer, 768-hidden, 12-heads Transformer model.
25
+
15
26
### Download Dataset
16
27
17
-
We explain the use of the example code based on the Microsoft Research Paraphrase Corpus (MRPC) corpus for sentence classification.
28
+
We explain the use of the example code based on the Microsoft Research Paraphrase Corpus (MRPC) corpus for sentence classification.
18
29
19
30
Download the data with the following cmd
20
31
```
21
32
python data/download_glue_data.py --tasks=MRPC
22
33
```
23
34
By default, it will download the MRPC dataset into the `data` directory. FYI, the MRPC dataset part of the [GLUE](https://gluebenchmark.com/tasks) dataset collection.
24
35
25
-
### Download BERT Pre-train Model
36
+
### Prepare data
26
37
38
+
We first preprocess the downloaded raw data into [TFRecord](https://www.tensorflow.org/tutorials/load_data/tf_records) files. The preprocessing tokenizes raw text with BPE encoding, truncates sequences, adds special tokens, etc.
By default, it will download a pretrained model (BERT-Base Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters) named `uncased_L-12_H-768_A-12` to `bert_pretrained_models/`.
31
-
32
-
Under `bert_pretrained_models/uncased_L-12_H-768_A-12`, you can find 5 files, where
33
-
-`bert-config.json` is the model configuration of the BERT model. For the particular model we just downloaded, it is an uncased-vocabulary, 12-layer, 768-hidden, 12-heads Transformer model.
46
+
-`task`: Specifies the dataset name to preprocess. BERT provides default support for `{'CoLA', 'MNLI', 'MRPC', 'XNLI', 'SST'}` data.
47
+
-`max_seq_length`: The maxium length of sequence. This includes BERT special tokens that will be automatically added. Longer sequence will be trimmed.
48
+
-`vocab_file`: Path to a vocabary file used for tokenization.
49
+
-`tfrecords_output_dir`: The output path where the resulting TFRecord files will be put in. Be default, it is set to `data/{task}` where `{task}` is the (upper-cased) dataset name specified in `--task` above. So in the above cmd, the TFRecord files are output to `data/MRPC`.
50
+
51
+
**Outcome of the Preprocessing**:
52
+
- The preprocessing will output 3 TFRecord data files `{train.tf_record, eval.tf_record, test.tf_record}` in the specified output directory.
**Note that** the data info `num_classes` and `num_train_data`, as well as `max_seq_length` specified in the cmd, are required for BERT training in the following. They should be specified in the data configuration file passed to BERT training (see below).
61
+
- For convenience, the above cmd automatically writes `num_classes`, `num_train_data` and `max_seq_length` to `config_data.py`.
34
62
35
63
### Train and Evaluate
36
64
37
65
For **single-GPU** training (and evaluation), run the following cmd. The training updates the classification layer and fine-tunes the pre-trained BERT parameters.
-`task`: Specifies which dataset to experiment on.
49
-
-`config_bert_pretrain`: Specifies the architecture of pre-trained BERT model to use.
50
-
-`config_downstream`: Configuration of the downstream part. In this example, [`config_classifier.py`](https://github.com/asyml/texar/blob/master/examples/bert/bert_classifier_main.py) configs the classification layer and the optimization method.
51
-
-`config_data`: The data configuration.
52
-
-`output_dir`: The output path where checkpoints and summaries for tensorboard visualization are saved.
75
+
-`config_bert_pretrain`: Specifies the architecture of pre-trained BERT model. Used to find architecture configs under `bert_pretrained_models/{config_bert_pretrain}`.
76
+
-`config_downstream`: Configuration of the downstream part. In this example, [`config_classifier.py`](https://github.com/asyml/texar/blob/master/examples/bert/bert_classifier_main.py) configures the classification layer and the optimization method.
77
+
-`config_data`: The data configuration. See the default [`config_data.py`](./config_data.py) for example. Make sure to specify `num_classes`, `num_train_data`, `max_seq_length`, and `tfrecord_data_dir` as used or output in the above [data preparation](#prepare-data) step.
78
+
-`output_dir`: The output path where checkpoints and TensorBoard summaries are saved.
53
79
54
80
For **Multi-GPU training** on one or multiple machines, you may first install the prerequisite OpenMPI and Hovorod packages, as detailed in the [distributed_gpu](https://github.com/asyml/texar/tree/master/examples/distributed_gpu) example.
@@ -75,7 +100,7 @@ The key configurations of multi-gpu training:
75
100
76
101
Please refer to [distributed_gpu](https://github.com/asyml/texar/tree/master/examples/distributed_gpu) example for more details of the other multi-gpu configurations.
77
102
78
-
Note that we also specified the `--distributed` flag for multi-gpu training.
103
+
Make sure to specifiy the `--distributed` flag as above for multi-gpu training.
79
104
80
105
81
106
@@ -95,10 +120,11 @@ The output is by default saved in `output/test_results.tsv`, where each line con
95
120
96
121
## Use other datasets/tasks
97
122
98
-
`bert_classifier_main.py` also support other datasets/tasks. To do this, specify a different value to the `--task` flag, and use a corresponding data configuration file.
123
+
`bert_classifier_main.py` also support other datasets/tasks. To do this, specify a different value to the `--task` flag when running [data preparation](#prepare-data).
99
124
100
-
For example, use the following commands to download the SST (Stanford Sentiment Treebank) dataset and run for sentence classification.
125
+
For example, use the following commands to download the SST (Stanford Sentiment Treebank) dataset and run for sentence classification. Make sure to specify the correct data path and other info in the data configuration file.
0 commit comments