Skip to content

Commit

Permalink
Merge pull request #47 from yuzhou03/bert-docs
Browse files Browse the repository at this point in the history
add Bert model && case readme
  • Loading branch information
upvenly authored Apr 20, 2023
2 parents 6a5be21 + 564f1c7 commit e3c7c07
Show file tree
Hide file tree
Showing 6 changed files with 245 additions and 131 deletions.
Original file line number Diff line number Diff line change
@@ -1,131 +1,90 @@

### 模型Checkpoint下载

● 下载地址:
`https://drive.google.com/drive/u/0/folders/1oQF4diVHNPCclykwdvQJw8n_VIWwV0PT`


```
文件列表:
tf1_ckpt
vocab.txt
bert_config.json
```


● 模型格式转换:

```
git clone https://github.com/mlcommons/training_results_v1.0.git
cd training_results_v1.0/NVIDIA/benchmarks/bert/implementations/pytorch/
docker build --pull -t mlperf-nvidia:language_model .
```

启动容器,将checkpoint保存路径挂载为/cks

```
python convert_tf_checkpoint.py --tf_checkpoint /cks/model.ckpt-28252.index --bert_config_path /cks/bert_config.json --output_checkpoint model.ckpt-28252.pt
```

### 测试数据集下载

● 下载地址:`https://drive.google.com/drive/folders/1cywmDnAsrP5-2vsr8GDc6QUc7VWe-M3v`

```
文件列表:
results_text.tar.gz
bert_reference_results_text_md5.txt
```

● 数据集格式转换:

```
cd /data && tar xf results_text.tar.gz
cd results4
md5sum --check ../bert_reference_results_text_md5.txt
cd ..
cp training_results_v1.0/NVIDIA/benchmarks/bert/implementations/pytorch/input_preprocessing/* ./
```

再次启动容器,将/data保存路径挂载为/data

```
cd /data
./parallel_create_hdf5.sh
mkdir -p 2048_shards_uncompressed
python3 ./chop_hdf5_files.py
mkdir eval_set_uncompressed
python3 create_pretraining_data.py \
--input_file=results4/eval.txt \
--output_file=eval_all \
--vocab_file=vocab.txt \
--do_lower_case=True \
--max_seq_length=512 \
--max_predictions_per_seq=76 \
--masked_lm_prob=0.15 \
--random_seed=12345 \
--dupe_factor=10
python3 pick_eval_samples.py \
--input_hdf5_file=eval_all.hdf5 \
--output_hdf5_file=eval_set_uncompressed/part_eval_10k.hdf5 \
--num_examples_to_pick=10000
```

> 注:详情参考https://github.com/mlcommons/training_results_v1.0/tree/master/NVIDIA/benchmarks/bert/implementations/pytorch
### Paddle版本运行指南

单卡运行命令:
● 依赖包,paddlepaddle-gpu

'''
python -m pip install paddlepaddle-gpu==2.4.0rc0 -i https://pypi.tuna.tsinghua.edu.cn/simple
'''

● bash环境变量:
```
export MASTER_ADDR=user_ip
export MASTER_PORT=user_port
export WORLD_SIZE=1
export NODE_RANK=0
export CUDA_VISIBLE_DEVICES=0,1#可用的GPU索引
export RANK=0
export LOCAL_RANK=0
```
example:
```
export MASTER_ADDR=10.21.226.184
export MASTER_PORT=29501
export WORLD_SIZE=1
export NODE_RANK=0
export CUDA_VISIBLE_DEVICES=0,1#可用的GPU索引
export RANK=0
export LOCAL_RANK=0
```

● 运行脚本:

在该路径目录下

```
python run_pretraining.py
--data_dir data_path
--extern_config_dir config_path
--extern_config_file config_file.py
```

example:
```
python run_pretraining.py
--data_dir /ssd2/yangjie40/data_config
--extern_config_dir /ssd2/yangjie40/flagperf/training/nvidia/bert-pytorch/config
--extern_config_file config_A100x1x2.py
```


### 许可证

本项目基于Apache 2.0 license。
本项目部分代码基于MLCommons https://github.com/mlcommons/training_results_v1.0/tree/master/NVIDIA 实现。
## 模型信息
### 模型介绍

BERT stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

Please refer to this paper for a detailed description of BERT:
[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)


### 模型代码来源
[Bert MLPerf](https://github.com/mlcommons/training_results_v1.0/tree/master/NVIDIA/benchmarks/bert/implementations)


### 模型Checkpoint下载

● 下载地址:
`https://drive.google.com/drive/u/0/folders/1oQF4diVHNPCclykwdvQJw8n_VIWwV0PT`

```
文件列表:
tf1_ckpt
vocab.txt
bert_config.json
```

● 模型格式转换:

```
git clone https://github.com/mlcommons/training_results_v1.0.git
cd training_results_v1.0/NVIDIA/benchmarks/bert/implementations/pytorch/
docker build --pull -t mlperf-nvidia:language_model .
```

启动容器,将checkpoint保存路径挂载为/cks

```
python convert_tf_checkpoint.py --tf_checkpoint /cks/model.ckpt-28252.index --bert_config_path /cks/bert_config.json --output_checkpoint model.ckpt-28252.pt
```

### 测试数据集下载

● 下载地址:`https://drive.google.com/drive/folders/1cywmDnAsrP5-2vsr8GDc6QUc7VWe-M3v`

```
文件列表:
results_text.tar.gz
bert_reference_results_text_md5.txt
```

● 数据集格式转换:

```
cd /data && tar xf results_text.tar.gz
cd results4
md5sum --check ../bert_reference_results_text_md5.txt
cd ..
cp training_results_v1.0/NVIDIA/benchmarks/bert/implementations/pytorch/input_preprocessing/* ./
```

再次启动容器,将/data保存路径挂载为/data

```
cd /data
./parallel_create_hdf5.sh
mkdir -p 2048_shards_uncompressed
python3 ./chop_hdf5_files.py
mkdir eval_set_uncompressed
python3 create_pretraining_data.py \
--input_file=results4/eval.txt \
--output_file=eval_all \
--vocab_file=vocab.txt \
--do_lower_case=True \
--max_seq_length=512 \
--max_predictions_per_seq=76 \
--masked_lm_prob=0.15 \
--random_seed=12345 \
--dupe_factor=10
python3 pick_eval_samples.py \
--input_hdf5_file=eval_all.hdf5 \
--output_hdf5_file=eval_set_uncompressed/part_eval_10k.hdf5 \
--num_examples_to_pick=10000
```

### 框架与芯片支持情况
| | Pytorch |Paddle|TensorFlow2|
| ---- | ---- | ---- | ---- |
| Nvidia GPU | N/A |[](../../nvidia/bert-paddle/README.md) |N/A|
87 changes: 87 additions & 0 deletions training/nvidia/bert-paddle/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@

### 模型Checkpoint下载
[模型Checkpoint下载](../../benchmarks/bert/README.md#模型checkpoint下载)


### 测试数据集下载
[测试数据集下载](../../benchmarks/bert/README.md#测试数据集下载)


### Paddle版本运行指南

单卡运行命令:
● 依赖包,paddlepaddle-gpu

'''
python -m pip install paddlepaddle-gpu==2.4.0rc0 -i https://pypi.tuna.tsinghua.edu.cn/simple
'''

● bash环境变量:
```
export MASTER_ADDR=user_ip
export MASTER_PORT=user_port
export WORLD_SIZE=1
export NODE_RANK=0
export CUDA_VISIBLE_DEVICES=0,1#可用的GPU索引
export RANK=0
export LOCAL_RANK=0
```
example:
```
export MASTER_ADDR=10.21.226.184
export MASTER_PORT=29501
export WORLD_SIZE=1
export NODE_RANK=0
export CUDA_VISIBLE_DEVICES=0,1#可用的GPU索引
export RANK=0
export LOCAL_RANK=0
```

● 运行脚本:

在该路径目录下

```
python run_pretraining.py
--data_dir data_path
--extern_config_dir config_path
--extern_config_file config_file.py
```

example:
```
python run_pretraining.py
--data_dir /ssd2/yangjie40/data_config
--extern_config_dir /ssd2/yangjie40/flagperf/training/nvidia/bert-pytorch/config
--extern_config_file config_A100x1x2.py
```


### Nvidia GPU配置与运行信息参考
#### 环境配置
- ##### 硬件环境
- 机器、加速卡型号: NVIDIA_A100-SXM4-40GB
- 多机网络类型、带宽: InfiniBand,200Gb/s
- ##### 软件环境
- OS版本:Ubuntu 20.04
- OS kernel版本: 5.4.0-113-generic
- 加速卡驱动版本:470.129.06
- Docker 版本:20.10.16
- 训练框架版本: paddle-2.4.0-rc
- 依赖软件版本:
- cuda: cuda_11.2.r11.2


### 运行情况
| 训练资源 | 配置文件 | 运行时长(s) | 目标精度 | 收敛精度 | Steps数 | 性能(samples/s)|
| -------- | --------------- | ----------- | -------- | -------- | ------- | ---------------- |
| 单机1卡 | config_A100x1x1 | N/A | 0.67 | N/A | N/A | N/A |
| 单机2卡 | config_A100x1x2 | N/A | 0.67 | N/A | N/A | N/A |
| 单机4卡 | config_A100x1x4 | 1715.28 | 0.67 | 0.6809 | 6250 | 180.07 |
| 单机8卡 | config_A100x1x8 | 1315.42 | 0.67 | 0.6818 | 4689 | 355.63 |

### 许可证

本项目基于Apache 2.0 license。

本项目部分代码基于MLCommons https://github.com/mlcommons/training_results_v1.0/tree/master/NVIDIA/benchmarks/ 实现。
17 changes: 17 additions & 0 deletions training/nvidia/bert-paddle/config/config_A100x1x1.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
target_mlm_accuracy = 0.67
gradient_accumulation_steps = 1
max_steps = 10000000
start_warmup_step = 0
warmup_proportion = 0
warmup_steps = 2000

learning_rate = 1e-4
weight_decay_rate = 0.01
opt_lamb_beta_1 = 0.9
opt_lamb_beta_2 = 0.999
train_batch_size = 12
eval_batch_size = train_batch_size
max_samples_termination = 450000000
cache_eval_data = False

seed = 9031
17 changes: 17 additions & 0 deletions training/nvidia/bert-paddle/config/config_A100x1x2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
target_mlm_accuracy = 0.67
gradient_accumulation_steps = 1
max_steps = 10000000
start_warmup_step = 0
warmup_proportion = 0
warmup_steps = 2000

learning_rate = 1e-4
weight_decay_rate = 0.01
opt_lamb_beta_1 = 0.9
opt_lamb_beta_2 = 0.999
train_batch_size = 12
eval_batch_size = train_batch_size
max_samples_termination = 450000000
cache_eval_data = False

seed = 9031
17 changes: 17 additions & 0 deletions training/nvidia/bert-paddle/config/config_A100x1x4.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
target_mlm_accuracy = 0.67
gradient_accumulation_steps = 1
max_steps = 10000000
start_warmup_step = 0
warmup_proportion = 0
warmup_steps = 2000

learning_rate = 1e-4
weight_decay_rate = 0.01
opt_lamb_beta_1 = 0.9
opt_lamb_beta_2 = 0.999
train_batch_size = 12
eval_batch_size = train_batch_size
max_samples_termination = 450000000
cache_eval_data = False

seed = 9031
17 changes: 17 additions & 0 deletions training/nvidia/bert-paddle/config/config_A100x2x8.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
target_mlm_accuracy = 0.67
gradient_accumulation_steps = 1
max_steps = 10000
start_warmup_step = 0
warmup_proportion = 0
warmup_steps = 2000

learning_rate = 1e-4
weight_decay_rate = 0.01
opt_lamb_beta_1 = 0.9
opt_lamb_beta_2 = 0.999
train_batch_size = 12
eval_batch_size = train_batch_size
max_samples_termination = 4500000
cache_eval_data = False

seed = 9031

0 comments on commit e3c7c07

Please sign in to comment.