This is a step-by-step tutorial on how to get started with LiBai:
- We have prepared relevant datasets, which can be downloaded from the following links:
- Download the dataset and move the data file to the folder. The file structure should be like:
$ tree data
path/to/bert_data
├── bert-base-chinese-vocab.txt
├── loss_compara_content_sentence.bin
└── loss_compara_content_sentence.idxWe provide train.sh for execute training. Before invoking the script, perform the following steps.
Step 1. Set data path and vocab path
- Update the data path and vocab path in bert_large_pretrain config file:
# Refine data path and vocab path to data folder
vocab_file = "/path/to/bert_data/bert-base-chinese-vocab.txt"
data_prefix = "/path/to/bert_data/loss_compara_content_sentence"Step 2. Configure your parameters
- In the
configs/bert_large_pretrain.pyprovided, a set of parameters are defined including training scheme, model, etc. - You can also modify the parameters setting. For example, if you want to use 8 GPUs for training, you can refer to the file
configs/common/train.py. If you want to train model with 2D mesh hybrid parallelism (4 groups for data parallel and 2 groups for tensor parallel), you can set the the parameters as follows:
train.dist.data_parallel_size=4
train.dist.tensor_parallel_size=2Step 3. Invoke parallel training
- To train
BertForPreTrainingmodel on a single node with 8 GPUs, run:
bash tools/train.sh tools/train_net.py configs/bert_large_pretrain.py 8-
To train
BertForPreTrainingmodel on 2 nodes with 16 GPUs,in
node0, run:NODE=2 NODE_RANK=0 ADDR=192.168.0.0 PORT=12345 bash tools/train.sh tools/train_net.py configs/bert_large_pretrain.py 8
NODE=2means total number of nodesNODE_RANK=0means current node is node0ADDR=192.168.0.0means the ip address of node0PORT=12345means the port of node0in
node1, run:NODE=2 NODE_RANK=1 ADDR=192.168.0.0 PORT=12345 bash tools/train.sh tools/train_net.py configs/bert_large_pretrain.py 8
NODE=2means total number of nodesNODE_RANK=1means current node is node1ADDR=192.168.0.0means the ip address of node0PORT=12345means the port of node0
For ImageNet, we use standard ImageNet dataset, which can be downloaded from http://image-net.org/.
- For the standard folder dataset, move validation images to labeled sub-folders. The file structure should be like:
$ tree data
imagenet
├── train
│ ├── class1
│ │ ├── img1.jpeg
│ │ ├── img2.jpeg
│ │ └── ...
│ ├── class2
│ │ ├── img3.jpeg
│ │ └── ...
│ └── ...
└── val
├── class1
│ ├── img4.jpeg
│ ├── img5.jpeg
│ └── ...
├── class2
│ ├── img6.jpeg
│ └── ...
└── ...
- Update the data path in vit_imagenet config file:
# Refine data path to imagenet data folder
dataloader.train.dataset[0].root = "/path/to/imagenet"
dataloader.test[0].dataset.root = "/path/to/imagenet"- To train
vit_tiny_patch16_224model on ImageNet on a single node with 8 GPUs for 300 epochs, run:
bash tools/train.sh tools/train_net.py configs/vit_imagenet.py 8- The default vit model in LiBai is set to
vit_tiny_patch16_224. To train other vit models, update the vit_imagenet config file by importing other vit models in the config file as follows:
# from .common.models.vit.vit_tiny_patch16_224 import model
from .common.models.vit.vit_base_patch16_224 import model