Skip to content

Commit

Permalink
collective
Browse files Browse the repository at this point in the history
  • Loading branch information
yinhaofeng committed Apr 28, 2021
1 parent 9b286ef commit c42434d
Show file tree
Hide file tree
Showing 4 changed files with 76 additions and 3 deletions.
51 changes: 51 additions & 0 deletions doc/collective_mode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Collective模式运行
如果您希望可以同时使用多张GPU,更为快速的训练您的模型,可以尝试使用`单机多卡``多机多卡`模式运行。

## 版本要求
用户需要确保已经安装paddlepaddle-2.0.0-rc-gpu及以上版本的飞桨开源框架

## 设置config.yaml
首先需要在模型的yaml配置中,加入use_fleet参数,并把值设置成True。
```yaml
runner:
# 通用配置不再赘述
...
# use fleet
use_fleet: True
```
## 单机多卡训练
### 单机多卡模式下指定需要使用的卡号
在没有进行设置的情况下将使用单机上所有gpu卡。若需要指定部分gpu卡执行,可以通过设置环境变量CUDA_VISIBLE_DEVICES来实现。
例如单机上有8张卡,只打算用前4卡张训练,可以设置export CUDA_VISIBLE_DEVICES=0,1,2,3
再执行训练脚本即可。
### 执行训练
```bash
# 动态图执行训练
python -m paddle.distributed.launch ../../../tools/trainer.py -m config.yaml
# 静态图执行训练
python -m paddle.distributed.launch ../../../tools/static_trainer.py -m config.yaml
```

注意:在使用静态图训练时,确保模型static_model.py程序中create_optimizer函数设置了分布式优化器。
```python
def create_optimizer(self, strategy=None):
optimizer = paddle.optimizer.Adam(learning_rate=self.learning_rate, lazy_mode=True)
# 通过Fleet API获取分布式优化器,将参数传入飞桨的基础优化器
if strategy != None:
import paddle.distributed.fleet as fleet
optimizer = fleet.distributed_optimizer(optimizer, strategy)
optimizer.minimize(self._cost)
```

## 多机多卡训练
使用多机多卡训练,您需要另外一台或多台能够互相ping通的机器。每台机器中都需要安装paddlepaddle-2.0.0-rc-gpu及以上版本的飞桨开源框架,同时将需要运行的paddlerec模型,数据集复制到每一台机器上。
从单机多卡到多机多卡训练,在代码上不需要做任何改动,只需再额外指定ips参数即可。其内容为多机的ip列表,命令如下所示:
```bash
# 动态图
# 动态图执行训练
python -m paddle.distributed.launch --ips="xx.xx.xx.xx,yy.yy.yy.yy" --gpus 0,1,2,3,4,5,6,7 ../../../tools/trainer.py -m config.yaml
# 静态图执行训练
python -m paddle.distributed.launch --ips="xx.xx.xx.xx,yy.yy.yy.yy" --gpus 0,1,2,3,4,5,6,7 ../../../tools/static_trainer.py -m config.yaml
```
6 changes: 4 additions & 2 deletions models/rank/wide_deep/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@
runner:
train_data_dir: "data/sample_data/train"
train_reader_path: "criteo_reader" # importlib format
use_gpu: False
use_gpu: True
use_auc: True
train_batch_size: 2
train_batch_size: 50
epochs: 3
print_interval: 2
#model_init_path: "output_model/0" # init model
Expand All @@ -34,6 +34,8 @@ runner:
use_inference: False
save_inference_feed_varnames: ["label","C1","C2","C3","C4","C5","C6","C7","C8","C9","C10","C11","C12","C13","C14","C15","C16","C17","C18","C19","C20","C21","C22","C23","C24","C25","C26","dense_input"]
save_inference_fetch_varnames: ["cast_0.tmp_0"]
#use fleet
use_fleet: False

# hyper parameters of user-defined network
hyper_parameters:
Expand Down
13 changes: 12 additions & 1 deletion tools/static_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,9 @@ def main(args):
input_data_names = [data.name for data in input_data]

fetch_vars = static_model_class.net(input_data)

#infer_target_var = model.infer_target_var
logger.info("cpu_num: {}".format(os.getenv("CPU_NUM")))
static_model_class.create_optimizer()

use_gpu = config.get("runner.use_gpu", True)
use_auc = config.get("runner.use_auc", False)
Expand All @@ -79,6 +79,7 @@ def main(args):
model_init_path = config.get("runner.model_init_path", None)
batch_size = config.get("runner.train_batch_size", None)
reader_type = config.get("runner.reader_type", "DataLoader")
use_fleet = config.get("runner.use_fleet", False)
os.environ["CPU_NUM"] = str(config.get("runner.thread_num", 1))
logger.info("**************common.configs**********")
logger.info(
Expand All @@ -88,6 +89,16 @@ def main(args):
logger.info("**************common.configs**********")

place = paddle.set_device('gpu' if use_gpu else 'cpu')

if use_fleet:
from paddle.distributed import fleet
strategy = fleet.DistributedStrategy()
fleet.init(is_collective=True, strategy=strategy)
if use_fleet:
static_model_class.create_optimizer(strategy)
else:
static_model_class.create_optimizer()

exe = paddle.static.Executor(place)
# initialize
exe.run(paddle.static.default_startup_program())
Expand Down
9 changes: 9 additions & 0 deletions tools/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ def main(args):
train_batch_size = config.get("runner.train_batch_size", None)
model_save_path = config.get("runner.model_save_path", "model_output")
model_init_path = config.get("runner.model_init_path", None)
use_fleet = config.get("runner.use_fleet", False)

logger.info("**************common.configs**********")
logger.info(
Expand All @@ -102,6 +103,14 @@ def main(args):
# to do : add optimizer function
optimizer = dy_model_class.create_optimizer(dy_model, config)

# use fleet run collective
if use_fleet:
from paddle.distributed import fleet
strategy = fleet.DistributedStrategy()
fleet.init(is_collective=True, strategy=strategy)
optimizer = fleet.distributed_optimizer(optimizer)
dy_model = fleet.distributed_model(dy_model)

logger.info("read data")
train_dataloader = create_data_loader(config=config, place=place)

Expand Down

0 comments on commit c42434d

Please sign in to comment.