Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TIPC-Benchmark]Support @to_static traing for Benchmark #1756

Merged
merged 3 commits into from
Mar 15, 2022

Conversation

Aurelius84
Copy link
Contributor

@Aurelius84 Aurelius84 commented Mar 14, 2022

What's New?

此 PR 基于新Benchmark规范实现了 @to_static 动转静训练监控机制,在现有的功能上,为兼容性升级。

1. 使用方式

在动态图训练的基础上,开启动转静训练的方法如下:

  • 配置参数名:to_static(不能拼写错误,大小写敏感)
# 方式一:对某个模型所有配置组合均开启动转静训练
bash test_tipc/benchmark_train.sh test_tipc/config/ResNet/ResNet50_train_infer_python.txt benchmark_train  to_static

# 方式二:对某个模型指定配置组合均开启动转静训练
bash test_tipc/benchmark_train.sh test_tipc/config/ResNet/ResNet50_train_infer_python.txt benchmark_train dynamic_bs8_fp32_DP_N1C2 to_static

2. 验证日志

此 PR 基于 RestNet和MobileNet 模型进行了单机单卡、多卡验证。

可以根据日志中的 Successfully to apply @to_static with specs XX 来判断动转静是否生效,日志如下:

[2022/03/14 08:13:16] root INFO: profiler_options : None
[2022/03/14 08:13:16] root INFO: train with paddle 0.0.0 and device Place(gpu:0)
W0314 08:13:24.551266 13696 gpu_context.cc:244] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0314 08:13:24.556710 13696 gpu_context.cc:272] device: 0, cuDNN Version: 8.1.
[2022/03/14 08:13:28] root INFO: Successfully to apply @to_static with specs: [InputSpec(shape=(-1, 3, 224, 224), dtype=paddle.float32, name=None)]
[2022/03/14 08:13:28] root WARNING: The training strategy in config files provided by PaddleClas is based on 4 gpus. But the number of gpus is 1 in current training. Please modify the stategy (learning rate, batch size and so on) if use config files in PaddleClas to train.
[2022/03/14 08:13:32] root INFO: [Train][Epoch 1/1][Iter: 0/160146]lr: 0.10000, top1: 0.00000, top5: 0.00000, CELoss: 7.45196, loss: 7.45196, batch_cost: 3.92930s, reader_cost: 0.99335, ips: 2.03599 samples/s, eta: 7 days, 6:47:41
[2022/03/14 08:13:32] root INFO: [Train][Epoch 1/1][Iter: 1/160146]lr: 0.10000, top1: 0.00000, top5: 0.00000, CELoss: 21.29071, loss: 21.29071, batch_cost: 2.01570s, reader_cost: 0.49713, ips: 3.96885 samples/s, eta: 3 days, 17:40:03

3. 方案介绍

现有的 Benchmark 方案是通过执行bash test_train_inference_python.sh脚本实现的。

通过解析test_tipc/config/test_xxx.txttrainer:norm_train(第15行)来分发训练配置。此处我们扩展了第20行的配置,新增了动转静trainer:

  • to_static_train:-o Global.to_static=True

此处会复用trainer:norm_train的配置,在其后追加-o Global.to_static=True 来实现开启动转静训练,以保证动转静训练和动态图训练的基本配置参数是对齐的。

@Aurelius84 Aurelius84 requested a review from LDOUBLEV March 14, 2022 12:42
@Aurelius84 Aurelius84 requested review from weisy11 and LDOUBLEV March 15, 2022 06:20
Copy link
Contributor

@weisy11 weisy11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants