Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add launcher, update multi client launch and exit #5414

Merged
merged 10 commits into from
Jul 9, 2021

Conversation

daquexian
Copy link
Contributor

@daquexian daquexian commented Jul 7, 2021

  1. 从 PyTorch 搬运了 launcher 的实现,和 PyTorch 相比做了修改:

    1. 去掉了 --use_env 参数(因为我们只支持 use_env=True)
    2. 修改了 --logdir 参数的含义(从重定向 stdout 和 stderr 到 logdir 改为把日志文件保存在 logdir,通过第三条提到的 GLOG_log_dir 环境变量),增加了 --redirect_stdout_and_stderr 参数提供原来的重定向功能。

    launcher 的基础用法是 python3 -m oneflow.distributed.launch --nproc_per_node 2 xxx.py

  2. 把 is_multi_client 这个标记从 ProcessCtx 里移到了一个专门的 Global<bool, MultiClient> 里,为了在 single client 时,可以在执行 env.init 之前就获取这个标记的值

  3. 在 multi client 时,读取 GLOG_log_dir 等 glog 标准的环境变量修改 env proto 里的 CppLoggingConf

  4. oneflow 退出时,只在 single client 下调用 MasterSendAbort

launcher 的测试用例还没有添加,因为需要改动 ci 的测试脚本

daquexian added 3 commits July 7, 2021 14:27
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
@chengtbf chengtbf added this to the v0.5.0 milestone Jul 7, 2021
Signed-off-by: daquexian <daquexian566@gmail.com>
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot July 9, 2021 02:37
@oneflow-ci-bot oneflow-ci-bot merged commit 53501b0 into master Jul 9, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the multi_client_update branch July 9, 2021 04:50
@@ -62,11 +62,6 @@ bool GlobalProcessCtx::IsThisProcessMaster() {
return Global<ProcessCtx>::Get()->rank() == 0;
}

bool GlobalProcessCtx::IsMultiClient() {
CHECK_NOTNULL(Global<ProcessCtx>::Get());
return Global<ProcessCtx>::Get()->is_multi_client();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里只删除了 IsMultiClient 的 实现,但是没有删除 .h 里的接口? @daquexian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants