-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Support CPU Parallel in DataParallel Interface by GLOO to speed up training #35745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support CPU Parallel in DataParallel Interface by GLOO to speed up training #35745
Conversation
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for
bash_test_modules(test_cpuonly_launch START_BASH test_cpuonly_launch.sh SERIAL LABELS "RUN_TYPE=EXCLUSIVE" ENVS "PADDLE_DIST_UT_PORT=${dist_ut_port}" PADDLE_BINARY_DIR=${PADDLE_BINARY_DIR}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for set_tests_properties(test_parallel_dygraph_unused_variables_gloo PROPERTIES TIMEOUT 120) set_tests_properties(test_parallel_dygraph_sparse_embedding_gloo PROPERTIES TIMEOUT 120) set_tests_properties(test_parallel_dygraph_sparse_embedding_over_height_gloo PROPERTIES TIMEOUT 120)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for const_cast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
opts.setOutput(output_ptr, element_num * size_); | ||
gloo::allgather(opts); | ||
#else | ||
LOG(WARNING) << "AllGather does nothing when WITH_GLOO=OFF"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个地方是 Throw Exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个保持了gloo_wrapper的其他接口的处理范式,参见line 221行
@2742195759 您好!源码编译最新的develop分支后,用python -m paddle.distributed.launch --nproc_per_node=4 --backend=gloo train.py训练,运行到dist.init_parallel_env()这句时报错如下: |
我用docker环境编译的CPU版的:
|
你得添加 –DWITH_GLOO=ON 才能开启cpu并行。 |
您好,已经添加
请问您这个是因为什么呢?是否是某些第三方库下载不完整的原因呢? |
@2742195759 完成的报错信息如下:
|
看样子还是有某些cmake的选项没有开启,导致没有编译到这个c broadcast 属性。我晚上帮你看下。你可以先试试 –DWITH_DISTRIBUTE=ON 这个属性是否有用。 |
我刚才运行
您看是否是因为这个的原因呢? |
我这会儿正在尝试再添加一个 |
@2742195759 您好!上述问题已解决,均为第三方库下载时的网络问题,使相应包安装不完整。感谢您的答复! |
好的~ |
PR types
New features
PR changes
APIs
Describe
背景
这个PR实现了在用户调用 spawn 或者是 launch 的时候可以添加自定义的backend参数。backend参数表示了用户希望使用CPU、GPU、XPU还是PS并行。可用的backend选择为 "gloo", "nccl", "bkcl", "auto" .
场景举例
影响面
gloo / nccl / bkcl -> Collective
auto -> 按照之前的逻辑进行推导。
异常处理
1、对Mac和Win平台下使用Gloo的报错
目前的Paddle只有linux支持GLOO,因此当用户想要迁移代码的时候,如果在MAC或者是Windows下并试图使用gloo作为backend的时候会Raise ValueError。
2、用户指定的backend和paddle版本不匹配的报错
例如用户在 CPU 的paddle下使用了 backend='nccl' 那么会报错
3、gloo模式下对参数的检测
例如用户在backend=gloo的时候添加了很多其他的cpu不支持的参数,在这些情况下进行了捕获和报错。
4、NPU模式目前不支持parallel
目前NPU不支持 parallel训练,但是在parallel接口中存在了NPU的一些逻辑,所以这些逻辑尽量保留,这时候,backend如果为 auto的情况下,推导为 unknown,为了方便后续报错。
####使用样例
使用launch启动上述代码: