-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use attr nd_sbp to check consistent #6222
Conversation
leaves-zwx
commented
Sep 9, 2021
- 使用 data reader op 的属性 nd_sbp 来判断 op 是否是 consistent,如果 nd_sbp 为空,说明 op 不为 consistent,此时我们认为 data reader op 工作在 DDP 下
- 导出 amp_white_identity function
Speed stats:
|
oneflow/user/data/ofrecord_dataset.h
Outdated
if (IsMirroredParallelContext(ctx->parallel_ctx())) { | ||
// NOTE(zwx): OFRecordDataset is not consistent since attr nd_sbp is empty, | ||
// we assume that it works in DDP | ||
auto nd_sbp_str_vec = ctx->Attr<std::vector<std::string>>("nd_sbp"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
反正都是从 KernelInitContext 里取,是不是还是可以抽成一个单独的函数?IsMirroredKernelContext 之类的?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以
// NOTE(zwx): OFRecordDataset is not consistent since attr nd_sbp is empty, | ||
// we assume that it works in DDP | ||
bool is_local = false; | ||
if (ctx->op_type_name() == "OFRecordReader") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里对 OFRecordReader 做了特判,因为 OFRecordDataset 除了被 OFRecordDataReader 使用还被 OFRecordImageClassificationDataReader 使用。
前者需要兼容 DDP 的用法,所以这里判断了 is local;而后者目前还未被移动到 multi-client 下,且其没有 nd_sbp attr,而且看起来可能是个将要被废弃的接口,所以暂时不对其做修改(添加 nd_sbp attr),而是选择在这里做特判,如果 OFRecordImageClassificationDataReader 将来被删除或移植到 multi-client 下,这里的特判可以去掉。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里差一点就分析到问题的本质了。不应该加这个特判;而是有两种解决方案:
- 判断是否为 Multi-Client
- 修改 Single-Client 下的 reader op 的 python 脚本,全部加上 nd_sbp(直接加默认值 S(0) 就行 )的默认配置。
我倾向于选择方案 2,并删掉此处的特判。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OFRecordImageClassificationDataReader 没这个 attr nd_sbp,不是没传
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的问题在于没有判断 IsMultiClient,如果是 single client,那么 is_local 应该总为 false,提交了一个修复 pr #6288
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
你可以先这么修。
更合适的方案应该是修改 Single-Client 下的 op 使之支持这个 nd_sbp 的约束吧。
同时还需要检查 parallel num。在 is local 下, parallel num 一定是 1; 如果 parallel num > 1,那么一定是 Consistent 模式;但是如果 parallel num = 1 ,则不一定是 local 模式。所以 parallel num 是判断的必要不充分条件?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯,不过 local 模式下,本来 parallel num 就是人工设置的一个假的,主要是为了兼容旧的写法,新的方式下 local 应该避免访问 parallel_ctx
Speed stats:
|
CI failed, removing label automerge |
Speed stats:
|
// OFRecordImageClassificationDataReader had supported DDP (add attr nd_sbp) | ||
// or been deprecated. | ||
if (ctx->op_type_name() == "OFRecordReader") { | ||
auto nd_sbp_str_vec = ctx->Attr<std::vector<std::string>>("nd_sbp"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个判断条件仍然是有问题的。在 Single-Client 下,老版本的 reader op 没有配置 nd_sbp 属性(因为 nd_sbp 是后来单独加的)。 但 Single-Client 只支持 Consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是,已在 #6288 提交了修复,去判断是否在 multi-client 下,如果在 single client 下,is_local 总为 false