-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/logical nccl send recv #8318
Conversation
@@ -366,7 +386,7 @@ bool TryBuildNcclLogicalOpConf(OperatorConf* ret, const OpNode* src_node, const | |||
std::shared_ptr<Shape> dst_reduced_hierarchy = dst_reduced_parallel_desc->hierarchy(); | |||
|
|||
if ((*src_reduced_hierarchy) == (*dst_reduced_hierarchy) | |||
&& src_reduced_nd_sbp == dst_reduced_nd_sbp) { | |||
&& (*src_reduced_nd_sbp) == (*dst_reduced_nd_sbp)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix ndsbp equal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个 BUG 之前居然没遇到 😂
if (!got_nccl && ParseBooleanFromEnv("LOGICAL_SR", false)) { | ||
got_nccl = TryBuildNcclBy2DHierarchyOthers(ret, *src_reduced_nd_sbp, *dst_reduced_nd_sbp, | ||
src_reduced_hierarchy, lbn, scope_symbol_id, | ||
logical_blob_desc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insert nccl logical send recv
oneflow/core/job/nd_sbp_util.cpp
Outdated
// Go through all the ranks while transfer between two nd sbps with no PartialSum under the same | ||
// placement. | ||
// NOTE: We need to make sure no partial sums in the sbps of the producer and consumer. | ||
void DfsTraverseRanks4NdSbp( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eager_out = x.to_global(sbp=dst_nd_sbp, placement=placement) | ||
test_case.assertTrue(np.array_equal(eager_out.numpy(), x.numpy())) | ||
|
||
# bad case of graph: S with P |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bad cases
// Note: when src_nd_sbp has partial_sum, need a out_size buffer to copy and add to out. | ||
buf_count += out_shape->elem_cnt(); | ||
} | ||
return buf_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里错了,这里应该乘GetSizeOfDataType(data_type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. 现在单测和集成测试都正常了。
@@ -493,6 +496,20 @@ Maybe<void> BoxingCollector::AskSbpCombination(const NdSbp& sbp_producer, const | |||
if (ParseBooleanFromEnv("ONEFLOW_BOXING_DISABLE_MIDDLE_NODE_AND_CHECK", false)) { | |||
return Maybe<void>::Ok(); | |||
} | |||
// If compute_cost==false + 2D sbp + same placment + nccl logical + not (p->b), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这种条件下,middle node 不工作,交给 nccl logical send recv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src 中有 p, dst 中 有 b时,不使用 send recv 而使用 middle node,发现 send recv 测试中存在无法通过的测试例子,在send recv测试中关闭 nccl use compute stream 就能复现问题。
所以这里 (p->b) 还是使用 send recv。存在的问题,后面的 PR 来修复。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里解释一下存在的问题,2个:
- 传输速度由O(n+m)变为O(nm),n是上游节点P的维度分量乘积,m是下游节点B的维度分量乘积。
例如: [2, 2]: (P, S0) -> [2, 2]: (B, B), O(2+4) -> O(2*4), 卡数少的时候差不了多少。 - 最后的B在各卡上数据不一致,会有1e-15的相对误差。
@@ -5329,6 +5329,24 @@ def OneFlow__ncclLogicalS2sOp : OneFlow_BaseOp<"_nccl_logical_s2s", [NoSideEffec | |||
let has_nd_sbp_infer_fn = 1; | |||
} | |||
|
|||
def OneFlow__ncclLogicalSendRecvOp : OneFlow_BaseOp<"_nccl_logical_send_recv", [NoSideEffect, NoGrad, DeclareOpInterfaceMethods<UserOpCompatibleInterface>]> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nccl logical send recv op
bool AlwaysComputeWhenAllOutputsEmpty() const override { return false; } | ||
}; | ||
|
||
void NcclLogicalSendRecv::Compute(user_op::KernelComputeContext* ctx, user_op::OpKernelState* state, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nccl logical send recv kernel
|
||
# check eager boxing | ||
eager_out = x.to_global(sbp=dst_nd_sbp, placement=placement) | ||
test_case.assertTrue(np.array_equal(eager_out.numpy(), x.numpy())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
确保 eager 是正确的,因为后面 graph 的输出会调用 eager numpy(),里面隐式的调用了 dst_nd_sbp -> B
test_case.assertTrue(np.array_equal(eager_out.numpy(), x.numpy())) | ||
|
||
# check graph boxing | ||
flow.boxing.nccl.enable_use_compute_stream(True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
打开 nccl logical 开关
#if flow.env.get_rank() == 0: | ||
# print("src sbp ", src_nd_sbp, ", dst sbp ", dst_nd_sbp) | ||
|
||
test_case.assertTrue(np.array_equal(out_np, in_np)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
graph check
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally. |
Speed stats:
|
Static analysis with clang failed. PR label automerge has been removed |
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally. |
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8318/ |
CI failed when running job: cuda-module. PR label automerge has been removed |
Speed stats:
|
https://github.com/Oneflow-Inc/oneflow/runs/6678073040?check_suite_focus=true
一个无关的单测,本地没有复现。先重跑一下。 |
Speed stats:
|
bool has_independent_stream_; | ||
std::string stream_name_; | ||
std::unique_ptr<ParallelDesc> parallel_desc_; | ||
mutable std::unique_ptr<Comm> comm_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里为啥是 mutable ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ncclComm_t comm() const { return GetOrCreateComm().comm; }
印象中好像是这样的 const 接口,会改动 comm_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
那其实是不是就不需要作为 const 接口呢? 😂
参考其他的 nccl logical kernel:
ncclComm_t comm() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的,看了下,可以去掉 const
Insert send recv in nccl logical pass to enable memory reuse for special SBP.