Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new_X_to_B #5987

Merged
merged 23 commits into from
Aug 23, 2021
Merged

new_X_to_B #5987

merged 23 commits into from
Aug 23, 2021

Conversation

clackhan
Copy link
Contributor

No description provided.

@@ -33,6 +34,7 @@ Maybe<one::Tensor> EagerBoxingInterpreter::Interpret(const std::shared_ptr<one::
Symbol<ParallelDesc> in_parallel_desc,
Symbol<ParallelDesc> out_parallel_desc) const {
JUST(CheckEagerBoxingDataType(input->dtype()->data_type()));
DisableCheckConsistentTensorMetaScope disable_meta_check;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

禁止interpreter中做检查ConsistentTensorMeta

Comment on lines 123 to 124
&& (in_parallel_desc->device_type() == DeviceType::kGPU
&& out_parallel_desc->device_type() == DeviceType::kGPU)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前仅支持gpu版本

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么仅支持gpu版本呢?这里只用到了broadcast op。而这个op cpu下也有啊。

Comment on lines 42 to 43
Maybe<int64_t> GetBroadcastRoot(Symbol<ParallelDesc> src_parallel_desc,
Symbol<ParallelDesc> dst_parallel_desc) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据src placment和dst placement计算broadcast过程中的root节点

Comment on lines 83 to 84
const auto& new_tag_in_parallel_desc =
JUST(ReplaceDeviceType(in_parallel_desc, out_parallel_desc->device_type()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device type不同时,第一步转换为sbp为broadcast的tensor时需要考虑device tag的变化

std::shared_ptr<one::Tensor> local_tensor = JUST(broadcast_input->cur_rank_phy_tensor());
{
const auto& out_parallel_id = JUST(GetParallelId4CurrentProcessCtx(out_parallel_desc));
if (out_parallel_id->has_value()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

只有out_parallel_desc覆盖的rank才会执行broadcast

Comment on lines 93 to 97
if (!new_in_parallel_id->has_value()) {
std::string device_type = Device::Type4DeviceTag(new_tag_in_parallel_desc->device_tag());
local_tensor = JUST(one::functional::Empty(*input->shape(), input->dtype(),
JUST(Device::New(device_type))));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

输入tensor在当前rank无效时,需要创建empty tensor,主要作用是完成output的推导

Comment on lines 101 to 104
Symbol<ParallelDesc> broadcast_parallel_desc_cur_rank =
JUST(MapAt(*broadcast_grop, GlobalProcessCtx::Rank()));
int64_t root =
JUST(CachedGetBroadcastRoot(new_tag_in_parallel_desc, broadcast_parallel_desc_cur_rank));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

计算各组broadcast 所需的placement和root

Comment on lines 102 to 104
int64_t dev_id = GlobalProcessCtx::LocalRank(root);
int64_t parallel_id =
CHECK_JUST(kernel_state->parallel_desc()->ParallelId4MachineDeviceId(root, dev_id));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nccl中,需要根据root计算communicator中对应的device rank

Comment on lines 42 to 43
Maybe<int64_t> CalBroadcastRoot(Symbol<ParallelDesc> src_parallel_desc,
Symbol<ParallelDesc> dst_parallel_desc) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据src_parallel_desc和dst_parallel_desc计算broadcast过程中的root节点

@@ -99,15 +99,18 @@ class EagerNcclBroadcastKernel final : public user_op::OpKernel {
const user_op::Tensor* in = ctx->Tensor4ArgNameAndIndex("in", 0);
user_op::Tensor* out = ctx->Tensor4ArgNameAndIndex("out", 0);
int64_t root = ctx->Attr<int64_t>("root");
int64_t dev_id = GlobalProcessCtx::LocalRank(root);
int64_t parallel_id =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nccl_root

@@ -116,6 +117,13 @@ Maybe<EagerBoxingInterpreter> GetBoxingInterpreter(Symbol<cfg::NdSbp> in_nd_sbp,
in_nd_sbp, out_nd_sbp, in_parallel_desc, out_parallel_desc));
if (interpreter.IsOk()) { return JUST(interpreter); }
}
if (in_parallel_desc->parallel_num() != out_parallel_desc->parallel_num()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

把这段逻辑写到GetOneDimNcclCollectiveEagerBoxingInterpreter。这里你就直接支持cpu版本了。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这我可能表述还有不太对。不见得立刻能支持cpu。

…new_X_to_B

Conflicts:
	oneflow/core/framework/op_interpreter/boxing/eager_boxing_interpreter_mgr.cpp
Comment on lines +35 to +40
if (!out_parallel_id->has_value()) {
std::string device_type = Device::Type4DeviceTag(in_parallel_desc->device_tag());
local_tensor = JUST(one::functional::Empty(
*JUST(GetPhysicalShape(*input->shape(), *in_nd_sbp, *in_parallel_desc, 0)), input->dtype(),
JUST(Device::New(device_type))));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

placement没有覆盖当前进程时,local tensor为空tensor,则需要给当前local tensor初始化一个tensor,以防止本文件第43行的local to consistent函数中,执行mirror copy op时发生错误

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的 local tensor 是什么角色呢,我看计算它的形状的时候,parallel_id 总是 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的 local tensor 是什么角色呢,我看计算它的形状的时候,parallel_id 总是 0

这里为了防止43行调用的ToConsistent函数出bug,当tensor的placement没有覆盖当前rank是,取到的local tensor是一个空tensor,在ToConsistent中会执行copy op完成devcie tag的转换,如果输入是一个空tensor,则该进程执行copy op会发生错误,因此需要给local_tensor重新赋值,使之成为一个有意义的tensor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那为什么计算它的形状的时候 parallel_id 总是 0 呢

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

PyTorch resnet50 time: 142.5ms (= 7123.2ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 128.3ms (= 6414.9ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
Relative speed: 1.11 (= 142.5ms / 128.3ms)

PyTorch resnet50 time: 85.3ms (= 4266.5ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 74.3ms (= 3716.8ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
Relative speed: 1.15 (= 85.3ms / 74.3ms)

PyTorch resnet50 time: 62.9ms (= 3147.0ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 47.4ms (= 2372.4ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
Relative speed: 1.33 (= 62.9ms / 47.4ms)

PyTorch resnet50 time: 50.1ms (= 2503.2ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 38.1ms (= 1907.3ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
Relative speed: 1.31 (= 50.1ms / 38.1ms)

PyTorch resnet50 time: 43.9ms (= 2193.8ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 45.6ms (= 2278.6ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
Relative speed: 0.96 (= 43.9ms / 45.6ms)

@oneflow-ci-bot oneflow-ci-bot removed their request for review August 23, 2021 05:03
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

PyTorch resnet50 time: 141.2ms (= 7060.0ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 128.4ms (= 6420.7ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
Relative speed: 1.10 (= 141.2ms / 128.4ms)

PyTorch resnet50 time: 84.6ms (= 4228.4ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 75.5ms (= 3776.6ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
Relative speed: 1.12 (= 84.6ms / 75.5ms)

PyTorch resnet50 time: 56.3ms (= 2817.3ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 48.3ms (= 2414.3ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
Relative speed: 1.17 (= 56.3ms / 48.3ms)

PyTorch resnet50 time: 48.3ms (= 2417.3ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 37.3ms (= 1862.9ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
Relative speed: 1.30 (= 48.3ms / 37.3ms)

PyTorch resnet50 time: 45.0ms (= 2250.9ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 36.3ms (= 1814.2ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
Relative speed: 1.24 (= 45.0ms / 36.3ms)

@oneflow-ci-bot oneflow-ci-bot removed their request for review August 23, 2021 08:45

static constexpr auto* CheckSymXToB = DECORATE(&RawCheckSymXToB, ThreadLocal);

Maybe<one::UserOpExpr> EagerNcclAllReduce(Symbol<ParallelDesc> parallel_desc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oneflow/core/framework/op_interpreter/boxing/collective_boxing_interpreter.cpp 里有一模一样的代码,可以复用吗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oneflow/core/framework/op_interpreter/boxing/collective_boxing_interpreter.cpp 里有一模一样的代码,可以复用吗

之后会把所有boxing interpreter改为注册的形式,到时候/collective_boxing_interpreter.cpp文件会删除,现在先保留

Comment on lines +35 to +40
if (!out_parallel_id->has_value()) {
std::string device_type = Device::Type4DeviceTag(in_parallel_desc->device_tag());
local_tensor = JUST(one::functional::Empty(
*JUST(GetPhysicalShape(*input->shape(), *in_nd_sbp, *in_parallel_desc, 0)), input->dtype(),
JUST(Device::New(device_type))));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的 local tensor 是什么角色呢,我看计算它的形状的时候,parallel_id 总是 0

JUST(SymXToBBoxingFunction(tensor, in, broadcast_in_placed_nd_sbp));

const auto& AsymBoxingFunction =
*JUST(GetBoxingFunction("asymmetric-x-to-b", broadcast_in_placed_nd_sbp, out));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么会有 asymmetric-x-to-b 和 asym-x-to-b 两个 boxing function 呢,是不是名字错了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么会有 asymmetric-x-to-b 和 asym-x-to-b 两个 boxing function 呢,是不是名字错了

笔误写错了,是asymmetric-broadcast,已更正

@oneflow-ci-bot oneflow-ci-bot removed their request for review August 23, 2021 16:41
@oneflow-ci-bot oneflow-ci-bot self-requested a review August 23, 2021 16:42
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

PyTorch resnet50 time: 140.7ms (= 7034.5ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 128.6ms (= 6429.7ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
Relative speed: 1.09 (= 140.7ms / 128.6ms)

PyTorch resnet50 time: 82.5ms (= 4123.1ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 74.8ms (= 3739.2ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
Relative speed: 1.10 (= 82.5ms / 74.8ms)

PyTorch resnet50 time: 58.7ms (= 2935.8ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 48.0ms (= 2401.0ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
Relative speed: 1.22 (= 58.7ms / 48.0ms)

PyTorch resnet50 time: 50.2ms (= 2508.1ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 39.4ms (= 1971.9ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
Relative speed: 1.27 (= 50.2ms / 39.4ms)

PyTorch resnet50 time: 43.2ms (= 2158.1ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 38.3ms (= 1916.9ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
Relative speed: 1.13 (= 43.2ms / 38.3ms)

@oneflow-ci-bot oneflow-ci-bot removed their request for review August 23, 2021 19:34
@oneflow-ci-bot oneflow-ci-bot merged commit c13629f into master Aug 23, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the new_X_to_B branch August 23, 2021 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants