Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support consistent_tensor.to(copy=True) #6122

Merged
merged 17 commits into from
Sep 2, 2021
Merged

Conversation

lixinqi
Copy link
Contributor

@lixinqi lixinqi commented Sep 1, 2021

No description provided.

@CLAassistant
Copy link

CLAassistant commented Sep 1, 2021

CLA assistant check
All committers have signed the CLA.

@@ -185,7 +185,7 @@ def to_op(input, *args, **kwargs):
)

if copy is True:
raise TypeError("A consistent tensor do not support to(copy=True)")
return input.detach().clone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里直接这样改不对?还是需要传到 _consistent_tensor_to 里,因为上面的 device dtype 可能有值

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该把 copy 传到 _consistent_tensor_to 里面去,然后在 55 行做下改动

if device_type == input.placement.device_type and dtype == input.dtype:
    if copy:
        return input.detach().clone()
    else:
        return input

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是不是应该直接把copy的逻辑放到_consistent_tensor_to里面去,和_tensor_to接口对齐。

目前这个to的逻辑在python里面有点重,后续找人挪到C++吧。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看上去其实还不能有detach。因为原本有可能requires_grad就是为True。

@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 1, 2021 06:43
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 1, 2021 07:13
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 1, 2021 08:07
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 1, 2021 09:44
@oneflow-ci-bot oneflow-ci-bot self-requested a review September 1, 2021 11:01
@github-actions
Copy link
Contributor

github-actions bot commented Sep 1, 2021

CI failed, removing label automerge

@github-actions github-actions bot removed the automerge label Sep 1, 2021
@oneflow-ci-bot oneflow-ci-bot removed their request for review September 1, 2021 12:11
@@ -53,7 +53,7 @@ def _consistent_tensor_to(input, device_type, dtype):
assert isinstance(dtype, flow.dtype)

if device_type == input.placement.device_type and dtype == input.dtype:
return input
return input if not copy else input.clone()
Copy link
Contributor

@strint strint Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert这行,失败的那个test_graph_asymmetric_io.py测试就可以过

cur_rank_phy_tensor = std::make_shared<MirroredTensor>(cur_rank_phy_tensor_impl);
} else {
const auto& dtype_symbol = JUST(DType::Get(dtype));
const auto& empty = JUST(functional::Empty(*cur_rank_phy_shape, dtype_symbol, device));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

虽然cur_shape_phy_shape为全0 shape。但如果不执行Empty op,将导致eager_blob_object的blob对象为空。

@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 1, 2021 17:02
@oneflow-ci-bot oneflow-ci-bot removed their request for review September 2, 2021 01:04
BlockingCounter bc(1);
JUST(PhysicalRun([&bc](InstructionsBuilder* builder) -> Maybe<void> {
JUST(builder->ComputeGlobalFrontSeqBarrier());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NNGraph同步vm的时候不应该操心其他rank的vm。

@@ -40,9 +40,11 @@ Maybe<void> Run(vm::InstructionMsgList* instr_msg_list) {
return Maybe<void>::Ok();
}

Maybe<void> SingleClientSync() {
Maybe<void> ClusterSync() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

重名的原因是这里兼职了旧版的SingleClientSync和MultiClientSync

@oneflow-ci-bot oneflow-ci-bot self-requested a review September 2, 2021 02:16
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 2, 2021 04:16
@github-actions
Copy link
Contributor

github-actions bot commented Sep 2, 2021

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 128.2ms (= 6409.3ms / 50, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 140.7ms (= 7034.5ms / 50, input_shape=[16, 3, 224, 224])
Relative speed: 1.10 (= 140.7ms / 128.2ms)

OneFlow resnet50 time: 74.3ms (= 3714.7ms / 50, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.3ms (= 4216.7ms / 50, input_shape=[8, 3, 224, 224])
Relative speed: 1.14 (= 84.3ms / 74.3ms)

OneFlow resnet50 time: 47.6ms (= 2381.1ms / 50, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 58.9ms (= 2946.2ms / 50, input_shape=[4, 3, 224, 224])
Relative speed: 1.24 (= 58.9ms / 47.6ms)

OneFlow resnet50 time: 42.8ms (= 2142.0ms / 50, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 51.2ms (= 2561.6ms / 50, input_shape=[2, 3, 224, 224])
Relative speed: 1.20 (= 51.2ms / 42.8ms)

OneFlow resnet50 time: 40.4ms (= 2018.3ms / 50, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 42.4ms (= 2121.3ms / 50, input_shape=[1, 3, 224, 224])
Relative speed: 1.05 (= 42.4ms / 40.4ms)

OneFlow resnet50 time: 141.5ms (= 7075.2ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 157.0ms (= 7848.8ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
Relative speed: 1.11 (= 157.0ms / 141.5ms)

OneFlow resnet50 time: 92.9ms (= 4646.2ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 100.8ms (= 5037.7ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
Relative speed: 1.08 (= 100.8ms / 92.9ms)

OneFlow resnet50 time: 68.8ms (= 3437.6ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.8ms (= 3737.9ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
Relative speed: 1.09 (= 74.8ms / 68.8ms)

OneFlow resnet50 time: 64.4ms (= 3222.5ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.6ms (= 3528.6ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
Relative speed: 1.10 (= 70.6ms / 64.4ms)

OneFlow resnet50 time: 63.3ms (= 3163.1ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 72.6ms (= 3627.5ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
Relative speed: 1.15 (= 72.6ms / 63.3ms)

@oneflow-ci-bot oneflow-ci-bot merged commit 5987caf into master Sep 2, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the consistent_to branch September 2, 2021 05:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants