Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph mode non contiguous tensor issue #8281

Merged
merged 19 commits into from
May 25, 2022

Conversation

xiacijie
Copy link
Contributor

Make sure input tensors and parameter/buffer tensors are all contiguous in graph mode

@xiacijie xiacijie closed this May 24, 2022
@xiacijie xiacijie force-pushed the graph_mode_non_contiguous_tensor_issue branch from 303ede6 to dde8b8d Compare May 24, 2022 06:46
@xiacijie
Copy link
Contributor Author

reopen

@xiacijie xiacijie reopened this May 24, 2022
@xiacijie xiacijie requested a review from strint May 24, 2022 07:18

def leaf_node_fn(node):
if isinstance(node._value, Tensor) and not node._value.is_contiguous():
node._value.contiguous_()
Copy link
Contributor

@Flowingsun007 Flowingsun007 May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对于input tensor来说,是不是无需用inplace contiguous?

Copy link
Contributor Author

@xiacijie xiacijie May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input tensor用inplace 和非inplace的都可以

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还是建议用非inplace的,因为省了一个op效率会更高,观感上也更clean一些

@strint
Copy link
Contributor

strint commented May 24, 2022

Other than input and state, there is a special tensor call free eager tensor:

https://github.com/Oneflow-Inc/oneflow/blob/master/python/oneflow/test/graph/test_graph_free_eager_tensor.py#L104

free eager tensor is python global variable.

free eager tensor should be make contiguous in AddFreeEagerTensorToVariableOp() in lazy_op_interpreter.cpp :

if (!inpute_tensor->is_contigous()) {
    auto lazy_mode_disabled_guard = LazyMode::Guard(/*is_enabled*/ false);
    JUST(one::functional::InplaceToContiguous(input_tensor));
    JUST(vm::CurrentRankSync())
}

@strint
Copy link
Contributor

strint commented May 24, 2022

这里对input做contigous保证,如果input是非contigous的,潜在的在每次graph input时,插入了两次 vm 指令调用(一个to contigous op 和 assign op #8275 );

这里可能使得之前设定的 dataloader 和 graph input之间不做任何 vm 指令调用才能保证流水的依赖不能保证了。

这样后面使用 dataloader 和 graph 组合做流水并行,需要用户保证在 dataloader 的输出 tensor 是 contigous的。

@xiacijie
Copy link
Contributor Author

如果free eager tensor 是non contiguous的,例子:

bias = flow.tensor(
        [[1, 2, 3], [3, 4, 5], [7, 7, 7],], dtype=flow.float32, device=device
    )
free_eager_bias_non_contiguous = bias.transpose(0, 1)

class GraphTestNonContiguousTensors(flow.nn.Graph):
        def __init__(self):
            super().__init__()
            self.model = ModuleTest(False, device)

        def build(self, input):
            res = self.model(input) + free_eager_bias_non_contiguous
            return res

执行上面的graph会导致这个segmentation fault:

0x00007fffe60f0817 in oneflow::(anonymous namespace)::ToContiguousKernel<(oneflow::DeviceType)1, float>::Compute(oneflow::user_op::KernelComputeContext*) const () from /home/xiacijie/Project/oneflow/build/liboneflow.so
(gdb) bt
#0  0x00007fffe60f0817 in oneflow::(anonymous namespace)::ToContiguousKernel<(oneflow::DeviceType)1, float>::Compute(oneflow::user_op::KernelComputeContext*) const () from /home/xiacijie/Project/oneflow/build/liboneflow.so
#1  0x00007fffe4051f52 in oneflow::UserKernel::ForwardUserKernel(std::function<oneflow::Blob* (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&, oneflow::user_op::OpKernelState*) const ()
   from /home/xiacijie/Project/oneflow/build/liboneflow.so
#2  0x00007fffe405202f in oneflow::UserKernel::ForwardDataContent(oneflow::KernelContext*) const ()
   from /home/xiacijie/Project/oneflow/build/liboneflow.so
#3  0x00007fffe402fffb in oneflow::Kernel::Forward(oneflow::KernelContext*) const ()
   from /home/xiacijie/Project/oneflow/build/liboneflow.so
#4  0x00007fffe403052e in oneflow::Kernel::Launch(oneflow::KernelContext*) const ()
   from /home/xiacijie/Project/oneflow/build/liboneflow.so
#5  0x00007fffe40d7f00 in oneflow::(anonymous namespace)::LightActor<1, 0, signed char, oneflow::(anonymous namespace)::ArrayBaseIndex<signed char, 2>, oneflow::(anonymous namespace)::ArrayBaseStateContainer<signed char, 2> >::ProcessMsg(oneflow::ActorMsg const&) () from /home/xiacijie/Project/oneflow/build/liboneflow.so
#6  0x00007fffe4ef3f47 in oneflow::Thread::PollMsgChannel() ()
   from /home/xiacijie/Project/oneflow/build/liboneflow.so
#7  0x00007fffe4ef4168 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<oneflow::Thread::Thread(oneflow::StreamId const&)::{lambda()#1}> > >::_M_run() () from /home/xiacijie/Project/oneflow/build/liboneflow.so
#8  0x00007fffde3c2de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007ffff7e3b609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#10 0x00007ffff7d62293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

如果在lazy_op_interpreter.cpp 的 AddFreeEagerTensorToVariableOp 函数下处理non contiguous的 input tensor:

Maybe<void> AddFreeEagerTensorToVariableOp(const std::shared_ptr<Tensor>& input_tensor) {
  if (!input_tensor->is_contiguous()) {
      bool prev_mode = LazyMode::is_enabled();
      LazyMode::Guard lazy_mode_disabled_guard(false);
      JUST(functional::InplaceToContiguous(input_tensor));
      JUST(vm::CurrentRankSync());
      if (prev_mode) LazyMode::Guard lazy_mode_enable_guard(true);
    }

运行上面graph还是会导致相同的segment fault

@strint
Copy link
Contributor

strint commented May 25, 2022

I20220525 11:52:26.701298 3601208 lazy_op_interpreter.cpp:450] Lazy nn.Graph name GraphTestNonContiguousTensors_0 add op :
FreeEagerTensor-2 for FreeEagerTensor.
I20220525 11:52:26.701333 3601208 lazy_op_interpreter.cpp:938] Lazy nn.Graph name GraphTestNonContiguousTensors_0 try to add op:
name: "to_contiguous-1"
device_tag: "cpu"
scope_symbol_id: 4611686018427412479
loc: "Python Stack[-2]: <frame at 0x7f5ba00eed60, file \'/home/xiacijie/Project/oneflow/python/oneflow/test/graph/test_graph_non_contiguous_tensors.py\', line 62, code build>; Python Stack[-1]: <frame at 0x7f5ba00f75b0, file \'/home/xiacijie/Project/oneflow/python/oneflow/framework/tensor.py\', line 235, code _add>;  ... 17 more; "
user_conf {
  op_type_name: "to_contiguous"
  input {
    key: "in"
    value {
      s: "FreeEagerTensor-2/out"
    }
  }
  output {
    key: "out"
    value {
      s: "to_contiguous-1/out_0"
    }
  }
  input_order: "in"
  output_order: "out"
}

free eager tensor 后面还是被插入了一个 contiguous op。

@xiacijie 你 push 一下报问题的代码看看,可能是代码顺序写的有问题。

@xiacijie xiacijie requested a review from chengtbf as a code owner May 25, 2022 04:40
@xiacijie xiacijie requested review from oneflow-ci-bot and removed request for oneflow-ci-bot May 25, 2022 07:14
@github-actions
Copy link
Contributor

CI failed when running job: cpu-module. PR label automerge has been removed

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8281/

@github-actions
Copy link
Contributor

CI failed when running job: cpu-misc. PR label automerge has been removed

@github-actions
Copy link
Contributor

Speed stats:

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8281/

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 130.2ms (= 13019.1ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 142.2ms (= 14217.4ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.09 (= 142.2ms / 130.2ms)

OneFlow resnet50 time: 78.4ms (= 7844.2ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 88.6ms (= 8864.2ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.13 (= 88.6ms / 78.4ms)

OneFlow resnet50 time: 53.9ms (= 10776.9ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 58.9ms (= 11787.2ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.09 (= 58.9ms / 53.9ms)

OneFlow resnet50 time: 41.5ms (= 8309.8ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 45.7ms (= 9130.5ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.10 (= 45.7ms / 41.5ms)

OneFlow resnet50 time: 36.9ms (= 7384.7ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 36.8ms (= 7364.5ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.00 (= 36.8ms / 36.9ms)

OneFlow swin dataloader time: 0.251s (= 50.116s / 200, num_workers=1)
PyTorch swin dataloader time: 0.149s (= 29.877s / 200, num_workers=1)
Relative speed: 0.596 (= 0.149s / 0.251s)

OneFlow swin dataloader time: 0.067s (= 13.418s / 200, num_workers=4)
PyTorch swin dataloader time: 0.041s (= 8.295s / 200, num_workers=4)
Relative speed: 0.618 (= 0.041s / 0.067s)

OneFlow swin dataloader time: 0.036s (= 7.182s / 200, num_workers=8)
PyTorch swin dataloader time: 0.023s (= 4.581s / 200, num_workers=8)
Relative speed: 0.638 (= 0.023s / 0.036s)

❌ OneFlow resnet50 time: 146.8ms (= 14681.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 172.7ms (= 17267.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.18 (= 172.7ms / 146.8ms)

OneFlow resnet50 time: 99.2ms (= 9915.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 112.9ms (= 11293.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.14 (= 112.9ms / 99.2ms)

OneFlow resnet50 time: 75.4ms (= 15078.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 88.8ms (= 17751.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.18 (= 88.8ms / 75.4ms)

OneFlow resnet50 time: 59.9ms (= 11985.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 76.4ms (= 15286.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.28 (= 76.4ms / 59.9ms)

OneFlow resnet50 time: 54.6ms (= 10918.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.0ms (= 13804.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.26 (= 69.0ms / 54.6ms)

@github-actions
Copy link
Contributor

Speed stats:

@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot May 25, 2022 17:50
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 130.7ms (= 13067.6ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 146.1ms (= 14607.8ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.12 (= 146.1ms / 130.7ms)

OneFlow resnet50 time: 77.8ms (= 7775.1ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 89.6ms (= 8962.6ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.15 (= 89.6ms / 77.8ms)

OneFlow resnet50 time: 54.6ms (= 10916.1ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 61.1ms (= 12222.2ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.12 (= 61.1ms / 54.6ms)

OneFlow resnet50 time: 43.1ms (= 8617.9ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 45.1ms (= 9018.3ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.05 (= 45.1ms / 43.1ms)

OneFlow resnet50 time: 37.3ms (= 7451.9ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 36.3ms (= 7253.1ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 0.97 (= 36.3ms / 37.3ms)

OneFlow swin dataloader time: 0.251s (= 50.248s / 200, num_workers=1)
PyTorch swin dataloader time: 0.152s (= 30.416s / 200, num_workers=1)
Relative speed: 0.605 (= 0.152s / 0.251s)

OneFlow swin dataloader time: 0.065s (= 13.095s / 200, num_workers=4)
PyTorch swin dataloader time: 0.043s (= 8.695s / 200, num_workers=4)
Relative speed: 0.664 (= 0.043s / 0.065s)

OneFlow swin dataloader time: 0.056s (= 11.167s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.402s / 200, num_workers=8)
Relative speed: 0.394 (= 0.022s / 0.056s)

❌ OneFlow resnet50 time: 147.6ms (= 14762.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 167.0ms (= 16696.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.13 (= 167.0ms / 147.6ms)

OneFlow resnet50 time: 97.6ms (= 9757.2ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 113.2ms (= 11320.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 113.2ms / 97.6ms)

OneFlow resnet50 time: 73.4ms (= 14680.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 87.1ms (= 17421.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 87.1ms / 73.4ms)

OneFlow resnet50 time: 63.4ms (= 12677.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 73.5ms (= 14709.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.16 (= 73.5ms / 63.4ms)

OneFlow resnet50 time: 54.9ms (= 10980.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 79.8ms (= 15952.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.45 (= 79.8ms / 54.9ms)

@xiacijie xiacijie merged commit f338fab into master May 25, 2022
@xiacijie xiacijie deleted the graph_mode_non_contiguous_tensor_issue branch May 25, 2022 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants