Graph mode non contiguous tensor issue #8281

xiacijie · 2022-05-23T04:17:01Z

Make sure input tensors and parameter/buffer tensors are all contiguous in graph mode

xiacijie · 2022-05-24T07:09:47Z

reopen

python/oneflow/nn/graph/graph.py

…neflow-Inc/oneflow into graph_mode_non_contiguous_tensor_issue

python/oneflow/test/graph/test_graph_non_contiguous_tensors.py

Flowingsun007 · 2022-05-24T10:11:12Z

python/oneflow/nn/graph/graph.py

+
+        def leaf_node_fn(node):
+            if isinstance(node._value, Tensor) and not node._value.is_contiguous():
+                node._value.contiguous_()


对于input tensor来说，是不是无需用inplace contiguous？

input tensor用inplace 和非inplace的都可以

还是建议用非inplace的，因为省了一个op效率会更高，观感上也更clean一些

python/oneflow/nn/graph/graph.py

strint · 2022-05-24T10:25:11Z

Other than input and state, there is a special tensor call free eager tensor:

https://github.com/Oneflow-Inc/oneflow/blob/master/python/oneflow/test/graph/test_graph_free_eager_tensor.py#L104

free eager tensor is python global variable.

free eager tensor should be make contiguous in AddFreeEagerTensorToVariableOp() in lazy_op_interpreter.cpp :

if (!inpute_tensor->is_contigous()) {
    auto lazy_mode_disabled_guard = LazyMode::Guard(/*is_enabled*/ false);
    JUST(one::functional::InplaceToContiguous(input_tensor));
    JUST(vm::CurrentRankSync())
}

strint · 2022-05-24T10:33:11Z

这里对input做contigous保证，如果input是非contigous的，潜在的在每次graph input时，插入了两次 vm 指令调用（一个to contigous op 和 assign op #8275 ）；

这里可能使得之前设定的 dataloader 和 graph input之间不做任何 vm 指令调用才能保证流水的依赖不能保证了。

这样后面使用 dataloader 和 graph 组合做流水并行，需要用户保证在 dataloader 的输出 tensor 是 contigous的。

xiacijie · 2022-05-25T03:35:43Z

如果free eager tensor 是non contiguous的，例子：

bias = flow.tensor(
        [[1, 2, 3], [3, 4, 5], [7, 7, 7],], dtype=flow.float32, device=device
    )
free_eager_bias_non_contiguous = bias.transpose(0, 1)

class GraphTestNonContiguousTensors(flow.nn.Graph):
        def __init__(self):
            super().__init__()
            self.model = ModuleTest(False, device)

        def build(self, input):
            res = self.model(input) + free_eager_bias_non_contiguous
            return res

执行上面的graph会导致这个segmentation fault:

0x00007fffe60f0817 in oneflow::(anonymous namespace)::ToContiguousKernel<(oneflow::DeviceType)1, float>::Compute(oneflow::user_op::KernelComputeContext*) const () from /home/xiacijie/Project/oneflow/build/liboneflow.so
(gdb) bt
#0  0x00007fffe60f0817 in oneflow::(anonymous namespace)::ToContiguousKernel<(oneflow::DeviceType)1, float>::Compute(oneflow::user_op::KernelComputeContext*) const () from /home/xiacijie/Project/oneflow/build/liboneflow.so
#1  0x00007fffe4051f52 in oneflow::UserKernel::ForwardUserKernel(std::function<oneflow::Blob* (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&, oneflow::user_op::OpKernelState*) const ()
   from /home/xiacijie/Project/oneflow/build/liboneflow.so
#2  0x00007fffe405202f in oneflow::UserKernel::ForwardDataContent(oneflow::KernelContext*) const ()
   from /home/xiacijie/Project/oneflow/build/liboneflow.so
#3  0x00007fffe402fffb in oneflow::Kernel::Forward(oneflow::KernelContext*) const ()
   from /home/xiacijie/Project/oneflow/build/liboneflow.so
#4  0x00007fffe403052e in oneflow::Kernel::Launch(oneflow::KernelContext*) const ()
   from /home/xiacijie/Project/oneflow/build/liboneflow.so
#5  0x00007fffe40d7f00 in oneflow::(anonymous namespace)::LightActor<1, 0, signed char, oneflow::(anonymous namespace)::ArrayBaseIndex<signed char, 2>, oneflow::(anonymous namespace)::ArrayBaseStateContainer<signed char, 2> >::ProcessMsg(oneflow::ActorMsg const&) () from /home/xiacijie/Project/oneflow/build/liboneflow.so
#6  0x00007fffe4ef3f47 in oneflow::Thread::PollMsgChannel() ()
   from /home/xiacijie/Project/oneflow/build/liboneflow.so
#7  0x00007fffe4ef4168 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<oneflow::Thread::Thread(oneflow::StreamId const&)::{lambda()#1}> > >::_M_run() () from /home/xiacijie/Project/oneflow/build/liboneflow.so
#8  0x00007fffde3c2de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007ffff7e3b609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#10 0x00007ffff7d62293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

如果在lazy_op_interpreter.cpp 的 AddFreeEagerTensorToVariableOp 函数下处理non contiguous的 input tensor：

Maybe<void> AddFreeEagerTensorToVariableOp(const std::shared_ptr<Tensor>& input_tensor) {
  if (!input_tensor->is_contiguous()) {
      bool prev_mode = LazyMode::is_enabled();
      LazyMode::Guard lazy_mode_disabled_guard(false);
      JUST(functional::InplaceToContiguous(input_tensor));
      JUST(vm::CurrentRankSync());
      if (prev_mode) LazyMode::Guard lazy_mode_enable_guard(true);
    }

运行上面graph还是会导致相同的segment fault

strint · 2022-05-25T04:00:00Z

I20220525 11:52:26.701298 3601208 lazy_op_interpreter.cpp:450] Lazy nn.Graph name GraphTestNonContiguousTensors_0 add op :
FreeEagerTensor-2 for FreeEagerTensor.
I20220525 11:52:26.701333 3601208 lazy_op_interpreter.cpp:938] Lazy nn.Graph name GraphTestNonContiguousTensors_0 try to add op:
name: "to_contiguous-1"
device_tag: "cpu"
scope_symbol_id: 4611686018427412479
loc: "Python Stack[-2]: <frame at 0x7f5ba00eed60, file \'/home/xiacijie/Project/oneflow/python/oneflow/test/graph/test_graph_non_contiguous_tensors.py\', line 62, code build>; Python Stack[-1]: <frame at 0x7f5ba00f75b0, file \'/home/xiacijie/Project/oneflow/python/oneflow/framework/tensor.py\', line 235, code _add>;  ... 17 more; "
user_conf {
  op_type_name: "to_contiguous"
  input {
    key: "in"
    value {
      s: "FreeEagerTensor-2/out"
    }
  }
  output {
    key: "out"
    value {
      s: "to_contiguous-1/out_0"
    }
  }
  input_order: "in"
  output_order: "out"
}

free eager tensor 后面还是被插入了一个 contiguous op。

@xiacijie 你 push 一下报问题的代码看看，可能是代码顺序写的有问题。

github-actions · 2022-05-25T09:42:16Z

CI failed when running job: cpu-module. PR label automerge has been removed

…neflow-Inc/oneflow into graph_mode_non_contiguous_tensor_issue

github-actions · 2022-05-25T11:10:57Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8281/

github-actions · 2022-05-25T11:11:14Z

CI failed when running job: cpu-misc. PR label automerge has been removed

github-actions · 2022-05-25T11:12:26Z

Speed stats:

…neflow-Inc/oneflow into graph_mode_non_contiguous_tensor_issue

github-actions · 2022-05-25T15:41:13Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8281/

github-actions · 2022-05-25T15:50:26Z

Speed stats:

GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 130.2ms (= 13019.1ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 142.2ms (= 14217.4ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.09 (= 142.2ms / 130.2ms)

OneFlow resnet50 time: 78.4ms (= 7844.2ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 88.6ms (= 8864.2ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.13 (= 88.6ms / 78.4ms)

OneFlow resnet50 time: 53.9ms (= 10776.9ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 58.9ms (= 11787.2ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.09 (= 58.9ms / 53.9ms)

OneFlow resnet50 time: 41.5ms (= 8309.8ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 45.7ms (= 9130.5ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.10 (= 45.7ms / 41.5ms)

OneFlow resnet50 time: 36.9ms (= 7384.7ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 36.8ms (= 7364.5ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.00 (= 36.8ms / 36.9ms)

OneFlow swin dataloader time: 0.251s (= 50.116s / 200, num_workers=1)
PyTorch swin dataloader time: 0.149s (= 29.877s / 200, num_workers=1)
Relative speed: 0.596 (= 0.149s / 0.251s)

OneFlow swin dataloader time: 0.067s (= 13.418s / 200, num_workers=4)
PyTorch swin dataloader time: 0.041s (= 8.295s / 200, num_workers=4)
Relative speed: 0.618 (= 0.041s / 0.067s)

OneFlow swin dataloader time: 0.036s (= 7.182s / 200, num_workers=8)
PyTorch swin dataloader time: 0.023s (= 4.581s / 200, num_workers=8)
Relative speed: 0.638 (= 0.023s / 0.036s)

❌ OneFlow resnet50 time: 146.8ms (= 14681.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 172.7ms (= 17267.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.18 (= 172.7ms / 146.8ms)

OneFlow resnet50 time: 99.2ms (= 9915.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 112.9ms (= 11293.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.14 (= 112.9ms / 99.2ms)

OneFlow resnet50 time: 75.4ms (= 15078.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 88.8ms (= 17751.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.18 (= 88.8ms / 75.4ms)

OneFlow resnet50 time: 59.9ms (= 11985.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 76.4ms (= 15286.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.28 (= 76.4ms / 59.9ms)

OneFlow resnet50 time: 54.6ms (= 10918.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.0ms (= 13804.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.26 (= 69.0ms / 54.6ms)

github-actions · 2022-05-25T17:38:10Z

Speed stats:

github-actions · 2022-05-25T18:05:43Z

Speed stats:

GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 130.7ms (= 13067.6ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 146.1ms (= 14607.8ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.12 (= 146.1ms / 130.7ms)

OneFlow resnet50 time: 77.8ms (= 7775.1ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 89.6ms (= 8962.6ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.15 (= 89.6ms / 77.8ms)

OneFlow resnet50 time: 54.6ms (= 10916.1ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 61.1ms (= 12222.2ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.12 (= 61.1ms / 54.6ms)

OneFlow resnet50 time: 43.1ms (= 8617.9ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 45.1ms (= 9018.3ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.05 (= 45.1ms / 43.1ms)

OneFlow resnet50 time: 37.3ms (= 7451.9ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 36.3ms (= 7253.1ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 0.97 (= 36.3ms / 37.3ms)

OneFlow swin dataloader time: 0.251s (= 50.248s / 200, num_workers=1)
PyTorch swin dataloader time: 0.152s (= 30.416s / 200, num_workers=1)
Relative speed: 0.605 (= 0.152s / 0.251s)

OneFlow swin dataloader time: 0.065s (= 13.095s / 200, num_workers=4)
PyTorch swin dataloader time: 0.043s (= 8.695s / 200, num_workers=4)
Relative speed: 0.664 (= 0.043s / 0.065s)

OneFlow swin dataloader time: 0.056s (= 11.167s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.402s / 200, num_workers=8)
Relative speed: 0.394 (= 0.022s / 0.056s)

❌ OneFlow resnet50 time: 147.6ms (= 14762.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 167.0ms (= 16696.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.13 (= 167.0ms / 147.6ms)

OneFlow resnet50 time: 97.6ms (= 9757.2ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 113.2ms (= 11320.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 113.2ms / 97.6ms)

OneFlow resnet50 time: 73.4ms (= 14680.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 87.1ms (= 17421.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 87.1ms / 73.4ms)

OneFlow resnet50 time: 63.4ms (= 12677.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 73.5ms (= 14709.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.16 (= 73.5ms / 63.4ms)

OneFlow resnet50 time: 54.9ms (= 10980.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 79.8ms (= 15952.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.45 (= 79.8ms / 54.9ms)

xiacijie requested review from hjchen2, BBuf, daquexian and jackalcooper as code owners May 23, 2022 04:17

xiacijie closed this May 24, 2022

xiacijie force-pushed the graph_mode_non_contiguous_tensor_issue branch from 303ede6 to dde8b8d Compare May 24, 2022 06:46

make sure tensors are contiguous during graph mode's compilation

309ee41

xiacijie reopened this May 24, 2022

Merge branch 'master' into graph_mode_non_contiguous_tensor_issue

e874e9e

xiacijie requested a review from strint May 24, 2022 07:18

strint reviewed May 24, 2022

View reviewed changes

python/oneflow/nn/graph/graph.py Show resolved Hide resolved

strint reviewed May 24, 2022

View reviewed changes

python/oneflow/nn/graph/graph.py Outdated Show resolved Hide resolved

strint reviewed May 24, 2022

View reviewed changes

python/oneflow/nn/graph/graph.py Show resolved Hide resolved

xiacijie added 4 commits May 24, 2022 17:55

make sure tensors ar e contiguous inside self._run and add test case

c3366e9

Merge branch 'graph_mode_non_contiguous_tensor_issue' of github.com:O…

183c013

…neflow-Inc/oneflow into graph_mode_non_contiguous_tensor_issue

add license

90d17df

remove device

805aedb

strint approved these changes May 24, 2022

View reviewed changes

strint reviewed May 24, 2022

View reviewed changes

python/oneflow/test/graph/test_graph_non_contiguous_tensors.py Show resolved Hide resolved

rename

9f2819d

Flowingsun007 reviewed May 24, 2022

View reviewed changes

python/oneflow/nn/graph/graph.py Show resolved Hide resolved

Flowingsun007 approved these changes May 24, 2022

View reviewed changes

handle free eager tensor

60a07bd

xiacijie requested a review from chengtbf as a code owner May 25, 2022 04:40

github-actions bot removed the automerge label May 25, 2022

xiacijie requested review from oneflow-ci-bot and removed request for oneflow-ci-bot May 25, 2022 07:14

strint approved these changes May 25, 2022

View reviewed changes

xiacijie added the automerge label May 25, 2022

xiacijie and others added 2 commits May 25, 2022 15:56

Merge branch 'master' into graph_mode_non_contiguous_tensor_issue

efd9322

Merge branch 'master' into graph_mode_non_contiguous_tensor_issue

d9ff0ed

github-actions bot removed the automerge label May 25, 2022

xiacijie added the automerge label May 25, 2022

xiacijie and others added 3 commits May 25, 2022 18:02

remove unused include

033ffe1

Merge branch 'graph_mode_non_contiguous_tensor_issue' of github.com:O…

d83361f

…neflow-Inc/oneflow into graph_mode_non_contiguous_tensor_issue

Merge branch 'master' into graph_mode_non_contiguous_tensor_issue

1ff2e13

github-actions bot removed the automerge label May 25, 2022

xiacijie added 3 commits May 25, 2022 22:29

skip cuda test if cpu only

69be1e5

Merge branch 'graph_mode_non_contiguous_tensor_issue' of github.com:O…

0135a2d

…neflow-Inc/oneflow into graph_mode_non_contiguous_tensor_issue

Merge branch 'master' into graph_mode_non_contiguous_tensor_issue

6471429

Merge branch 'master' into graph_mode_non_contiguous_tensor_issue

21beb6e

xiacijie added the automerge label May 25, 2022

oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot May 25, 2022 17:50

xiacijie merged commit f338fab into master May 25, 2022

xiacijie deleted the graph_mode_non_contiguous_tensor_issue branch May 25, 2022 18:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graph mode non contiguous tensor issue #8281

Graph mode non contiguous tensor issue #8281

xiacijie commented May 23, 2022

xiacijie commented May 24, 2022

Flowingsun007 May 24, 2022 •

edited

Loading

xiacijie May 24, 2022 •

edited

Loading

Flowingsun007 May 24, 2022

strint commented May 24, 2022 •

edited

Loading

strint commented May 24, 2022 •

edited

Loading

xiacijie commented May 25, 2022

strint commented May 25, 2022

github-actions bot commented May 25, 2022

github-actions bot commented May 25, 2022

github-actions bot commented May 25, 2022

github-actions bot commented May 25, 2022

github-actions bot commented May 25, 2022

github-actions bot commented May 25, 2022

github-actions bot commented May 25, 2022

github-actions bot commented May 25, 2022

Graph mode non contiguous tensor issue #8281

Graph mode non contiguous tensor issue #8281

Conversation

xiacijie commented May 23, 2022

xiacijie commented May 24, 2022

Flowingsun007 May 24, 2022 • edited Loading

Choose a reason for hiding this comment

xiacijie May 24, 2022 • edited Loading

Choose a reason for hiding this comment

Flowingsun007 May 24, 2022

Choose a reason for hiding this comment

strint commented May 24, 2022 • edited Loading

strint commented May 24, 2022 • edited Loading

xiacijie commented May 25, 2022

strint commented May 25, 2022

github-actions bot commented May 25, 2022

github-actions bot commented May 25, 2022

github-actions bot commented May 25, 2022

github-actions bot commented May 25, 2022

github-actions bot commented May 25, 2022

github-actions bot commented May 25, 2022

github-actions bot commented May 25, 2022

github-actions bot commented May 25, 2022

Flowingsun007 May 24, 2022 •

edited

Loading

xiacijie May 24, 2022 •

edited

Loading

strint commented May 24, 2022 •

edited

Loading

strint commented May 24, 2022 •

edited

Loading