-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graph mode non contiguous tensor issue #8281
Conversation
303ede6
to
dde8b8d
Compare
reopen |
…neflow-Inc/oneflow into graph_mode_non_contiguous_tensor_issue
|
||
def leaf_node_fn(node): | ||
if isinstance(node._value, Tensor) and not node._value.is_contiguous(): | ||
node._value.contiguous_() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对于input tensor来说,是不是无需用inplace contiguous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input tensor用inplace 和非inplace的都可以
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
还是建议用非inplace的,因为省了一个op效率会更高,观感上也更clean一些
Other than input and state, there is a special tensor call free eager tensor: free eager tensor is python global variable. free eager tensor should be make contiguous in if (!inpute_tensor->is_contigous()) {
auto lazy_mode_disabled_guard = LazyMode::Guard(/*is_enabled*/ false);
JUST(one::functional::InplaceToContiguous(input_tensor));
JUST(vm::CurrentRankSync())
} |
这里对input做contigous保证,如果input是非contigous的,潜在的在每次graph input时,插入了两次 vm 指令调用(一个to contigous op 和 assign op #8275 ); 这里可能使得之前设定的 dataloader 和 graph input之间不做任何 vm 指令调用才能保证流水的依赖不能保证了。 这样后面使用 dataloader 和 graph 组合做流水并行,需要用户保证在 dataloader 的输出 tensor 是 contigous的。 |
如果free eager tensor 是non contiguous的,例子:
执行上面的graph会导致这个segmentation fault:
如果在lazy_op_interpreter.cpp 的
运行上面graph还是会导致相同的segment fault |
free eager tensor 后面还是被插入了一个 contiguous op。 @xiacijie 你 push 一下报问题的代码看看,可能是代码顺序写的有问题。 |
CI failed when running job: cpu-module. PR label automerge has been removed |
…neflow-Inc/oneflow into graph_mode_non_contiguous_tensor_issue
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8281/ |
CI failed when running job: cpu-misc. PR label automerge has been removed |
Speed stats:
|
…neflow-Inc/oneflow into graph_mode_non_contiguous_tensor_issue
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8281/ |
Speed stats:
|
Speed stats:
|
Speed stats:
|
Make sure input tensors and parameter/buffer tensors are all contiguous in graph mode