-
Notifications
You must be signed in to change notification settings - Fork 23.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inductor: make onednn linear inputs are always real contiguous #108560
inductor: make onednn linear inputs are always real contiguous #108560
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/108560
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 1f9bb88 with merge base ff38c0e ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ghstack-source-id: 57fdce6b16c1bc5db5100956ea5b297b02f13a6f Pull Request resolved: #108560
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a regression due to onednn upgrade? Perhaps better to address in onednn instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realized that we require input as contiguous in the c++ kernel so doing require_contiguous
in the inductor should be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you will address my comment to test_linear_with_no_default_contiguous_input
, you may also consider fixing the PR title (as in grammar).
test/inductor/test_cpu_repro.py
Outdated
def test_linear_with_no_default_contiguous_input(self): | ||
mod = torch.nn.Sequential(torch.nn.Linear(16, 16)).eval() | ||
temp = torch.randn(1, 16, 1, 1) | ||
v = torch.ops.inductor._reinterpret_tensor(temp, [1, 16], [0, 1], 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: why using _reinterpret_tensor
instead of as_strided
? _reinterpret_tensor
is meant to be an internal op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, using as_strided now.
…uous" For OneDNN linear, if packed linear inputs are not the default contiguous tensor, it always calls in ref pat and gets a worse performance, this PR will force its inputs to the actual default contiguous tensor. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…uous" For OneDNN linear, if packed linear inputs are not the default contiguous tensor, it always calls in ref pat and gets a worse performance, this PR will force its inputs to the actual default contiguous tensor. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 jobs have failed, first few of them are: inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (inductor_torchbench_cpu_accuracy, 1, 1, linux.4xlarge), inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (inductor_torchbench_dynamic_cpu_accuracy, 1, 1, linux.12xlarge) Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…buffer (#108635) When viewing a ExternKernelAlloc buffer, there always have a redundant memory copy: ``` buf0: ExternKernelSchedulerNode(MKLPackedLinear) buf0.writes = [StarDep(name='buf0')] buf0.unmet_dependencies = [] buf0.met_dependencies = [StarDep(name='arg1_1'), StarDep(name='constant0'), StarDep(name='constant1')] buf0.users = [NodeUser(node=SchedulerNode(name='buf1'), can_inplace=True, is_weak=False)] buf0.node.kernel = torch.ops.mkl._mkl_linear buf1: SchedulerNode(ComputedBuffer) buf1.writes = [MemoryDep('buf1', c0, {c0: 64})] buf1.unmet_dependencies = [MemoryDep('buf0', c0, {c0: 64})] buf1.met_dependencies = [] buf1.users = [NodeUser(node=OUTPUT, can_inplace=False, is_weak=False)] buf1.group.device = cpu buf1.group.iteration = ((64,), ()) buf1.sizes = ([64], []) class buf1_loop_body: var_ranges = {z0: 64} index0 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('buf0', get_index) get_index_1 = self.get_index('index0') store = ops.store('buf1', get_index_1, load, None) return store ``` and the cpp backend-generated code is: ``` cpp_fused_view_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/ib/cibrnuq56cxamjj4krp4zpjvsirbmlolpbnmomodzyd46huzhdw7.h" extern "C" void kernel(float* in_out_ptr0) { #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + static_cast<long>(i0)); tmp0.store(in_out_ptr0 + static_cast<long>(i0)); } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg1_1, = args args.clear() assert_size_stride(arg1_1, (4, 16), (16, 1)) buf0 = torch.ops.mkl._mkl_linear(arg1_1, constant1, constant0, None, 4) del arg1_1 buf1 = reinterpret_tensor(buf0, (4, 4, 4), (16, 4, 1)); del buf0 # reuse cpp_fused_view_0(c_void_p(buf1.data_ptr())) return (buf1, ) ``` For the ExternKernelAlloc buffer, we can do a real view, rather than a memory copy. Pull Request resolved: #108635 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel ghstack dependencies: #108560
…ch#108560) For OneDNN linear, if packed linear inputs are not the default contiguous tensor, it always calls in ref pat and gets a worse performance, this PR will force its inputs to the actual default contiguous tensor. Pull Request resolved: pytorch#108560 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
…buffer (pytorch#108635) When viewing a ExternKernelAlloc buffer, there always have a redundant memory copy: ``` buf0: ExternKernelSchedulerNode(MKLPackedLinear) buf0.writes = [StarDep(name='buf0')] buf0.unmet_dependencies = [] buf0.met_dependencies = [StarDep(name='arg1_1'), StarDep(name='constant0'), StarDep(name='constant1')] buf0.users = [NodeUser(node=SchedulerNode(name='buf1'), can_inplace=True, is_weak=False)] buf0.node.kernel = torch.ops.mkl._mkl_linear buf1: SchedulerNode(ComputedBuffer) buf1.writes = [MemoryDep('buf1', c0, {c0: 64})] buf1.unmet_dependencies = [MemoryDep('buf0', c0, {c0: 64})] buf1.met_dependencies = [] buf1.users = [NodeUser(node=OUTPUT, can_inplace=False, is_weak=False)] buf1.group.device = cpu buf1.group.iteration = ((64,), ()) buf1.sizes = ([64], []) class buf1_loop_body: var_ranges = {z0: 64} index0 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('buf0', get_index) get_index_1 = self.get_index('index0') store = ops.store('buf1', get_index_1, load, None) return store ``` and the cpp backend-generated code is: ``` cpp_fused_view_0 = async_compile.cpp(''' extern "C" void kernel(float* in_out_ptr0) { #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + static_cast<long>(i0)); tmp0.store(in_out_ptr0 + static_cast<long>(i0)); } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg1_1, = args args.clear() assert_size_stride(arg1_1, (4, 16), (16, 1)) buf0 = torch.ops.mkl._mkl_linear(arg1_1, constant1, constant0, None, 4) del arg1_1 buf1 = reinterpret_tensor(buf0, (4, 4, 4), (16, 4, 1)); del buf0 # reuse cpp_fused_view_0(c_void_p(buf1.data_ptr())) return (buf1, ) ``` For the ExternKernelAlloc buffer, we can do a real view, rather than a memory copy. Pull Request resolved: pytorch#108635 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel ghstack dependencies: pytorch#108560 fix manual_seed use deterministic manual seed
Stack from ghstack (oldest at bottom):
For OneDNN linear, if packed linear inputs are not the default contiguous tensor, it always calls in ref pat and gets a worse performance, this PR will force its inputs to the actual default contiguous tensor.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov