Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inductor: make onednn linear inputs are always real contiguous #108560

Closed

Conversation

XiaobingSuper
Copy link
Collaborator

@XiaobingSuper XiaobingSuper commented Sep 5, 2023

Stack from ghstack (oldest at bottom):

For OneDNN linear, if packed linear inputs are not the default contiguous tensor, it always calls in ref pat and gets a worse performance, this PR will force its inputs to the actual default contiguous tensor.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 5, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/108560

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 1f9bb88 with merge base ff38c0e (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

XiaobingSuper added a commit that referenced this pull request Sep 5, 2023
ghstack-source-id: 57fdce6b16c1bc5db5100956ea5b297b02f13a6f
Pull Request resolved: #108560
@XiaobingSuper XiaobingSuper changed the title inductor: make onednn fuse linear inputs are always real contiguous inductor: make onednn linear inputs are always real contiguous Sep 5, 2023
@XiaobingSuper XiaobingSuper requested a review from jgong5 September 5, 2023 09:43
Copy link
Collaborator

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a regression due to onednn upgrade? Perhaps better to address in onednn instead?

Copy link
Collaborator

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realized that we require input as contiguous in the c++ kernel so doing require_contiguous in the inductor should be correct.

Copy link
Contributor

@desertfire desertfire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you will address my comment to test_linear_with_no_default_contiguous_input, you may also consider fixing the PR title (as in grammar).

def test_linear_with_no_default_contiguous_input(self):
mod = torch.nn.Sequential(torch.nn.Linear(16, 16)).eval()
temp = torch.randn(1, 16, 1, 1)
v = torch.ops.inductor._reinterpret_tensor(temp, [1, 16], [0, 1], 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: why using _reinterpret_tensor instead of as_strided? _reinterpret_tensor is meant to be an internal op.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, using as_strided now.

…uous"



For OneDNN linear, if packed linear inputs are not the default contiguous tensor, it always calls in ref pat and gets a worse performance, this PR will force its inputs to the actual default contiguous tensor.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov

[ghstack-poisoned]
@XiaobingSuper XiaobingSuper added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 8, 2023
…uous"



For OneDNN linear, if packed linear inputs are not the default contiguous tensor, it always calls in ref pat and gets a worse performance, this PR will force its inputs to the actual default contiguous tensor.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov

[ghstack-poisoned]
@XiaobingSuper
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

@XiaobingSuper
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Sep 11, 2023
…buffer (#108635)

When viewing a ExternKernelAlloc buffer, there always have a redundant memory copy:
```
buf0: ExternKernelSchedulerNode(MKLPackedLinear)
buf0.writes = [StarDep(name='buf0')]
buf0.unmet_dependencies = []
buf0.met_dependencies = [StarDep(name='arg1_1'), StarDep(name='constant0'), StarDep(name='constant1')]
buf0.users = [NodeUser(node=SchedulerNode(name='buf1'), can_inplace=True, is_weak=False)]
buf0.node.kernel = torch.ops.mkl._mkl_linear

buf1: SchedulerNode(ComputedBuffer)
buf1.writes = [MemoryDep('buf1', c0, {c0: 64})]
buf1.unmet_dependencies = [MemoryDep('buf0', c0, {c0: 64})]
buf1.met_dependencies = []
buf1.users = [NodeUser(node=OUTPUT, can_inplace=False, is_weak=False)]
buf1.group.device = cpu
buf1.group.iteration = ((64,), ())
buf1.sizes = ([64], [])
class buf1_loop_body:
    var_ranges = {z0: 64}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('buf0', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf1', get_index_1, load, None)
        return store
```

and the cpp backend-generated code is:
```
cpp_fused_view_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/ib/cibrnuq56cxamjj4krp4zpjvsirbmlolpbnmomodzyd46huzhdw7.h"
extern "C" void kernel(float* in_out_ptr0)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(16L))
            {
                auto tmp0 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + static_cast<long>(i0));
                tmp0.store(in_out_ptr0 + static_cast<long>(i0));
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg1_1, = args
    args.clear()
    assert_size_stride(arg1_1, (4, 16), (16, 1))
    buf0 = torch.ops.mkl._mkl_linear(arg1_1, constant1, constant0, None, 4)
    del arg1_1
    buf1 = reinterpret_tensor(buf0, (4, 4, 4), (16, 4, 1)); del buf0  # reuse
    cpp_fused_view_0(c_void_p(buf1.data_ptr()))
    return (buf1, )
```

For the ExternKernelAlloc buffer, we can do a real view, rather than a memory copy.

Pull Request resolved: #108635
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #108560
@facebook-github-bot facebook-github-bot deleted the gh/XiaobingSuper/162/head branch September 14, 2023 14:22
michiboo pushed a commit to michiboo/pytorch that referenced this pull request Sep 17, 2023
…ch#108560)

For OneDNN linear, if packed linear inputs are not the default contiguous tensor, it always calls in ref pat and gets a worse performance, this PR will force its inputs to the actual default contiguous tensor.

Pull Request resolved: pytorch#108560
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
michiboo pushed a commit to michiboo/pytorch that referenced this pull request Sep 17, 2023
…buffer (pytorch#108635)

When viewing a ExternKernelAlloc buffer, there always have a redundant memory copy:
```
buf0: ExternKernelSchedulerNode(MKLPackedLinear)
buf0.writes = [StarDep(name='buf0')]
buf0.unmet_dependencies = []
buf0.met_dependencies = [StarDep(name='arg1_1'), StarDep(name='constant0'), StarDep(name='constant1')]
buf0.users = [NodeUser(node=SchedulerNode(name='buf1'), can_inplace=True, is_weak=False)]
buf0.node.kernel = torch.ops.mkl._mkl_linear

buf1: SchedulerNode(ComputedBuffer)
buf1.writes = [MemoryDep('buf1', c0, {c0: 64})]
buf1.unmet_dependencies = [MemoryDep('buf0', c0, {c0: 64})]
buf1.met_dependencies = []
buf1.users = [NodeUser(node=OUTPUT, can_inplace=False, is_weak=False)]
buf1.group.device = cpu
buf1.group.iteration = ((64,), ())
buf1.sizes = ([64], [])
class buf1_loop_body:
    var_ranges = {z0: 64}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('buf0', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf1', get_index_1, load, None)
        return store
```

and the cpp backend-generated code is:
```
cpp_fused_view_0 = async_compile.cpp('''
extern "C" void kernel(float* in_out_ptr0)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(16L))
            {
                auto tmp0 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + static_cast<long>(i0));
                tmp0.store(in_out_ptr0 + static_cast<long>(i0));
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg1_1, = args
    args.clear()
    assert_size_stride(arg1_1, (4, 16), (16, 1))
    buf0 = torch.ops.mkl._mkl_linear(arg1_1, constant1, constant0, None, 4)
    del arg1_1
    buf1 = reinterpret_tensor(buf0, (4, 4, 4), (16, 4, 1)); del buf0  # reuse
    cpp_fused_view_0(c_void_p(buf1.data_ptr()))
    return (buf1, )
```

For the ExternKernelAlloc buffer, we can do a real view, rather than a memory copy.

Pull Request resolved: pytorch#108635
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: pytorch#108560

fix manual_seed

use deterministic manual seed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants