Skip to content

Optimized GEMM & GEMV for Intel platforms #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

gongzg
Copy link
Contributor

@gongzg gongzg commented Jan 9, 2017

This PR implemented optimized GEMV and GEMM kernels for Intel Gen Graphics. For the GEMM function, we force the profile to always choose image based GEMM implementation as we found for the real work load, the image based kernels always get better performance. If we use the default tuning mechanism, isaac may choose different implementations which are much slower.

Lin, Lixiang and others added 11 commits December 15, 2016 09:31
Change-Id: Ic9edf18a3ae0f41b21c2ac374d50000fc5d4e6f3
Change-Id: I89f632e2598594805e24b6aa2d084dcfa1c4f218
v2: by zhigang, fix some warnings and remove half relative code.
v3: by lixiang, modify json file
v4: by junkai, optimize gemm image kernel and force isaac to run gemm image kernel.
v5: by junkai, change json file to force issac to run gemm image kernel.
Change-Id: Ieab41924476bfc001f7026fbea3b5ea5e56eb00b
@ptillet
Copy link
Collaborator

ptillet commented Jan 9, 2017

Hi,

Thanks a lot for the PR. I went through it quite quickly and here are some comments. Let's make sure it eventually gets integrated at just a tiny maintenance cost.

  • I don't see the GEMV using any Intel OpenCL extension. The improvements for GEMV should probably be analyzed in more depth to see what's wrong with the current template. Upgrading the current template would massively simplify the PR (no need to check for GEMV) and changes would carry out to row/col-wise reductions and any input shape.
  • Isaac should not depend on pthread and OpenCL at compile-time. If there are missing dynamic loads for OpenCL, they should be added in driver/dispatch.cpp. Currently this breaks compilation on NVidia machines, and those that don't have OpenCL installed in the system's default path.
  • So the new .json profile for Broadwell seems to suggest that the Image kernel beats Isaac for every single shape. This is a bit surprising, especially considering that this kernel requires a copy from the OpenCL buffers to images. It also means that the improved buffer kernels are never used. I want to check what's going on on my Broadwell machine.

@gongzg
Copy link
Contributor Author

gongzg commented Jan 10, 2017

  • GEMV kernel really doesn't use any Intel specific extension. There are some vector load/store consideration to make it more efficient. Not sure whether this change will bring benefit for other platforms.
  • You are correct. The pthread dependency should be from clBLAS, I will fix it.
  • Actually, this json file is mainly for SKL, but also is good for BDW. And the fact is that the image based GEMM kernel is only faster for some relatively large matrix, for small matrixes, the generic kernel may be faster. But once we enable other kernels during the auto-tuning phase and use that json file. We found during runtime, the isaac may choose the generic kernel rather than the image based kernel for some even relatively large matrix and thus get much worse performance than we force it to image based gemm kernel.

@ptillet
Copy link
Collaborator

ptillet commented Jan 10, 2017

So I could try the branch today at work, and saw indeed some good 20-30% improvements in some cases for GEMM. Good job :)
However, I get some crashes in the tests/benchmarks, and substantial decrease in performance for some shapes.

For GEMV, there is only one single case in the benchmarks where the Intel kernel provides gains:
(M, N)=(1024,60000), A_trans = 1
This corresponds to a case where the matrix is too large to fit in the eDRAM. Which is a bit strange since the two algorithms seem to be the same. I'll investigate this in more details.

@gongzg
Copy link
Contributor Author

gongzg commented Jan 10, 2017

@ptillet Could you let me know how to reproduce the crashes? Which platform are you working on? @listenlink will take care of those crashes.

As those GEMV kernels, we found Intel GEMV wins for many cases. @listenlink could you share some cases which GEMV wins in both BDW and SKL. Thanks.

BTW, we found the master branch isaac cause random fail with clcaffe. If you build clcaffe with isaac and enable the intel spatial engine then run the clcaffe's test suite. It may crash some time or get incorrect result. But if we run those crashing or fail case directly, it could pass. Do you have time to look into this issue as well?

@ptillet
Copy link
Collaborator

ptillet commented Jan 10, 2017

For GEMM:
Can you see any HW reason why the image kernels would be faster than the buffer kernels? On NVidia Hardware, the compiler will use the read-only texture-cache when the pointers are restrict. I would be surprised if there was no instruction in the ISA to use the texture cache while fetching from global memory.
Also, there are indeed some problems with the prediction mechanism, hence the branch labels-cache branch which obviates the prediction mechanism. It is not in production yet, but it's definitely useful for testing new kernels.

For GEMV:
ISAAC's GEMV kernel also uses vector loads and vector stores. Keep also in mind that the .json file on isaac-master was generated for my platform, which is a BDW GT3e with 2x DDR3 sticks. If you have SKL , DDR4 or quad-channel, the results could be very different. I can't benchmark Skylake unfortunately; I have no machine.

For the crashes:
I get crashes at the end of test-blas-2 and test-blas-3, with "The supplied parameters for this template are invalid". I will get the error code tomorrow.
./bench/bench-blas hangs on GEMM. I will try with hangcheck disabled tomorrow, but it's a bad sign since no kernel should take long enough to activate the watchdog.
Do you have the guarantee that the subgroup size will indeed be what you assume it to be in the kernels?

For the bug:
Can you open another issue for the error with caffe? A commit SHA1 that doesn't cause the error would also be useful. The test-suite is pretty strict for the kernels now, but after all it is not impossible that things get messed up with queues at some point.

@listenlink
Copy link

Hi,
For the crashes,
that's because the hardware support the double datatype test while the broadwell.json file don't have the "float64" profiles, the crashes will also occur on the original master branch.

For GEMV perf,
because the original benchmark is vs. the commit on Jan last year, when rebase the latest code, i also find it will not get benefit against the current reduce_2d kernel.

@gongzg
Copy link
Contributor Author

gongzg commented Jan 11, 2017

@ptillet
For GEMM:
There is indeed texture only cache for image for Intel Gen graphics. And the tiling mode of image works very well with the sub group blocking read extension which could get better IO efficiency than the buffer based implementation. This is the major reason why image is better.

For GEMV, @listenlink already explained that we are comparing with the orignal implemtnation. But as you may know, to open source our contribution, we came from a long way and you already did some nice improvement thus we can't see too much difference now. What's your suggestion for this case?

@ptillet
Copy link
Collaborator

ptillet commented Jan 11, 2017

So at this point it's probably useful that I summarize my comments on the PR:

1 - There shouldn't be any static dependency for Isaac -- not even OpenCL. I use libdl to load OpenCL and/or CUDA at runtime (depending on what's installed), see driver/dispatch.cpp for details. The pthread dependency is also not necessary.

2 - Actually I have not touched GEMV kernels in a very long time, although I may have re-ran the auto-tuner. At this point it seems like the maintenance troubles associated with intel-specific gemv kernels are not worth it. However I think your expertise could be useful to improve the current GEMV template if you wish, but it already uses vectorized load/store. Is there any way to have global atomic add on GEN ? (i.e., is it supported by the hardware and is there any way to access it via OpenCL without a repeated use of atomic_inc).

3 - I think the GEMM kernels are very valuable, as they yield some good improvements for some input shapes. Here are my worries, though:

  • Is there any guarantee that the sub-group size will always be set by the compiler to the value required by the kernel?
  • I think it's time for Isaac to find a way to differentiate between different GEN architectures. It seems like the best profiles for BDW and SKL are different.
  • the file generation/gemm.cpp is becoming quite big, perhaps having intel-specific definitions somewhere else (gemm/intel.cpp) could be good.

Overall, I think the PR could be simplified to just a few file:
runtime/profiles.cpp
jit/gemm.h
jit/gemm/intel.cpp (new file)
driver/dispatch.h|cpp
driver/device.h|cpp (for checking BDW vs SKL)
python/src/bind/kernels.cpp

I think that changes in the auto-tuner should not be needed, but I may be wrong

@gongzg
Copy link
Contributor Author

gongzg commented Jan 12, 2017

@ptillet Thanks for the comments, we are working on that. One question which I forgot to anwser,
"Do you have the guarantee that the subgroup size will indeed be what you assume it to be in the kernels?"

There is a sub group extension function attribution to make sure the kernel will be built with specified sub group size:

__attribute__((intel_reqd_sub_group_size(16)))
__kernel void foo() {
...
}

@gongzg gongzg closed this Jan 21, 2017
goostavz pushed a commit to goostavz/triton that referenced this pull request Aug 4, 2023
jlebar pushed a commit that referenced this pull request Jun 21, 2024
When running
[convert_blocked1d_to_slice0](https://github.com/triton-lang/triton/blob/0ba5f0c3cd029d5c3d1f01b9bf29dac32c27345e/test/Conversion/tritongpu_to_llvm.mlir#L924)
Triton ends up computing a rank of a matrix with 0 columns during linear
layout lowering, which trips up f2reduce, and causes undefined behavior,
detectable through
[UBSAN](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html).

Fix this by returning the rank (0) early in these cases, without calling
f2reduce.

<details><summary>Stack trace</summary>
<p>

```
third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30: runtime error: shift exponent 18446744073709551615 is too large for 64-bit type 'unsigned long long'
    #0 0x556ee2fea3be in inplace_rref_small third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30
    #1 0x556ee2fea3be in f2reduce::inplace_rref_strided(unsigned long*, unsigned long, unsigned long, unsigned long) third_party/triton/third_party/f2reduce/f2reduce.cpp:470:9
    #2 0x556ee2ea70da in getMatrixRank third_party/triton/lib/Tools/LinearLayout.cpp:125:3
    #3 0x556ee2ea70da in mlir::triton::LinearLayout::checkInvariants(bool) third_party/triton/lib/Tools/LinearLayout.cpp:299:7
    #4 0x556ee2ea656d in mlir::triton::LinearLayout::tryCreate(llvm::MapVector<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>, llvm::DenseMap<mlir::StringAttr, unsigned int, llvm::DenseMapInfo<mlir::StringAttr, void>, llvm::detail::DenseMapPair<mlir::StringAttr, unsigned int>>, llvm::SmallVector<std::__u::pair<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>>, 0u>>, llvm::ArrayRef<std::__u::pair<mlir::StringAttr, int>>, bool) third_party/triton/lib/Tools/LinearLayout.cpp:190:41
    #5 0x556ee2eb2150 in mlir::triton::LinearLayout::divideRight(mlir::triton::LinearLayout const&) third_party/triton/lib/Tools/LinearLayout.cpp:654:51
    #6 0x556ee2ee1c39 in mlir::cvtNeedsSharedMemory(mlir::RankedTensorType, mlir::RankedTensorType) third_party/triton/lib/Analysis/Utility.cpp:652:14
    #7 0x556ee2cf38fd in mlir::triton::getRepShapeForCvtLayout(mlir::triton::gpu::ConvertLayoutOp) third_party/triton/lib/Analysis/Allocation.cpp:66:8
    #8 0x556ee2cf3efa in mlir::triton::getScratchConfigForCvtLayout(mlir::triton::gpu::ConvertLayoutOp, unsigned int&, unsigned int&) third_party/triton/lib/Analysis/Allocation.cpp:95:19
    #9 0x556ee2cf6057 in mlir::triton::AllocationAnalysis::getScratchValueSize(mlir::Operation*) third_party/triton/lib/Analysis/Allocation.cpp:272:24
    #10 0x556ee2cf5499 in operator() third_party/triton/lib/Analysis/Allocation.cpp:343:7
    #11 0x556ee2cf5499 in void llvm::function_ref<void (mlir::Operation*)>::callback_fn<mlir::triton::AllocationAnalysis::getValuesAndSizes()::'lambda'(mlir::Operation*)>(long, mlir::Operation*) third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:45:12
    #12 0x556edeeee7a9 in operator() third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68:12
    #13 0x556edeeee7a9 in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:174:5
    #14 0x556edeeee87c in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:182:9
    #15 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), mlir::Operation *, void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:313:10
    #16 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Operation.h:794:12
    #17 0x556ee2cf49e7 in mlir::triton::AllocationAnalysis::getValuesAndSizes() third_party/triton/lib/Analysis/Allocation.cpp:341:16
    #18 0x556ee2cf4852 in run third_party/triton/lib/Analysis/Allocation.cpp:182:5
    #19 0x556ee2cf4852 in AllocationAnalysis third_party/triton/lib/Analysis/Allocation.cpp:169:5
    #20 0x556ee2cf4852 in mlir::Allocation::run(llvm::DenseMap<mlir::FunctionOpInterface, mlir::Allocation, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>, llvm::detail::DenseMapPair<mlir::FunctionOpInterface, mlir::Allocation>>&) third_party/triton/lib/Analysis/Allocation.cpp:627:3
    #21 0x556ee1677402 in operator() third_party/triton/include/triton/Analysis/Allocation.h:227:26
    #22 0x556ee1677402 in void mlir::CallGraph<mlir::Allocation>::doWalk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)>(mlir::FunctionOpInterface, llvm::DenseSet<mlir::FunctionOpInterface, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>>&, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)) third_party/triton/include/triton/Analysis/Utility.h:350:7
    #23 0x556ee16756b3 in walk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, (lambda at third_party/triton/include/triton/Analysis/Allocation.h:222:9), (lambda at third_party/triton/include/triton/Analysis/Allocation.h:224:9)> third_party/triton/include/triton/Analysis/Utility.h:242:7
    #24 0x556ee16756b3 in mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp) third_party/triton/include/triton/Analysis/Allocation.h:220:5
    #25 0x556ee2c2bf18 in (anonymous namespace)::AllocateSharedMemory::runOnOperation() third_party/triton/lib/Conversion/TritonGPUToLLVM/AllocateSharedMemory.cpp:26:22
...
UndefinedBehaviorSanitizer: invalid-shift-exponent third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30 
```
</p>
</details>
oraluben pushed a commit to oraluben/triton that referenced this pull request Sep 11, 2024
…iton-lang#11)

* [CPU] Support flexible active driver + update vector-add tutorial

* Update vector-add to run CPU always + optional GPU

* Update do_bench for CPU
gglin001 pushed a commit to gglin001/triton that referenced this pull request Nov 13, 2024
…iton-lang#11)

* [CPU] Support flexible active driver + update vector-add tutorial

* Update vector-add to run CPU always + optional GPU

* Update do_bench for CPU
bertmaher pushed a commit to bertmaher/triton that referenced this pull request Dec 10, 2024
When running
[convert_blocked1d_to_slice0](https://github.com/triton-lang/triton/blob/0ba5f0c3cd029d5c3d1f01b9bf29dac32c27345e/test/Conversion/tritongpu_to_llvm.mlir#L924)
Triton ends up computing a rank of a matrix with 0 columns during linear
layout lowering, which trips up f2reduce, and causes undefined behavior,
detectable through
[UBSAN](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html).

Fix this by returning the rank (0) early in these cases, without calling
f2reduce.

<details><summary>Stack trace</summary>
<p>

```
third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30: runtime error: shift exponent 18446744073709551615 is too large for 64-bit type 'unsigned long long'
    #0 0x556ee2fea3be in inplace_rref_small third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30
    triton-lang#1 0x556ee2fea3be in f2reduce::inplace_rref_strided(unsigned long*, unsigned long, unsigned long, unsigned long) third_party/triton/third_party/f2reduce/f2reduce.cpp:470:9
    triton-lang#2 0x556ee2ea70da in getMatrixRank third_party/triton/lib/Tools/LinearLayout.cpp:125:3
    triton-lang#3 0x556ee2ea70da in mlir::triton::LinearLayout::checkInvariants(bool) third_party/triton/lib/Tools/LinearLayout.cpp:299:7
    triton-lang#4 0x556ee2ea656d in mlir::triton::LinearLayout::tryCreate(llvm::MapVector<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>, llvm::DenseMap<mlir::StringAttr, unsigned int, llvm::DenseMapInfo<mlir::StringAttr, void>, llvm::detail::DenseMapPair<mlir::StringAttr, unsigned int>>, llvm::SmallVector<std::__u::pair<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>>, 0u>>, llvm::ArrayRef<std::__u::pair<mlir::StringAttr, int>>, bool) third_party/triton/lib/Tools/LinearLayout.cpp:190:41
    triton-lang#5 0x556ee2eb2150 in mlir::triton::LinearLayout::divideRight(mlir::triton::LinearLayout const&) third_party/triton/lib/Tools/LinearLayout.cpp:654:51
    triton-lang#6 0x556ee2ee1c39 in mlir::cvtNeedsSharedMemory(mlir::RankedTensorType, mlir::RankedTensorType) third_party/triton/lib/Analysis/Utility.cpp:652:14
    triton-lang#7 0x556ee2cf38fd in mlir::triton::getRepShapeForCvtLayout(mlir::triton::gpu::ConvertLayoutOp) third_party/triton/lib/Analysis/Allocation.cpp:66:8
    triton-lang#8 0x556ee2cf3efa in mlir::triton::getScratchConfigForCvtLayout(mlir::triton::gpu::ConvertLayoutOp, unsigned int&, unsigned int&) third_party/triton/lib/Analysis/Allocation.cpp:95:19
    triton-lang#9 0x556ee2cf6057 in mlir::triton::AllocationAnalysis::getScratchValueSize(mlir::Operation*) third_party/triton/lib/Analysis/Allocation.cpp:272:24
    triton-lang#10 0x556ee2cf5499 in operator() third_party/triton/lib/Analysis/Allocation.cpp:343:7
    triton-lang#11 0x556ee2cf5499 in void llvm::function_ref<void (mlir::Operation*)>::callback_fn<mlir::triton::AllocationAnalysis::getValuesAndSizes()::'lambda'(mlir::Operation*)>(long, mlir::Operation*) third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:45:12
    triton-lang#12 0x556edeeee7a9 in operator() third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68:12
    triton-lang#13 0x556edeeee7a9 in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:174:5
    triton-lang#14 0x556edeeee87c in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:182:9
    triton-lang#15 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), mlir::Operation *, void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:313:10
    triton-lang#16 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Operation.h:794:12
    triton-lang#17 0x556ee2cf49e7 in mlir::triton::AllocationAnalysis::getValuesAndSizes() third_party/triton/lib/Analysis/Allocation.cpp:341:16
    triton-lang#18 0x556ee2cf4852 in run third_party/triton/lib/Analysis/Allocation.cpp:182:5
    triton-lang#19 0x556ee2cf4852 in AllocationAnalysis third_party/triton/lib/Analysis/Allocation.cpp:169:5
    triton-lang#20 0x556ee2cf4852 in mlir::Allocation::run(llvm::DenseMap<mlir::FunctionOpInterface, mlir::Allocation, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>, llvm::detail::DenseMapPair<mlir::FunctionOpInterface, mlir::Allocation>>&) third_party/triton/lib/Analysis/Allocation.cpp:627:3
    triton-lang#21 0x556ee1677402 in operator() third_party/triton/include/triton/Analysis/Allocation.h:227:26
    triton-lang#22 0x556ee1677402 in void mlir::CallGraph<mlir::Allocation>::doWalk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)>(mlir::FunctionOpInterface, llvm::DenseSet<mlir::FunctionOpInterface, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>>&, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)) third_party/triton/include/triton/Analysis/Utility.h:350:7
    triton-lang#23 0x556ee16756b3 in walk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, (lambda at third_party/triton/include/triton/Analysis/Allocation.h:222:9), (lambda at third_party/triton/include/triton/Analysis/Allocation.h:224:9)> third_party/triton/include/triton/Analysis/Utility.h:242:7
    triton-lang#24 0x556ee16756b3 in mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp) third_party/triton/include/triton/Analysis/Allocation.h:220:5
    triton-lang#25 0x556ee2c2bf18 in (anonymous namespace)::AllocateSharedMemory::runOnOperation() third_party/triton/lib/Conversion/TritonGPUToLLVM/AllocateSharedMemory.cpp:26:22
...
UndefinedBehaviorSanitizer: invalid-shift-exponent third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30 
```
</p>
</details>
stephen-huan pushed a commit to stephen-huan/triton that referenced this pull request Dec 24, 2024
…iton-lang#11)

* [CPU] Support flexible active driver + update vector-add tutorial

* Update vector-add to run CPU always + optional GPU

* Update do_bench for CPU
stephen-huan pushed a commit to stephen-huan/triton that referenced this pull request Mar 11, 2025
…iton-lang#11)

* [CPU] Support flexible active driver + update vector-add tutorial

* Update vector-add to run CPU always + optional GPU

* Update do_bench for CPU
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants