Optimized GEMM & GEMV for Intel platforms #11

gongzg · 2017-01-09T09:12:20Z

This PR implemented optimized GEMV and GEMM kernels for Intel Gen Graphics. For the GEMM function, we force the profile to always choose image based GEMM implementation as we found for the real work load, the image based kernels always get better performance. If we use the default tuning mechanism, isaac may choose different implementations which are much slower.

…tigate VPG ocl drive failing bug.

…find root cause

Change-Id: Ic9edf18a3ae0f41b21c2ac374d50000fc5d4e6f3

Change-Id: I89f632e2598594805e24b6aa2d084dcfa1c4f218

… the same operatin.

v2: by zhigang, fix some warnings and remove half relative code. v3: by lixiang, modify json file v4: by junkai, optimize gemm image kernel and force isaac to run gemm image kernel. v5: by junkai, change json file to force issac to run gemm image kernel. Change-Id: Ieab41924476bfc001f7026fbea3b5ea5e56eb00b

ptillet · 2017-01-09T11:56:11Z

Hi,

Thanks a lot for the PR. I went through it quite quickly and here are some comments. Let's make sure it eventually gets integrated at just a tiny maintenance cost.

I don't see the GEMV using any Intel OpenCL extension. The improvements for GEMV should probably be analyzed in more depth to see what's wrong with the current template. Upgrading the current template would massively simplify the PR (no need to check for GEMV) and changes would carry out to row/col-wise reductions and any input shape.
Isaac should not depend on pthread and OpenCL at compile-time. If there are missing dynamic loads for OpenCL, they should be added in driver/dispatch.cpp. Currently this breaks compilation on NVidia machines, and those that don't have OpenCL installed in the system's default path.
So the new .json profile for Broadwell seems to suggest that the Image kernel beats Isaac for every single shape. This is a bit surprising, especially considering that this kernel requires a copy from the OpenCL buffers to images. It also means that the improved buffer kernels are never used. I want to check what's going on on my Broadwell machine.

gongzg · 2017-01-10T00:32:03Z

GEMV kernel really doesn't use any Intel specific extension. There are some vector load/store consideration to make it more efficient. Not sure whether this change will bring benefit for other platforms.
You are correct. The pthread dependency should be from clBLAS, I will fix it.
Actually, this json file is mainly for SKL, but also is good for BDW. And the fact is that the image based GEMM kernel is only faster for some relatively large matrix, for small matrixes, the generic kernel may be faster. But once we enable other kernels during the auto-tuning phase and use that json file. We found during runtime, the isaac may choose the generic kernel rather than the image based kernel for some even relatively large matrix and thus get much worse performance than we force it to image based gemm kernel.

ptillet · 2017-01-10T00:33:43Z

So I could try the branch today at work, and saw indeed some good 20-30% improvements in some cases for GEMM. Good job :)
However, I get some crashes in the tests/benchmarks, and substantial decrease in performance for some shapes.

For GEMV, there is only one single case in the benchmarks where the Intel kernel provides gains:
(M, N)=(1024,60000), A_trans = 1
This corresponds to a case where the matrix is too large to fit in the eDRAM. Which is a bit strange since the two algorithms seem to be the same. I'll investigate this in more details.

gongzg · 2017-01-10T01:52:17Z

@ptillet Could you let me know how to reproduce the crashes? Which platform are you working on? @listenlink will take care of those crashes.

As those GEMV kernels, we found Intel GEMV wins for many cases. @listenlink could you share some cases which GEMV wins in both BDW and SKL. Thanks.

BTW, we found the master branch isaac cause random fail with clcaffe. If you build clcaffe with isaac and enable the intel spatial engine then run the clcaffe's test suite. It may crash some time or get incorrect result. But if we run those crashing or fail case directly, it could pass. Do you have time to look into this issue as well?

ptillet · 2017-01-10T02:45:02Z

For GEMM:
Can you see any HW reason why the image kernels would be faster than the buffer kernels? On NVidia Hardware, the compiler will use the read-only texture-cache when the pointers are restrict. I would be surprised if there was no instruction in the ISA to use the texture cache while fetching from global memory.
Also, there are indeed some problems with the prediction mechanism, hence the branch labels-cache branch which obviates the prediction mechanism. It is not in production yet, but it's definitely useful for testing new kernels.

For GEMV:
ISAAC's GEMV kernel also uses vector loads and vector stores. Keep also in mind that the .json file on isaac-master was generated for my platform, which is a BDW GT3e with 2x DDR3 sticks. If you have SKL , DDR4 or quad-channel, the results could be very different. I can't benchmark Skylake unfortunately; I have no machine.

For the crashes:
I get crashes at the end of test-blas-2 and test-blas-3, with "The supplied parameters for this template are invalid". I will get the error code tomorrow.
./bench/bench-blas hangs on GEMM. I will try with hangcheck disabled tomorrow, but it's a bad sign since no kernel should take long enough to activate the watchdog.
Do you have the guarantee that the subgroup size will indeed be what you assume it to be in the kernels?

For the bug:
Can you open another issue for the error with caffe? A commit SHA1 that doesn't cause the error would also be useful. The test-suite is pretty strict for the kernels now, but after all it is not impossible that things get messed up with queues at some point.

listenlink · 2017-01-10T12:36:45Z

Hi,
For the crashes,
that's because the hardware support the double datatype test while the broadwell.json file don't have the "float64" profiles, the crashes will also occur on the original master branch.

For GEMV perf,
because the original benchmark is vs. the commit on Jan last year, when rebase the latest code, i also find it will not get benefit against the current reduce_2d kernel.

gongzg · 2017-01-11T08:24:04Z

@ptillet
For GEMM:
There is indeed texture only cache for image for Intel Gen graphics. And the tiling mode of image works very well with the sub group blocking read extension which could get better IO efficiency than the buffer based implementation. This is the major reason why image is better.

For GEMV, @listenlink already explained that we are comparing with the orignal implemtnation. But as you may know, to open source our contribution, we came from a long way and you already did some nice improvement thus we can't see too much difference now. What's your suggestion for this case?

ptillet · 2017-01-11T15:55:15Z

So at this point it's probably useful that I summarize my comments on the PR:

1 - There shouldn't be any static dependency for Isaac -- not even OpenCL. I use libdl to load OpenCL and/or CUDA at runtime (depending on what's installed), see driver/dispatch.cpp for details. The pthread dependency is also not necessary.

2 - Actually I have not touched GEMV kernels in a very long time, although I may have re-ran the auto-tuner. At this point it seems like the maintenance troubles associated with intel-specific gemv kernels are not worth it. However I think your expertise could be useful to improve the current GEMV template if you wish, but it already uses vectorized load/store. Is there any way to have global atomic add on GEN ? (i.e., is it supported by the hardware and is there any way to access it via OpenCL without a repeated use of atomic_inc).

3 - I think the GEMM kernels are very valuable, as they yield some good improvements for some input shapes. Here are my worries, though:

Is there any guarantee that the sub-group size will always be set by the compiler to the value required by the kernel?
I think it's time for Isaac to find a way to differentiate between different GEN architectures. It seems like the best profiles for BDW and SKL are different.
the file generation/gemm.cpp is becoming quite big, perhaps having intel-specific definitions somewhere else (gemm/intel.cpp) could be good.

Overall, I think the PR could be simplified to just a few file:
runtime/profiles.cpp
jit/gemm.h
jit/gemm/intel.cpp (new file)
driver/dispatch.h|cpp
driver/device.h|cpp (for checking BDW vs SKL)
python/src/bind/kernels.cpp

I think that changes in the auto-tuner should not be needed, but I may be wrong

gongzg · 2017-01-12T09:02:53Z

@ptillet Thanks for the comments, we are working on that. One question which I forgot to anwser,
"Do you have the guarantee that the subgroup size will indeed be what you assume it to be in the kernels?"

There is a sub group extension function attribution to make sure the kernel will be built with specified sub group size:

__attribute__((intel_reqd_sub_group_size(16)))
__kernel void foo() {
...
}

When running [convert_blocked1d_to_slice0](https://github.com/triton-lang/triton/blob/0ba5f0c3cd029d5c3d1f01b9bf29dac32c27345e/test/Conversion/tritongpu_to_llvm.mlir#L924) Triton ends up computing a rank of a matrix with 0 columns during linear layout lowering, which trips up f2reduce, and causes undefined behavior, detectable through [UBSAN](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html). Fix this by returning the rank (0) early in these cases, without calling f2reduce. <details><summary>Stack trace</summary> <p> ``` third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30: runtime error: shift exponent 18446744073709551615 is too large for 64-bit type 'unsigned long long' #0 0x556ee2fea3be in inplace_rref_small third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30 #1 0x556ee2fea3be in f2reduce::inplace_rref_strided(unsigned long*, unsigned long, unsigned long, unsigned long) third_party/triton/third_party/f2reduce/f2reduce.cpp:470:9 #2 0x556ee2ea70da in getMatrixRank third_party/triton/lib/Tools/LinearLayout.cpp:125:3 #3 0x556ee2ea70da in mlir::triton::LinearLayout::checkInvariants(bool) third_party/triton/lib/Tools/LinearLayout.cpp:299:7 #4 0x556ee2ea656d in mlir::triton::LinearLayout::tryCreate(llvm::MapVector<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>, llvm::DenseMap<mlir::StringAttr, unsigned int, llvm::DenseMapInfo<mlir::StringAttr, void>, llvm::detail::DenseMapPair<mlir::StringAttr, unsigned int>>, llvm::SmallVector<std::__u::pair<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>>, 0u>>, llvm::ArrayRef<std::__u::pair<mlir::StringAttr, int>>, bool) third_party/triton/lib/Tools/LinearLayout.cpp:190:41 #5 0x556ee2eb2150 in mlir::triton::LinearLayout::divideRight(mlir::triton::LinearLayout const&) third_party/triton/lib/Tools/LinearLayout.cpp:654:51 #6 0x556ee2ee1c39 in mlir::cvtNeedsSharedMemory(mlir::RankedTensorType, mlir::RankedTensorType) third_party/triton/lib/Analysis/Utility.cpp:652:14 #7 0x556ee2cf38fd in mlir::triton::getRepShapeForCvtLayout(mlir::triton::gpu::ConvertLayoutOp) third_party/triton/lib/Analysis/Allocation.cpp:66:8 #8 0x556ee2cf3efa in mlir::triton::getScratchConfigForCvtLayout(mlir::triton::gpu::ConvertLayoutOp, unsigned int&, unsigned int&) third_party/triton/lib/Analysis/Allocation.cpp:95:19 #9 0x556ee2cf6057 in mlir::triton::AllocationAnalysis::getScratchValueSize(mlir::Operation*) third_party/triton/lib/Analysis/Allocation.cpp:272:24 #10 0x556ee2cf5499 in operator() third_party/triton/lib/Analysis/Allocation.cpp:343:7 #11 0x556ee2cf5499 in void llvm::function_ref<void (mlir::Operation*)>::callback_fn<mlir::triton::AllocationAnalysis::getValuesAndSizes()::'lambda'(mlir::Operation*)>(long, mlir::Operation*) third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:45:12 #12 0x556edeeee7a9 in operator() third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68:12 #13 0x556edeeee7a9 in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:174:5 #14 0x556edeeee87c in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:182:9 #15 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), mlir::Operation *, void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:313:10 #16 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Operation.h:794:12 #17 0x556ee2cf49e7 in mlir::triton::AllocationAnalysis::getValuesAndSizes() third_party/triton/lib/Analysis/Allocation.cpp:341:16 #18 0x556ee2cf4852 in run third_party/triton/lib/Analysis/Allocation.cpp:182:5 #19 0x556ee2cf4852 in AllocationAnalysis third_party/triton/lib/Analysis/Allocation.cpp:169:5 #20 0x556ee2cf4852 in mlir::Allocation::run(llvm::DenseMap<mlir::FunctionOpInterface, mlir::Allocation, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>, llvm::detail::DenseMapPair<mlir::FunctionOpInterface, mlir::Allocation>>&) third_party/triton/lib/Analysis/Allocation.cpp:627:3 #21 0x556ee1677402 in operator() third_party/triton/include/triton/Analysis/Allocation.h:227:26 #22 0x556ee1677402 in void mlir::CallGraph<mlir::Allocation>::doWalk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)>(mlir::FunctionOpInterface, llvm::DenseSet<mlir::FunctionOpInterface, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>>&, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)) third_party/triton/include/triton/Analysis/Utility.h:350:7 #23 0x556ee16756b3 in walk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, (lambda at third_party/triton/include/triton/Analysis/Allocation.h:222:9), (lambda at third_party/triton/include/triton/Analysis/Allocation.h:224:9)> third_party/triton/include/triton/Analysis/Utility.h:242:7 #24 0x556ee16756b3 in mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp) third_party/triton/include/triton/Analysis/Allocation.h:220:5 #25 0x556ee2c2bf18 in (anonymous namespace)::AllocateSharedMemory::runOnOperation() third_party/triton/lib/Conversion/TritonGPUToLLVM/AllocateSharedMemory.cpp:26:22 ... UndefinedBehaviorSanitizer: invalid-shift-exponent third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30 ``` </p> </details>

…iton-lang#11) * [CPU] Support flexible active driver + update vector-add tutorial * Update vector-add to run CPU always + optional GPU * Update do_bench for CPU

When running [convert_blocked1d_to_slice0](https://github.com/triton-lang/triton/blob/0ba5f0c3cd029d5c3d1f01b9bf29dac32c27345e/test/Conversion/tritongpu_to_llvm.mlir#L924) Triton ends up computing a rank of a matrix with 0 columns during linear layout lowering, which trips up f2reduce, and causes undefined behavior, detectable through [UBSAN](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html). Fix this by returning the rank (0) early in these cases, without calling f2reduce. <details><summary>Stack trace</summary> <p> ``` third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30: runtime error: shift exponent 18446744073709551615 is too large for 64-bit type 'unsigned long long' #0 0x556ee2fea3be in inplace_rref_small third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30 triton-lang#1 0x556ee2fea3be in f2reduce::inplace_rref_strided(unsigned long*, unsigned long, unsigned long, unsigned long) third_party/triton/third_party/f2reduce/f2reduce.cpp:470:9 triton-lang#2 0x556ee2ea70da in getMatrixRank third_party/triton/lib/Tools/LinearLayout.cpp:125:3 triton-lang#3 0x556ee2ea70da in mlir::triton::LinearLayout::checkInvariants(bool) third_party/triton/lib/Tools/LinearLayout.cpp:299:7 triton-lang#4 0x556ee2ea656d in mlir::triton::LinearLayout::tryCreate(llvm::MapVector<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>, llvm::DenseMap<mlir::StringAttr, unsigned int, llvm::DenseMapInfo<mlir::StringAttr, void>, llvm::detail::DenseMapPair<mlir::StringAttr, unsigned int>>, llvm::SmallVector<std::__u::pair<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>>, 0u>>, llvm::ArrayRef<std::__u::pair<mlir::StringAttr, int>>, bool) third_party/triton/lib/Tools/LinearLayout.cpp:190:41 triton-lang#5 0x556ee2eb2150 in mlir::triton::LinearLayout::divideRight(mlir::triton::LinearLayout const&) third_party/triton/lib/Tools/LinearLayout.cpp:654:51 triton-lang#6 0x556ee2ee1c39 in mlir::cvtNeedsSharedMemory(mlir::RankedTensorType, mlir::RankedTensorType) third_party/triton/lib/Analysis/Utility.cpp:652:14 triton-lang#7 0x556ee2cf38fd in mlir::triton::getRepShapeForCvtLayout(mlir::triton::gpu::ConvertLayoutOp) third_party/triton/lib/Analysis/Allocation.cpp:66:8 triton-lang#8 0x556ee2cf3efa in mlir::triton::getScratchConfigForCvtLayout(mlir::triton::gpu::ConvertLayoutOp, unsigned int&, unsigned int&) third_party/triton/lib/Analysis/Allocation.cpp:95:19 triton-lang#9 0x556ee2cf6057 in mlir::triton::AllocationAnalysis::getScratchValueSize(mlir::Operation*) third_party/triton/lib/Analysis/Allocation.cpp:272:24 triton-lang#10 0x556ee2cf5499 in operator() third_party/triton/lib/Analysis/Allocation.cpp:343:7 triton-lang#11 0x556ee2cf5499 in void llvm::function_ref<void (mlir::Operation*)>::callback_fn<mlir::triton::AllocationAnalysis::getValuesAndSizes()::'lambda'(mlir::Operation*)>(long, mlir::Operation*) third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:45:12 triton-lang#12 0x556edeeee7a9 in operator() third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68:12 triton-lang#13 0x556edeeee7a9 in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:174:5 triton-lang#14 0x556edeeee87c in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:182:9 triton-lang#15 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), mlir::Operation *, void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:313:10 triton-lang#16 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Operation.h:794:12 triton-lang#17 0x556ee2cf49e7 in mlir::triton::AllocationAnalysis::getValuesAndSizes() third_party/triton/lib/Analysis/Allocation.cpp:341:16 triton-lang#18 0x556ee2cf4852 in run third_party/triton/lib/Analysis/Allocation.cpp:182:5 triton-lang#19 0x556ee2cf4852 in AllocationAnalysis third_party/triton/lib/Analysis/Allocation.cpp:169:5 triton-lang#20 0x556ee2cf4852 in mlir::Allocation::run(llvm::DenseMap<mlir::FunctionOpInterface, mlir::Allocation, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>, llvm::detail::DenseMapPair<mlir::FunctionOpInterface, mlir::Allocation>>&) third_party/triton/lib/Analysis/Allocation.cpp:627:3 triton-lang#21 0x556ee1677402 in operator() third_party/triton/include/triton/Analysis/Allocation.h:227:26 triton-lang#22 0x556ee1677402 in void mlir::CallGraph<mlir::Allocation>::doWalk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)>(mlir::FunctionOpInterface, llvm::DenseSet<mlir::FunctionOpInterface, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>>&, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)) third_party/triton/include/triton/Analysis/Utility.h:350:7 triton-lang#23 0x556ee16756b3 in walk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, (lambda at third_party/triton/include/triton/Analysis/Allocation.h:222:9), (lambda at third_party/triton/include/triton/Analysis/Allocation.h:224:9)> third_party/triton/include/triton/Analysis/Utility.h:242:7 triton-lang#24 0x556ee16756b3 in mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp) third_party/triton/include/triton/Analysis/Allocation.h:220:5 triton-lang#25 0x556ee2c2bf18 in (anonymous namespace)::AllocateSharedMemory::runOnOperation() third_party/triton/lib/Conversion/TritonGPUToLLVM/AllocateSharedMemory.cpp:26:22 ... UndefinedBehaviorSanitizer: invalid-shift-exponent third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30 ``` </p> </details>

…iton-lang#11) * [CPU] Support flexible active driver + update vector-add tutorial * Update vector-add to run CPU always + optional GPU * Update do_bench for CPU

Lin, Lixiang and others added 11 commits December 15, 2016 09:31

Added intelblas_gemv template

4b1b0eb

Enable the intelblas_gemv to pass test_suite with beignet. Will inves…

f1afd47

…tigate VPG ocl drive failing bug.

temp enable gemv on vpg driver by uncomment the renew output. should …

02c5b38

…find root cause

Added clFlush to fix issue over VPG ocl driver

941a3ee

Enable external intelblas_gemv training on python

d353da2

comment out finding library of OpenCL

9b5d7fd

integrate intelblas_gemm.

12eb106

Change-Id: Ic9edf18a3ae0f41b21c2ac374d50000fc5d4e6f3

Updated tuned database json for broadwell

14d7354

Added log printing of external kernels

86150a0

Change-Id: I89f632e2598594805e24b6aa2d084dcfa1c4f218

Modified external template bench print for more than one template for…

8af6c2d

… the same operatin.

gongzg closed this Jan 21, 2017

goostavz pushed a commit to goostavz/triton that referenced this pull request Aug 4, 2023

[TRANSFORM] Pipeline (triton-lang#11)

5802f75

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized GEMM & GEMV for Intel platforms #11

Optimized GEMM & GEMV for Intel platforms #11

gongzg commented Jan 9, 2017

ptillet commented Jan 9, 2017

gongzg commented Jan 10, 2017

ptillet commented Jan 10, 2017

gongzg commented Jan 10, 2017

ptillet commented Jan 10, 2017

listenlink commented Jan 10, 2017

gongzg commented Jan 11, 2017

ptillet commented Jan 11, 2017

gongzg commented Jan 12, 2017

Optimized GEMM & GEMV for Intel platforms #11

Optimized GEMM & GEMV for Intel platforms #11

Conversation

gongzg commented Jan 9, 2017

ptillet commented Jan 9, 2017

gongzg commented Jan 10, 2017

ptillet commented Jan 10, 2017

gongzg commented Jan 10, 2017

ptillet commented Jan 10, 2017

listenlink commented Jan 10, 2017

gongzg commented Jan 11, 2017

ptillet commented Jan 11, 2017

gongzg commented Jan 12, 2017