-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Enable the case N != ldc in EigenBlasGemm. #5976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7306a34 to
6040f3a
Compare
… temporary memory.
paddle/math/Allocator.h
Outdated
| #include <mutex> | ||
| #include "hl_gpu.h" | ||
| #include "paddle/utils/Logging.h" | ||
| #ifdef PADDLE_WITH_CUDA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WITH_GPU=OFF下编译的时候会include hl_cuda_stub.h;下面这些代码中的宏应该是不需要加也能正确编译的?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
加这个是因为,EigenGemm.cpp中引入了#include "paddle/math/MemoryHandle.h",从而会间接地引入#include "hl_base.h",这个头文件中定义了using real float,会导致Eigen的编译问题。
paddle/function/EigenGemm.cpp
Outdated
| sizeC[1] = N; | ||
| CHECK_EQ(N, ldc); | ||
| T* gemmC = C; | ||
| if (N != ldc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这样的fix方式,会给以后带来,性能风险(而且,不熟悉这段代码的人也不容易知道这里有性能问题)。可以,看一下Eigen有没有别的表达方式,可以直接支持stride参数的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我查过了,Eigen的stride方式不适合这种情况。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
f41e4cd to
16bdb47
Compare
paddle/function/EigenGemm.cpp
Outdated
| Eigen::DefaultDevice device; | ||
| if (alpha == T(1) && beta == T(0)) { | ||
| c.device(device) = a.contract(b, dims); | ||
| c.slice(offsetC, extentC).device(device) = a.contract(b, dims); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
考虑两种情况:
ldc_1 > Nand use the operationc.slice(offsetC, extentC).device(device) = a.contract(b, dims);ldc_2 == Nand use the operationc.device(device) = a.contract(b, dims);
1和2中的MNK是一样的,但是1中的ldc > N,2中的ldc == N,这两种情况下分别采用这两个不同的计算方式表现出来的性能(gflops)是一样的吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. 这样代码显得有点长(重复)了。后面来测下耗时。
Fix #5997