-
Couldn't load subscription status.
- Fork 0
added fp16 kernel support #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: conv2d-cuda-opt
Are you sure you want to change the base?
Conversation
|
|
@Green-Sky, thanks for testing. it's weird that it failed in sd.cpp. This PR passed all tests in Could you try @etasnadi's branch https://github.com/etasnadi/llama.cppxx/tree/conv2d-cuda-opt? Mine is on top his and I hope I didn't break anything. |
|
Well, without the f16 support, it falls back to the slow naive impl, as expected. (working as intented) |
|
Your changes look alright too. The only thing that I think might be wrong somewhere is that we might be not accounting for the halving of the size of the kernel somehow. |
|
Looks like it is not your change that causes the issue. I will report in the pr in a sec. |
|
Don't forget to update this if you want to use f16: llama.cppxx/ggml/src/ggml-cuda/ggml-cuda.cu Line 2465 in 819c40e
|
In my tests, I don't have |
|
I am also puzzled by: no matter what test cases I throw at it in test_backend_op, they all passed; in a customized test I created for my implicit PR, it always failed with Edit: I fixed my test case and got the benchmark comparison with im2col+gemm.
|
Is it the f16 or f32 perf? The perf is not looking too good. Can you test what's the case with the Vulkan backend? |
It is same for me, for the tensor core kernel test cases passed, but when I try to test in sd.cpp, it didn't work. |
This is fp16 results. I don't have a vulkan dev env, sorry. |
Ok, I thought that this is sd.cpp results. I guess it is test-backend-ops with custom shapes. Can you share your test cases you added to test-backend-ops? These numbers are for this PR or your previous implicit GEMM implementation? Thanks. |
Please see https://github.com/bssrdf/llama.cpp/tree/add-conv2d-test-case for a test I added. The above numbers are for this PR. Thanks. |
That's reasonable. The CUDA backend uses cuBLAS by default for performing the matrix multiplication and it is highly optimized for each shape classes on each single arch. I knew that this is the case so that is the reason I did not share the CUDA kernel in the first place. In order to beat the im2col+gemm implementation we also need to optimize the code for each device that seemed to me almost impossible. I know that many github repos claim that they outperform cuBLAS but I doubt that they can keep their advantage on each device and each shape. Nevertheless, the memory saving is huge, so it might worth it to add the direct conv2d code even though it is considerably slower in several cases. Now I try to reproduce the crash with stable diffusion and once it is fixed, we are ready. |
@etasnadi, I agree with everything you said. However, @Green-Sky showed vulkan conv2d direct can be faster than cuda im2col. I looked at vulkan code and found nothing specially optimized. Maybe the vulkan compiler can do a magic job of optimizing hell of it. |
I think im2col is somewhat memory bound, and the fact that the vulkan direct implementation uses less memory helps. (with cm2) |
|
I figured out what is the problem. I mistakenly forgot to use the proper stream for copying to the symbols. @bssrdf can you update the code and test if it works on your device if you use the async alternative of This is not an issue with test-backend-ops because it probably uses the same (default) stream for both the kernel call and the copy to symbols. |
Vulkan already has matrix cores support I guess that's the reason why it is faster. The scalar CUDA kernel has very similar performance to Vulkan on my device and it is faster with the bank-conflict fix as the Vulkan code have not received it. |
@etasnadi, I tried replacing |
- cudaMemcpyToSymbol(dp, &p, sizeof(Params));
+ cudaMemcpyToSymbolAsync(dp, &p, sizeof(Params), 0, cudaMemcpyHostToDevice, ctx.stream());actually works for sd.cpp |
768x1024 sd1 fp16 vae:
edit: it is within error with @bssrdf implicitgemm pr, for thermal reasons. |
Interesting. For me, the Vulkan direct and indirect has the same perf, CUDA is considerably slower in SD. I tested on 2060. While in What command line are you using exactly? |
should be https://huggingface.co/cyberdelia/latest_s15_models/blob/main/CyberRealistic_V9_FP16.safetensors |
|
I probably need to rerun everything, to make sure I used the same res everywhere (likely) and bc sd.cpp/ggml got updated in between. |
|
For some reason, Since @Green-Sky confirmed |
I intercepted the conv2d calls when executed your command line and added them as test cases to The Vulkan kernel is indeed faster in a few less important cases but in the first 5 the CUDA kernel is somewhat faster. The additional test cases I added: // Stable-diffusion layers
std::map<std::string, uint32_t> idx_sd{
{ "iw", 0 },
{ "ih", 1 },
{ "kw", 2 },
{ "kh", 3 },
{ "Cout", 4 },
{ "Cin", 5 },
{ "B", 6 },
};
// Input image size
uint32_t w = 768;
uint32_t h = 1024;
// Number of filters (base)
uint32_t Cout_b = 128;
uint32_t Cin_b = 128;
std::vector<std::array<uint32_t, 7>> cases_sd = {
{ w / 8, h / 8, 3, 3, Cout_b * 4, Cin_b * 4, 1 }, // x10 (called 10 times)
{ w / 4, h / 4, 3, 3, Cout_b * 4, Cin_b * 4, 1 }, // x7
{ w / 2, h / 2, 3, 3, Cout_b * 2, Cin_b * 2, 1 }, // x5
{ w, h, 3, 3, Cout_b, Cin_b, 1 }, // x5
{ w / 8, h / 8, 1, 1, Cout_b * 4, Cin_b * 4, 1 }, // x4
{ w / 8, h / 8, 1, 1, 4, 4, 1 },
{ w / 8, h / 8, 3, 3, Cout_b * 4, 4, 1 },
{ w / 2, h / 2, 3, 3, Cout_b * 4, Cin_b * 4, 1 },
{ w / 2, h / 2, 3, 3, Cout_b * 2, Cin_b * 4, 1 },
{ w / 2, h / 2, 1, 1, Cout_b * 2, Cin_b * 4, 1 },
{ w, h, 3, 3, Cout_b, Cin_b * 2, 1 },
{ w, h, 1, 1, Cout_b, Cin_b * 2, 1 },
{ w, h, 3, 3, Cout_b * 2, Cin_b * 2, 1 },
{ w, h, 3, 3, 3, Cin_b, 1 },
};
for (auto act_case : cases_sd) {
GGML_ASSERT(act_case[idx_sd["kw"]] == 3 || act_case[idx_sd["kw"]] == 1);
GGML_ASSERT(act_case[idx_sd["kh"]] == 3 || act_case[idx_sd["kh"]] == 1);
uint32_t p0 = act_case[idx_sd["kw"]] == 3 ? 1 : 0;
uint32_t p1 = act_case[idx_sd["kh"]] == 3 ? 1 : 0;
test_cases.emplace_back(new test_conv_2d(
{ act_case[idx_sd["iw"]], act_case[idx_sd["ih"]], act_case[idx_sd["Cin"]], act_case[idx_sd["B"]] },
{ act_case[idx_sd["kw"]], act_case[idx_sd["kh"]], act_case[idx_sd["Cin"]], act_case[idx_sd["Cout"]] },
GGML_TYPE_F16, 1, 1, p0, p1, 1, 1, false));
} |
|
I think you all measured the execution time wrong. I manually merged @bssrdf's PR and they created a different op for I was aware of this beause such algs (cuDNN and https://zhuanlan.zhihu.com/p/661879423 the implicit code is forked from) usually treat the channel dim as contiguous and it is more efficient in most cases. It is still slower in some cases but it an be similarly optimized with the tricks used in the Vulkan implementation. implicit conv: Vulkan translation: Stable-diffusion decoding is also faster with the implicit conv (but it still far from im2col+gemm): |
Pretty sure my sd.cpp VAE tables are correct. They are just very flaky (I need to repaste the device, it drops by maybe 20% perf over the first ~5min or so) |
The problem is not how you tested but how the implicit alg was plugged in to ggml. There was a different op added for implcit conv2d so by default sd.cpp did not use the implicit alg. If you explicitly modified the code then it is surprising for me that the implicit alg is not significantly faster than the Vulkan translation because it uses warptiling and double buffering and both could have significant positive effect on prf. I created a branch in my repo where both algs are added https://github.com/etasnadi/llama.cppxx/tree/conv2d-implicit and you can switch between them by setting |
I added support for fp16. Please review. Thanks.