-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make ggml_conv_2d faster #483
Conversation
By the way, I believe that the matrix multiplication used in both MUL_MAT and CONV_2D operations should be unified for easier optimization. |
Nice speedup! I am thinking how to avoid the 2-step process. Do you need it just because the INIT step is not multi-threaded? I think it should be possible to reuse Btw, I'm not 100% that the strategy of making the Conv2D into a matrix multiplication is the best thing to do. I initially did it like this since mul mat is familiar, but we waste a lot of extra memory this way. Maybe there is something better to do here - not sure. On the other hand, if we reuse the mul mat implementation it would be great because we won't have to write GPU code for Conv2D. Will need some time to figure out how to refactor and merge this. |
If the INIT step is multithreaded, then indeed there might be no need to split it into two stages.
Im2col is a common practice in many frameworks like Caffe, PyTorch, and others. However, these frameworks often implement other optimization techniques such as Winograd, where they choose different calculation methods based on the input format. |
I think for now we'll merge this as proposed. At some point, we have to make Stage 1 reuse the matrix multiplication operator because this will bring massive speed-up on Apple Silicon via the Accelerate framework. @leejet Let me know if this is still the version you want to have upstream. |
@ggerganov The code in this PR is up to date now. |
@leejet , could you check, please, if there is a race condition? I tried to compare results of old conv2d with new gemm-based and got same values with single-threaded, but different with multithreaded (and I know for sure that exactly old code was correct).
where new is randomly 2560,... or 2240,..., or even 0.000000,... only when omp is enabled. In single-thread results are the same. |
@AngryLoki Because you changed the shared
|
Ouch, my mistake... Yes, new version is correct. |
I have a question.I see in stable diffusion.cpp that it doesn't apply to Gpus,Why does ggml_conv_2d not support Gpu?I wish someone could help me.What's holding it back?Because I think convolution is a pretty general operator.Thanks. |
We will eventually support GPU implementations of the convolution operators, but it's not the focus yet. Contributions are welcome. Sorry for delaying this merge for so long. After merging #547, will focus on bringing this up-to-date and merge |
I need some feedback to finish the implementation in #556 |
I would like to merge this @PABannier Would be nice if you can test that this branch works for your cases, as it touches conv_2d and the change is not trivial. My SAM example still works, but want to make sure something else didn't break |
Hey @ggerganov ! I used the following two-step implementation for the Yet, since neither encodec nor bark use 2d convolution, I can't check for this specific implementation of 2d convolution. |
I would like to ask a question: is it possible to be faster with winograd convolution? |
It might be faster, but I'm not sure, I haven't tried |
Im2col is quite fast on CUDA, but it consumes a lot of memory. I'm thinking of adding an option to split the input into tiles and process them sequentially to save memory when the input is very large. I would then concatenate each result of the matrix multiplication. However, it's somewhat challenging to perform data shifting in ggml_mul_mat. Some useful info: I'm trying this: def winograd_1d(data, filter):
N = len(data)
b1 = filter[0] + filter[2]
b2 = 0.5 * (b1 + filter[1])
b3 = 0.5 * (b1 - filter[1])
output = [0] * N
print(f"N {N}")
for i in range(0, N - 1, 2):
a1 = (data[i] + data[i + 1]) * b2
a2 = (data[i + 1] - data[i]) * b3
output[i] = (data[i] - data[i + 1]) * filter[0] + a1 + 2
output[i + 1] = a1 - a2 - (data[i] - data[i + 2]) * filter[2] # Error: data + 2 index out of range
if N % 2 != 0:
a1 = (data[N - 1] + data[N]) * b2
a2 = (data[N] - data[N - 1]) * b3
output[N - 1] = (data[N - 2] - data[N]) * filter[0] + a1 + a2
return output
# Example usage:
data = [1, 2, 3, 4, 5, 6, 7, 8]
filter = [0.1, 0.2, 0.3]
result = winograd_1d(data, filter)
print(result) |
`llama_sample_top_p_top_k` was missing the struct annotation on line 126. This causes a compiler issue when being parsed by the Kotlin C interop generator. This commit fixes the above issue by adding the struct annotation.
I've split the
CONV_2D
operation into two parts: im2col(CONV_2D_STAGE_0
) and GEMM(CONV_2D_STAGE_1
). I've enabled multi-threading for im2col and restructured the loops to make the cache more friendly. For clarity, I've used more intuitive and general variable names.Taking the decode_first_stage phase in stable-diffusion as an example, when generating a 512x512 image.
For a detailed comparison, you can refer to the information provided here leejet/stable-diffusion.cpp#30
before:
after:
The time for
CONV_2D
has been reduced from the original 43098.730 ms to 18442.602 ms (CONV_2D_STAGE_0
+CONV_2D_STAGE_1
).