Skip to content

Conv2D direct support #744

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Aug 2, 2025
Merged

Conversation

daniandtheweb
Copy link
Contributor

@daniandtheweb daniandtheweb commented Jul 28, 2025

This PR adds two runtime flags to enable the new Conv2D Direct implementation (#739) on the diffusion model and on the vae model.

This new implementation improves VAE decoding speed and memory usage.

Tested on a Ryzen 7 4700U (Vega 7 iGPU)
Master
SD 1.5 - 512 x 512 = vae compute buffer size: 1664.00 MB(VRAM), 20.0s
SD XL - 1024 x1024 = ErrorOutOfDeviceMemory

PR
SD 1.5 - 512 x 512 = vae compute buffer size: 640.00 MB(VRAM), 6.13s
SD XL - 1024 x 1024 = vae compute buffer size: 2560.00 MB(VRAM), 26.78s

Unfortunately I won't be home for a while so my only machine to test the changes is this laptop. If anyone could post their tests here it would help a lot.

@daniandtheweb daniandtheweb changed the title Conv2DDirect for VAE stage Conv2DDirect for VAE stage - Vulkan Jul 28, 2025
@stduhpf
Copy link
Contributor

stduhpf commented Jul 28, 2025

Just a thought, wouldn't it be better to just add a bool direct = false argument to the existing Conv2d to minimise duplicated code?

@wbruna
Copy link
Contributor

wbruna commented Jul 28, 2025

SD XL - 1024 x1024 = ErrorOutOfDeviceMemory

This usually means more than ~4G. I'd suggest testing with something like 576x1024, or maybe 640x1024.

Just a thought, wouldn't it be better to just add a bool direct = false argument to the existing Conv2d to minimise duplicated code?

Adding up on that, a new overloaded constructor could help avoiding all the new optional parameters on the call sites; eg. something like:

    Conv2d(int64_t in_channels,
           int64_t out_channels,
           bool direct,
           std::pair<int, int> kernel_size,
           (...)

(although this is already asking for some kind of named parameter idiom).

@daniandtheweb
Copy link
Contributor Author

Just a thought, wouldn't it be better to just add a bool direct = false argument to the existing Conv2d to minimise duplicated code?

Doing it right now. I just wanted it more explicit to easily test the changes.

@daniandtheweb
Copy link
Contributor Author

I have reduced the duplicated code and added a check that should only allow the direct mode on vulkan and cpu.

@daniandtheweb daniandtheweb marked this pull request as ready for review July 28, 2025 17:29
@daniandtheweb daniandtheweb changed the title Conv2DDirect for VAE stage - Vulkan Conv2D direct support Jul 28, 2025
@daniandtheweb daniandtheweb changed the title Conv2D direct support Conv2D direct support - WIP Jul 28, 2025
@wbruna
Copy link
Contributor

wbruna commented Jul 28, 2025

Testing f5b5f5c :

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7600 XT (RADV NAVI33) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

SDXL:

SD_CONV2D_DIRECT Tiling 576x1024 640x1024 1024x1024
Off Off 3.75G 8.40s OOM OOM
On Off 1.50G 2.18s 1.65G 2.37s 2.59G 3.78s
Off On 510M 16.62s 510M 16.61s 510M 28.64s
On On 254M 6.73s 254M 6.76s 254M 11.63s

SD1.5:

SD_CONV2D_DIRECT Tiling 512x512 512x768 640x960
Off Off 1.72G 3.60s 2.53G 5.44s 3.90G 9.14s
On Off 734M 0.94s 1.65G 1.41s 2.59G 2.12s
Off On 510M 5.73s 510M 9.17s 510M 16.63s
On On 254M 2.33s 254M 3.72s 254M 6.75s

@daniandtheweb
Copy link
Contributor Author

@Green-Sky what do you think about how I implemented the flags? It's still a WIP, I think I may have missed some Conv2D uses.

@Green-Sky
Copy link
Contributor

Green-Sky commented Jul 30, 2025

@Green-Sky what do you think about how I implemented the flags? It's still a WIP, I think I may have missed some Conv2D uses.

Looks good.

--vae-conv-direct --diffusion-fa is the new goat.

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Jul 31, 2025

--vae-conv-direct --diffusion-fa is the new goat.

Do you think the --diffusion-conv-direct should also be used for esrgan, and controlnet?
Also, I'm not an expert on how the flux models work, does flux use Conv2D at all? I don't seem to find any reference about it in the code.

@stduhpf
Copy link
Contributor

stduhpf commented Jul 31, 2025

I think Flux Models only use Conv2D in the VAE.

@Green-Sky
Copy link
Contributor

Do you think the --diffusion-conv-direct should also be used for esrgan, and controlnet?

I am not very familiar with those, but probably.

Also, I'm not an expert on how the flux models work, does flux use Conv2D at all? I don't seem to find any reference about it in the code.

It does not use conv2d, last time I checked. (I also included it in the benchmarks I did here)

@daniandtheweb daniandtheweb changed the title Conv2D direct support - WIP Conv2D direct support Jul 31, 2025
@daniandtheweb
Copy link
Contributor Author

This still probably needs some cleanup and I still may have missed some cases but it's almost ready.

@leejet
Copy link
Owner

leejet commented Jul 31, 2025

I think it would be better to automatically select whether to use ggml_conv2d or ggml_conv2d_direct within ggml_nn_conv_2d based on the shapes and sizes of the input and weight, as well as whether the backend supports it.

@daniandtheweb
Copy link
Contributor Author

I've just tested this PR with the latest changes from the new llama.cpp's conv2d PR and it seems like the slowdown in the diffusion process using conv2d direct is almost gone (it's still slightly slower but it's minimal).

Applying conv2d direct to the upscaling process seems quite helpful.

conv2d

SD 1.5, 512x512 -> 2048x2048, ESRGAN, compute buffer size: 416.00 MB,  36 tiles, 2.41 s/tile

conv2d direct

SD 1.5, 512x512 -> 2048x2048, ESRGAN, compute buffer size: 164.19 MB,  36 tiles, 2.05 s/tile

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Jul 31, 2025

I think it would be better to automatically select whether to use ggml_conv2d or ggml_conv2d_direct within ggml_nn_conv_2d based on the shapes and sizes of the input and weight, as well as whether the backend supports it.

The idea around this PR is to let the choice, for now, to the user given the fact that Conv2D direct is new and there's not enough data to know if it's going to be faster for everyone or if it may cause regressions. When it will be sure that this will be an improvement for everyone the logic can be moved in ggml-extend.cpp to automatically select it.

Isn't it a safer choice, at least for now?

@Green-Sky
Copy link
Contributor

The idea around this PR is to let the choice, for now, to the user given the fact that Conv2D direct is new and there's not enough data to know if it's going to be faster for everyone or if it may cause regressions. When it will be sure that this will be an improvement for everyone the logic can be moved in ggml-extend.cpp to automatically select it.

Isn't it a safer choice, at least for now?

I agree. Current state:

CPU, generally slower in my tests, including vae.
Vulkan, no gain in sampling or worse (might be card dependend), good wins for vae.
OpenCL, better in all cases (as far as tested on limited hardware).
Cuda, missing (hopefully soon, probably slower in sampling, same as vulkan).
SYCL, missing.
ncann, missing.
..., missing.

I think it would be better to automatically select whether to use ggml_conv2d or ggml_conv2d_direct within ggml_nn_conv_2d based on the shapes and sizes of the input and weight, as well as whether the backend supports it.

This is still hard, I think. This should probably be done in the ggml backends itself.

@leejet
Copy link
Owner

leejet commented Aug 2, 2025

I updated the code and used a more concise approach to enable conv2d_direct.

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Aug 2, 2025

I updated the code and used a more concise approach to enable conv2d_direct.

That's definitely way cleaner, thanks.

@leejet leejet merged commit 5b8996f into leejet:master Aug 2, 2025
9 checks passed
@leejet
Copy link
Owner

leejet commented Aug 2, 2025

Thank you for your contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants