WIP: update stable-diffusion.cpp to 5900ef6605c6 (new API) #1669

wbruna · 2025-08-04T20:44:58Z

This updates the embedded stable-diffusion.cpp code, and apply the necessary changes to the API adapter.

I've tried to preserve the local changes to the codebase:

get_num_physical_cores got renamed to sd_get_num_physical_cores
utf filenames on Windows
a few header file changes for third party libs (json lib path, STB_IMAGE_ macros)
logging / LOG_DEBUG changes
automatic taesd model path
fixed lora instead of enabling through the prompt
kontext and photomaker images passed directly
auto-disabling photomaker for non-SDXL models
the fix for VAE tiling already applied
something else I'm probably forgetting

I've applied other minor changes, like formatting, to minimize the diff against sd.cpp mainline.

Right now, the code merely builds successfully, but I'll keep working on it over the next few days.

LostRuins · 2025-08-05T10:12:36Z

thanks. i think most of the files should be relatively easy to upgrade, except for stable-diffusion.cpp itself which has a few key points of divergence mostly to do with disk access and image data passing.

otherarch/sdcpp/stable-diffusion.cpp

otherarch/sdcpp/lora.hpp

wbruna · 2025-08-06T13:01:37Z

The current code seems to be working for basic gens + LoRAs, except for an upstream issue with img2img + flash attention ( leejet/stable-diffusion#756 ). I still need to test the more extensively changed code paths (Kontext, Photomaker, TAESD, ...).

I've tried to move some local changes from stable-diffusion.cpp to sdtype_adapter.cpp, to facilitate future updates. The Chroma detection got a bit more complex, so let me know if you prefer the previous approach.

For removing the LoRAs from the prompt, I've opted for emptying the LoRA list instead, to make the diff cleaner. That also makes a nice place to print a warning, in case the user tries to include a LoRA through the prompt.

It may be worth to change the Kontext and Photomaker image lists to use a single image reference list, instead of one list per model. That could avoid further interface changes to support other edit models in the future, like Instruct-Pix2pix and CosXL. But of course that could be done after this update.

wbruna · 2025-08-07T19:18:11Z

Chroma isn't working with flash attention (black images), but could be that same upstream issue. Maybe the safest would be a separate flash attention control for image generation... I can add one later, along with configs for the new conv2d_direct optimizations.

I've also included here the fix for #1672 , to facilitate the tests. Apart from that: SD1.5, SDXL, SD3.5, Chroma, Flux, Kontext, Photomaker, inpaint, LoRAs and TAESD seem to be working so far (Linux + Vulkan).

LostRuins · 2025-08-08T15:39:58Z

I decided to test flash attention on chroma in v1.97.1 (latest release) cuda, and I also got a black square, so at least that (appears) to be the case for now.

wbruna · 2025-08-08T15:50:21Z

I've rebased into 1.97.1, and folded missing stuff into the original commit, still keeping the refactors separate in case we need to tweak them.

I could add a workaround to disable flash attention for Chroma. For img2img / inpaint is trickier, since the flag is set at model loading time; but maybe it could be changed inside the ctx at inference time, like the VAE tiling flag.

LostRuins · 2025-08-08T17:15:42Z

we can do enhancements as a separate PR, lets get full feature parity and no regressions first.

wbruna · 2025-08-08T19:44:40Z

I took a look at enabling flash attention at inference time. It's doable, but we'd need to touch a lot of objects on the model tree to propagate the flag. Unless upstream is willing to switch to that approach (as was done for the enable_conv2d_direct flag), it may be best to simply disable flash attention for image gen until leejet/stable-diffusion.cpp#756 moves forward.

So, I believe the current code is ready to be reviewed.

otherarch/sdcpp/clip.hpp

otherarch/sdcpp/model.cpp

otherarch/sdcpp/sdtype_adapter.cpp

otherarch/sdcpp/stable-diffusion.cpp

wbruna · 2025-08-10T13:34:45Z

I've fixed the pretty_progress throttling, added workarounds for the flash attention issues with Chroma and img2img, and cleaned up the workaround for clip_skip that got fixed upstream. I don't have much time right now to do full tests (Flux is really slow here :-) ), but I'll do it later.

LostRuins · 2025-08-10T13:56:22Z

Thanks for everything so far.

leejet/stable-diffusion.cpp#756

wbruna · 2025-08-11T16:24:05Z

I redid the tests with the CLIP weights change, and just found a minor bug on the pretty_progress throttling. ROCm is working fine, too. As far as I can tell, it's ready!

LostRuins · 2025-08-12T11:39:05Z

Tested Sd1.5, SDXL, Flux, Kontext, seem ok so far.

Are you able to use SD3.5 files? I think some time back there was a regression that messed up sd3.5 for me, I also get a black square the latest release either, so it's not the fault of this PR.

Testing on this file https://huggingface.co/Comfy-Org/stable-diffusion-3.5-fp8/blob/main/sd3.5_medium_incl_clips_t5xxlfp8scaled.safetensors

It does work back in 1.78 when it was introduced.
1.80.3 is still working.
It is broken by 1.81.1
Trying to figure out when and why it broke.

Edit: looks like it broke at 2a890ec

wbruna · 2025-08-12T12:19:12Z

At least SD3.5 medium is working for me; sd3.5m_turbo-Q4_K_M.gguf :

and sd3.5_medium-Q8_0.gguf:

My sd3.5-large gguf is failing right now, but it's a failure to load, not garbage rendering; probably a messed up config.

Edit: using this VAE (I don't remember why I picked this one, I just found it again by the SHA256 hash).

wbruna · 2025-08-12T13:43:58Z

I can't get SD3.5-Large-Turbo-GGUF-mixed-sdcpp to load on Koboldcpp. And it's working fine on sd.cpp:

I think Koboldcpp was falling back to another previously working configuration during my tests :-(

(also downloading the sd3.5_medium_incl_clips_t5xxlfp8scaled.safetensors file to test it here)

LostRuins · 2025-08-12T13:46:24Z

When I use taesd i get a brown square, otherwise I get a black square
leejet/stable-diffusion.cpp#560

Edit: this was on medium

LostRuins

approving first, lets see if we can figure it out.
if not, we can still merge it since all other stuff seems good to go.

LostRuins · 2025-08-12T14:04:30Z

okay looking at your gguf example I think the fallback code is failing, so the tensor names are appended twice
model.diffusion_model.model.diffusion_model...
let me test in an older koboldcpp first

I think the solution is to check for such tensor names before allowing the prefix to be appended.

wbruna · 2025-08-12T14:41:34Z

I did a few tests with sd3.5_medium_incl_clips_t5xxlfp8scaled on sd.cpp:

rocm, plain model: black image
rocm, model + external vae: black image
rocm, model + external vae + external t5xxl: works
vulkan, model + external t5xxl: works

The external t5xxl comes from t5xxl-Q4_K.gguf.

Edit: t5xxl-Q8_0.gguf works fine too.

LostRuins · 2025-08-12T14:42:59Z

@wbruna alright after I added my fix for the tensor names, I was able to load the model.

However, the results are very weird. I don't know if I am doing it wrong or if I used bad files, since I loaded my own T5 and clip, using your SD3.5-Large-Turbo-GGUF-mixed-sdcpp.

Can you merge my changes from 7b5cf71 and check if it works fine for you?
Prompt: Cat with Hat

I think maybe i have a faulty t5, clip or vae. but if it works for you i think thats good enough. please merge and try my fix

LostRuins · 2025-08-12T14:53:22Z

vulkan, model + external t5xxl: works

could you also check in sdcpp
vulkan, model + internal t5xxl (all in one, this fails for me.)

wbruna · 2025-08-12T15:03:58Z

I think maybe i have a faulty t5, clip or vae. but if it works for you i think thats good enough. please merge and try my fix

It works! This was with the just-pushed merge 9c039b2 , on Vulkan:

LostRuins · 2025-08-12T15:09:44Z

Alright, if no other issues I will merge this now

wbruna · 2025-08-12T15:10:58Z

could you also check in sdcpp
vulkan, model + internal t5xxl (all in one, this fails for me.)

Black image for me too.

wbruna · 2025-08-13T12:22:31Z

@LostRuins , the target sdmain is currently failing to build. I was able to fix a few issues, but then I got undefined reference to stbi_load; and when I tried to fix that, I noticed conflicts between otherarch/sdcpp/thirdparty and vendor/stb due to version changes. Perhaps it'd be best to use only a single copy in vendor/stb?

LostRuins · 2025-08-13T14:32:36Z

Alright I fixed the issues I could fine and it compiles fine for me. Please pull latest experimental. Does it work for you now?

wbruna · 2025-08-13T15:22:06Z

Yeah, it builds fine now; thanks!

LostRuins · 2025-08-30T07:48:29Z

Btw i've been keeping my eye on the upstream Qwen Image and WAN developments. Thanks to your PR kcpp is now in sync with the latest API, so merging it shouldn't be too hard.

However, we'd probably want to think of a good approach for the API and frontend side since video files will no doubt be massive. AVI is probably not the best format due to it's size, though transcoding without FFMPEG (which cannot be used) will be tricky. Open to suggestions if you have any.

wbruna · 2025-08-31T12:03:36Z

Btw i've been keeping my eye on the upstream Qwen Image and WAN developments. Thanks to your PR kcpp is now in sync with the latest API, so merging it shouldn't be too hard.

Yeah... The main issue is the needed ggml support; I don't know how close leejet is to upstreaming them. And I'm already missing fixes on that branch (he's working on a version with that slow Vulkan build time 😢). Unrelated breakage could be an issue, too... But I suspect leejet already had wan in mind for that big refactor, so it could be less of an issue now.

And I'm considering again that old idea of keeping an sd branch as an 'upstream' for Kobold. Two, actually: one branch just to pick still-not-applied upstream PRs, another with the specific tweaks for Koboldcpp. I kinda did that locally for the VAE tiling fixes, but making it 'official' would make the changes easier to track (since it doesn't look like we'll have those applied to sd master any time soon...).

However, we'd probably want to think of a good approach for the API and frontend side since video files will no doubt be massive. AVI is probably not the best format due to it's size, though transcoding without FFMPEG (which cannot be used) will be tricky. Open to suggestions if you have any.

Koboldcpp can't embed libavcodec? (honest question; I know very little about video encoding. I confess I'm afraid of even trying video gen on my poor card 😅)

LostRuins · 2025-08-31T15:37:17Z

It's really big cause it supports dozens of codecs. And it has multiple mixed licenses, LGPL is technically usable within APGL but i'd prefer to avoid it if i can. I only need ONE single codec that can be played in a browser, thats it.

ehoogeveen-medweb · 2025-08-31T16:11:41Z

Would it be possible to use the CLI if it's available, or would that be too indirect? I already have ffmpeg on my PATH anyway for yt-dlp, for example.

LostRuins · 2025-08-31T16:13:30Z

Most likely if I cannot find something suitable the output will remain as MJPEG, which the clients can then convert on their side to whatever format they prefer

wbruna · 2025-08-31T23:12:39Z

It's really big cause it supports dozens of codecs. And it has multiple mixed licenses, LGPL is technically usable within APGL but i'd prefer to avoid it if i can. I only need ONE single codec that can be played in a browser, thats it.

Maybe OpenH264? It's BSD licensed, and the binaries seem relatively small:
https://github.com/cisco/openh264/releases/tag/v2.6.0

wbruna · 2025-09-06T12:56:40Z

Wan support just got merged into master. There is a ton of changes, so it may be reasonable to just update the sd.cpp code for now, and add video gen support later.

Code-wise, the main issue is that the VAE tiling fixes conflict with some of the Wan changes; I asked stduhpf to update the PR. main.cpp also got a lot of changes, but the API itself didn't change that much, so sdtype_adapter may not need too many fixes right now.

LostRuins · 2025-09-07T06:08:49Z

Alright cool.

wbruna · 2025-09-07T14:58:39Z

I think I've found a way to better track those changes. I'm keeping a branch with the upstream code + pending PRs, which will act as a vendor branch:

https://github.com/wbruna/stable-diffusion.cpp/tree/koboldcpp_sd_base

And another branch with the needed local changes:

https://github.com/wbruna/stable-diffusion.cpp/tree/koboldcpp_sd_changes

The idea is to keep koboldcpp_sd_base up-to-date with upstream first, then merge it into koboldcpp_sd_changes.

Right now, that merge is just a first try; I'll review it later, and open a PR. Possibly including those FA changes that just arrived on master...

LostRuins reviewed Aug 5, 2025

View reviewed changes

otherarch/sdcpp/stable-diffusion.cpp Show resolved Hide resolved

LostRuins reviewed Aug 5, 2025

View reviewed changes

otherarch/sdcpp/lora.hpp Show resolved Hide resolved

LostRuins added the enhancement New feature or request label Aug 5, 2025

wbruna force-pushed the kcpp_update_sdcpp branch from f39ada1 to aabe701 Compare August 6, 2025 20:34

wbruna force-pushed the kcpp_update_sdcpp branch from 6ebede9 to faad964 Compare August 8, 2025 13:56

wbruna marked this pull request as ready for review August 8, 2025 19:44

wbruna force-pushed the kcpp_update_sdcpp branch from e50114a to 9bf5e31 Compare August 8, 2025 20:00

wbruna mentioned this pull request Aug 8, 2025

Add flash attention and conv2d direct controls for image generation #1678

Merged

wbruna force-pushed the kcpp_update_sdcpp branch from 9bf5e31 to 5f104d2 Compare August 9, 2025 13:08