Skip to content

Conversation

@wbruna
Copy link

@wbruna wbruna commented Aug 4, 2025

This updates the embedded stable-diffusion.cpp code, and apply the necessary changes to the API adapter.

I've tried to preserve the local changes to the codebase:

  • get_num_physical_cores got renamed to sd_get_num_physical_cores
  • utf filenames on Windows
  • a few header file changes for third party libs (json lib path, STB_IMAGE_ macros)
  • logging / LOG_DEBUG changes
  • automatic taesd model path
  • fixed lora instead of enabling through the prompt
  • kontext and photomaker images passed directly
  • auto-disabling photomaker for non-SDXL models
  • the fix for VAE tiling already applied
  • something else I'm probably forgetting

I've applied other minor changes, like formatting, to minimize the diff against sd.cpp mainline.

Right now, the code merely builds successfully, but I'll keep working on it over the next few days.

@LostRuins
Copy link
Owner

thanks. i think most of the files should be relatively easy to upgrade, except for stable-diffusion.cpp itself which has a few key points of divergence mostly to do with disk access and image data passing.

@LostRuins LostRuins added the enhancement New feature or request label Aug 5, 2025
@wbruna
Copy link
Author

wbruna commented Aug 6, 2025

The current code seems to be working for basic gens + LoRAs, except for an upstream issue with img2img + flash attention ( leejet/stable-diffusion#756 ). I still need to test the more extensively changed code paths (Kontext, Photomaker, TAESD, ...).

I've tried to move some local changes from stable-diffusion.cpp to sdtype_adapter.cpp, to facilitate future updates. The Chroma detection got a bit more complex, so let me know if you prefer the previous approach.

For removing the LoRAs from the prompt, I've opted for emptying the LoRA list instead, to make the diff cleaner. That also makes a nice place to print a warning, in case the user tries to include a LoRA through the prompt.

It may be worth to change the Kontext and Photomaker image lists to use a single image reference list, instead of one list per model. That could avoid further interface changes to support other edit models in the future, like Instruct-Pix2pix and CosXL. But of course that could be done after this update.

@wbruna wbruna force-pushed the kcpp_update_sdcpp branch from f39ada1 to aabe701 Compare August 6, 2025 20:34
@wbruna
Copy link
Author

wbruna commented Aug 7, 2025

Chroma isn't working with flash attention (black images), but could be that same upstream issue. Maybe the safest would be a separate flash attention control for image generation... I can add one later, along with configs for the new conv2d_direct optimizations.

I've also included here the fix for #1672 , to facilitate the tests. Apart from that: SD1.5, SDXL, SD3.5, Chroma, Flux, Kontext, Photomaker, inpaint, LoRAs and TAESD seem to be working so far (Linux + Vulkan).

@wbruna wbruna force-pushed the kcpp_update_sdcpp branch from 6ebede9 to faad964 Compare August 8, 2025 13:56
@LostRuins
Copy link
Owner

I decided to test flash attention on chroma in v1.97.1 (latest release) cuda, and I also got a black square, so at least that (appears) to be the case for now.

@wbruna
Copy link
Author

wbruna commented Aug 8, 2025

I've rebased into 1.97.1, and folded missing stuff into the original commit, still keeping the refactors separate in case we need to tweak them.

I could add a workaround to disable flash attention for Chroma. For img2img / inpaint is trickier, since the flag is set at model loading time; but maybe it could be changed inside the ctx at inference time, like the VAE tiling flag.

@LostRuins
Copy link
Owner

we can do enhancements as a separate PR, lets get full feature parity and no regressions first.

@wbruna
Copy link
Author

wbruna commented Aug 8, 2025

I took a look at enabling flash attention at inference time. It's doable, but we'd need to touch a lot of objects on the model tree to propagate the flag. Unless upstream is willing to switch to that approach (as was done for the enable_conv2d_direct flag), it may be best to simply disable flash attention for image gen until leejet/stable-diffusion.cpp#756 moves forward.

So, I believe the current code is ready to be reviewed.

@wbruna wbruna marked this pull request as ready for review August 8, 2025 19:44
@wbruna wbruna force-pushed the kcpp_update_sdcpp branch from e50114a to 9bf5e31 Compare August 8, 2025 20:00
@wbruna wbruna force-pushed the kcpp_update_sdcpp branch from 9bf5e31 to 5f104d2 Compare August 9, 2025 13:08
@wbruna wbruna force-pushed the kcpp_update_sdcpp branch from 5f104d2 to 4d71dac Compare August 10, 2025 13:28
@wbruna
Copy link
Author

wbruna commented Aug 10, 2025

I've fixed the pretty_progress throttling, added workarounds for the flash attention issues with Chroma and img2img, and cleaned up the workaround for clip_skip that got fixed upstream. I don't have much time right now to do full tests (Flux is really slow here :-) ), but I'll do it later.

@LostRuins
Copy link
Owner

Thanks for everything so far.

@wbruna wbruna force-pushed the kcpp_update_sdcpp branch from 4d71dac to 1bd0b46 Compare August 11, 2025 11:24
@wbruna wbruna force-pushed the kcpp_update_sdcpp branch from 1bd0b46 to c433877 Compare August 11, 2025 14:42
@wbruna
Copy link
Author

wbruna commented Aug 11, 2025

I redid the tests with the CLIP weights change, and just found a minor bug on the pretty_progress throttling. ROCm is working fine, too. As far as I can tell, it's ready!

@LostRuins
Copy link
Owner

LostRuins commented Aug 12, 2025

Tested Sd1.5, SDXL, Flux, Kontext, seem ok so far.

Are you able to use SD3.5 files? I think some time back there was a regression that messed up sd3.5 for me, I also get a black square the latest release either, so it's not the fault of this PR.

Testing on this file https://huggingface.co/Comfy-Org/stable-diffusion-3.5-fp8/blob/main/sd3.5_medium_incl_clips_t5xxlfp8scaled.safetensors

It does work back in 1.78 when it was introduced.
1.80.3 is still working.
It is broken by 1.81.1
Trying to figure out when and why it broke.

Edit: looks like it broke at 2a890ec

@wbruna
Copy link
Author

wbruna commented Aug 12, 2025

At least SD3.5 medium is working for me; sd3.5m_turbo-Q4_K_M.gguf :

test_sd3_turbo

and sd3.5_medium-Q8_0.gguf:

test_sd3_medium

My sd3.5-large gguf is failing right now, but it's a failure to load, not garbage rendering; probably a messed up config.

Edit: using this VAE (I don't remember why I picked this one, I just found it again by the SHA256 hash).

@wbruna
Copy link
Author

wbruna commented Aug 12, 2025

I can't get SD3.5-Large-Turbo-GGUF-mixed-sdcpp to load on Koboldcpp. And it's working fine on sd.cpp:

test_sd3_large_sdcpp

I think Koboldcpp was falling back to another previously working configuration during my tests :-(

(also downloading the sd3.5_medium_incl_clips_t5xxlfp8scaled.safetensors file to test it here)

@LostRuins
Copy link
Owner

LostRuins commented Aug 12, 2025

When I use taesd i get a brown square, otherwise I get a black square
leejet/stable-diffusion.cpp#560

Edit: this was on medium

Copy link
Owner

@LostRuins LostRuins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving first, lets see if we can figure it out.
if not, we can still merge it since all other stuff seems good to go.

@LostRuins
Copy link
Owner

LostRuins commented Aug 12, 2025

okay looking at your gguf example I think the fallback code is failing, so the tensor names are appended twice
model.diffusion_model.model.diffusion_model...
let me test in an older koboldcpp first

I think the solution is to check for such tensor names before allowing the prefix to be appended.

@wbruna
Copy link
Author

wbruna commented Aug 12, 2025

I did a few tests with sd3.5_medium_incl_clips_t5xxlfp8scaled on sd.cpp:

  • rocm, plain model: black image
  • rocm, model + external vae: black image
  • rocm, model + external vae + external t5xxl: works
  • vulkan, model + external t5xxl: works

The external t5xxl comes from t5xxl-Q4_K.gguf.

Edit: t5xxl-Q8_0.gguf works fine too.

@LostRuins
Copy link
Owner

LostRuins commented Aug 12, 2025

@wbruna alright after I added my fix for the tensor names, I was able to load the model.

However, the results are very weird. I don't know if I am doing it wrong or if I used bad files, since I loaded my own T5 and clip, using your SD3.5-Large-Turbo-GGUF-mixed-sdcpp.

Can you merge my changes from 7b5cf71 and check if it works fine for you?
Prompt: Cat with Hat

image

I think maybe i have a faulty t5, clip or vae. but if it works for you i think thats good enough. please merge and try my fix

@LostRuins
Copy link
Owner

vulkan, model + external t5xxl: works

could you also check in sdcpp
vulkan, model + internal t5xxl (all in one, this fails for me.)

@wbruna
Copy link
Author

wbruna commented Aug 12, 2025

I think maybe i have a faulty t5, clip or vae. but if it works for you i think thats good enough. please merge and try my fix

It works! This was with the just-pushed merge 9c039b2 , on Vulkan:

test_sd35m_kobold

@LostRuins
Copy link
Owner

Alright, if no other issues I will merge this now

@wbruna
Copy link
Author

wbruna commented Aug 12, 2025

could you also check in sdcpp
vulkan, model + internal t5xxl (all in one, this fails for me.)

Black image for me too.

@LostRuins LostRuins merged commit 5de7ed3 into LostRuins:concedo_experimental Aug 12, 2025
@wbruna
Copy link
Author

wbruna commented Aug 13, 2025

@LostRuins , the target sdmain is currently failing to build. I was able to fix a few issues, but then I got undefined reference to stbi_load; and when I tried to fix that, I noticed conflicts between otherarch/sdcpp/thirdparty and vendor/stb due to version changes. Perhaps it'd be best to use only a single copy in vendor/stb?

@LostRuins
Copy link
Owner

Alright I fixed the issues I could fine and it compiles fine for me. Please pull latest experimental. Does it work for you now?

@wbruna
Copy link
Author

wbruna commented Aug 13, 2025

Yeah, it builds fine now; thanks!

@LostRuins
Copy link
Owner

Btw i've been keeping my eye on the upstream Qwen Image and WAN developments. Thanks to your PR kcpp is now in sync with the latest API, so merging it shouldn't be too hard.

However, we'd probably want to think of a good approach for the API and frontend side since video files will no doubt be massive. AVI is probably not the best format due to it's size, though transcoding without FFMPEG (which cannot be used) will be tricky. Open to suggestions if you have any.

@wbruna
Copy link
Author

wbruna commented Aug 31, 2025

Btw i've been keeping my eye on the upstream Qwen Image and WAN developments. Thanks to your PR kcpp is now in sync with the latest API, so merging it shouldn't be too hard.

Yeah... The main issue is the needed ggml support; I don't know how close leejet is to upstreaming them. And I'm already missing fixes on that branch (he's working on a version with that slow Vulkan build time 😢). Unrelated breakage could be an issue, too... But I suspect leejet already had wan in mind for that big refactor, so it could be less of an issue now.

And I'm considering again that old idea of keeping an sd branch as an 'upstream' for Kobold. Two, actually: one branch just to pick still-not-applied upstream PRs, another with the specific tweaks for Koboldcpp. I kinda did that locally for the VAE tiling fixes, but making it 'official' would make the changes easier to track (since it doesn't look like we'll have those applied to sd master any time soon...).

However, we'd probably want to think of a good approach for the API and frontend side since video files will no doubt be massive. AVI is probably not the best format due to it's size, though transcoding without FFMPEG (which cannot be used) will be tricky. Open to suggestions if you have any.

Koboldcpp can't embed libavcodec? (honest question; I know very little about video encoding. I confess I'm afraid of even trying video gen on my poor card 😅)

@LostRuins
Copy link
Owner

It's really big cause it supports dozens of codecs. And it has multiple mixed licenses, LGPL is technically usable within APGL but i'd prefer to avoid it if i can. I only need ONE single codec that can be played in a browser, thats it.

@ehoogeveen-medweb
Copy link

Would it be possible to use the CLI if it's available, or would that be too indirect? I already have ffmpeg on my PATH anyway for yt-dlp, for example.

@LostRuins
Copy link
Owner

Most likely if I cannot find something suitable the output will remain as MJPEG, which the clients can then convert on their side to whatever format they prefer

@wbruna
Copy link
Author

wbruna commented Aug 31, 2025

It's really big cause it supports dozens of codecs. And it has multiple mixed licenses, LGPL is technically usable within APGL but i'd prefer to avoid it if i can. I only need ONE single codec that can be played in a browser, thats it.

Maybe OpenH264? It's BSD licensed, and the binaries seem relatively small:
https://github.com/cisco/openh264/releases/tag/v2.6.0

@wbruna
Copy link
Author

wbruna commented Sep 6, 2025

Wan support just got merged into master. There is a ton of changes, so it may be reasonable to just update the sd.cpp code for now, and add video gen support later.

Code-wise, the main issue is that the VAE tiling fixes conflict with some of the Wan changes; I asked stduhpf to update the PR. main.cpp also got a lot of changes, but the API itself didn't change that much, so sdtype_adapter may not need too many fixes right now.

@LostRuins
Copy link
Owner

Alright cool.

@wbruna
Copy link
Author

wbruna commented Sep 7, 2025

I think I've found a way to better track those changes. I'm keeping a branch with the upstream code + pending PRs, which will act as a vendor branch:

https://github.com/wbruna/stable-diffusion.cpp/tree/koboldcpp_sd_base

And another branch with the needed local changes:

https://github.com/wbruna/stable-diffusion.cpp/tree/koboldcpp_sd_changes

The idea is to keep koboldcpp_sd_base up-to-date with upstream first, then merge it into koboldcpp_sd_changes.

Right now, that merge is just a first try; I'll review it later, and open a PR. Possibly including those FA changes that just arrived on master...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants