PictureColorDiffusion was born after multiple attempts to color 2D grayscale images using and editing other open source projects, mainly manga and comics (such as GAN models with LAB channels). After poor results, I tried using Stable Diffusion's img2img generation, controlnet quickly joined the process and I switched to txt2img generation. Once I've found settings that generally worked well between models, I decided to automate the generation with a application, as well as adding some additional options on the application side to improve the end result.
PictureColorDiffusion is a program that automate 2d colorization of drawings / manga / comics using Stable Diffusion's WebUI API, it's interrogation feature, the controlnet extension and other features on the application side.
- AUTOMATIC1111 Stable Diffusion WebUI (Can be run locally or remotly, like on google colabs)
- Need to be run with the
--api
argument.
- Need to be run with the
- ControlNet extension for the Stable Diffusion WebUI
- If you plan to use a SD model, you will need ControlNet SD models. For SDXL, there is no official models, but some were made by the community like bdsqlsz or MistoLine (recommended) 1.
- For more informations how to install and where to put the ControlNet models, please read their own Wiki.
- A SD / SDXL model related to 2D drawing or anime, preferably trained on danbooru tags, like AOM3.
- This model need to be put into the
models\Stable-Diffusion
directory of the AUTOMATIC1111 Stable Diffusion WebUI. - A VAE model if there isn't one baked into the SD / SDXL model, for SD1.x based model like AOM3, stabilityai mse-840000-ema VAE seems to give good results. The VAE model need to be put into the
models\VAE
directory of the AUTOMATIC1111 Stable Diffusion WebUI.
- This model need to be put into the
Tip
You can bypass the Stable Diffusion API Endpoint verification in the application with the shortcut Ctrl+Shift+B
, keep in mind that some issues will arise if you do so. The colorization of images won't work, but you will be able to try your YoloV8 model by right clicking on the inference button.
For AUTOMATIC1111 Stable Diffusion WebUI installation, please read their own Wiki.
For the configuration of the ControlNet Extension, please read their own Wiki.
You can download the latest build of PictureColorDiffusion by clicking here.
To run the application, unzip the release.zip
file and execute PictureColorDiffusion.exe
.
This is a list of feature implemented directly in PictureColorDiffusion.
- Dynamic resizing of image size depending of the selected mode.
- Interrogation model (deepdanbooru) filter for bad words.
- Perform image segmentation on the input picture with a YoloV8 onnx model to keep parts of the original image in the output image. I've created an example model for detecting speech bubbles available on huggingface.
Note
The application does not offer the possibility of targeting specific classes from a YoloV8 model during image segmentation.
All YoloV8 models must be placed in the models
directory, located in the same directory as the executable.
Only onnx
model are supported.
I tried to make every modes of the application have somewhat good results with popular 2d/anime related models from huggingface and civitai. In the end, I realised that the results seems to depends of the following:
- Has the SD / SDXL model been trained on colored images resembling your grayscale image & prompt?
- Example: A grayscale comic with manga mode, but the model does not know manga related words well enough.
- Did the PictureColorDiffusion mode you selectioned matches your grayscale image?
- Example: Manga mode for a drawing could cause poor results and turn the drawing into a manga like image.
There are some work arounds, you could train a Lora with colored images of what you want specifically for your model, then use the Lora using the additional prompt section of the application (Format: <lora:LORA_NAME_HERE:WEIGHT_HERE>
).
You can also use the additional prompt & negative prompt section to add informations on what you are trying to colorize.
Keeping the Use interrogation
feature enabled can also help, as it's automatically adding additional information on what you are trying to colorize.
If your generated image is completely different from your input image, it means ControlNet probably wasn't used for the generation. You can easily check this by opening your web UI console and searching for errors.
The error typically ends with Exception: ControlNet model [MODEL-NAME](StableDiffusionVersion.SDXL) is not compatible with sd model(StableDiffusionVersion.SD1x)
or something similar. In this example, the web UI is indicating that you are using a Stable Diffusion SD1.x model (with sd model(StableDiffusionVersion.SD1x)
), but that the ControlNet model you selected was made for SDXL (ControlNet model [MODEL-NAME](StableDiffusionVersion.SDXL)
). Thus, Controlnet failed to load, and the web UI continued without ControlNet, generating a picture completely different from the input image. You need to make sure that your Controlnet model support the same version as your StableDiffusion (SD) model.
I’m not sure of the exact cause of this issue, but I’ve concluded that some community-made ControlNet models are more compatible with SDXL Pony-based models than others. During my tests, the best results I achieved were with MistoLine 1.
Why do objects detected by YOLOv8 occasionally show up duplicated in the output results when using an SDXL mode?
I encountered this issue while testing the MangaXL
mode with my YOLOv8 model for speech bubble segmentation. I concluded that the combination of certain SDXL models and ControlNet models causes ControlNet to attempt to recreate the object in an incorrect position. For reference, this issue often occurred with bdsqlsz models but rarely happened with MistoLine 1. So using a different ControlNet model that is compatible with SDXL should resolve the issue.
Footnotes
-
MistoLine is an SDXL-ControlNet model that can adapt to any type of line art input, this mean that the MistoLine model can be used for multiples modules (anime_denoise, canny, etc.). From some quick tests, I've found that this model tends to produce outputs that are closer to the original image compared to others. It also performs slightly better with SDXL Pony based models. ↩ ↩2 ↩3