Generate caption/TEnc only Lora's easily for influencing model CLIP bias/creating associations (e.g. removing girls, gaining sfw results, reducing lowres bias) proof of concept+current method included #295

SwiftIllusion · 2023-03-14T18:27:50Z

SwiftIllusion
Mar 14, 2023

Through experiments I've found a method to help peoples outputs and control over a model without having to impact the positive prompt/output by attempting to sway it with negative prompts (which can then be left to influence the positive prompt in more meaningful ways), to hopefully increase the utility/control people have with their generations.
However the method currently is a lot of steps and tools used to get the final Lora, and though I don't understand what's required to generate those Lora's an easier method could make this a new accessible tool for people to use.

Demonstrations of potential without ever adjusting the negative prompt:

Removing female bias

"night sky" in Anythingv3.1, without loras/with "girl" captioned Lora 0.8/with "girl" captioned Lora 0.8 + "girls" captioned Lora 1.1

"man, flower garden" in Anythingv3.1, without loras/with "girl" captioned Lora 0.1 + "girls" captioned Lora 0.3

Favoring sfw output by creating associations

Put a black bar over their parts but still nsfw so linking to imgur gallery of the images instead for these
https://imgur.com/a/R5zJrbE
In order of comparison, with a Lora trained on multiple captions - "man" and "woman" (to try and isolate their concept/dilute it away from the all the captions that probably man/woman and naked), and "person-wearing-clothes" at double the repeats (to build a stronger association that people should be wearing clothes more commonly)
"sexy woman, beautiful, standing next to desk" in ChilloutMix, without Loras/with Lora 1.4
"elegant woman" in ChilloutMix, without Loras/with Lora 1.2
"muscular man, standing" in ChilloutMix, without Loras/with Lora 1.3
"woman standing" in LifeLikeDiffusion, without Loras/with Lora 0.8

Distancing associations

"rabbit" in MareAcernis (most female biased model I've used), without Loras/with "person" Lora -0.3 + "people" Lora -0.25 + "girl" Lora 1

Current method:

At the moment it takes many steps and going back and forth between having the kohya_ss gui running and automatic1111 gui running. Instructions simplified just to explain what I'm doing and not meant to teach.

1 - Pick a model that is biased/you want to adjust output for, use "stable-diffusion-webui-model-toolkit" to load the model then make a save of it to create out modelA (to make sure later there won't be difference like clip fixes/precision influencing the purity of the Lora.
2 - Finetune that model (e.g. with kohya_ss gui Dreambooth) on a 64x64 white square paired (the image is irrelevant) with a caption (the caption is important) to create modelB
I did this with just 50 repeats and 1 epoch and that was much more than necessary every time
3 -Use the 'model-toolkit' extension to load the CLIP of modelB onto modelA to create modelC. This way there's no image training and the only difference is the weight of that caption in the CLIP.
4 - Use "sd-webui-supermerger" to merge the difference between modelA and modelC using a prompt with an element you are trying to influence until the merge value gets the results you are looking for, and then save it to create modelD. For example for the sfw adjustment I used "woman" until they were reliably clothed.
5 - Extract difference the difference between modelA and modelD as a Lora e.g. in kohya_ss utilities with network dimension 4 (where the only difference is the CLIP).

Now you have a Lora you can use to influence your output, and most importantly this just influences what your prompt is searching for/what it thinks is relevant. It does not take away the quality or style of anything that does end up in the image (this still of course depends on the UNET training of the model to produce it). This can be used on any model and the only difference will be the weight values you choose as it will depend on the bias of each individual model.

Idea:

The finetuning of the model takes less time than generating an image, the Lora difference takes more than that, but the whole process of moving between all the interfaces and steps, preparing/changing the captions to train on in the directories, etc, when I'm still not personally always clear on what caption will get me to my end goal, is a journey.

I don't understand what is required for a Lora to be built (would it require a reference point if it's just influencing the CLIP, etc), so I'm not sure how much can be accomplished.
However if there was a way for you to be able to generate this, in sd-webui-additional-networks especially (noting again the 'training' part of this process took less time than an image generation), that would be incredible. But if you could at least just generate a Lora in isolation on the TEnc (and adjust the repeats to make the starting strength flexible), so you don't have to go through all the steps required to prepare training and get a model difference to extract etc, that would make the process a lot more approachable.

SwiftIllusion · 2023-03-15T10:24:53Z

SwiftIllusion
Mar 15, 2023
Author

Got to do another experiment, this time seeing what working against "lowres" TEnc bias with a training via the above process on "lowres" caption could do, to show/prove its concept further.

"portrait of man waving, wearing casual clothing" without negative prompt in 'LifelikeDiffusion'
0 strength

0~1 strength in 0.1 increments here - https://imgur.com/a/i2qI4S0 - to not make this post so long with images
1 strength