What does the R in LoRA mean? And why tweaking it is cool! #37

brian6091 · 2022-12-14T15:29:14Z

brian6091
Dec 14, 2022
Collaborator

So @cloneofsimo recently accepted a pull request that allows changing the rank of the LoRA approximation. I thought I'd kick off some discussion about what the rank parameter is, and what it allows us to do. Long story short, the compression (space savings) you get with LoRA may be even crazier than you thought (we're talking from ~3.7Gb to just over 1Mb!).

As nicely explained in the Readme, given some pre-trained weight matrix, $W \in \mathbb{R}^{n \times m}$, we seek to avoid training $W$ directly, but rather adjust it using another matrix $\Delta W$ that is the product of two low-rank matrices: $\Delta W = A B^T$, where $A \in \mathbb{R}^{n \times r}, B \in \mathbb{R}^{m \times r}, r << n$. So how to choose the rank $r$? I wasn't sure, so I wanted to find out how low we could go. Intuitively, reducing $r$ will result in loss of information, but perhaps we don't need all of it when we're inserting a small number of objects/concepts into such a huge model.

So I ran a small experiment to compare LoRA with Dreambooth-style fine-tuning. Here's the setup for all training:

Train to insert a token for one person. I chose an actress since images were easy to find. She (Caterina Murino) is also already in the base model (fairly poorly!). I used 21 instance images.
I used runwayml/stable-diffusion-v1-5 as the pre-trained model (revision "fp16")
Therefore, training with mixed_precision="fp16"
Prior class preservation was enabled (with prior_loss_weight=1), with 1500 class images
Text encoder training was enabled
gradient_accumulation_steps=1
train_batch_size=1
8-bit ADAM enabled
gradient_checkpointing was disabled (currently does not work with LoRA)
I used stabilityai/sd-vae-ft-mse as the VAE to generate all images

For Dreambooth:

initial learning_rate=2e-6, and decreased with a linear schedule
the same learning rate and schedule were used for both the UNET and the text encoder

For LoRA:

initial learning_rate=1e-4 for the UNET, and decreased with a linear schedule
initial learning_rate=5e-5 for the text encoder, and decreased with a linear schedule
rank $r$ varied over {1, 2, 4, 8, 16}, this was hardcoded to 4 until yesterday

The first figure below gives an overview of representative results (at around 2400 iterations for all models). The first two columns represent outputs for base SDv1-5 across a few different prompts. The first column is just a test of whether the token I chose produced anything coherent (does not seem to), and the second shows images produced when using the actress's name. The following six columns compare Dreambooth and LoRA with different rank approximations. Recall that impact of $\Delta W$ can be adjusted with a scale factor $\alpha$, one each for the UNET and the text decoder, so I just fix both at 0.8 here. Finally, the bottom row is a test for bleeding onto another person.

full-size image here for the pixel-peepers

The first take-home is that all the fine-tunings did a reasonable job, producing images that more closely resembed Caterina Murino than what the pre-trained model produced. The second is that decreasing the rank $r$ doesn't degrade quality very much at all. I ran $r=1$ expecting it to do terribly, but I was shocked. I mean, we've reduced $\Delta W$ to the outer product of two skinny vectors!

To get a sense of how insane that is, here is a table showing the tally and percentages of parameters being trained (relative to the total that could be trained). Since Dreambooth trains all parameters of the UNET and text encoder, it gets 100%. Note that for all the LoRA configurations tested, we are training less that 0.5% of the potentially trainable parameters (across the UNET and text encoder). In the case of $r=1$, we are training just 0.03% of the trainable parameters! This translates directly into crazy efficiency, with the combined weights totalling 1.145Mb, again 0.03% of the ~3.75Gb needed to store a Dreambooth fine-tuning.

Now, the caveat is that really comparing the quality of Dreambooth and LoRA outputs requires further experiments. That's because I didn't try at all to optimize training, so nothing is matched for validation loss, etc. That said, I found the Dreambooth results somewhat better in general (although Keanu is somewhat less himself with Dreambooth). It seemed like texture and photorealism were slightly but consistently better, but this might just be the result of not tweaking the training for all the models. It's probably worth doing a deeper comparison, at least across different ranks in LoRA (it's so easy to keep all the checkpoints around!).

Also, I did not play very much with the scaling parameters, which seem to be very sensitive. So here are a few figures to look a bit more closely at these parameters in LoRA. Recall that we've got two scale parameters that we can adjust, one each for the UNET and the text encoder. The next two figures show images generated for prompts from the first figure, sweeping over both scale parameters at two different ranks. The final figure is again Keanu Reeves, and reassuringly, he remains Keanu Reeves despite having applied $\Delta W$ when generating images of him.

full-size image here

Anyways, I hope this gives you some sense of the range for exploration we get from @cloneofsimo's brilliant insight. There are one or two other hyperparameters I want to pull out for all of us to play with, so stay tuned. In case you want to play with the notebook I used for fine-tuning, you can find it in this Github repository, or follow the links directly:

Notebook for training with either Dreambooth or Low-rank Adaptation (LoRA), link to repo

Thanks for reading, and let me know if you are interested in seeing anything else!

cloneofsimo · 2022-12-14T15:40:24Z

cloneofsimo
Dec 14, 2022
Maintainer

Wow thank you, this is incredibly helpful! I will add this to the readme in the Guides and Tips section!! Thank you so much for awesome experiments and results again!

0 replies

cloneofsimo · 2022-12-14T16:11:57Z

cloneofsimo
Dec 14, 2022
Maintainer

@brian6091 would you be kind enough to let me use one of your scaling parameter comparison figures in the README for this repo?

7 replies

cloneofsimo Dec 15, 2022
Maintainer

This is kindof dumb question, but what tool did you use to make the figure like above? Was it done pythonically? I am trying to make a similar figure regarding learning rate.

cloneofsimo Dec 15, 2022
Maintainer

I really like the style and cleanness of it, such a well made figures 😆

brian6091 Dec 15, 2022
Collaborator Author

The grids themselves were generated using PIL, and text was added in Adobe Illustrator. This could probably be done easily in python, but I'm not familiar with python.

cloneofsimo Dec 15, 2022
Maintainer

Thanks! I just thought you were using packages for these figures. Would be nice if there were simple figure-making python tool with nice design like that.

krahnikblis Dec 16, 2022

Thanks! I just thought you were using packages for these figures. Would be nice if there were simple figure-making python tool with nice design like that.

hey @cloneofsimo i made this for you, getcha started on making those grids like that in the python environment (assuming you're using jupyter or colab; don't know how cli would treat plots)

from mpl_toolkits.axes_grid1 import ImageGrid
import numpy as np
import PIL
import pathlib
import os
import math

def plot_img_grid(image_list,image_size=64,num_cols=4,fig_name="",col_names=[],row_names=[],dpi=96):
    """
    display a list of images (either np or PIL) as a grid, with optional figure title and row/column titles
    will resize images to desired (assumes square)
    """
    # prepare inputs
    num_rows = math.ceil(len(image_list) / num_cols)
    # in case different number of titles than grid size allows:
    if len(col_names) != num_cols and col_names != []:
        col_names = col_names + ["" for i in range(num_cols % len(col_names))]
    if len(row_names) != num_rows and row_names != []:
        row_names = row_names + ["" for i in range(num_rows % len(row_names))]
    img_list = []
    if PIL.Image.isImageType(image_list[0]):
        for img in image_list:
            if not img.size == (image_size,image_size):
                img = np.asarray(img.resize((image_size,image_size)))
            img_list.append(img)
    else:
        for img in image_list:
            if not img.shape == (image_size,image_size,img.shape[2]):
                img = np.asarray(PIL.Image.fromarray(img).resize((image_size,image_size)))
            img_list.append(img)
    # prepare plot
    plt.rcParams["figure.dpi"] = dpi
    plt.style.use('dark_background') 
    fig = plt.figure(figsize=((num_rows + 1) * image_size / dpi, (num_cols + 1) * image_size / dpi))
    grid = ImageGrid(fig, 111, nrows_ncols=(num_rows, num_cols),axes_pad=0.)
    for ax, im in zip(grid, img_list):
        ax.imshow(im)
        ax.set_xticks([])
        ax.set_yticks([])
    if fig_name != "":
        fig.suptitle(fig_name, fontsize=16)
        fig.subplots_adjust(top=0.98)
    if col_names != []:
        for ax, t in zip(grid.axes_column,col_names):
            ax[0].set_title(t)
    if row_names != []:
        for ax, t in zip(grid.axes_row,row_names):
            ax[0].set_ylabel(t, rotation=90, size='large')
    # display
    plt.show()

# usage demo
img_arrays = []
for fn in pathlib.Path("path/to/your/images").glob("*.png"):
    img_arrays.append(PIL.Image.open(fn))
cols = [f"col{c}" for c in range(8)]
rows = [f"row{r}" for r in range(8)]
plot_img_grid(img_arrays,image_size=128,num_cols=8,fig_name="nice grid, yeah?",col_names=cols,row_names=rows)

nanobeep · 2022-12-14T17:25:02Z

nanobeep
Dec 14, 2022

@brian6091 Wow, amazing work! Could you tell us roughly how your training times differed between Dreambooth and LoRA?

4 replies

brian6091 Dec 14, 2022
Collaborator Author

It's odd that I didn't actually record the times, but I did record the iterations/sec. Seems like github isn't loading the table properly, so here's a direct link: https://drive.google.com/file/d/1uswBW2m9eVFecGFM5qWTncLgbfDG5-jK/view?usp=share_link

Dreambooth: 2400 it / 1.78 it/sec gives about 23 minutes
LoRA: 2400 it / ~2.65 it/sec gives about 15 minutes

Plus maybe a couple of minutes total for generating sample images at the checkpoints.

pedrogengo Dec 14, 2022

Do you know the GPU you were using?

brian6091 Dec 14, 2022
Collaborator Author

A100 with 40Gb memory on Google Colab. Unfortunately, until we get the gradient_checkpointing working again, it won't run in this configuration on a T4.

pedrogengo Dec 14, 2022

Tks for the answer and the amazing job!

cian0 · 2022-12-14T17:58:20Z

cian0
Dec 14, 2022

awesome insights, how were you able to do this in @cloneofsimo 's repo?

initial learning_rate=1e-4 for the UNET, and decreased with a linear schedule
initial learning_rate=5e-5 for the text encoder, and decreased with a linear schedule

how can I decrease it with a linear schedule?

3 replies

brian6091 Dec 14, 2022
Collaborator Author

I used @cloneofsimo's code for the LoRA parts, running from a custom training script and Colab notebook. You can find them here: https://github.com/brian6091/Dreambooth/

Otherwise, most of the training scripts derived from the original diffusers example have a parameter (--lr_scheduler) that you can set to one of ["linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"]

cian0 Dec 15, 2022

Thank you! Based on your testing, would you say that this approach is only good if there are existing concepts inside the model? Or is it possible to train an entirely new concept/token like normally what's being done in Dreambooth?

brian6091 Dec 15, 2022
Collaborator Author

I think this should in principle work for entirely new concepts, just like Dreambooth. How well it works requires more testing!

cloneofsimo · 2022-12-15T09:35:02Z

cloneofsimo
Dec 15, 2022
Maintainer

Just wanted to add this to the discussion : on the learing rate,
really not anything conclusive, but seems like even higher-learning rate can work as well. (upto 4e-4)

prompt = "female 3d game character bnha, Skill magic geek inside matrix deepdream radiating a glowing aura stuff loot legends stylized digital illustration video game icon artstation lois van baarle, ilya kuvshinov, rossdraws", $\alpha = 1.0$

prompt = "portrait of female 3d game character bnha, impressionist style from the 19th century, claude monet, oil painting", $\alpha = 0.6$

All with num_inference_steps=50, guidance_scale=4.5, Euler A scheduler

Columns represents 1000 steps,
Rows represent learning rates, ["2.5e-5", "5e-5", "1e-4", "2e-4", "4e-4"] starting from top.

5 replies

cloneofsimo Dec 15, 2022
Maintainer

2000 steps with 4e-4 looks good in my opinion. I wonder what Adam, Adagrad, SGD, and weight decay plays on these as well. I assume weight decay would have meaningful work on these optimizations.

cloneofsimo Dec 15, 2022
Maintainer

I guess this wasn't rank-related, so I think ill make another discussion.

ExponentialML Dec 16, 2022

num_inference_steps=50, guidance_scale=4.5

Is there a specific reason why dreambooth skews the guidance scale so much, and is there anything we can do to mitigate this?
I feel if we could come up with a way to train a model without modifying the CFG (meaning using values like 3 for likeness), we should be able to achieve better results across the board.

brian6091 Dec 16, 2022
Collaborator Author

@ExponentialML

Do you mean that you're looking for something that preserves your concept across a wider range of CFGs? I think something like this can be achieved by dropping text-conditioning on fraction of iterations (e.g., how SD V1-5 was fine-tuned). There is some trade-off though, between maintaining fidelity of trained concept and diversity of images generated from a given prompt. The idea is explained in this paper.

ExponentialML Dec 16, 2022

@brian6091 Yes, that's exactly it. Thanks for the clarity and links!

DarkAlchy · 2022-12-26T20:43:32Z

DarkAlchy
Dec 26, 2022

Far too much stuff to input vs Dreambooth. Where is a 2.x LoRa at for colab?

0 replies

DarkAlchy · 2023-06-19T05:40:41Z

DarkAlchy
Jun 19, 2023

I am very sad this never was made for 2.x on Colab as it is far superior to Kohya's version.

0 replies

What does the R in LoRA mean? And why tweaking it is cool! #37

brian6091 Dec 14, 2022 Collaborator

Replies: 7 comments · 19 replies

cloneofsimo Dec 14, 2022 Maintainer

cloneofsimo Dec 14, 2022 Maintainer

cloneofsimo Dec 15, 2022 Maintainer

cloneofsimo Dec 15, 2022 Maintainer

brian6091 Dec 15, 2022 Collaborator Author

cloneofsimo Dec 15, 2022 Maintainer

brian6091 Dec 14, 2022 Collaborator Author

brian6091 Dec 14, 2022 Collaborator Author

brian6091 Dec 14, 2022 Collaborator Author

brian6091 Dec 15, 2022 Collaborator Author

cloneofsimo Dec 15, 2022 Maintainer

cloneofsimo Dec 15, 2022 Maintainer

cloneofsimo Dec 15, 2022 Maintainer

brian6091 Dec 16, 2022 Collaborator Author

brian6091
Dec 14, 2022
Collaborator

Replies: 7 comments 19 replies

cloneofsimo
Dec 14, 2022
Maintainer

cloneofsimo
Dec 14, 2022
Maintainer

cloneofsimo Dec 15, 2022
Maintainer

cloneofsimo Dec 15, 2022
Maintainer

brian6091 Dec 15, 2022
Collaborator Author

cloneofsimo Dec 15, 2022
Maintainer

brian6091 Dec 14, 2022
Collaborator Author

brian6091 Dec 14, 2022
Collaborator Author

brian6091 Dec 14, 2022
Collaborator Author

brian6091 Dec 15, 2022
Collaborator Author

cloneofsimo
Dec 15, 2022
Maintainer

cloneofsimo Dec 15, 2022
Maintainer

cloneofsimo Dec 15, 2022
Maintainer

brian6091 Dec 16, 2022
Collaborator Author