What does the R in LoRA mean? And why tweaking it is cool! #37
Replies: 7 comments 19 replies
-
Wow thank you, this is incredibly helpful! I will add this to the readme in the Guides and Tips section!! Thank you so much for awesome experiments and results again! |
Beta Was this translation helpful? Give feedback.
-
@brian6091 would you be kind enough to let me use one of your scaling parameter comparison figures in the README for this repo? |
Beta Was this translation helpful? Give feedback.
-
@brian6091 Wow, amazing work! Could you tell us roughly how your training times differed between Dreambooth and LoRA? |
Beta Was this translation helpful? Give feedback.
-
awesome insights, how were you able to do this in @cloneofsimo 's repo?
how can I decrease it with a linear schedule? |
Beta Was this translation helpful? Give feedback.
-
Just wanted to add this to the discussion : on the learing rate,
All with Columns represents 1000 steps, |
Beta Was this translation helpful? Give feedback.
-
Far too much stuff to input vs Dreambooth. Where is a 2.x LoRa at for colab? |
Beta Was this translation helpful? Give feedback.
-
I am very sad this never was made for 2.x on Colab as it is far superior to Kohya's version. |
Beta Was this translation helpful? Give feedback.
-
So @cloneofsimo recently accepted a pull request that allows changing the rank of the LoRA approximation. I thought I'd kick off some discussion about what the rank parameter is, and what it allows us to do. Long story short, the compression (space savings) you get with LoRA may be even crazier than you thought (we're talking from ~3.7Gb to just over 1Mb!).
As nicely explained in the Readme, given some pre-trained weight matrix,$W \in \mathbb{R}^{n \times m}$ , we seek to avoid training $W$ directly, but rather adjust it using another matrix $\Delta W$ that is the product of two low-rank matrices: $\Delta W = A B^T$ , where $A \in \mathbb{R}^{n \times r}, B \in \mathbb{R}^{m \times r}, r << n$ . So how to choose the rank $r$ ? I wasn't sure, so I wanted to find out how low we could go. Intuitively, reducing $r$ will result in loss of information, but perhaps we don't need all of it when we're inserting a small number of objects/concepts into such a huge model.
So I ran a small experiment to compare LoRA with Dreambooth-style fine-tuning. Here's the setup for all training:
For Dreambooth:
For LoRA:
The first figure below gives an overview of representative results (at around 2400 iterations for all models). The first two columns represent outputs for base SDv1-5 across a few different prompts. The first column is just a test of whether the token I chose produced anything coherent (does not seem to), and the second shows images produced when using the actress's name. The following six columns compare Dreambooth and LoRA with different rank approximations. Recall that impact of$\Delta W$ can be adjusted with a scale factor $\alpha$ , one each for the UNET and the text decoder, so I just fix both at 0.8 here. Finally, the bottom row is a test for bleeding onto another person.
full-size image here for the pixel-peepers
The first take-home is that all the fine-tunings did a reasonable job, producing images that more closely resembed Caterina Murino than what the pre-trained model produced. The second is that decreasing the rank$r$ doesn't degrade quality very much at all. I ran $r=1$ expecting it to do terribly, but I was shocked. I mean, we've reduced $\Delta W$ to the outer product of two skinny vectors!
To get a sense of how insane that is, here is a table showing the tally and percentages of parameters being trained (relative to the total that could be trained). Since Dreambooth trains all parameters of the UNET and text encoder, it gets 100%. Note that for all the LoRA configurations tested, we are training less that 0.5% of the potentially trainable parameters (across the UNET and text encoder). In the case of$r=1$ , we are training just 0.03% of the trainable parameters! This translates directly into crazy efficiency, with the combined weights totalling 1.145Mb, again 0.03% of the ~3.75Gb needed to store a Dreambooth fine-tuning.
Now, the caveat is that really comparing the quality of Dreambooth and LoRA outputs requires further experiments. That's because I didn't try at all to optimize training, so nothing is matched for validation loss, etc. That said, I found the Dreambooth results somewhat better in general (although Keanu is somewhat less himself with Dreambooth). It seemed like texture and photorealism were slightly but consistently better, but this might just be the result of not tweaking the training for all the models. It's probably worth doing a deeper comparison, at least across different ranks in LoRA (it's so easy to keep all the checkpoints around!).
Also, I did not play very much with the scaling parameters, which seem to be very sensitive. So here are a few figures to look a bit more closely at these parameters in LoRA. Recall that we've got two scale parameters that we can adjust, one each for the UNET and the text encoder. The next two figures show images generated for prompts from the first figure, sweeping over both scale parameters at two different ranks. The final figure is again Keanu Reeves, and reassuringly, he remains Keanu Reeves despite having applied$\Delta W$ when generating images of him.
full-size image here
full-size image here
full-size image here
Anyways, I hope this gives you some sense of the range for exploration we get from @cloneofsimo's brilliant insight. There are one or two other hyperparameters I want to pull out for all of us to play with, so stay tuned. In case you want to play with the notebook I used for fine-tuning, you can find it in this Github repository, or follow the links directly:
Notebook for training with either Dreambooth or Low-rank Adaptation (LoRA), link to repo
Thanks for reading, and let me know if you are interested in seeing anything else!
Beta Was this translation helpful? Give feedback.
All reactions