-
Notifications
You must be signed in to change notification settings - Fork 659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LoRA Fine Tuning #82
base: main
Are you sure you want to change the base?
LoRA Fine Tuning #82
Conversation
This is just the first draft so we can start building this feature. - Added dataloader.py, which loads data for training - Added train.py, with the current training loop - Added lora.py, for LoRA wrapper of the stage 1 Transformer - Added dummy_dataset folder with 25 data samples to work with when testing (VCTK-->p311) - Commented out the initial inference code when stage 1 model is built. There is no batch processing in the training loop currently (was getting some dimension mismatching in the KVCache.update). The dataloader works fine, but everything else requires some work. This is just an initial draft so we can start working on this thing together! :-)
I'm going to try getting a very basic thing to run. Currently, there are a couple issues:
Can make it clean once we've got a basic loop running successfully |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main thing is that it looks like you're trying to finetune the whole network end-to-end from first stage -> second stage -> mbd -> deepfilternet with L2 loss... I would recommend just finetuning the first stage using next-token prediction loss... (ref: https://github.com/karpathy/nanoGPT/blob/master/model.py#L184)...
It also looks like you might be trying to finetune through a file load operation, which would kill the gradient (https://github.com/metavoiceio/metavoice-src/pull/82/files?file-filters%5B%5D=.py&file-filters%5B%5D=dotfile&show-viewed-files=true#diff-ed183d67207df065a11e1289f19d34cc2abbc5448dea952683cfe9728c342b95R270)?
For this, you'll need to change your data loader slightly to output only the first two hierarchies of the encodec tokens in a flattened interleaved manner.
Last thing is - why are you trying to finetune? If it's mostly for generalisation to a new speaker, it should be sufficient to finetune only the first stage...
dataloader.py
Outdated
wav = wav.mean(axis=0, keepdims=True) | ||
|
||
wav = wav.unsqueeze(0) # Add batch dimension | ||
wav = wav.unsqueeze(0) # Add channel dimension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would cause an error if wav.ndim=2
?
dataloader.py
Outdated
# Padding for text tokens | ||
text_tokens = pad_sequence(text_tokens, batch_first=True, padding_value=0) | ||
|
||
# Audio waveform - padding to longest waveform on 4th dimension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have 4 dimension? Should be either (batch_size, time) or (batch_size, channels=1, time)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we only need the Encodec tokens for training stage 1, I have removed the audio waveforms from the dataloader.
dataloader.py
Outdated
MBD_SAMPLE_RATE = 24000 | ||
|
||
MAX_DURATION_IN_SECONDS = 15 | ||
MAX_INPUT_LENGTH = int(MBD_SAMPLE_RATE * MAX_DURATION_IN_SECONDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this being used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nowhere. I just removed it in the latest commit :-)
fam/llm/fast_inference_utils.py
Outdated
guidance_scale=torch.tensor(3.0, device=device, dtype=precision), | ||
end_of_audio_token=9999, # don't end early for compilation stage. | ||
) | ||
# y = generate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why comment this out? How do you test the model without this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add this back in, this was for faster initialization on my local setup (slow GPU)
train.py
Outdated
|
||
print("Setting model to training mode...") | ||
self.model.train() # Set the model to training mode | ||
self.llm_second_stage.model.train() # Set the model to training mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need to train this stage at all if you're only adapting speaker identities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Removed stage 2 in latest commit :-)
Got you. Are you not currently having to load all the models onto GPU VRAM though? Given the constraints of a 16GB VRAM, I would probably:
|
Hey @vatsalaggarwal, these are all very helpful insights - much appreciated! I will implement the points you mentioned:
Makes sense to just work with stage 1 with cross entropy loss, I just wasn't able to get there - so thanks for that insight. Should be able to commit these changes later today or tomorrow! |
Not sure what you mean, but does https://github.com/karpathy/nanoGPT/blob/master/model.py#L184 help? |
It massively helped - I've not worked with transformer networks before (only diffusion & GANs), so studying nanoGPT after you sent it today was very helpful! Am just making a few more changes for today, then I'm pushing an update :-) |
Switched from Adam to SGD optimizer Modified the DataLoader to return the first two encodec token hierarchies as a flattened interleaved tensor (let me know if that looks ok to you?) Modified LoRA wrapper to only fine tune speaker_cond_pos layer. In nanoGPT-LoRA only the causal attention layer is fine tuned (https://github.com/danielgrittner/nanoGPT-LoRA/blob/master/model.py#L128). Would it be worth trying something similar? Modified training loop to forward pass with entire batches at a time. Loss calculation doesn't work, need to match the GT labels with generated probabilities. Need some direction here.
Just committed - here's an overview:
EDIT1: After sleeping on it, I think it is an issue with the way the GT Encodec tokens are extracted and prepared in the DataLoader which is causing the problem, and not the format of the model output. EDIT2: Was wondering if you have more detailed documentation about the model architecture / diagrams to help further understanding? |
I might start to understand a little bit here. Would the idea be to generate a single token at a random location for each batch by giving the model: So we would concat Then This would then give us prediction tensor for the batch Then the GT labels would be Is that the right intuition? If so, would the GT Encodec labels be determined by the raw output of the EncodecModel.encode function, or do we need to run it through the first stage adapter to get the correct vocab indices? If you're not sure, I'll see if I can find out. Again, thanks a lot for pointing me in the right direction, it's very helpful given my limited knowledge! |
Almost trains, I have just made a mistake somewhere causing: RuntimeError: Trying to backward through the graph a second time I'm guessing it's because we need to do all the preprocessing in the dataloader rather than in the training loop. Let me know any thoughts. It's getting close :-)
I've modified the training loop to be as I described above, and it seems to work (although I'm not 100% sure that the GT Encodec indices are correctly determined). Somewhere I have made a mistake, causing tensors to be retained in the computational graph across batches, which leads to: Let me know any thoughts! :-) Script can be launched with mixed precision using accelerate now: |
Moved loss calculation to LoRA wrapper model. Modified training loop to be similar to that of nanoGPT. This involves using a sliding window for prompts & labels, which should more accurately replicate what the model is actually producing at logits level. If my intuition is wrong about this, please correct me.
Still some work to do in dataloader to ensure proper windows (aka blocks in nanoGPT) of X,Y data is prepared for good training. Right now useless data might be used due to randomly selecting padded zero values. Can be fixed in dataloader, or might end up creating a get_batch function similar to that of nanoGPT. Haven't been able to solve the error which occurs when calling .backward for the 2nd time: I'm sure it's a very obvious error I've made, but I have not been able to isolate it. |
I think I've isolated the backward error (see above) to the caching mechanism in the model. Will work on a solution today and then we should be able to have something training EDIT: It is definitely the caching mechanism. I just got it to run when clearing and detaching K,V tensors in KVCache between iterations. |
in the middle of finishing something, haven't had time to look at this, will do it soon, sorry! |
Completely understand, no pressure! I'm just posting updates for whenever you have the time :-) |
the updates are super helpful though! |
I have the model training the LoRA layers now, but the data preparation process is currently garbage and I'm probably also calculating the loss with unit mismatch between the prompt input, inference output, and GT labels. Will play with this for a bit, but I might need some help correctly interpreting the "units" of the variables here so we can prepare them correctly for the prompt as well as for the loss function. Gonna draw some inspiration from nanoGPT and improve the data preparation, and then we're getting close to running some tests. |
The model is fine tuning now. Not correctly, but it is fine tuning. Data must be prepared correctly now. But first, Attention and KVCache must be modified to be compatible with processing batches where batch size > 1.
Training loop is pretty clean now, and all data preparation is now done in the dataloader.py file. Loss becomes "nan" when entries in a batch have a lot of variance between each other (eg one entry had to be padded a lot during collation due to big difference in lengths on either prompt or encodec tokens tensors or both). Issue could perhaps be solved by grouping longer data points together, keeping them of similar length to avoid a lot of padding. Would love to hear any thoughts here.
b is batch size, t is timesteps, vocab_size is vocab_size... the hierarchies are flattened and interleaved into one for the first stage model
does my previous comment help?
KVCaching isn't relevant during training, that should be switched off... |
this shouldn't be used during training
Do you mean for the second iteration? Have you zeroed gradients?
Didn't get this... best thing might be to work through the NanoGPT training loop... but rough idea is to create a row of (B, S) which contains all the text and audio tokens concatenated together. Then, you apply next-token pred. This does the right thing because of causal masking. |
Thanks for explaining, that makes sense. I've implemented it, similar to NanoGPT. Pushing momentarily. |
Data loading should be more memory optimized, but this runs. Gonna run some tests to ensure this correctly trains the LoRA. Might wanna test LoRAs on different layers.
Corrected mistake in data preparation. Will start training some LoRAs now and see if this fine tuning code is correctly set up.
The LoRA layer (1st layer in model) is 98k trainable parameters. Will try training now to validate the current code. |
Disabled Accelerate for now. Properly aligned all dtypes between the model and dataloader. Previously loaded speaker embeddings were not converted to the correct dtype. Copied most of the nanoGPT-LoRA training parameters. Renamed "epochs" to "iters" like in nanoGPT.
update to further mimic the nanoGPT-LoRA training process
Currently sweeping learning rate and LoRA rank, alpha, dropout. Graphs look like this so far: I'm unsure whether the data is fed into the model properly @vatsalaggarwal. I measure very high loss on the frozen foundational model (loss 8 - 12) when evaluating the model, but the audio sounds just fine. During training of LoRA layers I can reduce loss down to 4-5, indicating to me that the LoRAs work but that the data is formatted differently than how it was formatted when training the foundational model. The data is prepared like in nanoGPT but with audio tokens added after the text tokens:
During Training:
*input_pos is always set to torch.arange(0, block_size), since we are sliding a block_size window over the data to get the input Is this the same way that the foundational model was trained @vatsalaggarwal? I would just expect the loss to start out very low, given its ability to generalize extremely well. Were there any nuances when you trained the model that this doesn't account for? |
I think @lucapericlp is close to turning this into a working solution (without LoRA), so might be better to add LoRA to that... hopefully should be out by EOD. The way you mentioned works for training text-only model, but we had some trouble training text+audio models that way... so the data formats etc were slightly differently, hopefully upcoming PR should clarify! |
That is great to hear! I am very interested to see how the solution looks. Should quickly be able to follow up with added LoRAs to @lucapericlp's solution once pushed. Can run some sweeps to find us the best configuration as well :-) Thanks! |
Eagerly awaiting this as well! Was fun reading the progress here too @danablend :) |
@danablend / @makorihi check out #93 (review) |
Have just been reading through it! I'll get it up on my system and add LoRAs to this as soon as possible, hopefully today or tomorrow! This is great - cheers @lucapericlp @vatsalaggarwal. |
@danablend I think the main thing to keep in mind is: each row is composed of "text tokens | audio tokens | padding"... I think you were taking arbitrary segments of these (per NanoGPT) as that's what they do in the text-LLM trainings, but it doesn't work so great for text->speech training.. |
ah, sorry, I misunderstood 😅 |
I believe padding is appended in the data preparation step here. Padding token is 2048 as set here |
I have gotten LoRAs to train based off of @lucapericlp's awesome work. Gonna clean it up and prepare for review. Probably going to happen in the weekend. |
NICE! what sort of losses are you getting, and audio? Did you use https://github.com/huggingface/peft, or did you wrap your own as in this PR in the end? |
I've seen losses down at around 0.42 with 15m parameters, just on the last block's attention layer. I haven't actually written the code that loads the model with added LoRA layers yet, so I'm going to get some audio samples once that's there. I took the LoRA from earlier, which is an adaptation of the one from nanoGPT-LoRA. Would you prefer if I used the Peft library from HF? Could probably change it without too much hassle. It's not so much extra code, now that the fine tuning & data loading has been cracked. |
@lucapericlp is the LoRA whiz... what do you think we should? |
That sounds pretty good!! Would just be worried about overfitting, but otherwise excited to hear the samples!! |
Re LoRA suggestions, from my experience, the two most impactful factors when comparing vs a full param finetune were:
With those insights, I was able to get within a < 1.0pp "accuracy" degradation on the task I was training on. Increasing rank does improve performance (we started with 8) but, I had found, has diminishing returns & results in marginal gains vs the others levers. Alpha, in that case, was kept static as per the paper (iirc) and we found following heuristic of Might be useful, I had posted this thread that has articles I found helpful (even though, some were published after & replicated my private findings). |
@lucapericlp we were also wondering if it’s worth integrating with PEFT or is the current way of doing things fine? cf #82 (comment) |
I'd say using PEFT is preferable purely from standpoint of reducing codebase footprint for "standard" ops where possible. |
@danablend any news? |
This is just the first draft so we can start building this feature.
There is no batch processing in the training loop currently (was getting some dimension mismatching in the KVCache.update function, probably not that difficult to solve).
The dataloader works fine, but everything else requires work. This is just a dirty initial draft so we can start working on this thing together! :-)
Hope to hear some insights! I have time to put into this feature, so any pointers would be great! Am not super well-versed in AI, so bear with me.