-
Notifications
You must be signed in to change notification settings - Fork 26.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How is it possible to furthur tune gpt-2(or gpt) in a seq2seq manner? #1464
Comments
Hi, this is on our mid-term roadmap (seq2seq models). |
@Hannabrahman In the original GPT2 paper (section 3.7 Translation) the authors used the format "english sentence = french sentence" to produce translations. You can definitely fine tune the model using this format to produce translations using the existing scripts if you structure your seq2seq data this way. |
@dvaltchanov and @thomwolf thanks for pointing out to me. Or it doesn't matter if we include the source tokens loss in our total loss? |
@Hannabrahman Based on my tests, it doesn't matter if you include them. Your total loss will be higher but you're mainly interested in the validation loss on the translations anyway. As long as you use the "start of text" and "end of text" tokens to wrap your "sequence = sequence" text the model seems to be able to figure it out after a little bit of fine tuning. |
@dvaltchanov Thanks. 1- should I add special tokens ( [SOS], some separator token for source and target, [EOS]) and train it like this:
2- The instances in my dataset have different length ( 60-85 tokens). I have to either trim them to be the same size (it is not really good for my usecase), or use padding to pad them to same size. However, I read somewhere in this repo that gpt and gpt-2 doesnt handle right padding, how did you solve this issue while finetuning gpt on your own usecase and dataset? Many thanks in advance. |
@Hannabrahman Great questions:
This then gets tokenized and truncated to the max length. This will allow the model to learn variable length sequences. You can accomplish the same effect by concatenating all of your text into a single string and sampling sections of it. However, if you do this the model will learn associations between neighbouring samples over multiple epochs, so I recommend having something that shuffles the order of concatenated samples each epoch. During generation you prompt with "[SOS] something in English = " and stop generating when it produces an [EOS] token. |
@dvaltchanov Do you have your custom data loader code somewhere so that I can take a look? |
@Hannabrahman See my edited response above. I hope my clarification helps. |
@dvaltchanov Thankss. Basically you followed the same approach as in here . They read all the input into one long string and then truncate it in max_len. However it doesn't have any sampling or shuffling. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hi, is there a seq2seq example of GPT2 now? |
Hi, any updates? |
Hi everyone, Given that Alpaca (decoder-only model like GPT) was trained in a seq2seq manner, I realised we can learn from their code (cheers to OS!). ApproachThe naive solution is to concatenate the source and target strings. However, the main issue here is that the loss is incurred in the next-word-prediction of the source strings. To circumvent this, Alpaca simply ignored the loss in the source strings. Concretely:
Note how the source string's loss is ignored with ImplicationsSeq2Seq prompting. In concatenating the source and target strings, it may not be obvious to the model how to differentiate the source from target strings. I suspect that Alpaca/self-instruct circumvented this by making the differentiation clear via prompts:
Notice how Increased GPU Memory usage. To my understanding, the Packing is more intuitive with causal LM. Packing is the act of packing training examples together to avoid padding. In causal LM, we can pack via
Notice how the target string immediately comes after the source. In contrast, packing for seq2seq LM will look like
To me, it's not intuitive that the model can match the ith target to the ith source string. CreditsCheers to Alpaca, LlaMMA, and OS for finally solving this engineering puzzle for me! Do LMK if any parts don't make sense to you - I'm still learning myself. |
Created training examples by concatenating inputs and targets like this: 'Document:{document}\nSummary:{Summary}' |
Hi,
Can we futhur funetue gpt-2 pretrained model in a sequence 2 sequence manner, where we want to minimize the loss of log p(y|x).
In other words, our dataset has both source and target and we want to generate target given source.
But I want to start from using gpt-2 weights and then tune it.
The text was updated successfully, but these errors were encountered: