-
Notifications
You must be signed in to change notification settings - Fork 1
Investigating the test set #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The mask rate discrepancy does not seem to cause it, as training on 15% and eval on 15% still has poor loss validation/test convergence when using |
|
I'm looking into this now, but I see the test loss for ESM2-150 (132M) seems sane? Is the loss still a problem, or is it just a performance issue now? If so, what was the fix? |
|
The 132 run was fine, processed it just like the validation set. But it would be good to return the logits too for more metrics. When returning the logits it takes up more memory so need smaller batch size. With the smaller batch size the metrics and loss are much worse, which is the confusing bit. The sequence length shouldn't affect the performance that much with the document mask... |
|
Perhaps the issue is truncation of documents hurting performance (1/4th the batch size implies ~4x as many sequences are truncated). During validation we could right-pad the sequences s.t. none are truncated. Let me see if this resolves the problem and prevents any batch-size related issues. |
|
I see. It may be worth it to change the data loading a bit. Thinking about keeping the sequences (or tokens) separated and stacking together up to batch size tokens but not exceeding it so never truncated. |
|
I trained a tiny model with |
Currently the test set is much more memory intensive than training and also inaccurate compared to the valid set.
Looks like it has something to do with moving the logits and labels around in cuda memory, as when calling
.inferenceon valid set or test set we needbatch_size // 4to not oom.Have done experiments trying the validation set with
.inferenceas well and the loss convergence is much worse than using the regular forward pass. The only difference here is that we mask at 15% instead of 20% and are using abatch_size // 4.So it seems either the input length highly effects performance because of flex attention (which would be a major problem to the usability of the actual model).
or
that training at 20% mask rate and evaluating at 15% leads to much worse performance (this is also not expected vs. normal pLM experiments).
To try and figure this out am considering training an SDPA version and consistent masking rate.