-
Notifications
You must be signed in to change notification settings - Fork 29.5k
Add support for XLM-R XL and XXL models #13210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @Soonhwan-Kwon , sorry for the late reply! I discussed this topic with @patrickvonplaten a while ago and we came to the conclusion that it would be better to have a new model/class name for it, such as I've also tested the model implementation on a GLUE task, but the result was not very good. The model is so large, that it was impossible for me to test it on a GPU - even with batch size 1. Then I did some DeepSpeed tests, but on my V100 I would have to wait more than 3 days for the smallest GLUE task - and the final result was not performing well 🤔 |
@stefan-it thank you for the reply, and I have A100 80gb machine if you need any cross check. |
@Soonhwan-Kwon @stefan-it Can you share your Deepspeed configuration for loading the XLMR-xl? I'm getting Nan as the loss from deepspeed after using your code changes for the conversion. @Soonhwan-Kwon Do you have a plan to create a standalone file for XLMRobertaExtraLarge? The reason is that you current file change breaks the conversion for the large and base model. |
Maybe I could paste my fine-tuning script by loading the XLM-Roberta-XLarge model, which is converted from @Soonhwan-Kwon 's script. You could run the script and have a double check with it. deepspeed --num_gpus=8 run_xnli.py --model_name_or_path /mnt/xlm-roberta-xlarge \
--deepspeed ds_config_zero3.json \
--language zh \
--train_language en \
--do_predict \
--max_seq_length 128 \
--per_device_train_batch_size 4 \
--learning_rate 2e-6 \
--logging_steps 100 \
--eval_steps 100 \
--save_steps 5000 \
--num_train_epochs 5 \
--output_dir /mnt/output_xlmr \
--cache_dir /mnt/cache \
--fp16 \
--overwrite_output_dir \
--evaluation_strategy "steps" \
--dataloader_num_workers 8 \
--use_fast_tokenizer False |
@@ -81,7 +81,9 @@ def __init__(self, config): | |||
|
|||
# self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load | |||
# any TensorFlow checkpoint file | |||
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) | |||
self.normalize_embeddings = config.normalize_embeddings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR @Soonhwan-Kwon ! for transformers
we have a rather strict rule to not adapt existing modeling files for new modeling checkpoints so in this case here it would be great if you could create a new modeling_xlm_roberta_xl.py
file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review, and I began to make xlm_roberta_xl as your suggestion.
This PR adds support for the newly released XL and XXL models for XLM-R. These models are described in the "Larger-Scale Transformers for Multilingual Masked Language Modeling" paper.
I compared fairseq and transformers side by side,
and managed output same.
torch.Size([1, 10, 250880]) torch.Size([1, 10, 250880])
max_absolute_diff = 0.00022125244140614
Do both models outut the same tensors? 🔥
Since fairseq roberta to transformers conversion was made a long time ago,
transformers architecture differs far from fairseq which originally started from,
and it makes quite confusion to write right code.
I synced transformers code to allow fairseq model structure.
And the original PR #12082 (comment) was closed by its author @stefan-it and the PR(https://github.com/stefan-it/transformers/pull/1) I pushed for his repo about 40 days ago but got no response,
so I opened the new PR.