Skip to content

Add support for XLM-R XL and XXL models #13210

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Add support for XLM-R XL and XXL models #13210

wants to merge 2 commits into from

Conversation

Soonhwan-Kwon
Copy link
Contributor

@Soonhwan-Kwon Soonhwan-Kwon commented Aug 21, 2021

This PR adds support for the newly released XL and XXL models for XLM-R. These models are described in the "Larger-Scale Transformers for Multilingual Masked Language Modeling" paper.

I compared fairseq and transformers side by side,
and managed output same.

torch.Size([1, 10, 250880]) torch.Size([1, 10, 250880])
max_absolute_diff = 0.00022125244140614
Do both models outut the same tensors? 🔥

Since fairseq roberta to transformers conversion was made a long time ago,
transformers architecture differs far from fairseq which originally started from,
and it makes quite confusion to write right code.
I synced transformers code to allow fairseq model structure.

And the original PR #12082 (comment) was closed by its author @stefan-it and the PR(https://github.com/stefan-it/transformers/pull/1) I pushed for his repo about 40 days ago but got no response,
so I opened the new PR.

@Soonhwan-Kwon Soonhwan-Kwon marked this pull request as draft August 22, 2021 09:56
@stefan-it
Copy link
Collaborator

stefan-it commented Aug 27, 2021

Hi @Soonhwan-Kwon ,

sorry for the late reply! I discussed this topic with @patrickvonplaten a while ago and we came to the conclusion that it would be better to have a new model/class name for it, such as XLMRobertaExtraLarge to avoid these if self.normalize_before switches.

I've also tested the model implementation on a GLUE task, but the result was not very good. The model is so large, that it was impossible for me to test it on a GPU - even with batch size 1. Then I did some DeepSpeed tests, but on my V100 I would have to wait more than 3 days for the smallest GLUE task - and the final result was not performing well 🤔

@Soonhwan-Kwon
Copy link
Contributor Author

@stefan-it thank you for the reply, and I have A100 80gb machine if you need any cross check.

@mdavoudi90
Copy link

@Soonhwan-Kwon @stefan-it Can you share your Deepspeed configuration for loading the XLMR-xl? I'm getting Nan as the loss from deepspeed after using your code changes for the conversion. @Soonhwan-Kwon Do you have a plan to create a standalone file for XLMRobertaExtraLarge? The reason is that you current file change breaks the conversion for the large and base model.

@ccclyu
Copy link

ccclyu commented Sep 3, 2021

@Soonhwan-Kwon @stefan-it Can you share your Deepspeed configuration for loading the XLMR-xl? I'm getting Nan as the loss from deepspeed after using your code changes for the conversion. @Soonhwan-Kwon Do you have a plan to create a standalone file for XLMRobertaExtraLarge? The reason is that you current file change breaks the conversion for the large and base model.

Maybe I could paste my fine-tuning script by loading the XLM-Roberta-XLarge model, which is converted from @Soonhwan-Kwon 's script. You could run the script and have a double check with it.

deepspeed --num_gpus=8 run_xnli.py --model_name_or_path /mnt/xlm-roberta-xlarge \
  --deepspeed ds_config_zero3.json \
  --language zh \
  --train_language en \
  --do_predict \
  --max_seq_length 128 \
  --per_device_train_batch_size 4 \
  --learning_rate 2e-6 \
  --logging_steps 100 \
  --eval_steps 100 \
  --save_steps 5000 \
  --num_train_epochs 5 \
  --output_dir /mnt/output_xlmr \
  --cache_dir  /mnt/cache  \
  --fp16   \
  --overwrite_output_dir \
  --evaluation_strategy "steps" \
  --dataloader_num_workers 8 \
  --use_fast_tokenizer False 

@@ -81,7 +81,9 @@ def __init__(self, config):

# self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
# any TensorFlow checkpoint file
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.normalize_embeddings = config.normalize_embeddings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR @Soonhwan-Kwon ! for transformers we have a rather strict rule to not adapt existing modeling files for new modeling checkpoints so in this case here it would be great if you could create a new modeling_xlm_roberta_xl.py file

Copy link
Contributor Author

@Soonhwan-Kwon Soonhwan-Kwon Sep 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review, and I began to make xlm_roberta_xl as your suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants