Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for XLM-R XL and XXL models by modeling_xlm_roberta_xl.py #13727

Merged
merged 27 commits into from
Jan 29, 2022
Merged

Add support for XLM-R XL and XXL models by modeling_xlm_roberta_xl.py #13727

merged 27 commits into from
Jan 29, 2022

Conversation

Soonhwan-Kwon
Copy link
Contributor

@Soonhwan-Kwon Soonhwan-Kwon commented Sep 24, 2021

This PR adds support for the newly released XL and XXL models for XLM-R, and . These models are described in the "Larger-Scale Transformers for Multilingual Masked Language Modeling" paper.

And thank you for @patrickvonplaten and @stefan-it, as the review I got from the #13210,
I added modeling_xlm_roberta_xl.py and convert_xlm_roberta_xl_original_pytorch_checkpoint_to_pytorch.py for conversion script.

I compared fairseq and transformers side by side,
and managed output same.

torch.Size([1, 11, 250880]) torch.Size([1, 11, 250880])
max_absolute_diff = 0.000186920166015625
Do both models output the same tensors? 🔥
Saving model to converted_xlmr_xl
Configuration saved in converted_xlmr_xl/config.json
Model weights saved in converted_xlmr_xl/pytorch_model.bin

Since fairseq roberta to transformers conversion was made a long time ago,
transformers architecture differs far from fairseq which originally started from,
and it makes quite confusion to write right code.
I synced transformers code to allow fairseq model structure.

  • add test for XLM-R XL and XXL
  • upload model for XLM-R XL and XXL to official repo

@patrickvonplaten
Copy link
Contributor

Thanks for the PR @Soonhwan-Kwon!

Could you also add a test file and some integration tests? :-)

@Soonhwan-Kwon
Copy link
Contributor Author

Soonhwan-Kwon commented Oct 13, 2021

@patrickvonplaten I started to work on test file, It seems test needs models uploaded on official repo. but how I can upload model files for xlm-roberta-xl or xlm-roberta-xxl to the official repo?

@huggingface huggingface deleted a comment from github-actions bot Nov 11, 2021
@patrickvonplaten
Copy link
Contributor

Hey @Soonhwan-Kwon,

Thanks a lot for working and this and sorry to reply so late!
Would it be ok to upload the checkpoints for now under your name on the hub and to make the tests pass and then in a last step, will move the checkpoints to the official organization?

Let me know if you need some help fixing the last steps :-)

@Soonhwan-Kwon
Copy link
Contributor Author

Hey @Soonhwan-Kwon,

Thanks a lot for working and this and sorry to reply so late! Would it be ok to upload the checkpoints for now under your name on the hub and to make the tests pass and then in a last step, will move the checkpoints to the official organization?

Let me know if you need some help fixing the last steps :-)

Thank you for the reply, I'm middle of uploading models, but it takes time for xxlarge(over 24GB) model.

@Soonhwan-Kwon
Copy link
Contributor Author

Soonhwan-Kwon commented Nov 15, 2021

@patrickvonplaten I have uploaded all models, but I have no idea how to fix last steps because I'm kind of newbie here. How can I fix last steps? Thank you in advance.

@patrickvonplaten
Copy link
Contributor

@Soonhwan-Kwon, could you maybe also add a configuration file (just copy the xlm-roberta one) and also add a full test suite for the model? :-)

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@Soonhwan-Kwon
Copy link
Contributor Author

@Soonhwan-Kwon, could you maybe also add a configuration file (just copy the xlm-roberta one) and also add a full test suite for the model? :-)

sorry for the late response. I already added config.json, so is it the tokenizer.json you're talking about? And I added simple test for models(tests/test_modeling_xlm_roberta_xl.py) but where can I find the full test suite?

@patrickvonplaten
Copy link
Contributor

Hey @Soonhwan-Kwon,

I meant more a new configuration_xlm_roberta_xl.py python file that is more or less a copy of configuration_xlm_robert.py:-) But I see that the configs are exactly similar so maybe we can leave as is.

@sgugger @LysandreJik - This PR adds the checkpoints of https://ai.facebook.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/ to transformers. The model is essentially a "scaled-up" version of https://huggingface.co/docs/transformers/master/en/model_doc/xlmroberta#overview . Since the "scaled-up" version has a significantly different architecture (layer_norm is used very different amongst others) we decided to make a new modeling_xlm_roberta_xl.py model file. Now would it be ok for you to a) not have a corresponding configuration_xlm_roberta_xl.py and just use the config_xlm_roberta.py code or do you prefer b) adding a new configuration_xlm_roberta_xl.py file for consistency? I'm a bit indifferent here, but do slighly prefer b). What do you think?

@Soonhwan-Kwon - there are some failing tests which I think can partly be solved by rebasing to current master. Otherwise, if ok for you I'm also happy to dive in the PR and help you finish the last parts of it. Let me know what you prefer :-)

@LysandreJik
Copy link
Member

Hey @Soonhwan-Kwon, thanks a lot for your PR!!

@patrickvonplaten, I prefer b): a lot of the library is built on the assumption that you have one configuration file/object per modeling file/model objects. Since we've authorized auto models to map one configuration to multiple models this isn't as much of an issue as it could have been in the past, but I'm positive we'll find edge cases where it doesn't work as well as we expect it to simply because of the wrong assumption.

@sgugger
Copy link
Collaborator

sgugger commented Dec 13, 2021

Also, since it falls into our "new architecture test", there should be a new folder regrouping this modeling file and configuration file instead of putting everything in the xlm-roberta folder.

@Soonhwan-Kwon
Copy link
Contributor Author

Hey @Soonhwan-Kwon,

I meant more a new configuration_xlm_roberta_xl.py python file that is more or less a copy of configuration_xlm_robert.py:-) But I see that the configs are exactly similar so maybe we can leave as is.

@sgugger @LysandreJik - This PR adds the checkpoints of https://ai.facebook.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/ to transformers. The model is essentially a "scaled-up" version of https://huggingface.co/docs/transformers/master/en/model_doc/xlmroberta#overview . Since the "scaled-up" version has a significantly different architecture (layer_norm is used very different amongst others) we decided to make a new modeling_xlm_roberta_xl.py model file. Now would it be ok for you to a) not have a corresponding configuration_xlm_roberta_xl.py and just use the config_xlm_roberta.py code or do you prefer b) adding a new configuration_xlm_roberta_xl.py file for consistency? I'm a bit indifferent here, but do slighly prefer b). What do you think?

@Soonhwan-Kwon - there are some failing tests which I think can partly be solved by rebasing to current master. Otherwise, if ok for you I'm also happy to dive in the PR and help you finish the last parts of it. Let me know what you prefer :-)

@patrickvonplaten Sure, I will be glad if you help the last parts and feel free to dive in this PR.

@Soonhwan-Kwon
Copy link
Contributor Author

@patrickvonplaten I added you as collaborator in my repo, perhaps you might need access.

@patrickvonplaten
Copy link
Contributor

Thanks @Soonhwan-Kwon - I'll try to tackle this tomorrow :-)

@stefan-it
Copy link
Collaborator

Thanks for working on that @Soonhwan-Kwon . I made some minor suggestions and will look at the tokenization part now (to check if there are any differences between XLM-R and XLM-R-XL/XXL :)

@stefan-it
Copy link
Collaborator

The tokenization part is working as expected. Here are some details:

  • Underlying sentence piece models (XLM-R and XLM-R-XL) are identical (checked that via torch.hub.load, that downloads the model and stores them under ~/.cache/torch/pytorch_fairseq). Checksums are the same.

  • Tokenizer mapping (fairseq to spm model) is thankfully the same as for XLM-R, here I documented that mapping:

# Original fairseq vocab and spm vocab must be "aligned":
# Vocab | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
# -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ----
# fairseq | '<s>' | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's' | '▁de' | '-'
# spm | '<unk>' | '<s>' | '</s>' | ',' | '.' | '▁' | 's' | '▁de' | '-' | '▁a'

  • But where does this vocab size "mismatch" come from? XLM-R has a vocab size of 250,002, whereas XLM-R-XL has 250,880.

Fairseq comes with an own dictionary file (dict.txt), for XLM-R it has 249,997 entries, and 250,875 entries for XLM-R-XL.

The dictionary file for XLM-R-XL is the same as XLM-R, but it contains madeupword tokens ranging from madeupword0 to madeupword877 at the end.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Careful with some `Copied from statements that haven't been adapted to the model name.

not require `lang` tensors to understand which language is used, and should be able to determine the correct
language from the input ids.

This model was contributed by [Soonhwan-Kwon](https://github.com/Soonhwan-Kwon) and [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/xlmr).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned you here @Soonhwan-Kwon and @stefan-it

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPDATE: The PR should be ready for merge now. The checkpoints have been moved to the Facebook's org: https://huggingface.co/models?other=xlm-roberta-xl and added some model cards. @Soonhwan-Kwon, I've made you the "main" contributor for this model.

@patrickvonplaten
Copy link
Contributor

@stefan-it - it would also be great if you could do a final review :-)

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

README.md Outdated Show resolved Hide resolved
patrickvonplaten and others added 3 commits January 29, 2022 13:04
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@patrickvonplaten patrickvonplaten merged commit e09473a into huggingface:master Jan 29, 2022
@patrickvonplaten patrickvonplaten deleted the xlm_xl branch January 29, 2022 12:42
@stefan-it
Copy link
Collaborator

Really cool! I'm currently running experiments on token classification with that new model 🤗

@Soonhwan-Kwon
Copy link
Contributor Author

@patrickvonplaten @sgugger @stefan-it Thank you for the merge, it was a great experience, and I came to respect committers of transformers. And below revert is just miss click, sorry.

@stefan-it stefan-it mentioned this pull request Feb 3, 2023
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants