Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Nystromformer #14659

Merged
merged 62 commits into from
Jan 11, 2022
Merged

Add Nystromformer #14659

merged 62 commits into from
Jan 11, 2022

Conversation

novice03
Copy link
Contributor

@novice03 novice03 commented Dec 7, 2021

What does this PR do?

This PR adds the Nystromformer transformer model to the repository.

Paper: https://arxiv.org/abs/2102.03902
Code: https://github.com/mlpen/Nystromformer
Checkpoint: Nystromformer sequence length 512

Who can review?

@NielsRogge

@novice03 novice03 changed the title [WIP] Add Nystromformer Add Nystromformer Jan 2, 2022
@stefan-it
Copy link
Collaborator

stefan-it commented Jan 2, 2022

Hey @novice03 , really cool PR, can't wait to fine-tune models with it :)

I have some questions about the tokenization part: it seems that the ALBERT tokenizer (and an albert spm model) is used and uploaded to the model hub, so I think in config.json the entry "tokenizer_class": "AlbertTokenizer" should be added so that AutoTokenizer is working.

But I've looked at the original pre-processing and pre-training code and it seems that the RoBERTa tokenizer is used for pre-training a Nyströmformer model, why is an albert spm model used here 🤔

@novice03
Copy link
Contributor Author

novice03 commented Jan 2, 2022

Hello, thanks for taking a look at this PR. I've added the tokenizer entry to the config.json file. It is true that the code seems to use RoBERTa tokenizer. However, the model checkpoint released is of a model with vocab_size=30,000. Furthermore, they've also used albert tokenizer in their code. I've tried using BertTokenizer and RobertaTokenizer, but they give errors or incorrect results. The checkpoint released by the author only works with this specific albert tokenizer configuration.

@novice03 novice03 marked this pull request as ready for review January 2, 2022 13:12
@SaulLu
Copy link
Contributor

SaulLu commented Jan 3, 2022

Very cool to have this new model @novice03 🤗 .

Let me share with you a little information about the tokenizers of uw-madison/nystromformer-512 on this PR.

@NielsRogge showed me that there is currently an inconsistency in the treatment of special tokens with the slow and fast versions of uw-madison/nystromformer-512 tokenizers.

I think that to generate into the slow_and_fast folder the files for the fast tokenizer, we should do:

from transformers import AlbertTokenizer

tokenizer = AlbertTokenizer.from_pretrained("uw-madison/nystromformer-512")
tokenizer.save_pretrained("slow_tokenizer")

fast_tokenizer = AlbertTokenizerFast.from_pretrained(
        "slow_tokenizer", 
        bos_token="[CLS]",
        eos_token="[SEP]",
        unk_token="<unk>",
        sep_token="[SEP]",
        pad_token="<pad>",
        cls_token="[CLS]",
        mask_token="[MASK]"
        )

fast_tokenizer.save_pretrained("slow_and_fast")

I hope this is helpful!

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Only left a few comments.

Would you be open to also contribute the TensorFlow version of the model, or would you rather we start with the PyTorch model?

@LysandreJik LysandreJik requested a review from sgugger January 10, 2022 14:16
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this model, this PR is in great shape! I just left styling suggestions.

docs/source/model_doc/nystromformer.mdx Outdated Show resolved Hide resolved
tests/test_modeling_nystromformer.py Outdated Show resolved Hide resolved
tests/test_modeling_nystromformer.py Outdated Show resolved Hide resolved
@NielsRogge NielsRogge merged commit 28e0914 into huggingface:master Jan 11, 2022
@novice03 novice03 deleted the add_nystromformer branch January 11, 2022 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants