Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Perceiver IO #14487

Merged
merged 147 commits into from
Dec 8, 2021
Merged

Conversation

NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Nov 22, 2021

What does this PR do?

This PR implements Perceiver IO, by Google Deepmind.

Fixes #12996

It's a pretty cool piece of work: it applies a Transformer encoder to any kind of modality (images, text, audio, video) and for any problem (text classification, image classification, audio classification, video autoencoding, optical flow,...)! Perceiver is basically a BERT, ViT and Wav2Vec2 in one. However, the authors did apply the Perceiver on each problem separately, although I believe you could train a single (shared) encoder as backbone, and have different pre- and postprocessors for the different modalities.

The memory- and time requirements of the self-attention mechanism don't depend on the length of the inputs, as the inputs are only used for doing cross-attention. The bulk of computation happens on a set of latent variables, which are just randomly initialized at the beginning of training (i.e. self.latents = nn.Parameter(torch.randn((num_latents, d_latents)))).

Some demo notebooks to showcase what Perceiver can do:

  • image classification & masked language modeling: colab
  • optical flow: colab
  • video autoencoding (and video classification): colab

These colab notebooks are based on the original ones provided by Deepmind, in their (very clean) JAX/Haiku implementation which can be found here.

cc @esceptico

@NielsRogge NielsRogge force-pushed the modeling_perceiver_rebase branch from ce65e58 to 3397375 Compare November 22, 2021 11:24
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, @NielsRogge! This PR is especially clean. I left mostly nits, with a few points I'd like to check with you however:

  • The configuration values of the Preprocessor objects are hardcoded in the code. I think it would make sense to control these attributes from the configuration object.
  • I'm not sure it makes sense to expose these to the end-user. Do they serve a purpose? If so, they should probably have code examples and documentation that explain how they work in detail.
  • Have you tried using the image classification models with the image classification example?
  • I have personally only played with the Optical flow model, but it seems to have quite a few quirks regarding the input dimensions. Also, there needed to be a lot of methods to compute the optical flow and others, without which the model is quite complex to use. Would it be possible to reference this notebook in the docs (alongside your other notebooks), to add a script in the research-projects, or at least to show the full pipeline when using that model? I think that without these, it will be very complex for users to leverage the model which is a shame.

Great work on the perceiver model!

src/transformers/models/perceiver/modeling_perceiver.py Outdated Show resolved Hide resolved
src/transformers/models/perceiver/modeling_perceiver.py Outdated Show resolved Hide resolved
tests/test_modeling_common.py Outdated Show resolved Hide resolved
tests/test_modeling_perceiver.py Outdated Show resolved Hide resolved
tests/test_modeling_perceiver.py Outdated Show resolved Hide resolved
tests/test_modeling_perceiver.py Outdated Show resolved Hide resolved
tests/test_modeling_perceiver.py Show resolved Hide resolved
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this new model.

It would be nice if the doc page could explain the differences between the three different image classification models and when one should be used over the others. Also, it looks like the type of image classification is picked by the task attribute of the config, which does not make sense to me, as those are all performing the same tasks.

Then, the extra dep should be removed.

docs/source/index.rst Show resolved Hide resolved
docs/source/model_doc/perceiver.rst Outdated Show resolved Hide resolved
docs/source/model_doc/perceiver.rst Outdated Show resolved Hide resolved
docs/source/model_doc/perceiver.rst Outdated Show resolved Hide resolved
docs/source/model_doc/perceiver.rst Outdated Show resolved Hide resolved
tests/test_modeling_perceiver.py Outdated Show resolved Hide resolved
tests/test_modeling_perceiver.py Outdated Show resolved Hide resolved
tests/test_modeling_perceiver.py Outdated Show resolved Hide resolved
tests/test_modeling_perceiver.py Outdated Show resolved Hide resolved
tests/test_modeling_perceiver.py Outdated Show resolved Hide resolved
@NielsRogge NielsRogge force-pushed the modeling_perceiver_rebase branch 2 times, most recently from 80ce222 to d5b714d Compare December 1, 2021 12:40
max_position_embeddings (:obj:`int`, `optional`, defaults to 2048):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
train_size (:obj:`List[int]`, `optional`, defaults to [368, 496]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's just used for the optimal flow model - might it make sense to let this default to None to not confuse readers of the config of models that have nothing to do with optical flow?


def __init__(self, config):
super().__init__()
self.latents = nn.Parameter(torch.randn(config.num_latents, config.d_latents))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think in general we try to do the initialization part on the _init_weights() function. But no big deal for me

config_class = PerceiverConfig
base_model_prefix = "perceiver"

def _init_weights(self, module):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also init the latents of the PerceiverEmbeddings here instead of directly in the module

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing PR! Really impressed how this super complex model was made so easy to read!

I strongly agree with @LysandreJik and @sgugger regarding not adding a new einops dependency here.

Besides that I'm not super happy with abstracting model output classes as I don't think this is common practice in transformers and I think it goes a bit against (everything in one file philosophy) - what do you think @sgugger here?

The rest is only nits

@NielsRogge NielsRogge force-pushed the modeling_perceiver_rebase branch from 9cbb4c4 to 83a6776 Compare December 2, 2021 10:53
@NielsRogge
Copy link
Contributor Author

NielsRogge commented Dec 2, 2021

Thanks for the reviews, I've:

  • removed the einops dependency, and replaced it by just PyTorch.
  • removed the pre and postprocessors from the main init, they can still be imported using from transformers.models.perceiver.modeling_perceiver import xxx
  • copied all arguments for the Perceiver model-specific outputs, to better reflect the existing philosophy
  • fixed the tests on GPU
  • renamed PerceiverForImageClassification to PerceiverForImageClassificationLearned (to better reflect that this one uses learned position embeddings, which is more in line with the other models, namely PerceiverForImageClassification and PerceiverForImageClassificationConvProcessing)

To do:

  • now, it's mainly about writing more docstrings.

@tonibagur
Copy link

Hi, reading this nice perceiver io implementation I came up with a question: It seems that in Perceiver the input is iteratively mixed with the latents via cross attention but in Perceiver IO it seems that the input is fed only once in the network. Does somebody know if I am right and the reason of doing it in this way? If I got it right, it seems that also in this implementation the input is fed only once...

@esceptico
Copy link

Hi @tonibagur
As mentioned in the paper:

The encode and processing stages re-use the basic structure of the original Perceiver. With the exception of the repeated encoder cross-attends. We found it simpler to omit these modules, as
they increase compute without dramatically increasing performance, as reported in Appendix Table 6 in original Perceiver paper.

@tonibagur
Copy link

Hi @tonibagur As mentioned in the paper:

The encode and processing stages re-use the basic structure of the original Perceiver. With the exception of the repeated encoder cross-attends. We found it simpler to omit these modules, as
they increase compute without dramatically increasing performance, as reported in Appendix Table 6 in original Perceiver paper.

Thanks esceptico, I completely missed this paragraph :'

Copy link
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overall looks ok, but there are no tests for the tokenizer ?

Some comments.

Copy link
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last nit on my end.

tests/test_tokenization_perceiver.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @NielsRogge !

@NielsRogge NielsRogge merged commit 65b20b7 into huggingface:master Dec 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Perceiver IO
7 participants