-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Perceiver IO #14487
Add Perceiver IO #14487
Conversation
ce65e58
to
3397375
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, @NielsRogge! This PR is especially clean. I left mostly nits, with a few points I'd like to check with you however:
- The configuration values of the
Preprocessor
objects are hardcoded in the code. I think it would make sense to control these attributes from the configuration object. - I'm not sure it makes sense to expose these to the end-user. Do they serve a purpose? If so, they should probably have code examples and documentation that explain how they work in detail.
- Have you tried using the image classification models with the image classification example?
- I have personally only played with the Optical flow model, but it seems to have quite a few quirks regarding the input dimensions. Also, there needed to be a lot of methods to compute the optical flow and others, without which the model is quite complex to use. Would it be possible to reference this notebook in the docs (alongside your other notebooks), to add a script in the
research-projects
, or at least to show the full pipeline when using that model? I think that without these, it will be very complex for users to leverage the model which is a shame.
Great work on the perceiver model!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this new model.
It would be nice if the doc page could explain the differences between the three different image classification models and when one should be used over the others. Also, it looks like the type of image classification is picked by the task attribute of the config, which does not make sense to me, as those are all performing the same tasks.
Then, the extra dep should be removed.
80ce222
to
d5b714d
Compare
max_position_embeddings (:obj:`int`, `optional`, defaults to 2048): | ||
The maximum sequence length that this model might ever be used with. Typically set this to something large | ||
just in case (e.g., 512 or 1024 or 2048). | ||
train_size (:obj:`List[int]`, `optional`, defaults to [368, 496]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's just used for the optimal flow model - might it make sense to let this default to None to not confuse readers of the config of models that have nothing to do with optical flow?
|
||
def __init__(self, config): | ||
super().__init__() | ||
self.latents = nn.Parameter(torch.randn(config.num_latents, config.d_latents)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think in general we try to do the initialization part on the _init_weights()
function. But no big deal for me
config_class = PerceiverConfig | ||
base_model_prefix = "perceiver" | ||
|
||
def _init_weights(self, module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also init the latents of the PerceiverEmbeddings
here instead of directly in the module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing PR! Really impressed how this super complex model was made so easy to read!
I strongly agree with @LysandreJik and @sgugger regarding not adding a new einops
dependency here.
Besides that I'm not super happy with abstracting model output classes as I don't think this is common practice in transformers
and I think it goes a bit against (everything in one file philosophy) - what do you think @sgugger here?
The rest is only nits
9cbb4c4
to
83a6776
Compare
Thanks for the reviews, I've:
To do:
|
Hi, reading this nice perceiver io implementation I came up with a question: It seems that in Perceiver the input is iteratively mixed with the latents via cross attention but in Perceiver IO it seems that the input is fed only once in the network. Does somebody know if I am right and the reason of doing it in this way? If I got it right, it seems that also in this implementation the input is fed only once... |
Hi @tonibagur
|
Thanks esceptico, I completely missed this paragraph :' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overall looks ok, but there are no tests for the tokenizer ?
Some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last nit on my end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, @NielsRogge !
What does this PR do?
This PR implements Perceiver IO, by Google Deepmind.
Fixes #12996
It's a pretty cool piece of work: it applies a Transformer encoder to any kind of modality (images, text, audio, video) and for any problem (text classification, image classification, audio classification, video autoencoding, optical flow,...)! Perceiver is basically a BERT, ViT and Wav2Vec2 in one. However, the authors did apply the Perceiver on each problem separately, although I believe you could train a single (shared) encoder as backbone, and have different pre- and postprocessors for the different modalities.
The memory- and time requirements of the self-attention mechanism don't depend on the length of the inputs, as the inputs are only used for doing cross-attention. The bulk of computation happens on a set of latent variables, which are just randomly initialized at the beginning of training (i.e.
self.latents = nn.Parameter(torch.randn((num_latents, d_latents)))
).Some demo notebooks to showcase what Perceiver can do:
These colab notebooks are based on the original ones provided by Deepmind, in their (very clean) JAX/Haiku implementation which can be found here.
cc @esceptico