Closed
Description
🌟 New model addition
Model description
T5 version t5.1.1.* is very similar to the original T5 model, with the following differences:
- GEGLU activation in feed-forward hidden layer, rather than ReLU - see https://arxiv.org/abs/2002.05202 .
- Dropout was turned off in pre-training (quality win). Dropout should be re-enabled during fine-tuning.
- Pre-trained on C4 only without mixing in the downstream tasks.
- no parameter sharing between embedding and classifier layer
- "xl" and "xxl" replace "3B" and "11B". The model shapes are a bit different - larger d_model and smaller num_heads and d_ff.
The key reason why these models are interesting is that - unlike the originally released models - they were trained only on unlabeled data and not on any labeled data, making them applicable for few-shot learning experiments. As they are very similar to the original T5 models, I assume they are relatively easy to implement.
Open source status
- the model implementation is available: (give details) - see https://github.com/google-research/text-to-text-transfer-transformer/
- the model weights are available: (give details) - see https://github.com/google-research/text-to-text-transfer-transformer/blob/master/released_checkpoints.md
- who are the authors: (mention them, if possible by @gh-username) - Colin Raffel ( @craffel ), Noam Shazeer ( @nshazeer ), Adam Roberts ( @adarob ), Katherine Lee, Sharan Narang, Michael Matena ( @mmatena ), Yanqi Zhou, Wei Li, Peter J. Liu
(Also tagging @patrickvonplaten as he is mentioned in the who to tag guide for T5)