Hi,
First of all, great work. I am big proponent of FLan-t5 and use it in my projects. For multilingual, mT5 and bigscience/mt0 models provide a good baseline and are truly multilingual. Does Flash Attention work on mt5 architecture? Seems like only T5 is supported now?
https://huggingface.co/bigscience/mt0-large is something I am looking at which is based on mT5
Thanks for the great work