A new video-text dataset which may help

[Vript](https://github.com/mutonix/Vript) is a fine-grained video-text dataset with 12K annotated high-resolution videos (~400k clips), where each clip has a detailed caption of ~145 words. 

<p align="center">
<img src="https://github.com/mutonix/Vript/blob/main/assets/Vript-overview_00.png?raw=true" width="500">  
</p>