Skip to content

A curated list of deep learning resources for video-text retrieval.

Notifications You must be signed in to change notification settings

dxli94/awesome-video-text-retrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 

Repository files navigation

Awesome Video-Text Retrieval by Deep Learning Awesome

A curated list of deep learning resources for video-text retrieval.

Contributing

Please feel free to pull requests to add papers.

Markdown format:

- `[Conference/Trans Year]` Author. Title. Trans Year. [[paper]](link) [[code]](link) [[homepage]](link)

Table of Contents

Popular Implementations

PyTorch

TensorFlow

Others

Papers

2019

  • [CVPR2019] Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, Xun Wang. Dual Encoding for Zero-Example Video Retrieval. CVPR, 2019. [paper] [code]
  • [CVPR2019] Yale Song, and Mohammad Soleymani. Polysemous visual-semantic embedding for cross-modal retrieval. CVPR, 2019. [paper]
  • [ICCV2019] Michael Wray, Diane Larlus, Gabriela Csurka, and Dima Damen. Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings. ICCV, 2019. [paper]
  • [ICCV2019] Yu Xiong, Qingqiu Huang, Lingfeng Guo, Hang Zhou, Bolei Zhou, and Dahua Lin. A Graph-Based Framework to Bridge Movies and Synopses. ICCV, 2019. [paper]
  • [ACMMM2019] Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, and Jianfeng Dong. W2VV++ Fully Deep Learning for Ad-hoc Video Search. ACM Multimedia, 2019. [paper] [code]
  • [BMVC2019] Yang Liu, Samuel Albanie, Arsha Nagrani, Andrew Zisserman. Use What You Have: Video Retrieval Using Representations From Collaborative Experts. MBVC, 2019. [paper] [code]
  • [BigMM2019] Jaeyoung Choi, Martha Larson, Gerald Friedland, and Alan Hanjalic. From Intra-Modal to Inter-Modal Space: Multi-Task Learning of Shared Representations for Cross-Modal Retrieval. International Conference on Multimedia Big Data, 2019. [paper]

2018

  • [TMM2018] Jianfeng Dong, Xirong Li, Cees GM Snoek. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia, 2018. [paper] [code]
  • [ECCV2018] Bowen Zhang, Hexiang Hu, Fei Sha. Cross-Modal and Hierarchical Modeling of Video and Text. ECCV, 2018. [paper] [code]
  • [ECCV2018] Youngjae Yu, Jongseok Kim, Gunhee Kim. A Joint Sequence Fusion Model for Video Question Answering and Retrieval. ECCV, 2018. [paper]
  • [ECCV2018] Dian Shao, Yu Xiong, Yue Zhao, Qingqiu Huang, Yu Qiao, and Dahua Lin. Find and focus: Retrieve and localize video events with natural language queries. ECCV, 2018. [paper]
  • [ICMR2018] Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, Amit K. Roy-Chowdhury. Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval. ICMR, 2018. [paper] [code]
  • [arXiv2018] Antoine Miech, Ivan Laptev, Josef Sivic. Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. arXiv preprint arXiv:1804.02516, 2018. [paper] [code]

Before

  • [CVPR2017] Youngjae Yu, Hyungjin Ko, Jongwook Choi, Gunhee Kim. End-to-end concept word detection for video captioning, retrieval, and question answering. CVPR, 2017. [paper] [code]
  • [ECCVW2016] Mayu OtaniEmail, Yuta NakashimaEsa, RahtuJanne Heikkilä, Naokazu Yokoya. Learning joint representations of videos and sentences with web image search. ECCV Workshop, 2016. [paper]
  • [AAAI2015] Ran Xu, Caiming Xiong, Wei Chen, Jason J Corso. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. AAAI, 2015. [paper]

Ad-hoc Video Search

  • For the papers targeting at ad-hoc video search in the context of [TRECVID], please refer to [here]

Other Related

  • [arXiv2020] Tianhao Li, and Limin Wang. Learning Spatiotemporal Features via Video and Text Pair Discrimination. arXiv preprint arXiv:2001.05691, 2020. [paper]
  • [arXiv2019] Hazel Doughty, Ivan Laptev, Walterio Mayol-Cuevas, and Dima Damen. Action Modifiers: Learning from Adverbs in Instructional Videos. arXiv preprint arXiv:1912.06617, 2019. [paper]
  • [arXiv2019] Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-End Learning of Visual Representations from Uncurated Instructional Videos. arXiv preprint arXiv:1912.06430, 2019. [paper]

Datasets

  • [MSVD] David L. Chen and William B. Dolan. Collecting Highly Parallel Data for Paraphrase Evaluation. ACL, 2011. [paper] [dataset]
  • [MSRVTT] Jun Xu Tao Mei Ting Yao Yong Rui. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. CVPR, 2016. [paper] [dataset]
  • [TGIF] Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. TGIF: A new dataset and benchmark on animated GIF description. CVPR, 2016. [paper] [homepage]
  • [AVS] George Awad, et al. Trecvid 2016: Evaluating video search, video event detection, localization, and hyperlinking. TRECVID Workshop, 2016. [paper] [dataset]
  • [LSMDC] Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. Movie description. IJCV, 2017. [paper] [dataset]
  • [ActivityNet Captions] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. ICCV, 2017. [paper] [dataset]
  • [DiDeMo] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell. Localizing Moments in Video with Natural Language. ICCV, 2017. [paper] [code]
  • [HowTo100M] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. ICCV, 2019. [homepage] paper
  • [VATEX] Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, William Yang Wang. VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. ICCV, 2019. [paper] [homepage]

Licenses

CC0

To the extent possible under law, danieljf24 all copyright and related or neighboring rights to this repository.

About

A curated list of deep learning resources for video-text retrieval.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published