This is a curated list of audio-visual learning methods and datasets, based on our survey: <Learning in Audio-visual Context: A Review, Analysis, and New Perspective>. This list will continue to be updated, please feel free to nominate good related works with Pull Requests!
[Website of Our Survey], [arXiv]
- Overview
- Table of contents
[Applied Intelligence-2015]
Audio-visual Speech Recognition Using Deep Learning
Authors: Kuniaki Noda, Yuki Yamaguchi, Kazuhiro Nakadai, Hiroshi G. Okuno, Tetsuya Ogata
Institution: Waseda University; Kyoto University; Honda Research Institute Japan Co., Ltd.
[CVPR-2016]
Temporal Multimodal Learning in Audiovisual Speech Recognition
Authors: Di Hu, Xuelong Li, Xiaoqiang Lu
Institution: Northwestern Polytechnical University; Chinese Academy of Sciences
[AVSP-2017]
End-To-End Audiovisual Fusion With LSTMs
Authors: Stavros Petridis, Yujiang Wang, Zuwei Li, Maja Pantic
Institution: Imperial College London; University of Twente
[IEEE TPAMI-2018]
Deep Audio-visual Speech Recognition
Authors: Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman
Institution: University of Oxford; Google Inc.
[2019]
Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection
Authors: Guangxiang Zhao, Junyang Lin, Zhiyuan Zhang, Xuancheng Ren, Qi Su, Xu Sun
Institution: Peking University
[IEEE TNNLS-2022]
Multimodal Sparse Transformer Network for Audio-visual Speech Recognition
Authors: Qiya Song, Bin Sun, Shutao Li
Institution: Hunan University
[Interspeech-2022]
Robust Self-Supervised Audio-V\visual Speech Recognition
Authors: Bowen Shi, Wei-Ning Hsu, Abdelrahman Mohamed
Institution: Toyota Technological Institute at Chicago; Meta AI
[2022]
Bayesian Neural Network Language Modeling for Speech Recognition
Authors: Boyang Xue, Shoukang Hu, Junhao Xu, Mengzhe Geng, Xunying Liu, Helen Meng
Institution: the Chinese University of Hong Kong
[Interspeech-2022]
Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition
Authors: Joanna Hong, Minsu Kim, Daehun Yoo, Yong Man Ro
Institution: KAIST; Genesis Lab Inc.
[MLSP-2022]
Rethinking Audio-visual Synchronization for Active Speaker Detection
Authors: Abudukelimu Wuerkaixi, You Zhang, Zhiyao Duan, Changshui Zhang
Institution: Tsinghua University; Beijing National Research Center for Information Science and Technology; University of Rochester
[NeurIPS-2022]
A Single Self-Supervised Model for Many Speech Modalities Enables Zero-Shot Modality Transfer
Authors: Wei-Ning Hsu, Bowen Shi
Institution: Toyota Technological Institute at Chicago
[ITOEC-2022]
FSMS: An Enhanced Polynomial Sampling Fusion Method for Audio-Visual Speech Recognition
Authors: Chenghan Li; Yuxin Zhang; Huaichang Du
Institution: Communication University of China
[IJCNN-2022]
Continuous Phoneme Recognition based on Audio-Visual Modality Fusion
Authors: Julius Richter; Jeanine Liebold; Timo Gerkamnn
Institution: Universität Hamburg
[ICIP-2022]
Learning Contextually Fused Audio-Visual Representations For Audio-Visual Speech Recognition
Authors: Zi-Qiang Zhang, Jie Zhang, Jian-Shu Zhang, Ming-Hui Wu, Xin Fang, Li-Rong Dai
Institution: University of Science and Technology of China; Chinese Academy of Sciences; iFLYTEK Co., Ltd.
[2022]
Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation
Authors: Jing-Xuan Zhang, Genshun Wan, Zhen-Hua Ling, Jia Pan, Jianqing Gao, Cong Liu
Institution: University of Science and Technology of China; iFLYTEK Co. Ltd.
[CVPR-2022]
Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations
Authors: Dan Oneaţă, Horia Cucu
Institution: University POLITEHNICA of Bucharest
[AAAI-2022]
Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading
Authors: Minsu Kim, Jeong Hun Yeo, Yong Man Ro
Institution: Korea Advanced Institute of Science and Technology
[AAAI-2023]
Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning
Authors: Chen Chen, Yuchen Hu, Qiang Zhang, Heqing Zou, Beier Zhu, Eng Siong Chng
Institution: Nanyang Technological University; ZJU-Hangzhou Global Scientific and Technological Innovation Center; Zhejiang University
[WACV-2023]
Audio-Visual Efficient Conformer for Robust Speech Recognition
Authors: Maxime Burchi, Radu Timofte
Institution: University of Würzburg
[2023]
Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition
Authors: Minsu Kim, Hyung-Il Kim, Yong Man Ro
Institution: Korea Advanced Institute of Science and Technology; Electronics and Telecommunications Research Institute
[2023]
Multimodal Speech Recognition for Language-Guided Embodied Agents
Authors: Allen Chang, Xiaoyuan Zhu, Aarav Monga, Seoho Ahn, Tejas Srinivasan, Jesse Thomason
Institution: University of Southern California
[2023]
MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation
Authors: Mohamed Anwar, Bowen Shi, Vedanuj Goswami, Wei-Ning Hsu, Juan Pino, Changhan Wang
Institution: Meta AI
[ICASSP-2023]
The NPU-ASLP System for Audio-Visual Speech Recognition in MISP 2022 Challenge
Authors: Pengcheng Guo, He Wang, Bingshen Mu, Ao Zhang, Peikun Chen
Institution: Northwestern Polytechnical University
[CVPR-2023]
Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring
Authors: Joanna Hong, Minsu Kim, Jeongsoo Choi, Yong Man Ro
Institution: KAIST
[MTA-2016]
Audio-visual Speaker Diarization Using Fisher Linear Semi-discriminant Analysis
Authors: Nikolaos Sarafianos, Theodoros Giannakopoulos, Sergios Petridis
Institution: National Center for Scientific Research “Demokritos”
[ICASSP-2018]
Audio-visual Person Recognition in Multimedia Data From the Iarpa Janus Program
Authors: Gregory Sell, Kevin Duh, David Snyder, Dave Etter, Daniel Garcia-Romero
Institution: The Johns Hopkins University
[ICASSP-2019]
Noise-tolerant Audio-visual Online Person Verification Using an Attention-based Neural Network Fusion
Authors: Suwon Shon, Tae-Hyun Oh, James Glass
Institution: MIT Computer Science and Artificial Intelligence Laboratory, Cambridge
[Interspeech-2019]
Who Said That?: Audio-visual Speaker Diarisation Of Real-World Meetings
Authors: Joon Son Chung, Bong-Jin Lee, Icksang Han
Institution: Naver Corporation
[ICASSP-2020]
Self-Supervised Learning for Audio-visual Speaker Diarization
Authors: Yifan Ding, Yong Xu, Shi-Xiong Zhang, Yahuan Cong, Liqiang Wang
Institution: University of Central Florida; Tencent AI Lab; Beijing University of Posts and Telecommunications
[ICASSP-2021]
A Multi-View Approach to Audio-visual Speaker Verification
Authors: Leda Sari, Kritika Singh, Jiatong Zhou, Lorenzo Torresani, Nayan Singhal, Yatharth Saraf
Institution: University of Illinois at Urbana-Champaign, Facebook AI Research
[IEEE/ACM TASLP-2021]
Audio-visual Deep Neural Network for Robust Person Verification
Authors: Yanmin Qian, Zhengyang Chen, Shuai Wang
Institution: Shanghai Jiao Tong University
[ICDIP 2022]
End-To-End Audiovisual Feature Fusion for Active Speaker Detection
Authors: Fiseha B. Tesema, Zheyuan Lin, Shiqiang Zhu, Wei Song, Jason Gu, Hong Wu
Institution: Interdisciplinary Innovation Research Institute, Zhejiang Lab; Dalhousie University; University of Electronic
Science and Technology of China; Zhejiang University
[EUVIP-2022]
Active Speaker Recognition using Cross Attention Audio-Video Fusion
Authors: Bogdan Mocanu, Tapu Ruxandra
Institution: University "Politehnica" of Bucharest; Télécom SudParis
[2022]
Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection
Authors: Rahul Sharma, Shrikanth Narayanan
Institution: University of Southern California
[SLT-2023]
Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual Active Speaker Detection
Authors: Xuanjun Chen, Haibin Wu, Helen Meng, Hung-yi Lee, Jyh-Shing Roger Jang
Institution: National Taiwan University; The Chinese University of Hong Kong
[ICAI-2023]
Speaker Recognition in Realistic Scenario Using Multimodal Data
Authors: Saqlain Hussain Shah, Muhammad Saad Saeed, Shah Nawaz, Muhammad Haroon Yousaf
Institution: University of Engineering and Technology Taxila; Swarm Robotics Lab NCRA; Deutsches Elektronen-Synchrotron DESY
[CVPR-2023]
A Light Weight Model for Active Speaker Detection
Authors: Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, Liangyin Chen
Institution: Sichuan University; The Chinese University of Hong Kong
[ICASSP-2023]
The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition
Authors: Zhe Wang, Shilong Wu, Hang Chen, Mao-Kui He, Jun Du, Chin-Hui Lee, Jingdong Chen, Shinji Watanabe, Sabato Siniscalchi, Odette Scharenborg, Diyuan Liu, Baocai Yin, Jia Pan, Jianqing Gao, Cong Liu
Institution: University of Science and Technology of China; Georgia Institute of Technology; Carnegie Mellon University; Kore University of Enna; iFlytek; Northwestern Polytechnical University; Delft University of Technology
[IJCNN-2016]
Exploring Multimodal Video Representation For Action Recognition
Authors: Cheng Wang; Haojin Yang; Christoph Meinel
Institution: University of Potsdam
[CVPR-2018]
The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary
Authors: Bernard Ghanem, Juan Carlos Niebles, Cees Snoek, Fabian Caba Heilbron, Humam Alwassel, Victor Escorcia, Ranjay Krishna, Shyamal Buch, Cuong Duc Dao
Institution: King Abdullah University of Science and Technology; Stanford University; Universidad del Norte; Universiteit van Amsterdam
[ICCV-2019]
EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
Authors: Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen
Institution: University of Bristol; University of Oxford
[ICCV-2019]
SCSampler: Sampling Salient Clips From Video for Efficient Action Recognition
Authors: Bruno Korbar, Du Tran, Lorenzo Torresani
Institution: Facebook AI Research
[ICCV-2019]
Uncertainty-Aware Audiovisual Activity Recognition Using Deep Bayesian Variational Inference
Authors: Mahesh Subedar, Ranganath Krishnan, Paulo Lopez Meyer, Omesh Tickoo, Jonathan Huang
Institution: Intel Labs
[CVPR-2020]
Listen to Look: Action Recognition by Previewing Audio
Authors: Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, Lorenzo Torresani
Institution: The University of Texas at Austin; Facebook AI Research
[2020]
Audiovisual SlowFast Networks for Video Recognition
Authors: Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, Christoph Feichtenhofer
Institution: University of California; Facebook AI Research
[ICCV-2021]
AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition
Authors: Rameswar Panda, Chun-Fu(Richard) Chen, Quanfu Fan, Ximeng Sun, Kate Saenko, Aude Oliva, Rogerio Feris
Institution: MIT-IBM Watson AI Lab; Boston University; Massachusetts Institute of Technology
[2021]
Cross-Domain First Person Audio-Visual Action Recognition through Relative Norm Alignment
Authors: Mirco Planamente, Chiara Plizzari, Emanuele Alberti, Barbara Caputo
Institution: Politecnico di Torino; Istituto Italiano di Tecnologia
[WACV-2022]
Domain Generalization Through Audio-Visual Relative Norm Alignment in First Person Action Recognition
Authors: Mirco Planamente, Chiara Plizzari, Emanuele Alberti, Barbara Caputo
Institution: Politecnico di Torino; Istituto Italiano di Tecnologia; CINI Consortium
[CVPR-2022]
Audio-Adaptive Activity Recognition Across Video Domains
Authors: Yunhua Zhang, Hazel Doughty, Ling Shao, Cees G. M. Snoek
Institution: University of Amsterdam; Inception Institute of Artificial Intelligence
[WACV-2022]
MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition
Authors: Jiawei Chen, Chiu Man Ho
Institution: OPPO US Research Center
[CVPR-2022]
Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos
Authors: Saghir Alfasly, Jian Lu, Chen Xu, Yuru Zou
Institution: Shenzhen University; Guangdong Key Laboratory of Intelligent Information Processing; Pazhou Lab
[2022]
Noise-Tolerant Learning for Audio-Visual Action Recognition
Authors: Haochen Han, Qinghua Zheng, Minnan Luo, Kaiyao Miao, Feng Tian, Yan Chen
Institution: Xi’an Jiaotong University, the Shanxi Provincial Key Laboratory of Institute of Multimedia
Knowledge Fusion and Engineering; the Ministry of Education Key Laboratory for
Intelligent Networks and Network Security
[ICLR-2023]
Exploring Temporally Dynamic Data Augmentation for Video Recognition
Authors: Taeoh Kim, Jinhyung Kim, Minho Shim, Sangdoo Yun, Myunggu Kang, Dongyoon Wee, Sangyoun Lee
Institution: NAVER Clova; KAIST; NAVER AI Lab; Yonsei University
[2023]
Epic-Sounds: A Large-scale Dataset of Actions That Sound
Authors: Jaesung Huh, Jacob Chalk, Evangelos Kazakos, Dima Damen, Andrew Zisserman
Institution: University of Oxford; University of Bristol
[EMNLP-2017]
Tensor Fusion Network for Multimodal Sentiment Analysis
Authors: Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, Louis-Philippe Morency
Institution: Carnegie Mellon University; Nanyang Technological University
[AAAI-2018]
Multi-attention Recurrent Network for Human Communication Comprehension
Authors: Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, Louis-Philippe Morency
Institution: Carnegie Mellon University; Nanyang Technological University
[AAAI-2018]
Memory Fusion Network for Multi-view Sequential Learning
Authors: Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, Louis-Philippe Morency
Institution: Carnegie Mellon University; Instituto Polite cnico Nacional; Nanyang Technological University
[NAACL-2018]
Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos
Authors: Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Erik Cambria, Louis-Philippe Morency, Roger Zimmermann
Institution: National University of Singapore
[EMNLP-2018]
Contextual Inter-modal Attention for Multi-modal Sentiment Analysis
Authors: Deepanway Ghosal, Md Shad Akhtar, Dushyant Chauhan, Soujanya Poria, Asif Ekbal, Pushpak Bhattacharyya
Institution: Indian Institute of Technology Patna; Nanyang Technological University
[ACL-2019]
Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model
Authors: Yitao Cai, Huiyu Cai, Xiaojun Wan
Institution: Peking University
[ACL-2020]
Sentiment and Emotion help Sarcasm? A Multi-task Learning Framework for Multi-Modal Sarcasm, Sentiment and Emotion Analysis
Authors: Dushyant Singh Chauhan, Dhanush S R, Asif Ekbal and Pushpak Bhattacharyya
Institution: Indian Institute of Technology Patna
[ACL-2020]
A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis
Authors: Jean-Benoit Delbrouck, Noe Tits, Mathilde Brousmiche, Stephane Dupont
Institution: University of Mons
[ACL-2020]
Multilogue-Net: A Context Aware RNN for Multi-modal Emotion Detection and Sentiment Analysis in Conversation
Authors: Aman Shenoy, Ashish Sardana
Institution: Birla Institute of Technology and Science, Pilani; NVIDIA Graphics
[CVPR-2021]
Progressive Modality Reinforcement for Human Multimodal Emotion Recognition From Unaligned Multimodal Sequences
Authors: Fengmao Lv, Xiang Chen, Yanyong Huang, Lixin Duan, Guosheng Lin
Institution: Southwest Jiaotong University; Southwestern University of Finance and Economics; Tencent; University of Electronic Science and Technology of China; Nanyang Technological University
[IEEE TAC-2021]
Multi-modal Sarcasm Detection and Humor Classification in Code-mixed Conversations
Authors: Manjot Bedi, Shivani Kumar, Md Shad Akhtar, Tanmoy Chakraborty
Institution: Indraprastha Institute of Information Technology, Delhi
[IEEE SLT-2021]
Detecting expressions with multimodal transformers
Authors: Srinivas Parthasarathy, Shiva Sundaram
Institution: Amazon
[CVPR-2022]
M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation
Authors: Vishal Chudasama, Purbayan Kar, Ashish Gudmalwar, Nirmesh Shah, Pankaj Wasnik, Naoyuki Onoe
Institution: Sony Research India
[CCC-2022]
A Multimodal Emotion Perception Model based on Context-Aware Decision-Level Fusion
Authors: Yishan Chen; Zhiyang Jia; Kaoru Hirota; Yaping Dai
Institution: Beijing Institute of Technology; State Key Laboratory of Intelligent Control and Decision of Complex Systems
[IJCNN-2022]
Sense-aware BERT and Multi-task Fine-tuning for Multimodal Sentiment Analysis
Authors: Lingyong Fang, Gongshen Liu, Ru Zhang
Institution: Shanghai Jiao Tong University; Beijing University Posts and Telecommunications
[IEEE TASLP-2022]
EmoInt-Trans: A Multimodal Transformer for Identifying Emotions and Intents in Social Conversations
Authors: Gopendra Vikram Singh, Mauajama Firdaus, Asif Ekbal, Pushpak Bhattacharyya
Institution: Indian Institute of Technology
[ICPR-2022]
Self-attention fusion for audiovisual emotion recognition with incomplete data
Authors: Kateryna Chumachenko, Alexandros Iosifidis, Moncef Gabbouj
Institution: Tampere University; Aarhus University
[IEEE TAC-2023]
Audio-Visual Emotion Recognition With Preference Learning Based on Intended and Multi-Modal Perceived Labels
Authors: Yuanyuan Lei, Houwei Cao
Institution: Texas A&M University; New York Institute of Technology
[IEEE T-BIOM-2023]
Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention
Authors: R Gnana Praveen, Patrick Cardinal, Eric Granger
Institution: Ecole de technologie supérieure
[Interspeech-2018]
Visual Speech Enhancement
Authors: Aviv Gabbay, Asaph Shamir, Shmuel Peleg
Institution: The Hebrew University of Jerusalem
[Interspeech-2018]
The Conversation: Deep Audio-Visual Speech Enhancement
Authors: Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
Institution: University of Oxford
[IEEE TETCI-2018]
Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks
Authors:
Jen-Cheng Hou, Syu-Siang Wang, Ying-Hui Lai, Yu Tsao, Hsiu-Wen Chang, Hsin-Min Wang
Institution: Research Center for Information Technology Innovation; National Taiwan University; National Yang-Ming University; Mackay Medical College; Academia Sinica
[ICASSP-2018]
Seeing Through Noise: Visually Driven Speaker Separation And Enhancement
Authors: Aviv Gabbay, Ariel Ephrat, Tavi Halperin, Shmuel Peleg
Institution: The Hebrew University of Jerusalem
[GlobalSIP-2019]
Visually Assisted Time-Domain Speech Enhancement
Authors: Elham Ideli, Bruce Sharpe, Ivan V. Baji?, Rodney G. Vaughan
Institution: Simon Fraser University; SingSoftNext
[ICASSP-2019]
On Training Targets and Objective Functions for Deep-learning-based Audio-visual Speech Enhancement
Authors: Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen
Institution: Aalborg University; Oticon A/S
[InterSpeech-2019]
Multimodal SpeakerBeam: Single Channel Target Speech Extraction with Audio-Visual Speaker Clues
Authors: Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Atsunori Ogawa, Tomohiro Nakatani
Institution: Nippon Telegraph & Telephone Corporation
[Interspeech-2019]
My Lips Are Concealed: Audio-Visual Speech Enhancement Through Obstructions
Authors: Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
Institution: University of Oxford; Naver Corporation
[2020]
Facefilter: Audio-Visual Speech Separation Using Still Images
Authors: Soo-Whan Chung, Soyeon Choe, Joon Son Chung, Hong-Goo Kang
Institution: Yonsei University; Naver Corporation
[ICASSP-2020]
Robust Unsupervised Audio-Visual Speech Enhancement Using a Mixture of Variational Autoencoders
Authors: Mostafa Sadeghi, Xavier Alameda-Pineda
Institution: Inria Grenoble Rhone-Alpes
[CVPR-2021]
Looking Into Your Speech: Learning Cross-Modal Affinity for Audio-Visual Speech Separation
Authors: Jiyoung Lee, Soo-Whan Chung, Sunok Kim, Hong-Goo Kang, Kwanghoon Sohn
Institution: Yonsei University; Naver Corporation; Korea Aerospace University
[ISCAS-2021]
Audio-Visual Target Speaker Enhancement on Multi-Talker Environment using Event-Driven Cameras
Authors: Ander Arriandiaga, Giovanni Morrone, Luca Pasa, Leonardo Badino, Chiara Bartolozzi
Institution: Istituto Italiano di Tecnologia; University of Modena and Reggio Emilia
[ICASSP-2022]
The Impact of Removing Head Movements on Audio-Visual Speech Enhancement
Authors: Zhiqi Kang, Mostafa Sadeghi, Radu Horaud, Xavier Alameda-Pineda, Jacob Donley, Anurag Kumar
Institution: Inria Grenoble; Université Grenoble Alpes; Inria Nancy Grand-Est; Reality Labs Research
[2022]
Dual-path Attention is All You Need for Audio-Visual Speech Extraction
Authors: Zhongweiyang Xu, Xulin Fan, Mark Hasegawa-Johnson
Institution: University of Illinois at Urbana-Champaign
[ICASSP-2022]
Audio-visual multi-channel speech separation, dereverberation and recognition
Authors: Guinan Li, Jianwei Yu, Jiajun Deng, Xunying Liu, Helen Meng
Institution: The Chinese University of Hong Kong; Tencent AI lab
[2022]
Audio-visual speech separation based on joint feature representation with cross-modal attention
Authors: Junwen Xiong, Peng Zhang, Lei Xie, Wei Huang, Yufei Zha, Yanning Zhang
Institution: Northwestern Polytechnical University; Nanchang University
[CVPR-2022]
Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis
Authors: Karren Yang, Dejan Marković, Steven Krenn, Vasu Agrawal, Alexander Richard
Institution: Massachusetts Institute of Technology; Meta Reality Labs Research
[IEEE MMSP-2022]
As We Speak: Real-Time Visually Guided Speaker Separation and Localization
Authors: Piotr Czarnecki, Jakub Tkaczuk
Institution: Warsaw University of Technology
[IEEE HEALTHCOM-2022]
A Novel Frame Structure for Cloud-Based Audio-Visual Speech Enhancement in Multimodal Hearing-aids
Authors: Abhijeet Bishnu, Ankit Gupta, Mandar Gogate, Kia Dashtipour, Ahsan Adeel, Amir Hussain, Mathini Sellathurai, Tharmalingam Ratnarajah
Institution: University of Edinburgh; Heriot-Watt Watt University; Edinburgh Napier University; University of Wolverhampton
[CVPR-2022]
Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation
Authors: Akam Rahimi, Triantafyllos Afouras, Andrew Zisserman
Institution: University of Oxford
[WACV-2023]
BirdSoundsDenoising: Deep Visual Audio Denoising for Bird Sounds
Authors: Youshan Zhang, Jialu Li
Institution: Yeshiva University; Cornell University
[SLT-2023]
AVSE Challenge: Audio-Visual Speech Enhancement Challenge
Authors: Andrea Lorena Aldana Blanco, Cassia Valentini-Botinhao, Ondrej Klejch, Mandar Gogate, Kia Dashtipour, Amir Hussain, Peter Bell
Institution: University of Edinburgh
[ICLR-2023]
Filter-Recovery Network for Multi-Speaker Audio-Visual Speech Separation
Authors: Haoyue Cheng, Zhaoyang Liu, Wayne Wu, Limin Wang
Institution: Nanjing University; SenseTime
[WACV-2023]
Unsupervised Audio-Visual Lecture Segmentation
Authors: Darshan Singh S, Anchit Gupta, C. V. Jawahar, Makarand Tapaswi
Institution: International Institute of Information Technology, Hyderabad
[ISCSLP-2022]
Multi-Task Joint Learning for Embedding Aware Audio-Visual Speech Enhancement
Authors: Chenxi Wang, Hang Chen, Jun Du, Baocai Yin, Jia Pan
Institution: University of Science and Technology of China; iFlytek
[ICASSP-2023]
Real-Time Audio-Visual End-to-End Speech Enhancement
Authors: Zirun Zhu, Hemin Yang, Min Tang, Ziyi Yang, Sefik Emre Eskimez, Huaming Wang
Institution: Microsoft
[ECCV-2018]
Learning to Separate Object Sounds by Watching Unlabeled Video
Authors: Ruohan Gao, Rogerio Feris, Kristen Grauman
Institution: The University of Texas at Austin; IBM Research; Facebook AI Research
[ECCV-2018]
The Sound of Pixels
Authors: Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab; Columbia University
[ICASSP-2019]
Self-supervised Audio-visual Co-segmentation
Authors: Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh McDermott, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab
[ICCV-2019]
The Sound of Motions
Authors: Hang Zhao, Chuang Gan, Wei-Chiu Ma, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab
[ICCV-2019]
Recursive Visual Sound Separation Using Minus-Plus Net
Authors: Xudong Xu, Bo Dai, Dahua Lin
Institution: The Chinese University of Hong Kong
[ICCV-2019]
Co-Separating Sounds of Visual Objects
Authors: Ruohan Gao, Kristen Grauman
Institution: The University of Texas at Austin; Facebook AI Research
[ACCV-2020]
Visually Guided Sound Source Separation using Cascaded Opponent Filter Network
Authors: Lingyu Zhu, Esa Rahtu
Institution: Tampere University
[CVPR-2020]
Music Gesture for Visual Sound Separation
Authors: Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab
[ICCV-2021]
Visual Scene Graphs for Audio Source Separation
Authors: Moitreya Chatterjee, Jonathan Le Roux, Narendra Ahuja, Anoop Cherian
Institution: University of Illinois at Urbana-Champaign; Mitsubishi Electric Research Laboratories
[CVPR-2021]
Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation
Authors: Yapeng Tian, Di Hu, Chenliang Xu
Institution: University of Rochester; Renmin University of China; Beijing Key Laboratory of Big Data Management and Analysis Methods
[ECCV-2022]
AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation
Authors: Efthymios Tzinis, Scott Wisdom, Tal Remez, John R. Hershey
Institution: Google Research; University of Illinois Urbana-Champaign
[ECCV-2022]
AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation
Authors: Efthymios Tzinis, Scott Wisdom, Tal Remez, John R. Hershey
Institution: Google Research; University of Illinois Urbana-Champaign
[ICIP-2022]
Visual Sound Source Separation with Partial Supervision Learning
Authors: Huasen Wang, Lingling Gao, Qianchao Tan, Luping Ji
Institution: University of Electronic Science and Technology of China
[NeurIPS-2022]
Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation
Authors: Moitreya Chatterjee, Narendra Ahuja, Anoop Cherian
Institution: University of Illinois; Mitsubishi Electric Research Labs
[ICLR-2023]
CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos
Authors: Hao-Wen Dong, Naoya Takahashi, Yuki Mitsufuji, Julian McAuley, Taylor Berg-Kirkpatrick
Institution: Sony Group Corporation; University of California San Diego
[CVPR-2020]
Learning to Have an Ear for Face Super-Resolution
Authors: Givi Meishvili, Simon Jenni, Paolo Favaro
Institution: University of Bern
[IEEE TCSVT-2021]
Appearance Matters, So Does Audio: Revealing the Hidden Face via Cross-Modality Transfer
Authors:
Chenqi Kong, Baoliang Chen, Wenhan Yang, Haoliang Li, Peilin Chen, Shiqi Wang
Institution: City University of Hong Kong; Nanyang Technological University
[CVPR-2022]
Deep Video Inpainting Guided by Audio-Visual Self-Supervision
Authors: Kyuyeon Kim; Junsik Jung; Woo Jae Kim; Sung-Eui Yoon
Institution: Korea Advanced Institute of Science and Technology
[CVPR-2022]
Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices?
Authors: Cho-Ying Wu, Chin-Cheng Hsu, Ulrich Neumann
Institution: University of Southern California
[WACV-2023]
Audio-Visual Face Reenactment
Authors: Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, C V Jawahar
Institution: International Institute of Information Technology, Hyderabad; University of Bath
[ICASSP-2017]
Vid2speech: Speech Reconstruction From Silent Video
Authors: Ariel Ephrat, Shmuel Peleg
Institution: The Hebrew University of Jerusalem
[ICCV-2017]
Improved Speech Reconstruction From Silent Video
Authors: Ariel Ephrat, Tavi Halperin, Shmuel Peleg
Institution: The Hebrew University of Jerusalem
[ICASSP-2018]
Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video
Authors: Hassan Akbari, Himani Arora, Liangliang Cao, Nima Mesgarani
Institution: Columbia University
[MM-2018]
Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed
Authors: Yaman Kumar, Mayank Aggarwa, Pratham Nawal, Shin’ichi Satoh, Rajiv Ratn Shah, Roger Zimmermann
Institution: Netaji Subhas Institute of Technology; National Institute of Informatics; Indraprastha Institute of Information; National University of Singapore
[2019]
Video-Driven Speech Reconstruction using Generative Adversarial Networks
Authors: Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Maja Pantic
Institution: Imperial College London; Samsung AI Centre
[Interspeech-2019]
Hush-Hush Speak: Speech Reconstruction Using Silent Videos
Authors: Shashwat Uttam, Yaman Kumar Singla, Dhruva Sahrawat, Mansi Agarwal
Institution: Netaji Subhas Institute of Technology; Adobe Research; National University of Singapore; Delhi Technological University
[ICASSP-2021]
Learning Audio-Visual Correlations From Variational Cross-Modal Generation
Authors: Ye Zhu, Yu Wu, Hugo Latapie, Yi Yang, Yan Yan
Institution: Illinois Institute of Technology; University of Technology Sydney; Cisco
[IEEE TCYB-2022]
End-to-End Video-to-Speech Synthesis Using Generative Adversarial Networks
Authors: Rodrigo Mira, Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Bj?rn W. Schuller, Maja Pantic
Institution: Imperial College London; University of Augsburg; Meta AI
[ICPR-2022]
Learning Speaker-specific Lip-to-Speech Generation
Authors: Munender Varshney, Ravindra Yadav, Vinay P. Namboodiri, Rajesh M Hegde
Institution: Indian institute of Technology; University of Bath
[ICASSP-2023]
Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech
Authors: Jiyoung Lee, Joon Son Chung, Soo-Whan Chung
Institution: NAVER AI Lab; Korea Advanced Institute of Science and Technology; NAVER Cloud
[IEEE TMM-2015]
Real-Time Piano Music Transcription Based on Computer Vision
Authors: Mohammad Akbari, Howard Cheng
Institution: Simon Fraser University; University of Lethbridge
[MM-2017]
Deep Cross-Modal Audio-Visual Generation
Authors:
Lele Chen, Sudhanshu Srivastava,
Zhiyao Duan, Chenliang Xu
Institution: University of Rochester
[NeurIPS-2020]
Audeo: Audio Generation for a Silent Performance Video
Authors: Kun Su, Xiulong Liu, Eli Shlizerman
Institution: University of Washington
[ECCV-2020]
Foley Music: Learning to Generate Music from Videos
Authors: Chuang Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, Antonio Torralba
Institution: Cambridge
[ICASSP-2020]
Sight to Sound: An End-to-End Approach for Visual Piano Transcription
Authors: A. Sophia Koepke, Olivia Wiles
, Yael Moses, Andrew Zisserman
Institution: University of Oxford; The Interdisciplinary Center
[2020]
Multi-Instrumentalist Net: Unsupervised Generation of Music from Body Movements
Authors: Kun Su, Xiulong Liu, Eli Shlizerman
Institution: University of Washington
[ICASSP-2021]
Collaborative Learning to Generate Audio-Video Jointly
Authors: Vinod K Kurmi, Vipul Bajaj, Badri N Patro, K S Venkatesh, Vinay P Namboodiri, Preethi Jyothi
Institution: Indian Institute of Technology Kanpur; University of Bath; Indian Institute of Technology Bombay
[ACM-2021]
Video Background Music Generation with Controllable Music Transformer
Authors: Shangzhe Di,
Zeren Jiang, Si Liu, Zhaokai Wang, Leyan Zhu, Zexin He, Hongming Liu, Shuicheng Yan
Institution: Beihang University; Charterhouse School, Godalming, Surrey; Sea AI Lab
[2022]
Vis2Mus: Exploring Multimodal Representation Mapping for Controllable Music Generation
Authors: Runbang Zhang, Yixiao Zhang, Kai Shao, Ying Shan, Gus Xia
Institution: New York University, Shanghai; Queen Mary University of London; Tencent Inc.; Mohamed bin Zayed University of Artificial Intelligence
[CVPR-2016]
Visually Indicated Sounds
Authors: Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, William T. Freeman
Institution: Massachusetts Institute of Technology; U.C. Berkeley; Google Research
[CVPR-2018]
Visual to Sound: Generating Natural Sound for Videos in the Wild
Authors: Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, Tamara L. Berg
Institution: University of North Carolina at Chapel Hill; Adobe Research
[IEEE TIP-2020]
Generating Visually Aligned Sound From Videos
Authors:
Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang, Chuang Gan,
Institution: South China University of Technology; China Pazhou Laboratory; MIT-IBM Watson AI Lab
[BMVC-2021]
Taming Visually Guided Sound Generation
Authors: Vladimir Iashin, Esa Rahtu
Institution: Tampere University
[IEEE TCSVT-2022]
Towards an End-to-End Visual-to-Raw-Audio Generation With GAN
Authors: Shiguang Liu; Sijia Li; Haonan Cheng
Institution: Tianjin University
[2022]
I Hear Your True Colors: Image Guided Audio Generation
Authors: Roy Sheffer, Yossi Adi
Institution: The Hebrew University of Jerusalem
[ACM TOG-2018]
Scene-aware audio for 360° videos
Authors: Dingzeyu Li, Timothy R.Langlois, Changxi Zheng
Institution: Columbia University; Adobe Research
[NeurIPS-2018]
Self-Supervised Generation of Spatial Audio for 360° Video
Authors: Pedro Morgado, Nuno Nvasconcelos, Timothy Langlois, Oliver Wang
Institution: University of California San Diego; Adobe Research
[CVPR-2019]
2.5D Visual Sound
Authors: Ruohan Gao, Kristen Grauman
Institution: The University of Texas at Austin; Facebook AI Research
[ICIP-2019]
Self-Supervised Audio Spatialization with Correspondence Classifier
Authors: Yu-Ding Lu, Hsin-Ying Lee, Hung-Yu Tseng, Ming-Hsuan Yang
Institution: University of California at Merced
[ECCV-2020]
Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation
Authors: Hang Zhou, Xudong Xu, Dahua Lin, Xiaogang Wang, Ziwei Liu
Institution: The Chinese University of Hong Kong
[CVPR-2021]
Visually Informed Binaural Audio Generation without Binaural Audios
Authors: Xudong Xu, Hang Zhou, Ziwei Liu, Bo Dai, Xiaogang Wang, Dahua Lin
Institution: The Chinese University of Hong Kong; Nanyang Technological University
[AAAI-2021]
Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation
Authors: Yan-Bo Lin, Yu-Chiang Frank Wang
Institution: National Taiwan University; ASUS Intelligent Cloud Services
[TOG-2021]
Binaural Audio Generation via Multi-task Learning
Authors: Sijia Li, Shiguang Liu, Dinesh Manocha
Institution: Tianjin University; University of Maryland at College Park
[WACV-2022]
Beyond Mono to Binaural: Generating Binaural Audio From Mono Audio With Depth and Cross Modal Attention
Authors: Kranti Kumar Parida, Siddharth Srivastava, Gaurav Sharma
Institution: Indian Institute of Technology Kanpur; CDAC Noida; TensorTour Inc.
[ACM TOG-2017]
Synthesizing Obama: learning lip sync from audio
Authors: Supasorn Suwajanakorn, Steven Maxwell Seitz, Ira Kemelmacher-Shlizerman
Institution: University of Washington
[ECCV-2018]
Lip Movements Generation at a Glance
Authors: Lele Chen, Zhiheng Li, Ross K Maddox, Zhiyao Duan, Chenliang Xu
Institution: University of Rochester
[IJCV-2019]
You Said That?: Synthesising Talking Faces from Audio
Authors: Amir Jamaludin, Joon Son Chung, Andrew Zisserman
Institution: University of Oxford
[ICCV-2019]
Few-Shot Adversarial Learning of Realistic Neural Talking Head Models
Authors: Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, Victor Lempitsky
Institution: Samsung AI Center; Skolkovo Institute of Science and Technology
[IJCV-2020]
Realistic Speech-Driven Facial Animation with GANs
Authors: Konstantinos Vougioukas, Stavros Petridis, Maja Pantic
Institution: Imperial College London; Samsung AI Research Centre Cambridge
[IJCV-2020]
GANimation: One-Shot Anatomically Consistent Facial Animation
Authors: Albert Pumarola, Antonio Agudo, Aleix M. Martinez, Alberto Sanfeliu, Francesc Moreno-Noguer
Institution: Institut de Robòtica i Informàtica Industrial; The Ohio State University
[ACM TOG-2020]
Makelttalk: Speaker-Aware Talking-Head Animation
Authors: Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, Dingzeyu Li
Institution: University of Massachusetts Amherst; Huya Inc.; Adobe Research
[CVPR-2020]
FReeNet: Multi-Identity Face Reenactment
Authors: Jiangning Zhang, Xianfang Zeng, Mengmeng Wang, Yusu Pan, Liang Liu, Yong Liu, Yu Ding, Changjie Fan
Institution: Zhejiang University; Fuxi AI Lab
[ECCV-2020]
Neural Voice Puppetry: Audio-driven Facial Reenactment
Authors: Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, Matthias Nie?ner
Institution: Technical University of Munich; Saarland Informatics Campus
[CVPR-2020]
Rotate-and-Render: Unsupervised Photorealistic Face Rotation from Single-View Images
Authors: Hang Zhou, Jihao Liu, Ziwei Liu, Yu Liu, Xiaogang Wang
Institution: The Chinese University of Hong Kong; SenseTime Research
[ECCV-2020]
MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation
Authors: Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, Chen Change Loy
Institution: SenseTime Research; Carnegie Mellon University; Center for Research on Intelligent Perception and Computing, CASIA; University of Chinese Academy of Sciences; Shenzhen Institutes of Advanced Technology, Chinese Academy of Science; Nanyang Technological University
[AAAI-2021]
Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation
Authors: Lincheng Li, Suzhen Wang, Zhimeng Zhang, Yu Ding, Yixing Zheng, Xin Yu, Changjie Fan
Institution: Netease Fuxi AI Lab; University of Technology Sydney
[CVPR-2021]
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation
Authors: Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, Ziwei Liu
Institution: The Chinese University of Hong Kong; SenseTime Research; Tokyo Institute of Technology; Nanyang Technological University
[CVPR-2021]
Audio-Driven Emotional Video Portraits
Authors: Xinya Ji, Hang Zhou, Kaisiyuan Wang, Wayne Wu, Chen Change Loy, Xun Cao, Feng Xu
Institution: Nanjing University; The Chinese University of Hong Kong; The University of Sydney; SenseTime Research; Nanyang Technological University; Tsinghua University
[AAAI-2022]
One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning
Authors: Suzhen Wang, Lincheng Li, Yu Ding, Xin Yu
Institution: Netease Fuxi AI Lab; University of Technology Sydney
[TVCG-2022]
Generating talking face with controllable eye movements by disentangled blinking feature
Authors: Shiguang Liu, Jiaqi Hao
Institution: Tianjin University
[AAAI-2022]
SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory
Authors: Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, Yong Man Ro
Institution: Korea Advanced Institute of Science and Technology
[CVPR-2022]
FaceFormer: Speech-Driven 3D Facial Animation with Transformers
Authors: Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, Taku Komura
Institution: The University of Hong Kong; The Hong Kong University of Science and Technology; Adobe Research; Texas A&M University
[IVA-2018]
Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network
Authors: Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hiroshi Sakuta, Kazuhiko Sumi
Institution: Hokkai Gakuen University Sapporo; Aoyama Gakuin University; Yokohama National University
[IVA-2019]
Analyzing Input and Output Representations for Speech-Driven Gesture Generation
Authors: Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, Hedvig Kjellstr?m
Institution: KTH Royal Institute of Technology in Stockholm; Hokkai Gakuen University; Aoyama Gakuin University;
[CVPR-2019]
Learning Individual Styles of Conversational Gesture
Authors: Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, Jitendra Malik
Institution: University of California, Berkeley; Zebra Medical Vision; Massachusetts Institute of Technology
[ICMI-2019]
To React or not to React: End-to-End Visual Pose Forecasting for Personalized Avatar during Dyadic Conversations,
Authors: Chaitanya Ahuja, Shugao Ma, Louis-Philippe Morency, Yaser Sheikh
Institution: Carnegie Mellon University; Facebook Reality Labs
[EUROGRAPHICS-2020]
Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows
Authors: Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, Jonas Beskow
Institution: KTH Royal Institute of Technology
[ICMI-2020]
Gesticulator: A Framework For Semantically-Aware Speech-Driven Gesture Generation
Authors: Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexandersson, Iolanda Leite, Hedvig Kjellstr?m
Institution: KTH Royal Institute of Technology
[2020]
Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach
Authors: Chaitanya Ahuja, Dong Won Lee, Yukiko I. Nakano, Louis-Philippe Morency
Institution: Carnegie Mellon University; Seikei University
[ACM TOG-2020]
Speech Gesture Generation From The Trimodal Context Of Text, Audio, And Speaker Identity
Authors: Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, Geehyuk Lee
Institution: Korea Advanced Institute of Science and Technology; University of Science and Technology; Electronics and Telecommunications Research Institute
[CVPR-2022]
SEEG: Semantic Energized Co-Speech Gesture Generation
Authors: Yuanzhi Liang, Qianyu Feng, Linchao Zhu, Li Hu, Pan Pan, Yi Yang
Institution: Alibaba; University of Technology Sydney; Zhejiang University
[IEEE TNNLS-2022]
VAG: A Uniform Model for Cross-Modal Visual-Audio Mutual Generation
Authors: Wangli Hao; He Guan; Zhaoxiang Zhang
Institution: Chinese Academy of Sciences; University of Chinese Academy of Sciences
[MM-2018]
Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis
Authors: Taoran Tang, Jia Jia, Hanyang Mao
Institution: Tsinghua University
[CVPR-2018]
Audio to Body Dynamics
Authors: Eli Shlizerman, Lucio Dery, Hayden Schoen, Ira Kemelmacher-Shlizerman
Institution: Facebook Inc.; Stanford University; University of Washington
[NeurIPS-2019]
Dancing to Music
Authors: Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun Wang, Yu-Ding Lu, Ming-Hsuan Yang, Jan Kautz
Institution: University of California; NVIDIA
[ICLR-2021]
Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning
Authors: Ruozi Huang, Huang Hu, Wei Wu, Kei Sawada, Mi Zhang, Daxin Jiang
Institution: Fudan University; Microsoft STCA; Meituan; Rinna AI
[ICCV-2021]
AI Choreographer: Music Conditioned 3D Dance Generation With AIST++
Authors: Ruilong Li, Shan Yang, David A. Ross, Angjoo Kanazawa
Institution: University of Southern California; Google Research; University of California, Berkeley
[ICASSP-2022]
Genre-Conditioned Long-Term 3D Dance Generation Driven by Music
Authors: Yuhang Huang, Junjie Zhang, Shuyan Liu, Qian Bao, Dan Zeng, Zhineng Chen, Wu Liu
Institution: Shanghai University; University of Chinese Academy of Sciences; JD AI Research; Fudan University
[CVPR-2022]
Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory
Authors: Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Chang Loy, Ziwei Liu
Institution: Nanyang Technological University; Sun Yat-Sen University; University of California, Los Angeles; SenseTime Research
[ICRA-2020]
BatVision: Learning to See 3D Spatial Layout with Two Ears
Authors: Jesper Haahr Christensen; Sascha Hornauer; Stella X. Yu
Institution: Technical University of Denmark; University of California
[ECCV-2020]
VISUALECHOES: Spatial Image Representation Learning Through Echolocation
Authors: Ruohan Gao, Changan Chen, Ziad Al-Halah, Carl Schissler, Kristen Grauman
Institution: The University of Texas at Austin; Facebook Reality Lab; Facebook AI Research
[CVPR-2021]
Beyond Image to Depth: Improving Depth Prediction Using Echoes
Authors: Kranti Kumar Parida, Siddharth Srivastava, Gaurav Sharma
Institution: Indian Institute of Technology Kanpur; Centre for Development of Advanced Computing Noida; TensorTour Inc.
[ICASSP-2022]
Co-Attention-Guided Bilinear Model for Echo-Based Depth Estimation
Authors: Go Irie, Takashi Shibata, Akisato Kimura
Institution: Nippon Telegraph & Telephone Corporation
[NeurIPS-2022]
Learning Neural Acoustic Fields
Authors: Andrew Luo, Yilun Du, Michael Tarr, Josh Tenenbaum, Antonio Torralba, Chuang Gan
Institution: Carnegie Mellon University; Massachusetts Institute of Technology; MIT-IBM Watson AI Lab
[NeurIPS-2022]
Few-Shot Audio-Visual Learning of Environment Acoustics
Authors: Sagnik Majumder, Changan Chen, Ziad Al-Halah, Kristen Grauman
Institution: The University of Texas at Austin; Facebook AI Research
[NeurIPS-2016]
SoundNet: Learning Sound Representations from Unlabeled Video
Authors: Yusuf Aytar, Carl Vondrick, Antonio Torralba
Institution: Massachusetts Institute of Technology
[ICCV-2019]
Self-Supervised Moving Vehicle Tracking With Stereo Sound
Authors: Chuang Gan, Hang Zhao, Peihao Chen, David Cox, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab; IBM Research AI
[CVPR-2021]
There Is More Than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking With Sound by Distilling Multimodal Knowledge
Authors: Francisco Rivera Valverde, Juana Valeria Hurtado, Abhinav Valada
Institution: University of Freiburg
[AAAI-2021]
Enhanced Audio Tagging via Multi- to Single-Modal Teacher-Student Mutual Learning
Authors: Yifang Yin, Harsh Shrivastava, Ying Zhang, Zhenguang Liu, Rajiv Ratn Shah, Roger Zimmermann
Institution: National University of Singapore; National University of Singapore Northwestern Polytechnical University; Zhejiang Gongshang University; Indraprastha Institute of Information Technology, Delhi
[Interspeech-2021]
Knowledge Distillation from Multi-Modality to Single-Modality for Person Verification
Authors: Leying Zhang, Zhengyang Chen, Yanmin Qian
Institution: Shanghai Jiao Tong University
[ICCV-2021]
Multimodal Knowledge Expansion
Authors: Zihui Xue, Sucheng Ren, Zhengqi Gao, Hang Zhao
Institution: Shanghai Qi Zhi Institute; UT Austin; South China University of Technology; Massachusetts Institute of Technology; Tsinghua University
[CVPR-2021]
Distilling Audio-visual Knowledge by Compositional Contrastive Learning
Authors: Yanbei Chen, Yongqin Xian, A. Sophia Koepke, Ying Shan, Zeynep Akata
Institution: University of Tubingen; MPI for Informatics; Tencent; Max Planck Institute for Intelligent Systems
[2022]
Estimating Visual Information From Audio Through Manifold Learning
Authors: Fabrizio Pedersoli, Dryden Wiebe, Amin Banitalebi, Yong Zhang, George Tzanetakis, Kwang Moo Yi
Institution: University of British Columbia; Huawei Technologies Canada Co., Ltd; University of Victoria
[DCASE-2021]
Audio-Visual Scene Classification Using A Transfer Learning Based Joint Optimization Strategy
Authors: Chengxin Chen, Meng Wang, Pengyuan Zhang
Institution: Institute of Acoustics, CAS; University of Chinese Academy of Sciences
[Interspeech-2021]
Audiovisual transfer learning for audio tagging and sound event detection
Authors: Wim Boes, Hugo Van hamme
Institution: ESAT, KU Leuven
[2023]
Revisiting Pre-training in Audio-Visual Learning
Authors: Ruoxuan Feng, Wenke Xia, Di Hu
Institution: Hunan University; Renmin University of China
[2017]
Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint
Authors: Sungeun Hong, Woobin Im, Hyun S. Yang
[ICCV-2017]
Image2song: Song Retrieval via Bridging Image Content and Lyric Words
Authors: Xuelong Li, Di Hu, Xiaoqiang Lu
Institution: Chinese Academy of Sciences; Northwestern Polytechnical University
[CVPR-2018]
Seeing voices and hearing faces: Cross-modal biometric matching
Authors: Arsha Nagrani, Samuel Albanie, Andrew Zisserman
Institution: University of Oxford
[ECCV-2018]
Cross-modal Embeddings for Video and Audio Retrieval
Authors: Didac Suris, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giro-i-Nieto
Institution: Universitat Politecnica de Catalunya; Barcelona Supercomputing Center
[ISM-2018]
Audio-Visual Embedding for Cross-Modal Music Video Retrieval through Supervised Deep CCA
Authors: Donghuo Zeng, Yi Yu, Keizo Oyama
Institution: National Institute of Informatics
[TOMCCAP-2020]
Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval
Authors: Donghuo Zeng, Yi Yu, Keizo Oyama
Institution: National Institute of Informatics
[IEEE TGRS-2020]
Deep Cross-Modal Image–Voice Retrieval in Remote Sensing
Authors: Yaxiong Chen, Xiaoqiang Lu, Shuai Wang
Institution: China University of Chinese Academy of Sciences; Chinese Academy of Sciences
[2021]
Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval
Authors: Donghuo Zeng, Jianming Wu, Gen Hattori, Yi Yu, Rong Xu
Institution: KDDI Research, Inc.; National Institute of Informatics, SOKENDAI
[ICCV-2021]
Temporal Cue Guided Video Highlight Detection With Low-Rank Audio-Visual Fusion
Authors: Qinghao Ye, Xiyue Shen, Yuan Gao, Zirui Wang, Qi Bi, Ping Li, Guang Yang
Institution: Hangzhou Dianzi University; University of California; East China Normal University; University of Oxford; Wuhan University; Imperial College London
[IJCAI-2022]
Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast
Authors: Boqing Zhu, Kele Xu, Changjian Wang, Zheng Qin, Tao Sun, Huaimin Wang, Yuxing Peng
Institution: National University of Defense Technology
[IEEE ISM-2022]
Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal Retrieval
Authors: Donghuo Zeng, Yanan Wang, Jianming Wu, Kazushi Ikeda
Institution: KDDI Research, Inc.
[IEEE SMC-2022]
Graph Network based Approaches for Multi-modal Movie Recommendation System
Authors: Daipayan Chakder, Prabir Mondal, Subham Raj, Sriparna Saha, Angshuman Ghosh, Naoyuki Onoe
Institution: Indian Institute of Technology; Sony Research
[CVPR-2022]
Visual Acoustic Matching
Authors: Changan Chen, Ruohan Gao, Paul Calamia, Kristen Grauman
Institution: University of Texas at Austin; Stanford University; Meta AI
Back to Top
[ICCV-2017]
Look, Listen and Learn
Authors: Relja Arandjelovic, Andrew Zisserman
Institution: Google Inc.; University of Oxford
[NeurIPS-2018]
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization
Authors: Bruno Korbar, Du Tran, Lorenzo Torresani
Institution: Dartmouth College; Facebook Research
[NeurIPS-2020]
Learning Representations from Audio-Visual Spatial Alignment
Authors: Pedro Morgado, Yi Li, Nuno Nvasconcelos
Institution: University of California
[NeurIPS-2020]
Self-Supervised Learning by Cross-Modal Audio-Video Clustering
Authors: Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, Du Tran
Institution: King Abdullah University of Science and Technology; Facebook AI Research
[NeurIPS-2020]
Labelling Unlabelled Videos From Scratch With Multi-Modal Self-Supervision
Authors: Yuki Asano, Mandela Patrick, Christian Rupprecht, Andrea Vedaldi
Institution: University of Oxford; Facebook AI Research
[CVPR-2021]
Audio-Visual Instance Discrimination with Cross-Modal Agreement
Authors: Pedro Morgado, Nuno Vasconcelos, Ishan Misra
Institution: University of California San Diego; Facebook AI Research
[CVPR-2021]
Robust Audio-Visual Instance Discrimination
Authors: Pedro Morgado, Ishan Misra, Nuno Vasconcelos
Institution: University of California San Diego; Facebook AI Research
[2021]
Unsupervised Sound Localization via Iterative Contrastive Learning
Authors: Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, Ming-Hsuan Yang
Institution: National Yang Ming Chiao Tung University; University of California; Snap Inc.; Google Research
[ICCV-2021]
Multimodal Clustering Networks for Self-Supervised Learning From Unlabeled Videos
Authors: Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang
Institution: Columbia University; Massachusetts Institute of Technology; University of Central Florida; Goethe University Frankfurt; IBM Research AI; MIT-IBM Watson AI Lab; The University of Texas at Austin; NYU-Courant CS & CDS
[2021]
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation
Authors: Jing Liu, Xinxin Zhu, Fei Liu, Longteng Guo, Zijia Zhao, Mingzhen Sun, Weining Wang, Hanqing Lu, Shiyu Zhou, Jiajun Zhang, Jinqiao Wang
Institution: Chinese Academy of Sciences
[NeurIPS-2021]
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Authors: Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong
Institution: Columbia University; Google Inc.; Cornell University
[2021]
Audio-visual Representation Learning for Anomaly Events Detection in Crowds
Authors: Junyu Gao, Maoguo Gong, Xuelong Li
Institution: Xidian University; Northwestern Polytechnical University
[ICASSP-2022]
Audioclip: Extending Clip to Image, Text and Audio
Authors: Andrey Guzhov, Federico Raue, Jorn Hees, Andreas Dengel
Institution: Germany TU Kaiserslautern; Deutsches Forschungszentrum für Künstliche Intelligenz GmbH
[CVPR-2022]
MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound
Authors: Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, Yejin Choi
Institution: University of Washington; Allen Institute for Artificial Intelligence; University of Edinburgh
[2022]
Probing Visual-Audio Representation for Video Highlight Detection via Hard-Pairs Guided Contrastive Learning
Authors: Shuaicheng Li, Feng Zhang, Kunlin Yang, Lingbo Liu, Shinan Liu, Jun Hou, Shuai Yi
Institution: Sensetime Research; The Hong Kong Polytechnic University
[NeurIPS-2022]
Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings
Authors: Yiren Jian, Chongyang Gao, Soroush Vosoughi
Institution: Dartmouth College; Northwestern University
[IEEE TMM-2022]
Multimodal Information Bottleneck: Learning Minimal Sufficient Unimodal and Multimodal Representations
Authors: Sijie Mai, Ying Zeng, Haifeng Hu
Institution: Sun Yat-sen University; National Natural Science Foundation of China
[CVPR-2022]
Audiovisual Generalised Zero-shot Learning with Cross-modal Attention and Language
Authors: Otniel-Bogdan Mercea, Lukas Riesch, A. Sophia Koepke, Zeynep Akata
Institution: University of Tübingen; Robert Bosch GmbH; Max Planck Institute
[CVPRW-2022]
Multi-task Learning for Human Affect Prediction with Auditory–Visual Synchronized Representation
Authors: Euiseok Jeong;, Geesung Oh, Sejoon Lim
Institution: Kookmin University
[CVPR-2023]
Vision Transformers are Parameter-Efficient Audio-Visual Learners
Authors: Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, Gedas Bertasius
Institution: The University of North Carolina at Chapel Hill
[CVPR-2022]
Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language
Authors: Otniel-Bogdan Mercea, Lukas Riesch, A. Sophia Koepke, Zeynep Akata
Institution: University of Tubingen; Robert Bosch GmbH; Max Planck Institute
[ECCV-2022]
Temporal and cross-modal attention for audio-visual zero-shot learning
Authors: Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata
Institution: University of Tuebingen; Max Planck Institute
[NeurIPS-2022]
u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality
Authors: Wei-Ning Hsu, Bowen Shi
Institution: Meta AI
[NeurIPS-2022]
Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
Authors: Junru Wu, Yi Liang, Feng Han, Hassan Akbari, Zhangyang Wang, Cong Yu
Institution: Texas A&M University; Google Research; University of Texas at Austin; Celonis Inc.
[AAAI-2023]
Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
Authors: Pritam Sarkar, Ali Etemad
Institution: Queen’s University; Vector Institute
[ICLR-2023]
Contrastive Audio-Visual Masked Autoencoder
Authors: Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James R. Glass
Institution: Massachusetts Institute of Technology; The University of Texas at Austin; MIT-IBM Watson AI Lab; Goethe University Frankfurt
[ICLR-2023]
Jointly Learning Visual and Auditory Speech Representations from Raw Data
Authors: Alexandros Haliassos, Pingchuan Ma, Rodrigo Mira, Stavros Petridis, Maja Pantic
Institution: Imperial College London; Meta AI
[WACV-2023]
Audio Representation Learning by Distilling Video as Privileged Information
Authors: Amirhossein Hajavi, Ali Etemad
Institution: Queen's University, Canada
[CVPR-2023]
Vision Transformers are Parameter-Efficient Audio-Visual Learners
Authors: Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, Gedas Bertasius
Institution: The University of North Carolina at Chapel Hill
[2023]
AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations
Authors: Jiachen Lian, Alexei Baevski, Wei-Ning Hsu, Michael Auli
Institution: University of California; Meta AI
[AAAI-2023]
Audio-Visual Contrastive Learning with Temporal Self-Supervision
Authors: Simon Jenni, Alexander Black, John Collomosse
Institution: Adobe Research; University of Surrey
[ECCV-2018]
Objects that Sound
Authors: Relja Arandjelovic, Andrew Zisserman
Institution: Google Inc.; University of Oxford
[ECCV-2018]
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
Authors: Andrew Owens, Alexei A. Efros
Institution: University of California, Berkeley
[ECCV-2018]
The Sound of Pixels
Authors: Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab; Columbia University
[ICASSP-2019]
Self-supervised Audio-visual Co-segmentation
Authors: Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh McDermott, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab
[ICCV-2019]
The Sound of Motions
Authors: Hang Zhao, Chuang Gan, Wei-Chiu Ma, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab
[CVPR-2019]
Deep Multimodal Clustering for Unsupervised Audiovisual Learning
Authors: Di Hu, Feiping Nie, Xuelong Li
Institution: Northwestern Polytechnical University
[CVPR-2021]
Localizing Visual Sounds the Hard Way
Authors: Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman
Institution: University of Oxford
[IEEE TPAMI-2021]
Class-aware Sounding Objects Localization via Audiovisual Correspondence
Authors: Di Hu, Yake Wei, Rui Qian, Weiyao Lin, Ruihua Song, Ji-Rong Wen
Institution: Renmin University of China; Shanghai Jiao Tong University
[IEEE TPAMI-2021]
Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications
Authors: Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon
Institution: Korea Advanced Institute of Science and Technology; Pohang University of Science and Technology; University of California
[CVPR-2022]
Mix and Localize: Localizing Sound Sources in Mixtures
Authors: Xixi Hu, Ziyang Chen, Andrew Owens
Institution: University of Michigan; The University of Texas at Austin
[ECCV-2022]
Audio-Visual Segmentation
Authors: Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong
Institution: Hefei University of Technology; SenseTime Research; Australian National University; Beihang University; NVIDIA; The University of Hong Kong; 7Shanghai Artificial Intelligence Laboratory
[2022]
Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization
Authors: Hao Jiang, Calvin Murdock, Vamsi Krishna Ithapu
Institution: Meta Reality Labs
[MM-2022]
Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation
Authors: Jinxiang Liu, Chen Ju, Weidi Xie, Ya Zhang
Institution: Shanghai Jiao Tong University; Shanghai AI Laboratory
[CVPR-2022]
Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes
Authors: Zengjie Song, Yuxi Wang, Junsong Fan, Tieniu Tan, Zhaoxiang Zhang
Institution: Chinese Academy of Science; University of Chinese Academy of Sciences
[CVPR-2022]
Self-supervised object detection from audio-visual correspondence
Authors: Triantafyllos Afouras; Yuki M. Asano; Francois Fagan; Andrea Vedaldi; Florian Metze
Institution: University of Oxford; University of Amsterdam; Meta AI
[EUSIPCO-2022]
Visually Assisted Self-supervised Audio Speaker Localization and Tracking
Authors: Jinzheng Zhao, Peipei Wu, Shidrokh Goudarzi, Xubo Liu, Jianyuan Sun, Yong Xu, Wenwu Wang
Institution: University of Surrey; Tencent AI Lab, Bellevue
[CVPR-2022]
Mix and Localize: Localizing Sound Sources in Mixtures
Authors: Xixi Hu, Ziyang Chen, Andrew Owens
Institution: University of Michigan; The University of Texas at Austin
[2022]
MarginNCE: Robust Sound Localization with a Negative Margin
Authors: Sooyoung Park, Arda Senocak, Joon Son Chung
Institution: Korea Advanced Institute of Science and Technology; Electronics and Telecommunications Research Institute, South Korea
[IEEE TMM-2022]
Cross modal video representations for weakly supervised active speaker localization
Authors: Rahul Sharma, Krishna Somandepalli, Shrikanth Narayanan
Institution: University of Southern California; Google Inc.
[NeurIPS-2022]
A Closer Look at Weakly-Supervised Audio-Visual Source Localization
Authors: Shentong Mo, Pedro Morgado
Institution: Carnegie Mellon University; University of Wisconsin-Madison
[AAAI-2022]
Visual Sound Localization in the Wild by Cross-Modal Interference Erasing
Authors: Xian Liu, Rui Qian, Hang Zhou, Di Hu, Weiyao Lin, Ziwei Liu, Bolei Zhou, Xiaowei Zhou
Institution: The Chinese University of Hong Kong; Zhejiang University; Shanghai Jiao Tong University; Renmin University of China; Nanyang Technological University
[ECCV-2022]
Sound Localization by Self-Supervised Time Delay Estimation
Authors: Ziyang Chen, David F. Fouhey, Andrew Owens
Institution: University of Michigan
[IEEE TASLP-2023]
Audio-Visual Cross-Attention Network for Robotic Speaker Tracking
Authors: Xinyuan Qian, Zhengdong Wang, Jiadong Wang, Guohui Guan, Haizhou Li
Institution: University of Science and Technology Beijing; Chinese University of Hong Kong; Shenzhen Research Institute of Big dataNational University of Singapore; Univeristy of California at Berkeley; University of Bremen
[WACV-2023]
Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization
Authors: Dennis Fedorishin, Deen Dayal Mohan, Bhavin Jawade, Srirangaraj Setlur, Venu Govindaraju
Institution: University at Buffalo
[WACV-2023]
Exploiting Visual Context Semantics for Sound Source Localization
Authors: Xinchi Zhou, Dongzhan Zhou, Di Hu, Hang Zhou, Wanli Ouyang
Institution: The University of Sydney; Renmin University of China; Baidu Inc.
[2023]
Audio-Visual Segmentation with Semantics
Authors: Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong
Institution: Hefei University of Technology; SenseTime Research; University of Oxford; Australian National University; Beihang University; NVIDIA; The University of Hong Kong; Shanghai Artificial Intelligence Laboratory
[CVPR-2023]
Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
Authors: Weixuan Sun, Jiayi Zhang, Jianyuan Wang, Zheyuan Liu, Yiran Zhong, Tianpeng Feng, Yandong Guo, Yanhao Zhang, Nick Barnes
Institution: Australian National University; Beihang University; The University of Oxford; Shanghai AI Lab; OPPO Research Institute
[CVPR-2023]
Egocentric Audio-Visual Object Localization
Authors: Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
Institution: University of Rochester; Meta Reality Labs Research
[CVPR-2023]
Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
Authors: Weixuan Sun, Jiayi Zhang, Jianyuan Wang, Zheyuan Liu, Yiran Zhong, Tianpeng Feng, Yandong Guo, Yanhao Zhang, Nick Barnes
Institution: Australian National University; Beihang University; The University of Oxford; Shanghai AI Lab; OPPO Research Institute
[2019]
DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction
Authors: Hamed R. Tavakoli, Ali Borji, Esa Rahtu, Juho Kannala
Institution: Aalto University; Tampere University
[CVPR-2020]
STAViS: Spatio-Temporal AudioVisual Saliency Network
Authors: Antigoni Tsiami, Petros Koutras, Petros Maragos
Institution: National Technical University of Athens
[IEEE TIP-2020]
A Multimodal Saliency Model for Videos With High Audio-visual Correspondence
Authors:
Xiongkuo Min, Guangtao Zhai, Jiantao Zhou, Xiao-Ping Zhang, Xiaokang Yang, Xinping Guan
Institution: Shanghai Jiao Tong University; University of Macau; Ryerson University
[IROS-2021]
ViNet: Pushing the limits of Visual Modality for Audio-Visuav Saliency Prediction
Authors: Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti, Shyamgopal Karthik, Ramanathan Subramanian, Vineet Gandhi
Institution: International Institute for Information Technology; University of Canberra
[CVPR-2021]
From Semantic Categories to Fixations: A Novel Weakly-Supervised Visual-Auditory Saliency Detection Approach
Authors: Guotao Wang, Chenglizhao Chen, Deng-Ping Fan, Aimin Hao, Hong Qin
Institution: Beihang University; Qingdao University; Chinese Academy of Medical Sciences
[ICME-2021]
Lavs: A Lightweight Audio-Visual Saliency Prediction Model
Authors: Dandan Zhu; Defang Zhao; Xiongkuo Min; Tian Han; Qiangqiang Zhou; Shaobo Yu; Yongqing Chen; Guangtao Zhai; Xiaokang Yang
Institution: Shanghai Jiao Tong University; Tongji University; Stevens Institute of Technology; Jiangxi Normal University; East China Normal University; Hainan University
[2022]
A Comprehensive Survey on Video Saliency Detection with Auditory Information: the Audio-visual Consistency Perceptual is the Key!
Authors: Chenglizhao Chen, Mengke Song, Wenfeng Song, Li Guo, Muwei Jian
Institution: China University of Petroleum; Shandong University of Finance and Economics; Beijing Information Science and Technology University
[TOMCCAP-2022]
PAV-SOD: A New Task Towards Panoramic Audiovisual Saliency Detection
Authors: Yi Zhang, Fang-Yi Chao, Wassim Hamidouche, Olivier Deforges
Institution: University Rennes; Institut National des Sciences Appliquées Rennes; Centre national de la recherche scientifique; Trinity College Dublin
[CVPR-2023]
CASP-Net: Rethinking Video Saliency Prediction from an Audio-VisualConsistency Perceptual Perspective
Authors: Junwen Xiong, Ganglai Wang, Peng Zhang, Wei Huang, Yufei Zha, Guangtao Zhai
Institution: Northwestern Polytechnical University; Ningbo Institute of Northwestern Polytechnical University; Nanchang University; Shanghai Jiao Tong University
[ECCV-2020]
SoundSpaces: Audio-Visual Navigation in 3D Environments
Authors: Changan Chen, Unnat Jain, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, Kristen Grauman
Institution: The University of Texas at Austin; University of Illinois at Urbana-Champaign; Facebook Reality Labs; Facebook AI Research
[ICRA-2020]
Look, Listen, and Act: Towards Audio-Visual Embodied Navigation
Authors: Chuang Gan, Yiwei Zhang, Jiajun Wu, Boqing Gong, Joshua B. Tenenbaum
Institution: MIT-IBM Watson AI Lab; Tsinghua University; Massachusetts Institute of Technology; Google Inc.
[ICLR-2021]
Learning to Set Waypoints for Audio-Visual Navigation
Authors: Changan Chen, Sagnik Majumder, Ziad Al-Halah, Ruohan Gao, Santhosh Kumar Ramakrishnan, Kristen Grauman
Institution: The University of Texas at Austin; Facebook AI Research
[CVPR-2021]
Semantic Audio-Visual Navigation
Authors: Changan Chen, Ziad Al-Halah, Kristen Grauman
Institution: The University of Texas at Austin; Facebook AI Research
[ICCV-2021]
Move2Hear: Active Audio-Visual Source Separation
Authors: Sagnik Majumder, Ziad Al-Halah, Kristen Grauman
Institution: The University of Texas at Austin; Facebook AI Research
[2022]
Sound Adversarial Audio-Visual Navigation
Authors: Yinfeng Yu, Wenbing Huang, Fuchun Sun, Changan Chen, Yikai Wang, Xiaohong Liu
Institution: Tsinghua University; Xinjiang University; The University of Texas at Austin; JD Explore Academy
[CVPR-2022]
Towards Generalisable Audio Representations for Audio-Visual Navigation
Authors: Shunqi Mao, Chaoyi Zhang, Heng Wang, Weidong Cai
Institution: University of Sydney
[NeurIPS-2022]
SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning
Authors: Changan Chen, Carl Schissler, Sanchit Garg, Philip Kobernik, Alexander Clegg, Paul
Institution: The University of Texas at Austin; Reality Labs at Meta; Georgia Tech; Meta AI
[NeurIPS-2022]
AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments
Authors: Sudipta Paul, Amit K. Roy-Chowdhury, Anoop Cherian
Institution: University of California; Mitsubishi Electric Research Labs, Cambridge
[BMVC-2022]
Pay Self-Attention to Audio-Visual Navigation
Authors: Yinfeng Yu, Lele Cao, Fuchun Sun, Xiaohong Liu, Liejun Wang
Institution: Tsinghua University; Motherbrain, EQT; Xinjiang University
[CVPR-2022]
Finding Fallen Objects Via Asynchronous Audio-Visual Integration
Authors: Chuang Gan, Yi Gu, Siyuan Zhou, Jeremy Schwartz, Seth Alter, James Traer, Dan Gutfreund, Joshua B. Tenenbaum, Josh McDermott, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab
[CVPR-2022]
ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer
Authors: Ruohan Gao, Zilin Si, Yen-Yu Chang, Samuel Clarke, Jeannette Bohg, Li Fei-Fei, Wenzhen Yuan, Jiajun Wu
Institution: Stanford Univeristy; Carnegie Mellon University
[IEEE RAL-2023]
Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped Environments with Moving Sounds
Authors: Abdelrahman Younes, Daniel Honerkamp, Tim Welschehold, Abhinav Valada
Institution: University of Freiburg
[2023]
Audio Visual Language Maps for Robot Navigation
Authors: Chenguang Huang, Oier Mees, Andy Zeng, Wolfram Burgard
Institution: University of Freiburg; Google Research; University of Technology Nuremberg
[ECCV-2018]
Audio-visual Event Localization in Unconstrained Videos
Authors: Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, Chenliang Xu
Institution: University of Rochester
[ICASSP-2019]
Dual-modality Seq2Seq Network for Audio-visual Event Localization
Authors: Yan-Bo Lin, Yu-Jhe Li, Yu-Chiang Frank Wang
Institution: National Taiwan University
[ICCV-2019]
Dual Attention Matching for Audio-Visual Event Localization
Authors: Yu Wu, Linchao Zhu, Yan Yan, Yi Yang
Institution: Baidu Research; University of Technology Sydney; Texas State University
[AAAI-2020]
Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization
Authors: Hanyu Xuan, Zhenyu Zhang, Shuo Chen, Jian Yang, Yan Yan
Institution: Nanjing University of Science and Technology
[ACCV-2020]
Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization
Authors: Yan-Bo Lin, Yu-Chiang Frank Wang
Institution: National Taiwan University; ASUS Intelligent Cloud Services
[WACV-2021]
Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention
Authors: Bin Duan, Hao Tang, Wei Wang, Ziliang Zong, Guowei Yang, Yan Yan
Institution: Illinois Institute of Technology; University of Trento; Texas State University
[CVPR-2021]
Positive Sample Propagation along the Audio-Visual Event Line
Authors: Jinxing Zhou, Liang Zheng, Yiran Zhong, Shijie Hao, Meng Wang
Institution: Hefei University of Technology; Intelligent Interconnected Systems Laboratory of Anhui Province; Australian National University
[AIKE-2021]
Audio-Visual Event Localization based on Cross-Modal Interacting Guidance
Authors: Qiurui Yue; Xiaoyu Wu; Jiayi Gao
Institution: Communication University of China
[TMM-2021]
Audio-Visual Event Localization by Learning Spatial and Semantic Co-attention
Authors: Cheng Xue, Xionghu Zhong, Minjie Cai, Hao Chen, Wenwu Wang
Institution: Hunan University; United Kingdom of Great Britain and Northern Ireland
[CVPR-2022]
Cross-Modal Background Suppression for Audio-Visual Event Localization
Authors: Yan Xia, Zhou Zhao
Institution: Zhejiang University
[ICASSP-2022]
Bi-Directional Modality Fusion Network For Audio-Visual Event Localization
Authors: Shuo Liu; Weize Quan; Yuan Liu; Dong-Ming Yan
Institution: Chinese Academy of Sciences; Alibaba Group
[ICSIP-2022]
Audio-Visual Event and Sound Source Localization Based on Spatial-Channel Feature Fusion
Authors: Xiaolong Zheng, Ying Wei
Institution: Shandong University
[IJCNN-2022]
Look longer to see better: Audio-visual event localization by exploiting long-term correlation
Authors: Longyin Guo, Qijun Zhao, Hongmei Gao
Institution: Sichuan University; Tibet University
[EUSIPCO-2022]
Audio Visual Graph Attention Networks for Event Detection in Sports Video
Authors: Taichi Ishiwatari, Makiko Azuma, Takuya Handa, Masaki Takahashi, Takahiro Mochizuki, Masanori Sano
Institution: Science and Technology Research Laboratories, NHK; Tokyo Institute of Technology
[IEEE TPAMI-2022]
Contrastive Positive Sample Propagation along the Audio-Visual Event Line
Authors: Jinxing Zhou, Dan Guo, Meng Wang
Institution: Hefei University of Technology
[IEEE TPAMI-2022]
Semantic and Relation Modulation for Audio-Visual Event Localization
Authors: Hao Wang, Zheng-Jun Zha, Liang Li, Xuejin Chen, Jiebo Luo
Institution: University of Science and Technology of China; Chinese Academy of Sciences; University of Rochester
[WACV-2023]
AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization
Authors: Tanvir Mahmud, Diana Marculescu
Institution: The University of Texas at Austin
[WACV-2023]
Event-Specific Audio-Visual Fusion Layers: A Simple and New Perspective on Video Understanding
Authors: Arda Senocak, Junsik Kim, Tae-Hyun Oh, Dingzeyu Li, In So Kweon
Institution: Korea Advanced Institute of Science & Technology; Harvard University; Pohang University of Science and Technology; Adobe Research
[2023]
A dataset for Audio-Visual Sound Event Detection in Movies
Authors: Rajat Hebbar, Digbalay Bose, Krishna Somandepalli, Veena Vijai, Shrikanth Narayanan
Institution: University of Southern California
[CVPR-2023]
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline
Authors: Tiantian Geng, Teng Wang, Jinming Duan, Runmin Cong, Feng Zheng
Institution: Southern University of Science and Technology; University of Birmingham; The University of Hong Kong; Shandong University; Peng Cheng Laboratory
[ECCV-2020]
Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing
Authors: Yapeng Tian, Dingzeyu Li, Chenliang Xu
Institution: University of Rochester; Adobe Research
[CVPR-2021]
Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing
Authors: Yu Wu, Yi Yang
Institution: Baidu Research; University of Technology Sydney
[NeurIPS-2021]
Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing
Authors: Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, Ming-Hsuan Yang
Institution: National Yang Ming Chiao Tung University; UNC Chapel Hill; University of California, Merced; Snap Research; Google Research; Yonsei University
[2022]
Investigating Modality Bias in Audio Visual Video Parsing
Authors: Piyush Singh Pasi, Shubham Nemani, Preethi Jyothi, Ganesh Ramakrishnan
Institution: Indian Institute of Technology
[ICASSP-2022]
Distributed Audio-Visual Parsing Based On Multimodal Transformer and Deep Joint Source Channel Coding
Authors: Penghong Wang, Jiahui Li, Mengyao Ma, Xiaopeng Fan
Institution: Harbin Institute of Technology; Wireless Technology Lab
[ECCV-2022]
Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing
Authors: Haoyue Cheng, Zhaoyang Liu, Hang Zhou, Chen Qian, Wayne Wu, Limin Wang
Institution: Nanjing University; SenseTime Research; The Chinese University of Hong Kong; Shanghai AI Laboratory
[NeurIPS-2022]
Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing
Authors: Shentong Mo, Yapeng Tian
Institution: Carnegie Mellon University; University of Texas at Dallas
[2023]
Improving Audio-Visual Video Parsing with Pseudo Visual Labels
Authors: Jinxing Zhou, Dan Guo, Yiran Zhong, Meng Wang
Institution: Hefei University of Technology; Shanghai AI Lab
[ICCV-2021]
Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg Videos
Authors: Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, Gunhee Kim
Institution: Seoul National University; Allen Institute for AI; University of Oxford; Hyundai Motor Company
[CVPR-2022]
Learning To Answer Questions in Dynamic Audio-Visual Scenarios
Authors: Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu
Institution: Renmin University of China; Beijing Key Laboratory of Big Data Management and Analysis Methods; University of Rochester
[NeurIPS-2022]
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
Authors: Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin · Shuohang Wang · Ziyi Yang · Chenguang Zhu · Derek Hoiem · Shih-Fu Chang · Mohit Bansal · Heng Ji
Institution: University of Illinois at Urbana-Champaign; MSR; The University of North Carolina at Chapel Hill; Columbia University
[CVPR-2019]
Audio Visual Scene-Aware Dialog
Authors: Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh
Institution: Georgia Institute of Technology; Mitsubishi Electric Research Laboratories
[Interspeech-2019]
Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog
Authors: Hori, C.; Cherian, A.; Marks, T.; Hori, T.
Institution: Mitsubishi Electric Research Laboratories, Inc.
[ICASSP-2019]
End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features
Authors: Chiori Hori, Huda Alamri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K. Marks, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Irfan Essa, Dhruv Batra, Devi Parikh
Institution: Mitsubishi Electric Research Laboratories; Georgia Institute of Technology
[CVPR-2019]
A Simple Baseline for Audio-Visual Scene-Aware Dialog
Authors: Idan Schwartz, Alexander G. Schwing, Tamir Hazan
Institution: Technion; University of Illinois at Urbana-Champaign
[CVPR-2019]
Exploring Context, Attention and Audio Features for Audio Visual Scene-Aware Dialog
Authors: Shachi H Kumar, Eda Okur, Saurav Sahay, Jonathan Huang, Lama Nachman
Institution: Anticipatory Computing Lab
[2020]
TMT: A Transformer-based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-aware Dialog
Authors: Wubo Li, Dongwei Jiang, Wei Zou, Xiangang Li
Institution: Didi Chuxing
[AAAI-2021]
Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers
Authors: Shijie Geng, Peng Gao, Moitreya Chatterjee, Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li, Anoop Cherian
Institution: Rutgers University; The Chinese University of Hong Kong; University of Illinois at Urbana Champaign; Mitsubishi Electric Research Laboratories
[2021]
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs
Authors: Xudong Lin, Gedas Bertasius, Jue Wang, Shih-Fu Chang, Devi Parikh, Lorenzo Torresani
Institution: Columbia University; Facebook AI; Georgia Tech; Dartmouth
[ICASSP-2022]
Audio-Visual Scene-Aware Dialog and Reasoning Using Audio-Visual Transformers with Joint Student-Teacher Learning
Authors: Ankit Shah, Shijie Geng, Peng Gao, Anoop Cherian, Takaaki Hori, Tim K. Marks, Jonathan Le Roux, Chiori Hori
Institution: Mitsubishi Electric Research Laboratories; Carnegie Mellon University; Rutgers University; The Chinese University of Hong Kong
[WACV-2022]
QUALIFIER: Question-Guided Self-Attentive Multimodal Fusion Network for Audio Visual Scene-Aware Dialog
Authors: Muchao Ye;Quanzeng You;Fenglong Ma
Institution: University Park; Microsoft Azure Computer Vision
[TACL-2022]
Learning English with Peppa Pig
Authors: Mitja Nikolaus, Afra Alishahi, Grzegorz Chrupała
Institution: Aix-Marseille University; Tilburg University
[2022]
End-to-End Multimodal Representation Learning for Video Dialog
Authors: Huda Alamri, Anthony Bilic, Michael Hu, Apoorva Beedu, Irfan Essa
Institution: Georgia Institute of Technology
[AAAI-2022]
Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations
Authors: Yoshihiro Yamazaki, Shota Orihashi, Ryo Masumura, Mihiro Uchida, Akihiko Takashima
Institution: Nippon Telegraph and Telephone Corporation
Dataset | Year | Videos | Length | Data form | Video source | Task |
---|---|---|---|---|---|---|
LRW, LRS2 and LRS3 | 2016,2018, 2018 | - | 800h+ | video | in the wild | Speech-related, speaker-related,face generation-related tasks |
VoxCeleb, VoxCeleb2 | 2017, 2018 | - | 2,000h+ | video | YouTube | Speech-related, speaker-related,face generation-related tasks |
AVA-ActiveSpeaker | 2019 | - | 38.5h | video | YouTube | Speech-related task, speaker-related task |
Kinetics-400 | 2017 | 306,245 | 850h+ | video | YouTube | Action recognition |
EPIC-KITCHENS | 2018 | 39,594 | 55h | video | Recorded videos | Action recognition |
CMU-MOSI | 2016 | 2,199 | 2h+ | video | YouTube | Emotion recognition |
CMU-MOSEI | 2018 | 23,453 | 65h+ | video | YouTube | Emotion recognition |
VGGSound | 2020 | 200k+ | 550h+ | video | YouTube | Action recognition, sound localization |
AudioSet | 2017 | 2M+ | 5,800h+ | video | YouTube | Action recognition, sound sepearation |
Greatest Hits | 2016 | 977 | 9h+ | video | Recorded videos | Sound generation |
MUSIC | 2018 | 714 | 23h+ | video | YouTube | Sound seperation, sound localization |
FAIR-Play | 2019 | 1,871 | 5.2h | video with binaural sound | Recorded videos | Spatial sound generation |
YT-ALL | 2018 | 1,146 | 113.1h | 360 video | YouTube | Spatial sound generation |
Replica | 2019 | - | - | 3D environment | 3D simulator | Depth estimation |
AIST++ | 2021 | - | 5.2h | 3D video | Recorded videos | Dance generation |
TED | 2019 | - | 52h | video | TED talks | Gesture generation |
SumMe | 2014 | 25 | 1h+ | video with eye-tracking | User videos | Saliency detection |
AVE | 2018 | 4,143 | 11h+ | video | YouTube | Event localization |
LLP | 2020 | 11,849 | 32.9h | video | YouTube | Event parsing |
SoundSpaces | 2020 | - | - | 3D environment | 3D simulator | Audio-visual navigation |
AVSD | 2019 | 11,816 | 98h+ | video with dialog | Crowd-sourced | Audio-visual dialog |
Pano-AVQA | 2021 | 5.4k | 7.7h | 360 video with QA | Video-sharing platforms | Audio-visual question answering |
MUSIC-AVQA | 2022 | 9,288 | 150h+ | video with QA | YouTube | Audio-visual question answering |
AVSBench | 2022 | 5,356 | 14.8h+ | video | YouTube | Audio-visual segmentation, sound localization |