Sound event detection aims at processing the continuous acoustic signal and converting it into symbolic descriptions of the corresponding sound events present at the auditory scene. Sound event detection can be utilized in a variety of applications, including context-based indexing and retrieval in multimedia databases, unobtrusive monitoring in health care, and surveillance. Since 2017, to utilise large multimedia data available, learning acoustic information from weak annotations was formulated. This reading list consists of papers for sound event detection and Sound AI.
Papers covering multiple sub-areas are listed in both the sections. If there are any areas, papers, and datasets I missed, please let me know or feel free to make a pull request.
The reading list is no longer being actively maintained. However, PRs for relevant papers are welcomed.
INTERSPEECH 2022 papers added
ICASSP 2022 papers added
The reading list is expanded to include topics in Sound AI
WASPAA 2021 papers added
INTERSPEECH 2021 papers added
ICASSP 2021 papers added
- Survey Papers
- Areas
- Learning formulation
- Network architecture
- Pooling fuctions
- Missing or noisy audio
- Data Augmentation
- Audio Generation
- Representation Learning
- Multi-Task Learning
- Adversarial Attacks
- Few-Shot
- Zero-Shot
- Knowledge-transfer
- Polyphonic SED
- Loss function
- Audio and Visual
- Audio Captioning
- Audio Retrieval
- Healthcare
- Robotics
- Dataset
- Workshops/Conferences/Journals
- More
Sound event detection and time–frequency segmentation from weakly labelled data, TASLP 2019
Sound Event Detection: A Tutorial, IEEE Signal Processing Magazine, Volume 38, Issue 5
Automated Audio Captioning: an Overview of Recent Progress and New Challenges, EURASIP Journal on Audio Speech and Music Processing 2022
Weakly supervised scalable audio content analysis, ICME 2016
Audio Event Detection using Weakly Labeled Data, 24th ACM Multimedia Conference 2016
An approach for self-training audio event detectors using web data, 25th EUSIPCO 2017
A joint detection-classification model for audio tagging of weakly labelled data, ICASSP 2017
Connectionist Temporal Localization for Sound Event Detection with Sequential Labeling, ICASSP 2019
Multi-Task Learning for Interpretable Weakly Labelled Sound Event Detection, ArXiv 2020
A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition, ICML 2020
Non-Negative Matrix Factorization-Convolutional Neural Network (NMF-CNN) For Sound Event Detection, ArXiv 2020
Duration robust weakly supervised sound event detection, ICASSP 2020
SeCoST:: Sequential Co-Supervision for Large Scale Weakly Labeled Audio Event Detection, ICASSP 2020
Guided Learning for Weakly-Labeled Semi-Supervised Sound Event Detection, ICASSP 2020
Unsupervised Contrastive Learning of Sound Event Representations, ICASSP 2021
Sound Event Detection Based on Curriculum Learning Considering Learning Difficulty of Events, ICASSP 2021
Comparison of Deep Co-Training and Mean-Teacher Approaches for Semi-Supervised Audio Tagging, ICASSP 2021
Enhancing Audio Augmentation Methods with Consistency Learning, ICASSP 2021
Weakly-supervised audio event detection using event-specific Gaussian filters and fully convolutional networks, ICASSP 2017
Deep CNN Framework for Audio Event Recognition using Weakly Labeled Web Data, NIPS Workshop on Machine Learning for Audio 2017
Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network, ICASSP 2018
Orthogonality-Regularized Masked NMF for Learning on Weakly Labeled Audio Data, ICASSP 2018
Sound event detection and time–frequency segmentation from weakly labelled data, TASLP 2019
Attention-based Atrous Convolutional Neural Networks: Visualisation and Understanding Perspectives of Acoustic Scenes, ICASSP 2019
Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization, TASLP 2020
DD-CNN: Depthwise Disout Convolutional Neural Network for Low-complexity Acoustic Scene Classification, ArXiv 2020
Effective Perturbation based Semi-Supervised Learning Method for Sound Event Detection, INTERSPEECH 2020
Weakly-Supervised Sound Event Detection with Self-Attention, ICASSP 2020
Improving Deep Learning Sound Events Classifiers using Gram Matrix Feature-wise Correlations, ICASSP 2021
An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection, ICASSP 2021
AST: Audio Spectrogram Transformer, INTERSPEECH 2021
Event Specific Attention for Polyphonic Sound Event Detection, INTERSPEECH 2021
Sound Event Detection with Adaptive Frequency Selection, WASPAA 2021
SSAST: Self-Supervised Audio Spectrogram Transformer, AAAI 2022
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection, ICASSP 2022
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer, INTERSPEECH 2022
Efficient Training of Audio Transformers with Patchout, INTERSPEECH 2022
BEATs: Audio Pre-Training with Acoustic Tokenizers, ArXiv 2022
Adaptive Pooling Operators for Weakly Labeled Sound Event Detection, TASLP 2018
Comparing the Max and Noisy-Or Pooling Functions in Multiple Instance Learning for Weakly Supervised Sequence Learning Tasks, Interspeech 2018
A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling, ICASSP 2019
Hierarchical Pooling Structure for Weakly Labeled Sound Event Detection, INTERSPEECH 2019
Weakly labelled audioset tagging with attention neural networks, TASLP 2019
Sound event detection and time–frequency segmentation from weakly labelled data, TASLP 2019
Multi-Task Learning for Interpretable Weakly Labelled Sound Event Detection, ArXiv 2019
A Global-Local Attention Framework for Weakly Labelled Audio Tagging, ICASSP 2021
Sound event detection and time–frequency segmentation from weakly labelled data, TASLP 2019
Multi-Task Learning for Interpretable Weakly Labelled Sound Event Detection, ArXiv 2019
Improving weakly supervised sound event detection with self-supervised auxiliary tasks, INTERSPEECH 2021
SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification, INTERSPEECH 2021
Contrastive Predictive Coding of Audio with an Adversary, INTERSPEECH 2020
Towards Learning a Universal Non-Semantic Representation of Speech, INTERSPEECH 2021
ACCDOA: Activity-Coupled Cartesian Direction of Arrival Representation for Sound Event Localization and Detection, ICASSP 2021
FRILL: A Non-Semantic Speech Embedding for Mobile Devices, INTERSPEECH 2021
HEAR 2021: Holistic Evaluation of Audio Representations, PMLR: NeurIPS 2021 Competition Track
Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks, ICASSP 2022
Towards Learning Universal Audio Representations, ICASSP 2022
SSAST: Self-Supervised Audio Spectrogram Transformer, AAAI 2022
A Joint Separation-Classification Model for Sound Event Detection of Weakly Labelled Data, ICASSP 2018
Multi-Task Learning for Interpretable Weakly Labelled Sound Event Detection, ArXiv 2019
Multi-Task Learning and post processing optimisation for sound event detection, DCASE 2019
Label-efficient audio classification through multitask learning and self-supervision, ICLR 2019
A Joint Framework for Audio Tagging and Weakly Supervised Acoustic Event Detection Using DenseNet with Global Average Pooling, INTERSPEECH 2020
Improving weakly supervised sound event detection with self-supervised auxiliary tasks, INTERSPEECH 2021
Identifying Actions for Sound Event Classification, WASPAA 2021
Impact of Acoustic Event Tagging on Scene Classification in a Multi-Task Learning Framework, INTERSPEECH 2022
Few-Shot Audio Classification with Attentional Graph Neural Networks, INTERSPEECH 2019
Continual Learning of New Sound Classes Using Generative Replay, WASSPA 2019
Few-Shot Sound Event Detection, ICASSP 2020
Few-Shot Continual Learning for Audio Classification, ICASSP 2021
Unsupervised and Semi-Supervised Few-Shot Acoustic Event Classification, ICASSP 2021
Who Calls the Shots? Rethinking Few-Shot Learning for Audio, WASPAA 2021
A Mutual Learning Framework For Few-Shot Sound Event Detection, ICASSP 2022
Active Few-Shot Learning for Sound Event Detection, INTERSPEECH 2022
Adapting Language-Audio Models as Few-Shot Audio Learners, INTERSPEECH 2023
AudioCLIP: Extending CLIP to Image, Text and Audio, ICASSP 2022
Wav2CLIP: Learning Robust Audio Representations From CLIP, ICASSP 2022
CLAP 👏: Learning Audio Concepts From Natural Language Supervision, ICASSP 2023
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation, ICASSP 2023
Listen, Think, and Understand, ArXiv 2023
Pengi 🐧: An Audio Language Model for Audio Tasks, ArXiv 2023
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action, ArXiv 2023
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities, ArXiv 2023
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models, ArXiv 2023
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities, ArXiv 2024
Transfer learning of weakly labelled audio, WASPAA 2017
Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes, ICASSP 2018
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition, TASLP 2020
Do sound event representations generalize to other audio tasks? A case study in audio transfer learning, INTERSPEECH 2021
A first attempt at polyphonic sound event detection using connectionist temporal classification, ICASSP 2017
Polyphonic Sound Event Detection with Weak Labeling, Thesis 2018
Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy, DCASE 2019
Evaluation of Post-Processing Algorithms for Polyphonic Sound Event Detection, WASPAA 2019
Specialized Decision Surface and Disentangled Feature for Weakly-Supervised Polyphonic Sound Event Detection, TASLP 2020
Spatial Data Augmentation with Simulated Room Impulse Responses for Sound Event Localization and Detection, ICASSP 2022
Impact of Sound Duration and Inactive Frames on Sound Event Detection Performance, ICASSP 2021
A Light-Weight Multimodal Framework for Improved Environmental Audio Tagging, ICASSP 2018
Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data, IJCAI 2020
Labelling unlabelled videos from scratch with multi-modal self-supervision, NeurIPS 2020
Audio-Visual Event Recognition Through the Lens of Adversary, ICASSP 2021
Taming Visually Guided Sound Generation, BMVC 2021
Learning Audio-Video Modalities from Image Captions, ECCV 2022
UAVM: Towards Unifying Audio and Visual Models, IEEE Signal Processing letters
Contrastive Audio-Visual Masked Autoencoder, ICLR 2023
Automated audio captioning with recurrent neural networks, WASPAA 2017
Audio caption: Listen and tell, ICASSP 2018
AudioCaps: Generating captions for audios in the wild, NAACL 2019
Audio Captioning Based on Combined Audio and Semantic Embeddings, ISM 2020
Clotho: An Audio Captioning Dataset, ICASSP 2020
A Transformer-based Audio Captioning Model with Keyword Estimation, INTERSPEECH 2020
Text-to-Audio Grounding: Building Correspondence Between Captions and Sound Events, ICASSP 2021
Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags, ICASSP 2021
Automated Audio Captioning using Transfer Learning and Reconstruction Latent Space Similarity Regularization, ICASSP 2022
Sound Event Detection Guided by Semantic Contexts of Scenes, ICASSP 2022
Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning, INTERSPEECH 2022
Audio Retrieval with Natural Language Queries: A Benchmark Study, IEEE Transactions on Multimedia 2022
On Metric Learning for Audio-Text Cross-Modal Retrieval, INTERSPEECH 2022
Introducing Auxiliary Text Query-modifier to Content-based Audio Retrieval, INTERSPEECH 2022
Audio Retrieval with WavText5K and CLAP Training, ArXiv 2022
Acoustic Scene Generation with Conditional Samplernn, ICASSP 2019
Conditional Sound Generation Using Neural Discrete Time-Frequency Representation Learning, MLSP 2021
Taming Visually Guided Sound Generation, BMVC 2021
Diffsound: Discrete Diffusion Model for Text-to-sound Generation, ArXiv 2022
AudioGen: Textually Guided Audio Generation, ICML 2023
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models, ArXiv 2023
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models, ICML 2023
AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models, ArXiv 2023
Diverse and Vivid Sound Generation from Text Descriptions, ICASSP 2023
Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation, ArXiv 2023
Simple and Controllable Music Generation, ArXiv 2023
Audiobox: Unified Audio Generation with Natural Language Prompts, ArXiv 2023
Masked Audio Generation using a Single Non-Autoregressive Transformer, ArXiv 2024
Audio event and scene recognition: A unified approach using strongly and weakly labeled data, IJCNN 2017
Sound Event Detection Using Point-Labeled Data, WASPAA 2019
An Investigation of the Effectiveness of Phase for Audio Classification, ICASSP 2022
Task | Dataset | Source | Num. Files |
---|---|---|---|
Sound Event Classification | ESC-50 | freesound.org | 2k files |
Sound Event Classification | DCASE17 Task 4 | YT videos | 2k files |
Sound Event Classification | US8K | freesound.org | 8k files |
Sound Event Classification | FSD50K | freesound.org | 50k files |
Sound Event Classification | AudioSet | YT videos | 2M files |
COVID-19 Detection using Coughs | DiCOVA | Volunteers recording audio via a website | 1k files |
Few-shot Bioacoustic Event Detection | DCASE21 Task 5 | audio | 4k+ files |
Acoustic Scene Classification | DCASE18 Task 1 | Recorded by TUT | 1.5k |
Various | VGG-Sound | Web videos | 200k files |
Audio Captioning | Clotho | freesound.org | 5k files |
Audio Captioning | AudioCaps | YT videos | 51k files |
Audio-text | SoundDescs | BBC Sound Effects | 32k files |
Audio-text | WavText5K | Varied | 5k files |
Audio-text | LAION 630k | Varied | 630k files |
Audio-text | WavCaps | Varied | 400k files |
Action Recognition | UCF101 | Web videos | 13k files |
Unlabeled | YFCC100M | Yahoo videos | 1M files |
Other audio-based datasets to consider
DCASE dataset list
List of old workshops (archived) and on-going workshops/conferences/journals:
Venues | link |
---|---|
Machine Learning for Audio Signal Processing, NIPS 2017 workshop | https://nips.cc/Conferences/2017/Schedule?showEvent=8790 |
MLSP: Machine Learning for Signal Processing | https://ieeemlsp.cc/ |
WASPAA: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics | https://www.waspaa.com |
ICASSP: IEEE International Conference on Acoustics Speech and Signal Processing | https://2021.ieeeicassp.org/ |
INTERSPEECH | https://www.interspeech2021.org/ |
IEEE/ACM Transactions on Audio, Speech and Language Processing | https://dl.acm.org/journal/taslp |
DCASE | http://dcase.community/ |
Computational Analysis of Sound Scenes and Events
- If you are interested in audio-captioning, K. Drossos maintains a detailed reading list here
- Tracking states of the arts and recent results (bibliography) on sound AI topics and audio tasks here