This project aims to develop a robust ensemble-based audio classification system that distinguishes between infant cries, screams, and normal utterances. The approach integrates two powerful models, YAMNet and Wav2Vec2, to leverage both pre-trained knowledge and fine-tuned accuracy for this specific classification task.
The system processes raw audio input, extracts relevant features, and classifies the sounds into three categories:
- Crying
- Screaming
- Normal Utterances
An ensemble technique is used to combine the predictions from YAMNet and Wav2Vec2, ensuring improved accuracy and robustness.
- Multi-dataset integration for diverse and balanced training.
- Preprocessing techniques including noise reduction and pitch normalization.
- Fine-tuned YAMNet and Wav2Vec2 models.
- Ensemble learning to improve classification accuracy.
- Google Colab integration for ease of execution.
- Performance evaluation through accuracy, precision, recall, and confusion matrices.
YAMNet is a deep neural network trained on AudioSet to recognize over 500 different sound classes. It uses Mel spectrograms as input features and applies a MobileNet architecture.
Modifications in this project:
- Fine-tuned the model on our specific dataset.
- Adjusted output layers to classify three specific categories.
Wav2Vec2 is a self-supervised speech representation learning model by Facebook AI. Unlike YAMNet, it learns contextualized representations directly from raw audio signals, making it robust to noise and variations.
Modifications in this project:
- Fine-tuned on labeled infant audio data.
- Adjusted classification head for cry, scream, and normal utterance detection.
We implemented an ensemble approach by combining outputs from both models using:
- Averaging Probabilities: Taking the mean of both models' predictions.
- Majority Voting: Selecting the most frequent predicted class.
This helps in reducing errors and improving overall accuracy.
The project utilizes multiple datasets:
- Infant Cry Audio Corpus from KAGGLE
- Human Screaming Detection Dataset from KAGGLE
- Children speech Audioset 4
All datasets were preprocessed to ensure:
- Consistent sample rates
- Uniform bit-depth normalization
- Proper segmentation and labeling
This project is implemented in Google Colab for ease of execution.
Steps to Run on Colab:
-
Open Google Colab: Colab Link
-
Upload the dataset to Google Drive.
-
Mount Google Drive in Colab:
from google.colab import drive drive.mount('/content/drive')
-
Clone the GitHub repository:
!git clone https://github.com/Arjun-08/Developing-an-Ensemble-Model-for-Detecting-Infant-Cries-Screams-and-Normal-Utterances.git cd Developing-an-Ensemble-Model-for-Detecting-Infant-Cries-Screams-and-Normal-Utterances
-
Install required dependencies:
!pip install -r requirements.txt
-
Run the training and inference scripts (detailed below).
- 70% Training
- 15% Validation
- 15% Testing
- Accuracy
- Precision, Recall, F1-score
- Confusion Matrices
- ROC Curves
Once the model is trained, you can perform inference on new audio files.
from model import run_inference
prediction = run_inference('/content/drive/MyDrive/frontera/extracted_data/Screaming/---1_cCGK4M_out.wav')
print("Predicted Label:", prediction)
Predicted Label: [3] (crying)
Epoch 1/10
accuracy: 0.6756 - loss: 0.9839 - val_accuracy: 0.7826 - val_loss: 0.9807
...
Epoch 10/10
accuracy: 0.8676 - loss: 0.4984 - val_accuracy: 0.7826 - val_loss: 0.7915
Epoch 1 Validation Loss: 0.815121
Epoch 2 Validation Loss: 0.833583
Epoch 3 Validation Loss: 0.837320
Train Loss: 0.6878751118977865
Test Accuracy: 0.7681
Test Precision: 0.5900
Test Recall: 0.7681
Test F1 Score: 0.6674
For questions or collaborations, reach out via nvarjunmani07@gmail.com.
Special thanks to the Team FRONTERA HEALTH and dataset providers that made this research possible!