This project aims to detect audio DeepFakes by leveraging the ASVspoof 2019 dataset. The focus is on distinguishing bona fide (real) speech from spoofed audio using various preprocessing techniques and machine learning models. By exploring different architectures and feature extraction methods, we aim to address the growing challenges posed by synthetic and manipulated audio in the field of digital forensics.
Authors: Tomas Lovato, Alisea Bovo
Academic Year: Digital Forensics 2024/2025
GitHub Repository: DF_AudioDeepfakeDetection
The project is based on the ASVspoof 2019 Logical Access (LA) dataset. This dataset includes three partitions:
- Training Set: 25,380 samples (20 speakers: 8 male, 12 female).
- Development Set: 24,844 samples (20 speakers: 8 male, 12 female).
- Evaluation Set: Approximately 72,000 samples (48 speakers: 21 male, 27 female).
- Known Attacks: 6 types (4 TTS, 2 VC).
- Unknown Attacks: 11 types (6 TTS, 2 VC, 3 hybrids).
- The evaluation set contains attacks generated by unseen algorithms, testing the generalization of models.
The project's success hinges on effective preprocessing of audio data. Three main preprocessing strategies were used:
- Purpose: Feature extraction for CNN-based models.
- Steps:
- Audio loaded using
Librosawith a fixed sample rate of 16 kHz. - Temporal normalization and padding/truncating to a standard duration.
- Mel Spectrogram extraction.
- Audio loaded using
- Purpose: Used with SVM and One-Class SVM (OCSVM).
- Steps:
- Temporal normalization and scaling.
- MFCC feature extraction.
- Purpose: Input for SVM models after dimensionality reduction.
- Steps:
- Extraction of STFT features.
- Dimensionality reduction using Autoencoders or PCA.
- Unbalanced Training:
- Trained on the unbalanced dataset.
- Achieved high accuracy but biased predictions (tended to classify most samples as spoofed).
- Balanced Training:
- Classes balanced by oversampling or undersampling.
- Significant improvement in confusion matrix and ROC-AUC scores.
- Advanced CNN:
- Larger convolution kernels for capturing temporal dependencies.
- Achieved near-perfect results on both seen and unseen datasets.
- MFCC Features:
- Moderate ROC (~0.7) with challenges in generalization.
- Future work: Hyperparameter tuning with grid search.
- STFT Features:
- Dimensionality reduced via Autoencoders or PCA.
- Performance improved with PCA, but computational cost remains high.
Despite the gender imbalance in the dataset:
- The model did not exhibit significant bias.
- Slightly better performance was observed for female voices.
- Successfully implemented CNNs with Mel Spectrograms, leading to robust detection performance.
- Developed insights into feature extraction and dimensionality reduction techniques.
- Difficulty in generalizing to unseen spoofing techniques.
- High computational requirements for feature extraction and dimensionality reduction.
