Real-time audio event detection inspired by YOLO's philosophy, adapted for temporal audio processing.
# Clone the repository
git clone https://github.com/armanrasta/yoho
cd yoho
# Install dependencies
pip install -r requirements.txt
# Install in development mode
pip install -e .python train.py \
--audio_dir /path/to/audio/files \
--annotations /path/to/annotations.json \
--num_classes 10 \
--epochs 100 \
--batch_size 16 \
--save_dir checkpointspython detect.py \
--model_path checkpoints/yoho_best.pth \
--audio_path test_audio.wav \
--num_classes 10 \
--confidence_thresh 0.5 \
--visualizeyoho/
├── 🐍 train.py # Training script
├── 🔍 detect.py # Detection script
├── 📚 requirements.txt # Dependencies
├── ⚙️ setup.py # Package setup
├── 📖 README.md # This file
├── 📊 example_annotations.json # Example data format
├── 🎵 yoho/ # Core YOHO package
│ ├── __init__.py
│ ├── 🧠 model.py # YOHO architecture
│ ├── 📉 loss.py # YOHO loss function
│ ├── 🎼 data.py # Dataset & feature extraction
│ ├── 🏋️ trainer.py # Training utilities
│ └── 🔮 detector.py # Inference engine
└── 🔧 utils/ # Utility functions
├── __init__.py
├── ⚓ anchors.py # Anchor calculation
└── 📈 evaluation.py # Evaluation metrics
- ⚡ Real-time Detection: Single-pass inference like YOLO
- 🎵 Multi-scale Architecture: Detects events at different temporal resolutions
- 🔊 Professional Audio Processing: Mel-spectrograms, MFCCs, and more
- 🔄 Data Augmentation: Audio-specific augmentations for robustness
- 📊 Visualization: Detection results with audio waveform and spectrogram
- 🏭 Production Ready: Proper training pipeline and model checkpointing
YOHO adapts YOLO's core principles for audio:
- Backbone: CNN with residual connections for temporal feature extraction
- Neck: Feature pyramid network for multi-scale feature fusion
- Heads: Multiple detection heads for different temporal resolutions
- Anchors: Optimized for typical audio event durations
{
"audio1.wav": [
[1.2, 2.5, 0, 1.0], // [start_time, end_time, class_id, confidence]
[3.1, 4.0, 2, 1.0]
],
"audio2.wav": [
[0.5, 1.8, 1, 1.0]
]
}- Mel-spectrograms (recommended)
- MFCCs with delta features
- Log-spectrograms
- Combined features (Mel + MFCC)
- Real-time capable on modern GPUs
- Multi-event detection in single audio clip
- Temporal localization with start/end times
- Class confidence scores for each detection
Key training parameters:
--num_classes 10 # Number of event classes
--batch_size 16 # Training batch size
--lr 1e-4 # Learning rate
--epochs 100 # Training epochs
--feature_type mel_spectrogram # Feature extraction method- Event-based F1 Score: Temporal matching with tolerance
- Precision/Recall: Standard detection metrics
- Temporal IoU: Intersection-over-Union for time segments
- 🎵 Music Analysis: Chord detection, beat tracking
- 🔊 Sound Event Detection: Environmental sounds, alarms
- 🎬 Audio Analysis: Scene segmentation, event tagging
- 🦻 Healthcare: Cough detection, heart sound analysis
- 🐾 Bioacoustics: Animal call detection
class CustomFeatureExtractor(AudioFeatureExtractor):
def forward(self, waveform, feature_type='custom'):
if feature_type == 'custom':
# Your custom feature extraction
return custom_featuresclass CustomYOHO(YOHO):
def _build_backbone(self):
# Your custom backbone
return custom_backboneIf you use YOHO in your research, please cite:
@software{yoho2024,
title = {YOHO: You Only Hear Once for Audio Event Detection},
author = {Your Name},
year = {2024},
url = {https://github.com/armanrasta/yoho}
}We welcome contributions! Please see our Contributing Guidelines for details.
This project is licensed under the MIT License - see the LICENSE file for details.
YOHO - Because you should only have to hear it once! 🎯