Solution for https://www.kaggle.com/competitions/birdclef-2025
This repository uses uv package manager to install the required packages. For specific installation, please look docs. On Linux you can install and synchronize it by running:
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
source .venv/bin/activatePlace your Kaggle API token in the ~/.kaggle directory. You can find instructions on how to do it here.
mv kaggle.json ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.jsonPlace your WandB API token in the .env file so that it logins automatically:
WANDB_API_KEY=your_wandb_api_key
kaggle competitions download -p input -c birdclef-2025
unzip input/birdclef-2025.zip -d input/birdclef-2025/
kaggle datasets download kdmitrie/bc25-separation-voice-from-data-by-silero-vad -p input/voice_data --unzipIf you wish to do manual data precomputation, you can run the following command:
uv run python -m src.audio_processinguv run python -m src.train -c configs/your_config.yamlLookead at different librosa.feature.melspectrogram params:
- Changed
N_FFTto2048from1024 - Changed
HOP_LENGTHto512from256 - Changed
FMINfrom20to50 FMAXseems to be better at15000, not14000These changes were made due to the seemingly clearer and full image. However, we have to investigate it with training and validation.
- Tested mel spectrograms precomputation — indeed, it increased the training speed.
- Conducted experiments:
001,002,003,004.
- Conducted experiments:
005,006.
N_MELShas to be 256!!! Because it translates into the height of the meplspec image.- Changed short audio processing from copying to constant padding
- Tested
FocalLossBCE.
001-1.yaml—001-8.yamlare related to the melspec settings. The best ones are:
N_FFT: 2048
HOP_LENGTH: 1024 (but 512 on Public LB)
N_MELS: 128
FMIN: 50 (but 20 on Public LB)
FMAX: 14000 (but 16000 on Public LB)
MINMAX_NORM: true
002-1.yaml—002-7.yamlare related to the optimizer scheduler settings and batch size.- Larger BS (32 -> 128) can yield better results — need to investigate further
weight_decay: 1.0e-2is better thanweight_decay: 1.0e-5OneCycleLRis really badReduceLROnPlateaucan also get good results
003-13.yaml,003-17.yaml,003-18.yamlare changing resolution of001-3,001-7,001-8experiments from256x256to224x224- Based on both Local AUC and Public AUC,
256x256is a better choice
- Based on both Local AUC and Public AUC,
004-13.yaml,004-17.yaml,004-18.yamlare changing number of epochs from 10 to 15 andmin_lr: 1.0e-6tomin_lr: 1.0e-7.- Based on both Local AUC and Public AUC, it is not clear whether these changes actually improve generalization ability, but it's clear that they overfit much worse
005-1.yaml—005-10.yamlare changinglrfrom1.0e-3to1.0e-2with the step of0.1.lr: 3.0e-3is the best one
006-1.yaml—006-4.yamlare changingin_channelsfrom1to3andpretrainedfromTruetoFalsein_channels: 3with ImageNet normaliation qorks fine
011-1.yaml—011-3.yamlare changing voice processing.1is not filtering out human voice,2is filtering out human voice and leave only the longest segment without voice,3is filtering out human voice and leaving concatenated segments without voice.1is the best one based on Local AUC.
012-*are changing the base model fromEfficientNet0to1,2,3and4with1and3input channels. Other training parameters are the same.- Only
EfficientNet3with 3 channels is better thanEfficientNet0with 1 channel.
- Only
013-a*are changing theaug_probandmixup_alphaparameters.- Turns out that different values of
mixup_alphado not change anything. Only0value can change, but I have never tested it. - Also,
aug_probof0.5is the best.
- Turns out that different values of
013-d*are changing thedrop_rateanddrop_path_rateparameters.drop_rate: 0.5anddrop_path_rate: 0.2are the best ones. The next isdrop_rate: 0.2anddrop_path_rate: 0.35.
014-*are changing theprecompute_dataparameter and adds all 5 folds to the training.precompute_dataparam does not change anything, but the speed (x30) of training.- All 5 folds have 5 different results, from
0.94544to0.95592.
015-*are changing theBCEWithLogitsLosstoFocalLossBCEwith different params.BCEWithLogitsLossseems to be the best single.FocalLossBCE's reductionsumis worse thanmean.FocalLossBCE's gamma3is worse than2.
016-*are changing theHOP_LENGTHfrom512to16, also changeN_MELSfrom256to512andMINMAX_NORMfromTruetoFalse.
| Experment name, fold | Local AUC | Public AUC | Details |
|---|---|---|---|
| 001-1, 0 | 0.94536 | 0.747 | - |
| 001-2, 0 | 0.94777 | - | - |
| 001-3, 0 | 0.95217 | 0.751 | - |
| 001-4, 0 | 0.94892 | - | - |
| 001-5, 0 | 0.94621 | - | - |
| 001-6, 0 | 0.94896 | - | - |
| 001-7, 0 | 0.95190 | 0.779 | - |
| 001-8, 0 | 0.95055 | - | - |
| 002-1, 0 | 0.94536 | - | - |
| 002-2, 0 | 0.94487 | - | - |
| 002-3, 0 | 0.94789 | - | - |
| 002-4, 0 | 0.94842 | - | - |
| 002-5, 0 | 0.94152 | - | - |
| 002-6, 0 | 0.94143 | - | - |
| 002-7, 0 | 0.94771 | - | - |
| 003-13, 0 | 0.95079 | - | - |
| 003-17, 0 | 0.95095 | 0.774 | - |
| 003-18, 0 | 0.94993 | - | - |
| 004-13, 0 | 0.95005 | - | - |
| 004-17, 0 | 0.95103 | 0.765 | - |
| 004-18, 0 | 0.95146 | - | - |
| 005-1, 0 | 0.94534 | - | - |
| 005-2, 0 | 0.94883 | - | - |
| 005-3, 0 | 0.95130 | - | - |
| 005-4, 0 | 0.94950 | - | - |
| 005-5, 0 | 0.94731 | - | - |
| 005-6, 0 | 0.93399 | - | - |
| 005-7, 0 | 0.93703 | - | - |
| 005-8, 0 | 0.92751 | - | - |
| 005-9, 0 | 0.91582 | - | - |
| 005-10, 0 | 0.91446 | - | - |
| 006-1, 0 | 0.95190 | - | - |
| 006-2, 0 | 0.92699 | - | - |
| 006-3, 0 | 0.92410 | - | - |
| 006-4, 0 | 0.95110 | - | - |
| 011-1, 0 | 0.95403 | 0.732 | - |
| 011-2, 0 | 0.95004 | 0.762 | - |
| 011-3, 0 | 0.94818 | 0.763 | - |
| 012-10, 0 | 0.95403 | - | - |
| 012-11, 0 | 0.85695 | - | - |
| 012-12, 0 | 0.94916 | - | - |
| 012-13, 0 | 0.95021 | - | - |
| 012-14, 0 | 0.94993 | - | - |
| 012-30, 0 | 0.94919 | - | - |
| 012-31, 0 | 0.90809 | - | - |
| 012-32, 0 | 0.95438 | 0.746 | - |
| 012-33, 0 | 0.95227 | - | - |
| 012-34, 0 | 0.95086 | 0.742 | - |
| 013-a0, 0 | 0.95162 | - | - |
| 013-a5, 0 | 0.94826 | - | - |
| 013-a10, 0 | 0.95403 | - | - |
| 013-a15, 0 | 0.95085 | - | - |
| 013-a20, 0 | 0.95222 | - | - |
| 013-d0, 0 | 0.94865 | - | - |
| 013-d1, 0 | 0.94600 | - | - |
| 013-d2, 0 | 0.95319 | - | - |
| 013-d3, 0 | 0.95376 | - | - |
| 013-d4, 0 | 0.95186 | - | - |
| 013-d5, 0 | 0.94825 | - | - |
| 013-d6, 0 | 0.94824 | - | - |
| 013-d7, 0 | 0.95255 | - | - |
| 013-d8, 0 | 0.95196 | - | - |
| 013-d9, 0 | 0.95023 | - | - |
| 013-d10, 0 | 0.94682 | - | - |
| 013-d11, 0 | 0.95443 | 0.759 | - |
| 013-d12, 0 | 0.95529 | 0.764 | - |
| 013-d13, 0 | 0.95204 | - | - |
| 013-d14, 0 | 0.94606 | - | - |
| 013-d15, 0 | 0.94986 | - | - |
| 013-d16, 0 | 0.95272 | - | - |
| 013-d17, 0 | 0.95022 | - | - |
| 013-d18, 0 | 0.95159 | - | - |
| 013-d19, 0 | 0.94326 | - | - |
| 013-d20, 0 | 0.95247 | - | - |
| 013-d21, 0 | 0.95592 | 0.773 | - |
| 013-d22, 0 | 0.95044 | - | - |
| 013-d23, 0 | 0.95028 | - | - |
| 015-1, 0 | 0.95592 | - | - |
| 015-2, 0 | 0.94985 | - | - |
| 015-3, 0 | 0.9462 | - | - |
| 015-4, 0 | error | - | - |
| 015-5, 0 | 0.94688 | - | - |
| 015-6, 0 | 0.95063 | - | - |
| 015-7, 0 | 0.94945 | - | - |
| 015-8, 0 | 0.95122 | - | - |
| 015-9, 0 | - | - |
- Remove human voice link1 link2
- Remove
mel_spec_norm - Test padding audio if it is less than 5s instead of copying it link
- Maybe use TTA link
- Test
HOP_LENGTHup to 16 - Test
FMINup to 20 - Test
FMAXup to 16000 - Test
N_MELSup to 128 - Test
- Test model's
drop_rateto something other than0.2 - Test model's
drop_path_rateto something other than0.2 - Test a different model's classifier link
self.classifier = nn.Sequential(
nn.Linear(backbone_out, 512),
nn.BatchNorm1d(512),
nn.LeakyReLU(0.1),
nn.Dropout(0.15),
nn.Linear(512, 256),
nn.BatchNorm1d(256),
nn.LeakyReLU(0.1),
nn.Dropout(0.1),
nn.Linear(256, num_classes)
)- Test audio denoising link
- Test
FocalLossBCElink - Make prediction based on all 5s segments of the audio link
- Add albumentations
- Test extracting not the center 5 seconds, but the first 5 seconds
- Test 3 channels
- Test ImageNet normalization for 3 channels if the weights are pretrained
T.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225]), - Test melspec more thoroughly (
N_MELS,HOP_LENGTH) - [ ]