More info can be found on our project page
🌟 We won the 2nd place in the UDA track, EPIC-Kitchens Challenge @CVPR 2022. 🌟
Audio model: link
SlowFast model for RGB modality: link
Slow-Only model for optical flow modality: link
- There are two streams in total, one is the audio-adaptive model with RGB and audio modalities, and the other is the audio-adaptive model with optical flow and audio modalities.
- We average the predictions from the two streams in the end for an mean accuracy of 61.0%.
- Prepare the audio files (.wav) from the videos:
python generate_sound_files.py
- Environments:
PyTorch 1.7.0
mmcv-full 1.2.7
mmaction2 0.13.0
cudatoolkit 10.1.243
- The directory structure should be modified to match:
├── rgb
| ├── train
| | ├── D1
| | | ├── P08_01
| | | | ├── frame_0000000000.jpg
| | | | ├── ...
| | | ├── P08_02
| | | ├── ...
| | ├── D2
| | ├── D3
| ├── test
| | ├── D1
| | ├── D2
| | ├── D3
├── flow
| ├── train
| | ├── D1
| | | ├── P08_01
| | | | ├── u
| | | | | ├── frame_0000000000.jpg
| | | | | ├── ...
| | | | ├── v
| | | ├── P08_02
| | | ├── ...
| | ├── D2
| | ├── D3
| ├── test
| | ├── D1
| | ├── D2
| | ├── D3
This is the demo code for training the audio-adaptive model with RGB (SlowFast backbone) and audio modalities on EPIC-Kitchens dataset, reproducing an mean accuracy of 59.2%.
-
Before running the code, you need to change the data paths to yours in
dataloader_*.py
,train_*.py
,test_*.py
andget_*.py
. -
First download the data following the code provided by an existing work https://github.com/jonmun/MM-SADA-code
-
Go to the sub-directory
cd EPIC-rgb-audio
- To run the code on 4 NVIDIA 1080Ti GPUs:
sh bash.sh
This is the demo code for training the audio-adaptive model with optical flow (Slow-Only backbone) and audio modalities on EPIC-Kitchens dataset, reproducing an mean accuracy of 53.9%.
- Before running the code, you need to change the data paths to yours in
dataloader_*.py
,train_*.py
,test_*.py
andget_*.py
.
Note that the clusters and absent-pseudo labels generated by audio are the same as those in the "RGB and audio" code
- Go to the sub-directory
cd EPIC-flow-audio
- To run the code on 4 NVIDIA 1080Ti GPUs:
sh bash.sh
This code conducts semi-supervised domain adaptation with all the source (3rd-person view) data and half of the target (1st-person view) data, based on RGB (SlowFast backbone) and audio modalities, reproducing an mAP of 26.3%.
- The directory structure should be modified to match:
├── CharadesEgo
| ├── audio
| | ├── 005BUEGO.wav
| | ├── 005BU.wav
| | ├── ...
| ├── CharadesEgo_v1_rgb
| | ├── 005BU
| | | ├── 005BU-000001.jpg
| | | ├── 005BU-000002.jpg
| | | ├── ...
| | ├── 005BUEGO
| | ├── ...
| ├── Labels
| | ├── 005BU
| | | ├── frame_0000000001_0000000174.csv
| | | ├── ...
| | ├── 005BUEGO
| | ├── ...
| ├── CharadesEgo_v1_train_only1st.csv
| ├── CharadesEgo_v1_train_only3rd.csv
| ├── CharadesEgo_v1_test_only1st.csv
| ├── CharadesEgo_v1_test_only3rd.csv
Here the "Labels" directory contains the labels that we generated by ourselves according to the csv files provided by the CharadesEgo dataset. You can directly download it from this link or run generate_labels.py
to create it by yourself.
-
Before running the code, you need to change the data paths to yours in
dataloader_*.py
,train_*.py
,test_*.py
andget_*.py
. -
Go to the sub-directory
cd CharadesEgo
- To run the code on 4 NVIDIA 1080Ti GPUs:
sh bash.sh
This dataset can be downloaded at https://uvaauas.figshare.com/articles/dataset/ActorShift_zip/19387046
If you have any questions, you can send an email to y.zhang9@uva.nl
If you find the code useful in your research please cite:
@inproceedings{ZhangCVPR2022,
title = {Audio-Adaptive Activity Recognition Across Video Domains},
author = {Yunhua Zhang and Hazel Doughty and Ling Shao and Cees G M Snoek},
year = {2022},
date = {2022-06-02},
urldate = {2022-06-01},
booktitle = {CVPR},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}