This repository implements the model proposed in the paper:
Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen, EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition, ICCV, 2019
When using this code, kindly reference:
@InProceedings{kazakos2019TBN,
author = {Kazakos, Evangelos and Nagrani, Arsha and Zisserman, Andrew and Damen, Dima},
title = {EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition},
booktitle = {IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2019}
}
- We now provide support for training/evaluating on the newly released dataset EPIC-KITCHENS-100, as well as a pretrained model on EPIC-KITCHENS-100.
- Install project's requirements in a separate conda environment. In your terminal:
$ conda env create -f environment.yml
. - CUDA 10.0
This step assumes that you've downloaded the RGB and Flow frames of EPIC-KITCHENS-100/EPIC-KITCHENS-55 dataset using the script found here, where you can find instructions on how to use the script. Your copy of the dataset (either EPIC-KITCHENS-100 or EPIC-KITCHENS-55) should have the same folder structure provided in the script (which can be found here). Also you should untar each video's frames in its corresponding folder, e.g for P01_101.tar
you should create a folder P01_101
and put the contents of the tar file inside.
dataset.py
uses a unified folder structure for all datasets, which is the same as the one used in the TSN code. Example of the folder structure for RGB and Flow:
├── dataset_root
| ├── video1
| | ├── img_0000000000
| | ├── x_0000000000
| | ├── y_0000000000
| | ├── .
| | ├── .
| | ├── .
| | ├── img_0000000100
| | ├── x_0000000100
| | ├── y_0000000100
| ├── .
| ├── .
| ├── .
| ├── video10000
| | ├── img_0000000000
| | ├── x_0000000000
| | ├── y_0000000000
| | ├── .
| | ├── .
| | ├── .
| | ├── img_0000000250
| | ├── x_0000000250
| | ├── y_0000000250
To map the folder structure of EPIC-KITCHENS to the above folder structure I've used symlinks. Use the following script to convert the original folder structure of EPIC-KITCHENS to the folder structure above:
python preprocessing_epic/symlinks.py /path/to/dataset/ /path/to/output
This step assumes that you've downloaded the videos of EPIC-KITCHENS using this script. It is the same script as the one that you will use to download RGB/Flow frames, shown above.
To extract the audio from the videos, run:
python preprocessing_epic/extract_audio.py /path/to/videos /path/to/ouput
To load the audio in dataset.py
, Im using a dictionary, where the keys are the video names and the values are the extracted audio from the previous step. To save the extracted audio into a dictionary, run:
python preprocessing_epic/wav_to_dict.py /path/to/audio /path/to/output
This is done because the untrimmed videos of EPIC-KITCHENS are very large, and loading the untrimmed wav files in each training iteration is very slow. For other datasets with short audio clips, if you don't want to save the audio in a dictionary, and prefer to load the wav files directly in dataset.py
, you can set use_audio_dict=False
in TBNDataset
in dataset.py
.
- TBN-epic-kitchens-55.pth: Download link. This is the full TBN model (RGB, Flow, Audio) trained on EPIC-KITCHENS-55, which we use to report results in our paper.
- TBN-epic-kitchens-100.pth: Download link. This is the full TBN model (RGB, Flow, Audio) trained on EPIC-KITCHENS-100.
- TSN-kinetics-flow.pth: Download link. This is a TSN Flow model, trained on Kinetics, downloaded from here. The original model was on Caffe and I converted it to a PyTorch model. This can be used for initialising the Flow stream from Kinetics when training TBN, as we observed an increase in performance in preliminary experiments in comparison to initialising Flow from ImageNet.
Basic steps:
- Extract the audio in a similar way to the one that I've shown above (.wav files for the whole dataset in a single folder). Have a look at
preprocessing_epic/extract_audio.py
for help. - Visual data should have the same folder structure as the one that I've shown above. To do that, map your original folder structure to the one above using symlinks, similarly to
epic_preprocessing/symlinks.py
- In both
train.py
andtest.py
, register the number of classes of your dataset in the variablenum_class
at the top ofmain()
. - Under
video_records/
create your_record.py which should inherit fromVideoRecord
. This should parse the lines of a file that contains info about your dataset (paths, labels etc). Have a look atepickitchens100_record.py
as an example. - Add your dataset in
_parse_list()
indataset.py
, by parsing each line oflist_file
and storing it to a list, wherelist_file
is the file that contain info for your dataset.
To train the full RGB, Flow, Audio model, run:
python train.py epic-kitchens-55 RGB Flow Spec --train_list train_val/EPIC_train_action_labels.pkl --val_list train_val/EPIC_val_action_labels.pkl
--visual_path /path/to/rgb+flow --audio_path /path/to/audio --arch BNInception --num_segments 3 --dropout 0.5 --epochs 80 -b 128 --lr 0.01 --lr_steps 60
--gd 20 --partialbn --eval-freq 1 -j 40 --pretrained_flow /path/to/pretrained/kinetics/flow/model
In the paper, results are reported by training on the whole training set. The pretrained model in pretrained/
is the result of training in the whole training set Train/val sets where used for development and hyperparam tuning. To train on the whole dataset, concatenate EPIC_train_action_labels.pkl
and EPIC_val_action_labels.pkl
, found under train_val
, in EPIC_train+val_action_labels.pkl
and run:
python train.py epic-kitchens-55 RGB Flow Spec --train_list train_val/EPIC_train+val_action_labels.pkl --val_list train_val/EPIC_val_action_labels.pkl
--visual_path /path/to/rgb+flow --audio_path /path/to/audio --arch BNInception --num_segments 3 --dropout 0.5 --epochs 80 -b 128 --lr 0.01 --lr_steps 60
--gd 20 --partialbn --eval-freq 1 -j 40 --pretrained_flow /path/to/pretrained/kinetics/flow/model
Individual modalities can be trained, as well as any combination of 2 modalities. To train audio, run:
python train.py epic-kitchens-55 Spec --train_list train_val/EPIC_train_action_labels.pkl --val_list train_val/EPIC_val_action_labels.pkl
--audio_path /path/to/audio --arch BNInception --num_segments 3 --dropout 0.5 --epochs 80 -b 128 --lr 0.001 --lr_steps 60 --gd 20
--partialbn --eval-freq 1 -j 40
To train RGB, run:
python train.py epic-kitchens-55 RGB --train_list train_val/EPIC_train_action_labels.pkl --val_list train_val/EPIC_val_action_labels.pkl
--visual_path /path/to/rgb+flow --arch BNInception --num_segments 3 --dropout 0.5 --epochs 80 -b 128 --lr 0.01 --lr_steps 60 --gd 20
--partialbn --eval-freq 1 -j 40
To train flow, run:
python train.py epic-kitchens-55 Flow --train_list train_val/EPIC_train_action_labels.pkl --val_list train_val/EPIC_val_action_labels.pkl
--visual_path /path/to/rgb+flow --arch BNInception --num_segments 3 --dropout 0.5 --epochs 80 -b 128 --lr 0.001 --lr_steps 60 --gd 20
--partialbn --eval-freq 1 -j 40 --pretrained_flow /path/to/pretrained/kinetics/flow/model
Example of training RGB+Audio (any other combination can be used):
python train.py epic-kitchens-55 RGB Spec --train_list train_val/EPIC_train_action_labels.pkl --val_list train_val/EPIC_val_action_labels.pkl --visual_path /path/to/rgb+flow --audio_path /path/to/audio --arch BNInception --num_segments 3 --dropout 0.5 --epochs 80 -b 128 --lr 0.01 --lr_steps 60 --gd 20
--partialbn --eval-freq 1 -j 40
EPIC_train_action_labels.pkl
and EPIC_val_action_labels.pkl
can be found under train_val/
. They are the result of spliting the original EPIC_train_action_labels.pkl into a training and a validation set, by randomly holding out one untrimmed video from each participant for the 14 kitchens (out of 32) with the largest number of untrimmed videos.
To compute scores, save scores and labels, and print the accuracy of the validation set using all modalities, run:
python test.py epic-kitchens-55 RGB Flow Spec path/to/checkpoint --test_list train_val/EPIC_val_action_labels.pkl --visual_path /path/to/rgb+flow --audio_path /path/to/audio --arch BNInception --scores_root scores/ --test_segments 25 --test_crops 1 --dropout 0.5 -j 40
To compute and save scores of the test sets (S1/S2) (since we do not have access to the labels), run:
python test.py epic-kitchens-55 RGB Flow Spec path/to/checkpoint --test_list EPIC_test_s1_timestamps.pkl --visual_path /path/to/rgb+flow --audio_path /path/to/audio --arch BNInception --scores_root scores/ --test_segments 25 --test_crops 1 --dropout 0.5 -j 40
For S2, replace EPIC_test_s1_timestamps.pkl
with EPIC_test_s2_timestamps.pkl
. These 2 files can be found in the repository of EPIC-KITCHENS-55 annotations (link).
Similarly testing can be done for any combination of modalities, or individual modalities.
Furthermore, you can use fuse_results_epic.py
to fuse modalities' scores with late fusion, assuming that you trained individual modalities (similarly to TSN). Lastly, submission_json.py
can be used for preparing your scores in json format to submit them in the EPIC-Kitchens Action Recognition Challenge.
The following table contains the results of training and evaluating EPIC-KITCHENS-55 on the splits from train_val/
.
Top-1 Accuracy:
VERB | NOUN | ACTION |
---|---|---|
63.31 | 46.00 | 34.83 |
Top-5 Accuracy:
VERB | NOUN | ACTION |
---|---|---|
88.29 | 68.31 | 54.09 |
python train.py epic-kitchens-100 RGB Flow Spec --train_list EPIC_100_train.pkl --val_list EPIC_100_validation.pkl
--visual_path /path/to/rgb+flow --audio_path /path/to/audio --arch BNInception --num_segments 6 --dropout 0.5 --epochs 80 -b 64 --lr 0.01 --lr_steps 40 60
--gd 20 --partialbn --eval-freq 1 -j 40 --pretrained_flow /path/to/pretrained/kinetics/flow/model
EPIC_100_train.pkl
and EPIC_100_validation.pkl
can be found in the annotations repository of EPIC-KITCHENS-100 (link)
python test.py epic-kitchens-100 RGB Flow Spec path/to/checkpoint --test_list EPIC_100_validation.pkl --visual_path /path/to/rgb+flow --audio_path /path/to/audio --arch BNInception --scores_root scores/ --test_segments 25 --test_crops 1 --dropout 0.5 -j 40
Top-1 Accuracy:
VERB | NOUN | ACTION |
---|---|---|
65.26 | 47.49 | 36.08 |
Top-5 Accuracy:
VERB | NOUN | ACTION |
---|---|---|
90.32 | 73.94 | 58.04 |
NOTE: For official comparisons with TBN, please submit your results to the test server of EPIC-KITCHENS.
The code is published under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, found here.