QueryMamba: A Mamba-Based Encoder-Decoder Architecture with a Statistical Verb-Noun Interaction Module for Video Action Forecasting @ Ego4D Long-Term Action Anticipation Challenge 2024
conda install -c nvidia cuda-toolkit
conda env create -f environment.yaml python=3.9
conda activate mamba
This repo works on pre-extracted video features. You can download official ego4d features here or extract features by yourself.
Since a clip is much shorter than an entire video, we convert video features to clip features for faster I/O.
python tools.ego4d_video_to_clip_features.py
We define action taxonomy as dictionary with (verb_label, noun_label) as the keys. The action taxonomy looks like:
{
"0,10": {
"verb": "adjust_(regulate,_increase/reduce,_change)",
"noun": "bag_(bag,_grocery,_nylon,_polythene,_pouch,_sachet,_sack,_suitcase)",
"freq": 28,
"action_label": 0
},
...,
}
Each annotation file contains a Lx15
tensor, where L
denotes the number of frames. 15 is the maximal number of labels for each frame.
For instance, if a frame has the annotation [98,11,101,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1]
, then it has three "true" labels: 98, 11, and 101.
You can generate both action taxonomy and gt files with:
python tools.create_ego4d_gt_files.py
After setting up features and gt files, your data structure may look like this:
Dataset root path (e.g., /home/user/datasets)
├── ego4d
│ └── version
│ │── annotations
│ │ └── action_taxonomy.json
│ │ └── fho_lta_train.json
│ │ └── ...
│ │── action_anno_perfram
│ │ └── clip_uid.pt
│ │── noun_anno_perfram
│ │ └── clip_uid.pt
│ │── verb_anno_perfram
│ │ └── clip_uid.pt
│ │── omnivore_video_swinl_clips
│ │ └── clip_uid.pt
All config params are defined in default_config.py.
bash expts/train_ego4d_querymamba.sh
# or testing
bash expts/test_ego4d_querymamba.sh