AMEGO: Active Memory from long EGOcentric videos
Gabriele Goletto, Tushar Nagarajan, Giuseppe Averta, Dima Damen
This project provides tools for extracting and processing location segments and hand-object interaction tracklets in egocentric videos, the AMEGO representation.1. Clone the Repository and Set Up Environment
Clone this repository and create a Conda environment:
git clone --recursive https://github.com/gabrielegoletto/AMEGO
cd AMEGO
conda env create -f amego.yml
conda activate amego
pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.8/index.html
cd submodules/epic-kitchens-100-hand-object-bboxes
python setup.py install
Download Tracker Weights
Download the weights from this link and save them in model_checkpoints/
.
The expected data stucture for EPIC-KITCHENS videos is:
│
├── EPIC-KITCHENS/
│ ├── <p_id>/
│ │ ├── rgb_frames/
│ │ │ └── <video_id>/
│ │ │ ├── frame_0000000000.jpg
│ │ │ ├── frame_0000000001.jpg
│ │ │ └── ...
│ │ │
│ │ ├── flowformer/
│ │ │ └── <video_id>/
│ │ │ ├── flow_0000000000.pth
│ │ │ ├── flow_0000000001.pth
│ │ │ └── ...
│ │ │
│ │ └── hand-objects/
│ │ └── <video_id>.pkl
│ │
│ └── ...
│
└── ...
The expected data structure for new videos is:
│
├── <video_id>/
│ ├── rgb_frames/
│ │ ├── frame_0000000000.jpg
│ │ ├── frame_0000000001.jpg
│ │ └── ...
│ │
│ ├── flowformer/
│ │ ├── flow_0000000000.pth
│ │ ├── flow_0000000001.pth
│ │ └── ...
│ │
│ └── hand-objects/
│ └── <video_id>.pkl
│
└── ...
2. (Optional) Extract optical flow
2a. Prepare Flowformer Model
Download the Flowformer model trained on the Sintel dataset from this link. Place the model files in submodules/flowformer/models/
.
2b. Extracting Flowformer Flow
Run the following command to extract flow data:
python -m tools.generate_flowformer_flow --root <root> --v_id <video_id> --dset <epic|video> --models_root submodules/flowformer/models --model sintel --video_fps <video_fps>
video_fps
is needed just for non-EPIC videos
3. (Optional) Extract HOI detections (already given for EPIC-KITCHENS videos)
3a. Download Hand-Object Model
Download the model from this link and place it in submodules/hand_object_detector/models/
.
3b. Create and Activate Environment (an ad-hoc environment is required)
conda create --name handobj python=3.8
conda activate handobj
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
cd submodules/hand_object_detector/
pip install -r requirements.txt
cd lib
python setup.py build develop
pip install protobuf==3.20.3
pip install imageio
3c. Extracting Hand-Object Bounding Boxes
Run Extraction Script
python -m tools.extract_bboxes --image_dir <root>/<video_id>/rgb_frames --cuda --mGPUs --checksession 1 --checkepoch 8 --checkpoint 132028 --bs 32 --detections_pb <video_id>.pb2
Format Bounding Boxes
mkdir -p <root>/<video_id>/hand-objects/
python -m submodules.epic-kitchens-100-hand-object-bboxes.src.scripts.convert_raw_to_releasable_detections <video_id>.pb2 <root>/<video_id>/hand-objects/<video_id>.pkl --frame-height <video_height> --frame-width <video_width>
If there are issues with the detections_pb2
file, run:
protoc -I ./tools/detection_types/ --python_out=. ./tools/detection_types/detections.proto
AMEGO extraction can be customized by adjusting configuration parameters. You can modify the configuration either by directly changing the values in the default.yaml file or by passing arguments via the command line interface (CLI).
1. Preparation (for new videos only) This script extracts frames from a video (resized to 456x256), computes the optical flow using FlowFormer, and extracts hand-object bounding boxes. It is a shortcut for automatically computing steps 2 and 3 above.
bash prepare_video.py <video_path> <video_fps>
2. Extract Interaction Tracklets
python HOI_AMEGO.py
for new videos:
python HOI_AMEGO.py --dset video --v_id <video_id> --video_fps <video_fps>
3. Extract Location Segments
python LS_AMEGO.py
for new videos:
python LS_AMEGO.py --dset video --v_id <video_id> --video_fps <video_fps>
The output for interaction tracklets is saved as a JSON file with the following fields:
track_id
: The unique identifier for each interaction track.obj_bbox
: The bounding box of the object involved in the interaction. The bounding box is not normalized, and coordinates are in xywh format (x-coordinate, y-coordinate, width, height) relative to the frame dimensions.num_frame
: The list of frames where the object is detected during the interaction.features
: DINO features extracted for the interaction track, providing detailed object representations.cluster
: The object instance assigned to the track, which helps group similar object interactions.last_frame
: The final frame in which the interaction is considered active.side
: For each frame, this field contains information on the side of the hand (left or right) interacting with the object.
The output for location segments is also saved as a JSON file, but with a simplified structure containing the following fields:
cluster
: The object instance assigned to the segment, representing a unique object or cluster of objects.features
: DINO features extracted for the segment, providing visual representations of the object.num_frame
: The list of frames where the object appears in the segment.
Both can be easily read using pandas
(pd.read_json(<filename>)
).
To query AMEGO for AMB:
python -m querying.Q* --root <root> --AMEGO <AMEGO root>
Replace *
with the query type number ranging from 1 to 8 depending on the specific query.
This repository builds upon previous works, specifically FlowFormer for optical flow extraction, HOI Detector for hand-object interactions detection, and EgoSTARK for tracking.
If you use AMEGO in your research or applications, please cite our paper:
@inproceedings{goletto2024amego,
title={AMEGO: Active Memory from long EGOcentric videos},
author={Goletto, Gabriele and Nagarajan, Tushar and Averta, Giuseppe and Damen, Dima},
booktitle={European Conference on Computer Vision},
year={2024}
}
This project is licensed under the MIT License - see the LICENSE file for details.