Skip to content

Latest commit

 

History

History
296 lines (220 loc) · 10.8 KB

README.md

File metadata and controls

296 lines (220 loc) · 10.8 KB

VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI

🌐 Homepage | 🤗 Paper | 📖 arXiv | 🏆 Leaderboard

intro

Figure 1: The main tasks of VidEgoThink benchmark to comprehensively assess the egocentric video understanding capabilities in Embodied AI. There are four types of tasks, including video question answering, hierarchy planning, visual grounding, and reward modeling. These four tasks are complementary to each other to implement a complete goal for Embodied AI.

🔔 News

[2024-10]: VidEgoThink is the Top-1 paper of Oct-17 in Hugging Face. 🔥
[2024-10]: Our paper VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI has been released.
[2024-09]: EgoThink and VidEgoThink is invited to be presented in ZhiDX.

💾 VidEgoThink Benchmark

Given that the utilization of foundation models in Embodied AI remains an open research question, we carefully design four types of interrelated tasks for comprehensive assessment: (i) video question-answering, (ii) hierarchy planning, (iii) visual grounding, (iv) reward modeling.

1. Video Question Answering

overview

Figure 2: Case of video question answering.

2. Hierarchy Planning

overview

Figure 3: Case of hierarchy planning.

3. Visual Grounding

overview

Figure 4: Case of visual grounding.

4. Reward Modeling

overview

Figure 5: Case of reward modeling.

💾 Dataset

1. Download Original Egocentric Videos

You can use Ego4D CLI to get the original egocentric videos of Ego4d GoalStep.

# download goalstep videos
ego4d --datasets full_scale --benchmark goalstep -o <out-dir>

2. Download Our Annotations

Please directly clone our GitHub Repo.

git clone https://github.com/AdaCheng/VidEgoThink.git
cd data

The format of our annotations are as follows, where this video_path indicates the clipped video from start_time to end_time of the original video_uid in Ego4D GoalStep. The image_path contains the uniformly sampled keyframes from our clipped videos.

[
    {
        "video_uid": "a13a145f-920a-44ec-8aef-b489c097f4a7",
        "start_time": 294.21739,
        "end_time": 341.15273,
        "video_path": "151.mp4",
        "image_path": [
            "151/frame_0001.png",
            "151/frame_0015.png",
            "151/frame_0030.png",
            "151/frame_0045.png",
            "151/frame_0060.png",
            "151/frame_0074.png",
            "151/frame_0089.png",
            "151/frame_0104.png"
        ],
        "question": "How many times did I adjust a container in the cupboard with my right hand?",
        "answer": "Twice."
    },
]

3. Prepare Videos and Images

Considering the license of Ego4D and the large file size, readers need to use our scripts to process the original egocentric videos. 😎 We will also try to share our videos and images to external cloud soon.

  • Prepare clipped videos.
python video_clip.py \
    --data_path /VidEgoThin/data/${annotation_file} \
    --video_folder /goal_step/v2/full_scale/ \
    --output_folder /data/${clipped_video_folder}
  • Prepare sampled keyframes. (Optional, we use the same keyframes for multi-images MLLMs to ensure fairness. You can choose better strategy.)
python keyframe_extract.py \
    --input_folder /data/${clipped_video_folder} \
    --output_folder /data/${keyframe_folder}

📊 Evaluation

Add New Open-Source Models

🫰 Thank you very much if you would like to contribute the code of the new model you have deployed!

  1. create test_{new_model}.py in /models.
  2. Add the new model in get_model() in /models/__init__.py.
# Qwen2-VL-7B-Instruct
if model_name == 'qwen2_vl':
    from .test_qwen2vl import TestQwen2VL
    return TestQwen2VL(device)

Inference

  • API-based Model

Please update the API-based models' keys and base_urls between the line 23 to line 33 of file gpt_eval.py.

# dataset: Activity, Object/existence, etc.
# MODEL: GPT series models, such as gpt-4o
# INFERENCE_TYPE: {caption, frames, 32-frames, text}
# TASK: {vqa, hp_high2mid, hp_mid2low, rm_critique, rm_feedback}
python gpt_eval.py \
    --model_name $MODEL \
    --inference_type $INFERENCE_TYPE \
    --annotation_path /${dataset}/annotations.json \
    --video_folder /data/${clipped_video_folder} \
    --image_folder /data/${keyframe_folder} \
    --answer_path /answer/${dataset} \
    --task $TASK
  • Open-Source Model (@TODO: double check)
# dataset: Activity, Object/existence, etc.
# MODEL: models defined in the models file
# DEVICE: GPU id, 0/1/2..., currently only single card can run
python eval.py \
    --model_name $MODEL \
    --annotation_path /${dataset}/annotations.json \
    --answer_path /answer/${dataset} \
    --batch_size 1 \
    --device $DEVICE

Evaluation

Please update the API-based models' key and base between the line 463 to line 546 of file common.py.

# data-folder: the folder name of answer.
# bench-name: Activity, Object/existence, etc.
# EVA_MODELS: a list of models to be evaluated (separated by spaces), for example "llava-13b-llama2 llava-1.5-13b llava-1.5-7b"
# $EVA_JUDGE_MODEL: gpt-4o (default), gpt-3.5-turbo, claude-2, etc.
python  gen_judgment.py \
    --data-folder /answer \
    --bench-name $dataset \
    --mode single \
    --model-list $EVA_MODELS \
    --judge-model $EVA_JUDGE_MODEL 
    --parallel 4
    --judge-file judge_prompts.jsonl

Show Results

# EVA_MODELS: a list of models to be evaluated (separated by spaces), for example "llava-13b-llama2 llava-1.5-13b llava-1.5-7b"
# $EVA_JUDGE_MODEL: gpt-4 (default), gpt-3.5-turbo, claude-2, etc.
python show_result.py \
    --input-file {data_folder}/{bench-name}/model_judgment/{judge-model}_single.jsonl \
    --judge-model $EVA_JUDGE_MODEL \
    --model-list  $EVA_MODELS \
    --mode single

🏆 Leaderboard

Overview

overview

Table 1: Experimental results of video question answering. OE, OO, OI, OC, OS, OP denote object existence, object order, object interaction, object count, object state, object prediction. AE, AS, AC indicates action existence, action sequence, action count. SE, ST, SP denote scene existence, scene transition, scene prediction. The bold font denotes the best performance and the underline font denotes the second-best performance.

overview

Table 2: Experimental results of video question answerng, hierarchy planning, visual grounding, and reward modeling tasks. The bold font denotes the best performance and the underline font denotes the second-best performance.

Contact

Citation

@article{cheng2024videgothink,
title={VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI},
author={Cheng, Sijie and Fang, Kechen and Yu, Yangyang and Zhou, Sicheng and Li, Bohao and Tian, Ye and Li, Tingguang and Han, Lei and Liu, Yang},
journal={arXiv preprint arXiv:2410.11623},
year={2024}
}

If you are intested in our VidEgoThink, we strongly recommend you to read our previous related work, EgoThink.🥰

@InProceedings{Cheng_2024_CVPR,
    author    = {Cheng, Sijie and Guo, Zhicheng and Wu, Jingwen and Fang, Kechen and Li, Peng and Liu, Huaping and Liu, Yang},
    title     = {EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {14291-14302}
}

Acknowledge

Thanks to Yuyang You for his support in data collection and inference. Thanks to Xiang Yue, Yuanzhi Li, Jiangjie Chen for their early discussion.

Furthermore, we appreciate the developers behind the following projects for their significant contributions to our research: EgoThink, Ego4D, Multi-Modality-Arena, FastChat.