Skip to content

Latest commit

 

History

History
124 lines (99 loc) · 5.8 KB

README.md

File metadata and controls

124 lines (99 loc) · 5.8 KB

StoryBench: A Multifaceted Benchmark for Continuous Story Visualization

This is the implementation of the approaches described in the paper:

Emanuele Bugliarello, Hernan Moraldo, Ruben Villegas, Mohammad Babaeizadeh, Mohammad Taghi Saffar, Han Zhang, Dumitru Erhan, Vittorio Ferrari, Pieter-Jan Kindermans, Paul Voigtlaender. StoryBench: A Multifaceted Benchmark for Continuous Story Visualization. 2023.

We provide our text annotations, guidelines for human evaluation, and the code for computing automatic metrics.

Data

data/ contains the evaluation data for StoryBench.

  • data/llm_outputs/ contains the captions split by our instruction-tuned LLM
  • data/tasks/ contains the evaluation data formatted for the StoryBench tasks of action_exe, story_cont and story_gen

Training data can be dowloaded from the following links:

We also share our Oops validation data used to assess the robustness of our data transformation pipeline:

Metrics

metrics/ contains the source code to perform automatic evaluation of generated videos.

To set up your Python virtual environment, run:

pip install -r metrics/requirements.txt

To compute a given metric (e.g., FID with InceptionV3) run as follows:

MODEL_NAME="phenaki"
TASK="action_exe"  # [action_exe, story_cont, story_gen]
DATA_SPLIT="oops_test"  # [{oops,uvo,didemo}_{val,test}]
DATA_DIR="/tmp/datadir/"
OUT_DIR="/tmp/out/"

python3 -m metrics.fid_inception --batch_size=256 --model="ground_truth" --task=${TASK} --dataset=${DATA_SPLIT} --data_dir=${DATA_DIR} --output_dir=${OUT_DIR} --num_videos=1

python3 -m metrics.fid_inception --batch_size=256 --model=${MODEL_NAME} --task=${TASK} --dataset=${DATA_SPLIT} --data_dir=${DATA_DIR} --output_dir=${OUT_DIR} --num_videos=4

In this example, we run the same script twice, first to extract the features from the ground-truth videos, and then to extract the features from the videos generated by a text-to-video model (phenaki here). Note that we set --num_videos=4 in the latter case as we sample four videos per text prompt when we generate videos with our models.

If you do not use our extracted features (see above), you only need to run the first script (to extract ground-truth features) once.

The input data to the scripts are npz files with the (ground-truth or generated) video as a NumPy array.

We rely on publicly available models and code to compute our automatic metrics. For reference, our working directory is structured as follows.

Click to expand
checkpoints/
    | DOVER.pth
    | InternVideo-MM-L-14.ckpt
    | ViT-L-14-336px.pt
    | convnext_tiny_1k_224_ema.pth
    | i3d_torchscript.pt
    | pt_inception-2015-12-05-6726825d.pth
data/
    | ground_truth/
    |   | action_exe/
    |   |   | oops_test/
    |   |   |   | raw/
    |   |   |   |   | fn0.npz
    |   |   |   |   | ...
    |   |   |   | features/
    |   |   |   |   | fid_clip/
    |   |   |   |   |   | embeddings_0.npz
    |   |   |   |   | fid_inception/
    |   |   |   |   |   | embeddings_0.npz
    |   |   |   |   | ...
    |   |   |   |   | vtm_internvideo/
    |   |   |   |   |   | embeddings_0.npz
    |   |   | ...
    |   | ...
    | phenaki/
    |   | action_exe/
    |   |   | oops_test/
    |   |   |   | raw/
    |   |   |   |   | fn0.npz
    |   |   |   |   | ...
    |   |   | ...
    |   | ...
outputs/
    | phenaki/
    |   | action_exe/
    |   |   | oops_test/
    |   |   |   | features/
    |   |   |   |   | embeddings_0.npz
    |   |   |   |   | embeddings_1.npz
    |   |   |   |   | embeddings_2.npz
    |   |   |   |   | embeddings_3.npz
    |   |   | ...
    |   | ...

Note that:

  • checkpoints can be downloaded from the corresponding repositories (see metrics/third_party/):
  • after extracting the features for the ground-truth data, we move them from their ${OUT_DIR} to the features/ directory under ${DATA_DIR}

License

This work is licensed under the Apache License. See LICENSE for details.

We rely on third-party software and models to compute automatic evaluation metrics, released under MIT and Apache licenses.

The annotations are licensed by Google LLC under CC BY 4.0 license.