Skip to content

gorjanradevski/dave

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DAVE: Diagnostic Benchmark for Audio-Visual Evaluation

Overview of DAVE

This repository contains the official inference code for DAVE: Diagnostic Benchmark for Audio-Visual Evaluation. Use this code to evaluate your audio-visual models on the DAVE dataset.


🧩 Overview

DAVE is a diagnostic benchmark that tests audio-visual models by ensuring both modalities (audio and video) are required for successful inference. This repository provides tools to:

  • Load and iterate over the dataset
  • Construct multimodal prompts
  • Run inference with Gemini/OpenAI or your own models
  • Evaluate predictions against ground truth
  • Reproduce the main results in the paper

📦 Installation

git clone https://github.com/gorjanradevski/dave.git
cd dave
pip install torch datasets google-generativeai google-genai openai opencv-python Pillow

In case you want to regenerate the dataset from scratch, you will also need to install:

pip install moviepy ffmpeg-python

📂 Dataset Setup

You can load the dataset via Hugging Face:

from datasets import load_dataset

# split="epic" or split="ego4d"
dataset = load_dataset("gorjanradevski/dave", split="epic", trust_remote_code=True)

🚀 Inference with Gemini/OpenAI Models

  1. Set your API keys
export OPENAI_API_KEY="..."
export GOOGLE_API_KEY="..."
  1. Upload the dataset to Google for Gemini-based inference
python src/upload_dataset_gemini.py --split epic --output_path data/epic_gemini_mapping.json
  1. Run inference

Using Gemini

python src/inference.py --split epic \
  --google_id_mapping_path data/epic_gemini_mapping.json \
  --model_names gemini-1.5-flash-latest \
  --prompt_types multimodal

Using OpenAI

python src/inference.py --split epic \
  --model_names openai \
  --prompt_types multimodal

This will save the predictions for the epic split in a resultsfolder.


  1. Evaluate predictions

To reproduce our main results

python src/evaluate_predictions.py --result_dir results/

This will generate results across:

  • DAVE's three question types: multimodal synchronization, sound absence detection, and sound discrimination;
  • DAVE’s atomic tasks: temporal ordering, audio classification, and action recognition;
  • different modalities: video + text, audio + text, text.

To evaluate your own predictions file

python src/evaluate_predictions.py --predictions_file /path/to/predictions.json

🧪 Inference with Your Own Model

import random
sample = random.choice(dataset)

# Access necessary fields
audio_class = sample["audio_class"]
options = sample["raw_choices_multimodal"]
video_path = sample["video_with_overlayed_audio_path"]
ground_truth = options[sample["overlayed_event_index"]]

# Prompt for the model
prompt = f"""What is the person in the video doing when {audio_class} is heard? Choose one:
(A) {options[0]}
(B) {options[1]}
(C) {options[2]}
(D) {options[3]}
(E) none of the above
"""

# Run your model
# prediction = your_model.predict(video_path, prompt)

🐍 Inference with Open-Source Models

To run inference using open-source audio-visual models such as Video-LLaMA, PandaGPT, or Video-SALMONN, follow the setup instructions provided in their respective repositories:

Once set up, place each model inside the src/external/ directory using the following structure:

src/external/video_llama/
src/external/pandagpt/
src/external/video_salmonn/

📄 Citation

If you use this benchmark or codebase, please cite:

@article{radevski2025dave,
  title={DAVE: Diagnostic benchmark for Audio Visual Evaluation},
  author={Radevski, Gorjan and Popordanoska, Teodora and Blaschko, Matthew B and Tuytelaars, Tinne},
  journal={arXiv preprint arXiv:2503.09321},
  year={2025}
}

📫 Contact

For questions, open an issue or contact: firstname.lastname@kuleuven.be

📝 License

Everything is licensed under the MIT License.

About

Codebase for "DAVE: Diagnostic benchmark for Audio Visual Evaluation" (NeurIPS 2025 Datasets & Benchmarks)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages