DAVE: Diagnostic Benchmark for Audio-Visual Evaluation

This repository contains the official inference code for DAVE: Diagnostic Benchmark for Audio-Visual Evaluation. Use this code to evaluate your audio-visual models on the DAVE dataset.

🧩 Overview

DAVE is a diagnostic benchmark that tests audio-visual models by ensuring both modalities (audio and video) are required for successful inference. This repository provides tools to:

Load and iterate over the dataset
Construct multimodal prompts
Run inference with Gemini/OpenAI or your own models
Evaluate predictions against ground truth
Reproduce the main results in the paper

📦 Installation

git clone https://github.com/gorjanradevski/dave.git
cd dave
pip install torch datasets google-generativeai google-genai openai opencv-python Pillow

In case you want to regenerate the dataset from scratch, you will also need to install:

pip install moviepy ffmpeg-python

📂 Dataset Setup

You can load the dataset via Hugging Face:

from datasets import load_dataset

# split="epic" or split="ego4d"
dataset = load_dataset("gorjanradevski/dave", split="epic", trust_remote_code=True)

🚀 Inference with Gemini/OpenAI Models

Set your API keys

export OPENAI_API_KEY="..."
export GOOGLE_API_KEY="..."

Upload the dataset to Google for Gemini-based inference

python src/upload_dataset_gemini.py --split epic --output_path data/epic_gemini_mapping.json

Run inference

Using Gemini

python src/inference.py --split epic \
  --google_id_mapping_path data/epic_gemini_mapping.json \
  --model_names gemini-1.5-flash-latest \
  --prompt_types multimodal

Using OpenAI

python src/inference.py --split epic \
  --model_names openai \
  --prompt_types multimodal

This will save the predictions for the epic split in a resultsfolder.

Evaluate predictions

To reproduce our main results

python src/evaluate_predictions.py --result_dir results/

This will generate results across:

DAVE's three question types: multimodal synchronization, sound absence detection, and sound discrimination;
DAVE’s atomic tasks: temporal ordering, audio classification, and action recognition;
different modalities: video + text, audio + text, text.

To evaluate your own predictions file

python src/evaluate_predictions.py --predictions_file /path/to/predictions.json

🧪 Inference with Your Own Model

import random
sample = random.choice(dataset)

# Access necessary fields
audio_class = sample["audio_class"]
options = sample["raw_choices_multimodal"]
video_path = sample["video_with_overlayed_audio_path"]
ground_truth = options[sample["overlayed_event_index"]]

# Prompt for the model
prompt = f"""What is the person in the video doing when {audio_class} is heard? Choose one:
(A) {options[0]}
(B) {options[1]}
(C) {options[2]}
(D) {options[3]}
(E) none of the above
"""

# Run your model
# prediction = your_model.predict(video_path, prompt)

🐍 Inference with Open-Source Models

To run inference using open-source audio-visual models such as Video-LLaMA, PandaGPT, or Video-SALMONN, follow the setup instructions provided in their respective repositories:

Once set up, place each model inside the src/external/ directory using the following structure:

src/external/video_llama/
src/external/pandagpt/
src/external/video_salmonn/

📄 Citation

If you use this benchmark or codebase, please cite:

@article{radevski2025dave,
  title={DAVE: Diagnostic benchmark for Audio Visual Evaluation},
  author={Radevski, Gorjan and Popordanoska, Teodora and Blaschko, Matthew B and Tuytelaars, Tinne},
  journal={arXiv preprint arXiv:2503.09321},
  year={2025}
}

📫 Contact

For questions, open an issue or contact: firstname.lastname@kuleuven.be

📝 License

Everything is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
results		results
src		src
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DAVE: Diagnostic Benchmark for Audio-Visual Evaluation

🧩 Overview

📦 Installation

📂 Dataset Setup

🚀 Inference with Gemini/OpenAI Models

🧪 Inference with Your Own Model

🐍 Inference with Open-Source Models

📄 Citation

📫 Contact

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

gorjanradevski/dave

Folders and files

Latest commit

History

Repository files navigation

DAVE: Diagnostic Benchmark for Audio-Visual Evaluation

🧩 Overview

📦 Installation

📂 Dataset Setup

🚀 Inference with Gemini/OpenAI Models

🧪 Inference with Your Own Model

🐍 Inference with Open-Source Models

📄 Citation

📫 Contact

📝 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages