BagelScore: Visual-Language Evaluation Made Easy

This is the code repository for the paper:

BagelScore: Visual-Language Evaluation Made Easy

Shuo Yin*, Zeyu Zhang*^†, Huacan Wang, Qizhen Lan, Ronghao Chen, and Hao Tang^#

*Equal contribution. ^†Project lead. ^#Corresponding author.

Paper

Citation

If you use any content of this repo for your work, please cite the following our paper:

placeholder

🌟 Overview

BagelScore is a reference-free evaluation metric that leverages the BAGEL multimodal model to assess:

Image-Text Matching: Semantic alignment between images and captions
Image Editing Quality: Quality of AI-generated image edits

Unlike traditional embedding-based metrics (e.g., CLIPScore), BagelScore uses inference-based semantic judgment to capture fine-grained semantic mismatches like negations and substitutions.

Key Features

✅ Reference-Free: No need for ground-truth images
✅ Semantic Understanding: Captures complex semantic relationships
✅ Multi-Task: Supports both matching and editing evaluation
✅ High Correlation: Strong alignment with human judgments

📦 Installation

Prerequisites

Python 3.8+
CUDA 11.8+ (for GPU support)
80GB+ GPU memory (for BAGEL-7B model)

Setup

# Clone the repository
git clone https://github.com/YOUR_ORG/BAGELSCORE.git
cd BAGELSCORE

# Install dependencies
pip install -r requirements.txt

# Install flash-attention (required)
pip install flash_attn==2.5.8 --no-build-isolation

Download BAGEL Model

from huggingface_hub import snapshot_download

save_dir = "models/BAGEL-7B-MoT"
repo_id = "ByteDance-Seed/BAGEL-7B-MoT"

snapshot_download(
    cache_dir=save_dir + "/cache",
    local_dir=save_dir,
    repo_id=repo_id,
    local_dir_use_symlinks=False,
    resume_download=True,
    allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)

🚀 Quick Start

1. BagelScore for Image-Text Matching

from bagelscore import BagelScorer
from PIL import Image

# Initialize scorer
scorer = BagelScorer(
    model_path="./models/BAGEL-7B-MoT",
    device_id=0
)

# Load image and caption
image = Image.open("example.jpg")
caption = "A cat sitting on a couch"

# Calculate BagelScore
score, info = scorer.calculate_bagelscore(image, caption)
print(f"BagelScore: {score:.3f}")

2. EditScore for Image Editing Quality

from edit_score_calculator import EditScoreCalculator
from inferencer import InterleaveInferencer

# Initialize components (see evaluate_editscore_metrics.py for full setup)
calculator = EditScoreCalculator()

# Run inference with editing
output = inferencer(
    image=original_image,
    text="Apply a cartoon style to the whole image.",
    think=True,
    return_edit_score_data=True
)

# Calculate EditScore metrics
scores = calculator.compute_base_metrics(
    original_vae_latent=output['edit_score_data']['original_vae_latent'],
    generated_latent=output['edit_score_data']['generated_latent'],
    input_text_emb=output['edit_score_data']['input_text_emb'],
    think_text_emb=output['edit_score_data']['think_text_emb']
)

print(f"Image RLS: {scores['image_rls']:.4f}")
print(f"Image Cosine Sim: {scores['image_cosine_sim']:.4f}")
print(f"Text Similarity: {scores['text_similarity']:.4f}")

3. Batch Evaluation

# Evaluate BagelScore on a dataset
python bagelscore.py \
    --model_path ./models/BAGEL-7B-MoT \
    --data_file dataset.json \
    --images_dir ./images \
    --output_file results/bagelscore_results.csv \
    --device_id 0

# Evaluate EditScore metrics
python evaluate_editscore_metrics.py \
    --mode batch \
    --model_path ./models/BAGEL-7B-MoT \
    --images_dir ./images \
    --results_dir ./results \
    --prompt "Apply a cartoon style to the whole image." \
    --limit 100

📊 Evaluation Metrics

BagelScore

BagelScore uses a binary query approach:

Asks the model: "Are the IMAGE and TEXT describing the same content?"
Extracts logits for "Yes" tokens
Applies sigmoid function to get final score: S(x,y) = σ(ℓ_yes)

Score Range: [0, 1]

1.0: Perfect semantic match
0.0: Complete mismatch

EditScore Base Metrics

EditScore provides three fundamental metrics:

image_rls (Relative Latent Shift): Measures editing magnitude
- RLS = ||generated - original||₂ / ||original||₂
image_cosine_sim (Cosine Similarity): Measures content preservation
- Cosine similarity between original and edited image latents
text_similarity: Measures instruction consistency
- Cosine similarity between input prompt and model's "think" text

📁 Project Structure

BAGELSCORE/
├── bagelscore.py                    # Main BagelScore implementation
├── edit_score_calculator.py         # EditScore base metrics calculator
├── inferencer.py                    # BAGEL model inference wrapper
├── evaluate_editscore_metrics.py    # EditScore evaluation script
├── batch_gpt_image_scoring.py       # GPT-4 scoring for comparison
├── modeling/                        # BAGEL model architecture
├── data/                            # Data loading utilities
├── eval/                            # Evaluation benchmarks
├── train/                           # Training scripts
├── requirements.txt                 # Python dependencies
└── LICENSE                          # Apache 2.0 License

🔬 Experiments & Benchmarks

Datasets

We evaluate BagelScore on:

Flickr8k-Expert: Expert-annotated image-caption pairs (1-4 scale)
Flickr8k-CF: CrowdFlower-annotated pairs (0-1 scale)
Edit-1K: Image editing quality dataset

Results

Metric	Flickr8K-Expert	Flickr8K-CF	Composite
BAGELScore	53.2	38.0	55.9
CLIPScore	51.2	34.4	53.8
RefCLIPScore	53.0	36.4	55.4
ViLBERTScore-F	50.1	N/A	52.4
SPICE	44.9	24.4	40.3
CIDEr	43.9	24.6	37.7
METEOR	41.8	22.2	38.9
ROUGE-L	32.3	19.9	32.4
BLEU-1	32.3	N/A	31.3
BLEU-4	30.8	16.9	30.6
BERTScore (RoBERTa-F)	39.2	22.8	30.1
TIGEr	N/A	N/A	45.4
BERTScore++	N/A	N/A	44.9
LEIC*	N/A	29.5	N/A

Metric	EditScore	Image RLS	Image Cosine	Text Sim.	Human Score
EditScore	1.00	-0.78	0.78	0.05	0.14
Image RLS	-0.78	1.00	-0.74	0.00	-0.12
Image Cosine Sim.	0.78	-0.74	1.00	0.01	0.09
Text Similarity	0.05	0.00	0.01	1.00	0.05
Human Score	0.14	-0.12	0.09	0.05	1.00

Metric	Kendall Tau-b	Kendall Tau-c
Human Score	1.000	1.000
EditScore	0.259	0.253
GPT-based Score	0.192	0.189

---

🛠️ Advanced Usage

Multi-GPU Evaluation

python evaluate_editscore_metrics.py \
    --mode multi_gpu \
    --num_gpus 4 \
    --images_dir ./images \
    --results_dir ./results \
    --limit 1000

Custom Prompts

# Custom editing prompt
prompt = "Transform the image into a watercolor painting style"

output = inferencer(
    image=image,
    text=prompt,
    think=True,
    cfg_text_scale=4.0,
    cfg_img_scale=1.5,
    num_timesteps=50
)

Memory Optimization

For limited GPU memory:

Use batch processing with --batch_size 1
Enable memory cleanup with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Process data incrementally with --resume_from flag

📝 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🙏 Acknowledgments

BAGEL Model: Based on ByteDance-Seed/BAGEL-7B-MoT
Datasets: Flickr8k, SEED-Data-Edit
Inspiration: CLIPScore, PAC-S, and other vision-language metrics

📁 Project Organization

Core Files

This repository contains the essential components for BagelScore evaluation:

Core implementation files in the root directory
Model architecture in modeling/
Evaluation tools in eval/
Training scripts in train/

📧 Contact

For questions and feedback:

Issues: GitHub Issues
Email: yins25@tsinghua.mails.edu.cn
Paper: arXiv:XXXX.XXXXX

⭐ Star us on GitHub if you find this project helpful!

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Edit_1k		Edit_1k
assets		assets
bs_cal		bs_cal
data		data
data_1000		data_1000
eval		eval
modeling		modeling
scripts		scripts
test_images		test_images
train		train
.gitignore		.gitignore
3d_visualization.png		3d_visualization.png
LICENSE		LICENSE
README.md		README.md
app.py		app.py
bagelscore.py		bagelscore.py
bagelscore_logo.png		bagelscore_logo.png
batch_gpt_image_scoring.py		batch_gpt_image_scoring.py
cleanup_script.sh		cleanup_script.sh
comparison.png		comparison.png
dual_gpu_runner.py		dual_gpu_runner.py
edit_score_calculator.py		edit_score_calculator.py
editsocre_consistency.png		editsocre_consistency.png
enhanced_dual_gpu_runner.py		enhanced_dual_gpu_runner.py
evaluate_editscore_metrics.py		evaluate_editscore_metrics.py
framework.png		framework.png
inference.ipynb		inference.ipynb
inferencer.py		inferencer.py
requirements.txt		requirements.txt
requirements_unified.txt		requirements_unified.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BagelScore: Visual-Language Evaluation Made Easy

Citation

🌟 Overview

Key Features

📦 Installation

Prerequisites

Setup

Download BAGEL Model

🚀 Quick Start

1. BagelScore for Image-Text Matching

2. EditScore for Image Editing Quality

3. Batch Evaluation

📊 Evaluation Metrics

BagelScore

EditScore Base Metrics

📁 Project Structure

🔬 Experiments & Benchmarks

Datasets

Results

🛠️ Advanced Usage

Multi-GPU Evaluation

Custom Prompts

Memory Optimization

📝 License

🙏 Acknowledgments

📁 Project Organization

Core Files

📧 Contact

About

Uh oh!

Releases

Packages

Languages

License

AIGeeksGroup/BagelScore

Folders and files

Latest commit

History

Repository files navigation

BagelScore: Visual-Language Evaluation Made Easy

Citation

🌟 Overview

Key Features

📦 Installation

Prerequisites

Setup

Download BAGEL Model

🚀 Quick Start

1. BagelScore for Image-Text Matching

2. EditScore for Image Editing Quality

3. Batch Evaluation

📊 Evaluation Metrics

BagelScore

EditScore Base Metrics

📁 Project Structure

🔬 Experiments & Benchmarks

Datasets

Results

🛠️ Advanced Usage

Multi-GPU Evaluation

Custom Prompts

Memory Optimization

📝 License

🙏 Acknowledgments

📁 Project Organization

Core Files

📧 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages