Skip to content

cmubig/ECB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

39 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Evaluation Cultural Bias ๐ŸŒ

A Comprehensive Framework for Assessing Cultural Representation in Generative Image Models

Python 3.12.4+ License: MIT

This repository contains the implementation and evaluation framework for Evaluation Cultural Bias (ECB), a comprehensive methodology for assessing cultural representation and bias in generative image models across multiple countries and cultural contexts.

๐ŸŽฏ Project Overview

ECB introduces a comprehensive evaluation framework that includes:

  • Image Generation: T2I (Text-to-Image) and I2I (Image-to-Image) pipelines for multiple models
  • Cultural Metrics: Cultural appropriateness, representation accuracy, and contextual sensitivity
  • General Metrics: Technical quality, prompt adherence, and perceptual fidelity

Key Components

  1. Multi-Model Image Generation: T2I and I2I pipelines for 5 different generative models
  2. Structured Cultural Evaluation: Context-aware assessment using cultural knowledge bases
  3. VLM-based Evaluation: Vision-Language Models for cultural understanding
  4. Model Comparison: Audit across 6 countries and 8 cultural categories
  5. Human Survey Platform: Web-based interface for collecting human evaluation data
  6. Analysis Pipeline: Statistical analysis and visualization tools

๐Ÿ“ Repository Structure

ECB/
โ”œโ”€โ”€ ๐Ÿ“Š dataset/                    # Generated images and metadata
โ”‚   โ”œโ”€โ”€ flux/                      # FLUX model outputs
โ”‚   โ”œโ”€โ”€ hidream/                   # HiDream model outputs  
โ”‚   โ”œโ”€โ”€ qwen/                      # Qwen-VL model outputs
โ”‚   โ”œโ”€โ”€ nextstep/                  # NextStep model outputs
โ”‚   โ””โ”€โ”€ sd35/                      # Stable Diffusion 3.5 outputs
โ”‚
โ”œโ”€โ”€ ๐Ÿ”ฌ evaluation/                 # Evaluation framework
โ”‚   โ”œโ”€โ”€ cultural_metric/           # Cultural assessment pipeline
โ”‚   โ”‚   โ”œโ”€โ”€ enhanced_cultural_metric_pipeline.py  # Main evaluation script
โ”‚   โ”‚   โ”œโ”€โ”€ build_cultural_index.py              # Knowledge base builder
โ”‚   โ”‚   โ””โ”€โ”€ vector_store/          # FAISS-based cultural knowledge index
โ”‚   โ”œโ”€โ”€ general_metric/            # Technical quality assessment
โ”‚   โ”‚   โ””โ”€โ”€ multi_metric_evaluation.py           # CLIP, FID, LPIPS metrics
โ”‚   โ”œโ”€โ”€ analysis/                  # Statistical analysis and visualization
โ”‚   โ”‚   โ”œโ”€โ”€ scripts/               # All analysis scripts
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ core/              # Core analysis scripts
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ single_model/      # Individual model analysis
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ multi_model_*_analysis.py  # Cross-model comparisons
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ run_analysis.py     # Main execution interface
โ”‚   โ”‚   โ””โ”€โ”€ results/               # All analysis results
โ”‚   โ”‚       โ”œโ”€โ”€ individual/        # Individual model charts (5 models ร— 2 types)
โ”‚   โ”‚       โ”œโ”€โ”€ comparison/        # Multi-model comparison charts
โ”‚   โ”‚       โ””โ”€โ”€ summary/           # Summary charts
โ”‚   โ””โ”€โ”€ survey_app/                # Human evaluation interface
โ”‚       โ”œโ”€โ”€ app.py                 # Flask web application
โ”‚       โ””โ”€โ”€ responses/             # Human survey responses
โ”‚
โ”œโ”€โ”€ ๐Ÿญ generator/                  # Image generation pipelines
โ”‚   โ”œโ”€โ”€ T2I/                       # Text-to-Image generation
โ”‚   โ”‚   โ”œโ”€โ”€ flux/                  # FLUX T2I implementation
โ”‚   โ”‚   โ”œโ”€โ”€ hidream/               # HiDream T2I implementation
โ”‚   โ”‚   โ”œโ”€โ”€ qwen/generate_qwen_image.py                  # Qwen-VL T2I implementation
โ”‚   โ”‚   โ”œโ”€โ”€ nextstep/generate_nextstep.py              # NextStep T2I implementation
โ”‚   โ”‚   โ””โ”€โ”€ sd35/                  # Stable Diffusion 3.5 T2I
โ”‚   โ””โ”€โ”€ I2I/                       # Image-to-Image editing
โ”‚       โ”œโ”€โ”€ flux/                  # FLUX I2I implementation
โ”‚       โ”œโ”€โ”€ hidream/               # HiDream I2I implementation
โ”‚       โ”œโ”€โ”€ qwen/edit_qwen_image.py                  # Qwen-VL I2I implementation
โ”‚       โ”œโ”€โ”€ nextstep/edit_nextstep.py              # NextStep I2I implementation
โ”‚       โ””โ”€โ”€ sd35/                  # Stable Diffusion 3.5 I2I
โ”‚
โ”œโ”€โ”€ ๐ŸŒ ecb-human-survey/           # Next.js web application
โ”‚   โ”œโ”€โ”€ src/                       # React components and logic
โ”‚   โ”œโ”€โ”€ public/                    # Static assets
โ”‚   โ””โ”€โ”€ firebase.json              # Firebase configuration
โ”‚
โ”œโ”€โ”€ ๐Ÿ“š external_data/              # Cultural reference documents
โ”‚   โ”œโ”€โ”€ China.pdf                  # Cultural knowledge sources
โ”‚   โ”œโ”€โ”€ India.pdf
โ”‚   โ””โ”€โ”€ [Other countries...]
โ”‚
โ”œโ”€โ”€ ๐Ÿ“„ iaseai26-paper/             # Research paper and documentation
โ”‚   โ””โ”€โ”€ IASEAI26.pdf               # Academic publication
โ”‚
โ””โ”€โ”€ ๐Ÿ”ง Configuration Files
    โ”œโ”€โ”€ requirements.txt            # Python dependencies
    โ””โ”€โ”€ run_*.sh                   # Execution scripts

๐Ÿš€ Quick Start

Prerequisites

# Python environment
conda create -n ecb python=3.8
conda activate ecb

# Install dependencies
pip install -r evaluation/cultural_metric/requirements.txt
pip install -r evaluation/general_metric/requirements.txt

1. Image Generation (Optional - if you want to generate new images)

# Text-to-Image generation
cd generator/T2I/flux/
python generate_t2i.py --prompts prompts.json --output ../../dataset/flux/base/

# Image-to-Image editing  
cd generator/I2I/flux/
python generate_i2i.py --base-images ../../dataset/flux/base/ --edit-prompts edit_prompts.json --output ../../dataset/flux/edit_1/

2. Cultural Knowledge Base Setup

cd evaluation/cultural_metric/
python build_cultural_index.py \
    --data-dir ../../external_data/ \
    --output-dir vector_store/

3. Run Cultural Evaluation

python enhanced_cultural_metric_pipeline.py \
    --input-csv ../../dataset/flux/prompt-img-path.csv \
    --image-root ../../dataset/flux/ \
    --summary-csv results/flux_cultural_summary.csv \
    --detail-csv results/flux_cultural_details.csv \
    --index-dir vector_store/ \
    --load-in-4bit \
    --max-samples 50

4. Run General Metrics Evaluation

cd evaluation/general_metric/
python multi_metric_evaluation.py \
    --input-csv ../../dataset/flux/prompt-img-path.csv \
    --image-root ../../dataset/flux/ \
    --output-csv results/flux_general_metrics.csv

5. Generate Analysis Reports

cd evaluation/analysis/scripts/
python3 run_analysis.py  # Run all analyses
python3 run_analysis.py --analysis-type single --single-type cultural --models flux
python3 run_analysis.py --analysis-type multi  # Cross-model comparison
python3 run_analysis.py --analysis-type core   # Summary analysis

๐Ÿ“Š Evaluation Metrics

Cultural Metrics

Metric Description Range Evaluator
Cultural Representative How well the image represents cultural elements 1-5 Qwen2-VL
Prompt Alignment Alignment with cultural context prompts 1-5 Qwen2-VL
Cultural Accuracy Binary classification accuracy (yes/no questions) 0-1 LLM-generated Q&A
Group Ranking Best/worst selection within cultural groups Rank Multi-image VLM

General Metrics

Metric Description Range Method
CLIP Score Semantic similarity to prompt 0-1 CLIP ViT-L/14
Aesthetic Score Perceptual aesthetic quality 0-10 LAION Aesthetic
FID Image distribution similarity 0-โˆž Inception features
LPIPS Perceptual distance 0-1 AlexNet features

๐ŸŒ Evaluation Scope

Countries Covered

  • ๐Ÿ‡จ๐Ÿ‡ณ China
  • ๐Ÿ‡ฎ๐Ÿ‡ณ India
  • ๐Ÿ‡ฐ๐Ÿ‡ท South Korea
  • ๐Ÿ‡ฐ๐Ÿ‡ช Kenya
  • ๐Ÿ‡ณ๐Ÿ‡ฌ Nigeria
  • ๐Ÿ‡บ๐Ÿ‡ธ United States

Cultural Categories

  • ๐Ÿ›๏ธ Architecture (Traditional/Modern Houses, Landmarks)
  • ๐ŸŽจ Art (Dance, Painting, Sculpture)
  • ๐ŸŽ‰ Events (Festivals, Weddings, Funerals, Sports)
  • ๐Ÿ‘— Fashion (Clothing, Accessories, Makeup)
  • ๐Ÿœ Food (Dishes, Desserts, Beverages, Staples)
  • ๐Ÿž๏ธ Landscape (Cities, Countryside, Nature)
  • ๐Ÿ‘ฅ People (Various Professions and Roles)
  • ๐Ÿฆ Wildlife (Animals, Plants)

Models Evaluated

  • FLUX: State-of-the-art diffusion model
  • HiDream: High-resolution generation model
  • Qwen-VL: Vision-language multimodal model
  • NextStep: Advanced editing-focused model
  • Stable Diffusion 3.5: Popular open-source model

๐Ÿ”ง Advanced Usage

Batch Generation Pipeline

# Generate images for all models and all cultural categories
cd generator/
python batch_generation.py \
    --models flux hidream qwen nextstep sd35 \
    --countries china india korea kenya nigeria usa \
    --categories architecture art event fashion food landscape people wildlife \
    --output-dir ../dataset/

Custom Image Generation

from generator.T2I.flux import FluxT2IGenerator
from generator.I2I.flux import FluxI2IGenerator

# T2I Generation
t2i_gen = FluxT2IGenerator()
image = t2i_gen.generate("Traditional Chinese architecture house, photorealistic")

# I2I Editing
i2i_gen = FluxI2IGenerator()
edited_image = i2i_gen.edit(base_image, "Change to represent Korean architecture")

Custom Cultural Knowledge Integration

from evaluation.cultural_metric.build_cultural_index import CulturalIndexBuilder

builder = CulturalIndexBuilder()
builder.add_cultural_documents(
    country="MyCountry",
    documents=["path/to/cultural_doc.pdf"],
    categories=["architecture", "food", "art"]
)
builder.build_index("custom_vector_store/")

Batch Evaluation Pipeline

# Evaluate all models with cultural and general metrics
cd evaluation/analysis/scripts/
python3 run_analysis.py  # Run complete analysis for all 5 models
python3 run_analysis.py --models flux hidream nextstep qwen sd35 --analysis-type all

Human Survey Integration

cd ecb-human-survey/
npm install
npm run dev  # Start web interface on localhost:3000

๐Ÿ“ˆ Results and Analysis

Key Findings

  1. Cultural Representation Gaps: Variations across countries and categories
  2. Model-Specific Biases: Different models show different cultural blind spots
  3. Category-Dependent Performance: Architecture and food show better representation than people and events
  4. Editing Consistency: Progressive editing maintains cultural consistency differently across models

Visualization Outputs

  • Individual Model Charts: 13 cultural + 6 general charts per model (5 models total)
  • Multi-Model Comparison: Cross-model performance comparison charts
  • Summary Charts: Core metrics overview and insights
  • Organized Structure: Clean separation of scripts and results in evaluation/analysis/

Analysis Structure

evaluation/analysis/
โ”œโ”€โ”€ scripts/           # All analysis scripts
โ”œโ”€โ”€ results/          # All generated charts
โ”‚   โ”œโ”€โ”€ individual/   # Individual model results (5 models ร— 2 types)
โ”‚   โ”œโ”€โ”€ comparison/   # Multi-model comparison charts
โ”‚   โ””โ”€โ”€ summary/      # Summary and overview charts

๐Ÿค Contributing

Contributions welcome! Please see our Contributing Guidelines for details.

Areas for Contribution

  • Additional cultural knowledge sources
  • New evaluation metrics
  • Model integration
  • Visualization improvements
  • Survey interface enhancements

๐Ÿ“š Citation

If you use ECB in your research, please cite:

@inproceedings{ecb2024,
  title={Exposing Cultural Blindspots: A Structured Audit of Generative Image Models},
  author={[Author Names]},
  booktitle={Proceedings of IASEAI 2026},
  year={2024}
}

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Cultural knowledge sources from international organizations
  • Open-source model providers (FLUX, Stable Diffusion, Qwen)
  • Human evaluation participants
  • Academic collaborators and reviewers

๐Ÿ“ž Contact

For questions, issues, or collaboration:


Evaluation Cultural Bias: Making Cultural Representation Visible, Measurable, and Improvable ๐ŸŒ

About

Exposing Cultural Blindspots: A Structured Audit of Generative Image Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •