This repository contains the implementation and evaluation framework for Evaluation Cultural Bias (ECB), a comprehensive methodology for assessing cultural representation and bias in generative image models across multiple countries and cultural contexts.
ECB introduces a comprehensive evaluation framework that includes:
- Image Generation: T2I (Text-to-Image) and I2I (Image-to-Image) pipelines for multiple models
- Cultural Metrics: Cultural appropriateness, representation accuracy, and contextual sensitivity
- General Metrics: Technical quality, prompt adherence, and perceptual fidelity
- Multi-Model Image Generation: T2I and I2I pipelines for 5 different generative models
- Structured Cultural Evaluation: Context-aware assessment using cultural knowledge bases
- VLM-based Evaluation: Vision-Language Models for cultural understanding
- Model Comparison: Audit across 6 countries and 8 cultural categories
- Human Survey Platform: Web-based interface for collecting human evaluation data
- Analysis Pipeline: Statistical analysis and visualization tools
ECB/
โโโ ๐ dataset/ # Generated images and metadata
โ โโโ flux/ # FLUX model outputs
โ โโโ hidream/ # HiDream model outputs
โ โโโ qwen/ # Qwen-VL model outputs
โ โโโ nextstep/ # NextStep model outputs
โ โโโ sd35/ # Stable Diffusion 3.5 outputs
โ
โโโ ๐ฌ evaluation/ # Evaluation framework
โ โโโ cultural_metric/ # Cultural assessment pipeline
โ โ โโโ enhanced_cultural_metric_pipeline.py # Main evaluation script
โ โ โโโ build_cultural_index.py # Knowledge base builder
โ โ โโโ vector_store/ # FAISS-based cultural knowledge index
โ โโโ general_metric/ # Technical quality assessment
โ โ โโโ multi_metric_evaluation.py # CLIP, FID, LPIPS metrics
โ โโโ analysis/ # Statistical analysis and visualization
โ โ โโโ scripts/ # All analysis scripts
โ โ โ โโโ core/ # Core analysis scripts
โ โ โ โโโ single_model/ # Individual model analysis
โ โ โ โโโ multi_model_*_analysis.py # Cross-model comparisons
โ โ โ โโโ run_analysis.py # Main execution interface
โ โ โโโ results/ # All analysis results
โ โ โโโ individual/ # Individual model charts (5 models ร 2 types)
โ โ โโโ comparison/ # Multi-model comparison charts
โ โ โโโ summary/ # Summary charts
โ โโโ survey_app/ # Human evaluation interface
โ โโโ app.py # Flask web application
โ โโโ responses/ # Human survey responses
โ
โโโ ๐ญ generator/ # Image generation pipelines
โ โโโ T2I/ # Text-to-Image generation
โ โ โโโ flux/ # FLUX T2I implementation
โ โ โโโ hidream/ # HiDream T2I implementation
โ โ โโโ qwen/generate_qwen_image.py # Qwen-VL T2I implementation
โ โ โโโ nextstep/generate_nextstep.py # NextStep T2I implementation
โ โ โโโ sd35/ # Stable Diffusion 3.5 T2I
โ โโโ I2I/ # Image-to-Image editing
โ โโโ flux/ # FLUX I2I implementation
โ โโโ hidream/ # HiDream I2I implementation
โ โโโ qwen/edit_qwen_image.py # Qwen-VL I2I implementation
โ โโโ nextstep/edit_nextstep.py # NextStep I2I implementation
โ โโโ sd35/ # Stable Diffusion 3.5 I2I
โ
โโโ ๐ ecb-human-survey/ # Next.js web application
โ โโโ src/ # React components and logic
โ โโโ public/ # Static assets
โ โโโ firebase.json # Firebase configuration
โ
โโโ ๐ external_data/ # Cultural reference documents
โ โโโ China.pdf # Cultural knowledge sources
โ โโโ India.pdf
โ โโโ [Other countries...]
โ
โโโ ๐ iaseai26-paper/ # Research paper and documentation
โ โโโ IASEAI26.pdf # Academic publication
โ
โโโ ๐ง Configuration Files
โโโ requirements.txt # Python dependencies
โโโ run_*.sh # Execution scripts
# Python environment
conda create -n ecb python=3.8
conda activate ecb
# Install dependencies
pip install -r evaluation/cultural_metric/requirements.txt
pip install -r evaluation/general_metric/requirements.txt# Text-to-Image generation
cd generator/T2I/flux/
python generate_t2i.py --prompts prompts.json --output ../../dataset/flux/base/
# Image-to-Image editing
cd generator/I2I/flux/
python generate_i2i.py --base-images ../../dataset/flux/base/ --edit-prompts edit_prompts.json --output ../../dataset/flux/edit_1/cd evaluation/cultural_metric/
python build_cultural_index.py \
--data-dir ../../external_data/ \
--output-dir vector_store/python enhanced_cultural_metric_pipeline.py \
--input-csv ../../dataset/flux/prompt-img-path.csv \
--image-root ../../dataset/flux/ \
--summary-csv results/flux_cultural_summary.csv \
--detail-csv results/flux_cultural_details.csv \
--index-dir vector_store/ \
--load-in-4bit \
--max-samples 50cd evaluation/general_metric/
python multi_metric_evaluation.py \
--input-csv ../../dataset/flux/prompt-img-path.csv \
--image-root ../../dataset/flux/ \
--output-csv results/flux_general_metrics.csvcd evaluation/analysis/scripts/
python3 run_analysis.py # Run all analyses
python3 run_analysis.py --analysis-type single --single-type cultural --models flux
python3 run_analysis.py --analysis-type multi # Cross-model comparison
python3 run_analysis.py --analysis-type core # Summary analysis| Metric | Description | Range | Evaluator |
|---|---|---|---|
| Cultural Representative | How well the image represents cultural elements | 1-5 | Qwen2-VL |
| Prompt Alignment | Alignment with cultural context prompts | 1-5 | Qwen2-VL |
| Cultural Accuracy | Binary classification accuracy (yes/no questions) | 0-1 | LLM-generated Q&A |
| Group Ranking | Best/worst selection within cultural groups | Rank | Multi-image VLM |
| Metric | Description | Range | Method |
|---|---|---|---|
| CLIP Score | Semantic similarity to prompt | 0-1 | CLIP ViT-L/14 |
| Aesthetic Score | Perceptual aesthetic quality | 0-10 | LAION Aesthetic |
| FID | Image distribution similarity | 0-โ | Inception features |
| LPIPS | Perceptual distance | 0-1 | AlexNet features |
- ๐จ๐ณ China
- ๐ฎ๐ณ India
- ๐ฐ๐ท South Korea
- ๐ฐ๐ช Kenya
- ๐ณ๐ฌ Nigeria
- ๐บ๐ธ United States
- ๐๏ธ Architecture (Traditional/Modern Houses, Landmarks)
- ๐จ Art (Dance, Painting, Sculpture)
- ๐ Events (Festivals, Weddings, Funerals, Sports)
- ๐ Fashion (Clothing, Accessories, Makeup)
- ๐ Food (Dishes, Desserts, Beverages, Staples)
- ๐๏ธ Landscape (Cities, Countryside, Nature)
- ๐ฅ People (Various Professions and Roles)
- ๐ฆ Wildlife (Animals, Plants)
- FLUX: State-of-the-art diffusion model
- HiDream: High-resolution generation model
- Qwen-VL: Vision-language multimodal model
- NextStep: Advanced editing-focused model
- Stable Diffusion 3.5: Popular open-source model
# Generate images for all models and all cultural categories
cd generator/
python batch_generation.py \
--models flux hidream qwen nextstep sd35 \
--countries china india korea kenya nigeria usa \
--categories architecture art event fashion food landscape people wildlife \
--output-dir ../dataset/from generator.T2I.flux import FluxT2IGenerator
from generator.I2I.flux import FluxI2IGenerator
# T2I Generation
t2i_gen = FluxT2IGenerator()
image = t2i_gen.generate("Traditional Chinese architecture house, photorealistic")
# I2I Editing
i2i_gen = FluxI2IGenerator()
edited_image = i2i_gen.edit(base_image, "Change to represent Korean architecture")from evaluation.cultural_metric.build_cultural_index import CulturalIndexBuilder
builder = CulturalIndexBuilder()
builder.add_cultural_documents(
country="MyCountry",
documents=["path/to/cultural_doc.pdf"],
categories=["architecture", "food", "art"]
)
builder.build_index("custom_vector_store/")# Evaluate all models with cultural and general metrics
cd evaluation/analysis/scripts/
python3 run_analysis.py # Run complete analysis for all 5 models
python3 run_analysis.py --models flux hidream nextstep qwen sd35 --analysis-type allcd ecb-human-survey/
npm install
npm run dev # Start web interface on localhost:3000- Cultural Representation Gaps: Variations across countries and categories
- Model-Specific Biases: Different models show different cultural blind spots
- Category-Dependent Performance: Architecture and food show better representation than people and events
- Editing Consistency: Progressive editing maintains cultural consistency differently across models
- Individual Model Charts: 13 cultural + 6 general charts per model (5 models total)
- Multi-Model Comparison: Cross-model performance comparison charts
- Summary Charts: Core metrics overview and insights
- Organized Structure: Clean separation of scripts and results in
evaluation/analysis/
evaluation/analysis/
โโโ scripts/ # All analysis scripts
โโโ results/ # All generated charts
โ โโโ individual/ # Individual model results (5 models ร 2 types)
โ โโโ comparison/ # Multi-model comparison charts
โ โโโ summary/ # Summary and overview charts
Contributions welcome! Please see our Contributing Guidelines for details.
- Additional cultural knowledge sources
- New evaluation metrics
- Model integration
- Visualization improvements
- Survey interface enhancements
If you use ECB in your research, please cite:
@inproceedings{ecb2024,
title={Exposing Cultural Blindspots: A Structured Audit of Generative Image Models},
author={[Author Names]},
booktitle={Proceedings of IASEAI 2026},
year={2024}
}This project is licensed under the MIT License - see the LICENSE file for details.
- Cultural knowledge sources from international organizations
- Open-source model providers (FLUX, Stable Diffusion, Qwen)
- Human evaluation participants
- Academic collaborators and reviewers
For questions, issues, or collaboration:
- ๐ง Email: [contact@ecb-project.org]
- ๐ Issues: GitHub Issues
- ๐ฌ Discussions: GitHub Discussions
Evaluation Cultural Bias: Making Cultural Representation Visible, Measurable, and Improvable ๐