Skip to content

Lylll9436/pixels-to-predicates

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

29 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Graph-based Urban Perception Prediction

License: MIT Python 3.8+ PyTorch

A deep learning framework for predicting human perceptual preferences of urban street scenes using graph-based scene representations. This work implements a novel pipeline that combines scene graph generation, graph masked autoencoders (GraphMAE), and Bradley-Terry comparison models.

🎯 Overview

This research project addresses the challenge of understanding and predicting human perception of urban environments. By modeling street scenes as structured graphs rather than raw pixel arrays, we capture the semantic relationships between objects, their attributes, and spatial arrangements.

Key Features

  • Scene Graph Generation: Converts natural language image descriptions into structured entity-relationship graphs
  • Graph Representation Learning: Self-supervised pre-training using GraphMAE on scene graph structures
  • Preference Prediction: Bradley-Terry pairwise comparison model for perceptual quality ranking
  • Comprehensive Baselines: Fair comparisons with CNN (ResNet50), Vision Transformer (ViT), and CLIP-based approaches
  • Multi-dimensional Analysis: Evaluation across multiple perceptual dimensions (safety, liveliness, beauty, etc.)

πŸ—οΈ Architecture

Input Images β†’ Scene Descriptions β†’ Scene Graphs β†’ Graph Encoding β†’ Preference Prediction
     ↓              (Gemini)           (NLP)        (GraphMAE)     (Bradley-Terry)
  Raw Pixels    Text Descriptions   Structured    Vector Repr.    Pairwise Scores

πŸ“‹ Requirements

  • Python 3.8+
  • PyTorch 2.0+
  • PyTorch Geometric 2.0+
  • Additional dependencies listed in requirements.txt

πŸš€ Installation

1. Clone the repository

git clone https://github.com/Lylll9436/structure_image.git
cd structure_image

2. Create virtual environment (recommended)

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install dependencies

pip install -r requirements.txt

4. Configure API credentials

Copy the example configuration and add your API keys:

cp config/.env.example config/.env

Edit config/.env with your actual API credentials:

API_BASE=https://your-api-endpoint.com
MODEL=gemini-2.5-flash
API_KEYS=your_api_key_1,your_api_key_2
PER_KEY_WORKERS=1

πŸ“Š Data Preparation

This project expects image data organized in specific directories. Please refer to data/README.md for detailed information on data structure and preparation.

πŸ”¬ Usage

Complete Pipeline

Run the full pipeline from image description to evaluation:

# Step 1: Generate scene descriptions
python src/01_describe_pic.py

# Step 2: Merge data with metadata
python src/02_merge_data.py csv output/stage_01_descriptions/PP2/*.json data/PP2/metadata/final_data.csv -o output/stage_02_merged/pp2_full.json

# Step 3: Build scene graphs
python src/03_build_scene_graphs.py --input output/stage_02_merged/pp2_full.json --output output/stage_03_scene_graphs

# Step 4: Convert to PyTorch format
python src/04_convert_to_pytorch.py --inputs output/stage_03_scene_graphs --output-dir output/stage_04_pytorch

# Step 5: Pre-train GraphMAE
python src/05_graph_vae.py --data-dir output/stage_04_pytorch --output-dir ./packed --num-epochs 100

# Step 6: Train comparison model
python src/06_comparison_trainer.py --backbone graphmae --graphs-dir output/stage_03_scene_graphs --repr-file packed/graph_representations.pt

# Step 7: Evaluate and visualize
python src/07_evaluate_and_visualize.py --graphs-dir output/stage_03_scene_graphs --repr-file packed/graph_representations.pt

Training Baseline Models

CNN Baseline (ResNet50)

python src/06_comparison_trainer.py \
    --backbone cnn \
    --image-root data/PP2/final_photo_dataset \
    --graphs-dir output/stage_03_scene_graphs \
    --epochs 100 \
    --batch-size 32

Vision Transformer Baseline

python src/06_comparison_trainer.py \
    --backbone vit \
    --image-root data/PP2/final_photo_dataset \
    --graphs-dir output/stage_03_scene_graphs \
    --epochs 100

CLIP Baseline

python src/06_comparison_trainer.py \
    --backbone clip \
    --image-root data/PP2/final_photo_dataset \
    --graphs-dir output/stage_03_scene_graphs \
    --epochs 100

GraphMAE Baseline

python src/06_comparison_trainer.py \
    --backbone graphmae \
    --graphs-dir output/stage_03_scene_graphs \
    --repr-file packed/graph_representations.pt \
    --graph-epochs 220

πŸ“ˆ Results

The model's performance is evaluated using multiple metrics:

  • Accuracy: Overall pairwise comparison accuracy
  • AUC-ROC: Area under the receiver operating characteristic curve
  • Category-wise Analysis: Per-category performance breakdown
  • Cross-validation: Robust evaluation across multiple splits

Results are saved in the result/ directory with comprehensive JSON metrics and visualizations.

πŸ“ Project Structure

structure_image/
β”œβ”€β”€ src/                           # Source code
β”‚   β”œβ”€β”€ 01_describe_pic.py         # Image description generation
β”‚   β”œβ”€β”€ 02_merge_data.py           # Data merging utilities
β”‚   β”œβ”€β”€ 03_build_scene_graphs.py   # Scene graph construction
β”‚   β”œβ”€β”€ 04_convert_to_pytorch.py   # PyTorch data conversion
β”‚   β”œβ”€β”€ 05_graph_vae.py            # GraphMAE pre-training
β”‚   β”œβ”€β”€ 06_comparison_trainer.py   # Comparison model training
β”‚   β”œβ”€β”€ 07_evaluate_and_visualize.py # Evaluation & visualization
β”‚   β”œβ”€β”€ 08_radar.py                # Radar chart visualization
β”‚   β”œβ”€β”€ 09_reasoning.py            # Perceptual reasoning analysis
β”‚   └── 10_rel_visual.py           # Relationship visualization
β”œβ”€β”€ config/                        # Configuration files
β”‚   └── .env.example              # Environment configuration template
β”œβ”€β”€ data/                          # Data directory (gitignored)
β”‚   └── README.md                 # Data preparation guide
β”œβ”€β”€ docs/                          # Additional documentation
β”œβ”€β”€ output/                        # Generated outputs (gitignored)
β”œβ”€β”€ result/                        # Model results (gitignored)
β”œβ”€β”€ logs/                          # Training logs (gitignored)
β”œβ”€β”€ .gitignore                    # Git ignore rules
β”œβ”€β”€ LICENSE                        # MIT License
β”œβ”€β”€ README.md                      # This file
└── requirements.txt               # Python dependencies

πŸ” Key Implementation Details

Fair Model Comparison

All baseline models are trained under identical conditions to ensure fair comparison:

  • Training Configuration:

    • Epochs: 100 (with early stopping patience: 20)
    • Learning rate: 1e-4
    • Weight decay: 5e-5
    • Dropout: 0.1
    • Hidden dimensions: [512, 256, 128]
    • Batch size: 32
  • Training Strategy:

    • Fine-tuning: All pre-trained backbones are fine-tuned end-to-end (not frozen)
    • Optimizer: AdamW with gradient clipping
    • Loss function: Binary cross-entropy with logits
    • Data split: 70% train, 15% validation, 15% test

GraphMAE Pre-training

  • Self-supervised masked graph reconstruction
  • Embedding dimension: 128
  • Extended training: 220 epochs for convergence
  • Node-level and graph-level representation learning

πŸ“ Citation

If you use this code in your research, please cite our CAADRIA 2026 paper (To Appear). You can also refer to the project repository for now.

@inproceedings{liu2026pixels,
  title={From Pixels to Predicates: Structuring Urban Perception with Scene Graphs},
  author={Liu, Yunlong and Li, Shuyang and Liu, Pengyuan and Zhang, Yu and Stouffs, Rudi},
  booktitle={Proceedings of the 31st International Conference on Computer-Aided Architectural Design Research in Asia (CAADRIA 2026)},
  year={2026},
  note={To appear},
  url={https://github.com/Lylll9436/structure_image}
}

Note: The paper is accepted for CAADRIA 2026. The citation information will be updated once the official proceedings are published.

Authors:

  • Yunlong Liu (Southeast University, China)
  • Shuyang Li (National University of Singapore / Singapore-ETH Centre)
  • Pengyuan Liu (University of Glasgow, United Kingdom)
  • Yu Zhang* (Southeast University, China)
  • Rudi Stouffs (National University of Singapore)

🀝 Contributing

This is a research project developed for academic purposes. Suggestions and discussions are welcome through issues and pull requests.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Scene graph generation powered by Google Gemini API
  • Graph neural network implementation based on PyTorch Geometric
  • Text embeddings using Sentence-BERT
  • Baseline models utilize pre-trained weights from ImageNet, CLIP (OpenAI), and other public sources

πŸ“§ Contact

For questions or collaboration opportunities, please open an issue or contact [lyl_arch@seu.edu.cn].


Note: Large data files and trained model weights are excluded from this repository due to size constraints. Please follow the data preparation guide to set up your own datasets.

About

Personal Project

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors