CLIP-MHAdapter: Parameter-Efficient Multi-Head Self-Attention Adapter for Street-View Image Classification
This repository contains the MSc Research project for CEGE0049, focused on global street scene classification using CLIP-based few-shot learning frameworks such as Dassl and CoOp. The goal is to classify scene conditions (weather, glare, and lighting) from street-level images with the dataset Global Streetscapes (Published: https://doi.org/10.1016/j.isprsjprs.2024.06.023), leveraging vision-language models.
This repository implements CLIP-MHAdapter, a lightweight and parameter-efficient method for adapting CLIP to street-view image attribute classification. Instead of fully fine-tuning the large CLIP backbone, the approach introduces a multi-head self-attention adapter that refines patch-level token representations, enabling the model to better capture localized visual cues in cluttered urban scenes. Built on top of Dassl and CoOp, the framework leverages the Global StreetScapes dataset to classify scene conditions such as weather, glare, and lighting. With only ~1.38M trainable parameters (≈1% of full fine-tuning), CLIP-MHAdapter achieves competitive or superior performance compared to strong baselines, offering an efficient solution for fine-grained urban imagery understanding.
This repository depends on:
GitRequired for cloning repositories and running scripts via Bash (git clone, bash setup.sh, etc.).PyTorchCore deep learning framework used for model training and inference.CLIPA vision-language model (VLM) from OpenAI used for aligning images and text in a shared embedding space.Dassl.pytorchA domain adaptation and generalization framework built on PyTorch, provides training infrastructure.CoOpA method built on top of CLIP for few-shot learning using prompt tuning.Hugging Face HubUsed to download datasets directly through its API if required.
The complete expreimental results is available at UCL OneDrive.
First, clone this repository to your local machine:
git clone https://github.com/Qi-YOU/CEGE0049-GSS-Dassl-CoOp.git
cd CEGE0049-GSS-Dassl-CoOpOne cloned, make sure you have the following installed:
- Miniconda or Anaconda
- NVIDIA GPU with CUDA 12.8 support (CUDA >= 12.8)
- Python 3.10
Install these dependencies using shallow clones (--depth 1) to save time and bandwidth by skipping full git history:
# Dassl
git clone --depth 1 https://github.com/KaiyangZhou/Dassl.pytorch.git# CLIP
git clone --depth 1 https://github.com/openai/CLIP.git# CoOp
git clone --depth 1 https://github.com/KaiyangZhou/CoOp.git-
For Linux: Make sure the setup script is executable and then run it:
# Ensure the script is executable chmod +x scripts/setup_venv.sh# Run the script ./scripts/setup_venv.sh -
For Windows (via Git Bash or Conda Prompt): Open an Anaconda Prompt window, keep it in the
baseenvironment, and execute these commands:i. Ensure Git Bash is available in your PATH:
# Add Git's Bash to PATH (adjust the path if Git is installed elsewhere) set PATH=C:\Program Files\Git\bin;%PATH%
ii. Then verify Bash is available:
bash --versioniii. Make sure the setup script is executable and then run it:
# Ensure the script is executable # chmod +x scripts/setup_venv.sh
# Run the script with Git Bash bash scripts/setup_venv.sh
Note: Depending on your internet speed, the setup process may take 5–10 minutes to complete.
If you encounter errors while installing packages, please double-check your network connection and the availability of relevant Conda or PyPI channels.
This script may prompt you to press a key to continue at certain steps, giving you a moment to review your system status before proceeding.
Please follow the instructions on the official dataset repository wiki: https://huggingface.co/datasets/NUS-UAL/global-streetscapes
The following directories has to be downloaded:
manual_labels/ (approx. 23 GB)
├── train/
│ └── 8 CSV files with manual labels for contextual attributes
├── test/
│ └── 8 CSV files with manual labels for contextual attributes
└── img/
└── 7 tar.gz files containing images for training and testing
Place these directories under ../autodl-tmp and rename dir name manual_labels to global_street_scapes.
Important:
- After downloading the 7 .tar.gz files in the img/ directory, make sure to extract each archive.
- Extraction will create folders named 1/ through 7/, each containing images in .jpeg format.
- The full dataset, after extraction, may occupy 20-30 GB of disk space.
Once correctly downloaded and prepared, the result directory structure should look like:
global_street_scapes/
├── img/
│ ├── 1/
| | ├── xx.jpeg
| | └── ...
│ ├── 2/
│ ├── 3/
│ ├── 4/
│ ├── 5/
│ ├── 6/
│ └── 7/
├── train/
│ ├── glare.csv
│ ├── ...
│ └── weather.csv
└── test/
├── glare.csv
├── ...
└── weather.csv
where each diretory named after 1 to 7 contains images in .jpeg format.
Before running the execution pipeline that utilizes CoOp, make sure required files are correctly placed across dependencies like CoOp/, CLIP/, etc.
-
For Linux/macOS:
# Ensure the script is executable chmod +x scripts/sync_files.sh # Run the script ./scripts/sync_files.sh
-
For Windows (Git Bash or Conda Prompt):
# Ensure the script is executable # chmod +x scripts/sync_files.sh # Run the script bash scripts/sync_files.sh
This script verifies the working directory and copies necessary files into appropriate submodules.
This repo provides two automation scripts:
run-grid-search.sh→ Explore hyperparameter combinations for CLIP MHAdapter.run-comparison.sh→ Benchmark CLIP MHAdapter against baseline models (ZeroshotCLIP, Linear Probe, CoOp, CLIP Adapter, ZeroR).
Both scripts produce experiment logs and summary files.
The script runs over:
- Loss:
ce - Class Weighting:
inverse,uniform - Blend Ratio:
0.2,0.8 - Num Heads:
4,8,16
To conduct grid search:
bash run-grid-search.sh-
Outputs saved in:
/root/autodl-tmp/results/<dataset>/clip-vitb16-mh_<heads>-ce-<weight>-br_<blend> -
Summary log:
train_summary.txt
This script compares CLIP MHAdapter against several baselines under controlled settings.
- For CLIP MHAdapter, results are reported using its best macro-F1 score configuration.
- For baselines, a pre-specified hyperparameter search space is pre-defined to ensure fairness.
ZeroR,Linear ProbeandCLIP Adapterare evaluated without additional arguments, except for enabling class weighting.- For
CoOp, the number of context toke (N_CTX) is restircted to 8 rather than the default value 16, matching other methods that do not modifiy the textual branch of the CLIP backbone, again to ensure fair comparison.
- Runtime is logged for all methods.
To conduct model comparison:
bash run-comparison.sh-
Outputs saved in:
/root/autodl-tmp/results/<dataset>/<trainer>-clip-vitb16-... -
Summary log:
/root/autodl-tmp/results/comparison_summary.txt
You can also run experiments manually without the helper scripts.
Below are the template commands and explanations for each placeholder.
CLIP MHAdapter
python CoOp/train.py \
--trainer CLIP_MHAdapter \
--dataset-config-file CoOp/configs/datasets/<DATASET>.yaml \
--config-file configs/vit_b16-adamw.yaml \
--output-dir /root/autodl-tmp/results/<DATASET>/clip-vitb16-mh_<HEADS>-ce-<WEIGHT>-br_<BLEND> \
--seed 42 \
TRAINER.LOSS.NAME ce \
TRAINER.LOSS.CLASS_WEIGHTING <inverse|uniform> \
MODEL.BLEND_RATIO <0.2|0.8> \
MODEL.NUM_HEADS <4|8|16><DATASET>→ one of the datasets (e.g.,glare,quality,weather).<WEIGHT>→ class weighting strategy:inverseoruniform.<BLEND>→ blend ratio value:0.2or0.8.<HEADS>→ number of attention heads (4,8, or16).- Example:
MODEL.NUM_HEADS 8will configure MHAdapter with 8 heads.
- Example:
- All runs use a fixed random seed (
42) for reproducibility.
- Switch between Linux and Windows paths inside the scripts before running.
- Best hyperparameters from
run-grid-search.share used byrun-comparison.shfor the proposed method CLIP-MHAdapter.
