This is the official repository for the paper "Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data".
Authors: Haozhe Si, Yuxuan Wan, Minh Do, Deepak Vasisht, Han Zhao, Hendrik F. Hamann.
This repository provides the implementation of the Low-rank Efficient Spatial-Spectral Vision Transformer (LESS ViT). LESS ViT is a scalable, efficient vision transformer designed specifically for multi-modal and hyperspectral geospatial raster data.
We also constructed a comprehensive benchmark for geospatial raster data, GFM-Bench, which incorporates 3 classification datasets and 4 semantic segmentation datasets. For more detailed information about GFM-Bench, please see and also our GitHub repository
.
We pre-trained the LESS ViT model using Hyper-MAE on the SSL4EO-S12 dataset for 300 epochs (on 8 × NVIDIA A6000 GPUs).
To launch pre-training, use launch_train.sh
script by running:
bash launch_train.sh
and please refer to GeospatialFM/finetune/args.py
(click here) for more detailed descriptions of arguments in the script.
We fine-tuned and evaluated our pre-trained LESS ViT on GFM-Bench. For detailed implementation of fine-tuning experiments, please refer to Appendix C section in our paper.
To launch fine-tuning experiments, run the following command:
python3 sweep_finetune.py \
--dataset ${DATASET_NAME} \
--root_dir ${ROOT_DIR}
--modal ${MODAL} \
To launch linear probing experiments, run the following command:
python3 sweep_finetune.py \
--dataset ${DATASET_NAME} \
--root_dir ${ROOT_DIR}
--modal ${MODAL} \
--lp
--dataset ${DATASET_NAME}
: Name of the dataset to use.--root_dir ${ROOT_DIR}
: Directory to save checkpoints and results.--modal ${MODAL}
: The fine-tuning modality (radar
optical
, ormulti
). Note: currently only BigEartnet and DFC2020 datasets supportradar
ormulti
fine-tuning in GFM-Bench.
We will be uploading pre-trained model checkpoints soon. Stay tuned! 😀
If you found our project helpful, please cite our paper:
@misc{si2025scalablefoundationmodelmultimodal,
title={Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data},
author={Haozhe Si and Yuxuan Wan and Minh Do and Deepak Vasisht and Han Zhao and Hendrik F. Hamann},
year={2025},
eprint={2503.12843},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.12843},
}