This repo contains code and weights for A Spitting Image: Modular Superpixel Tokenization in Vision Transformers, accepted for MELEX, ECCVW 2024.
For an introduction to our work, visit the project webpage.
The package can currently be installed via:
# HTTPS
pip install git+https://github.com/dsb-ifi/SPiT.git
# SSH
pip install git+ssh://git@github.com/dsb-ifi/SPiT.gitYou can load the Superpixel Transformer model easily via torch.hub:
model = torch.hub.load(
'dsb-ifi/spit',
'spit_base_16',
pretrained=True,
source='github',
)This will load the model and downloaded the pretrained weights, stored in your local torch.hub directory.
If you prefer downloading weights manually, feel free to use:
| Model | Link | MD5 |
|---|---|---|
| SPiT-S16 | Manual Download | 8e899c846a75c51e1c18538db92efddf |
| SPiT-S16 (w. grad.) | Manual Download | e49be7009c639c0ccda4bd68ed34e5af |
| SPiT-B16 | Manual Download | 9d3483a4c6fdaf603ee6528824d48803 |
| SPiT-B16 (w. grad.) | Manual Download | 9394072a5d488977b1af05c02aa0d13c |
| ViT-S16 | Manual Download | 73af132e4bb1405b510a5eb2ea74cf22 |
| ViT-S16 (w. grad.) | Manual Download | b8e4f1f219c3baef47fc465eaef9e0d4 |
| ViT-B16 | Manual Download | ce45dcbec70d61d1c9f944e1899247f1 |
| ViT-B16 (w. grad.) | Manual Download | 1caa683ecd885347208b0db58118bf40 |
| RViT-B16 | Manual Download | 18c13af67d10f407c3321eb1ca5eb568 |
| RViT-B16 (w. grad.) | Manual Download | 50d25403adfd5a12d7cb07f7ebfced97 |
We provide a Jupyter notebook as a sandbox for loading, evaluating, and extracting segmentations for the models.
Currently the code features some slight modifications to streamline use of the RViT models. The original RViT models sampled partitions from a dataset of pre-computed Voronoi tesselations for training and evaluation. This is impractical for deployment, and we have yet to implement a CUDA kernel for computing Voronoi with lower memory overhead.
However, we have developed a fast implementation for generating fast tesselations with PCA trees [1], which mimic Voronoi tesselations relatively well, and can be computed on-the-fly. There are, however still some minor issues with the small capacity RViT models. Consequently, the RViT-B16 models will perform marginally different than the reported results in the paper. We appreciate the readers patience with regard to this matter.
Note that the RViT models are inherently stochastic so that different runs can yield different results. Also, it is worth mentioning that SPiT models can yield slightly different results for each run, due to nondeterministic behaviours in CUDA kernels.
[1] Refinements to nearest-neighbor searching in
- Include foundational code and model weights.
- Add manual links with MD5 hash for manual weight download.
- Add module for loading models, and provide example notebook.
- Create temporary solution to on-line Voronoi tesselation.
- Add
hubconf.pyfor PyTorch Hub compatability. - Add example for extracting attribution maps with Att.Flow and Proto.PCA.
- Add example for computing sufficiency and comprehensiveness.
- Add assets for computed attribution maps for XAI experiments.
- Add code and examples for salient segmentation.
If you find our work useful, please consider citing our paper.
@inproceedings{Aasan2024,
title={A Spitting Image: Modular Superpixel Tokenization in Vision Transformers},
author={Aasan, Marius and Kolbj\o{}rnsen, Odd and Schistad Solberg, Anne and Ram\'irez Rivera, Ad\'in},
booktitle={{CVF/ECCV} Computer Vision -- {ECCVW} 2024 -- {MELEX}},
year={2024}
doi="https://doi.org/10.1007/978-3-031-93806-1_11",
}

