Zhuoyuan Li1*, Jiahao Lu1*, Jiacheng Deng1, Hanzhi Chang1, Lifan Wu1, Yanzhe Liang1, Tianzhu Zhang1β
1University of Science and Technology Β Β
*Equal contribution β Corresponding author
26/Jun/2025: Our paper is accepted by ICCV 2025. Congratulations!
11/Mar/2025: We release our paper to Arxiv.
π οΈ Installation
Start by cloning the repo:git clone https://github.com/peoplelu/SAS.git
cd SASFor linux, you need to install libopenexr-dev before creating the environment.
sudo apt-get install libopenexr-dev
conda create -n SAS python=3.8
conda activate SASStep 1: install PyTorch (We tested on pytorch 2.1.0 and cuda 11.8. Other versions may also work.):
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118Step 2: install MinkowskiNet:
conda install openblas-devel -c anaconda
pip install -U git+https://github.com/NVIDIA/MinkowskiEngine -v --no-deps \
--install-option="--blas_include_dirs=${CONDA_PREFIX}/include" \
--install-option="--blas=openblas"Step 3: install scatter for superpoint operation:
pip install torch-scatterStep 4: install the remaining dependencies:
pip install scipy, open3d, ftfy, tensorboardx, tqdm, imageio, plyfile, opencv-python, sharedarray
pip install git+https://github.com/openai/CLIP.gitStep 5: install tensorflow:
pip install tensorflow==2.13.1Step 6: Install SAM
pip install git+https://github.com/facebookresearch/segment-anything.gitStep 7: Install LSeg and SEEM
Please create another two environments, lseg and seem, to install dependencies for LSeg and SEEM. You can refer to their official repo for details.
Step 8: Install dependencies for Stable Diffusion
pip install datasets, diffusers, timm, transformers, clip_interrogatorπ§ Dataset Preparation
We provide the pre-processed point features from LSeg and SEEM, fused point features, and the constructed capabilities for the following datasets in hugging face:
- ScanNet
- Matterport3D
- nuScenes
Download the full pre-processed data (or you can choose the specific folder to download):
git lfs install
git clone https://huggingface.co/datasets/Charlie839242/SASThe structure of the pre-processed data (e.g., ScanNet) is as follows.
data
βββ scannet
βββ fused_feat
β βββ scannet_multiview_fuse
βββ point_feat
β βββ scannet_multiview_lseg
β βββ scannet_multiview_seem
βββ vocabulary
βββ scannet_vocabulary
- "scannet_multiview_lseg" and "scannet_multiview_seem" store the 3D point features from LSeg and SEEM respectively.
- "scannet_vocabulary" contain the generated images and the constructed capabilities.
- "scannet_multiview_fuse" is the combination of "scannet_multiview_lseg" and "scannet_multiview_seem" with "scannet_vocabulary" as the guide.
You can also extract 3D point features, and obtain "scannet_multiview_lseg" and "scannet_multiview_seem" on your own.
This part of code is included in "point_feat_extraction/lseg_feat". Following the below commands to set up:
- Download LSeg weight demo_e200.ckpt and put it in checkpoint folder.
- Download ADEChallengeData2016.zip from link, unzip it, and place it in dataset folder.
- Download the raw ScanNet 2D images from OpenScene and ScanNet 3D data from OpenScene, and put them under scannet folder.
wget https://cvg-data.inf.ethz.ch/openscene/data/scannet_processed/scannet_2d.zip
wget https://cvg-data.inf.ethz.ch/openscene/data/scannet_processed/scannet_3d.zipThen file strcuture is as follows:
lseg_feat
βββ checkpoints
β βββ demo_e200.ckpt
βββ dataset
β βββ ADEChallengeData2016
β β βββ ...
β β βββ ...
β β βββ ...
βββ scannet
β βββ scannet_2d
β β βββ ...
β β βββ ...
β β βββ ...
β βββ scannet_3d
β β βββ ...
β β βββ ...
β β βββ ...
Then execute the following command to extract per-point features of scannet from LSeg:
cd point_feat_extraction/lseg_feat
conda activate lseg
python fusion_scannet.pyThis will generate features from LSeg in "scannet_multiview_lseg" folder.
This part of code is included in "point_feat_extraction/seem_feat". Following the below commands to set up:
- Download the SEEM checkpoint from link and place it in seem_feat folder.
- Download the raw ScanNet 2D images from OpenScene and ScanNet 3D data from OpenScene, and put them under scannet folder.
wget https://cvg-data.inf.ethz.ch/openscene/data/scannet_processed/scannet_2d.zip
wget https://cvg-data.inf.ethz.ch/openscene/data/scannet_processed/scannet_3d.zipThen file strcuture is as follows:
seem_feat
βββ seem_focall_v0.pt
βββ scannet
β βββ scannet_2d
| β βββ scene0000_00
| | β βββ color
| | β βββ depth
| | β βββ label
| | β βββ pose
| β βββ scene0000_01
| | β βββ ...
| | β βββ ...
β βββ scannet_3d
β β βββ ...
β β βββ ...
β β βββ ...
First, execute the following command to extract the panoptic segmentation result of each 2D image from SEEM:
cd point_feat_extraction/seem_feat
conda activate seem
python extract_seem_pano.py
python extract_seem_semantic.pyNow, the file structure becomes:
seem_feat
βββ scannet
β βββ scannet_2d
| β βββ scene0000_00
| | β βββ color
| | β βββ depth
| | β βββ label
| | β βββ pose
| | β βββ sem_seg
| | β βββ sem_seg_img
| | β βββ pano_seg
βββ βββ βββ βββ pano_seg_img
Second, execute the following code that utilizes TAP to generate captions for masks from SEEM. Before this, download TAP checkpoint and palce it in TAP/models/tap_vit_h_v1_1.pkl.
conda create -n ta python=3.8
conda activate ta
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118
pip install packaging, ninja
pip install flash-attn --no-build-isolation
pip install git+ssh://git@github.com/baaivision/tokenize-anything.git
cd point_feat_extraction/seem_feat
python TAP/infer.pyFinally, execute the following code to encode the extracted captions of each mask:
cd point_feat_extraction/seem_feat
python fusion_scannet.pyThis will generate features from SEEM in "scannet_multiview_seem" folder.
π Model Capability Construction
You can also synthesize images and obtain "scannet_vocabulary" on your own.
cd MCC- Download from LSeg Checkpoint and place it in lseg_util folder.
- Download ADEChallengeData2016.zip from link, unzip it, and place it in lseg_util folder.
- Download the SEEM checkpoint from link and place it in seem_util folder.
- Download the SAM checkpoint from link and place it in sam_util folder.
python Stable_Diffusion/generate_any_class.py # This will generate images in synthesized_img folderYou can skip this step and directly use the provided vocabualry_embedding.py.
python lseg_util/generate_text_embedding.py # This will generate "vocabualry_embedding.py"conda activate lseg
python lseg_util/lseg_infer.py # This will generate masks in lseg_mask folderconda activate seem
python seem_util/seem_infer.py # This will generate masks in seem_mask folderpython sam_util/generate_mask.py # This will generate masks in refined_mask folderpython miou/cal_miou.py --split=lseg # This will generate miou in out folder and capability folder
python miou/cal_miou.py --split=seem π Feature Fusion
To integrate the LSeg features and the SEEM features of the ScanNet dataset using the constructed capability as the guide, execute the following command:
python feat_fusion/fusion_scannet.pyπ₯ Training
To extract superpoints of each scene in ScanNetv2 dataset, you should first download the raw ScanNet v2 dataset to obtain the .ply file of each scene. The ScanNet v2 dataset structure is as follows:
superpoint_extraction
βββ scannet_v2
β βββ intrinsics.txt
β βββ scene0000_00
β β βββ label-filt
β β βββ scene0000_00_2d-instance-filt.zip
β β βββ scene0000_00_2d-instance.zip
β β βββ scene0000_00_2d-label-filt.zip
β β βββ scene0000_00_2d-label.zip
β β βββ scene0000_00.aggregation.json
β β βββ scene0000_00.txt
β β βββ scene0000_00_vh_clean_2.0.010000.segs.json
β β βββ scene0000_00_vh_clean_2.labels.ply
β β βββ scene0000_00_vh_clean_2.ply
β β βββ scene0000_00_vh_clean.aggregation.json
β β βββ scene0000_00_vh_clean.ply
β β βββ scene0000_00_vh_clean.segs.json
β βββ scene0000_01
β β βββ ...
β β βββ ...
β β βββ ...
Then build the cpp lib for superpoint extraction:
cd csrc && mkdir build && cd build
cmake .. \
-DCMAKE_PREFIX_PATH=`python -c 'import torch;print(torch.utils.cmake_prefix_path)'` \
-DPYTHON_INCLUDE_DIR=$(python -c "from distutils.sysconfig import get_python_inc; print(get_python_inc())") \
-DPYTHON_LIBRARY=$(python -c "import distutils.sysconfig as sysconfig; print(sysconfig.get_config_var('LIBDIR'))") \
-DCMAKE_INSTALL_PREFIX=`python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())'`
make && make install # after install, please do not delete this folder (as we only create a symbolic link)Then execute the following command to extract superpoints. The superpoint-related code is built upon segmentator.
python superpoint_extraction/scannet_superpoint.pyMake sure your data folder is as follows:
.
βββ data
β βββ scannet_3d
β βββ scannet_3d
β β βββ train
β β βββ val
β β βββ scannetv2_train.txt
β β βββ scannetv2_val.txt
β β βββ scannetv2_test.txt
β β βββ superpoint # extracted superpoint
β β | βββ scene0000_00_vh_clean_2.pth
β β | βββ scene0000_01_vh_clean_2.pth
β β | βββ ...Then modify config/scannet/ours_lseg.yaml according to your own need.
- data_root_2d_fused_feature: the fused features of LSeg and SEEM
- data_root: including the 3D ScanNet data and its superpoint
- checkpoint: used for second stage training
- save_path: path that saves the training metric
Then execute the following command to enable the training of first stage:
sh run/distill_sp.sh exp/xxxx config/scannet/ours_lseg.yamlAfter the training of first stage, set "checkpoint" in config to be the model of the last epoch from the first stage of training. Then execute the following command to enable the training of the second stage. Note that you can adjust the hyperparameters on your own.
sh run/distill_EMA.sh exp/xxxx config/scannet/ours_lseg.yamlπ Evaluation
For evaluating the performance of the 2D features (either from LSeg, SEEM, or fused), set "data_root_2d_fused_feature" in config to your tested 2D feature folder (e.g., data/scannet_multiview_fuse) and execute the following command:
sh run/eval.sh out/xxxx config/scannet/ours_lseg.yaml fusionFor evaluating the performance of the distilled model (either from first stage or second stage), set "model_path" in config to your tested 3D model (e.g., exp/xxxx/model/model_best.pth.tar) and execute the following command:
sh run/eval.sh out/xxxx config/scannet/ours_lseg.yaml distillWe also release the pretrained checkpoint on huggingface, including ScanNet checkpoint, MatterPort3D checkpoint, and nuScene checkpoint. you can set "model_path" in config to your downloaded model for direct evaluation.
- Installation
- Pre-processed data
- Model capability construction
- The first stage of training
- The second stage of training
- Code for evaluation
- Extraction of superpoints
- Code for extraction of point features from LSeg and SEEM
- Release pretrained model
- Code and data for MatterPort3D
- Code and data for nuScenes
If you find our code or paper useful, please cite
@article{li2025sas,
title={SAS: Segment Any 3D Scene with Integrated 2D Priors},
author={Li, Zhuoyuan and Lu, Jiahao and Deng, Jiacheng and Chang, Hanzhi and Wu, Lifan and Liang, Yanzhe and Zhang, Tianzhu},
journal={arXiv preprint arXiv:2503.08512},
year={2025}
}
Our code is built upon OpenScene. We thank the authors for their excellent work!
