[ICCV 2025] SAS: Segment Any 3D Scene with Integrated 2D Priors

Zhuoyuan Li^1*, Jiahao Lu^1*, Jiacheng Deng¹, Hanzhi Chang¹, Lifan Wu¹, Yanzhe Liang¹, Tianzhu Zhang^1†

¹University of Science and Technology

^*Equal contribution ^†Corresponding author

🚩 News

26/Jun/2025: Our paper is accepted by ICCV 2025. Congratulations!

11/Mar/2025: We release our paper to Arxiv.

🚀 Quick Start

🛠️ Installation

Start by cloning the repo:

git clone https://github.com/peoplelu/SAS.git
cd SAS

For linux, you need to install libopenexr-dev before creating the environment.

sudo apt-get install libopenexr-dev
conda create -n SAS python=3.8
conda activate SAS

Step 1: install PyTorch (We tested on pytorch 2.1.0 and cuda 11.8. Other versions may also work.):

pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118

Step 2: install MinkowskiNet:

conda install openblas-devel -c anaconda
pip install -U git+https://github.com/NVIDIA/MinkowskiEngine -v --no-deps \
                           --install-option="--blas_include_dirs=${CONDA_PREFIX}/include" \
                           --install-option="--blas=openblas"

Step 3: install scatter for superpoint operation:

pip install torch-scatter

Step 4: install the remaining dependencies:

pip install scipy, open3d, ftfy, tensorboardx, tqdm, imageio, plyfile, opencv-python, sharedarray
pip install git+https://github.com/openai/CLIP.git

Step 5: install tensorflow:

pip install tensorflow==2.13.1

Step 6: Install SAM

pip install git+https://github.com/facebookresearch/segment-anything.git

Step 7: Install LSeg and SEEM

Please create another two environments, lseg and seem, to install dependencies for LSeg and SEEM. You can refer to their official repo for details.

Step 8: Install dependencies for Stable Diffusion

pip install datasets, diffusers, timm, transformers, clip_interrogator

🔧 Dataset Preparation

Download pre-processed data

We provide the pre-processed point features from LSeg and SEEM, fused point features, and the constructed capabilities for the following datasets in hugging face:

ScanNet
Matterport3D
nuScenes

Download the full pre-processed data (or you can choose the specific folder to download):

git lfs install
git clone https://huggingface.co/datasets/Charlie839242/SAS

The structure of the pre-processed data (e.g., ScanNet) is as follows.

data
  └── scannet
      ├── fused_feat
      │   └── scannet_multiview_fuse
      ├── point_feat
      │   ├── scannet_multiview_lseg
      │   └── scannet_multiview_seem
      └── vocabulary
          └── scannet_vocabulary

"scannet_multiview_lseg" and "scannet_multiview_seem" store the 3D point features from LSeg and SEEM respectively.
"scannet_vocabulary" contain the generated images and the constructed capabilities.
"scannet_multiview_fuse" is the combination of "scannet_multiview_lseg" and "scannet_multiview_seem" with "scannet_vocabulary" as the guide.

Extract Point Features

You can also extract 3D point features, and obtain "scannet_multiview_lseg" and "scannet_multiview_seem" on your own.

LSeg features of ScanNet

This part of code is included in "point_feat_extraction/lseg_feat". Following the below commands to set up:

Download LSeg weight demo_e200.ckpt and put it in checkpoint folder.
Download ADEChallengeData2016.zip from link, unzip it, and place it in dataset folder.
Download the raw ScanNet 2D images from OpenScene and ScanNet 3D data from OpenScene, and put them under scannet folder.

wget https://cvg-data.inf.ethz.ch/openscene/data/scannet_processed/scannet_2d.zip
wget https://cvg-data.inf.ethz.ch/openscene/data/scannet_processed/scannet_3d.zip

Then file strcuture is as follows:

lseg_feat
├── checkpoints
│   └── demo_e200.ckpt
├── dataset
│   └── ADEChallengeData2016
│   │   ├── ...
│   │   ├── ...
│   │   └── ...
├── scannet
│   ├── scannet_2d
│   │   ├── ...
│   │   ├── ...
│   │   └── ...
│   ├── scannet_3d
│   │   ├── ...
│   │   ├── ...
│   │   └── ...

Then execute the following command to extract per-point features of scannet from LSeg:

cd point_feat_extraction/lseg_feat
conda activate lseg
python fusion_scannet.py

This will generate features from LSeg in "scannet_multiview_lseg" folder.

SEEM features of ScanNet

This part of code is included in "point_feat_extraction/seem_feat". Following the below commands to set up:

Download the SEEM checkpoint from link and place it in seem_feat folder.
Download the raw ScanNet 2D images from OpenScene and ScanNet 3D data from OpenScene, and put them under scannet folder.

wget https://cvg-data.inf.ethz.ch/openscene/data/scannet_processed/scannet_2d.zip
wget https://cvg-data.inf.ethz.ch/openscene/data/scannet_processed/scannet_3d.zip

Then file strcuture is as follows:

seem_feat
├── seem_focall_v0.pt
├── scannet
│   ├── scannet_2d
|   │   ├── scene0000_00
|   |   │   ├── color
|   |   │   ├── depth
|   |   │   ├── label
|   |   │   └── pose
|   │   ├── scene0000_01
|   |   │   ├── ...
|   |   │   └── ...
│   ├── scannet_3d
│   │   ├── ...
│   │   ├── ...
│   │   └── ...

First, execute the following command to extract the panoptic segmentation result of each 2D image from SEEM:

cd point_feat_extraction/seem_feat
conda activate seem
python extract_seem_pano.py
python extract_seem_semantic.py

Now, the file structure becomes:

seem_feat
├── scannet
│   ├── scannet_2d
|   │   ├── scene0000_00
|   |   │   ├── color
|   |   │   ├── depth
|   |   │   ├── label
|   |   │   ├── pose
|   |   │   ├── sem_seg
|   |   │   ├── sem_seg_img
|   |   │   ├── pano_seg
└── └── └── └── pano_seg_img

Second, execute the following code that utilizes TAP to generate captions for masks from SEEM. Before this, download TAP checkpoint and palce it in TAP/models/tap_vit_h_v1_1.pkl.

conda create -n ta python=3.8
conda activate ta
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118
pip install packaging, ninja
pip install flash-attn --no-build-isolation
pip install git+ssh://git@github.com/baaivision/tokenize-anything.git

cd point_feat_extraction/seem_feat
python TAP/infer.py

Finally, execute the following code to encode the extracted captions of each mask:

cd point_feat_extraction/seem_feat
python fusion_scannet.py

This will generate features from SEEM in "scannet_multiview_seem" folder.

🎇 Model Capability Construction

You can also synthesize images and obtain "scannet_vocabulary" on your own.

cd MCC

Checkpoint Setup

Download from LSeg Checkpoint and place it in lseg_util folder.
Download ADEChallengeData2016.zip from link, unzip it, and place it in lseg_util folder.
Download the SEEM checkpoint from link and place it in seem_util folder.
Download the SAM checkpoint from link and place it in sam_util folder.

Generate synthesized images

python Stable_Diffusion/generate_any_class.py    # This will generate images in synthesized_img folder

Compute the category embedding

You can skip this step and directly use the provided vocabualry_embedding.py.

python lseg_util/generate_text_embedding.py    # This will generate "vocabualry_embedding.py"

Compute masks from LSeg

conda activate lseg
python lseg_util/lseg_infer.py    # This will generate masks in lseg_mask folder

Compute masks from SEEM

conda activate seem
python seem_util/seem_infer.py    # This will generate masks in seem_mask folder

Compute pseudo masks from SAM

python sam_util/generate_mask.py    # This will generate masks in refined_mask folder

Compute mIOU

python miou/cal_miou.py --split=lseg    # This will generate miou in out folder and capability folder
python miou/cal_miou.py --split=seem

🎆 Feature Fusion

To integrate the LSeg features and the SEEM features of the ScanNet dataset using the constructed capability as the guide, execute the following command:

python feat_fusion/fusion_scannet.py

🎥 Training

Superpoint extraction

To extract superpoints of each scene in ScanNetv2 dataset, you should first download the raw ScanNet v2 dataset to obtain the .ply file of each scene. The ScanNet v2 dataset structure is as follows:

superpoint_extraction
├── scannet_v2
│   ├── intrinsics.txt
│   ├── scene0000_00
│   │   ├── label-filt
│   │   ├── scene0000_00_2d-instance-filt.zip
│   │   ├── scene0000_00_2d-instance.zip
│   │   ├── scene0000_00_2d-label-filt.zip
│   │   ├── scene0000_00_2d-label.zip
│   │   ├── scene0000_00.aggregation.json
│   │   ├── scene0000_00.txt
│   │   ├── scene0000_00_vh_clean_2.0.010000.segs.json
│   │   ├── scene0000_00_vh_clean_2.labels.ply
│   │   ├── scene0000_00_vh_clean_2.ply
│   │   ├── scene0000_00_vh_clean.aggregation.json
│   │   ├── scene0000_00_vh_clean.ply
│   │   └── scene0000_00_vh_clean.segs.json
│   ├── scene0000_01
│   │   ├── ...
│   │   ├── ...
│   │   ├── ...

Then build the cpp lib for superpoint extraction:

cd csrc && mkdir build && cd build

cmake .. \
-DCMAKE_PREFIX_PATH=`python -c 'import torch;print(torch.utils.cmake_prefix_path)'` \
-DPYTHON_INCLUDE_DIR=$(python -c "from distutils.sysconfig import get_python_inc; print(get_python_inc())")  \
-DPYTHON_LIBRARY=$(python -c "import distutils.sysconfig as sysconfig; print(sysconfig.get_config_var('LIBDIR'))") \
-DCMAKE_INSTALL_PREFIX=`python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())'` 

make && make install # after install, please do not delete this folder (as we only create a symbolic link)

Then execute the following command to extract superpoints. The superpoint-related code is built upon segmentator.

python superpoint_extraction/scannet_superpoint.py

Training of First Stage

Make sure your data folder is as follows:

.
├── data
│   ├── scannet_3d
│   ├── scannet_3d
│   │   ├── train
│   │   ├── val
│   │   ├── scannetv2_train.txt
│   │   ├── scannetv2_val.txt
│   │   ├── scannetv2_test.txt
│   │   ├── superpoint    # extracted superpoint
│   │   |   ├── scene0000_00_vh_clean_2.pth
│   │   |   ├── scene0000_01_vh_clean_2.pth
│   │   |   ├── ...

Then modify config/scannet/ours_lseg.yaml according to your own need.

data_root_2d_fused_feature: the fused features of LSeg and SEEM
data_root: including the 3D ScanNet data and its superpoint
checkpoint: used for second stage training
save_path: path that saves the training metric

Then execute the following command to enable the training of first stage:

sh run/distill_sp.sh exp/xxxx config/scannet/ours_lseg.yaml

Training of Second Stage

After the training of first stage, set "checkpoint" in config to be the model of the last epoch from the first stage of training. Then execute the following command to enable the training of the second stage. Note that you can adjust the hyperparameters on your own.

sh run/distill_EMA.sh exp/xxxx config/scannet/ours_lseg.yaml

🌟 Evaluation

For evaluating the performance of the 2D features (either from LSeg, SEEM, or fused), set "data_root_2d_fused_feature" in config to your tested 2D feature folder (e.g., data/scannet_multiview_fuse) and execute the following command:

sh run/eval.sh out/xxxx config/scannet/ours_lseg.yaml fusion

For evaluating the performance of the distilled model (either from first stage or second stage), set "model_path" in config to your tested 3D model (e.g., exp/xxxx/model/model_best.pth.tar) and execute the following command:

sh run/eval.sh out/xxxx config/scannet/ours_lseg.yaml distill

We also release the pretrained checkpoint on huggingface, including ScanNet checkpoint, MatterPort3D checkpoint, and nuScene checkpoint. you can set "model_path" in config to your downloaded model for direct evaluation.

TODO List

📜 Citation

If you find our code or paper useful, please cite

@article{li2025sas,
  title={SAS: Segment Any 3D Scene with Integrated 2D Priors},
  author={Li, Zhuoyuan and Lu, Jiahao and Deng, Jiacheng and Chang, Hanzhi and Wu, Lifan and Liang, Yanzhe and Zhang, Tianzhu},
  journal={arXiv preprint arXiv:2503.08512},
  year={2025}
}

🤝 Acknowledgements

Our code is built upon OpenScene. We thank the authors for their excellent work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[ICCV 2025] SAS: Segment Any 3D Scene with Integrated 2D Priors

🚩 News

🚀 Quick Start

Download pre-processed data

Extract Point Features

LSeg features of ScanNet

SEEM features of ScanNet

Checkpoint Setup

Generate synthesized images

Compute the category embedding

Compute masks from LSeg

Compute masks from SEEM

Compute pseudo masks from SAM

Compute mIOU

Superpoint extraction

Training of First Stage

Training of Second Stage

TODO List

📜 Citation

🤝 Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
MCC		MCC
assets		assets
config/scannet		config/scannet
data		data
dataset		dataset
feat_fusion		feat_fusion
models		models
point_feat_extraction		point_feat_extraction
run		run
superpoint_extraction		superpoint_extraction
util		util
README.md		README.md

peoplelu/SAS

Folders and files

Latest commit

History

Repository files navigation

[ICCV 2025] SAS: Segment Any 3D Scene with Integrated 2D Priors

🚩 News

🚀 Quick Start

Download pre-processed data

Extract Point Features

LSeg features of ScanNet

SEEM features of ScanNet

Checkpoint Setup

Generate synthesized images

Compute the category embedding

Compute masks from LSeg

Compute masks from SEEM

Compute pseudo masks from SAM

Compute mIOU

Superpoint extraction

Training of First Stage

Training of Second Stage

TODO List

📜 Citation

🤝 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages