Skip to content

0nandon/EmbodiedSplat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EmbodiedSplat 🛋️
Online Feed-Forward Semantic 3DGS
for Open-Vocabulary 3D Scene Understanding

Seungjun Lee · Zihan Wang · Yunsong Wang · Gim Hee Lee
National University of Singapore

CVPR 2026

PyTorch Lightning

Logo

Build and understand at Once! By taking over 300 streaming images, our EmbodiedSplat reconstructs whole-scene open-vocabulary 3DGS in online manner at up to 5-6 FPS per-frame processing time. Reconstructed scene supports diverse perception tasks such as open-vocabulary 3D semantic segmentation, 2D-rendered semantic segmentation and novel-view color synthesis with depth rendering.

Table of Contents
  1. TODO
  2. Installation
  3. Data Preparation
  4. Evaluation
  5. Acknowledgement
  6. Citation

News:

  • [2026/02/21] EmbodiedSplat is accepted to CVPR 2026 🔥. The code will be released before June.
  • [2026/05/19] The code and pretrained weights are released! 👊🏻

TODO

  • Release the code of EmbodiedSplat and pretrained weights
  • If time permits, we are planning to give some updates (Not for publishing another paper, but just for fun ☺️):
    • Replacing the reconstruction backbone from FreeSplat++ to the most recent pose-free online 3DGS feed-forward model.
    • Replacing the CLIP(OpenSeg, MaskAdapter) + SAM pipeline into the more stronger 2D VLMs such as SAM3.
    • Attaching LLM to EmbodiedSplat by following the spirit of SplatTalk.
    • Adopting EmbodiedSplat to real robot and release the code.

Installation

Dependencies 📝

The main dependencies of the project are the following:

python: 3.10
cuda: 11.8

You can set up a conda environment as follows:

conda create -n embodiedsplat python=3.10
conda activate embodiedsplat
conda install -c conda-forge libopenblas=0.3.31 openblas-devel=0.3.31

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

pip install "setuptools<81"
pip install -r requirements.txt --no-build-isolation

cd src/third_party/MinkowskiEngine
git checkout 02fc608bea4c0549b0a7b00ca1bf15dee4a0b228
python setup.py install --blas_include_dirs=${CONDA_PREFIX}/include --blas=openblas

pip install --no-build-isolation src/model/encoder/submodules/simple-knn
pip install --no-build-isolation src/ops
pip install --no-build-isolation src/third_party/localagg
pip install --no-build-isolation git+https://github.com/JonathonLuiten/diff-gaussian-rasterization-w-depth
pip install --no-build-isolation src/third_party/langsplat-rasterization
pip install git+https://github.com/openai/CLIP.git

# if you face error when you run the evaluation code due to MinkowskiEngine, do:
cd $CONDA_PREFIX/lib
ln -sf libopenblasp-r0.3.31.so libopenblas.so.0
ln -sf libopenblasp-r0.3.31.so libopenblas.so
cd {YOUR_PATH}

Data Preparation

The testing scenes in ScanNet and ScanNet++, and pretrained weights are available here. You can easily download all the preprocessed data by running:

python download_data.py

Once you run the above command, two folders must be produced:

  • pretrained: Including all the pretrained weights of the EmbodiedSplat and auxiliary 2D models.
  • dataset: Including all the testing scenes and ground-truth annotations from ScanNet and ScanNet++.

Evaluation

NOTE 📌 : We make a minor update to the inference strategy. As mentioned at the end of Sec. 7.2, we apply floater removal as a post-refinement step following FreeSplat++. In our original paper, Gaussians identified as floaters are also excluded from semantic prediction on point clouds in Eq. 11. However, we empirically find that this exclusion degrades semantic performance, even though floater removal clearly improves rendered RGB quality. Hence, in the released code, floater Gaussians are excluded only during RGB rendering, while they are still used for semantic prediction. As a result, the evaluation results may be higher than the numbers reported in the paper.

NOTE 📌 : We support two types of inference strategy:

  • incremantal: Among all past frames, we select the N=30 images with the smallest pose differences from the current frame and use them as reference frames.
  • online: Simply select the past N=30 frames, i.e., [t−30,t−1], and use them as reference frames for timestep t.

The dafault setting is incremental, but it can be changed to online by setting model.encoder.recon_mode=online in the config files under config/experiment. Both settings yield similar performance.

We provide evaluation scripts for diverse settings across ScanNet and ScanNet++, with options to enable or disable GT depth. All the experiments are conducted in single NVIDIA RTX 6000 Ada GPU (48GB).

Column 1 EmbodiedSplat EmbodiedSplat-fast
ScanNet Here Here
ScanNet, GT Depth Here Here
ScanNet++ Here Here
ScanNet++, GT Depth Here Here

Generated semantic Gaussians are stored in outputs_semantic folder and subsequently used for evaluation in point clouds.

Acknowledgement

Our work is inspired a lot from the following works. We sincerely appreciate to their great contributions!

Citation

If you find our code or paper useful, please cite

@article{lee2026embodiedsplat,
  title={EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding},
  author={Lee, Seungjun and Wang, Zihan and Wang, Yunsong and Lee, Gim Hee},
  journal={arXiv preprint arXiv:2603.04254},
  year={2026}
}

About

[CVPR 2026] Official code of "EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages