SpatialFusion-LM: Foundational Vision Meets SpatialLM

SpatialFusion-LM is a unified framework for spatial 3D scene understanding from monocular or stereo RGB input. It integrates depth estimation, differentiable 3D reconstruction, and spatial layout prediction using large language models.

Replica scene office3 TUM scene office

🖥️ Tested Configuration

SpatialFusion-LM has been tested on:

🐧 Ubuntu: 24.04
🧠 GPU: NVIDIA RTX A6000
⚙️ CUDA: 12.8
🧊 Environment: Docker container with GPU support

Other modern Ubuntu + CUDA setups may work, but this is the validated reference configuration.

A GPU with ≥ 24 GB of VRAM is recommended to ensure stable real-time inference and efficient handling of high-resolution inputs across all components.

🚀 Quick Start

Clone the repo

git clone --recursive https://github.com/jagennath-hari/SpatialFusion-LM.git && cd SpatialFusion-LM

Download model weights and sample dataset

bash scripts/download_weights.sh && bash scripts/download_sample.sh

Run the demo inside Docker

bash run_container.sh
ros2 launch llm_ros llm_demo.launch.py

Change modes using mode:=mono/mono+/stereo

🎮 Supported Modes

Mode	Input	Depth Estimator	Use Case
`mono`	RGB Only	UniK3D (ViT-L)	Uncalibrated monocular
`mono+`	RGB + camera intrinsics	UniK3D (ViT-L)	Calibrated monocular
`stereo`	Rectified left + right + intrinsics + baseline	FoundationStereo (ViT-S)	Accurate stereo depth

📖 Overview

SpatialFusion-LM is a unified framework for spatially grounded 3D scene understanding from monocular or stereo RGB input. It integrates learning-based depth estimation, differentiable point cloud reconstruction, and spatial language modeling into a modular ROS 2 pipeline. By combining geometric cues with linguistic priors, the system generates object-centric 3D layouts that support semantic reasoning, embodied navigation, and robot perception in real-world environments.

The architecture decouples 3D scene inference into three core stages: (1) neural depth prediction, (2) back-projection and point cloud generation, and (3) spatial layout prediction via large-scale language models trained for 3D relational reasoning. Notably, the spatial reasoning is performed over instantaneous point clouds reconstructed in the local camera frame, rather than accumulated global maps, enabling frame-wise layout estimation in dynamic or unstructured environments.

SpatialFusion-LM supports real-time inference, dataset extensibility, and structured logging through Rerun and ROS 2, making it suitable for research in vision-language grounding, scene reconstruction, and robotics.

🔧 Features

📷 Supports monocular, monocular+ and stereo vision
🔍 Neural depth estimation with metric 3D reconstruction
🧱 Differentiable point cloud generation in the camera frame
🧠 Language-conditioned spatial layout prediction
🧩 Modular ROS 2 architecture (plug-and-play components)
🌀 Real-time inference and visualization
📊 Integrated logging via Rerun

✅ TODO

Create end-to-end inference using Triton Inference Server via ensemble models
Quantize UniK3D and FoundationStereo to TensorRT engines
Fix UniK3D bug in CUDA Memeory stream for async error while running SpatialLM in parallel
(2025-04-24) Expand evaluation to TUM, Replica, and with one custom recorded dataset

🗃️ Download TUM and Replica ROS 2 datasets

This script will prompt you to select one or more datasets to download:

bash scripts/download_dataset.sh

⚙️ Launch Configuration Options

The llm_demo.launch.py file accepts the following arguments:

Argument	Type	Description	Default
`mode`	string	Input mode: `mono`, `mono+`, or `stereo`	`stereo`
`spatialLM`	bool	Enable or disable layout prediction via SpatialLM	`true`
`rerun`	bool	Enable or disable logging to Rerun	`true`
`rviz`	bool	Enable or disable RVIZ visualization	`true`

📸 Mono, 📷 Mono+, 📷 📷 Stereo?

+---------------------------------+
|              mode=?             |
+---------------------------------+
                |
  ┌─────────────┴──────────────┐
  │             │              │
 mono         mono+          stereo
  │             │              │
 rgb           rgb           camera
              intr.           intr.
                +              +
               rgb           stereo
                              pair
                               +
                            baseline

🤖 Mode Descriptions

mono – Only RGB image is provided.
UniK3D internally estimates camera intrinsics and uses them to predict metric (absolute) depth.
While this enables 3D reconstruction without calibration, the accuracy depends on the quality of intrinsic estimation.
🚀 Suitable for quick deployment or uncalibrated cameras.
mono+ – RGB image and accurate camera intrinsics are provided.
UniK3D uses the supplied intrinsics to produce more accurate metric depth, with better scale alignment.
🧪 Ideal for calibrated cameras (e.g., using /camera_info).
stereo – Left and right rectified images, intrinsics, and baseline are required.
FoundationStereo uses a ViT-based architecture to predict dense disparity maps from stereo pairs.
Metric depth is then computed using the stereo baseline and intrinsics, and converted to a 3D point cloud.
🛡️ This mode provides the most robust and accurate depth, especially in structured or texture-rich environments.

🖼️ Demo Gallery

Below are example configurations showing how SpatialFusion-LM behaves with different launch options.

📸 Mono 🧠 SpatialLM Disabled (mono, rerun)

ros2 launch llm_ros llm_demo.launch.py mode:=mono spatialLM:=false rerun:=true rviz:=false

SpatialFusion-LM performing monocular estimation and 3D reconstruction on TUM scene xyz.

📷 📷 Stereo 🧠 SpatialLM Disabled (mono, rerun)

ros2 launch llm_ros llm_demo.launch.py mode:=stereo spatialLM:=false rerun:=true rviz:=false

SpatialFusion-LM performing stereo estimation and 3D reconstruction on indoor scene indoor_0.

🧪 Run with TUM Dataset

SpatialFusion-LM supports pre-recorded ROS 2 bags from the TUM RGB-D dataset. The llm_demo_tum.launch.py launch file is preconfigured and mono or mono+ modes depending on intrinsics.

ros2 launch llm_ros llm_demo_tum.launch.py \
  mode:=mono+ \
  bag_path:=/datasets/tum_office \
  spatialLM:=true \
  rerun:=true \
  rviz:=true

This assumes you have already downloaded the ROS 2 TUM dataset. If not, you can follow the provided script scripts/download_dataset.sh to do this.

TUM scene office (Mono+) TUM scene desk (Mono)

🧪 Run with Replica Dataset

SpatialFusion-LM supports pre-recorded ROS 2 bags from the Replica dataset. The llm_demo_replica.launch.py launch file is preconfigured and mono or mono+ modes depending on intrinsics.

ros2 launch llm_ros llm_demo_replica.launch.py \
  mode:=mono+ \
  bag_path:=/datasets/replica_office2 \
  spatialLM:=true \
  rerun:=true \
  rviz:=true

This assumes you have already downloaded the ROS 2 Replica dataset. If not, you can follow the provided script scripts/download_dataset.sh to do this.

Replica scene office2 (Mono+) Replica scene room0 (Mono)

🛠️ Using SpatialFusion-LM with Your Own ROS 2 Topics

To run SpatialFusion-LM on a live ROS 2 system or your own dataset:

1️⃣ Use `llm.launch.py` for direct topic-level control

This version of the launch file allows you to specify raw topic names directly (no bag playback or auto setup). Examples:

This simulates mode:=mono as there is no rgb_info provided.

ros2 launch llm_ros llm.launch.py \
  rgb_image:=/your_camera/image_rect \
  rerun:=true \
  spatialLM:=true

This simulates mode:=mono+ as rgb_info is provided.

ros2 launch llm_ros llm.launch.py \
  rgb_image:=/your_camera/image_rect \
  rgb_info:=/your_camera/camera_info \
  rerun:=true \
  spatialLM:=true

This simulates mode:=stereo as left and right topics, left_info and right_info, and baseline are provided.

ros2 launch llm_ros llm.launch.py \
  left_image:=/stereo/left/image_rect \
  right_image:=/stereo/right/image_rect \
  left_info:=/stereo/left/camera_info \
  right_info:=/stereo/right/camera_info \
  baseline:=0.12 \
  rerun:=true \
  spatialLM:=true

2️⃣ Parameter Descriptions

ros2 launch llm_ros llm.launch.py -s

Parameter	Description	Default	ROS 2 Msg Type
`rgb_image`	RGB image topic	''	`sensor_msgs/msg/Image`
`rgb_info`	RGB camera info topic	''	`sensor_msgs/msg/CameraInfo`
`left_image`	Left stereo image topic	''	`sensor_msgs/msg/Image`
`right_image`	Right stereo image topic	''	`sensor_msgs/msg/Image`
`left_info`	Left camera info topic	''	`sensor_msgs/msg/CameraInfo`
`right_info`	Right camera info topic	''	`sensor_msgs/msg/CameraInfo`
`baseline`	Stereo camera baseline (in meters)	`0.0`	`float` (launch param)
`rerun`	Enable Rerun logging	`true`	`bool` (launch param)
`spatialLM`	Enable 3D layout prediction via SpatialLM	`true`	`bool` (launch param)

📤 Output Topics

These are the outputs published by the core_node.py:

Topic	Description	ROS 2 Msg Type
`/spatialLM/depth`	Predicted depth map (1-channel float)	`sensor_msgs/msg/Image`
`/spatialLM/cloud`	Reconstructed 3D point cloud	`sensor_msgs/msg/PointCloud2`
`/spatialLM/image`	RGB image with projected 3D layout	`sensor_msgs/msg/Image`
`/spatialLM/boxes`	Predicted 3D layout objects (e.g., boxes)	`visualization_msgs/msg/MarkerArray`
`/tf`	Transform tree	(e.g., map → camera)

⚖️ Depth Comparison

Comparison of predicted depth maps from stereo, mono+, and mono modes respectively.

📈 Performance Benchmarks

Measured on a single NVIDIA RTX A6000 (48 GB VRAM) at 1920×1080 resolution. Input images are automatically resized to model-specific inference resolution, and values may vary depending on hardware, backend load, and ROS 2 message overhead.

Measured using Headless mode without Rerun or RVIZ.

Mode	Inference Time (ms)	Avg FPS	VRAM Used	Backbone
Mono	169.6	5.85	4440 MiB	UniK3D (ViT-L)
Mono+	126.1	7.87	4460 MiB	UniK3D (ViT-L)
Stereo	292.5	3.35	2126 MiB	FoundationStereo (ViT-S)

A signficant VRAM of ~8192 MiB is needed when spatialLM:=true.

🤝 Contributing

I welcome pull requests and suggestions! If you want to add a new dataset, model backend, or visualization utility, open an issue or fork this repo!

If you find any bug in the code, please report to jh7454@nyu.edu

📖 Citation

If you found this code/work to be useful in your own research, please considering citing the following:

@article{wen2025stereo,
  title={FoundationStereo: Zero-Shot Stereo Matching},
  author={Bowen Wen and Matthew Trepte and Joseph Aribido and Jan Kautz and Orazio Gallo and Stan Birchfield},
  journal={CVPR},
  year={2025}
}

@inproceedings{piccinelli2025unik3d,
    title     = {{U}ni{K3D}: Universal Camera Monocular 3D Estimation},
    author    = {Piccinelli, Luigi and Sakaridis, Christos and Segu, Mattia and Yang, Yung-Hsu and Li, Siyuan and Abbeloos, Wim and Van Gool, Luc},
    booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year      = {2025}
}

@misc{spatiallm,
  title        = {SpatialLM: Large Language Model for Spatial Understanding},
  author       = {ManyCore Research Team},
  howpublished = {\url{https://github.com/manycore-research/SpatialLM}},
  year         = {2025}
}

📄 License

This software is released under the GNU General Public License v3.0 (GPL-3.0). You are free to use, modify, and distribute this code under the terms of the license, but derivative works must also be open-sourced under GPL-3.0.

🙏 Acknowledgement

This work integrates several powerful research papers, libraries, and open-source tools:

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
docker		docker
media		media
scripts		scripts
spatialfusion-lm		spatialfusion-lm
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
run_container.sh		run_container.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SpatialFusion-LM: Foundational Vision Meets SpatialLM

🖥️ Tested Configuration

🚀 Quick Start

🎮 Supported Modes

📖 Overview

🔧 Features

✅ TODO

🗃️ Download TUM and Replica ROS 2 datasets

⚙️ Launch Configuration Options

📸 Mono, 📷 Mono+, 📷 📷 Stereo?

🤖 Mode Descriptions

🖼️ Demo Gallery

📸 Mono 🧠 SpatialLM Disabled (mono, rerun)

📷 📷 Stereo 🧠 SpatialLM Disabled (mono, rerun)

🧪 Run with TUM Dataset

🧪 Run with Replica Dataset

🛠️ Using SpatialFusion-LM with Your Own ROS 2 Topics

1️⃣ Use `llm.launch.py` for direct topic-level control

This simulates mode:=mono as there is no rgb_info provided.

This simulates mode:=mono+ as rgb_info is provided.

This simulates mode:=stereo as left and right topics, left_info and right_info, and baseline are provided.

2️⃣ Parameter Descriptions

📤 Output Topics

⚖️ Depth Comparison

📈 Performance Benchmarks

🤝 Contributing

📖 Citation

📄 License

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

jagennath-hari/SpatialFusion-LM

Folders and files

Latest commit

History

Repository files navigation

SpatialFusion-LM: Foundational Vision Meets SpatialLM

🖥️ Tested Configuration

🚀 Quick Start

🎮 Supported Modes

📖 Overview

🔧 Features

✅ TODO

🗃️ Download TUM and Replica ROS 2 datasets

⚙️ Launch Configuration Options

📸 Mono, 📷 Mono+, 📷 📷 Stereo?

🤖 Mode Descriptions

🖼️ Demo Gallery

📸 Mono 🧠 SpatialLM Disabled (mono, rerun)

📷 📷 Stereo 🧠 SpatialLM Disabled (mono, rerun)

🧪 Run with TUM Dataset

🧪 Run with Replica Dataset

🛠️ Using SpatialFusion-LM with Your Own ROS 2 Topics

1️⃣ Use llm.launch.py for direct topic-level control

This simulates mode:=mono as there is no rgb_info provided.

This simulates mode:=mono+ as rgb_info is provided.

This simulates mode:=stereo as left and right topics, left_info and right_info, and baseline are provided.

2️⃣ Parameter Descriptions

📤 Output Topics

⚖️ Depth Comparison

📈 Performance Benchmarks

🤝 Contributing

📖 Citation

📄 License

🙏 Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1️⃣ Use `llm.launch.py` for direct topic-level control

Packages