Skip to content

Text2VR/Flash-Sculptor

 
 

Repository files navigation

Flash Sculptor: Modular 3D Worlds from Objects


arXiv

Flash Sculptor: Modular 3D Worlds from Objects
Yujia Hu, Songhua Liu, Xingyi Yang, and Xinchao Wang
Learning and Vision Lab, National University of Singapore

Demo

Abstract: Existing text-to-3D and image-to-3D models often struggle with complex scenes involving multiple objects and intricate interactions. Although some recent attempts have explored such compositional scenarios, they still require an extensive process of optimizing the entire layout, which is highly cumbersome if not infeasible at all. To overcome these challenges, we propose Flash Sculptor in this paper, a simple yet effective framework for compositional 3D scene/object reconstruction from a single image. At the heart of Flash Sculptor lies a divide-and-conquer strategy, which decouples compositional scene reconstruction into a sequence of sub-tasks, including handling the appearance, rotation, scale, and translation of each individual instance. Specifically, for rotation, we introduce a coarse-to-fine scheme that brings the best of both worlds--efficiency and accuracy--while for translation, we develop an outlier-removal-based algorithm that ensures robust and precise parameters in a single step, without any iterative optimization. Extensive experiments demonstrate that Flash Sculptor achieves at least a 3 times speedup over existing compositional 3D methods, while setting new benchmarks in compositional 3D reconstruction performance.

Our Pipeline:

Pipeline

💻 Requirements

  • Ubuntu 20.04
  • CUDA 12.2
  • Python 3.10.12
  • Pytorch 2.4.0

🔧 Installation

For complete installation instructions, please see INSTALL.md.

⚙️ Pretrained Models

Please see DOWNLOAD.md to download pretrained models.

🔦 Run

Follow these steps to get a composite 3D scene from a single image:

0. Prepare an image

Obtain an image using the following command:

python t2i.py --task_name [task_name] --prompt [prompt]

Or you can simply put your own 2D image as results/[task_name]/2DImage.png.

1. Segment the image

Run the following command to segment the image and obtain the bounding box, mask and label of each object:

python segment.py --config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py --ram_checkpoint ram_swin_large_14m.pth --ram_plus_checkpoint ram_plus_swin_large_14m.pth --grounded_checkpoint groundingdino_swint_ogc.pth --sam_checkpoint sam_vit_h_4b8939.pth --sam_hq_checkpoint sam_hq_vit_h.pth --box_threshold 0.25 --text_threshold 0.2 --iou_threshold 0.5 --device "cuda" --task_name [task_name]

2. Reconstruct the background scene

First, recover the background by running:

cd Inpaint-Anything
python background_recover.py --task_name [task_name] --dilate_kernel_size 15 --lama_config ./lama/configs/prediction/default.yaml --lama_ckpt ./pretrained_models/big-lama
cd ..

Then, reconstruct the 3D scene of it using:

cd VistaDream
python vistadream.py --task_name [task_name]
cd ..

3. Depth estimation

cd ml-depth-pro
python run.py --task_name [task_name]
cd ..

4. Reconstruct single objects

First, inpaint the objects by:

python occlusion.py --task_name [task_name]
python inpaint.py --task_name [task_name]

Then, reconstruct the 3D point cloud of each object by:

cd TRELLIS
python trellis.py --task_name [task_name]

5. Combine the objects

First, determine the rotation by:

python rotation.py --task_name [task_name]
cd ..

Then, select points for depth alignment:

python select_points.py	--task_name [task_name]

Finally, combine the objects together:

python combine_objects.py --task_name [task_name]

6. Combine with the background

python combine_scene.py --task_name [task_name]

🔎 Interactive Viewer

To view the combined 3D scene with an interactive viewer: Replace Point_Cloud_Path with the path of your 3D scene point cloud.

Windows

cd viewer_windows/bin
SIBR_gaussianViewer_app.exe -m [Point_Cloud_Path]

Ubuntu

First install these dependencies

# Dependencies
sudo apt install -y libglew-dev libassimp-dev libboost-all-dev libgtk-3-dev libopencv-dev libglfw3-dev libavdevice-dev libavcodec-dev libeigen3-dev libxxf86vm-dev libembree-dev
# Project setup
cd SIBR_viewers
cmake -Bbuild . -DCMAKE_BUILD_TYPE=Release # add -G Ninja to build faster
cmake --build build -j24 --target install
cd ..

To launch the viewer:

./<SIBR_install_dir>/bin/SIBR_gaussianViewer_app -m [Point_Cloud_Path]

Navigation in SIBR Viewer

The SIBR interface provides several methods of navigating the scene. By default, you will be started with an FPS navigator, which you can control with W, A, S, D, Q, E for camera translation and I, K, J, L, U, O for rotation. Alternatively, you may want to use a Trackball-style navigator (select from the floating menu). You can also snap to a camera from the data set with the Snap to button or find the closest camera with Snap to closest. The floating menues also allow you to change the navigation speed. You can use the Scaling Modifier to control the size of the displayed Gaussians, or show the initial point cloud.

🔦 ToDo List

  • Release on arXiv.
  • Improve codes to support images with resolutions other than (1024, 1024).
  • Interactive demos.

🤔 Limitations

  1. The result of segmentation may need manually adjusted if the segmented objects are not exactly what we want.
  2. The inpainting module may occasionally produce suboptimal results.

💡 Citation

If you find this repo is helpful, please consider citing:

@article{hu2025flashsculptormodular3d,
  title={Flash Sculptor: Modular 3D Worlds from Objects},
  author={Yujia Hu and Songhua Liu and Xingyi Yang and Xinchao Wang},
  journal={arXiv preprint arXiv:2504.06178},
  year={2025}
}

🔗 Related Projects

We thank the excellent open-source projects:

  • Grounded-Segment-Anything for the exceptional automatic segmentation performance;
  • Inpaint-Anything for the wonderful image inpainting performance;
  • VistaDream for the efficient and fast 3D scene generation;
  • Depth-Pro for accurate monocular depth estimation;
  • TRELLIS for the high-fidelity and fast single-object 3D generation;
  • StableDiffusion for its powerful image generation and inpainting capabilities.
  • 3D Gaussian Splatting for its groundbreaking approach to fast and high-quality 3D scene rendering and SIBR real-time viewer and DreamScene360 for its integration of SIBR viewer.

About

Flash Sculptor: Modular 3D Worlds from Objects

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 74.6%
  • Python 17.0%
  • C++ 5.9%
  • CMake 1.1%
  • GLSL 0.5%
  • Cuda 0.5%
  • Other 0.4%