Flash Sculptor: Modular 3D Worlds from Objects
Yujia Hu, Songhua Liu, Xingyi Yang, and Xinchao Wang
Learning and Vision Lab, National University of Singapore
Abstract: Existing text-to-3D and image-to-3D models often struggle with complex scenes involving multiple objects and intricate interactions. Although some recent attempts have explored such compositional scenarios, they still require an extensive process of optimizing the entire layout, which is highly cumbersome if not infeasible at all. To overcome these challenges, we propose Flash Sculptor in this paper, a simple yet effective framework for compositional 3D scene/object reconstruction from a single image. At the heart of Flash Sculptor lies a divide-and-conquer strategy, which decouples compositional scene reconstruction into a sequence of sub-tasks, including handling the appearance, rotation, scale, and translation of each individual instance. Specifically, for rotation, we introduce a coarse-to-fine scheme that brings the best of both worlds--efficiency and accuracy--while for translation, we develop an outlier-removal-based algorithm that ensures robust and precise parameters in a single step, without any iterative optimization. Extensive experiments demonstrate that Flash Sculptor achieves at least a 3 times speedup over existing compositional 3D methods, while setting new benchmarks in compositional 3D reconstruction performance.
Our Pipeline:
- Ubuntu 20.04
- CUDA 12.2
- Python 3.10.12
- Pytorch 2.4.0
For complete installation instructions, please see INSTALL.md.
Please see DOWNLOAD.md to download pretrained models.
Follow these steps to get a composite 3D scene from a single image:
Obtain an image using the following command:
python t2i.py --task_name [task_name] --prompt [prompt]Or you can simply put your own 2D image as results/[task_name]/2DImage.png.
Run the following command to segment the image and obtain the bounding box, mask and label of each object:
python segment.py --config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py --ram_checkpoint ram_swin_large_14m.pth --ram_plus_checkpoint ram_plus_swin_large_14m.pth --grounded_checkpoint groundingdino_swint_ogc.pth --sam_checkpoint sam_vit_h_4b8939.pth --sam_hq_checkpoint sam_hq_vit_h.pth --box_threshold 0.25 --text_threshold 0.2 --iou_threshold 0.5 --device "cuda" --task_name [task_name]First, recover the background by running:
cd Inpaint-Anything
python background_recover.py --task_name [task_name] --dilate_kernel_size 15 --lama_config ./lama/configs/prediction/default.yaml --lama_ckpt ./pretrained_models/big-lama
cd ..Then, reconstruct the 3D scene of it using:
cd VistaDream
python vistadream.py --task_name [task_name]
cd ..cd ml-depth-pro
python run.py --task_name [task_name]
cd ..First, inpaint the objects by:
python occlusion.py --task_name [task_name]
python inpaint.py --task_name [task_name]Then, reconstruct the 3D point cloud of each object by:
cd TRELLIS
python trellis.py --task_name [task_name]First, determine the rotation by:
python rotation.py --task_name [task_name]
cd ..Then, select points for depth alignment:
python select_points.py --task_name [task_name]Finally, combine the objects together:
python combine_objects.py --task_name [task_name]python combine_scene.py --task_name [task_name]To view the combined 3D scene with an interactive viewer:
Replace Point_Cloud_Path with the path of your 3D scene point cloud.
cd viewer_windows/bin
SIBR_gaussianViewer_app.exe -m [Point_Cloud_Path]First install these dependencies
# Dependencies
sudo apt install -y libglew-dev libassimp-dev libboost-all-dev libgtk-3-dev libopencv-dev libglfw3-dev libavdevice-dev libavcodec-dev libeigen3-dev libxxf86vm-dev libembree-dev
# Project setup
cd SIBR_viewers
cmake -Bbuild . -DCMAKE_BUILD_TYPE=Release # add -G Ninja to build faster
cmake --build build -j24 --target install
cd ..
To launch the viewer:
./<SIBR_install_dir>/bin/SIBR_gaussianViewer_app -m [Point_Cloud_Path]
The SIBR interface provides several methods of navigating the scene. By default, you will be started with an FPS navigator, which you can control with W, A, S, D, Q, E for camera translation and I, K, J, L, U, O for rotation. Alternatively, you may want to use a Trackball-style navigator (select from the floating menu). You can also snap to a camera from the data set with the Snap to button or find the closest camera with Snap to closest. The floating menues also allow you to change the navigation speed. You can use the Scaling Modifier to control the size of the displayed Gaussians, or show the initial point cloud.
- Release on arXiv.
- Improve codes to support images with resolutions other than (1024, 1024).
- Interactive demos.
- The result of segmentation may need manually adjusted if the segmented objects are not exactly what we want.
- The inpainting module may occasionally produce suboptimal results.
If you find this repo is helpful, please consider citing:
@article{hu2025flashsculptormodular3d,
title={Flash Sculptor: Modular 3D Worlds from Objects},
author={Yujia Hu and Songhua Liu and Xingyi Yang and Xinchao Wang},
journal={arXiv preprint arXiv:2504.06178},
year={2025}
}
We thank the excellent open-source projects:
- Grounded-Segment-Anything for the exceptional automatic segmentation performance;
- Inpaint-Anything for the wonderful image inpainting performance;
- VistaDream for the efficient and fast 3D scene generation;
- Depth-Pro for accurate monocular depth estimation;
- TRELLIS for the high-fidelity and fast single-object 3D generation;
- StableDiffusion for its powerful image generation and inpainting capabilities.
- 3D Gaussian Splatting for its groundbreaking approach to fast and high-quality 3D scene rendering and SIBR real-time viewer and DreamScene360 for its integration of SIBR viewer.

