Skip to content

ludekcizinsky/master-thesis

Repository files navigation

🚧 I have finished the thesis, but I still need to polish this repo, so stay tuned.

About

Thesis Cover

We live and interact with a dynamic 3D world, yet we primarily capture it through 2D cameras that produce large collections of images and videos. This raises a central question. \emph{How can we leverage abundant 2D visual data to understand and reconstruct the 3D world in digital form?} Accurate reconstruction of dynamic 3D scenes has many applications, from embodied intelligence to new ways of creating and interacting with content. This thesis explores this question by building on recent advances in feed-forward models for 3D reconstruction from images and videos, together with progress in neural rendering.

Specifically, we propose a hybrid framework for reconstructing human-centric dynamic 3D scenes from a single monocular video. Our key insight is that modern feed-forward methods are fast but can be inaccurate, while optimization-based neural rendering can achieve high-quality reconstructions under suitable capture conditions, albeit at higher computational cost. We therefore initialize the scene using off-the-shelf estimates of camera motion, SMPL-X parameters, instance masks, depth, and a canonical per-person 3D Gaussian Splatting (3DGS) representation. We then refine this initialization with a lightweight two-stage optimization that first improves pose parameters and then improves appearance by optimizing the explicit 3DGS. To densify multi-view supervision from a single input video, we synthesize additional training views and refine them with a diffusion-based model, which provides pseudo ground truth supervision for the second stage.

We evaluate our approach on three tasks. First, we assess novel view synthesis quality to demonstrate free-viewpoint rendering. Second, we evaluate pose estimation to measure human motion and interaction quality. Third, we evaluate human mesh reconstruction using TSDF fusion of rendered depth maps. Overall, our method achieves competitive results while being substantially faster to train than prior optimization-heavy pipelines. In our setup, training completes in tens of minutes per scene on a single GPU, whereas prior work reports runtimes on the order of hours to days.

We believe our work is an important step towards democratizing dynamic 3D reconstruction from monocular videos, making it more accessible and practical for a wider range of applications. We hope this framework will inspire further research and enable new applications in computer vision and graphics.

About

Monocular 4D reconstruction of human-centric scenes

Resources

Stars

Watchers

Forks

Contributors