Building 4D Models of Objects and Scenes from Monocular Videos

PhD Thesis, Tech. Report, CMU-RI-TR-23-54, July, 2023

View Publication

Abstract

This thesis studies how to infer time-varying 3D structures of generic, deformable objects, and dynamic scenes from monocular videos. In a casual setup without sufficient sensor observation or rich 3D supervision, one needs to tackle the challenges of registration, scale ambiguity, and limited views. Inspired by analysis-by-synthesis, we set up an inverse graphics problem and solve it with generic data-driven priors. Inverse graphics models approximate the true generation process of a video with differentiable operations (e.g., differentiable rendering and physics simulation), allowing one to inject prior knowledge about the physical world. Generic data-driven priors (e.g., motion correspondence, pixel descriptors, viewpoints) provide guidance to register pixels to a canonical 3D space, which allows one to fuse observations over time and across similar instances. Building upon these ideas, we develop methods to capture 4D models of deformable objects and dynamic scenes from in-the-wild video footage.

BibTeX

@phdthesis{Yang-2023-137118,
author = {Gengshan Yang},
title = {Building 4D Models of Objects and Scenes from Monocular Videos},
year = {2023},
month = {July},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-23-54},
keywords = {4D Reconstruction from Videos, Inverse Graphics; Dynamic Scene Perception},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.