3D Video Models through Point Tracking, Reconstructing, and Forecasting
Abstract:
This thesis advances 3D video understanding by bridging reconstruction and dynamics forecasting from monocular video, with applications in robotics, autonomy, and immersive environments. We introduce a novel pipeline that translates 2D video into 4D scenes by combining object-centric tracking, learned 2D view synthesis priors, and Gaussian splatting, enabling accurate geometry and motion recovery even under occlusions. We introduce a multi-object joint optimization of 3D dynamic reconstructions that captures cross-object interactions and object-centric generative priors, yielding state-of-the-art 4D reconstructions with consistent instance masks and temporally coherent appearance. Complementing this, we propose a scalable transformer-based diffusion model for forecasting 3D particle dynamics, enabling fast, action-conditioned planning via score-based guidance. Trained purely in simulation, the model generalizes to real-world objects reconstructed from monocular video, outperforming traditional planners in both accuracy and efficiency. Together, these contributions unify 3D reconstruction and generative dynamics modeling into an interactive framework for reasoning and control in complex physical scenes.
Thesis Committee Members:
Katerina Fragkiadaki, Chair
Kris Kitani
Shubham Tulsiani
Kosta Derpanis, York University
