3D Video Models through Point Tracking, Reconstructing and Forecasting

PhD Thesis, Tech. Report, CMU-RI-TR-25-79, August, 2025

View Publication

Abstract

From autonomous driving to household robotics and embodied AI, intelligent systems must build internal models of the world: mental representations that allow them to perceive their surroundings, reason about unobserved structure, and plan future interactions. Humans do this effortlessly from vision alone, but enabling machines to do the same from raw video remains a core challenge. Sparse viewpoints, occlusions, and complex dynamics make it difficult to recover mental models of scenes that are both geometrically accurate and physically actionable. This thesis addresses this challenge by developing a unified framework that allows agents to see the world, imagine its hidden structure and dynamics, and plan interactions within it, ultimately building textbf{mental world models} from monocular videos.

To see and imagine the video observation in 4D, we develop DreamScene4D, an optimization-based pipeline that fuses learned object tracking priors and generative image priors with 4D Gaussian splatting to decompose video into object-centric 3D shapes and motion trajectories. By independently optimizing the 3D Gaussians of each object to minimize both the rendering error and the score-distillation sampling loss, and then composing the Gaussians post-optimization, our method yields accurate geometry and precise motion estimates, outperforming prior NeRF-based and Gaussian-splatting-based approaches in complex real-world videos.

Although DreamScene4D’s decomposition enables precise reconstruction and completion of individual objects, its post hoc composition can lead to inconsistencies in spatial layout and depth ordering of the video scene. To address this, we introduce GenMOJO, a joint multi-object optimization framework that unifies scene-level rendering with object-centric imagination. By optimizing all objects together to match full-scene observations, GenMOJO faithfully preserves inter-object spatial relationships and occlusion structure. Simultaneously, it applies score-distillation-sampling in local object frames to hallucinate unseen geometry and motion. This combination of global supervision and localized generative priors yields coherent, high-fidelity 4D representations that outperform per-object methods in both visual quality and trajectory accuracy.

Finally, we present ParticleWorldDiffuser, a transformer-based diffusion model for forecasting 3D particle trajectories in learned scene representations. The action-conditioned variant predicts object displacements given control sequences, enabling model-predictive control, while the unconditional variant generates both action and object trajectories via score-based sampling for efficient, rollout-free planning. Trained in simulation but evaluated on real-world object reconstructions, our model generalizes across novel objects and scenes, demonstrating effective sim-to-real transfer.

Together, these contributions form a cohesive pipeline for generating a mental world model of the video observation. By grounding generative models in real video and validating their utility for physical prediction, this thesis takes a step toward embodied systems that can perceive and interact with the world through the lens of video, a capability essential for the next generation of intelligent agents.

BibTeX

@phdthesis{Chu-2025-148292,
author = {Wen-Hsuan Chu},
title = {3D Video Models through Point Tracking, Reconstructing and Forecasting},
year = {2025},
month = {August},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-25-79},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.