
Abstract:
Building robots that can perceive, reason, and act across a wide range of objects and environments remains a central goal in robotics. To achieve such generalization without relying on large amounts of task-specific data, predicting future outcomes in response to actions is a core capability towards generalized robotics. In this thesis, we investigate how to learn such future prediction models, or world models, across varying object and scene complexities. We hypothesize that a unified predictive representation can be learned to encode both geometric and dynamic priors and reused for downstream tasks.
We begin with a single vector representation to capture the dynamics of entire scene point clouds in the context of autonomous driving. Despite the high-dimensional and multimodal nature of the input, our model encodes stochastic motion priors using a conditional variational autoencoder (CVAE), where the latent variable captures future scene evolution conditioned on past observations and current scene geometry. To model long-range dependencies, we incorporate recurrent temporal modeling in the latent space and use hierarchical spatial encoders to preserve structure and object-level motion across frames. This predictive representation proves effective for downstream detection and tracking tasks, highlighting its utility for multi-agent scene understanding. Beyond rigid objects, we extend our framework to model deformable dynamics using 3D Gaussians, enabling the integration of dynamics learning and perception. Our proposed framework, PLOP (Particle Filtering for Learning Object Physics), learns deformable object dynamics from multiview RGB-D videos by optimizing a learnable dynamics function over a structured 3D latent state space with a resampling mechanism to mitigate particle degeneration through Gaussian splitting and merging. Once trained, the dynamics model can be used for downstream control via model-based reinforcement learning.
Looking ahead, we propose learning generative object interaction priors from human demonstrations. We aim to scale object-centric dynamics learning by incorporating human-object interaction data and modeling multimodal object interactions across a variety of material types. These directions will further integrate semantics, geometry, and physical reasoning into a unified representation space that supports generalization across object categories and tasks.
Thesis Committee Members:
Kris Kitani (Chair)
David Held
Shubham Tulsiani
Brian Okorn (The Robotics and AI Institute)