We pursue three complementary directions. First, in the absence of foundational priors, we build these ourselves in a self-supervised manner via the task of next-timestep prediction using sequences of 3D LiDAR sweeps of dynamic scenes. Importantly, we show that bottlenecking next-timestep prediction with a 4D representation is crucial. We find that such a forecasting model can be used for downstream motion planning for autonomous vehicles, which helps reduce collision rates to a large extent.
Second, we capitalize on foundational priors in a zero-shot manner. We turn to large reconstruction models that predict per pixel depth for images and videos. We use these to solve two underconstrained tasks — (1) tracking objects across occlusions in 2.5D, and (2) reconstructing dynamic scenes from sparse-views. In both settings, we find that one can do drastically better than prior state-of-the-art using additional scene cues in the form of data-driven depth priors.
Third, we exploit foundational priors via finetuning. We specifically look at video diffusion models and reformulate amodal perception and dynamic novel-view synthesis into self-supervised tasks that video models are good at i.e. inpainting. We find that it is surprisingly light-weight, in terms of data and compute, to finetune video diffusion models. This suggests that concepts similar to human visual perception are embedded in foundation models, which only have to be “controlled” to perform other tasks.
Together these contributions highlight how one can build, leverage and adapt foundational priors for spatiotemporal perception in a scalable manner — the scale is enabled by relying increasingly on internet-scale 2D data and carefully designing self-supervised objectives for learning.
Thesis Committee Members:
Deva Ramanan, Chair
Shubham Tulsiani
Katerina Fragkiadaki
Carl Vondrick, Columbia
Leonidas Guibas, Stanford & Google
