Towards 4D perception with foundational priors - Robotics Institute Carnegie Mellon University
Loading Events

PhD Thesis Defense

September

23
Tue
Tarasha Khurana PhD Student Robotics Institute,
Carnegie Mellon University
Tuesday, September 23
12:00 pm to 1:30 pm
Newell-Simon Hall 3305
Towards 4D perception with foundational priors
Abstract:
As humans, we are constantly interacting with and observing a three-dimensional dynamic world. Building this spatiotemporal or 4D understanding in vision algorithms is not straightforward as there is orders of magnitude less 4D data than 2D images and videos. This underscores the need to find meaningful ways to exploit 2D data to realize 4D tasks. Recent advancements in building “foundation models” — that have learnt generative/structural priors in a data-driven manner from internet-scale data — have allowed us access to these rich real-world priors for free. In this thesis, we investigate how one can tune these priors for 4D perception tasks like amodal tracking and completion, dynamic reconstruction and next-timestep prediction.

We pursue three complementary directions. First, in the absence of foundational priors, we build these ourselves in a self-supervised manner via the task of next-timestep prediction using sequences of 3D LiDAR sweeps of dynamic scenes. Importantly, we show that bottlenecking next-timestep prediction with a 4D representation is crucial. We find that such a forecasting model can be used for downstream motion planning for autonomous vehicles, which helps reduce collision rates to a large extent.

Second, we capitalize on foundational priors in a zero-shot manner. We turn to large reconstruction models that predict per pixel depth for images and videos. We use these to solve two underconstrained tasks — (1) tracking objects across occlusions in 2.5D, and (2) reconstructing dynamic scenes from sparse-views. In both settings, we find that one can do drastically better than prior state-of-the-art using additional scene cues in the form of data-driven depth priors.

Third, we exploit foundational priors via finetuning. We specifically look at video diffusion models and reformulate amodal perception and dynamic novel-view synthesis into self-supervised tasks that video models are good at i.e. inpainting. We find that it is surprisingly light-weight, in terms of data and compute, to finetune video diffusion models. This suggests that concepts similar to human visual perception are embedded in foundation models, which only have to be “controlled” to perform other tasks.

Together these contributions highlight how one can build, leverage and adapt foundational priors for spatiotemporal perception in a scalable manner — the scale is enabled by relying increasingly on internet-scale 2D data and carefully designing self-supervised objectives for learning.

Thesis Committee Members:

Deva Ramanan, Chair

Shubham Tulsiani

Katerina Fragkiadaki

Carl Vondrick, Columbia

Leonidas Guibas, Stanford & Google

Link to thesis draft