Towards 4D Perception with Foundational Priors - Robotics Institute Carnegie Mellon University

Towards 4D Perception with Foundational Priors

PhD Thesis, Tech. Report, CMU-RI-TR-25-91, October, 2025

Abstract

As humans, we are constantly interacting with and observing a three-dimensional dynamic world. Building this spatiotemporal or 4D understanding in vision algorithms is not straightforward as there is orders of magnitude less 4D data than 2D images and videos. This underscores the need to find meaningful ways to exploit 2D data to realize 4D tasks. Recent advancements in building "foundation models" -- that have learnt generative/structural priors in a data-driven manner from internet-scale data -- have allowed us access to these rich real-world priors for free. In this thesis, we investigate how one can tune these priors for 4D perception tasks like amodal tracking and completion, dynamic reconstruction and next-timestep prediction.

We pursue three complementary directions. First, in the absence of foundational priors, we build these ourselves in a self-supervised manner via the task of next-timestep prediction using sequences of 3D LiDAR sweeps of dynamic scenes. Importantly, we show that bottlenecking next-timestep prediction with a 4D representation is crucial. We find that such a forecasting model can be used for downstream motion planning for autonomous vehicles, which helps reduce collision rates to a large extent.

Second, we capitalize on foundational priors in a zero-shot manner. We turn to large reconstruction models that predict per pixel depth for images and videos. We use these to solve two underconstrained tasks -- (1) tracking objects across occlusions in 2.5D, and (2) reconstructing dynamic scenes from sparse-views. In both settings, we find that one can do drastically better than prior state-of-the-art using additional scene cues in the form of data-driven depth priors.

Third, we exploit foundational priors via finetuning. We specifically look at video diffusion models and reformulate amodal perception and dynamic novel-view synthesis into self-supervised tasks that video models are good at i.e. inpainting. We find that it is surprisingly light-weight, in terms of data and compute, to finetune video diffusion models. This suggests that concepts similar to human visual perception are embedded in foundation models, which only have to be "controlled" to perform other tasks.

Together these contributions highlight how one can build, leverage and adapt foundational priors for spatiotemporal perception in a scalable manner -- the scale is enabled by relying increasingly on internet-scale 2D data and carefully designing self-supervised objectives for learning.

BibTeX

@phdthesis{Khurana-2025-149120,
author = {Tarasha Khurana},
title = {Towards 4D Perception with Foundational Priors},
year = {2025},
month = {October},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-25-91},
keywords = {foundation models, 4D understanding, dynamic 3D understanding},
}