Consistent Modeling of 4D Scenes for Perception and Generation
Abstract:
A core challenge in vision is building representations that capture 3D scenes over time for perception and interactive generation. For accurate perception and plausible generation, we want consistency across views, time, and modalities. In this talk we explore consistency through the choice of representation, moving from dense grid formulations to entity-centric scenes that are easier to maintain across frames, and we extend that representation from perception to generation.
Our past work follows this shift within perception tasks. SOLOFusion uses a grid representation with long- and short-baseline temporal stereo for multi-camera 3D detection, improving foreground depth, but it does not perform entity grouping and it does not model background. ASCFormer performs depth estimation and completion via pixel–point affinity, grouping geometry coherently, but the grouping is geometric rather than semantic and remains static. DetMatch, together with our temporal follow-up, addresses semi-supervised 2D and 3D detection, aligning detections across modalities and video to produce consistent pseudo-labels and more stable tracklets, but it focuses on foreground entities and does not model background. S2GO proposes a streaming query-based representation for semantic occupancy estimation that is entity-centric, temporal, and models both foreground and background: each persistent query decodes to semantic Gaussians, and the state is carried across frames and supports short-horizon future prediction. This gives us a single, stable representation suitable for both perception and sampling.
We propose two projects that make this representation generative. First, we propose a static scene generation method: a diffusion model over grounded queries that represent both foreground and background and are decoded into Gaussians, generating a complete semantic occupancy scene. This grounded latent representation enables intuitive, consistent control. Then, we propose motion generation: a model that generates trajectories for ego and foreground entities conditioned on the generated static scene, producing coherent 4D rollouts and enabling interactive edits.
Thesis Committee Members:
Kris Kitani (Chair)
Deva Ramanan
Shubham Tulsiani
Shubham Tulsiani
Wei-Chiu Ma (Cornell University)
Link to Proposal Draft: Link
