Type: RI Ph.D. Thesis Proposal
Title: Scaling Representation Learning for Spatial Intelligence via Active Implicit Memory
Data-driven approaches have made progress in learning generalizable representations from large corpora, following the success of large language models; however, two fundamental challenges remain open.
First, on the learning paradigm: self-supervised objectives such as next-frame prediction are insufficient for visual streams, and nascent multi-view completion approaches remain fragile and prone to model collapse.
Second, on the architecture: current multi-view models maintain scene representations that are either fixed-size (limiting capacity), linearly growing (hitting compute walls), or managed by hand-crafted heuristics that reintroduce classical brittleness.
This thesis addresses both challenges through a unified framework in which bootstrapped representation learning, active implicit memory, and multi-task decoding are tightly coupled and mutually reinforcing.The first half of this thesis establishes the foundations through three completed contributions with increasing scope.
AnyLoc demonstrates that foundation model features provide a universal substrate for visual place recognition across diverse environments without task-specific training, establishing the power of data-driven representation learning.
SplaTAM introduces covisibility-guided differentiable rendering for dense visual SLAM, providing the memory management concepts and self-supervised rendering signal that the proposed work extends.
MapAnything presents a unified feed-forward transformer that decodes twelve sub-tasks from a single factored representation at internet scale, validating that task and modeling paradigm are orthogonal.
The second half proposes Memor, the core contribution of this thesis.
Memor introduces active implicit memory whose capacity adapts to the complexity of the input stream, and unified multi-task decoding that recovers diverse spatial outputs from the resulting shared representation.
Memory management is learned end-to-end rather than governed by hand-crafted rules, and the diversity of decoded tasks jointly enriches the underlying representation.
Two application chapters extend Memor: one to holistic scene generation, where supervised pre-training prevents the model collapse that limits purely self-supervised approaches; the other to open-world semantic exploration, where the memory provides spatial context for language-guided navigation.
Together, these contributions advance a unified approach to scaling representation learning for a wide breadth of spatial intelligence.
Thesis Committee Members:
Sebastian Scherer (Co-Chair)
Deva Ramanan (Co-Chair)
Shubham Tulsiani
Peter Kontschieder, Meta Reality Labs
