Takeo Kanade’s famous quip – to infer geometry or motion from images, you must first know what in one image corresponds to what in another, has guided geometric vision for three decades.
Deep learning seemed to bypass this: methods in 2017-2019 lifted 2D to 3D using only reprojection loss, exploiting an implicit bias toward smooth solutions. But these methods didn’t scale – each category needed its own architecture. Transformers promised scalability, yet with 2D-only supervision, naively scaling transformers often fails. The field concluded that scalable 3D learning requires massive 3D supervision (VGGT, Depth Anything).
This thesis asks: what went wrong, and can we recover 2D-only learning in the transformer era?
The answer: preserving correspondence, not adding supervision is what unlocks scale. Transformers scale through selective attention, but 3D lifting requires preserving every correspondence – these goals can conflict under standard architectures. We resolve this with an architectural principle that preserves correspondence throughout the network, achieving 12x improvement and matching full supervision with zero 3D labels. The result is 2D-LFM (2D Lifting Foundation model): a single model lifting 45+ categories to 3D using only 2D observations. The framework extends to template-free dense reconstruction (RAT4D).
This thesis shows that Kanade’s classical insights remain crucial in the modern foundation model era, and that understanding why
Thesis Committee Members:
Katerina Fragkiadaki
Jason Saragih, Meta
