Efficient Visual Modeling with Adaptive Representations - Robotics Institute Carnegie Mellon University
Loading Events

PhD Thesis Defense

January

13
Tue
Rohan Choudhury PhD Student Robotics Institute,
Carnegie Mellon University
Tuesday, January 13
2:30 pm to 4:00 pm
Newell-Simon Hall 3305
Efficient Visual Modeling with Adaptive Representations
Abstract: 

While image understanding, generation, and manipulation have matured rapidly in recent years, video remains challenging due to the significantly larger input size. As a result, tasks such as generating long videos or understanding extended video sequences remain out of reach for current models due to their computational cost. This talk presents a series of works that address this issue by adapting ideas from video compression to accelerate visual model training and inference. I will first introduce Run-Length Tokenization (RLT), which modifies the vision transformer architecture to exploit temporal redundancy, enabling substantial speedups without compromising accuracy. Next, I will present FlowTok, which incorporates motion vectors to extend RLT to dynamic scenes, maintaining efficiency even under camera and object motion. I will then discuss Adaptive Patch Transformers (APT), which apply these principles to images by dynamically assigning larger patch sizes in low-complexity regions to reduce computation while preserving performance. We next apply these principles to video generation, and propose SkipSR, a cascaded generation framework that combines fast video super-resolution with cascaded diffusion models. Finally, we introduce FPS-Bench, a benchmark to systematically evaluate the impact of frame rate and resolution on downstream video understanding tasks, offering insights into which aspects of fidelity truly matter for model performance. By unifying efficient video tokenization with scalable video synthesis and principled evaluation, this thesis enables significantly faster visual models in both understanding and generation tasks, unlocking further scaling.

Thesis Committee Members:

László A. Jeni(co-chair)
Kris M. Kitani (co-chair)
Jun-Yan Zhu
Rohit Girdhar (Meta GenAI)
Lu Jiang (ByteDance)
A draft of the thesis is available here: Thesis Draft