Accelerating Video Understanding and Generation at Scale - Robotics Institute Carnegie Mellon University
Loading Events

PhD Thesis Proposal

May

22
Thu
Rohan Choudhury PhD Student Robotics Institute,
Carnegie Mellon University
Thursday, May 22
2:00 pm to 3:30 pm
NSH 4305
Accelerating Video Understanding and Generation at Scale

Abstract:
While image understanding, generation, and manipulation have matured rapidly in recent years, video remains challenging due to the significantly larger input size. As a result, tasks such as generating long videos or understanding extended video sequences remain out of reach for current models due to their computational cost. This talk presents a series of works that address this issue by adapting ideas from video compression to accelerate visual model training and inference. I will first introduce Run-Length Tokenization (RLT), which modifies the vision transformer architecture to exploit temporal redundancy, enabling substantial speedups without compromising accuracy. Next, I will present FlowTok, which incorporates motion vectors to extend RLT to dynamic scenes, maintaining efficiency even under camera and object motion. I will then discuss Adaptive Patch Transformers (APT), which apply these principles to images by dynamically assigning larger patch sizes in low-complexity regions to reduce computation while preserving performance. Finally, I will highlight ongoing and future work in two directions: accelerating video generation and super-resolution through learned sparse attention, and developing a benchmark to assess which video understanding tasks truly benefit from high frame rates.

Thesis Committee Members:
László A. Jeni(co-chair)
Kris M. Kitani (co-chair)
Jun-Yan Zhu
Rohit Girdhar (Meta GenAI)
Lu Jiang (ByteDance)