ReferEverything: Towards Segmenting Everything We Can Speak of in Videos Using Text-to-Video Diffusion Models - Robotics Institute Carnegie Mellon University

ReferEverything: Towards Segmenting Everything We Can Speak of in Videos Using Text-to-Video Diffusion Models

Master's Thesis, Tech. Report, CMU-RI-TR-25-58, August, 2025

Abstract

This thesis presents REM, a framework for segmenting a wide range of concepts in video that can be described through natural language. Our method unlocks the universal visual-language mapping learned by video diffusion models on Internet-scale data by fine-tuning them on small-scale Referring Object Segmentation datasets. Our key insight is preserving the entirety of the generative model’s architecture by shifting its objective from predicting noise to predicting mask latents. The resulting model can accurately segment and track rare and unseen objects, despite only being trained on object masks from a limited set of categories. Additionally, it can effortlessly generalize to non-object dynamic concepts, such as smoke or raindrops or waves, as demonstrated in our newly introduced evalu- ation benchmark for Referring Video Process Segmentation (Ref-VPS). Our experiments show that REM performs on par with state-of-the-art approaches on in-domain datasets, like Ref-DAVIS, while outperforming them by up to 11 points in terms of region similarity out-of-domain, leveraging the power of Internet-scale pre-training. Finally, REM is also extremely good at segmenting highly dynamic objects in challenging fight scenes from movies and animated shows. We include all of the video visualizations at https://refereverything.github.io/

BibTeX

@mastersthesis{Bagchi-2025-148131,
author = {Anurag Bagchi},
title = {ReferEverything: Towards Segmenting Everything We Can Speak of in Videos Using Text-to-Video Diffusion Models},
year = {2025},
month = {August},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-25-58},
}