We learn pixel-level segmentations of objects from weakly tagged YouTube videos. Specifically, given a large collection of raw YouTube content, along with potentially noisy tags, we aim to automatically generate spatiotemporal masks for each object, such as "dog", without employing any pre-trained object detectors. We formulate this problem as learning weakly supervised classifiers for a set of independent spatio-temporal segments. The object seeds obtained using segment-level classifiers are further refined using graphcuts to generate high-precision object masks. Our results, obtained by training on a dataset of 20,000 YouTube videos weakly tagged into 15 classes, demonstrate automatic extraction of pixel-level object masks. Evaluated against a ground-truthed subset of 50,000 frames with pixel-level annotations, we confirm that our proposed methods can learn good object masks just by watching YouTube.
This talk includes contributions from several interns, Googlers and faculty colleagues: G. Hartmann, M. Grundmann, J. Hoffman, D. Tang, D. Tsai, V. Kwatra, O. Madani, S. Vijayanarasimhan, J. Yagnik, I. Essa, J. Rehg.
Rahul Sukthankar is a scientist at Google Research and an adjunct research professor in Robotics at Carnegie Mellon. He was previously a senior principal researcher at Intel Labs (2003-2011), a senior research scientist at HP/Compaq Labs (2000-2003) and research scientist at Just Research (1997-2000). Rahul received his Ph.D. in Robotics from Carnegie Mellon in 1997 and his B.S.E. in Computer Science from Princeton in 1991. His current research focuses on computer vision and machine learning, particularly in the areas of object recognition, video understanding and information retrieval.