VASC Seminar: Ross Girshick
Object Detection: from Structured Models to Deep ConvNets and Back Again
Researcher, Microsoft Research, Redmond
October 20, 2014, 3:00 - 4:00, NSH 1507
It’s an exciting time in computer vision. We’re rapidly making progress on fundamental problems such as object recognition and human pose estimation. When I started my Ph.D. in 2007, the best object detection system could achieve a mean average precision (mAP) of only 21% on our standard benchmark dataset (PASCAL VOC 2007). In this talk I’ll describe two systems, one developed during my Ph.D. and the other during my postdoc, that have more than doubled object detection performance (to 59% mAP) over the last seven years. The first system, Deformable Part Models (or DPM), is based on an elegant framework in which object categories are represented by a type of context-free grammar. These grammars allow object detectors to be specified recursively in terms of parts and subparts. Grammars can also naturally model object classes with variable structure and distinct subclasses. I will describe how we systematically improved object detection performance by increasing the structural sophistication of our detectors within this framework. In the second part of the talk, I will describe a new approach to object detection that is already achieving remarkable results. This approach, Region-based Convolutional Neural Networks (“R-CNN”), applies a large convolutional network to image regions generated by a bottom-up segmentation algorithm. The key insight behind this work is that one can train a CNN on a large-scale image classification dataset (ImageNet) and then transfer the learned representation to the problem of object detection, where we are typically short on labeled training data. I will conclude by revisiting DPMs and showing that they are in fact a special case of a convolutional neural network.
Host: Abhinav Gupta
Ross Girshick is a Researcher at Microsoft Research in Redmond, WA. He completed his Ph.D. in computer vision at The University of Chicago under the supervision of Pedro Felzenszwalb in 2012. Following his Ph.D., he spent two wonderful years as a postdoctoral fellow working with Jitendra Malik and Trevor Darrell at UC Berkeley. Ross's main research interests are in computer vision, AI, and machine learning. His work focuses on building models for object detection and recognition that aim to incorporate the "right" biases so that machine learning algorithms can understand image content from moderate- to large-scale datasets. During Ross's Ph.D., he spent time as an intern at Microsoft Research Cambridge, UK working on human pose estimation from depth images. He also participated in several first-place entries into the PASCAL VOC object detection challenge, and in 2010 was awarded the PASCAL VOC "lifetime achievement" prize for his work on Deformable Part Models.His recent work on R-CNN showed, for the first time, that deep convolutional networks can dramatically outperform the previous state-of-the-art at generic object category detection.