Contrastive self-supervised learning is a highly effective way of learning representations that are useful for, i.e. generalise, to a wide range of downstream vision tasks and datasets. In the first part of the talk, I will present MoCHi, our recently published contrastive self-supervised learning approach (NeurIPS 2020) that is able to learn transferable representations faster by synthesising hard negatives. Training with MoCHi learns models that are a great starting point for downstream tasks like object detection and segmentation and datasets like PASCAL VOC or MS-COCO. But how “far” are these datasets and how many of the downstream concepts were actually also encountered during training? In the second part of the talk, I will present a novel benchmark that aims at studying concept generalization, i.e., the extent to which models trained on a set of (seen) visual concepts can be used to recognize a new set of (unseen) concepts, in a principled way. We argue that semantic relationships between seen and unseen concepts affect generalization performance and propose ImageNet-CoG, a novel benchmark on the extended ImageNet-21K dataset that can evaluate models trained on the ubiquitous ImageNet-1K dataset out-of-the-box. Defining the ImageNet-1K concepts as the seen concepts, we leverage the WordNet hierarchy to rank all unseen concepts from the ImageNet-21K dataset with respect to their semantic distance to the the seen ones and sample a sequence of datasets with concepts that are semantically more and more distant from ImageNet-1K. We analyse a number of publicly available models from supervised, semi-supervised and self-supervised (BYOL, SWaV, MoCo, SimCLR) approaches under the prism of concept generalization, and show how our benchmark is able to uncover a number of interesting insights.
Yannis Kalantidis is a research scientist at NAVER LABS Europe. He got his PhD on large-scale visual similarity search and clustering from the National Technical University of Athens in 2014. He was a postdoc and research scientist at Yahoo Research in San Francisco for from 2015 until 2017, leading the visual similarity search project at Flickr and participated in the creating of the Visual Genome dataset. He then joined Facebook AI in Menlo Park in 2017 as a research scientist at the video understanding team and his research interests expanded to video understanding and deep learning architecture modelling. He joined NAVER LABS Europe in March 2020. His research interests revolve around visual representation learning and more specifically self-supervised learning, continual and streaming learning, multi-modal learning, video understanding and vision & language. He is further leading Computer Vision for Global Challenges (cv4gc.org), an initiative to bring the computer vision community closer to socially impactful tasks, datasets and applications for worldwide impact; CV4GC has organized workshops at top venues like CVPR and ICLR.
Sponsored in part by: Facebook Reality Labs Pittsburgh