Name: Contrastive View Predictive Learning with 3D-Bottlenecked RNNs
Start: 2019-05-03T11:00:00-04:00
End: 2019-05-03T12:00:00-04:00
Location: GHC 6115

This event has passed.

PhD Speaking Qualifier

May

Fri

Adam Harley Robotics Institute,
Carnegie Mellon University

Friday, May 3
11:00 am to 12:00 pm
GHC 6115

Contrastive View Predictive Learning with 3D-Bottlenecked RNNs

Abstract:
In this talk, I will describe our recent work on neural architectures for visual recognition, which use 3D not as input nor as the desired output space, but rather as the bottleneck of the learned representations. We consider embodied agents moving in otherwise static worlds equipped with these architectures; they learn 3D visual feature representations by estimating their egomotion, integrating input image sequences into a geometrically-consistent 3D deep feature map, and predicting the visual outcomes of their (ego)motion by projecting this map from desired viewpoints. View prediction, despite being an appealing objective and having connections to brain-like predictive learning, is hindered in practice by the multimodality of the image view to be predicted. To handle this difficulty, we propose contrastive ranking-based losses in place of maximum likelihood: instead of predicting pixels directly, our agent predicts 2D feature maps of the view under consideration, and maximizes their pixel-wise correspondence to bottom-up extracted features of the ground-truth view using metric learning. In this way, we are able to use view-prediction in much more realistic imagery than previously attempted. We demonstrate the emergent self-supervised 3D feature representations are useful for 3D object detection and visual correspondence tasks, and vastly outperform representations obtained with 2D architectures. Our experiments show a 3D object detector can generalize much better to a test set when co-trained with our contrastive view prediction objective. We argue that the better generalization of 3D feature representations over 2D ones (pursued in most current Computer Vision systems) is because objects have canonical 3D appearance, scale, and motion, in contrast to their 2D projections.

Committee:
Katerina Fragkiadaki (advisor)
Martial Hebert
Chris Atkeson
Xiaolong Wang

+ iCal Export

+ Google Calendar + Add to iCalendar

PhD Speaking Qualifier

May

Share This Event!

Event Navigation