Analogical Networks: Memory-Modulated In-Context 3D Parsing

Master's Thesis, Tech. Report, CMU-RI-TR-22-67, Robotics Institute, Carnegie Mellon University, December, 2022

View Publication

Abstract

Recent advances in the applications of deep neural networks to numerous visual perception tasks have shown excellent performance. However, this generally requires access to large amount of training samples and hence one persistent challenge is the setting of few-shot learning (i.e. ability to adapt to new tasks using only a few labeled samples without forgetting the original distribution). In most existing 3D fine-grained parsing related works, a separate parametric neural model is trained to parse each semantic category, which hinders knowledge sharing across objects and few-shot generalization to novel categories. In this thesis, we present Analogical Networks, a model that casts fine-grained 3D visual parsing as analogical inference: instead of mapping input scenes to part labels, which is hard to adapt in a few-shot manner to novel inputs, our model retrieves related scenes from memory and their corresponding part structures, and predicts analogous part structures in the input scene, via an end-to-end learnable modulation mechanism. By conditioning on more than one memory and using this memory as in-context information, compositions of structures are predicted, that mix and match parts from different visual exemplars. This is a memory inspired learning framework for perception parsing tasks that encodes domain knowledge explicitly in a vast collection of memories at different levels of abstraction, in addition to those implicitly encoded as model parameters. We show that Analogical Networks excel at few-shot 3D parsing, where instances of novel object categories are successfully parsed simply by expanding the model's memory, without any weight updates. Analogical Networks outperform existing state-of-the-art detection transformer models, as well as related meta-learning and few-shot learning techniques at part segmentation. We show that part correspondences emerge across memory and input scenes by simply training for a label-free segmentation objective, as a byproduct of the analogical inductive bias.

BibTeX

@mastersthesis{Singh-2022-134538,
author = {Mayank Singh},
title = {Analogical Networks: Memory-Modulated In-Context 3D Parsing},
year = {2022},
month = {December},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-22-67},
keywords = {3D Instance Segmentation, Visual Parsing, Memory Modulated Segmentation, Few-shot Learning},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.