Object Pose Estimation without Direct Supervision

Brian Okorn

PhD Thesis, Tech. Report, CMU-RI-TR-22-61, Robotics Institute, Carnegie Mellon University, September, 2022

View Publication

Abstract

Currently, robot manipulation is a special purpose tool, restricted to isolated environments with a fixed set of objects. In order to make robot manipulation more general, robots need to be able to perceive and interact with a large number of objects in cluttered scenes. Traditionally, object pose has been used as a representation to facilitate these interactions. While object pose has many benefits, several limitations become apparent when we investigate how to train an object pose estimator. Traditionally, to train pose estimators, we need to collect a large dataset of annotated object images for supervision. In addition to this data collection being a potentially costly endeavor, most pose estimators trained on such datasets do not account for uncertainty in pose predictions, nor do they generalize to novel objects outside of the training dataset. Further, the pose representation itself does not capture task-specific object interactions.

In this thesis we explore different methods of alleviating these limitations of training object pose estimators. First, we develop methods that can predict the pose uncertainty induced by both our training distribution and the ambiguities caused by object occlusions and symmetries. The ability to predict this uncertainty allows the robot to better understand what it does and does not know about the object's position and orientation and how that may affect task completion. Second, we propose a method that can estimate the pose of objects that were unknown at training time. To solve this problem, we introduce a novel method for zero-shot object pose estimation in clutter that combines classical pose hypothesis generation and a learned scoring function. Third, we evaluate the convergence properties of learning pose estimation from relative pose annotations using gradient-based optimization methods. We find that naively using such supervision can lead to poor convergence. Using this analysis, we develop a method to better leverage relative annotations when training pose estimators using gradient-based optimization. Finally, we develop a method to model the object-to-object relationships required for completing a task. Rather than separately estimating the pose of each object, we show how we can learn to estimate a task-specific relative pose from a small number of demonstrations that generalizes to novel objects. We find that such a formulation is naturally translationally equivariant and is able to focus on the components of each object that are key to completing the given task.

BibTeX

@phdthesis{Okorn-2022-133708,
author = {Brian Okorn},
title = {Object Pose Estimation without Direct Supervision},
year = {2022},
month = {September},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-22-61},
keywords = {Pose Estimation, Rotation Estimation, Self-Supervision, Uncertainty, Robotics, Computer Vision,},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.