Removing the i’s from i.i.d : Testing generalization on hard datasets

Master's Thesis, Tech. Report, CMU-RI-TR-19-81, Robotics Institute, Carnegie Mellon University, December, 2019

View Publication

Abstract

The last few years have seen the widespread success of over-parameterized deep learning models on various applications with massive datasets. However, these models are often critiqued for assuming access to perfect data, that is, a large amount of clean, i.i.d sampled data. In real-world scenarios, neither of these assumptions is entirely true. We consider four arbitrary domains as examples of some of these scenarios, namely, point cloud completion (with distribution shift), visual dialog (dataset size/bias issues), meta-rl for control (noisy, high variance and sparse training signal) and poaching prediction task (unstructured dataset with skew, noise and distribution shift). Using these datasets, we show that data and priors are meant to complement each other in machine learning models and it’s important to think of them jointly on a task to task basis for better generalization.

BibTeX

@mastersthesis{Gurumurthy-2019-118729,
author = {Swaminathan Gurumurthy},
title = {Removing the i’s from i.i.d : Testing generalization on hard datasets},
year = {2019},
month = {December},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-19-81},
keywords = {Machine learning, out-of-distribution, distribution shift, point cloud, dialog, vision, meta reinforcement learning},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.