Unsupervised Learning of the 4D Audio-Visual World from Sparse Unconstrained Real-World Samples - Robotics Institute Carnegie Mellon University

Unsupervised Learning of the 4D Audio-Visual World from Sparse Unconstrained Real-World Samples

PhD Thesis, Tech. Report, CMU-RI-TR-21-02, Robotics Institute, Carnegie Mellon University, January, 2021

Abstract

We, humans, can easily observe, explore, and analyze our four-dimensional (4D) audio-visual world. We, however, struggle to share our observation, exploration, and analysis with others. In this thesis, our goal is to learn a computational representation of the 4D audio-visual world that can be: (1) estimated from sparse real-world observations; and (2) explored to create new experiences. We introduce Computational Studio for observing, exploring, and creating the 4D audio-visual world, thereby allowing humans to communicate with other humans and machines effectively without any loss of information. Computational Studio serves as an environment for non-experts to construct and creatively edit the 4D audio-visual world from sparse real-world samples. There are three essential components of the Computational Studio: (1) How can we densely observe the 4D visual world?; (2) How can we communicate the audio-visual world using examples?; and (3) How can we interactively explore the audio-visual world?

The first part introduces capturing, browsing, and reconstructing the 4D visual world from sparse real-world multi-view samples. We bring together insights from classical image-based rendering and neural rendering approaches. Crucial to our work are two components: (1) Fusing information from sparse multi-views to create dense 3D point clouds; and (2) Fusing multi-view information to create new views. Though captured from discrete viewpoints, the proposed formulation allows us to do dense 3D reconstruction and 4D visualization of dynamic events. It also enables us to move around the space-time of the event continuously and facilitate: (1) Freezing the time and exploring 3D space; (2) Freezing the 3D space and moving through time; and (3) Simultaneously changing both time and 3D space. Without any external information, our formulation allows us to get a dense depth map and a foreground-background segmentation, which enables us to efficiently track objects in a video. In turn, these properties allow us to edit the videos and reveal occluded things in a 3D space, provided it is visible in any view.

The second part details the example-based synthesis of the audio-visual world in an unsupervised manner. Example-based audio-visual synthesis allows us to express ourselves easily. In this part, we introduce Recycle-GAN that combines spatial and temporal information via adversarial losses for an unsupervised video retargeting. This representation allows us to translate the contents from one domain to another while preserving the style native to the target domain. We then extend our work to audio-visual synthesis using Exemplar Autoencoders. Our approach builds on simple autoencoders that project out-of-sample data onto the distribution of the training set. We use Exemplar Autoencoders to learn the voice, stylistic prosody (emotions and ambiance), and visual appearance of a specific target exemplar speech. This work enables us to synthesize a natural voice for speech-impaired individuals and do a zero-shot multi-lingual translation. Finally, we introduce PixelNN, a semi-parametric model that enables us to generate multiple outputs from a given input and examples.

The third part introduces human-controllable representations that allow a human user to interact with visual data and create new experiences on everyday computational devices. Firstly, we introduce OpenShapes that allows a user to interactively synthesize new images using a paint-brush and a drag-and-drop tool. We then present simple video-specific autoencoders that enable human-controllable video exploration. This exploration includes a wide variety of video-analytic tasks such as (but not limited to) spatial and temporal super-resolution, object removal, video textures, average video exploration, associating various videos, video retargeting, and correspondence estimation within and across videos. Prior work has independently looked at each of these problems and proposed different formulations. We observe that a simple autoencoder trained (from scratch) on multiple frames of a specific video enables one to perform a large variety of video processing and editing tasks without even optimizing for a single task. Finally, we present a framework that allows us to extract a wide range of low-mid-high level semantic and geometric scene cues that could be understood and expressed by both humans and machines.

The last part of this thesis extends our work on continual and streaming learning of the audio-visual world to learning visual-recognition tasks given a few labeled examples and a (potentially) infinite stream of unlabeled examples. Computational Studio is a first step towards unlocking the full degree of creative imagination, which is currently limited to the human mind by the limits of the individual's expressivity and skills. It has the potential to change the way we audio-visually communicate with other humans and machines.

Notes
Attached is the low-res version of the thesis. See the dropbox link for the hi-res version -- https://www.dropbox.com/sh/qamifyhzxkgv626/AACDy4oFg-fHTz2L8TvCJz5ca?dl=0

BibTeX

@phdthesis{Bansal-2021-125848,
author = {Aayush Bansal},
title = {Unsupervised Learning of the 4D Audio-Visual World from Sparse Unconstrained Real-World Samples},
year = {2021},
month = {January},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-21-02},
keywords = {Unsupervised Learning, 4D Visualization, 3D Reconstruction, Audio-Visual Synthesis, Video Exploration, Image Synthesis, Scene Cues, Test-Time Training, Exemplar Learning, Nearest Neighbors, Adversarial Learning, Auto-Encoders, Continual and Streaming Learning, Human-Controllable Representations},
}