Dataset-Driven and Generative Approaches to Domain Generalization in Human-Centric Vision - Robotics Institute Carnegie Mellon University
Loading Events

PhD Thesis Defense

August

28
Thu
Emily Kim PhD Student Robotics Institute,
Carnegie Mellon University
Thursday, August 28
1:00 pm to 2:30 pm
GHC 6121
Dataset-Driven and Generative Approaches to Domain Generalization in Human-Centric Vision
Abstract:
Human-centered computer vision technology relies heavily on large, diverse datasets, yet even the largest collections cannot fully capture the variability of human appearance, motion, and viewpoint. At the same time, collecting data from human subjects is time-consuming, labor-intensive, and raises privacy concerns. To overcome these challenges while maintaining efficiency, researchers increasingly turn to two complementary directions: dataset-driven approaches (synthetic data generation, domain adaptation, hybrid training, targeted fine-tuning) and generative refinement methods that improve model outputs when lightweight architectures cannot fully generalize.

This thesis explores both directions by presenting new data for human-based vision models and addressing their limitations through synthetic augmentation, curriculum-based training, and single-step generative refinement. Together, these methods compensate for gaps in training diversity, improve generalization to unseen domains, and reduce the computational cost of training and inference. The approaches are applied to two applications: activity recognition (classifying human actions from sequences of frames) and 3D avatar generation (creating 3D avatars from a few subject images).

In the first part, we introduce REMAG, a dataset suite comprising both real and synthetic data across eleven action classes, captured from ground and drone cameras (Chapter 2). The synthetic portion is generated using four distinct methods, combining either traditional computer graphics (CG) or neural rendering with motion sources from either marker-based motion capture or 2D video-tracked motions. Through extensive experiments, we demonstrate that a two-step fine-tuning strategy—pre-training on high-quality synthetic data followed by fine-tuning on a small amount of real data—can match or even surpass the performance of models trained on a substantially larger real dataset, while also reducing training time. However, in real-world scenarios, collecting synchronized ground- and aerial-view data is often impractical due to the significant effort and resources required. To address this limitation, in Chapter 3 we investigate a domain adaptation setting in which no real aerial-view training data is available, examining how models can efficiently generalize to the real aerial-view domain when trained only with real ground-view and synthetic aerial-view data.

In the second part, we focus on training a 3D avatar generation model called Universal Avatars (UA) using a large-scale multi-view dataset of human heads called Ava-256 (Chapter 4). While this dataset allows the model to generate complete 3D avatars from input images and drive them with expression signals, lightweight architectures such as GP-Avatar, which enable real-time avatar synthesis from a single appearance image, struggle to preserve identity and expression consistency across diverse viewpoints. To address this tradeoff between efficiency and fidelity, Chapter 5 introduces TurboPortrait3D, a single-step diffusion method that refines coarse novel views generated by GP-Avatar. Unlike existing 3D-aware generative methods that rely on multi-step optimization, TurboPortrait3D produces sharper, identity-faithful, and 3D-consistent novel views in real time, demonstrating that generative refinement can effectively complement lightweight dataset-driven models.

Committee members: Jessica Hodgins (Chair), Fernando de la Torre, Jun-Yan Zhu, Julieta Martinez (Meta)