This thesis explores both directions by presenting new data for human-based vision models and addressing their limitations through synthetic augmentation, curriculum-based training, and single-step generative refinement. Together, these methods compensate for gaps in training diversity, improve generalization to unseen domains, and reduce the computational cost of training and inference. The approaches are applied to two applications: activity recognition (classifying human actions from sequences of frames) and 3D avatar generation (creating 3D avatars from a few subject images).
In the first part, we introduce REMAG, a dataset suite comprising both real and synthetic data across eleven action classes, captured from ground and drone cameras (Chapter 2). The synthetic portion is generated using four distinct methods, combining either traditional computer graphics (CG) or neural rendering with motion sources from either marker-based motion capture or 2D video-tracked motions. Through extensive experiments, we demonstrate that a two-step fine-tuning strategy—pre-training on high-quality synthetic data followed by fine-tuning on a small amount of real data—can match or even surpass the performance of models trained on a substantially larger real dataset, while also reducing training time. However, in real-world scenarios, collecting synchronized ground- and aerial-view data is often impractical due to the significant effort and resources required. To address this limitation, in Chapter 3 we investigate a domain adaptation setting in which no real aerial-view training data is available, examining how models can efficiently generalize to the real aerial-view domain when trained only with real ground-view and synthetic aerial-view data.
In the second part, we focus on training a 3D avatar generation model called Universal Avatars (UA) using a large-scale multi-view dataset of human heads called Ava-256 (Chapter 4). While this dataset allows the model to generate complete 3D avatars from input images and drive them with expression signals, lightweight architectures such as GP-Avatar, which enable real-time avatar synthesis from a single appearance image, struggle to preserve identity and expression consistency across diverse viewpoints. To address this tradeoff between efficiency and fidelity, Chapter 5 introduces TurboPortrait3D, a single-step diffusion method that refines coarse novel views generated by GP-Avatar. Unlike existing 3D-aware generative methods that rely on multi-step optimization, TurboPortrait3D produces sharper, identity-faithful, and 3D-consistent novel views in real time, demonstrating that generative refinement can effectively complement lightweight dataset-driven models.
