Abstract:
Salient foreground regions (e.g., vehicles, faces) occupy only ~10% of pixels, while less informative backgrounds (e.g., sky, trees) dominate ~90%. This imbalance fundamentally limits visual perception:
(1) 2D discriminative tasks (e.g., domain-adaptive detection and segmentation) rely heavily on large background regions with high cross-domain variation, making domain adaptation difficult.
(2) 2D generative tasks (e.g., image-to-image translation) compress images into a latent space where small foreground regions receive even fewer spatial resources, making it difficult to preserve and reconstruct fine-grained details.
(3) 3D tasks (e.g., occupancy prediction) allocate substantial computation on less informative background regions, leading to significant latency and memory overhead.
In this work, we propose an efficient image warping framework with instance-level saliency to oversample salient foregrounds and undersample less informative backgrounds. Our method is model-agnostic, works with arbitrary saliency priors, requires no architecture modification, and introduces negligible computational, memory, and latency overhead. Experiments on domain-adaptive object detection and semantic segmentation, image-to-image translation (e.g., human and driving scene relighting, and driving scene translation), and 3D occupancy prediction demonstrate improved accuracy for discriminative and 3D tasks, and enhanced fidelity and realism for generative tasks, while maintaining high efficiency.
Committee:
Prof. Srinivasa Narasimhan
Prof. Deva Ramanan
Prof. Shubham Tulsiani
Gaurav Parmar
