Unified Spherical Frontend: Towards Universal Distortion-Free Lens-Agnostic Rotation-Equivariant Perception
Abstract:
Modern perception increasingly relies on wide field-of-view cameras, yet standard convolutional networks still operate on planar pixel grids designed for pinhole imagery. By Gauss’s Theorema Egregium, no projection from the sphere to the plane preserves curvature, so every planar map of a spherical signal introduces spatially-varying distortion. Models trained on one lens therefore overfit to its specific distortion and degrade sharply under camera changes or in-plane rotations. Prior remedies fall short in opposite ways: spherical harmonic CNNs recover rotation-equivariance but remain computationally infeasible at image-scale resolution, while projection-based variants stay efficient but sacrifice equivariance. In this thesis, we investigate a single spatial-domain primitive: geodesic-distance convolution on the unit sphere, and its two complementary roles: unifying efficiency with equivariance for lens-agnostic 2D perception, and extending the same geometric principle from the sphere to 3D Euclidean space.
First, we present the Unified Spherical Frontend (USF), a modular pipeline that replaces the planar frontend of any CNN with pixel-to-ray lifting, near-uniform spherical resampling, geodesic-distance convolution, and spherical pooling, while leaving the backbone untouched. Constraining kernels to depend only on geodesic distance makes the operation SO(3)-equivariant by construction, and because all geometric quantities are camera-specific constants, they are computed once and cached for near-zero runtime overhead. On Spherical MNIST, PANDORA panoramic detection with YOLOv11, and Stanford 2D-3D-S panoramic segmentation with DeepLab v3 and UNet, USF matches or exceeds planar performance, suffers less than 1% degradation under arbitrary SO(3) rotations without rotation augmentation, and generalizes zero-shot to lens types unseen during training.
Second, we extend the distance-only kernel design from the sphere S² to Euclidean ℝ³, yielding an SE(3)-equivariant volumetric convolution for 3D point cloud processing. The same principle transfers directly, suggesting a broader geometric primitive for equivariant learning on curved and flat manifolds alike.
Together, these contributions position geodesic-distance aggregation as a unifying foundation for distortion-free, lens-agnostic, rotation-equivariant perception across 2D and 3D, offering a principled alternative to the augmentation-heavy and spectral-transform-based approaches that dominate spherical deep learning today.
Committee:
Prof. László A. Jeni (co-chair)
Prof. Sebastian Scherer (co-chair)
Prof. Shubham Tulsiani
Mosam Dabhi
