Dynamic and Large-Scale 3D Reconstruction via Test-Time Optimization with Priors
Abstract
Methods for 3D reconstruction from images or videos typically fall into two categories: 1) scene-specific test-time optimization (often via differentiable rendering) and 2) large-scale data-driven priors (often in the form of feedforward neural networks). This thesis presents two works that combine both paradigms for dynamic and large-scale 3D reconstruction from casual images and videos.
First, we present DressRecon, which recovers dynamic human avatars from a single video, focusing on extremely loose clothing or handheld object interactions. Prior work in human reconstruction is either limited to tight clothing with no object interactions, or requires costly inputs such as calibrated multi-view captures or personalized template scans. Our key insight is to combine data-driven priors about articulated human body shape with video-specific test-time optimization designed to capture detailed clothing and hand-object interactions. We accomplish this by learning a neural implicit model that disentangles body versus clothing deformations as separate motion model layers. To capture subtle geometry of clothing, we leverage image-based priors such as human body pose, surface normals, and optical flow during optimization. The resulting neural fields can be extracted into time-consistent meshes, or further optimized as explicit 3D Gaussians for high-fidelity interactive rendering. On datasets with highly challenging clothing deformations and object interactions, DressRecon yields higher-fidelity 3D reconstructions than prior art. (Project page: https://jefftan969.github.io/dressrecon/)
We also discuss TartanSplat, which focuses on large-scale 3D reconstruction from sparse-view aerial and ground images. Existing methods such as COLMAP and 3D Gaussian Splatting assume dense view coverage, and often fail to reconstruct complete geometry given casually captured sparse-view images. Our key insight is to leverage priors from monocular depth models, aligning frame-specific dense reconstructions to a sparse global point cloud obtained via structure-from-motion. Our method is capable of producing an interactive 3D reconstruction given ~50 images of a large-scale outdoor scene, achieving first place in the Feb 2025 challenge of the IARPA WRIVA program.
BibTeX
@mastersthesis{Tan-2025-148171,author = {Jeff Tan},
title = {Dynamic and Large-Scale 3D Reconstruction via Test-Time Optimization with Priors},
year = {2025},
month = {July},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-25-75},
keywords = {3D Reconstruction, Dynamic Human Modeling},
}