Estimating and Generating Human Motions from Interactions - Robotics Institute Carnegie Mellon University

Estimating and Generating Human Motions from Interactions

PhD Thesis, Tech. Report, CMU-RI-TR-25-89, September, 2025

Abstract

Modeling human motion is a fundamental topic in computer vision and robotics. Humans interact with the 3D physical world in complex ways, involving both changes in position (global motion) and body deformation and articulation (local motion). This thesis explores human motion in interactions with other humans, environments, and manipulated objects. We focus on the tasks of estimating and generating human motions, emphasizing the integration of diverse knowledge sources such as video, motion capture, and physics simulation.

We begin by examining human-human interactions. Using widely available video data, we study implicit interactions where individuals navigate toward goals while avoiding collisions. Initially, we address multiobject tracking and then progress to trajectory generation, exploring both estimation and generation perspectives. For tracking, we start with learning-based methods and revisit classic parametric filtering. To generate socially aware trajectories, we combine parametric priors with generative models to leverage inductive biases from data.

The second part of the thesis investigates human-scene interactions. As people frequently bend and articulate their bodies for daily tasks, we examine both local and global body motion. We utilize motion capture data to ensure visual realism in motion generation and employ physics simulation to enforce physical realism. We begin by validating the use of physics-based imitators for diverse motions. Subsequently, we place a human agent in a static scene and develop a reinforcement learning policy to generate physically grounded interactions guided by language instructions.

In the third part, we extend our study to human motion during object manipulation in dynamic environments. Due to limited human-object motion capture data, we focus on generating static hand-object grasps that generalize to a wide range of object shapes using large-scale object shape datasets. These grasps then guide a reinforcement learning policy that enables full-body motion for transporting an object in hand within a simulation.

Building on insights from earlier chapters, we observe the effectiveness and flexibility of generative models for both motion estimation and generation. This motivates us to explore a unified model. We propose a diffusion model for human motion, where conditioning the denoising process allows the model to perform estimation as well. When conditioned on video, the model achieves motion estimation performance comparable to specialized estimation models.

BibTeX

@phdthesis{Cao-2025-148852,
author = {Jinkun Cao},
title = {Estimating and Generating Human Motions from Interactions},
year = {2025},
month = {September},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-25-89},
keywords = {Computer Vision, Motion Tracking, Motion Generation},
}