Learning Generalizable Robot Skills for Dynamic and Interactive Tasks

PhD Thesis, Tech. Report, CMU-RI-TR-25-84, August, 2025

View Publication

Abstract

Recent years have seen increasing interest in developing robots that are capable of lifelong and reliable operations around humans in home and factory environments. Despite impressive recent progress towards long-horizon tasks such as laundry folding, current efforts are predominantly focused on quasi-static tasks in very structured settings.
General-purpose robots that assist humans in daily tasks should be capable of performing a wider range of dynamic and dexterous tasks in unstructured environments, safely interact with humans and their environment, and effectively learn from and adapt to new experiences.

One promising approach to building robots capable of assistive lifelong operations is to endow them with three core capabilities: a metric, semantic, and temporal (memory) understanding of the world, the ability to perform long-horizon reasoning and planning, and the capacity to execute real-time, closed-loop policies that are dexterous, reactive, and safe in dynamic environments.
Such an approach necessitates bridging a fundamental gap in the field. On one hand, traditional approaches based on classical control can provide strong guarantees for safety and optimality for real-time reactive control of dynamic tasks; however, they often rely on low-dimensional, fully observable states and often lack semantic awareness. In contrast, recent advances in vision-language-action models capture rich world semantics and generalize to novel settings, but are typically limited to quasi-static tasks and do not explicitly handle safety or adapt to dynamic changes in real time.
The goal of this thesis work is to bridge this gap by exploring methods at the intersection of modern control theory, robot learning, and multimodal foundation models, with the goal of learning generalizable and safety-aware robot policies capable of performing complex, dynamic, and interactive tasks in unstructured environments.

Towards this goal, this thesis explores various approaches for leveraging strong structural, algorithmic and semantic priors that can enable both generalization as well as real-time reactive control of dynamic tasks. My works [1], [2] focus on learning to perform dynamic pickup tasks using dynamic-Graph Neural Networks (GNNs) as structural priors and differentiable control algorithms (iLQR) as algorithmic priors for learning compositional skills that operate over multiple dynamic modes as well as generalize to complex environments and novel goals.
[1] introduces a framework for learning differentiable optimal skills (LQR) for switching linear dynamical systems from expert demonstrations. The resulting control scheme predicts and accounts for discontinuities due to contact, reacts to unanticipated contact events and generalizes to novel goal conditions and skill compositions.
[2] extends this framework to non-linear multibody interactive systems by learning stable locally linear dynamics models using dynamic-GNNs, which enables generalization to novel number of objects and interactions unseen during training. MResT combines the semantic reasoning capabilities of pre-trained frozen vision-language models (used at low frequency), with the adaptiveness of fine-tuned smaller networks (used at high frequency), to enable zero-shot generalization to semantic scene variations as well as real-time closed-loop control of precise and dynamic tasks.
This work focuses on learning generalizable multi-task policies using multi-resolution sensing in table-top settings for short-horizon tasks. GraphEQA, considers the long-horizon task of embodied question answering in large unseen indoor environments. GraphEQA, in real-time, constructs a multi-modal memory comprising a real-time 3D metric-semantic scene graph and task-relevant images, for grounding a VLM-based hierarchical planner to perform situated and semantically-informed long-horizon exploration and planning. Finally, VLTSafe combines the semantic safety reasoning capabilities of large pretrained VLMs with safety guarantees of low-level reachability-based RL policies to enable safe control of dynamic tasks in cluttered environments. This approach uses a VLM for identifying relevant task- and safety constraints which inform pretrained parameterized reachability-based RL policies during execution.

This thesis takes a step towards the development of generalist embodied agents that integrate the semantic understanding and generalizability of multimodal foundation models with robust, closed-loop control policies that ensure efficiency, dexterity and safety and opens new avenues for future research in scalable, safe, and generalizable robot learning.

BibTeX

@phdthesis{Saxena-2025-148385,
author = {Saumya Saxena},
title = {Learning Generalizable Robot Skills for Dynamic and Interactive Tasks},
year = {2025},
month = {August},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-25-84},
keywords = {robot learning, foundation models for robotics, semantic safety, dynamic manipulation, semantic navigation, embodied AI, scene graphs, switching systems, graph neural networks, differentiable control},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.