Towards Scalable Real2Sim and Evaluations for VLAs - Robotics Institute Carnegie Mellon University
Loading Events

MSR Thesis Presentation

April

21
Tue
Yash Jangir MSR Student Robotics Institute,
Carnegie Mellon University
Tuesday, April 21
10:00 am to 11:00 am
GHC 6115
Towards Scalable Real2Sim and Evaluations for VLAs

Abstract:
The evaluation of generalist robot policies, particularly Vision-Language-Action (VLA) models, is limited by the cost and scalability of real-world testing. This thesis introduces RobotArena ∞, a scalable benchmarking framework for large-scale simulation-based evaluation. Central to RobotArena ∞ is a fully automated reality-to-simulation (Real2Sim) pipeline that converts monocular video demonstrations from datasets such as Bridge, DROID, and RH20T into high-fidelity simulated environments. The pipeline integrates automated robot-camera calibration, 3D asset reconstruction, and system identification to align simulated dynamics with real-world behavior. To assess policy robustness, RobotArena ∞ applies controlled domain perturbations, including variations in background textures and object configurations. Policy performance is evaluated through two complementary mechanisms. First, automated VLM-guided scoring leverages vision-language models to reason jointly over visual observations and simulator states, producing structured estimates of task progress and success. Second, scalable human preference evaluation employs crowdsourced pairwise comparisons between policy rollouts, enabling the aggregation of human judgments into consistent global rankings of policy performance.
We evaluate six state-of-the-art VLAs across hundreds of environments and 8,500+ human comparisons. Results show that while models perform well within training distributions, they struggle under distribution shifts. Automated VLM rankings closely match human preferences, supporting their use as scalable proxies. Existing benchmarks with limited diversity may overestimate policy capabilities. RobotArena ∞ provides a reproducible, extensible platform for rigorous evaluation of robotic foundation models, promoting standardized and trustworthy benchmarking for future generalist robots. Project page is available here.

Committee:
Dr. Katerina Fragkiadaki (co-chair)
Dr. Yonatan Bisk (co-chair)
Dr. Shubham Tulsiani
Mr. Ayush Jain