FacebookFacebook group TwitterTwitter
ICVSS Computer Vision for Spatial and Physical Intelligence

Stages of Robot Learning

Dinesh Jayaraman

University of Pennsylvania, US

Abstract

Robotics is currently chasing its 'ChatGPT moment': the quest for versatile, general-purpose agents capable of immediate deployment by non-experts in unstructured environments. In this lecture, we will study the landscape of pre-training and post-training processes involved in today's predominant approaches towards this goal (most prominently, 'vision-language-action' models), and explore emerging alternative approaches to overcome the relative data deficiency of the robotics domain, particularly work performed within my research group that explicitly leverage the general-purpose language understanding and coding abilities of today's frontier multi-modal language models to bypass traditional data bottlenecks. Finally, we will argue that a 'foundation model' for robotics is only as useful as it is efficiently deployable. We will explore 'Inference-Time Efficiency,' moving beyond training data to study how robots can dynamically allocate sensors, representation bits, energy, time, and computation on-an-as-needed basis to maintain high performance in resource-constrained physical bodies.