Seeing, Understanding, Doing: How Video Models Are Reshaping Robotics

Cordelia Schmid

INRIA, FR

Abstract

The field of computer vision has recently witnessed rapid progress in video understanding. Today, state-of-the-art models can generate descriptive captions, answer complex questions about long video sequences, and reconstruct dynamic environments in three dimensions. This lecture begins with an overview of these recent methods, explaining the capabilities of modern video-language and 3D vision models. We then bridge the gap to robotics, demonstrating how these advancements enhance a robot's visual perception and spatial awareness. Furthermore, we show how world models trained on large amounts of video data enable robots to simulate physical outcomes and improve manipulation and planning tasks. We conclude by presenting recent examples of robots acquiring novel skills and dexterous behaviors directly by observing videos of humans performing everyday actions.