Foundation Models for 3D / 4D Scene Understanding and Content Creation

Leonidas Guibas

Stanford University, US

Abstract

In the last few years, large pre-trained models in the language and vision-language areas have shown impressive capabilities and emergent behaviors even for tasks they were not specifically trained on. These so-called foundation models (FMs) have re-shaped how we approach learning problems as we aim for the grand goal of artificial general intelligence (AGI). When it comes to 3D or 4D tasks, however -- tasks that involve spatial reasoning in 3D about geometry and motion, the state of FM development is less clear. This is because current FMs are trained with vast web data that includes text, images, and videos -- but little 3D. It is important to assess the 3D / 4D awareness and capabilities of FMs and study how to improve them, as our world is 3D and (1) perceiving, reasoning an acting on the real world requires 3D understanding, and (2) 3D consistency is crucial for realistic visual content generation. The obvious challenge is that the real 3D data we have is orders of magnitude less that what is available in the language and vision domains. Furthermore, 3D annotations are cumbersome to provide. This lecture will survey the state-of-the-art on this front and also illustrate how well designed collaborations between current FMs and 3D-aware agents can be synergistic in solving challenging 3D / 4D tasks.