FacebookFacebook group TwitterTwitter
ICVSS Computer Vision for Spatial and Physical Intelligence

Demystifying the impact of data for image and video understanding

Christoph Feichtenhofer

FAIR, Meta, USA

Abstract

This lecture will explore key ingredients for research on foundation models for image and video understanding, with a particular emphasis on the role of data. During the initial decade of the deep learning revolution, beginning with AlexNet, the visual recognition field primarily focused on creating innovative architectures while relying on fixed datasets like ImageNet for training. In recent years, in pursuit of the best benchmark numbers, the community has started to combine various training data sources, including distillation from black-box proprietary models. Nevertheless, the computer vision community rarely discusses the effect of data with the same importance as novel model design or training algorithms. This lecture will demystify the impact of large-scale data on image models trained via text supervision, the process of building multimodal large language models from scratch, the effect of synthetic data on vision foundation models, and the principle of a data engine for training image and video segmentation models.