Foundation models for video generation, editing, and personalization.

Ishan Misra

GenAI, Meta, US

Abstract

Videos are a powerful source of supervision for training machine learning models. They capture spatio-temporal dynamics, object state changes, actions, camera changes, physics etc. In this talk, I'll cover video generation models that leverage the powerful predictive signal in videos for training. I'll cover the fundamental building blocks that are needed for training large scale video generation models including their theory and practice. I'll cover our series of works - Emu Video and FlowVid - on video generation and editing that span different aspects of model scale, efficiency, and applications. Finally, I'll introduce our recent work called MovieGen which leads to state-of-the-art video generation performance. MovieGen also enables users to use simple text inputs to generate high-quality videos, personalize or edit them, and add audio. We establish scaling laws for video generation and show that these models scale similarly to Large Language Models, and thus scaling training FLOPs, model size, and data together leads to improved performance.