FacebookFacebook group TwitterTwitter
ICVSS Computer Vision for Spatial and Physical Intelligence

Generating images and videos with diffusion models

Vittorio Ferrari

Synthesia, UK

Abstract

Diffusion models have revolutionized image and video synthesis, enabling astonishing levels of realism and creative flexibility. We will start by diving into the fundamentals behind these models: the denoising diffusion process and transformers, which play a critical role in enabling conditioning on a text prompt, provide spatial attention on convolutional denoisers, and can even form the entire denoising core in recent models. We will examine the key components that make image generation models effective, including latent diffusion, classifier-free guidance, flow matching training style, and post-processing super-resolution models. We will also discuss extensions for video generation, such as spatio-temporal autoencoders, spatio-temporal attention, the typical “text-to-image + image-to-video” generation architecture, and the crucial role of multi-stage training strategies and their associated data curation efforts. The final part of the lecture will focus on controllability: how to steer generation beyond text prompts using auxiliary inputs such as reference images, edge maps, segmentation masks, and human pose maps. We will explore diverse methods for achieving this, e.g. noise manipulation, IP-Adapter, ControlNet, ReferenceNet, and recent unified transformer architectures that natively integrate multiple conditioning signals (e.g. OmniGen).