Video Probabilistic Diffusion Models in Projected Latent Space

Sihyun Yu1, Kihyuk Sohn2, Subin Kim1, Jinwoo Shin1

1KAIST       2Google Research

[Paper]            [Code]

Note: If some videos are not displayed properly, try out refreshing the page!

Abstract


Despite the remarkable progress in deep generative models, synthesizing high-resolution and temporally coherent videos still remains a challenge due to their high-dimensionality and complex temporal dynamics along with large spatial variations. Recent works on diffusion models have shown their potential to solve this challenge, yet they suffer from severe computation- and memory-inefficiency that limit the scalability. To handle this issue, we propose a novel generative model for videos, coined projected latent video diffusion models (PVDM), a probabilistic diffusion model which learns a video distribution in a low-dimensional latent space and thus can be efficiently trained with high-resolution videos under limited resources. Specifically, PVDM is composed of two components: (a) an autoencoder that projects a given video as 2D-shaped latent vectors that factorize the complex cubic structure of video pixels and (b) a diffusion model architecture specialized for our new factorized latent space and the training/sampling procedure to synthesize videos of arbitrary length with a single model. Experiments on popular video generation datasets demonstrate the superiority of PVDM compared with previous video synthesis methods; e.g., PVDM obtains the FVD score of 639.7 on the UCF-101 long video (128 frames) generation benchmark, which improves 1773.4 of the prior state-of-the-art.


Main Result


Note: We provide both short and long videos from our method; note that short videos of baselines correspond to the consecutively subsampled 16 frames in long video examples! For a fair comparison, we use qualitative results of baselines (DIGAN and StyleGAN-V) from the official StyleGAN-V webpage.

Short video (16 frames)


PVDM (ours)
UCF-101
PVDM (ours)
SKY

Long video (128 frames)


Real
DIGAN
StyleGAN-V
PVDM (ours)
UCF-101
Real
DIGAN
StyleGAN-V
PVDM (ours)
SkyTimelapse

Acknowledgement


SY thanks Jaehyung Kim, Jihoon Tack, and Younggyo Seo for their helpful discussions and comments. SY also acknowledges Ivan Skorokhodov for providing checkpoints and qualitative results of StyleGAN-V and other baselines. This template was originally made by Subin Kim and Sihyun Yu for a NVP project.