Interpolate start reference image.

Kandinsky Video is a text-to-video generation model, which is based on the FusionFrames architecture and Kandinsky 3.0 text-to-image model, consisting of two main stages - keyframe generation and interpolation. Our approach for temporal conditioning allows us to generate videos with high-quality appearance, smoothness and dynamics.

Abstract

Multimedia generation approaches occupy a prominent place in artificial intelligence research. Text-to-image models achieved high-quality results over the last few years. However, video synthesis methods recently started to develop. This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model. The first stage concerns keyframes synthesis to figure the storyline of a video, while the second one is devoted to interpolation frames generation to make movements of the scene and objects smooth. We compare several temporal conditioning approaches for keyframes generation. The results show the advantage of using separate temporal blocks over temporal layers in terms of metrics reflecting video generation quality aspects and human preference. The design of our interpolation model significantly reduces computational costs compared to other masked frame interpolation approaches. Furthermore, we evaluate different configurations of MoVQ-based video decoding scheme to improve consistency and achieve higher PSNR, SSIM, MSE, and LPIPS scores. Finally, we compare our pipeline with existing solutions and achieve top-2 scores overall and top-1 among open-source solutions: CLIPSIM = 0.2976 and FVD = 433.054.

Overall Pipeline

Interpolate start reference image.

The encoded text prompt enters the U-Net keyframe generation model with temporal layers or blocks, and then the sampled latent keyframes are sent to the latent interpolation model in such a way as to predict three interpolation frames between two keyframes. A temporal MoVQ-GAN decoder is used to get the final video result.

Keyframes Generation with Temporal Conditioning

Interpolate start reference image.

We examined two approaches of temporal components use in pretrained architecture of T2I U-Net from Kandinsky 3.0 – the traditional approach of mixing spatial and temporal layers in one block (left) and our approach of allocating a separate temporal block (middle). All layers indicated in gray are not trained in T2V architectures and are initialized with the weights of the T2I Kandinsky 3.0 model. NA and N in the left corner of all layers correspond to the presence of prenormalization layers with and without activation, respectively. For different types of blocks we implemented different types of temporal attention and temporal convolution layers, (left). We also implement different types of temporal conditioning. One of them is simple conditioning when pixel see only itself value in different moments of type (1D layers). In 3D layers pixels can see the values of its neighbors also in different moments of time (right).

Video Frame Interpolation

Interpolate start reference image.

A summary of the primary changes made to the T2I architecture includes: (i) The input convolution now accepts three noisy latent inputs (interpolated frames) and two conditioning latent inputs (keyframes). (ii) The output convolution predicts three denoised latents. (iii) Temporal convolutions have been introduced after each spatial convolution.

Results for temporal conditioning

Interpolate start reference image.

We propose our approach for temporal conditioning based on separate temporal blocks. We compared three types of temporal blocks with temporal layers, which were commonly used in previous works, and gained the advantage of our approach in terms of frame quality, text alignment and temporal consistency.

Comparison

Conv1dAttn1dBlocks Conv1dAttn3dBlocks Conv3dAttn1dBlocks Conv1dAttn1dLayers

Other results

"A car moving on the road from the sea to the mountains" "A red car drifting, 4k video" "Chemistry laboratory, chemical explosion, 4k" "Erupting volcano raw power, molten lava, and the forces of the Earth"
"Luminescent jellyfish swims underwater, neon, 4k" "Majestic waterfalls in a lush rainforest power, mist, and biodiversity" "White ghost flies through a night clearing, 4k" "Wildlife migration herds on the move, crossing landscapes in harmony"
"Majestic humpback whale breaching power, grace, and ocean spectacle" "Evoke the sense of wonder in a time-lapse journey through changing seasons" "Explore the fascinating world of underwater creatures in a visually stunning sequence" "Polar ice caps the pristine wilderness of the Arctic and Antarctic"
"Rolling waves on a sandy beach relaxation, rhythm, and coastal beauty" "Sloth in slow motion deliberate movements, relaxation, and arboreal life" "Sunrise over a tranquil mountain landscape colors, serenity, and awakening" "Craft a heartwarming narrative showcasing the bond between a human and their loyal pet companion"

Acknowledgment

The authors express great gratitude to Igor Pavlov, Anastasia Lysenko and Sergey Markov for their contribution in collecting and processing video data, thanks to which this study became possible and Andrei Filatov for building the site.

BibTeX

@article{arkhipkin2023fusionframes,
  title     = {FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline},
  author    = {Arkhipkin, Vladimir and Shaheen, Zein and Vasilev, Viacheslav and Dakhova, Elizaveta and Kuznetsov, Andrey and Dimitrov, Denis},
  journal   = {arXiv preprint arXiv:2311.13073},
  year      = {2023}, 
}