Ladybug flies off the mushroom Boiling water makes a noise and splashes from a teapot into a cup of tea The ice cream drips quickly, the ice cream melts The fire is burning, professional video Abstraction, smooth movements

Introduction

Our previous model Kandinsky Video 1.0, divides the video generation process into two stages: initially generating keyframes at a low FPS and then creating interpolated frames between these keyframes to increase the FPS. In Kandinsky Video 1.1, we further break down the keyframe generation into two extra steps: first, generating the initial frame of the video from the textual prompt using Text to Image Kandinsky 3.0, and then generating the subsequent keyframes based on the textual prompt and the previously generated first frame. This approach ensures more consistent content across the frames and significantly enhances the overall video quality. Furthermore, the approach allows animating any input image as an additional feature.

Overall Pipeline

Interpolate start reference image.

The overall pipeline include following steps.

  1. The generation of the first frame using textual prompt (Kandinsky 3.1).
  2. The generation of keyframes using the textual prompt and the previously generated first frame as input. Additional inputs include motion score and noise augmentation level.
  3. The generation of interpolated frames. Interpolation model in the new version uses additional temporal self-attention layers compared to the previous version which improves the visual quality of the interpolated frames.
  4. Decode the generated video using Sber MoVQ-GAN.We inserted temporal layers in the decoder and fine tuned the decoder to ensure more consistent details among the generated frames.

Motion score conditioning and Noise Augmentation

Interpolate start reference image.

Kandinsky Video 1.1 uses motion score as a condition to control dynamics in the video. Motion score is calculated using the optical flow model (RAFT) while training. During inference, we can manipulate motion score input to the model to obtain different dynamics in the generated video. Frames in video datasets suffer from quality issues like blur compared to the high quality images in image generation datasets. In training, we use the first frame in the video as a condition to generate the rest of keyframes. During inference, we generate the first frame using text to image model (Kandinsky 3.1). This leads to training-inference mismatch and to mitigate this issue we employ noise augmentation. Too solve the problem, during training, we add different levels of noise to the first frame before passing it as a condition. In inference, after generating the first frame, we add a small amount of random noise to this frame and we pass the level of noise as an additional condition to the network.


Acknowledgment

The authors express great gratitude to Igor Pavlov, Anastasia Lysenko and Sergey Markov for their contribution in collecting and processing video data, thanks to which this study became possible and Andrei Filatov for building the site.

More results

Conv1dAttn1dBlocks Conv1dAttn3dBlocks Conv3dAttn1dBlocks Conv1dAttn1dLayers Conv1dAttn1dLayers

BibTeX

@article{arkhipkin2023fusionframes,
  title     = {FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline},
  author    = {Arkhipkin, Vladimir and Shaheen, Zein and Vasilev, Viacheslav and Dakhova, Elizaveta and Kuznetsov, Andrey and Dimitrov, Denis},
  journal   = {arXiv preprint arXiv:2311.13073},
  year      = {2023}, 
}