Ladybug flies off the mushroom | Boiling water makes a noise and splashes from a teapot into a cup of tea | The ice cream drips quickly, the ice cream melts | The fire is burning, professional video | Abstraction, smooth movements |
Our previous model Kandinsky Video 1.0, divides the video generation process into two stages: initially generating keyframes at a low FPS and then creating interpolated frames between these keyframes to increase the FPS. In Kandinsky Video 1.1, we further break down the keyframe generation into two extra steps: first, generating the initial frame of the video from the textual prompt using Text to Image Kandinsky 3.0, and then generating the subsequent keyframes based on the textual prompt and the previously generated first frame. This approach ensures more consistent content across the frames and significantly enhances the overall video quality. Furthermore, the approach allows animating any input image as an additional feature.
The overall pipeline include following steps.
Kandinsky Video 1.1 uses motion score as a condition to control dynamics in the video. Motion score is calculated using the optical flow model (RAFT) while training. During inference, we can manipulate motion score input to the model to obtain different dynamics in the generated video. Frames in video datasets suffer from quality issues like blur compared to the high quality images in image generation datasets. In training, we use the first frame in the video as a condition to generate the rest of keyframes. During inference, we generate the first frame using text to image model (Kandinsky 3.1). This leads to training-inference mismatch and to mitigate this issue we employ noise augmentation. Too solve the problem, during training, we add different levels of noise to the first frame before passing it as a condition. In inference, after generating the first frame, we add a small amount of random noise to this frame and we pass the level of noise as an additional condition to the network.
Conv1dAttn1dBlocks | Conv1dAttn3dBlocks | Conv3dAttn1dBlocks | Conv1dAttn1dLayers | Conv1dAttn1dLayers |
@article{arkhipkin2023fusionframes,
title = {FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline},
author = {Arkhipkin, Vladimir and Shaheen, Zein and Vasilev, Viacheslav and Dakhova, Elizaveta and Kuznetsov, Andrey and Dimitrov, Denis},
journal = {arXiv preprint arXiv:2311.13073},
year = {2023},
}