Kandinsky 5.0

   
A family of diffusion models for Video & Image generation   

Summary

Kandinsky 5.0 T2V Lite

Kandinsky 5.0 T2V Lite is a lightweight video generation model (2B parameters) that ranks #1 among open-source models in its class. It outperforms larger models (Wan 2.2 T2V 5B and Wan 2.1 T2V 14B) and offers the best understanding of Russian concepts in the open-source ecosystem.

We provide 8 model variants, each optimized for different use cases:
  • SFT model — delivers the highest generation quality;
  • CFG-distilled — runs 2× faster;
  • Diffusion-distilled — enables low-latency generation with minimal quality loss;
  • Pretrain model — designed for fine-tuning by researchers and enthusiasts.
All models are available in two versions: for generating 5-second and 10-second videos.

Overall pipeline

Interpolate start reference image.

Kandinsky 5.0 – High-Level Architecture

Core paradigm

Latent diffusion pipeline with Flow Matching.
Diffusion Transformer (DiT) as the main generative backbone with cross-attention to text embeddings.

  • Qwen2.5-VL and CLIP provide text embeddings.
  • HunyuanVideo 3D VAE encodes/decodes video into a latent space.
  • DiT is the main generative module using cross-attention to condition on text.

Interpolate start reference image.

Model Inputs

  1. Text
    • Embeddings from Qwen2.5-VL (transformer decoder).
    • Augmented with 1D Rotary Position Embeddings (RoPE).
    • Refined by the Linguistic Token Refiner with bidirectional attention, which prepares text tokens for cross-attention inside DiT.
  2. Time
    • Diffusion step index.
    • Encoded with sinusoidal positional encoding + MLP.
    • Fused with a global CLIP text embedding of the video description.
  3. CLIP Text Embedding
    • A single global embedding of the full video description.
    • Provides semantic conditioning in addition to token-level embeddings.
  4. Visual
    • Latents from the HunyuanVideo 3D VAE.
    • Equipped with 3D Rotary Position Embeddings for spatial–temporal alignment.

Interpolate start reference image.

Text-to-Video

Distilled model

We emphasize generation capabilities in the following categories:

People

Cinematic Effects

Animation

Animals and nature

Dynamic scenes

Russian Culture Code

Short English captions

Transformations

Comparison with Other Models


Side-by-Side evaluation


The evaluation is based on the expanded prompts from the Movie Gen benchmark.
>
Interpolate start reference image. Interpolate start reference image.
Interpolate start reference image. Interpolate start reference image.
Interpolate start reference image. Interpolate start reference image.

VBench results


Interpolate start reference image.

Beta testing

You can apply to participate in the beta testing of the Kandinsky Video Lite via the telegram bot .

Authors

Project Leader: Denis Dimitrov
Team Leads: Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko
Core Contributors: Alexey Letunovskiy, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Dmitrii Mikhailov, Anna Averchenkova, Andrey Shutkin, Julia Agafonova, Olga Kim, Anastasiia Kargapoltseva, Nikita Kiselev
Contributors: Anna Dmitrienko, Anastasia Maltseva, Kirill Chernyshev, Ilia Vasiliev, Viacheslav Vasilev, Vladimir Polovnikov, Yury Kolabushin, Alexander Belykh, Mikhail Mamaev, Anastasia Aliaskina, Tatiana Nikulina, Polina Gavrilova

BibTeX


@misc{kandinsky2025,
    author = {Alexey Letunovskiy, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov,
              Dmitrii Mikhailov, Anna Averchenkova, Andrey Shutkin, Julia Agafonova, Olga Kim,
              Anastasiia Kargapoltseva, Nikita Kiselev, Vladimir Arkhipkin, Vladimir Korviakov,
              Nikolai Gerasimenko, Denis Parkhomenko, Anna Dmitrienko, Anastasia Maltseva,
              Kirill Chernyshev, Ilia Vasiliev, Viacheslav Vasilev, Vladimir Polovnikov,
              Yury Kolabushin, Alexander Belykh, Mikhail Mamaev, Anastasia Aliaskina,
              Tatiana Nikulina, Polina Gavrilova, Denis Dimitrov},
    title = {Kandinsky 5.0: A family of diffusion models for Video & Image generation},
    howpublished = {\url{https://github.com/ai-forever/Kandinsky-5}},
    year = 2025
}

@misc{mikhailov2025nablanablaneighborhoodadaptiveblocklevel,
      title={$\nabla$NABLA: Neighborhood Adaptive Block-Level Attention}, 
      author={Dmitrii Mikhailov and Aleksey Letunovskiy and Maria Kovaleva and Vladimir Arkhipkin
              and Vladimir Korviakov and Vladimir Polovnikov and Viacheslav Vasilev
              and Evelina Sidorova and Denis Dimitrov},
      year={2025},
      eprint={2507.13546},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.13546}, 
}