Kandinsky 3.1

Abstract

We present Kandinsky 3.1, the follow-up to the Kandinsky 3.0 model, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation, which we have enhanced and enriched with a variety of useful features and modes to give users more opportunities to fully utilise the power of our new model.

Overall Pipeline

Kandinsky 3.1 is a latent diffusion model, the full pipeline of which includes a text encoder for processing a prompt from the user, a U-Net for predicting noise during denoising (reverse) process and a decoder for image reconstruction from the generated latent. During the U-Net training, the text encoder and image decoder were completely frozen. The whole model contains 11.9 billion parameters. For extended description of architecture please refer to technical report.

Kandinsky Flash (Kandinsky 3.0 Refiner)

Diffusion models have problems with fast image generation. To address this problem, we trained a Kandinksy Flash model based on the Adversarial Diffusion Distillation approach with some modifications: we trained the model on latents, which reduced the memory overhead and removed distillation loss as it did not affect the training.

Also, we applied Kandinsky Flash model to images generated from Kandinsky 3.0 to improve visual quality of generated images.

Prompt beautification

Prompt plays crucial role in text-to-image generation. So, in Kandinsky 3.1 we decided to use language model for making prompt better. We used Intel's neural-chat-7b-v3-1 with the following system promt as the LLM:

### System: You are a prompt engineer. Your mission is to expand prompts written by user. You should provide the best prompt for text to image generation in English.
### User:
{prompt}
### Assistant:
{answer of the model}

Kandinsky Inpainting

To improve the generation quality of the inpainting model, we additionally trained the model on the object detection dataset. This allowed to get more stable generation of objects.

Kandinsky SuperRes

In the new Kandinsky 3.1 version, it is now possible to generate 4K resolution images using the KandiSuperRes model. The architecture used was the Kandinsky 3.0 architecture, which was modified as follows: the model became pixel-based to avoid compression artefacts, a modified UNet was used, and during training diffusion predicted the original image instead of noise. For more details please check out the repository.

Dataset	Model	FID↓	PSNR↑	SSIM↑	L1↓
Wikidata 5k	Real-ESRGAN	9.96	24.48	0.73	0.042
	Stable Diffusion	3.04	25.05	0.67	0.043
	KandiSuperRes	0.89	28.52	0.81	0.025
RealSR(V3)	Real-ESRGAN	73.26	23.12	0.72	0.061
	Stable Diffusion	47.79	24.85	0.67	0.049
	KandiSuperRes	47.37	25.05	0.75	0.046
Set14	Real-ESRGAN	115.94	22.88	0.62	0.056
	Stable Diffusion	76.32	23.60	0.57	0.052
	KandiSuperRes	61.00	25.70	0.70	0.039

Kandinsky IP-Adapter & Kandinksy ControlNet

To allow image generation in based on reference image we train IP-Adapter, which to generate image based on the style of image, text and generate variations of the image. Also, we trained HED-based ControlNet model. For more details please check out the repository.

Acknowledgment

Sber AI Research team: Arseniy Shakhmatov, Ilya Ryabov, Mikhail Shoytov, Anastasia Lysenko, Zein Shaheen;

Anton Razzhigaev and Elizaveta Dakhova from AIRI;

Konstantin Kulikov and his production team at Sber AI;

Sergey Markov and his research teams at Sber Devices;

Polina Voloshina labelling team;

ABC Elementary labelling team;

TagMe labelling team ;

Tatyana Nikulina, Angelina Kuts, Anton Bukashkin and prompt engineering team;

Thanks to all of you for your valuable help, advice, and constructive criticism

BibTeX

@misc{arkhipkin2023kandinsky, title={Kandinsky 3.0 Technical Report}, author={Vladimir Arkhipkin and Andrei Filatov and Viacheslav Vasilev and Anastasia Maltseva and Said Azizov and Igor Pavlov and Julia Agafonova and Andrey Kuznetsov and Denis Dimitrov}, year={2023}, eprint={2312.03511}, archivePrefix={arXiv}, primaryClass={cs.CV} }