Update README.md
Browse files
README.md
CHANGED
@@ -6,7 +6,7 @@ license: apache-2.0
|
|
6 |
<br><br><br><br>
|
7 |
|
8 |
<div align="center">
|
9 |
-
<image src="https://
|
10 |
</div>
|
11 |
|
12 |
<div align="center">
|
@@ -14,6 +14,8 @@ license: apache-2.0
|
|
14 |
</div>
|
15 |
|
16 |
|
|
|
|
|
17 |
<br><br><br><br>
|
18 |
|
19 |
<table border="0" style="width: 200; text-align: left; margin-top: 20px;">
|
@@ -61,18 +63,20 @@ license: apache-2.0
|
|
61 |
|
62 |
Kandinsky 4.0 T2V Flash is a text-to-video generation model based on latent diffusion for 480p resolution, that can generate **12 second videos** in 480p resolution in **11 seconds** on a single NVIDIA H100 gpu. The pipeline consist of 3D causal [CogVideoX](https://arxiv.org/pdf/2408.06072) VAE, text embedder [T5-V1.1-XXL](https://huggingface.co/google/t5-v1_1-xxl) and our trained MMDiT-like transformer model.
|
63 |
|
64 |
-
<img src="https://github.com/ai-forever/Kandinsky-4/blob/main/assets/pipeline.png">
|
65 |
|
66 |
-
|
67 |
|
68 |
-
|
69 |
|
|
|
70 |
|
71 |
## Architecture
|
72 |
|
73 |
For training Kandinsky 4.0 T2V Flash we used the following architecture of diffusion transformer, based on MMDiT proposed in [Stable Diffusion 3](https://arxiv.org/pdf/2403.03206).
|
74 |
|
75 |
-
|
|
|
|
|
76 |
|
77 |
For training flash version we used the following architecture of discriminator. Discriminator head structure resembles half of an MMDiT block.
|
78 |
|
|
|
6 |
<br><br><br><br>
|
7 |
|
8 |
<div align="center">
|
9 |
+
<image src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/Mi3ugli7f1MNNVWC5gzMS.png" ></image>
|
10 |
</div>
|
11 |
|
12 |
<div align="center">
|
|
|
14 |
</div>
|
15 |
|
16 |
|
17 |
+
|
18 |
+
|
19 |
<br><br><br><br>
|
20 |
|
21 |
<table border="0" style="width: 200; text-align: left; margin-top: 20px;">
|
|
|
63 |
|
64 |
Kandinsky 4.0 T2V Flash is a text-to-video generation model based on latent diffusion for 480p resolution, that can generate **12 second videos** in 480p resolution in **11 seconds** on a single NVIDIA H100 gpu. The pipeline consist of 3D causal [CogVideoX](https://arxiv.org/pdf/2408.06072) VAE, text embedder [T5-V1.1-XXL](https://huggingface.co/google/t5-v1_1-xxl) and our trained MMDiT-like transformer model.
|
65 |
|
|
|
66 |
|
67 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/W09Zl2q-TbLYV4xUE3Bte.png">
|
68 |
|
69 |
+
A serious problem for all diffusion models, and especially video generation models, is the generation speed. To solve this problem, we used the Latent Adversarial Diffusion Distillation (LADD) approach, proposed for distilling image generation models and first described in the [article](https://arxiv.org/pdf/2403.12015) from Stability AI and tested by us when training the [Kandinsky 3.1](https://github.com/ai-forever/Kandinsky-3) image generation model. The distillation pipeline itself involves additional training of the diffusion model in the GAN pipeline, i.e. joint training of the diffusion generator with the discriminator.
|
70 |
|
71 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/A9l81FaWiBJl5Vx06XbIc.png">
|
72 |
|
73 |
## Architecture
|
74 |
|
75 |
For training Kandinsky 4.0 T2V Flash we used the following architecture of diffusion transformer, based on MMDiT proposed in [Stable Diffusion 3](https://arxiv.org/pdf/2403.03206).
|
76 |
|
77 |
+
|
78 |
+
|
79 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/UjY8BqRUJ_H0lkgb_PKNY.png"> <img src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/fiMHO1CoR8JQjRXqXNE8k.png">
|
80 |
|
81 |
For training flash version we used the following architecture of discriminator. Discriminator head structure resembles half of an MMDiT block.
|
82 |
|