Text-to-Video
Diffusers
Safetensors
English
StableDiffusionPseudo3DPipeline
text-to-image
jax-diffusers-event
art
lopho commited on
Commit
1dbab65
·
1 Parent(s): ff57e0f

Model card

Browse files
Files changed (1) hide show
  1. README.md +127 -0
README.md CHANGED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ inference: false
3
+ tags:
4
+ - text-to-video
5
+ - text-to-image
6
+ pipeline_tag: text-to-video
7
+ datasets:
8
+ - TempoFunk/tempofunk-sdance
9
+ - TempoFunk/small
10
+ - TempoFunk/map
11
+ license: agpl-3.0
12
+ language: en
13
+ library_name: diffusers
14
+ ---
15
+
16
+
17
+ # Make-A-Video SD JAX Model Card
18
+
19
+ **A latent diffusion model for text-to-video synthesis.**
20
+
21
+ **[Try it with an interactive demo on HuggingFace spaces.](https://huggingface.co/spaces/TempoFunk/makeavid-sd-jax)**
22
+
23
+ Training code, PyTorch and FLAX implementation are available here: <https://github.com/lopho/makeavid-sd-tpu>
24
+
25
+ This model extends an inpainting LDM image generation model ([Stable Diffusion v1.5 Inpaint](https://huggingface.co/runwayml/stable-diffusion-inpainting))
26
+ with temporal convolution and temporal self-attention ported from [Make-A-Video PyTorch](https://github.com/lucidrains/make-a-video-pytorch)
27
+
28
+ It has then been fine tuned for ~150k steps on a [dataset](https://huggingface.co/datasets/TempoFunk/tempofunk-sdance) of 10,000 videos themed around dance.
29
+ Then for an additional ~50k steps with [extra data](https://huggingface.co/datasets/TempoFunk/small) of generic videos mixed into the original set.
30
+
31
+ This model used weights pretrained by [lxj616](https://huggingface.co/lxj616/make-a-stable-diffusion-video-timelapse) on 286 timelapse video clips for initialization.
32
+
33
+ ![](https://huggingface.co/spaces/TempoFunk/makeavid-sd-jax/resolve/main/example.gif)
34
+
35
+ ## Table of Contents
36
+
37
+ - [Model Details](#model-details)
38
+ - [Uses](#uses)
39
+ - [Limitations](#limitations)
40
+ - [Training](#training)
41
+ - [Training Data](#training-data)
42
+ - [Training Process](#training-process)
43
+ - [Hyper parameters](#hyperparameters)
44
+ - [Acknowledgements](#acknowledgements-and-Citations)
45
+ - [Citation](#citation)
46
+
47
+
48
+ ## Model Details
49
+
50
+ * **Developed by:** [Lopho](https://huggingface.co/lopho), [Chavinlo](https://huggingface.co/chavinlo)
51
+ * **Model type:** Diffusion based text-to-video generation model
52
+ * **Language(s):** English
53
+ * **License:** (pending) GNU Affero General Public License 3.0
54
+ * **Further resources:** [Model implementation & training code](https://github.com/lopho/makeavid-sd-tpu), [Weights & Biases training statistics](https://wandb.ai/tempofunk/makeavid-sd-tpu)
55
+
56
+ ## Uses
57
+
58
+ * Understanding limitations and biases of generative video models
59
+ * Development of educational or creative tools
60
+ * Artistic usage
61
+ * What ever you want
62
+
63
+ ## Limitations
64
+
65
+ * Limited knowledge of temporal concepts not seen during training (see linked datasets)
66
+ * Emerging flashing lights, most likely due to training on dance videos, which include many scenes with bright, neon and flashing lights
67
+ * The model has only been trained with English captions and will not perform as well in other languages
68
+
69
+ ## Training
70
+
71
+ ### Training Data
72
+
73
+ * [S(mall)dance](https://huggingface.co/datasets/TempoFunk/tempofunk-sdance): 10,000 video-caption pairs of dancing videos (as encoded image latents, text embeddings and metadata).
74
+ * [small](https://huggingface.co/datasets/TempoFunk/small): 7,000 video-caption pairs of general videos (as encoded image latents, text embeddings and metadata).
75
+ * [Mapping](https://huggingface.co/datasets/TempoFunk/map): Video source urls for above datasets
76
+
77
+ ### Training Procedure
78
+
79
+ * From each video sample a random range of 24 frames is selected
80
+ * Each video latent is encoded into latent representations of the shape 4 x 24 x H/8 x W/8
81
+ * The latent of the first frame from each video is repeated along the frame dimension as additional guidance (referred to as hint image)
82
+ * Hint latent and video latent are stacked to produce a shape of 8 x 24 x H/8 x W/8
83
+ * The last input channel is preserved for maskin purposes (not used during training, set to zero)
84
+ * Text prompts are encoded by the CLIP text encoder
85
+ * Video latents with added noise and clip encoded text prompts are fed into the UNet to predict the added noise
86
+ * Loss is the reconstruction objective between the added noise and the predicted noise via mean squared error (mse/l2)
87
+
88
+ ### Hyperparameters
89
+
90
+ * **Batch size:** 1 x 4
91
+ * **Image size:** 512 x 512
92
+ * **Frame count:** 24
93
+ * **Schedule:**
94
+ * 2 x 10 epochs: LR warmup for 2 epochs then held constant at 5e-5 (10,000 samples per ep)
95
+ * 2 x 20 epochs: LR warmup for 2 epochs then held constant at 5e-5 (10,000 samples per ep)
96
+ * 1 x 9 epochs: LR warmup for 1 epoch to 5e-5 then cosine annealing to 1e-8
97
+ * Additional data mixed in, see [Trainig Data](#training-data)
98
+ * 1 x 5 epochs: LR warmup for 1 epochs to 2.5e-5 then constant (17,000 samples per ep)
99
+ * 1 x 5 epochs: LR warmup for 0.25 epochs to 5e-6 then cosine annealing to 2.5e-6 (17,000 samples per ep)
100
+ * some restarts were required due to NaNs appearing in the gradient (see training logs)
101
+ * **Total update steps:** ~200,000
102
+ * **Hardware:** 4 x TPUv4 (provided by Google Cloud for the [HuggingFace JAX/Diffusers Sprint Event](https://github.com/huggingface/community-events/tree/main/jax-controlnet-sprint))
103
+
104
+ Trainig statistics are available at [Weights and Biases](https://wandb.ai/tempofunk/makeavid-sd-tpu).
105
+
106
+ ## Acknowledgements
107
+
108
+ * [CompViz](https://github.com/CompVis/) for [Latent Diffusion Models]() + [Stable Diffusion]()
109
+ * [Meta AIs Make-A-Video](https://arxiv.org/abs/2209.14792) for the research of applying pseudo 3D convolution and attention to existing image models
110
+ * [Phil Wang](https://github.com/lucidrains) for the torch implementation of [Make-A-Video Pseudo3D convolution and attention](https://github.com/lucidrains/make-a-video-pytorch/)
111
+ * [lxj616](https://huggingface.co/lxj616) for initial proof of feasibility of LDM + Make-A-Video
112
+
113
+ ## Citation
114
+
115
+ ```bibtext
116
+ @misc{TempoFunk2023,
117
+ author = {Lopho, Chavinlo},
118
+ title = {TempoFunk: Extending LDM models to Video},
119
+ url = {https://github.com/lopho/makeavid-sd-tpu},
120
+ month = {5},
121
+ year = {2023}
122
+ }
123
+ ```
124
+
125
+ ---
126
+
127
+ *This model card was written by: [Lopho](https://hugginface.co/lopho), [Chavinlo](https://huggingface.co/chavinlo), [Julian Herrera](https://huggingface.co/puffy310) and is based on the [DALL-E Mini model card](https://huggingface.co/dalle-mini/dalle-mini).*