Text-to-Video
Diffusers
Safetensors
English
StableDiffusionPseudo3DPipeline
text-to-image
jax-diffusers-event
art
lopho commited on
Commit
371d561
·
1 Parent(s): fc5cfe4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -78,7 +78,7 @@ This model used weights pretrained by [lxj616](https://huggingface.co/lxj616/mak
78
  * Each video latent is encoded into latent representations of the shape 4 x 24 x H/8 x W/8
79
  * The latent of the first frame from each video is repeated along the frame dimension as additional guidance (referred to as hint image)
80
  * Hint latent and video latent are stacked to produce a shape of 8 x 24 x H/8 x W/8
81
- * The last input channel is preserved for maskin purposes (not used during training, set to zero)
82
  * Text prompts are encoded by the CLIP text encoder
83
  * Video latents with added noise and clip encoded text prompts are fed into the UNet to predict the added noise
84
  * Loss is the reconstruction objective between the added noise and the predicted noise via mean squared error (mse/l2)
@@ -114,7 +114,7 @@ Trainig statistics are available at [Weights and Biases](https://wandb.ai/tempof
114
  ```bibtext
115
  @misc{TempoFunk2023,
116
  author = {Lopho, Carlos Chavez},
117
- title = {TempoFunk: Extending LDM models to Video},
118
  url = {https://github.com/lopho/makeavid-sd-tpu},
119
  month = {5},
120
  year = {2023}
 
78
  * Each video latent is encoded into latent representations of the shape 4 x 24 x H/8 x W/8
79
  * The latent of the first frame from each video is repeated along the frame dimension as additional guidance (referred to as hint image)
80
  * Hint latent and video latent are stacked to produce a shape of 8 x 24 x H/8 x W/8
81
+ * The last input channel is preserved for masking purposes (not used during training, set to zero)
82
  * Text prompts are encoded by the CLIP text encoder
83
  * Video latents with added noise and clip encoded text prompts are fed into the UNet to predict the added noise
84
  * Loss is the reconstruction objective between the added noise and the predicted noise via mean squared error (mse/l2)
 
114
  ```bibtext
115
  @misc{TempoFunk2023,
116
  author = {Lopho, Carlos Chavez},
117
+ title = {TempoFunk: Extending latent diffusion image models to Video},
118
  url = {https://github.com/lopho/makeavid-sd-tpu},
119
  month = {5},
120
  year = {2023}