Image / Video Gen
Image Generation Using Diffusion-Based Methods: Tips and Techniques for Stable Diffusion
Paper • 2208.11970 • PublishedNote More Theoretical Reports: https://arxiv.org/pdf/2303.08797 Noise Scheduler Research: https://arxiv.org/pdf/2301.10972
Tutorial on Diffusion Models for Imaging and Vision
Paper • 2403.18103 • Published • 2Denoising Diffusion Probabilistic Models
Paper • 2006.11239 • Published • 3Denoising Diffusion Implicit Models
Paper • 2010.02502 • Published • 3
Progressive Distillation for Fast Sampling of Diffusion Models
Paper • 2202.00512 • Published • 1Note 1. Introduce v_pred. As for DDPM noise scheduler 1.1 definition: v = \sqrt{\bar{\alpha_t}} \epsilon - \sqrt{1-\bar{\alpha_t}} x_0 1.2 The conversion btw epsilon pred and velocity pred: \epsilon_{pred} = \sqrt{\bar{\alpha_t}} v_{pred} + \sqrt{1-\bar{\alpha_t}} x_t
Flow Matching for Generative Modeling
Paper • 2210.02747 • Published • 1
simple diffusion: End-to-end diffusion for high resolution images
Paper • 2301.11093 • Published • 2Note 1. use (v-prediction, epsilon loss) the loss. v_pred = uvit ( z_t , logsnr_t ) eps_pred = sigma_t * z_t + alpha_t * v_t
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Paper • 2209.03003 • Published • 1
Scalable Diffusion Models with Transformers
Paper • 2212.09748 • Published • 17Note 1. Following the U-Net initialization strategy, zero-initializing the final convolutional layer in each block before any residual connections, DiT regresses γ, β, and dimension-wise scaling parameters α that are applied immediately before any residual connections within the DiT block.
SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers
Paper • 2401.08740 • Published • 13Note 1. Generation Process: (i) Stochastic interpolant framework decouples the formulation of xt from the forward SDE. 2. Model prediction: (i) Learn the velocity field v(x, t) and use it to express the score s(x, t) when using an SDE for sampling. 3. Optimal choice of wt will always be model prediction and interpolant dependent. 4. from a DiT model (discrete, score prediction, VP interpolant) to a SiT model (continuous, velocity prediction, Linear interpolant)
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
Paper • 2408.12590 • Published • 35Note 1. Extend the 2D image-based VAE into a 3D VideoVAE with CausalConv3D. 2. Encode a long video with a divide-and-merge strategy. 3. Caption Model: 3.1 The temporal encoder is implemented with [Token Turing Machines](https://github.com/google-research/scenic/tree/main/scenic/projects/token_turing).
Classifier-Free Diffusion Guidance
Paper • 2207.12598 • Published • 1Note 1. Follow-up work: APG(https://arxiv.org/pdf/2410.02416) 1.1 Leaning more on the orthogonal component significantly attenuates this saturation side effect in generations while maintaining the quality-boosting benefits of CFG. 1.2 APG performs best when applied to the denoised predictions rather than the noise prediction.
PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Paper • 2310.00426 • Published • 61Note 1. Training Receipt - Initialize the T2I model with a low-cost class-condition model; - Pretrain on text-image pair data rich in information density; - Fine-tuning with superior aesthetic quality data; 2. adaLN-single - one global set of shifts and scales is computed only at the first block which is shared across all the blocks, denoted as shared_adaln_cond; - a layer-specific trainable embedding, denoted as adaln_cond; adaptively adjusts the scale and shift parameters in different blocks
FreeInit: Bridging Initialization Gap in Video Diffusion Models
Paper • 2312.07537 • Published • 26Note 1. Gap btw training & inference: the initial noises corrupted from real videos remain temporally correlated at the low-frequency band. 2. Free-Init Procedure 2.1 Initialize an independent Gaussian noise; 2.2 DDIM denoising to generate a clean video latent; 2.3 Obtain noisy version video latent through forward diffusion; 2.4 Combine the low-frequency components of this video latent with the high-frequency components from random Gaussian noise; 2.5 Repeat;
black-forest-labs/FLUX.1-schnell
Text-to-Image • Updated • 631k • • 3.24k
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Paper • 2403.03206 • Published • 61Note Known as SD-3 1. Change the distribution over t from the uniform distribution to the one giving more weight to intermediate timesteps by sampling them more frequently. 2. Use a ratio of 50 % original and 50 % synthetic captions. 3. MM-DiT
On the Importance of Noise Scheduling for Diffusion Models
Paper • 2301.10972 • Published • 1Note 1. When increasing the image size, the optimal noise scheduling shifts towards a noisier one (due to increased redundancy in pixels). This is more important in video generation.
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
Paper • 2402.14797 • Published • 20Note 1. Argue that treating spatial and temporal modeling in a separable way causes motion artifacts, temporal inconsistencies, or generation of dynamic images rather than videos with vivid motion. 2. Follow-Up; Mind the Time: https://mint-video.github.io/src/MinT-paper.pdf 2.1 use interval guidance in CFG to mitigate the oversaturation issue
Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models
Paper • 2404.07724 • Published • 14Note 1. guidance is harmful toward the beginning of the chain (high noise levels), largely unnecessary toward the end (low noise levels), and only beneficial in the middle.
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Paper • 2410.06940 • Published • 7
MarDini: Masked Autoregressive Diffusion for Video Generation at Scale
Paper • 2410.20280 • Published • 23Note 1. For Spatio-Temporal Attention, 2D RoPE for spatial & temporal. Insert a learnable [NEXT] token to differentiate image patches across different rows is enough for Spatial. No need for 3D RoPE. 2. Do not include dynamic resolution training in our main training stages. Instead, after convergence, fine-tuning the model for a few steps (10K-20K) with dynamic resolutions enables it.
In-Context LoRA for Diffusion Transformers
Paper • 2410.23775 • Published • 11
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens
Paper • 2410.13863 • Published • 37Note 1. validation loss is a proxy for generation quality.
OminiControl: Minimal and Universal Control for Diffusion Transformer
Paper • 2411.15098 • Published • 55Note 1. process condition image tokens uniformly with text and noisy image tokens, integrating them into a unified sequence. Not using the direct addition of hidden states b/c constrains token interactions.
Open-Sora Plan: Open-Source Large Video Generation Model
Paper • 2412.00131 • Published • 33Note 1. Retain Full 3D Attention in the first and last two layers. 2. first train a Full 3D Attention model on 256 × 256 images; then inherit the model weights and replace Full 3D Attention with Skiparse Attention 3. adding slight Gaussian noise to the conditional images to enhance generalization during fine-tuning
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
Paper • 2410.10629 • Published • 10Note 1. remove the positional embedding in DiT and find no quality loss. 2. AE-F32C32; skip the 256px; gradually fine-tuning the model to 1024px, 2K and 4K 3. Replace T5 with LLM as Text Encoder. Using T5 text embedding as key, value, and image tokens (as the query) for x-attention training results in extreme instability, with training loss frequently becoming NaN.
genmo/mochi-1-preview
Text-to-Video • Updated • 40.5k • 1.15k
Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models
Paper • 2409.10695 • Published • 2Note 1. Token down-sampling at middle layers: reduced the sequence length of the image keys and values by four times in middle layers making the whole network resemble a traditional convolution U-Net with only one level of down sampling. 2. improved these captioning conditions by generating multi-level captions to reduce dataset bias and prevent model overfitting. 3. we looped through the gradients of all model parameters and counted how many gradients exceeded a specific gradient-value threshold.
Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling
Paper • 2411.18664 • Published • 24
STIV: Scalable Text and Image Conditioned Video Generation
Paper • 2412.07730 • Published • 71Note 1. As we scale up the spatial resolution, we observe the model producing slow or nearly static motion. 2. Using causal temporal attention also results in a significant drop in both quality and total scores. 3. Using interpolation of the RoPE embeddings yields improved VBench scores compared to extrapolation. 4. Observe staleness happens when we scale our model to 8B with >= 512 resolutions, probably due to the model being more easily overfitting to follow the first frame.
Relay Diffusion: Unifying diffusion process across resolutions for image synthesis
Paper • 2309.03350 • PublishedNote pixel-wise noise + patch-wise noise
Lightricks/LTX-Video
Image-to-Video • Updated • 90.8k • 869Note https://arxiv.org/pdf/2501.00103 1. move the patchifying layer to the beginning of the VAE encoder 2. fuses the decoding and denoising steps. 3.1 L2 loss often produces blurry outputs; 3.2 perceptual loss reduces blurriness 4.RoPE with fractional coordinates normalized by predefined maximum coordinates works best.
RepVideo: Rethinking Cross-Layer Representation for Video Generation
Paper • 2501.08994 • Published • 13Note 1. As layer depth increases, the attention corresponding to each frame’s token becomes more concentrated on the tokens from the same frame, with relatively weaker attention to tokens from other frames. 2. Enhance the model’s ability to interpret text prompts by employing multiple encoders to capture different layers of information, such as semantic level and character-level understanding, thereby improving the alignment between generated content and textual descriptions