Image / Video Gen
Image Generation Using Diffusion-Based Methods: Tips and Techniques for Stable Diffusion
Paper • 2208.11970 • PublishedNote More Theoretical Reports: https://arxiv.org/pdf/2303.08797
Tutorial on Diffusion Models for Imaging and Vision
Paper • 2403.18103 • Published • 2Denoising Diffusion Probabilistic Models
Paper • 2006.11239 • Published • 3Denoising Diffusion Implicit Models
Paper • 2010.02502 • Published • 3
Progressive Distillation for Fast Sampling of Diffusion Models
Paper • 2202.00512 • Published • 1Note 1. Introduce v_pred. As for DDPM noise scheduler 1.1 definition: v = \sqrt{\bar{\alpha_t}} \epsilon - \sqrt{1-\bar{\alpha_t}} x_0 1.2 The conversion btw epsilon pred and velocity pred: \epsilon_{pred} = \sqrt{\bar{\alpha_t}} v_{pred} + \sqrt{1-\bar{\alpha_t}} x_t
Flow Matching for Generative Modeling
Paper • 2210.02747 • Published • 1
simple diffusion: End-to-end diffusion for high resolution images
Paper • 2301.11093 • Published • 2Note 1. use (v-prediction, epsilon loss) the loss. v_pred = uvit ( z_t , logsnr_t ) eps_pred = sigma_t * z_t + alpha_t * v_t
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Paper • 2209.03003 • Published • 1
MAGVIT: Masked Generative Video Transformer
Paper • 2212.05199 • PublishedNote 1. Inflation 1.1 Use a central inflation method for the convolution layers, where the corresponding 2D kernel fills in the temporally central slice of a zero-filled 3D kernel. 1.2 Replace the same (zero) padding in the convolution layers with reflect padding,
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Paper • 2310.05737 • Published • 4Note 1. Known as MAGVIT-2. Growing the vocabulary size can benefit the generation quality. 2. Both reconstruction and generation consistently improve as the vocabulary size increases. Vocab is single-dimensional variables For example, latent feat z \in R^{4} [-1, 1, -2, 3] --> [0, 1, 0, 1] --> sum([0, 2^1, 0, 2^3]) --> 10 [ 1, 1, 1, 3] --> [1, 1, 1, 1] --> sum([2^0, 2^2, 2^2, 2^3]) --> 15
Scalable Diffusion Models with Transformers
Paper • 2212.09748 • Published • 16Note 1. Following the U-Net initialization strategy, zero-initializing the final convolutional layer in each block before any residual connections, DiT regresses γ, β, and dimension-wise scaling parameters α that are applied immediately before any residual connections within the DiT block.
SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers
Paper • 2401.08740 • Published • 12Note 1. Generation Process: (i) Stochastic interpolant framework decouples the formulation of xt from the forward SDE. 2. Model prediction: (i) Learn the velocity field v(x, t) and use it to express the score s(x, t) when using an SDE for sampling. 3. Optimal choice of wt will always be model prediction and interpolant dependent. 4. from a DiT model (discrete, score prediction, VP interpolant) to a SiT model (continuous, velocity prediction, Linear interpolant)
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
Paper • 2408.12590 • Published • 34Note 1. Extend the 2D image-based VAE into a 3D VideoVAE with CausalConv3D. 2. Encode a long video with a divide-and-merge strategy. 3. Caption Model: 3.1 The temporal encoder is implemented with [Token Turing Machines](https://github.com/google-research/scenic/tree/main/scenic/projects/token_turing).
Classifier-Free Diffusion Guidance
Paper • 2207.12598 • Published • 2Note 1. Follow-up work: APG(https://arxiv.org/pdf/2410.02416) 1.1 Leaning more on the orthogonal component significantly attenuates this saturation side effect in generations while maintaining the quality-boosting benefits of CFG. 1.2 APG performs best when applied to the denoised predictions rather than the noise prediction.
PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Paper • 2310.00426 • Published • 61Note 1. Training Receipt - Initialize the T2I model with a low-cost class-condition model; - Pretrain on text-image pair data rich in information density; - Fine-tuning with superior aesthetic quality data; 2. adaLN-single - one global set of shifts and scales is computed only at the first block which is shared across all the blocks, denoted as shared_adaln_cond; - a layer-specific trainable embedding, denoted as adaln_cond; adaptively adjusts the scale and shift parameters in different blocks
FreeInit: Bridging Initialization Gap in Video Diffusion Models
Paper • 2312.07537 • Published • 26Note 1. Gap btw training & inference: the initial noises corrupted from real videos remain temporally correlated at the low-frequency band. 2. Free-Init Procedure 2.1 Initialize an independent Gaussian noise; 2.2 DDIM denoising to generate a clean video latent; 2.3 Obtain noisy version video latent through forward diffusion; 2.4 Combine the low-frequency components of this video latent with the high-frequency components from random Gaussian noise; 2.5 Repeat;
black-forest-labs/FLUX.1-schnell
Text-to-Image • Updated • 1.96M • • 2.85k
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Paper • 2403.03206 • Published • 57Note Known as SD-3 1. Change the distribution over t from the uniform distribution to the one giving more weight to intermediate timesteps by sampling them more frequently. 2. Use a ratio of 50 % original and 50 % synthetic captions. 3. MM-DiT
On the Importance of Noise Scheduling for Diffusion Models
Paper • 2301.10972 • Published • 1Note 1. When increasing the image size, the optimal noise scheduling shifts towards a noisier one (due to increased redundancy in pixels). This is more important in video generation.
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
Paper • 2402.14797 • Published • 19Note 1. Argue that treating spatial and temporal modeling in a separable way causes motion artifacts, temporal inconsistencies, or generation of dynamic images rather than videos with vivid motion.
MotionCtrl: A Unified and Flexible Motion Controller for Video Generation
Paper • 2312.03641 • Published • 20Note 1. Motion Brush?
Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models
Paper • 2404.07724 • Published • 12Note 1. guidance is harmful toward the beginning of the chain (high noise levels), largely unnecessary toward the end (low noise levels), and only beneficial in the middle.
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Paper • 2410.06940 • Published • 4Tora: Trajectory-oriented Diffusion Transformer for Video Generation
Paper • 2407.21705 • Published • 25
MarDini: Masked Autoregressive Diffusion for Video Generation at Scale
Paper • 2410.20280 • Published • 21Note 1. For Spatio-Temporal Attention, 2D RoPE for spatial & temporal. Insert a learnable [NEXT] token to differentiate image patches across different rows is enough for Spatial. No need for 3D RoPE. 2. Do not include dynamic resolution training in our main training stages. Instead, after convergence, fine-tuning the model for a few steps (10K-20K) with dynamic resolutions enables it.
Finite Scalar Quantization: VQ-VAE Made Simple
Paper • 2309.15505 • Published • 21Note 1. Known as FSQ. 2.1 achieve high codebook utilization by design (almost 100%). 2.2 Before FSQ, most of the literature used unbounded scalar quantization, in which the range of integers is not limited by the encoder but only by constraining the representation's entropy. 2.3 vocab size: |C| = L^d 2.4 a simple heuristic that performs well in all considered tasks: Use Li ≥ 5 ∀i.
In-Context LoRA for Diffusion Transformers
Paper • 2410.23775 • Published • 10
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens
Paper • 2410.13863 • Published • 35Note 1. validation loss is a proxy for generation quality.