LaViDa: A Large Diffusion Language Model for Multimodal Understanding Paper • 2505.16839 • Published May 22 • 12
Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection Paper • 2503.12271 • Published Mar 15 • 9
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows Paper • 2412.01169 • Published Dec 2, 2024 • 13
InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following Paper • 2312.06738 • Published Dec 11, 2023
Hierarchical Open-vocabulary Universal Image Segmentation Paper • 2307.00764 • Published Jul 3, 2023
Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data Paper • 2402.05892 • Published Feb 8, 2024
xT: Nested Tokenization for Larger Context in Large Images Paper • 2403.01915 • Published Mar 4, 2024 • 1
Aligning Diffusion Models by Optimizing Human Utility Paper • 2404.04465 • Published Apr 6, 2024 • 15
Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning Paper • 2212.14532 • Published Dec 30, 2022 • 1