UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing Paper • 2503.12652 • Published Mar 16
CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling Paper • 2502.00965 • Published Feb 3
GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing Paper • 2505.11493 • Published May 16 • 3
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer Paper • 2509.16197 • Published 29 days ago • 52
CAR-Flow: Condition-Aware Reparameterization Aligns Source and Target for Better Flow Matching Paper • 2509.19300 • Published 25 days ago • 6
MOFI: Learning Image Representations from Noisy Entity Annotated Images Paper • 2306.07952 • Published Jun 13, 2023 • 2
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models Paper • 2404.07973 • Published Apr 11, 2024 • 32
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models Paper • 2410.02740 • Published Oct 3, 2024 • 54
STIV: Scalable Text and Image Conditioned Video Generation Paper • 2412.07730 • Published Dec 10, 2024 • 74
DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation Paper • 2503.10618 • Published Mar 13 • 18