LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer
Abstract
LaTtE-Flow, a new architecture, unifies image understanding and generation with high performance and faster inference by using a Layerwise Timestep Experts flow-based Transformer and Timestep-Conditioned Residual Attention mechanism.
Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed compared to recent unified multimodal models.
Community
This paper introduces a time-step expert architecture into recent flow-matching architectures, which have been widely adopted by Bagel, LMFusion, and Transfusion. The proposed new architecture achieves faster inference speed and convergence speed during training, exhibiting strong image generation and understanding capabilities.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation (2025)
- Fast Autoregressive Models for Continuous Latent Generation (2025)
- MADFormer: Mixed Autoregressive and Diffusion Transformers for Continuous Image Generation (2025)
- FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities (2025)
- Context-Aware Autoregressive Models for Multi-Conditional Image Generation (2025)
- HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation (2025)
- Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper