Remember Gemini, GPT-4o, all being true multimodal models π.
Now we have a paper π describing an architecture that might achieve that!
Uni-MoE: a native multimodal, Unified Mixture of Experts (MoE) architecture ποΈ.
Uni-MoE integrates various modalities (text π, image πΌοΈ, audio π΅, video πΉ, speech π£οΈ) using modality-specific encoders and connectors for a cohesive multimodal understanding.
Training Strategy: 1οΈβ£ Training cross-modality alignment with diverse connectors π. 2οΈβ£ Training modality-specific experts using cross-modality instruction data π. 3οΈβ£Tuning the Uni-MoE framework with Low-Rank Adaptation (LoRA) on mixed multimodal data π§.
Technical Details:
Modality-Specific Encoders: CLIP for images πΌοΈ, Whisper for speech π£οΈ, BEATs for audio π΅.
MoE-Based Blocks: Shared self-attention layers, feed-forward networks (FFN) based experts, and sparse routers for token-level expertise allocation π.
Efficient Training: Utilizes LoRA for fine-tuning pre-trained experts and self-attention layers π οΈ.
Uni-MoE outperforms traditional dense models on benchmarks like A-OKVQA, OK-VQA, VQAv2, MMBench, RACE-Audio, and English High School Listening Test π.