MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
Abstract
MoME, a novel framework integrating sparse Mixture-of-Experts into Matryoshka representation learning, enhances audio-visual speech recognition by dynamically adjusting capacity across scales and modalities, achieving state-of-the-art performance with fewer parameters.
Large language models (LLMs) have recently shown strong potential in audio-visual speech recognition (AVSR), but their high computational demands and sensitivity to token granularity limit their practicality in resource-constrained settings. Token compression methods can reduce inference cost, but they require fixing a compression rate in advance and produce a single fixed-length output, offering no flexibility to balance information density and efficiency at inference time. Matryoshka representation learning (MRL) addresses this by enabling a single model to operate across multiple token granularities, allowing compression rates to be adjusted dynamically. However, current MRL-based methods treat each scale independently during training, limiting cross-scale generalization, robustness at high compression, and interpretability. To overcome these limitations, we propose MoME (Mixture of Matryoshka Experts), a novel framework that integrates sparse Mixture-of-Experts (MoE) into MRL-based LLMs for AVSR. MoME augments a frozen LLM with top-k routed and shared experts, allowing dynamic capacity allocation across scales and modalities. A shared router promotes consistent expert activation across granularities, enabling compressed sequences to benefit from representations learned at lower compression. Experiments on LRS2 and LRS3 demonstrate that MoME achieves state-of-the-art performance across AVSR, ASR, and VSR tasks, while requiring significantly fewer parameters and maintaining robustness under noise. MoME unifies the adaptability of MRL with the efficiency of MoE, offering a scalable and interpretable solution for resource-aware speech recognition.
Community
We introduce Mixture of Matryoshka Experts (MoME), which unifies Matryoshka Representation Learning with sparse Mixture-of-Experts for Audio-Visual Speech Recognition. MoME augments frozen LLMs with top-k routed and shared experts, enabling dynamic capacity allocation across modalities and granularities while capturing global, cross-modal, and scale-invariant knowledge.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models (2025)
- Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion (2025)
- OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation (2025)
- VOX-KRIKRI: Unifying Speech and Language through Continuous Fusion (2025)
- GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR (2025)
- Growing Visual Generative Capacity for Pre-Trained MLLMs (2025)
- MoLEx: Mixture of LoRA Experts in Speech Self-Supervised Models for Audio Deepfake Detection (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper