Papers
arxiv:2308.11596

SeamlessM4T-Massively Multilingual & Multimodal Machine Translation

Published on Aug 22, 2023
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication

Community

  • Introduces SeamlessM4T (massively multi-lingual and multimodal machine translation in a seamless manner): a single model for (speech or text)-to-(speech or text) for 100 languages; learned self-supervised speech representations through w2v-BERT 2.0 (from speech audio data); SeamlessAlign corpus (aligned speech translations); better BELU scores than cascaded models (and more generic); also proposes Blaser 2.0 evaluation system (text-free S2ST). NLLB (no language left behind) is a text-to-text system (over 200 languages), speech to speech translation (S2ST) is limited; cascaded pipeline (like Whisper, NLLB, YourTTS) has ASR (automatic speech recognition), T2TT, then TTS (text-to-speech) - not scalable.
  • Proposes S2ST is 100-eng (100 languages to English) and eng-35, S2TT is 100-eng and eng-95, ASR for 96, T2ST for 95-eng and eng-35, and T2TT for 95-eng and eng-95. Speech is reached (porosity and expressive) and more uniform (unlike text scripts). Parallel data mining to create SeamlessAlign (470,000 hours of multimodal translation); language identification (LID) using trained ECAPA-TDNN architecture on VoxLingua107; adds a filtering technique to LID100 (margin in probability). Same text data as NLLB (in Stopes library); web-crawled audio data (custom audio event detection/AED model); over segmentation of audio (for phrases/sentences) through voice activity detection (VAD). SONAR (sentence-level multimodal and language agnostic representation): train text-embedding space then use teacher-student to extend to speech; multilingual speech encoder, NLLB gives multilingual text encoder and decoder, they’re aligned for shared modality; best xsim mining metric (compared to Laser3 and LaBSE); teacher-student training gimmicks: MSE loss, w2v-BERT 2.0 (over XLS-R), 3-layer seq2seq model (instead of pooling), grouped languages by family; better BELU (of speech encoder) than Whisper; mining based on FAISS (NN search).
  • SeamlesssM4T targets X2T (speech to text to text) and UnitY (two-pass/text then speech generation) joint optimization; multitasking UnitY pipeline uses two encoders (w2v-BERT 2.0 conformer speech encoder and SeamlessM4T-NLLB text encoder), X2T transformer text decoder (transcribe), T2U (text to unit) transformer encoder, unit decoder, and unit vocoder (HiFi-GAN) for speech synthesis. Unsupervised/self-supervised pre-training of speech encoder (w2v-BERT 2.0) through unlabeled speech audio (scalable for low-resource languages); contrastive learning for Gumbel vector quantization (GVQ) codebooks followed by masked prediction learning; contrastive loss (with code book diversity), masked GVQ, and masked RPQ (random projection quantizers). X2T trained on S2TT data; data preparation includes preprocessing, pseudo-labeling, parallel data mining (to match ASR and T2TT scales), and filtering; NLLB tokenizer trained with SentencePiece using BPE algorithm; T2TT is NLLB training pipeline implemented with Stopes. X2TT: Fuses speech encoder (w2v-BERT 2.0) with length adaptor (M-adaptor), text encoder, and text decoder (NLLB/SeamlessM4T) in an end-to-end trained model with next-token-prediction losses for S2TT and T2TT, and knowledge distillation through KL-divergence; stage 1 is English ASR and into-English S2TT, stage 2 is non-English ASR and from-English S2TT. S2ST has X2T with transformer T2U (text to unit) encoder, unit decoder, and then HiFi-GAN Unit Decoder (vocoder); discrete acoustic units and multilingual vocoder.
  • Comparisons/benchmarks to cascaded models (BELU scores): better than Whisper (including Large-v2) + NLLB and AudioPaLM on Fleurs S2TT; better than YourTTS + Whisper (+ NLLB) on Fleurs and CVSS S2ST. Better than XLS-R, Whisper, and comparable with (larger) AudioPaLM for S2TT on CoVoST 2 (X-eng - best on eng-X); better than NLLB, Whisper, and MMS models on Fleurs ASR and Flores T2TT. Includes ablations (quantization, codebooks, masked prediction objective) and analysis (language-wise). Also evaluated using modality agnostic/automatic metric Blaser 2.0 (uses Sonar embeddings); human evaluation using IWSLT evaluation campaign and WMT conference. SeamlessM4T-Large is more robust than Whisper-Large-v2 (S2TT and ASR with music and natural noise) - higher BELU and lower WER.
  • Bias and ethical evaluations at massive scale: Has toxicity detection on HolisticBias dataset, gender bias analysis (through pronouns), also has toxicity and bias categorized (ability, age, body, race, ethnicity, religion) across different languages. Also analyses limitations and future work. Appendix has information on FAIRSeq2, data statistics, and model card for SeamlessM4T. From Meta AI and UC Berkeley.

Links: Website, Blog (Publication post, Resource), arxiv, HuggingFace Space, GitHub (SONAR, fairseq2, stopes)

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 2

Spaces citing this paper 28

Collections including this paper 1