Massively Multilingual Speech (MMS) - Finetuned ASR - ALL

This checkpoint is a model fine-tuned for multi-lingual ASR and part of Facebook's Massive Multilingual Speech project. This checkpoint is based on the Wav2Vec2 architecture and makes use of adapter models to transcribe 1000+ languages. The checkpoint consists of 1 billion parameters and has been fine-tuned from facebook/mms-1b on 1162 languages.

Example

This MMS checkpoint can be used with Transformers to transcribe audio of 1107 different languages. Let's look at a simple example.

First, we install transformers and some other libraries

pip install torch accelerate torchaudio datasets
pip install --upgrade transformers

Note: In order to use MMS you need to have at least transformers >= 4.30 installed. If the 4.30 version is not yet available on PyPI make sure to install transformers from source:

pip install git+https://github.com/huggingface/transformers.git

Next, we load a couple of audio samples via datasets. Make sure that the audio data is sampled to 16000 kHz.

from datasets import load_dataset, Audio

# English
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
en_sample = next(iter(stream_data))["audio"]["array"]

# French
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "fr", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
fr_sample = next(iter(stream_data))["audio"]["array"]

Next, we load the model and processor

from transformers import Wav2Vec2ForCTC, AutoProcessor
import torch

model_id = "facebook/mms-1b-all"

processor = AutoProcessor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id)

Now we process the audio data, pass the processed audio data to the model and transcribe the model output, just like we usually do for Wav2Vec2 models such as facebook/wav2vec2-base-960h

inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs).logits

ids = torch.argmax(outputs, dim=-1)[0]
transcription = processor.decode(ids)
# 'joe keton disapproved of films and buster also had reservations about the media'

We can now keep the same model in memory and simply switch out the language adapters by calling the convenient load_adapter() function for the model and set_target_lang() for the tokenizer. We pass the target language as an input - "fra" for French.

processor.tokenizer.set_target_lang("fra")
model.load_adapter("fra")

inputs = processor(fr_sample, sampling_rate=16_000, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs).logits

ids = torch.argmax(outputs, dim=-1)[0]
transcription = processor.decode(ids)
# "ce dernier est volé tout au long de l'histoire romaine"

In the same way the language can be switched out for all other supported languages. Please have a look at:

processor.tokenizer.vocab.keys()

For more details, please have a look at the official docs.

Supported Languages

This model supports 1162 languages. Unclick the following to toogle all supported languages of this checkpoint in ISO 639-3 code. You can find more details about the languages and their ISO 649-3 codes in the MMS Language Coverage Overview.

Click to toggle

abi
abk
abp
aca
acd
ace
acf
ach
acn
acr
acu
ade
adh
adj
adx
aeu
afr
agd
agg
agn
agr
agu
agx
aha
ahk
aia
aka
akb
ake
akp
alj
alp
alt
alz
ame
amf
amh
ami
amk
ann
any
aoz
apb
apr
ara
arl
asa
asg
asm
ast
ata
atb
atg
ati
atq
ava
avn
avu
awa
awb
ayo
ayr
ayz
azb
azg
azj-script_cyrillic
azj-script_latin
azz
bak
bam
ban
bao
bas
bav
bba
bbb
bbc
bbo
bcc-script_arabic
bcc-script_latin
bcl
bcw
bdg
bdh
bdq
bdu
bdv
beh
bel
bem
ben
bep
bex
bfa
bfo
bfy
bfz
bgc
bgq
bgr
bgt
bgw
bha
bht
bhz
bib
bim
bis
biv
bjr
bjv
bjw
bjz
bkd
bkv
blh
blt
blx
blz
bmq
bmr
bmu
bmv
bng
bno
bnp
boa
bod
boj
bom
bor
bos
bov
box
bpr
bps
bqc
bqi
bqj
bqp
bre
bru
bsc
bsq
bss
btd
bts
btt
btx
bud
bul
bus
bvc
bvz
bwq
bwu
byr
bzh
bzi
bzj
caa
cab
cac-dialect_sanmateoixtatan
cac-dialect_sansebastiancoatan
cak-dialect_central
cak-dialect_santamariadejesus
cak-dialect_santodomingoxenacoj
cak-dialect_southcentral
cak-dialect_western
cak-dialect_yepocapa
cap
car
cas
cat
cax
cbc
cbi
cbr
cbs
cbt
cbu
cbv
cce
cco
cdj
ceb
ceg
cek
ces
cfm
cgc
che
chf
chv
chz
cjo
cjp
cjs
ckb
cko
ckt
cla
cle
cly
cme
cmn-script_simplified
cmo-script_khmer
cmo-script_latin
cmr
cnh
cni
cnl
cnt
coe
cof
cok
con
cot
cou
cpa
cpb
cpu
crh
crk-script_latin
crk-script_syllabics
crn
crq
crs
crt
csk
cso
ctd
ctg
cto
ctu
cuc
cui
cuk
cul
cwa
cwe
cwt
cya
cym
daa
dah
dan
dar
dbj
dbq
ddn
ded
des
deu
dga
dgi
dgk
dgo
dgr
dhi
did
dig
dik
dip
div
djk
dnj-dialect_blowowest
dnj-dialect_gweetaawueast
dnt
dnw
dop
dos
dsh
dso
dtp
dts
dug
dwr
dyi
dyo
dyu
dzo
eip
eka
ell
emp
enb
eng
enx
epo
ese
ess
est
eus
evn
ewe
eza
fal
fao
far
fas
fij
fin
flr
fmu
fon
fra
frd
fry
ful
gag-script_cyrillic
gag-script_latin
gai
gam
gau
gbi
gbk
gbm
gbo
gde
geb
gej
gil
gjn
gkn
gld
gle
glg
glk
gmv
gna
gnd
gng
gof-script_latin
gog
gor
gqr
grc
gri
grn
grt
gso
gub
guc
gud
guh
guj
guk
gum
guo
guq
guu
gux
gvc
gvl
gwi
gwr
gym
gyr
had
hag
hak
hap
hat
hau
hay
heb
heh
hif
hig
hil
hin
hlb
hlt
hne
hnn
hns
hoc
hoy
hrv
hsb
hto
hub
hui
hun
hus-dialect_centralveracruz
hus-dialect_westernpotosino
huu
huv
hvn
hwc
hye
hyw
iba
ibo
icr
idd
ifa
ifb
ife
ifk
ifu
ify
ign
ikk
ilb
ilo
imo
ina
inb
ind
iou
ipi
iqw
iri
irk
isl
ita
itl
itv
ixl-dialect_sangasparchajul
ixl-dialect_sanjuancotzal
ixl-dialect_santamarianebaj
izr
izz
jac
jam
jav
jbu
jen
jic
jiv
jmc
jmd
jpn
jun
juy
jvn
kaa
kab
kac
kak
kam
kan
kao
kaq
kat
kay
kaz
kbo
kbp
kbq
kbr
kby
kca
kcg
kdc
kde
kdh
kdi
kdj
kdl
kdn
kdt
kea
kek
ken
keo
ker
key
kez
kfb
kff-script_telugu
kfw
kfx
khg
khm
khq
kia
kij
kik
kin
kir
kjb
kje
kjg
kjh
kki
kkj
kle
klu
klv
klw
kma
kmd
kml
kmr-script_arabic
kmr-script_cyrillic
kmr-script_latin
kmu
knb
kne
knf
knj
knk
kno
kog
kor
kpq
kps
kpv
kpy
kpz
kqe
kqp
kqr
kqy
krc
kri
krj
krl
krr
krs
kru
ksb
ksr
kss
ktb
ktj
kub
kue
kum
kus
kvn
kvw
kwd
kwf
kwi
kxc
kxf
kxm
kxv
kyb
kyc
kyf
kyg
kyo
kyq
kyu
kyz
kzf
lac
laj
lam
lao
las
lat
lav
law
lbj
lbw
lcp
lee
lef
lem
lew
lex
lgg
lgl
lhu
lia
lid
lif
lin
lip
lis
lit
lje
ljp
llg
lln
lme
lnd
lns
lob
lok
lom
lon
loq
lsi
lsm
ltz
luc
lug
luo
lwo
lww
lzz
maa-dialect_sanantonio
maa-dialect_sanjeronimo
mad
mag
mah
mai
maj
mak
mal
mam-dialect_central
mam-dialect_northern
mam-dialect_southern
mam-dialect_western
maq
mar
maw
maz
mbb
mbc
mbh
mbj
mbt
mbu
mbz
mca
mcb
mcd
mco
mcp
mcq
mcu
mda
mdf
mdv
mdy
med
mee
mej
men
meq
met
mev
mfe
mfh
mfi
mfk
mfq
mfy
mfz
mgd
mge
mgh
mgo
mhi
mhr
mhu
mhx
mhy
mib
mie
mif
mih
mil
mim
min
mio
mip
miq
mit
miy
miz
mjl
mjv
mkd
mkl
mkn
mlg
mlt
mmg
mnb
mnf
mnk
mnw
mnx
moa
mog
mon
mop
mor
mos
mox
moz
mpg
mpm
mpp
mpx
mqb
mqf
mqj
mqn
mri
mrw
msy
mtd
mtj
mto
muh
mup
mur
muv
muy
mvp
mwq
mwv
mxb
mxq
mxt
mxv
mya
myb
myk
myl
myv
myx
myy
mza
mzi
mzj
mzk
mzm
mzw
nab
nag
nan
nas
naw
nca
nch
ncj
ncl
ncu
ndj
ndp
ndv
ndy
ndz
neb
new
nfa
nfr
nga
ngl
ngp
ngu
nhe
nhi
nhu
nhw
nhx
nhy
nia
nij
nim
nin
nko
nlc
nld
nlg
nlk
nmz
nnb
nno
nnq
nnw
noa
nob
nod
nog
not
npi
npl
npy
nso
nst
nsu
ntm
ntr
nuj
nus
nuz
nwb
nxq
nya
nyf
nyn
nyo
nyy
nzi
obo
oci
ojb-script_latin
ojb-script_syllabics
oku
old
omw
onb
ood
orm
ory
oss
ote
otq
ozm
pab
pad
pag
pam
pan
pao
pap
pau
pbb
pbc
pbi
pce
pcm
peg
pez
pib
pil
pir
pis
pjt
pkb
pls
plw
pmf
pny
poh-dialect_eastern
poh-dialect_western
poi
pol
por
poy
ppk
pps
prf
prk
prt
pse
pss
ptu
pui
pus
pwg
pww
pxm
qub
quc-dialect_central
quc-dialect_east
quc-dialect_north
quf
quh
qul
quw
quy
quz
qvc
qve
qvh
qvm
qvn
qvo
qvs
qvw
qvz
qwh
qxh
qxl
qxn
qxo
qxr
rah
rai
rap
rav
raw
rej
rel
rgu
rhg
rif-script_arabic
rif-script_latin
ril
rim
rjs
rkt
rmc-script_cyrillic
rmc-script_latin
rmo
rmy-script_cyrillic
rmy-script_latin
rng
rnl
roh-dialect_sursilv
roh-dialect_vallader
rol
ron
rop
rro
rub
ruf
rug
run
rus
sab
sag
sah
saj
saq
sas
sat
sba
sbd
sbl
sbp
sch
sck
sda
sea
seh
ses
sey
sgb
sgj
sgw
shi
shk
shn
sho
shp
sid
sig
sil
sja
sjm
sld
slk
slu
slv
sml
smo
sna
snd
sne
snn
snp
snw
som
soy
spa
spp
spy
sqi
sri
srm
srn
srp-script_cyrillic
srp-script_latin
srx
stn
stp
suc
suk
sun
sur
sus
suv
suz
swe
swh
sxb
sxn
sya
syl
sza
tac
taj
tam
tao
tap
taq
tat
tav
tbc
tbg
tbk
tbl
tby
tbz
tca
tcc
tcs
tcz
tdj
ted
tee
tel
tem
teo
ter
tes
tew
tex
tfr
tgj
tgk
tgl
tgo
tgp
tha
thk
thl
tih
tik
tir
tkr
tlb
tlj
tly
tmc
tmf
tna
tng
tnk
tnn
tnp
tnr
tnt
tob
toc
toh
tom
tos
tpi
tpm
tpp
tpt
trc
tri
trn
trs
tso
tsz
ttc
tte
ttq-script_tifinagh
tue
tuf
tuk-script_arabic
tuk-script_latin
tuo
tur
tvw
twb
twe
twu
txa
txq
txu
tye
tzh-dialect_bachajon
tzh-dialect_tenejapa
tzj-dialect_eastern
tzj-dialect_western
tzo-dialect_chamula
tzo-dialect_chenalho
ubl
ubu
udm
udu
uig-script_arabic
uig-script_cyrillic
ukr
umb
unr
upv
ura
urb
urd-script_arabic
urd-script_devanagari
urd-script_latin
urk
urt
ury
usp
uzb-script_cyrillic
uzb-script_latin
vag
vid
vie
vif
vmw
vmy
vot
vun
vut
wal-script_ethiopic
wal-script_latin
wap
war
waw
way
wba
wlo
wlx
wmw
wob
wol
wsg
wwa
xal
xdy
xed
xer
xho
xmm
xnj
xnr
xog
xon
xrb
xsb
xsm
xsr
xsu
xta
xtd
xte
xtm
xtn
xua
xuo
yaa
yad
yal
yam
yao
yas
yat
yaz
yba
ybb
ycl
ycn
yea
yka
yli
yor
yre
yua
yue-script_traditional
yuz
yva
zaa
zab
zac
zad
zae
zai
zam
zao
zaq
zar
zas
zav
zaw
zca
zga
zim
ziw
zlm
zmz
zne
zos
zpc
zpg
zpi
zpl
zpm
zpo
zpt
zpu
zpz
ztq
zty
zul
zyb
zyp
zza

Model details

Developed by: Vineel Pratap et al.
Model type: Multi-Lingual Automatic Speech Recognition model
Language(s): 1000+ languages, see supported languages
License: CC-BY-NC 4.0 license
Num parameters: 1 billion
Audio sampling rate: 16,000 kHz

Cite as:

@article{pratap2023mms,
  title={Scaling Speech Technology to 1,000+ Languages},
  author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel-Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei-Ning Hsu and Alexis Conneau and Michael Auli},
journal={arXiv},
year={2023}
}