AudranB commited on
Commit
44e4584
·
unverified ·
0 Parent(s):

initial commit

Browse files
.gitattributes ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ linto_stt_fr_fastconformer.nemo filter=lfs diff=lfs merge=lfs -text
37
+ assets/wer_table.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,226 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ datasets:
4
+ - mozilla-foundation/common_voice_17_0
5
+ - facebook/multilingual_librispeech
6
+ - facebook/voxpopuli
7
+ - datasets-CNRS/PFC
8
+ - datasets-CNRS/CFPP
9
+ - datasets-CNRS/CLAPI
10
+ - gigant/african_accented_french
11
+ - google/fleurs
12
+ - datasets-CNRS/lesvocaux
13
+ - datasets-CNRS/ACSYNT
14
+ - medkit/simsamu
15
+ language:
16
+ - fr
17
+ metrics:
18
+ - wer
19
+ base_model:
20
+ - nvidia/stt_fr_fastconformer_hybrid_large_pc
21
+ pipeline_tag: automatic-speech-recognition
22
+ tags:
23
+ - automatic-speech-recognition
24
+ - speech
25
+ - audio
26
+ - Transducer
27
+ - FastConformer
28
+ - CTC
29
+ - Transformer
30
+ - pytorch
31
+ - NeMo
32
+ library_name: nemo
33
+ model-index:
34
+ - name: linto_stt_fr_fastconformer
35
+ results:
36
+ - task:
37
+ name: Automatic Speech Recognition
38
+ type: automatic-speech-recognition
39
+ dataset:
40
+ name: common-voice-18-0
41
+ type: mozilla-foundation/common_voice_18_0
42
+ config: fr
43
+ split: test
44
+ args:
45
+ language: fr
46
+ metrics:
47
+ - name: Test WER
48
+ type: wer
49
+ value: 9.10
50
+ - task:
51
+ type: Automatic Speech Recognition
52
+ name: automatic-speech-recognition
53
+ dataset:
54
+ name: Multilingual LibriSpeech
55
+ type: facebook/multilingual_librispeech
56
+ config: french
57
+ split: test
58
+ args:
59
+ language: fr
60
+ metrics:
61
+ - name: Test WER
62
+ type: wer
63
+ value: 4.70
64
+ - task:
65
+ type: Automatic Speech Recognition
66
+ name: automatic-speech-recognition
67
+ dataset:
68
+ name: Vox Populi
69
+ type: facebook/voxpopuli
70
+ config: french
71
+ split: test
72
+ args:
73
+ language: fr
74
+ metrics:
75
+ - name: Test WER
76
+ type: wer
77
+ value: 10.76
78
+ - task:
79
+ type: Automatic Speech Recognition
80
+ name: automatic-speech-recognition
81
+ dataset:
82
+ name: SUMM-RE
83
+ type: linagora/SUMM-RE
84
+ config: french
85
+ split: test
86
+ args:
87
+ language: fr
88
+ metrics:
89
+ - name: Test WER
90
+ type: wer
91
+ value: 23.52
92
+ ---
93
+ # LinTO STT French – FastConformer
94
+
95
+ <style>
96
+ img {
97
+ display: inline;
98
+ }
99
+ </style>
100
+
101
+ [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture)
102
+ [![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)
103
+ [![Language](https://img.shields.io/badge/Language-fr-lightgrey#model-badge)](#datasets)
104
+
105
+ ---
106
+
107
+ ## Overview
108
+
109
+ This model is a fine-tuned version of the [NVIDIA French FastConformer Hybrid Large model](https://huggingface.co/nvidia/stt_fr_fastconformer_hybrid_large_pc). It is a large (115M parameters) hybrid ASR model trained with both **Transducer (default)** and **CTC** losses.
110
+
111
+ Compared to the base model, this version:
112
+ - Does **not** include punctuation or uppercase letters.
113
+ - Was trained on **9,500+ hours** of diverse, manually transcribed French speech.
114
+
115
+ ---
116
+
117
+ ## Performance
118
+
119
+ The evaluation code is available in the [ASR Benchmark repository](https://github.com/linagora-labs/asr_benchmark).
120
+
121
+ ### Word Error Rate (WER)
122
+
123
+ WER was computed **without punctuation or uppercase letters** and datasets were cleaned.
124
+ The [SUMM-RE dataset](https://huggingface.co/datasets/linagora/SUMM-RE) is the only one used **exclusively for evaluation**, meaning neither model saw it during training.
125
+
126
+ Evaluations can be very long (especially for whisper) so we used a subset of the test split for most datasets:
127
+ - 15% of CommonVoice
128
+ - 33% of MultiLingual LibriSpeech
129
+ - 33% of SUMM-RE
130
+ - 33% of VoxPopuli
131
+
132
+ ![WER table](https://huggingface.co/linagora/linto_stt_fr_fastconformer/resolve/main/assets/wer_table.png)
133
+
134
+ ### Real-Time Factor (RTF)
135
+
136
+ RTFX (the inverse of RTF) measures how many seconds of audio can be transcribed per second of processing time.
137
+
138
+ Evaluation:
139
+ - Hardware: Laptop with NVIDIA RTX 4090
140
+ - Input: 5 audio files (~2 minutes each) from the ACSYNT corpus
141
+
142
+ ![RTF table](https://huggingface.co/linagora/linto_stt_fr_fastconformer/resolve/main/assets/rtf_table.png)
143
+
144
+ ---
145
+
146
+ ## Usage
147
+
148
+ This model can be used with the [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) for both inference and fine-tuning.
149
+
150
+ ```python
151
+ # Install nemo
152
+ # !pip install nemo_toolkit['all']
153
+
154
+ import nemo.collections.asr as nemo_asr
155
+
156
+ model_name = "linagora/linto_stt_fr_fastconformer"
157
+ asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name=model_name)
158
+
159
+ # Path to your 16kHz mono-channel audio file
160
+ audio_path = "/path/to/your/audio/file"
161
+
162
+ # Transcribe with defaut transducer decoder
163
+ asr_model.transcribe([audio_path])
164
+
165
+ # (Optional) Switch to CTC decoder
166
+ asr_model.change_decoding_strategy(decoder_type="ctc")
167
+
168
+ # (Optional) Transcribe with CTC decoder
169
+ asr_model.transcribe([audio_path])
170
+ ```
171
+
172
+ ## Datasets
173
+
174
+ The model was trained on over 9,500 hours of French speech, covering:
175
+ - Read and spontaneous speech
176
+ - Conversations and meetings
177
+ - Varied accents and audio conditions
178
+
179
+ ![Datasets](https://huggingface.co/linagora/linto_stt_fr_fastconformer/resolve/main/assets/datasets_hours.png)
180
+
181
+ Datasets Used (by size):
182
+ - YouTubeFr: an internally curated corpus of CC0-licensed French-language videos sourced from YouTube. Will soon be available on LeVoiceLab platform
183
+ - [YODAS](https://huggingface.co/datasets/espnet/yodas): fr000 subset
184
+ - [Multilingual LibriSpeech](https://www.openslr.org/94/): french subset
185
+ - [CommonVoice](https://commonvoice.mozilla.org/fr/datasets): french subset
186
+ - [ESLO](http://eslo.huma-num.fr/index.php)
187
+ - [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli): french subset
188
+ - [Multilingual TEDx](https://www.openslr.org/100/): french subset
189
+ - [TCOF](https://www.cnrtl.fr/corpus/tcof/)
190
+ - CTF-AR (Corpus de Conversations Téléphoniques en Français avec Accents Régionaux): will soon be available on LeVoiceLab platform
191
+ - [PFC](https://www.ortolang.fr/market/corpora/pfc)
192
+ - [OFROM](https://ofrom.unine.ch/index.php?page=citations)
193
+ - CTFNN1 (Corpus de Conversations Téléphoniques en Français impliquant des accents Non-Natifs): will soon be available on LeVoiceLab platform
194
+ - [CFPP2000](https://www.ortolang.fr/market/corpora/cfpp2000)
195
+ - [VOXFORGE](https://www.voxforge.org/)
196
+ - [CLAPI](http://clapi.ish-lyon.cnrs.fr/)
197
+ - [AfricanAccentedFrench](https://www.openslr.org/57/)
198
+ - [FLEURS](https://huggingface.co/datasets/google/fleurs): french subset
199
+ - [LesVocaux](https://www.ortolang.fr/market/corpora/lesvocaux/v0.0.1)
200
+ - LINAGORA_Meetings
201
+ - [CFPB](https://orfeo.ortolang.fr/annis-sample/cfpb/CFPB-1000-5.html)
202
+ - [ACSYNT](https://www.ortolang.fr/market/corpora/sldr000832)
203
+ - [PxSLU](https://arxiv.org/abs/2207.08292)
204
+ - [SimSamu](https://huggingface.co/datasets/medkit/simsamu)
205
+
206
+ ## Limitations
207
+
208
+ - May struggle with rare vocabulary, heavy accents, or overlapping/multi-speaker audio.
209
+ - Outputs are lowercase only, with no punctuation, due to limitations in some training datasets.
210
+ - A future version may include casing and punctuation support
211
+
212
+ ## References
213
+
214
+ [1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
215
+
216
+ [2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
217
+
218
+ [3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
219
+
220
+ ## Acknowledgements
221
+
222
+ Thanks to NVIDIA for providing the base model architecture and the NeMo framework.
223
+
224
+ ## Licence
225
+
226
+ Licensed under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/).
assets/datasets_hours.png ADDED
assets/rtf_table.png ADDED
assets/wer_table.png ADDED

Git LFS Details

  • SHA256: 385bc228c799e2863243f0b821ea907b309b15b9588e839af22bb1fe49436c4e
  • Pointer size: 130 Bytes
  • Size of remote file: 91.5 kB
assets/wer_table_all.png ADDED
linto_stt_fr_fastconformer.nemo ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a301520a2c81b0f453aab7147d7f8becc11a3052aec0b84431371638529b8e92
3
+ size 459233280