File size: 9,044 Bytes
44e4584
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db06b34
44e4584
 
 
 
 
 
 
 
 
 
 
 
 
db06b34
44e4584
 
 
 
 
 
 
 
 
 
 
 
 
db06b34
44e4584
 
 
 
 
 
 
 
 
 
 
 
 
db06b34
44e4584
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8469512
 
44e4584
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96d8997
 
 
 
 
 
 
44e4584
 
 
6ca50d9
556f1b9
44e4584
 
 
 
 
 
 
6ca50d9
44e4584
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f84a9d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44e4584
8469512
 
44e4584
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e0dedc3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
---
license: cc-by-4.0
datasets:
- mozilla-foundation/common_voice_17_0
- facebook/multilingual_librispeech
- facebook/voxpopuli
- datasets-CNRS/PFC
- datasets-CNRS/CFPP
- datasets-CNRS/CLAPI
- gigant/african_accented_french
- google/fleurs
- datasets-CNRS/lesvocaux
- datasets-CNRS/ACSYNT
- medkit/simsamu
language:
- fr
metrics:
- wer
base_model:
- nvidia/stt_fr_fastconformer_hybrid_large_pc
pipeline_tag: automatic-speech-recognition
tags:
- automatic-speech-recognition
- speech
- audio
- Transducer
- FastConformer
- CTC
- Transformer
- pytorch
- NeMo
library_name: nemo
model-index:
- name: linto_stt_fr_fastconformer
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: common-voice-18-0
      type: mozilla-foundation/common_voice_18_0
      config: fr
      split: test
      args:
        language: fr
    metrics:
    - name: Test WER
      type: wer
      value: 8.96
  - task:
      type: Automatic Speech Recognition
      name: automatic-speech-recognition
    dataset:
      name: Multilingual LibriSpeech
      type: facebook/multilingual_librispeech
      config: french
      split: test
      args:
        language: fr
    metrics:
    - name: Test WER
      type: wer
      value: 4.7
  - task:
      type: Automatic Speech Recognition
      name: automatic-speech-recognition
    dataset:
      name: Vox Populi
      type: facebook/voxpopuli
      config: french
      split: test
      args:
        language: fr
    metrics:
    - name: Test WER
      type: wer
      value: 10.83
  - task:
      type: Automatic Speech Recognition
      name: automatic-speech-recognition
    dataset:
      name: SUMM-RE
      type: linagora/SUMM-RE
      config: french
      split: test
      args:
        language: fr
    metrics:
    - name: Test WER
      type: wer
      value: 23.5
---
# LinTO STT French – FastConformer

<style>
img {
 display: inline;
}
</style>

[![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture)  
[![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)  
[![Language](https://img.shields.io/badge/Language-fr-lightgrey#model-badge)](#datasets)

---

## Overview

This model is a fine-tuned version of the [NVIDIA French FastConformer Hybrid Large model](https://huggingface.co/nvidia/stt_fr_fastconformer_hybrid_large_pc). 
It is a large (115M parameters) hybrid ASR model trained with both **Transducer (default)** and **CTC** losses.

Compared to the base model, this version:
- Does **not** include punctuation or uppercase letters.
- Was trained on **9,500+ hours** of diverse, manually transcribed French speech.

---

## Performance

The evaluation code is available in the [ASR Benchmark repository](https://github.com/linagora-labs/asr_benchmark).

### Word Error Rate (WER)

WER was computed **without punctuation or uppercase letters** and datasets were cleaned. 
The [SUMM-RE dataset](https://huggingface.co/datasets/linagora/SUMM-RE) is the only one used **exclusively for evaluation**, meaning neither model saw it during training.

Evaluations can be very long (especially for whisper) so we selected only segments with a duration over 1 second and used a subset of the test split for most datasets:
- 15% of CommonVoice: 2424 rows (3.9h)
- 33% of MultiLingual LibriSpeech: 800 rows (3.3h)
- 33% of SUMM-RE: 1004 rows (2h). We selected only segments above 4 seconds to ensure quality.
- 33% of VoxPopuli: 678 rows (1.6h)
- Multilingual TEDx: 972 rows (1.5h)
- 50% of our internal Youtube corpus: 956 rows (1h)

![WER table](https://huggingface.co/linagora/linto_stt_fr_fastconformer/resolve/main/assets/wer_table.png)

As shown in the table above (lower is better), the model demonstrates robust performance across all datasets, consistently achieving results close to the best.

### Real-Time Factor (RTF)

RTFX (the inverse of RTF) measures how many seconds of audio can be transcribed per second of processing time.

Evaluation:
- Hardware: Laptop with NVIDIA RTX 4090
- Input: 5 audio files (~2 minutes each) from the ACSYNT corpus
- Higher is better

![RTF table](https://huggingface.co/linagora/linto_stt_fr_fastconformer/resolve/main/assets/rtf_table.png)

---

## Usage

This model can be used with the [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) for both inference and fine-tuning.

```python
# Install nemo
# !pip install nemo_toolkit['all']

import nemo.collections.asr as nemo_asr

model_name = "linagora/linto_stt_fr_fastconformer"
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name=model_name)

# Path to your 16kHz mono-channel audio file
audio_path = "/path/to/your/audio/file"

# Transcribe with defaut transducer decoder
asr_model.transcribe([audio_path])

# (Optional) Switch to CTC decoder
asr_model.change_decoding_strategy(decoder_type="ctc")

# (Optional) Transcribe with CTC decoder
asr_model.transcribe([audio_path])
```

## Training Details

The training code is available in the [nemo_asr_training repository](https://github.com/linagora-labs/nemo_asr_training).  
The full configuration used for fine-tuning is available [here](https://github.com/linagora-labs/nemo_asr_training/blob/main/fastconformer/yamls/nvidia_stt_fr_fastconformer_hybrid_large_pc.yaml).

### Hardware
- 1× NVIDIA H100 GPU (80 GB)

### Training Configuration
- Precision: BF16 mixed precision  
- Max training steps: 100,000  
- Gradient accumulation: 4 batches  

### Tokenizer
- Type: SentencePiece  
- Vocabulary size: 1,024 tokens

### Optimization
- Optimizer: `AdamW`
  - Learning rate: `1e-5`
  - Betas: `[0.9, 0.98]`
  - Weight decay: `1e-3`
- Scheduler: `CosineAnnealing`
  - Warmup steps: 10,000
  - Minimum learning rate: `1e-6`

### Data Setup
- 6 duration buckets (ranging from 0.1s to 30s)  
- Batch sizes per bucket:
  - Bucket 1 (shortest segments): batch size 80  
  - Bucket 2: batch size 76  
  - Bucket 3: batch size 72  
  - Bucket 4: batch size 68  
  - Bucket 5: batch size 64  
  - Bucket 6 (longest segments): batch size 60

### Training datasets

The data were transformed, processed and converted using [NeMo tools from the SSAK repository](https://github.com/linagora-labs/ssak/tree/main/tools/nemo)

The model was trained on over 9,500 hours of French speech, covering:
- Read and spontaneous speech
- Conversations and meetings
- Varied accents and audio conditions

![Datasets](https://huggingface.co/linagora/linto_stt_fr_fastconformer/resolve/main/assets/datasets_hours.png)

Datasets Used (by size):
- YouTubeFr: an internally curated corpus of CC0-licensed French-language videos sourced from YouTube. Will soon be available on LeVoiceLab platform
- [YODAS](https://huggingface.co/datasets/espnet/yodas): fr000 subset
- [Multilingual LibriSpeech](https://www.openslr.org/94/): french subset
- [CommonVoice](https://commonvoice.mozilla.org/fr/datasets): french subset
- [ESLO](http://eslo.huma-num.fr/index.php)
- [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli): french subset
- [Multilingual TEDx](https://www.openslr.org/100/): french subset
- [TCOF](https://www.cnrtl.fr/corpus/tcof/)
- CTF-AR (Corpus de Conversations Téléphoniques en Français avec Accents Régionaux): will soon be available on LeVoiceLab platform
- [PFC](https://www.ortolang.fr/market/corpora/pfc)
- [OFROM](https://ofrom.unine.ch/index.php?page=citations)
- CTFNN1 (Corpus de Conversations Téléphoniques en Français impliquant des accents Non-Natifs): will soon be available on LeVoiceLab platform
- [CFPP2000](https://www.ortolang.fr/market/corpora/cfpp2000)
- [VOXFORGE](https://www.voxforge.org/)
- [CLAPI](http://clapi.ish-lyon.cnrs.fr/)
- [AfricanAccentedFrench](https://www.openslr.org/57/)
- [FLEURS](https://huggingface.co/datasets/google/fleurs): french subset
- [LesVocaux](https://www.ortolang.fr/market/corpora/lesvocaux/v0.0.1)
- LINAGORA_Meetings
- [CFPB](https://orfeo.ortolang.fr/annis-sample/cfpb/CFPB-1000-5.html)
- [ACSYNT](https://www.ortolang.fr/market/corpora/sldr000832)
- [PxSLU](https://arxiv.org/abs/2207.08292)
- [SimSamu](https://huggingface.co/datasets/medkit/simsamu)

## Limitations

- May struggle with rare vocabulary, heavy accents, or overlapping/multi-speaker audio.
- Outputs are lowercase only, with no punctuation, due to limitations in some training datasets.
- A future version may include casing and punctuation support

## References

[1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)

[2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)

[3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)

## Acknowledgements

Thanks to NVIDIA for providing the base model architecture and the NeMo framework.

## Licence

The model is released under a [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license, in line with the licensing of the original model it was fine-tuned from.