--- license: cc-by-4.0 datasets: - mozilla-foundation/common_voice_17_0 - facebook/multilingual_librispeech - facebook/voxpopuli - datasets-CNRS/PFC - datasets-CNRS/CFPP - datasets-CNRS/CLAPI - gigant/african_accented_french - google/fleurs - datasets-CNRS/lesvocaux - datasets-CNRS/ACSYNT - medkit/simsamu language: - fr metrics: - wer base_model: - nvidia/stt_fr_fastconformer_hybrid_large_pc pipeline_tag: automatic-speech-recognition tags: - automatic-speech-recognition - speech - audio - Transducer - FastConformer - CTC - Transformer - pytorch - NeMo library_name: nemo model-index: - name: linto_stt_fr_fastconformer results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: common-voice-18-0 type: mozilla-foundation/common_voice_18_0 config: fr split: test args: language: fr metrics: - name: Test WER type: wer value: 8.96 - task: type: Automatic Speech Recognition name: automatic-speech-recognition dataset: name: Multilingual LibriSpeech type: facebook/multilingual_librispeech config: french split: test args: language: fr metrics: - name: Test WER type: wer value: 4.7 - task: type: Automatic Speech Recognition name: automatic-speech-recognition dataset: name: Vox Populi type: facebook/voxpopuli config: french split: test args: language: fr metrics: - name: Test WER type: wer value: 10.83 - task: type: Automatic Speech Recognition name: automatic-speech-recognition dataset: name: SUMM-RE type: linagora/SUMM-RE config: french split: test args: language: fr metrics: - name: Test WER type: wer value: 23.5 --- # LinTO STT French – FastConformer [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture) [![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture) [![Language](https://img.shields.io/badge/Language-fr-lightgrey#model-badge)](#datasets) --- ## Overview This model is a fine-tuned version of the [NVIDIA French FastConformer Hybrid Large model](https://huggingface.co/nvidia/stt_fr_fastconformer_hybrid_large_pc). It is a large (115M parameters) hybrid ASR model trained with both **Transducer (default)** and **CTC** losses. Compared to the base model, this version: - Does **not** include punctuation or uppercase letters. - Was trained on **9,500+ hours** of diverse, manually transcribed French speech. --- ## Performance The evaluation code is available in the [ASR Benchmark repository](https://github.com/linagora-labs/asr_benchmark). ### Word Error Rate (WER) WER was computed **without punctuation or uppercase letters** and datasets were cleaned. The [SUMM-RE dataset](https://huggingface.co/datasets/linagora/SUMM-RE) is the only one used **exclusively for evaluation**, meaning neither model saw it during training. Evaluations can be very long (especially for whisper) so we selected only segments with a duration over 1 second and used a subset of the test split for most datasets: - 15% of CommonVoice: 2424 rows (3.9h) - 33% of MultiLingual LibriSpeech: 800 rows (3.3h) - 33% of SUMM-RE: 1004 rows (2h). We selected only segments above 4 seconds to ensure quality. - 33% of VoxPopuli: 678 rows (1.6h) - Multilingual TEDx: 972 rows (1.5h) - 50% of our internal Youtube corpus: 956 rows (1h) ![WER table](https://huggingface.co/linagora/linto_stt_fr_fastconformer/resolve/main/assets/wer_table.png) As shown in the table above (lower is better), the model demonstrates robust performance across all datasets, consistently achieving results close to the best. ### Real-Time Factor (RTF) RTFX (the inverse of RTF) measures how many seconds of audio can be transcribed per second of processing time. Evaluation: - Hardware: Laptop with NVIDIA RTX 4090 - Input: 5 audio files (~2 minutes each) from the ACSYNT corpus - Higher is better ![RTF table](https://huggingface.co/linagora/linto_stt_fr_fastconformer/resolve/main/assets/rtf_table.png) --- ## Usage This model can be used with the [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) for both inference and fine-tuning. ```python # Install nemo # !pip install nemo_toolkit['all'] import nemo.collections.asr as nemo_asr model_name = "linagora/linto_stt_fr_fastconformer" asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name=model_name) # Path to your 16kHz mono-channel audio file audio_path = "/path/to/your/audio/file" # Transcribe with defaut transducer decoder asr_model.transcribe([audio_path]) # (Optional) Switch to CTC decoder asr_model.change_decoding_strategy(decoder_type="ctc") # (Optional) Transcribe with CTC decoder asr_model.transcribe([audio_path]) ``` ## Training Details The training code is available in the [nemo_asr_training repository](https://github.com/linagora-labs/nemo_asr_training). The full configuration used for fine-tuning is available [here](https://github.com/linagora-labs/nemo_asr_training/blob/main/fastconformer/yamls/nvidia_stt_fr_fastconformer_hybrid_large_pc.yaml). ### Hardware - 1× NVIDIA H100 GPU (80 GB) ### Training Configuration - Precision: BF16 mixed precision - Max training steps: 100,000 - Gradient accumulation: 4 batches ### Tokenizer - Type: SentencePiece - Vocabulary size: 1,024 tokens ### Optimization - Optimizer: `AdamW` - Learning rate: `1e-5` - Betas: `[0.9, 0.98]` - Weight decay: `1e-3` - Scheduler: `CosineAnnealing` - Warmup steps: 10,000 - Minimum learning rate: `1e-6` ### Data Setup - 6 duration buckets (ranging from 0.1s to 30s) - Batch sizes per bucket: - Bucket 1 (shortest segments): batch size 80 - Bucket 2: batch size 76 - Bucket 3: batch size 72 - Bucket 4: batch size 68 - Bucket 5: batch size 64 - Bucket 6 (longest segments): batch size 60 ### Training datasets The data were transformed, processed and converted using [NeMo tools from the SSAK repository](https://github.com/linagora-labs/ssak/tree/main/tools/nemo) The model was trained on over 9,500 hours of French speech, covering: - Read and spontaneous speech - Conversations and meetings - Varied accents and audio conditions ![Datasets](https://huggingface.co/linagora/linto_stt_fr_fastconformer/resolve/main/assets/datasets_hours.png) Datasets Used (by size): - YouTubeFr: an internally curated corpus of CC0-licensed French-language videos sourced from YouTube. Will soon be available on LeVoiceLab platform - [YODAS](https://huggingface.co/datasets/espnet/yodas): fr000 subset - [Multilingual LibriSpeech](https://www.openslr.org/94/): french subset - [CommonVoice](https://commonvoice.mozilla.org/fr/datasets): french subset - [ESLO](http://eslo.huma-num.fr/index.php) - [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli): french subset - [Multilingual TEDx](https://www.openslr.org/100/): french subset - [TCOF](https://www.cnrtl.fr/corpus/tcof/) - CTF-AR (Corpus de Conversations Téléphoniques en Français avec Accents Régionaux): will soon be available on LeVoiceLab platform - [PFC](https://www.ortolang.fr/market/corpora/pfc) - [OFROM](https://ofrom.unine.ch/index.php?page=citations) - CTFNN1 (Corpus de Conversations Téléphoniques en Français impliquant des accents Non-Natifs): will soon be available on LeVoiceLab platform - [CFPP2000](https://www.ortolang.fr/market/corpora/cfpp2000) - [VOXFORGE](https://www.voxforge.org/) - [CLAPI](http://clapi.ish-lyon.cnrs.fr/) - [AfricanAccentedFrench](https://www.openslr.org/57/) - [FLEURS](https://huggingface.co/datasets/google/fleurs): french subset - [LesVocaux](https://www.ortolang.fr/market/corpora/lesvocaux/v0.0.1) - LINAGORA_Meetings - [CFPB](https://orfeo.ortolang.fr/annis-sample/cfpb/CFPB-1000-5.html) - [ACSYNT](https://www.ortolang.fr/market/corpora/sldr000832) - [PxSLU](https://arxiv.org/abs/2207.08292) - [SimSamu](https://huggingface.co/datasets/medkit/simsamu) ## Limitations - May struggle with rare vocabulary, heavy accents, or overlapping/multi-speaker audio. - Outputs are lowercase only, with no punctuation, due to limitations in some training datasets. - A future version may include casing and punctuation support ## References [1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084) [2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece) [3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) ## Acknowledgements Thanks to NVIDIA for providing the base model architecture and the NeMo framework. ## Licence The model is released under a [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license, in line with the licensing of the original model it was fine-tuned from.