Update README.md

6ca50d9 verified 1 day ago

9.04 kB

	---
	license: cc-by-4.0
	datasets:
	- mozilla-foundation/common_voice_17_0
	- facebook/multilingual_librispeech
	- facebook/voxpopuli
	- datasets-CNRS/PFC
	- datasets-CNRS/CFPP
	- datasets-CNRS/CLAPI
	- gigant/african_accented_french
	- google/fleurs
	- datasets-CNRS/lesvocaux
	- datasets-CNRS/ACSYNT
	- medkit/simsamu
	language:
	- fr
	metrics:
	- wer
	base_model:
	- nvidia/stt_fr_fastconformer_hybrid_large_pc
	pipeline_tag: automatic-speech-recognition
	tags:
	- automatic-speech-recognition
	- speech
	- audio
	- Transducer
	- FastConformer
	- CTC
	- Transformer
	- pytorch
	- NeMo
	library_name: nemo
	model-index:
	- name: linto_stt_fr_fastconformer
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: common-voice-18-0
	type: mozilla-foundation/common_voice_18_0
	config: fr
	split: test
	args:
	language: fr
	metrics:
	- name: Test WER
	type: wer
	value: 8.96
	- task:
	type: Automatic Speech Recognition
	name: automatic-speech-recognition
	dataset:
	name: Multilingual LibriSpeech
	type: facebook/multilingual_librispeech
	config: french
	split: test
	args:
	language: fr
	metrics:
	- name: Test WER
	type: wer
	value: 4.7
	- task:
	type: Automatic Speech Recognition
	name: automatic-speech-recognition
	dataset:
	name: Vox Populi
	type: facebook/voxpopuli
	config: french
	split: test
	args:
	language: fr
	metrics:
	- name: Test WER
	type: wer
	value: 10.83
	- task:
	type: Automatic Speech Recognition
	name: automatic-speech-recognition
	dataset:
	name: SUMM-RE
	type: linagora/SUMM-RE
	config: french
	split: test
	args:
	language: fr
	metrics:
	- name: Test WER
	type: wer
	value: 23.5
	---
	# LinTO STT French – FastConformer

	<style>
	img {
	display: inline;
	}
	</style>

	[![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture)
	[![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)
	[![Language](https://img.shields.io/badge/Language-fr-lightgrey#model-badge)](#datasets)

	---

	## Overview

	This model is a fine-tuned version of the [NVIDIA French FastConformer Hybrid Large model](https://huggingface.co/nvidia/stt_fr_fastconformer_hybrid_large_pc).
	It is a large (115M parameters) hybrid ASR model trained with both Transducer (default) and CTC losses.

	Compared to the base model, this version:
	- Does not include punctuation or uppercase letters.
	- Was trained on 9,500+ hours of diverse, manually transcribed French speech.

	---

	## Performance

	The evaluation code is available in the [ASR Benchmark repository](https://github.com/linagora-labs/asr_benchmark).

	### Word Error Rate (WER)

	WER was computed without punctuation or uppercase letters and datasets were cleaned.
	The [SUMM-RE dataset](https://huggingface.co/datasets/linagora/SUMM-RE) is the only one used exclusively for evaluation, meaning neither model saw it during training.

	Evaluations can be very long (especially for whisper) so we selected only segments with a duration over 1 second and used a subset of the test split for most datasets:
	- 15% of CommonVoice: 2424 rows (3.9h)
	- 33% of MultiLingual LibriSpeech: 800 rows (3.3h)
	- 33% of SUMM-RE: 1004 rows (2h). We selected only segments above 4 seconds to ensure quality.
	- 33% of VoxPopuli: 678 rows (1.6h)
	- Multilingual TEDx: 972 rows (1.5h)
	- 50% of our internal Youtube corpus: 956 rows (1h)

	![WER table](https://huggingface.co/linagora/linto_stt_fr_fastconformer/resolve/main/assets/wer_table.png)

	As shown in the table above (lower is better), the model demonstrates robust performance across all datasets, consistently achieving results close to the best.

	### Real-Time Factor (RTF)

	RTFX (the inverse of RTF) measures how many seconds of audio can be transcribed per second of processing time.

	Evaluation:
	- Hardware: Laptop with NVIDIA RTX 4090
	- Input: 5 audio files (~2 minutes each) from the ACSYNT corpus
	- Higher is better

	![RTF table](https://huggingface.co/linagora/linto_stt_fr_fastconformer/resolve/main/assets/rtf_table.png)

	---

	## Usage

	This model can be used with the [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) for both inference and fine-tuning.

	```python
	# Install nemo
	# !pip install nemo_toolkit['all']

	import nemo.collections.asr as nemo_asr

	model_name = "linagora/linto_stt_fr_fastconformer"
	asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name=model_name)

	# Path to your 16kHz mono-channel audio file
	audio_path = "/path/to/your/audio/file"

	# Transcribe with defaut transducer decoder
	asr_model.transcribe([audio_path])

	# (Optional) Switch to CTC decoder
	asr_model.change_decoding_strategy(decoder_type="ctc")

	# (Optional) Transcribe with CTC decoder
	asr_model.transcribe([audio_path])
	```

	## Training Details

	The training code is available in the [nemo_asr_training repository](https://github.com/linagora-labs/nemo_asr_training).
	The full configuration used for fine-tuning is available [here](https://github.com/linagora-labs/nemo_asr_training/blob/main/fastconformer/yamls/nvidia_stt_fr_fastconformer_hybrid_large_pc.yaml).

	### Hardware
	- 1× NVIDIA H100 GPU (80 GB)

	### Training Configuration
	- Precision: BF16 mixed precision
	- Max training steps: 100,000
	- Gradient accumulation: 4 batches

	### Tokenizer
	- Type: SentencePiece
	- Vocabulary size: 1,024 tokens

	### Optimization
	- Optimizer: `AdamW`
	- Learning rate: `1e-5`
	- Betas: `[0.9, 0.98]`
	- Weight decay: `1e-3`
	- Scheduler: `CosineAnnealing`
	- Warmup steps: 10,000
	- Minimum learning rate: `1e-6`

	### Data Setup
	- 6 duration buckets (ranging from 0.1s to 30s)
	- Batch sizes per bucket:
	- Bucket 1 (shortest segments): batch size 80
	- Bucket 2: batch size 76
	- Bucket 3: batch size 72
	- Bucket 4: batch size 68
	- Bucket 5: batch size 64
	- Bucket 6 (longest segments): batch size 60

	### Training datasets

	The data were transformed, processed and converted using [NeMo tools from the SSAK repository](https://github.com/linagora-labs/ssak/tree/main/tools/nemo)

	The model was trained on over 9,500 hours of French speech, covering:
	- Read and spontaneous speech
	- Conversations and meetings
	- Varied accents and audio conditions

	![Datasets](https://huggingface.co/linagora/linto_stt_fr_fastconformer/resolve/main/assets/datasets_hours.png)

	Datasets Used (by size):
	- YouTubeFr: an internally curated corpus of CC0-licensed French-language videos sourced from YouTube. Will soon be available on LeVoiceLab platform
	- [YODAS](https://huggingface.co/datasets/espnet/yodas): fr000 subset
	- [Multilingual LibriSpeech](https://www.openslr.org/94/): french subset
	- [CommonVoice](https://commonvoice.mozilla.org/fr/datasets): french subset
	- [ESLO](http://eslo.huma-num.fr/index.php)
	- [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli): french subset
	- [Multilingual TEDx](https://www.openslr.org/100/): french subset
	- [TCOF](https://www.cnrtl.fr/corpus/tcof/)
	- CTF-AR (Corpus de Conversations Téléphoniques en Français avec Accents Régionaux): will soon be available on LeVoiceLab platform
	- [PFC](https://www.ortolang.fr/market/corpora/pfc)
	- [OFROM](https://ofrom.unine.ch/index.php?page=citations)
	- CTFNN1 (Corpus de Conversations Téléphoniques en Français impliquant des accents Non-Natifs): will soon be available on LeVoiceLab platform
	- [CFPP2000](https://www.ortolang.fr/market/corpora/cfpp2000)
	- [VOXFORGE](https://www.voxforge.org/)
	- [CLAPI](http://clapi.ish-lyon.cnrs.fr/)
	- [AfricanAccentedFrench](https://www.openslr.org/57/)
	- [FLEURS](https://huggingface.co/datasets/google/fleurs): french subset
	- [LesVocaux](https://www.ortolang.fr/market/corpora/lesvocaux/v0.0.1)
	- LINAGORA_Meetings
	- [CFPB](https://orfeo.ortolang.fr/annis-sample/cfpb/CFPB-1000-5.html)
	- [ACSYNT](https://www.ortolang.fr/market/corpora/sldr000832)
	- [PxSLU](https://arxiv.org/abs/2207.08292)
	- [SimSamu](https://huggingface.co/datasets/medkit/simsamu)

	## Limitations

	- May struggle with rare vocabulary, heavy accents, or overlapping/multi-speaker audio.
	- Outputs are lowercase only, with no punctuation, due to limitations in some training datasets.
	- A future version may include casing and punctuation support

	## References

	[1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)

	[2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)

	[3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)

	## Acknowledgements

	Thanks to NVIDIA for providing the base model architecture and the NeMo framework.

	## Licence

	The model is released under a [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license, in line with the licensing of the original model it was fine-tuned from.