AudranB's picture
Update README.md
6ca50d9 verified
metadata
license: cc-by-4.0
datasets:
  - mozilla-foundation/common_voice_17_0
  - facebook/multilingual_librispeech
  - facebook/voxpopuli
  - datasets-CNRS/PFC
  - datasets-CNRS/CFPP
  - datasets-CNRS/CLAPI
  - gigant/african_accented_french
  - google/fleurs
  - datasets-CNRS/lesvocaux
  - datasets-CNRS/ACSYNT
  - medkit/simsamu
language:
  - fr
metrics:
  - wer
base_model:
  - nvidia/stt_fr_fastconformer_hybrid_large_pc
pipeline_tag: automatic-speech-recognition
tags:
  - automatic-speech-recognition
  - speech
  - audio
  - Transducer
  - FastConformer
  - CTC
  - Transformer
  - pytorch
  - NeMo
library_name: nemo
model-index:
  - name: linto_stt_fr_fastconformer
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: common-voice-18-0
          type: mozilla-foundation/common_voice_18_0
          config: fr
          split: test
          args:
            language: fr
        metrics:
          - name: Test WER
            type: wer
            value: 8.96
      - task:
          type: Automatic Speech Recognition
          name: automatic-speech-recognition
        dataset:
          name: Multilingual LibriSpeech
          type: facebook/multilingual_librispeech
          config: french
          split: test
          args:
            language: fr
        metrics:
          - name: Test WER
            type: wer
            value: 4.7
      - task:
          type: Automatic Speech Recognition
          name: automatic-speech-recognition
        dataset:
          name: Vox Populi
          type: facebook/voxpopuli
          config: french
          split: test
          args:
            language: fr
        metrics:
          - name: Test WER
            type: wer
            value: 10.83
      - task:
          type: Automatic Speech Recognition
          name: automatic-speech-recognition
        dataset:
          name: SUMM-RE
          type: linagora/SUMM-RE
          config: french
          split: test
          args:
            language: fr
        metrics:
          - name: Test WER
            type: wer
            value: 23.5

LinTO STT French – FastConformer

Model architecture
Model size
Language


Overview

This model is a fine-tuned version of the NVIDIA French FastConformer Hybrid Large model. It is a large (115M parameters) hybrid ASR model trained with both Transducer (default) and CTC losses.

Compared to the base model, this version:

  • Does not include punctuation or uppercase letters.
  • Was trained on 9,500+ hours of diverse, manually transcribed French speech.

Performance

The evaluation code is available in the ASR Benchmark repository.

Word Error Rate (WER)

WER was computed without punctuation or uppercase letters and datasets were cleaned. The SUMM-RE dataset is the only one used exclusively for evaluation, meaning neither model saw it during training.

Evaluations can be very long (especially for whisper) so we selected only segments with a duration over 1 second and used a subset of the test split for most datasets:

  • 15% of CommonVoice: 2424 rows (3.9h)
  • 33% of MultiLingual LibriSpeech: 800 rows (3.3h)
  • 33% of SUMM-RE: 1004 rows (2h). We selected only segments above 4 seconds to ensure quality.
  • 33% of VoxPopuli: 678 rows (1.6h)
  • Multilingual TEDx: 972 rows (1.5h)
  • 50% of our internal Youtube corpus: 956 rows (1h)

WER table

As shown in the table above (lower is better), the model demonstrates robust performance across all datasets, consistently achieving results close to the best.

Real-Time Factor (RTF)

RTFX (the inverse of RTF) measures how many seconds of audio can be transcribed per second of processing time.

Evaluation:

  • Hardware: Laptop with NVIDIA RTX 4090
  • Input: 5 audio files (~2 minutes each) from the ACSYNT corpus
  • Higher is better

RTF table


Usage

This model can be used with the NVIDIA NeMo Toolkit for both inference and fine-tuning.

# Install nemo
# !pip install nemo_toolkit['all']

import nemo.collections.asr as nemo_asr

model_name = "linagora/linto_stt_fr_fastconformer"
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name=model_name)

# Path to your 16kHz mono-channel audio file
audio_path = "/path/to/your/audio/file"

# Transcribe with defaut transducer decoder
asr_model.transcribe([audio_path])

# (Optional) Switch to CTC decoder
asr_model.change_decoding_strategy(decoder_type="ctc")

# (Optional) Transcribe with CTC decoder
asr_model.transcribe([audio_path])

Training Details

The training code is available in the nemo_asr_training repository.
The full configuration used for fine-tuning is available here.

Hardware

  • 1× NVIDIA H100 GPU (80 GB)

Training Configuration

  • Precision: BF16 mixed precision
  • Max training steps: 100,000
  • Gradient accumulation: 4 batches

Tokenizer

  • Type: SentencePiece
  • Vocabulary size: 1,024 tokens

Optimization

  • Optimizer: AdamW
    • Learning rate: 1e-5
    • Betas: [0.9, 0.98]
    • Weight decay: 1e-3
  • Scheduler: CosineAnnealing
    • Warmup steps: 10,000
    • Minimum learning rate: 1e-6

Data Setup

  • 6 duration buckets (ranging from 0.1s to 30s)
  • Batch sizes per bucket:
    • Bucket 1 (shortest segments): batch size 80
    • Bucket 2: batch size 76
    • Bucket 3: batch size 72
    • Bucket 4: batch size 68
    • Bucket 5: batch size 64
    • Bucket 6 (longest segments): batch size 60

Training datasets

The data were transformed, processed and converted using NeMo tools from the SSAK repository

The model was trained on over 9,500 hours of French speech, covering:

  • Read and spontaneous speech
  • Conversations and meetings
  • Varied accents and audio conditions

Datasets

Datasets Used (by size):

Limitations

  • May struggle with rare vocabulary, heavy accents, or overlapping/multi-speaker audio.
  • Outputs are lowercase only, with no punctuation, due to limitations in some training datasets.
  • A future version may include casing and punctuation support

References

[1] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

[2] Google Sentencepiece Tokenizer

[3] NVIDIA NeMo Toolkit

Acknowledgements

Thanks to NVIDIA for providing the base model architecture and the NeMo framework.

Licence

The model is released under a CC-BY-4.0 license, in line with the licensing of the original model it was fine-tuned from.