initial commit

Browse files

Files changed (7) hide show

.gitattributes +37 -0
README.md +226 -0
assets/datasets_hours.png +0 -0
assets/rtf_table.png +0 -0
assets/wer_table.png +3 -0
assets/wer_table_all.png +0 -0
linto_stt_fr_fastconformer.nemo +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,37 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+linto_stt_fr_fastconformer.nemo filter=lfs diff=lfs merge=lfs -text
+assets/wer_table.png filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,226 @@

+---
+license: cc-by-4.0
+datasets:
+- mozilla-foundation/common_voice_17_0
+- facebook/multilingual_librispeech
+- facebook/voxpopuli
+- datasets-CNRS/PFC
+- datasets-CNRS/CFPP
+- datasets-CNRS/CLAPI
+- gigant/african_accented_french
+- google/fleurs
+- datasets-CNRS/lesvocaux
+- datasets-CNRS/ACSYNT
+- medkit/simsamu
+language:
+- fr
+metrics:
+- wer
+base_model:
+- nvidia/stt_fr_fastconformer_hybrid_large_pc
+pipeline_tag: automatic-speech-recognition
+tags:
+- automatic-speech-recognition
+- speech
+- audio
+- Transducer
+- FastConformer
+- CTC
+- Transformer
+- pytorch
+- NeMo
+library_name: nemo
+model-index:
+- name: linto_stt_fr_fastconformer
+  results:
+  - task:
+      name: Automatic Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: common-voice-18-0
+      type: mozilla-foundation/common_voice_18_0
+      config: fr
+      split: test
+      args:
+        language: fr
+    metrics:
+    - name: Test WER
+      type: wer
+      value: 9.10
+  - task:
+      type: Automatic Speech Recognition
+      name: automatic-speech-recognition
+    dataset:
+      name: Multilingual LibriSpeech
+      type: facebook/multilingual_librispeech
+      config: french
+      split: test
+      args:
+        language: fr
+    metrics:
+    - name: Test WER
+      type: wer
+      value: 4.70
+  - task:
+      type: Automatic Speech Recognition
+      name: automatic-speech-recognition
+    dataset:
+      name: Vox Populi
+      type: facebook/voxpopuli
+      config: french
+      split: test
+      args:
+        language: fr
+    metrics:
+    - name: Test WER
+      type: wer
+      value: 10.76
+  - task:
+      type: Automatic Speech Recognition
+      name: automatic-speech-recognition
+    dataset:
+      name: SUMM-RE
+      type: linagora/SUMM-RE
+      config: french
+      split: test
+      args:
+        language: fr
+    metrics:
+    - name: Test WER
+      type: wer
+      value: 23.52
+---
+# LinTO STT French – FastConformer
+<style>
+img {
+ display: inline;
+}
+</style>
+[![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture)
+[![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)
+[![Language](https://img.shields.io/badge/Language-fr-lightgrey#model-badge)](#datasets)
+---
+## Overview
+This model is a fine-tuned version of the [NVIDIA French FastConformer Hybrid Large model](https://huggingface.co/nvidia/stt_fr_fastconformer_hybrid_large_pc). It is a large (115M parameters) hybrid ASR model trained with both **Transducer (default)** and **CTC** losses.
+Compared to the base model, this version:
+- Does **not** include punctuation or uppercase letters.
+- Was trained on **9,500+ hours** of diverse, manually transcribed French speech.
+---
+## Performance
+The evaluation code is available in the [ASR Benchmark repository](https://github.com/linagora-labs/asr_benchmark).
+### Word Error Rate (WER)
+WER was computed **without punctuation or uppercase letters** and datasets were cleaned.
+The [SUMM-RE dataset](https://huggingface.co/datasets/linagora/SUMM-RE) is the only one used **exclusively for evaluation**, meaning neither model saw it during training.
+Evaluations can be very long (especially for whisper) so we used a subset of the test split for most datasets:
+- 15% of CommonVoice
+- 33% of MultiLingual LibriSpeech
+- 33% of SUMM-RE
+- 33% of VoxPopuli
+![WER table](https://huggingface.co/linagora/linto_stt_fr_fastconformer/resolve/main/assets/wer_table.png)
+### Real-Time Factor (RTF)
+RTFX (the inverse of RTF) measures how many seconds of audio can be transcribed per second of processing time.
+Evaluation:
+- Hardware: Laptop with NVIDIA RTX 4090
+- Input: 5 audio files (~2 minutes each) from the ACSYNT corpus
+![RTF table](https://huggingface.co/linagora/linto_stt_fr_fastconformer/resolve/main/assets/rtf_table.png)
+---
+## Usage
+This model can be used with the [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) for both inference and fine-tuning.
+```python
+# Install nemo
+# !pip install nemo_toolkit['all']
+import nemo.collections.asr as nemo_asr
+model_name = "linagora/linto_stt_fr_fastconformer"
+asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name=model_name)
+# Path to your 16kHz mono-channel audio file
+audio_path = "/path/to/your/audio/file"
+# Transcribe with defaut transducer decoder
+asr_model.transcribe([audio_path])
+# (Optional) Switch to CTC decoder
+asr_model.change_decoding_strategy(decoder_type="ctc")
+# (Optional) Transcribe with CTC decoder
+asr_model.transcribe([audio_path])
+```
+## Datasets
+The model was trained on over 9,500 hours of French speech, covering:
+- Read and spontaneous speech
+- Conversations and meetings
+- Varied accents and audio conditions
+![Datasets](https://huggingface.co/linagora/linto_stt_fr_fastconformer/resolve/main/assets/datasets_hours.png)
+Datasets Used (by size):
+- YouTubeFr: an internally curated corpus of CC0-licensed French-language videos sourced from YouTube. Will soon be available on LeVoiceLab platform
+- [YODAS](https://huggingface.co/datasets/espnet/yodas): fr000 subset
+- [Multilingual LibriSpeech](https://www.openslr.org/94/): french subset
+- [CommonVoice](https://commonvoice.mozilla.org/fr/datasets): french subset
+- [ESLO](http://eslo.huma-num.fr/index.php)
+- [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli): french subset
+- [Multilingual TEDx](https://www.openslr.org/100/): french subset
+- [TCOF](https://www.cnrtl.fr/corpus/tcof/)
+- CTF-AR (Corpus de Conversations Téléphoniques en Français avec Accents Régionaux): will soon be available on LeVoiceLab platform
+- [PFC](https://www.ortolang.fr/market/corpora/pfc)
+- [OFROM](https://ofrom.unine.ch/index.php?page=citations)
+- CTFNN1 (Corpus de Conversations Téléphoniques en Français impliquant des accents Non-Natifs): will soon be available on LeVoiceLab platform
+- [CFPP2000](https://www.ortolang.fr/market/corpora/cfpp2000)
+- [VOXFORGE](https://www.voxforge.org/)
+- [CLAPI](http://clapi.ish-lyon.cnrs.fr/)
+- [AfricanAccentedFrench](https://www.openslr.org/57/)
+- [FLEURS](https://huggingface.co/datasets/google/fleurs): french subset
+- [LesVocaux](https://www.ortolang.fr/market/corpora/lesvocaux/v0.0.1)
+- LINAGORA_Meetings
+- [CFPB](https://orfeo.ortolang.fr/annis-sample/cfpb/CFPB-1000-5.html)
+- [ACSYNT](https://www.ortolang.fr/market/corpora/sldr000832)
+- [PxSLU](https://arxiv.org/abs/2207.08292)
+- [SimSamu](https://huggingface.co/datasets/medkit/simsamu)
+## Limitations
+- May struggle with rare vocabulary, heavy accents, or overlapping/multi-speaker audio.
+- Outputs are lowercase only, with no punctuation, due to limitations in some training datasets.
+- A future version may include casing and punctuation support
+## References
+[1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
+[2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
+[3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
+## Acknowledgements
+Thanks to NVIDIA for providing the base model architecture and the NeMo framework.
+## Licence
+Licensed under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/).

assets/datasets_hours.png ADDED Viewed

assets/rtf_table.png ADDED Viewed

assets/wer_table.png ADDED Viewed

Git LFS Details

SHA256: 385bc228c799e2863243f0b821ea907b309b15b9588e839af22bb1fe49436c4e
Pointer size: 130 Bytes
Size of remote file: 91.5 kB

assets/wer_table_all.png ADDED Viewed

linto_stt_fr_fastconformer.nemo ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a301520a2c81b0f453aab7147d7f8becc11a3052aec0b84431371638529b8e92
+size 459233280