You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

NVIDIA Conformer-Transducer Large (ca-es)

Table of Contents

Click to expand

Summary

The "stt_ca-es_conformer_transducer_large" is an acoustic model based on "NVIDIA/stt_es_conformer_transducer_large" suitable for Bilingual Catalan-Spanish Automatic Speech Recognition.

Model Description

This model transcribes speech, and was fine-tuned on a Bilingual ca-es dataset comprising of 4000 hours. It is a "large" variant of Conformer-Transducer, with around 120 million parameters. We expaneded it is tokenizer vocab sise to be 5.5k t oinclude lowercase, uppercase, and punctuation See the model architecture section and NeMo documentation for complete architecture details.

Intended Uses and Limitations

This model can be used for Automatic Speech Recognition (ASR) in Catalan and Spanish. It is intended to transcribe audio files in Catalan and Spanish to plain text with punctuation.

Installation

To use this model, install NVIDIA NeMo. We recommend you install it after you've installed the latest PyTorch version.

pip install nemo_toolkit['all']

For Inference

To transcribe audio in Catalan or in Spanish using this model, you can follow this example:

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("PraxySante/stt_ca-es_conformer_transducer_large_fine_tuned")
transcriptions = asr_model.transcribe(["file.wav"])transcription = nemo_asr_model.transcribe([audio_path])[0].text
print(transcription)

Training Details

Training data

The model was fine-tuned on bilingual datasets in Catalan and Spanish, for a total of 4k hours. Including:

Training procedure

This model is the result of finetuning the model "projecte-aina/stt_ca-es_conformer_transducer_large"

Results

The results were calculated using the validation dataset, which is a held-out split from the same data used for training. This split contains 174 hours of audio for each language

Spanish WER: 0.08 Catalan WER: 0.10

Spanish CER: 0.04 Catalan CER: 0.05

Model Updates

We plan to provide regular updates to this model and its documentation. Future releases may include more performant versions as we continue to improve training strategies, expand dataset coverage, and incorporate community feedback.

Downloads last month
74
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train PraxySante/stt_ca-es_conformer_transducer_large_fine_tuned

Evaluation results

  • WER (Spanish) on Combined (Parlament-Parla-v1, MLS, Voxpopuli, etc.)
    self-reported
    0.080
  • CER (Spanish) on Combined (Parlament-Parla-v1, MLS, Voxpopuli, etc.)
    self-reported
    0.040
  • WER (Catalan) on Combined (Parlament-Parla-v1, MLS, Voxpopuli, etc.)
    self-reported
    0.100
  • CER (Catalan) on Combined (Parlament-Parla-v1, MLS, Voxpopuli, etc.)
    self-reported
    0.050