whisper-large-v3-ca-punctuated-3370h

Click to expand

Model Description
Intended Uses and Limitations
How to Get Started with the Model
Training Details
Citation
Additional Information

Model Description

The "whisper-large-v3-ca-punctuated-3370h" is an acoustic model suitable for Automatic Speech Recognition in Catalan. It is the result of finetuning the model "openai/whisper-large-v3" with a combination of Catalan data from Common Voice 17.0 (2,659 hours) and 710 hours of data released by the Projecte AINA from Barcelona, Spain. Totalling 3369 hours and 53 minutes.

A key advantage of this model is that it was trained on meticulously transcribed data, including punctuation and capitalization. As a result, the output transcriptions preserve these features, delivering more structured and readable outputs compared to standard ASR models.

Intended Uses and Limitations

This model can be used for Automatic Speech Recognition (ASR) in Catalan. The model is intended to transcribe audio files in Catalan to plain text with punctuation and capitalization.

How to Get Started with the Model

To see a functional version of this code, please see our our Notebook and, in order to invoke this model, just substitute the instances of "projecte-aina/whisper-large-v3-ca-3catparla" with "langtech-veu/whisper-large-v3-ca-punctuated-3370h".

Installation

In order to use this model, you may install datasets and transformers:

Create a virtual environment:

python -m venv /path/to/venv

Activate the environment:

source /path/to/venv/bin/activate

Install the modules:

pip install datasets transformers

For Inference

In order to transcribe audio in Catalan using this model, you can follow this example:

#Install Prerequisites
pip install torch
pip install datasets
pip install 'transformers[torch]'
pip install evaluate
pip install jiwer

#This code works with GPU

#Notice that: load_metric is no longer part of datasets.
#you have to remove it and use evaluate's load instead.
#(Note from November 2024)

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

#Load the processor and model.
MODEL_NAME="langtech-veu/whisper-large-v3-ca-punctuated-3370h"
processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME).to("cuda")

#Load the dataset
from datasets import load_dataset, load_metric, Audio
ds=load_dataset("projecte-aina/parlament_parla",split='test')

#Downsample to 16kHz
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))

#Process the dataset
def map_to_pred(batch):
    audio = batch["audio"]
    input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
    batch["reference"] = processor.tokenizer._normalize(batch['normalized_text'])

    with torch.no_grad():
        predicted_ids = model.generate(input_features.to("cuda"))[0]
    
    transcription = processor.decode(predicted_ids)
    batch["prediction"] = processor.tokenizer._normalize(transcription)
    
    return batch
    
#Do the evaluation
result = ds.map(map_to_pred)

#Compute the overall WER now.
from evaluate import load

wer = load("wer")
WER=100 * wer.compute(references=result["reference"], predictions=result["prediction"])
print(WER)

Training Details

Training data

The specific datasets used to create the model are:

Common Voice 17.0
"3CatParla". (soon to be published)

Training procedure

This model is the result of finetuning the model "openai/whisper-large-v3" by following this tutorial provided by Hugging Face.

Training Hyperparameters

language: catalan
hours of training audio: 3369 hours and 53 minutes
learning rate: 1e-5
sample rate: 16000
train batch size: 32 (x4 GPUs)
- gradient accumulation steps: 1
eval batch size: 32
save total limit: 4
max steps: 77660
warmup steps: 7766
eval steps: 7766
save steps: 7766

Citation

If this model contributes to your research, please cite the work:

@misc{mena2025whisperpunctuated,
      title={Acoustic Model in Catalan: whisper-large-v3-ca-punctuated-3370h.}, 
      author={Hernandez Mena, Carlos Daniel},
      organization={Barcelona Supercomputing Center},
      url={https://huggingface.co/langtech-veu/whisper-large-v3-ca-punctuated-3370h},
      year={2025}
}

Additional Information

Author

The fine-tuning process was performed during April (2025) in the Language Technologies Laboratory of the Barcelona Supercomputing Center by Carlos Daniel Hernández Mena.

Contact

For further information, please send an email to [email protected].

Copyright

License

Apache-2.0

Funding

This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

The training of the model was possible thanks to the computing time provided by Barcelona Supercomputing Center through MareNostrum 5.

Downloads last month: 72

Inference Providers NEW

Automatic Speech Recognition

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BSC-LT/whisper-large-v3-ca-punctuated-3370h

Base model

openai/whisper-large-v3

Finetuned

(560)

this model

Dataset used to train BSC-LT/whisper-large-v3-ca-punctuated-3370h

Collection including BSC-LT/whisper-large-v3-ca-punctuated-3370h

Speech models

Collection

Models developed by the speech team of the Language Technologies unit • 8 items • Updated 4 days ago

Evaluation results

WER on Mozilla Common Voice 17.0 (Test)
test set self-reported

5.500
WER on Mozilla Common Voice 17.0 (Dev)
validation set self-reported

5.070
WER on CV Benchmark Catalan Accents (Balearic fem)
self-reported

6.060
WER on CV Benchmark Catalan Accents (Balearic male)
self-reported

5.340
WER on CV Benchmark Catalan Accents (Central fem)
self-reported

4.080
WER on CV Benchmark Catalan Accents (Central male)
self-reported

4.560
WER on CV Benchmark Catalan Accents (Northern fem)
self-reported

4.210
WER on CV Benchmark Catalan Accents (Northern male)
self-reported

4.280
WER on CV Benchmark Catalan Accents (Northwestern fem)
self-reported

3.870
WER on CV Benchmark Catalan Accents (Northwestern male)
self-reported

4.860

View on Papers With Code

BSC-LT
/

whisper-large-v3-ca-punctuated-3370h