Whisper-Tiny Portuguese - Common Voice Only (Baseline)

This model is a fine-tuned version of openai/whisper-tiny for Portuguese automatic speech recognition (ASR). It was trained exclusively on Common Voice 17.0 Portuguese without any synthetic data augmentation, serving as the baseline for evaluating the impact of synthetic speech on the smallest Whisper architecture.

Purpose

This baseline model establishes the performance of the Whisper-Tiny architecture (39M parameters) using only real, crowdsourced speech data. It serves as a reference point to evaluate:

The effectiveness of synthetic data augmentation for the smallest model architecture
The fundamental capacity limitations of compact ASR models
Comparison with Small and Large-v3 models to understand scaling effects

Key Finding: Unlike Large-v3 models which show significant improvements with synthetic data, Tiny models show only marginal benefits (1.39 percentage points) from synthetic augmentation. The paper states: "This modest gain offers limited justification for the additional data filtering and preprocessing overhead."

Model Details

Property	Value
Base Model	openai/whisper-tiny
Language	Portuguese (pt)
Task	Automatic Speech Recognition (transcribe)
Parameters	39M
Training Data	Common Voice 17.0 Portuguese (Real Speech Only)
Total Training Samples	21,866
Sampling Rate	16kHz

Evaluation Results

This Model (whisper-tiny-cv-only-pt)

Metric	Value
Validation Loss	0.4463
Validation WER	27.05%
Test WER (Common Voice)	30.72%
Test WER (MLS)	45.83%
Best Checkpoint	Step 250
Max Training Steps	430

Comparison with Synthetic Data Augmentation (Whisper-Tiny Portuguese)

Training Data	Max Steps	Val Loss	Val WER	Test WER (CV)	Test WER (MLS)
Common Voice Only (Baseline)	430	0.4463	27.05%	30.72%	45.83%
High-Quality (q ≥ 0.8) + CV	575	0.4481	26.74%	29.33%	44.18%
Mid-High (q ≥ 0.5) + CV	805	0.4550	26.95%	30.11%	47.25%
All Synthetic + CV	860	0.4517	28.06%	29.84%	46.54%

Key Performance Characteristics

Fastest training: Fewest steps (430) among all Tiny configurations
Smallest dataset: Only 21,866 samples (no synthetic augmentation)
Reference baseline: 30.72% Test WER on Common Voice
Limited cross-domain: 45.83% MLS WER (challenging for Tiny architecture)

Why Synthetic Data Provides Limited Benefit for Tiny Models

The paper explains this architectural limitation:

"The Tiny and Small variants of Whisper exhibit only marginal benefits from synthetic data augmentation, revealing the limitations imposed by reduced model capacity. For instance, the Portuguese Whisper-Tiny model achieves its lowest test WER of 29.33% using the high-quality filtered subset, an improvement of just 1.39 percentage points over the Common Voice baseline of 30.72%."

Key Insight: Compact models (39M params) struggle to disentangle subtle acoustic differences between natural and synthetic speech. The high-quality filtered variant provides only 1.39% improvement—a modest gain that may not justify the additional data processing overhead.

Training Data

Dataset Composition

Source	Samples	Description
Common Voice 17.0 Portuguese	21,866	Real crowdsourced speech
Synthetic Data	0	No synthetic augmentation
Total	21,866

Training Procedure

Hyperparameters

Parameter	Value
Learning Rate	5e-5
Batch Size (Global)	256
Warmup Steps	200
Max Epochs	5
Precision	BF16
Optimizer	AdamW (fused)
Eval Steps	50
Metric for Best Model	eval_loss

Training Infrastructure

GPU: NVIDIA H200 (140GB VRAM)
Operating System: Ubuntu 22.04
Framework: Hugging Face Transformers

Usage

Transcription Pipeline

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="yuriyvnv/whisper-tiny-cv-only-pt",
    device="cuda"
)

result = transcriber("path/to/portuguese_audio.wav")
print(result["text"])

Direct Model Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-tiny-cv-only-pt")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-tiny-cv-only-pt")
model.to("cuda")

audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Specifying Language

model.generation_config.language = "pt"
model.generation_config.task = "transcribe"

When to Use This Model

This model is ideal when:

Maximum resource efficiency: Smallest model size (39M params)
Edge deployment: Limited memory and compute available
Fast inference: Fastest among Portuguese models
Baseline comparison: Reference for evaluating synthetic data impact on Tiny architecture

Consider alternatives based on your needs:

whisper-tiny-high-mixed-pt: Marginal improvement (29.33% vs 30.72%)
whisper-small-cv-only-pt: Better accuracy (13.87% WER)
whisper-large-v3-high-mixed-pt: Best accuracy (7.94% WER)

Model Size Comparison

Model	Params	Best Config	Test WER (CV)	Test WER (MLS)	Synthetic Benefit
Whisper-Tiny	39M	High-Quality	29.33%	44.18%	Marginal (+1.39%)
Whisper-Small	244M	CV Only	13.87%	30.69%	None/Negative
Whisper-Large-v3	1550M	High-Quality + CV	7.94%	12.41%	Significant (+32.6%)

Limitations

Lower accuracy: 30.72% WER (vs 7.94% for Large-v3)
Limited capacity: Cannot effectively leverage synthetic data
Domain specificity: Optimized for Common Voice-style speech
Cross-domain weakness: 45.83% MLS WER shows difficulty adapting

Citation

This model is part of research on WAVe (Word-Aligned Verification) for synthetic speech quality assessment. While the WAVe methodology paper is currently under review, please cite our previous work that motivated this research:

@article{perezhohin2024enhancing,
  title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
  author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}

References

Base Model: openai/whisper-tiny
Training Data: mozilla-foundation/common_voice_17_0
Whisper Paper: Robust Speech Recognition via Large-Scale Weak Supervision
Motivating Research: Enhancing ASR with Semantic Audio Filtering (IEEE Access 2024)

License

Apache 2.0

Downloads last month: 37

Safetensors

Model size

37.8M params

Tensor type

F32

Model tree for yuriyvnv/whisper-tiny-cv-only-pt

Base model

openai/whisper-tiny

Finetuned

(1663)

this model

Dataset used to train yuriyvnv/whisper-tiny-cv-only-pt

Collection including yuriyvnv/whisper-tiny-cv-only-pt

Whisper Models Portuguese Language

Collection

This Repo contains Whisper models trained on subsets of data like Common Voice 17(CV_17), Synthetic(Generated by OpenAI) + CV17 and Synthetic Only. • 15 items • Updated 11 days ago • 1

Evaluation results

Test WER on Common Voice 17.0 (Portuguese)
test set self-reported

30.720
Test WER (MLS) on Multilingual LibriSpeech (Portuguese)
test set self-reported

45.830

View on Papers With Code