π ScreenTalk-xs: Fine-Tuned Whisper Model for Movie & TV Audio
π Model Details
- Model Name: ScreenTalk-xs
- Developed by: DataLabX
- Finetuned from:
openai/whisper-small
- Language(s): English
- License: Apache-2.0
- Repository: Hugging Face Model Hub
π Model Description
ScreenTalk-xs is a fine-tuned version of OpenAI's whisper-small
model, optimized for speech-to-text transcription on movies & TV show audio. This model is specifically trained to improve ASR (Automatic Speech Recognition) performance in dialogue-heavy scenarios.
πΉ Key Features
- πΊ Optimized for movie & TV dialogues
- π€ Robust to noisy environments
- π Improved handling of long-form speech
- π Efficient inference with LoRA fine-tuning
π Uses
β Direct Use
- Speech-to-text transcription for movies, TV shows, and general spoken audio.
- Automatic subtitling & captioning for multimedia content.
- Voice-enabled applications such as AI assistants & transcription services.
πΉ Downstream Use
- Can be used for improving ASR models in entertainment, media, and accessibility applications.
β Out-of-Scope Use
- Not optimized for real-time streaming ASR.
- May not generalize well to heavily accented speech outside its training dataset.
π Training Details
π Training Data
The model was fine-tuned using the ScreenTalk-XS dataset, a collection of transcribed movie & TV audio.
π Training Hyperparameters
Hyperparameter | Value |
---|---|
Learning Rate | 5e-5 |
Batch Size | 6 |
Gradient Accumulation | 4 |
Epochs | 5 |
LoRA Rank (r ) |
4 |
Optimizer | AdamW |
π Training Procedure
- Fine-tuned with LoRA to reduce memory consumption while maintaining efficiency.
- Evaluation on a held-out test set to monitor WER (Word Error Rate).
π Evaluation
π Training Results
Epoch | Training Loss | Validation Loss | WER (%) |
---|---|---|---|
1 | 0.502400 | 0.333292 | 20.870653 |
2 | 0.244200 | 0.327987 | 20.580875 |
3 | 0.523600 | 0.325907 | 21.924394 |
4 | 0.445500 | 0.326386 | 20.508430 |
5 | 0.285700 | 0.327116 | 20.752107 |
- Best Model:
Epoch 4
, achieving WER = 20.50% - Model performance degrades after epoch 4, suggesting overfitting.
π Test Results
Model | WER (%) |
---|---|
Whisper-small (baseline) | 30.00% |
ScreenTalk-xs (fine-tuned) | 27.00% β |
π Key Observations
- Fine-tuning reduced WER from 30.00% β 27.00% π―
- Achieved a 10% relative improvement in ASR accuracy.
- Tested on the ScreenTalk-XS dataset.
π₯οΈ Technical Specifications
π Model Architecture
- Based on Whisper-small, a transformer-based sequence-to-sequence ASR model.
- Fine-tuned using LoRA to reduce memory footprint.
π Hardware & Compute Infrastructure
- Training Hardware: T4 (16GB) GPU
- Training Time: ~5 hours
- Training Environment: PyTorch + Transformers (Hugging Face)
π How to Use
You can use this model for speech-to-text transcription with pipeline
:
from transformers import pipeline
pipe = pipeline(
"automatic-speech-recognition",
model="fj11/ScreenTalk-xs",
device=0 # Run on GPU
)
result = pipe("path/to/audio.wav")
print(result["text"])
π Citation
If you use this model, please cite:
@misc{DataLabX2025ScreenTalkXS,
author = {DataLabX},
title = {ScreenTalk-xs: ASR Model Fine-Tuned on Movie & TV Audio},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/DataLabX/ScreenTalk-xs}
}
- Downloads last month
- 3
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for fj11/ScreenTalk-xs
Base model
openai/whisper-small