📌 ScreenTalk-xs: Fine-Tuned Whisper Model for Movie & TV Audio

📜 Model Details

Model Name: ScreenTalk-xs
Developed by: DataLabX
Finetuned from: openai/whisper-small
Language(s): English
License: Apache-2.0
Repository: Hugging Face Model Hub

📌 Model Description

ScreenTalk-xs is a fine-tuned version of OpenAI's whisper-small model, optimized for speech-to-text transcription on movies & TV show audio. This model is specifically trained to improve ASR (Automatic Speech Recognition) performance in dialogue-heavy scenarios.

🔹 Key Features

📺 Optimized for movie & TV dialogues
🎤 Robust to noisy environments
🔍 Improved handling of long-form speech
🚀 Efficient inference with LoRA fine-tuning

🚀 Uses

✅ Direct Use

Speech-to-text transcription for movies, TV shows, and general spoken audio.
Automatic subtitling & captioning for multimedia content.
Voice-enabled applications such as AI assistants & transcription services.

🔹 Downstream Use

Can be used for improving ASR models in entertainment, media, and accessibility applications.

❌ Out-of-Scope Use

Not optimized for real-time streaming ASR.
May not generalize well to heavily accented speech outside its training dataset.

🛠 Training Details

📌 Training Data

The model was fine-tuned using the ScreenTalk-XS dataset, a collection of transcribed movie & TV audio.

📌 Training Hyperparameters

Hyperparameter	Value
Learning Rate	`5e-5`
Batch Size	`6`
Gradient Accumulation	`4`
Epochs	`5`
LoRA Rank (`r`)	`4`
Optimizer	AdamW

📌 Training Procedure

Fine-tuned with LoRA to reduce memory consumption while maintaining efficiency.
Evaluation on a held-out test set to monitor WER (Word Error Rate).

📊 Evaluation

📌 Training Results

Epoch	Training Loss	Validation Loss	WER (%)
1	0.502400	0.333292	20.870653
2	0.244200	0.327987	20.580875
3	0.523600	0.325907	21.924394
4	0.445500	0.326386	20.508430
5	0.285700	0.327116	20.752107

Best Model: Epoch 4, achieving WER = 20.50%
Model performance degrades after epoch 4, suggesting overfitting.

📌 Test Results

Model	WER (%)
Whisper-small (baseline)	30.00%
ScreenTalk-xs (fine-tuned)	27.00% ✅

🔍 Key Observations

Fine-tuning reduced WER from 30.00% → 27.00% 🎯
Achieved a 10% relative improvement in ASR accuracy.
Tested on the ScreenTalk-XS dataset.

🖥️ Technical Specifications

📌 Model Architecture

Based on Whisper-small, a transformer-based sequence-to-sequence ASR model.
Fine-tuned using LoRA to reduce memory footprint.

📌 Hardware & Compute Infrastructure

Training Hardware: T4 (16GB) GPU
Training Time: ~5 hours
Training Environment: PyTorch + Transformers (Hugging Face)

📖 How to Use

You can use this model for speech-to-text transcription with pipeline:

from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="fj11/ScreenTalk-xs",
    device=0  # Run on GPU
)

result = pipe("path/to/audio.wav")
print(result["text"])

📜 Citation

If you use this model, please cite:

@misc{DataLabX2025ScreenTalkXS,
  author = {DataLabX},
  title = {ScreenTalk-xs: ASR Model Fine-Tuned on Movie & TV Audio},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/DataLabX/ScreenTalk-xs}
}

Downloads last month: 1

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for fj11/ScreenTalk-xs

Base model

openai/whisper-small

Finetuned

(2980)

this model

fj11
/

ScreenTalk-xs