πŸ“Œ ScreenTalk-xs: Fine-Tuned Whisper Model for Movie & TV Audio

πŸ“œ Model Details

πŸ“Œ Model Description

ScreenTalk-xs is a fine-tuned version of OpenAI's whisper-small model, optimized for speech-to-text transcription on movies & TV show audio. This model is specifically trained to improve ASR (Automatic Speech Recognition) performance in dialogue-heavy scenarios.

πŸ”Ή Key Features

  • πŸ“Ί Optimized for movie & TV dialogues
  • 🎀 Robust to noisy environments
  • πŸ” Improved handling of long-form speech
  • πŸš€ Efficient inference with LoRA fine-tuning

πŸš€ Uses

βœ… Direct Use

  • Speech-to-text transcription for movies, TV shows, and general spoken audio.
  • Automatic subtitling & captioning for multimedia content.
  • Voice-enabled applications such as AI assistants & transcription services.

πŸ”Ή Downstream Use

  • Can be used for improving ASR models in entertainment, media, and accessibility applications.

❌ Out-of-Scope Use

  • Not optimized for real-time streaming ASR.
  • May not generalize well to heavily accented speech outside its training dataset.

πŸ›  Training Details

πŸ“Œ Training Data

The model was fine-tuned using the ScreenTalk-XS dataset, a collection of transcribed movie & TV audio.

πŸ“Œ Training Hyperparameters

Hyperparameter Value
Learning Rate 5e-5
Batch Size 6
Gradient Accumulation 4
Epochs 5
LoRA Rank (r) 4
Optimizer AdamW

πŸ“Œ Training Procedure

  • Fine-tuned with LoRA to reduce memory consumption while maintaining efficiency.
  • Evaluation on a held-out test set to monitor WER (Word Error Rate).

πŸ“Š Evaluation

πŸ“Œ Training Results

Epoch Training Loss Validation Loss WER (%)
1 0.502400 0.333292 20.870653
2 0.244200 0.327987 20.580875
3 0.523600 0.325907 21.924394
4 0.445500 0.326386 20.508430
5 0.285700 0.327116 20.752107
  • Best Model: Epoch 4, achieving WER = 20.50%
  • Model performance degrades after epoch 4, suggesting overfitting.

πŸ“Œ Test Results

Model WER (%)
Whisper-small (baseline) 30.00%
ScreenTalk-xs (fine-tuned) 27.00% βœ…

πŸ” Key Observations

  • Fine-tuning reduced WER from 30.00% β†’ 27.00% 🎯
  • Achieved a 10% relative improvement in ASR accuracy.
  • Tested on the ScreenTalk-XS dataset.

πŸ–₯️ Technical Specifications

πŸ“Œ Model Architecture

  • Based on Whisper-small, a transformer-based sequence-to-sequence ASR model.
  • Fine-tuned using LoRA to reduce memory footprint.

πŸ“Œ Hardware & Compute Infrastructure

  • Training Hardware: T4 (16GB) GPU
  • Training Time: ~5 hours
  • Training Environment: PyTorch + Transformers (Hugging Face)

πŸ“– How to Use

You can use this model for speech-to-text transcription with pipeline:

from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="fj11/ScreenTalk-xs",
    device=0  # Run on GPU
)

result = pipe("path/to/audio.wav")
print(result["text"])

πŸ“œ Citation

If you use this model, please cite:

@misc{DataLabX2025ScreenTalkXS,
  author = {DataLabX},
  title = {ScreenTalk-xs: ASR Model Fine-Tuned on Movie & TV Audio},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/DataLabX/ScreenTalk-xs}
}
Downloads last month
3
Safetensors
Model size
242M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for fj11/ScreenTalk-xs

Finetuned
(2590)
this model

Dataset used to train fj11/ScreenTalk-xs

Spaces using fj11/ScreenTalk-xs 2