videoloc/seamless-translation
Model Description
This is a SeamlessTranslation model that processes audio and text inputs with translation awareness to predict Time To Edit (TTE) for subtitle segments. Given an audio segment and its corresponding subtitle text, the model predicts how much time (in seconds) would be required to edit/refine that subtitle segment, while taking into account whether the subtitle is a translation or original content.
The model extends the basic SeamlessM4T architecture with a translation feature that helps distinguish between original and translated subtitle content, improving TTE prediction accuracy across 5 languages: English, French, Spanish, Italian, and German with various translation pairs between them.
Key Features
- Translation-Aware Processing: Distinguishes between original and translated content
- Multimodal Processing: Simultaneously processes audio (16kHz) and text inputs
- Frozen Encoders: Uses pre-trained SeamlessM4T encoders (frozen for stability)
- Enhanced Architecture: Adds translation embedding to basic model
- TTE Prediction: Predicts editing time required for subtitle segments
- Direct Output: Raw time values in seconds for immediate use
Model Architecture
The model extends the basic SeamlessM4T architecture with translation awareness:
Audio Processing:
- SeamlessM4T speech encoder (frozen) processes raw audio input
- Audio projection layer maps speech encoder output to 1024 dimensions
- Mean pooling over sequence length to get fixed-size audio embedding
Text Processing:
- SeamlessM4T text encoder (frozen) processes tokenized text input
- Text projection layer maps text encoder output to 1024 dimensions
- Mean pooling over sequence length to get fixed-size text embedding
Translation Feature Processing:
- Binary translation flag (0/1) indicating original vs translated content
- Translation projection layer maps binary input to 64 dimensions
- Learned embedding helps model distinguish translation effects
Feature Fusion:
- Audio, text, and translation embeddings are concatenated (2112 total dimensions)
- Simple concatenation without complex cross-modal interactions
Regression Head:
- Multi-layer perceptron: 2112 β 1024 β 512 β 256 β 1
- ReLU activations and dropout for regularization
- Single output for TTE prediction (regression, in seconds)
Quick Start
Installation
pip install transformers torch torchaudio huggingface_hub
Basic Usage
from transformers import AutoModel, AutoConfig
from huggingface_hub import hf_hub_download
import torch
import numpy as np
import importlib.util
# Load model - custom architecture requires importing the model class
model_files = hf_hub_download(repo_id="videoloc/seamless-translation", filename="modeling_seamless_translation.py")
spec = importlib.util.spec_from_file_location("modeling_seamless_translation", model_files)
modeling_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(modeling_module)
# Now load the model using the custom class
config = modeling_module.SeamlessTranslationConfig.from_pretrained("videoloc/seamless-translation")
model = modeling_module.HFSeamlessTranslation.from_pretrained("videoloc/seamless-translation")
# Load the data collator (included in this repo)
collator_file = hf_hub_download(repo_id="videoloc/seamless-translation", filename="data_collator.py")
spec = importlib.util.spec_from_file_location("data_collator", collator_file)
collator_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(collator_module)
# Initialize data collator
data_collator = collator_module.DataCollatorSimpleSeamless(
processor="facebook/hf-seamless-m4t-medium",
max_audio_length_sec=8.0,
max_text_length=256
)
# Prepare your data with translation information
your_data = [
{
'raw_audio': np.random.randn(16000 * 5), # 5 seconds at 16kHz
'raw_text': "Your subtitle text here",
'is_translation': 1, # 1 for translated content, 0 for original
}
]
# Process and run inference
batch = data_collator(your_data)
model.eval()
with torch.no_grad():
outputs = model(**batch)
tte_prediction = outputs.logits.item()
print(f"Predicted Time To Edit (TTE): {tte_prediction:.2f} seconds")
Model Details
- Base Model: SeamlessM4T (facebook/hf-seamless-m4t-medium)
- Audio Encoder: Frozen SeamlessM4T speech encoder
- Text Encoder: Frozen SeamlessM4T text encoder
- Hidden Size: 1024
- Translation Embedding: 64 dimensions
- Audio Input: 16kHz
- Translation Input: Binary flag (0/1)
- Output: Single regression value (TTE in seconds)
- Task: Subtitle editing time prediction
Data Format
Your input data should be a list of dictionaries with:
raw_audio
: NumPy array of audio samples (16kHz sampling rate)raw_text
: String of subtitle textis_translation
: Binary flag (1 for translated, 0 for original content)labels
: Target TTE values in seconds (optional, for training)
Example:
data = [
{
'raw_audio': audio_samples, # shape: (num_samples,) at 16kHz
'raw_text': "Subtitle text content",
'is_translation': 1, # 1 = translated, 0 = original
'labels': 2.5 # optional TTE target value in seconds
}
]
Performance Metrics
- Best Eval RMSE: 33.34
Training Details
- Base Model: facebook/hf-seamless-m4t-medium
- Model Type: seamless_with_translation
- Epochs: 10
- Batch Size (Train): 32
- Batch Size (Eval): 64
- Learning Rate: 1.2e-4
- LR Scheduler: cosine_with_restarts
- Warmup Ratio: 0.05
- Weight Decay: 0.001
- Optimizer: AdamW (torch)
- Max Grad Norm: 1.0
- FP16: True
- Early Stopping Patience: 5
- Audio Max Length: 8.0 seconds
- Text Max Length: 256 tokens
- Sample Rate: 16kHz
- Translation Feature: Binary flag (0/1)
- Normalization: None (raw values)
- Dataset Split: 80/20 train/test
- Random Seed: 42
- Metric: RMSE (lower is better)
Training Configuration
The model was trained with the following specifications:
- Dataset: Multimodal audio-subtitle pairs with translation annotations (5 languages: EN, FR, ES, IT, DE)
- Train/Test Split: 80/20 with random seed 42
- Audio Processing: 16kHz sampling, max 8.0 seconds, no offset
- Text Processing: Max 256 tokens
- Translation Feature: Binary flag indicating original vs translated content
- Normalization: None (raw TTE values in seconds)
- Caching: Audio segments cached and compressed for efficiency
Usage Notes
- This is the translation-aware variant - includes translation features
- For basic model without translation features, see
seamless-basic
- For language pair embeddings, see
seamless-langpairs
- Model expects 16kHz audio input (automatically resampled by data collator)
- Translation flag significantly impacts predictions
- No feature normalization applied - outputs raw TTE predictions in seconds
- Optimized for subtitle editing time estimation tasks
Limitations
- Requires translation annotation in training data
- Designed for TTE prediction, not general audio-text matching
- Performance may vary on out-of-domain content
- Requires specific data preprocessing (use included data collator)
Related Models
- seamless-basic: Basic audio+text model without translation features
- seamless-langpairs: Includes language pair embeddings for fine-grained multilingual control
- seamless-crossattention: Advanced cross-modal attention mechanisms for sophisticated audio-text interactions
- Downloads last month
- 28
Model tree for videoloc/seamless-translation
Base model
facebook/hf-seamless-m4t-medium