videoloc/seamless-langpairs

Model Description

This is a SeamlessLanguagePairs model that processes audio and text inputs with both translation awareness and language pair embeddings to predict Time To Edit (TTE) for subtitle segments. Given an audio segment and its corresponding subtitle text, the model predicts how much time (in seconds) would be required to edit/refine that subtitle segment, taking into account both whether the subtitle is translated and the specific language pair involved.

The model extends the SeamlessM4T architecture with both translation features and language pair embeddings, providing the most granular control for multilingual scenarios across 5 languages: English, French, Spanish, Italian, and German with 21 different translation pairs between them (e.g., EN→FR, ES→DE, IT→EN, etc.).

Key Features

Language Pair Embeddings: Fine-grained control for 21 language pairs plus "other"
Translation-Aware Processing: Distinguishes between original and translated content
Multimodal Processing: Simultaneously processes audio (16kHz) and text inputs
Frozen Encoders: Uses pre-trained SeamlessM4T encoders (frozen for stability)
Enhanced Architecture: Adds both translation and language pair embeddings
TTE Prediction: Predicts editing time required for subtitle segments
Direct Output: Raw time values in seconds for immediate use

Model Architecture

The model extends the basic SeamlessM4T architecture with both translation and language pair awareness:

Audio Processing:
- SeamlessM4T speech encoder (frozen) processes raw audio input
- Audio projection layer maps speech encoder output to 1024 dimensions
- Mean pooling over sequence length to get fixed-size audio embedding
Text Processing:
- SeamlessM4T text encoder (frozen) processes tokenized text input
- Text projection layer maps text encoder output to 1024 dimensions
- Mean pooling over sequence length to get fixed-size text embedding
Translation Feature Processing:
- Binary translation flag (0/1) indicating original vs translated content
- Translation projection layer maps binary input to 32 dimensions
- Learned embedding helps model distinguish translation effects
Language Pair Processing:
- Categorical language pair ID (0-20) for specific language combinations
- Language pair embedding layer maps IDs to 64-dimensional vectors
- Captures language-specific temporal alignment patterns
Feature Fusion:
- Audio, text, translation, and language pair embeddings are concatenated (2144 total dimensions)
- Simple concatenation without complex cross-modal interactions
Regression Head:
- Multi-layer perceptron: 2144 → 1024 → 512 → 256 → 1
- ReLU activations and dropout for regularization
- Single output for TTE prediction (regression, in seconds)

Quick Start

Installation

pip install transformers torch torchaudio huggingface_hub

Basic Usage

from transformers import AutoModel, AutoConfig
from huggingface_hub import hf_hub_download
import torch
import numpy as np
import importlib.util

# Load model - custom architecture requires importing the model class
model_files = hf_hub_download(repo_id="videoloc/seamless-langpairs", filename="modeling_seamless_langpairs.py")
spec = importlib.util.spec_from_file_location("modeling_seamless_langpairs", model_files)
modeling_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(modeling_module)

# Now load the model using the custom class
config = modeling_module.SeamlessLanguagePairsConfig.from_pretrained("videoloc/seamless-langpairs")
model = modeling_module.HFSeamlessLanguagePairs.from_pretrained("videoloc/seamless-langpairs")

# Load the data collator (included in this repo)
collator_file = hf_hub_download(repo_id="videoloc/seamless-langpairs", filename="data_collator.py")
spec = importlib.util.spec_from_file_location("data_collator", collator_file)
collator_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(collator_module)

# Initialize data collator
data_collator = collator_module.DataCollatorSimpleSeamless(
    processor="facebook/hf-seamless-m4t-medium",
    max_audio_length_sec=8.0,
    max_text_length=256
)

# Prepare your data with translation and language pair information
your_data = [
    {
        'raw_audio': np.random.randn(16000 * 5),  # 5 seconds at 16kHz
        'raw_text': "Your subtitle text here",
        'is_translation': 1,       # 1 for translated content, 0 for original
        'language_pair_id': 5,     # 0-20 for specific language pairs
    }
]

# Process and run inference 
batch = data_collator(your_data)
model.eval()
with torch.no_grad():
    outputs = model(**batch)
    tte_prediction = outputs.logits.item()
    
print(f"Predicted Time To Edit (TTE): {tte_prediction:.2f} seconds")

Model Details

Base Model: SeamlessM4T (facebook/hf-seamless-m4t-medium)
Audio Encoder: Frozen SeamlessM4T speech encoder
Text Encoder: Frozen SeamlessM4T text encoder
Hidden Size: 1024
Translation Embedding: 32 dimensions
Language Pair Embedding: 64 dimensions
Number of Language Pairs: 21 (plus "other")
Audio Input: 16kHz
Translation Input: Binary flag (0/1)
Language Pair Input: Categorical ID (0-20)
Output: Single regression value (TTE in seconds)
Task: Subtitle editing time prediction

Supported Language Pairs

The model supports 21 specific translation pairs between 5 languages:

Languages: English (EN), French (FR), Spanish (ES), Italian (IT), German (DE)

Translation Pairs: All combinations between the 5 languages create various directional pairs (e.g., EN→FR, FR→EN, ES→IT, DE→ES, etc.). The model uses language pair IDs (0-20) to identify specific translation directions, with ID 21 reserved for "other" pairs.

Data Format

Your input data should be a list of dictionaries with:

raw_audio: NumPy array of audio samples (16kHz sampling rate)
raw_text: String of subtitle text
is_translation: Binary flag (1 for translated, 0 for original content)
language_pair_id: Integer ID (0-20) for specific language pair
labels: Target TTE values in seconds (optional, for training)

Example:

data = [
    {
        'raw_audio': audio_samples,  # shape: (num_samples,) at 16kHz
        'raw_text': "Subtitle text content",
        'is_translation': 1,     # 1 = translated, 0 = original
        'language_pair_id': 5,   # 0-20 for language pairs
        'labels': 2.5  # optional TTE target value in seconds
    }
]

Performance Metrics

Best Eval RMSE: 33.34

Training Details

Base Model: facebook/hf-seamless-m4t-medium
Model Type: seamless_lang_pairs
Epochs: 10
Batch Size (Train): 32
Batch Size (Eval): 64
Learning Rate: 1.2e-4
LR Scheduler: cosine_with_restarts
Warmup Ratio: 0.05
Weight Decay: 0.001
Optimizer: AdamW (torch)
Max Grad Norm: 1.0
FP16: True
Early Stopping Patience: 5
Audio Max Length: 8.0 seconds
Text Max Length: 256 tokens
Sample Rate: 16kHz
Translation Feature: Binary flag (0/1)
Language Pairs: 21 pairs + other
Language Pair Embedding: 64 dimensions
Normalization: None (raw values)
Dataset Split: 80/20 train/test
Random Seed: 42
Metric: RMSE (lower is better)

Training Configuration

The model was trained with the following specifications:

Dataset: Multimodal audio-subtitle pairs with translation and language pair annotations (5 languages: EN, FR, ES, IT, DE with 21 pairs)
Train/Test Split: 80/20 with random seed 42
Audio Processing: 16kHz sampling, max 8.0 seconds, no offset
Text Processing: Max 256 tokens
Translation Feature: Binary flag indicating original vs translated content
Language Pairs: 21 translation pairs from 5 languages (EN, FR, ES, IT, DE) plus "other" category
Normalization: None (raw TTE values in seconds)
Caching: Audio segments cached and compressed for efficiency

Usage Notes

This is the most advanced variant with both translation and language pair features
For simpler models, see seamless-basic (audio+text only) or seamless-translation (with translation flag)
Model expects 16kHz audio input (automatically resampled by data collator)
Both translation flag and language pair ID significantly impact predictions
Language pair embeddings capture language-specific temporal patterns
No feature normalization applied - outputs raw TTE predictions in seconds
Optimized for fine-grained subtitle editing time estimation tasks

Limitations

Requires both translation and language pair annotations in training data
Language pair embeddings are dataset-specific (top 21 pairs from training)
Designed for TTE prediction, not general audio-text matching
Performance may vary on out-of-domain content and unseen language pairs
Requires specific data preprocessing (use included data collator)

Related Models

seamless-basic: Basic audio+text model without translation or language features
seamless-translation: Includes translation awareness but no language pair embeddings
seamless-crossattention: Advanced cross-modal attention mechanisms for sophisticated audio-text interactions

videoloc
/

seamless-langpairs