Seq2Seq German-English Translation Model

A sequence-to-sequence neural machine translation model that translates German text to English, built using PyTorch with LSTM encoder-decoder architecture.

Model Description

This model implements the classic seq2seq architecture from Sutskever et al. (2014) for German-English translation:

  • Encoder: 2-layer LSTM that processes German input sequences
  • Decoder: 2-layer LSTM that generates English output sequences
  • Training Strategy: Teacher forcing during training, autoregressive generation during inference
  • Vocabulary: 30k German words, 25k English words
  • Dataset: Trained on 2M sentence pairs from WMT19 (subset of full 35M dataset)

Model Architecture

German Input β†’ Embedding β†’ LSTM Encoder β†’ Context Vector β†’ LSTM Decoder β†’ Embedding β†’ English Output

Hyperparameters:

  • Embedding size: 256
  • Hidden size: 512
  • LSTM layers: 2 (both encoder/decoder)
  • Dropout: 0.3
  • Batch size: 64
  • Learning rate: 0.0003

Training Data

  • Dataset: WMT19 German-English Translation Task
  • Size: 2M sentence pairs (filtered subset)
  • Preprocessing: Sentences filtered by length (5-50 tokens)
  • Tokenization: Custom word-level tokenizer with special tokens (<PAD>, <UNK>, <START>, <END>)

Performance

Training Results (5 epochs):

  • Initial Training Loss: 4.0949 β†’ Final: 3.1843 (91% improvement)
  • Initial Validation Loss: 4.1918 β†’ Final: 3.8537 (34% improvement)
  • Training Device: Apple Silicon (MPS)

Usage

Quick Start

# This is a custom PyTorch model, not a Transformers model
# Download the files and use with the provided inference script

import requests
from pathlib import Path

# Download model files
base_url = "https://huggingface.co/sumitdotml/seq2seq-de-en/resolve/main"
files = ["best_model.pt", "german_tokenizer.pkl", "english_tokenizer.pkl"]

for file in files:
    response = requests.get(f"{base_url}/{file}")
    Path(file).write_bytes(response.content)
    print(f"Downloaded {file}")

Translation Examples

# Interactive mode
python inference.py --interactive

# Single translation
python inference.py --sentence "Hallo, wie geht es dir?" --verbose

# Demo mode
python inference.py

Example Translations:

  • "Das ist ein gutes Buch." β†’ "this is a good idea."
  • "Wo ist der Bahnhof?" β†’ "where is the <UNK>"
  • "Ich liebe Deutschland." β†’ "i share."

Files Included

  • best_model.pt: PyTorch model checkpoint (trained weights + architecture)
  • german_tokenizer.pkl: German vocabulary and tokenization logic
  • english_tokenizer.pkl: English vocabulary and tokenization logic

Installation & Setup

  1. Clone the repository:

    git clone https://github.com/sumitdotml/seq2seq
    cd seq2seq
    
  2. Set up environment:

    uv venv && source .venv/bin/activate  # or python -m venv .venv
    uv pip install torch requests tqdm    # or pip install torch requests tqdm
    
  3. Download model:

    python scripts/download_pretrained.py
    
  4. Start translating:

    python scripts/inference.py --interactive
    

Model Architecture Details

The model uses a custom implementation with these components:

  • Encoder (src/models/encoder.py): LSTM-based encoder with embedding layer
  • Decoder (src/models/decoder.py): LSTM-based decoder with attention-free architecture
  • Seq2Seq (src/models/seq2seq.py): Main model combining encoder-decoder with generation logic

Limitations

  • Vocabulary constraints: Limited to 30k German / 25k English words
  • Training data: Only 2M sentence pairs (vs 35M in full WMT19)
  • No attention mechanism: Basic encoder-decoder without attention
  • Simple tokenization: Word-level tokenization without subword units
  • Translation quality: Suitable for basic phrases, struggles with complex sentences

Training Details

Environment:

  • Framework: PyTorch 2.0+
  • Device: Apple Silicon (MPS acceleration)
  • Training time: ~5 epochs
  • Validation strategy: Hold-out validation set

Optimization:

  • Optimizer: Adam (lr=0.0003)
  • Loss function: CrossEntropyLoss (ignoring padding)
  • Gradient clipping: 1.0
  • Scheduler: StepLR (step_size=3, gamma=0.5)

Reproduce Training

# Full training pipeline
python scripts/data_preparation.py      # Download WMT19 data
python src/data/tokenization.py        # Build vocabularies  
python scripts/train.py                # Train model

# For full dataset training, modify data_preparation.py:
# use_full_dataset = True  # Line 133-134

Citation

If you use this model, please cite:

@misc{seq2seq-de-en,
  author = {sumitdotml},
  title = {German-English Seq2Seq Translation Model},
  year = {2025},
  url = {https://huggingface.co/sumitdotml/seq2seq-de-en},
  note = {PyTorch implementation of sequence-to-sequence translation}
}

References

License

MIT License - See repository for full license text.

Contact

For questions about this model or training code, please open an issue in the GitHub repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train sumitdotml/seq2seq-de-en