Seq2Seq German-English Translation Model

A sequence-to-sequence neural machine translation model that translates German text to English, built using PyTorch with LSTM encoder-decoder architecture.

Model Description

This model implements the classic seq2seq architecture from Sutskever et al. (2014) for German-English translation:

Encoder: 2-layer LSTM that processes German input sequences
Decoder: 2-layer LSTM that generates English output sequences
Training Strategy: Teacher forcing during training, autoregressive generation during inference
Vocabulary: 30k German words, 25k English words
Dataset: Trained on 2M sentence pairs from WMT19 (subset of full 35M dataset)

Model Architecture

German Input → Embedding → LSTM Encoder → Context Vector → LSTM Decoder → Embedding → English Output

Hyperparameters:

Embedding size: 256
Hidden size: 512
LSTM layers: 2 (both encoder/decoder)
Dropout: 0.3
Batch size: 64
Learning rate: 0.0003

Training Data

Dataset: WMT19 German-English Translation Task
Size: 2M sentence pairs (filtered subset)
Preprocessing: Sentences filtered by length (5-50 tokens)
Tokenization: Custom word-level tokenizer with special tokens (<PAD>, <UNK>, <START>, <END>)

Performance

Training Results (5 epochs):

Initial Training Loss: 4.0949 → Final: 3.1843 (91% improvement)
Initial Validation Loss: 4.1918 → Final: 3.8537 (34% improvement)
Training Device: Apple Silicon (MPS)

Usage

Quick Start

# This is a custom PyTorch model, not a Transformers model
# Download the files and use with the provided inference script

import requests
from pathlib import Path

# Download model files
base_url = "https://huggingface.co/sumitdotml/seq2seq-de-en/resolve/main"
files = ["best_model.pt", "german_tokenizer.pkl", "english_tokenizer.pkl"]

for file in files:
    response = requests.get(f"{base_url}/{file}")
    Path(file).write_bytes(response.content)
    print(f"Downloaded {file}")

Translation Examples

# Interactive mode
python inference.py --interactive

# Single translation
python inference.py --sentence "Hallo, wie geht es dir?" --verbose

# Demo mode
python inference.py

Example Translations:

"Das ist ein gutes Buch." → "this is a good idea."
"Wo ist der Bahnhof?" → "where is the <UNK>"
"Ich liebe Deutschland." → "i share."

Files Included

best_model.pt: PyTorch model checkpoint (trained weights + architecture)
german_tokenizer.pkl: German vocabulary and tokenization logic
english_tokenizer.pkl: English vocabulary and tokenization logic

Installation & Setup

Clone the repository:

git clone https://github.com/sumitdotml/seq2seq
cd seq2seq

Set up environment:

uv venv && source .venv/bin/activate  # or python -m venv .venv
uv pip install torch requests tqdm    # or pip install torch requests tqdm

Download model:
```
python scripts/download_pretrained.py
```

Start translating:

python scripts/inference.py --interactive

Model Architecture Details

The model uses a custom implementation with these components:

Encoder (src/models/encoder.py): LSTM-based encoder with embedding layer
Decoder (src/models/decoder.py): LSTM-based decoder with attention-free architecture
Seq2Seq (src/models/seq2seq.py): Main model combining encoder-decoder with generation logic

Limitations

Vocabulary constraints: Limited to 30k German / 25k English words
Training data: Only 2M sentence pairs (vs 35M in full WMT19)
No attention mechanism: Basic encoder-decoder without attention
Simple tokenization: Word-level tokenization without subword units
Translation quality: Suitable for basic phrases, struggles with complex sentences

Training Details

Environment:

Framework: PyTorch 2.0+
Device: Apple Silicon (MPS acceleration)
Training time: ~5 epochs
Validation strategy: Hold-out validation set

Optimization:

Optimizer: Adam (lr=0.0003)
Loss function: CrossEntropyLoss (ignoring padding)
Gradient clipping: 1.0
Scheduler: StepLR (step_size=3, gamma=0.5)

Reproduce Training

# Full training pipeline
python scripts/data_preparation.py      # Download WMT19 data
python src/data/tokenization.py        # Build vocabularies  
python scripts/train.py                # Train model

# For full dataset training, modify data_preparation.py:
# use_full_dataset = True  # Line 133-134

Citation

If you use this model, please cite:

@misc{seq2seq-de-en,
  author = {sumitdotml},
  title = {German-English Seq2Seq Translation Model},
  year = {2025},
  url = {https://huggingface.co/sumitdotml/seq2seq-de-en},
  note = {PyTorch implementation of sequence-to-sequence translation}
}

References

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. NeurIPS.
WMT19 Translation Task: https://huggingface.co/datasets/wmt/wmt19

License

MIT License - See repository for full license text.

Contact

For questions about this model or training code, please open an issue in the GitHub repository.

Downloads last month: -; Downloads are not tracked for this model. How to track

sumitdotml
/

seq2seq-de-en