Model Name: Helsinki-NLP/opus-mt-synthetic-en-uk
Model Overview
This model is the synthetic baseline (transformer-base) for the English-Ukrainian language pair of our paper "Scaling Low-Resource MT via Synthetic Data Generation with LLMs". The training data is generated by forward translating English Europarl with GPT-4o and is specifically aimed at improving MT performance for underrepresented languages by supplementing traditional datasets with high-quality, LLM-generated translations.
The goal of this model is to provide a baseline for MT tasks, demonstrating the potential of synthetic data to enhance translation capabilities for languages with limited existing resources.
For more detailed methodology, see the full paper here.
Supported Language Pair:
- English ↔ Ukrainian
Evaluation
The quality of the generated synthetic data was evaluated using both automatic metrics (such as COMET and ChrF) and human evaluations. The evaluation shows that the synthetic data generally performs well for low-resource languages, with significant gains observed when using the data in downstream MT training. Below are the evaluation results on FLORES+:
Language Pair | ChrF Score | COMET Score |
---|---|---|
English ↔ Basque | 53.00 | 81.51 |
English ↔ Scottish Gaelic | 51.10 | 78.04 |
English ↔ Icelandic | 49.91 | 80.16 |
English ↔ Georgian | 49.49 | 80.72 |
English ↔ Macedonian | 57.72 | 82.24 |
English ↔ Somali | 45.10 | 78.15 |
English ↔ Ukrainian | 51.71 | 78.89 |
The results demonstrate that synthetic data provides strong baseline performance across all language pairs, with the best performance for Macedonian and Ukrainian, which are relatively less low-resource compared to others.
Usage
You can use this model to generate translations with the following code:
from transformers import MarianMTModel, MarianTokenizer
# Load the pre-trained model and tokenizer
model_name = "Helsinki-NLP/opus-mt-synthetic-en-uk"
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)
# Example source text (English)
source_texts = ["Hello, how are you?", "Good morning!", "What is your name?"]
# Tokenize the input texts
inputs = tokenizer(source_texts, return_tensors="pt", padding=True, truncation=True)
# Generate translations
translated_ids = model.generate(inputs["input_ids"])
# Decode the generated tokens to get the translated text
translated_texts = tokenizer.batch_decode(translated_ids, skip_special_tokens=True)
# Print the translations
for src, tgt in zip(source_texts, translated_texts):
print(f"Source: {src} => Translated: {tgt}")
For the given English sentences, the output might look something like this:
Source: How are you? => Translated: Як ви?
Source: Good morning! => Translated: Доброго ранку
Source: What is your name? => Translated: Яке ваше ім'я?
Citation Information
@article{degibert2025scaling,
title={Scaling Low-Resource MT via Synthetic Data Generation with LLMs},
author={de Gibert, Ona and Attieh, Joseph and Vahtola, Teemu and Aulamo, Mikko and Li, Zihao and V{\'a}zquez, Ra{\'u}l and Hu, Tiancheng and Tiedemann, J{\"o}rg},
journal={arXiv preprint arXiv:2505.14423},
year={2025}
}
- Downloads last month
- 23