Model Name: `Helsinki-NLP/opus-mt-synthetic-en-uk`

Model Overview

This model is the synthetic baseline (transformer-base) for the English-Ukrainian language pair of our paper "Scaling Low-Resource MT via Synthetic Data Generation with LLMs". The training data is generated by forward translating English Europarl with GPT-4o and is specifically aimed at improving MT performance for underrepresented languages by supplementing traditional datasets with high-quality, LLM-generated translations.

The goal of this model is to provide a baseline for MT tasks, demonstrating the potential of synthetic data to enhance translation capabilities for languages with limited existing resources.

For more detailed methodology, see the full paper here.

Supported Language Pair:

English ↔ Ukrainian

Evaluation

The quality of the generated synthetic data was evaluated using both automatic metrics (such as COMET and ChrF) and human evaluations. The evaluation shows that the synthetic data generally performs well for low-resource languages, with significant gains observed when using the data in downstream MT training. Below are the evaluation results on FLORES+:

Language Pair	ChrF Score	COMET Score
English ↔ Basque	53.00	81.51
English ↔ Scottish Gaelic	51.10	78.04
English ↔ Icelandic	49.91	80.16
English ↔ Georgian	49.49	80.72
English ↔ Macedonian	57.72	82.24
English ↔ Somali	45.10	78.15
English ↔ Ukrainian	51.71	78.89

The results demonstrate that synthetic data provides strong baseline performance across all language pairs, with the best performance for Macedonian and Ukrainian, which are relatively less low-resource compared to others.

Usage

You can use this model to generate translations with the following code:

from transformers import MarianMTModel, MarianTokenizer

# Load the pre-trained model and tokenizer
model_name = "Helsinki-NLP/opus-mt-synthetic-en-uk"
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)

# Example source text (English)
source_texts = ["Hello, how are you?", "Good morning!", "What is your name?"]

# Tokenize the input texts
inputs = tokenizer(source_texts, return_tensors="pt", padding=True, truncation=True)

# Generate translations
translated_ids = model.generate(inputs["input_ids"])

# Decode the generated tokens to get the translated text
translated_texts = tokenizer.batch_decode(translated_ids, skip_special_tokens=True)

# Print the translations
for src, tgt in zip(source_texts, translated_texts):
    print(f"Source: {src} => Translated: {tgt}")

For the given English sentences, the output might look something like this:

Source: How are you? => Translated: Як ви?
Source: Good morning! => Translated: Доброго ранку
Source: What is your name? => Translated: Яке ваше ім'я?

Citation Information

@article{degibert2025scaling,
  title={Scaling Low-Resource MT via Synthetic Data Generation with LLMs},
  author={de Gibert, Ona and Attieh, Joseph and Vahtola, Teemu and Aulamo, Mikko and Li, Zihao and V{\'a}zquez, Ra{\'u}l and Hu, Tiancheng and Tiedemann, J{\"o}rg},
  journal={arXiv preprint arXiv:2505.14423},
  year={2025}
}

Downloads last month: 23

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Helsinki-NLP/opus-mt-synthetic-en-uk

Collection including Helsinki-NLP/opus-mt-synthetic-en-uk

Scaling Low-Res MT via Synthetic Data Generation with LLMs

Collection

Synthetic baselines trained for our paper "Scaling Low-Resource MT via Synthetic Data Generation with LLMs" accepted as a main in EMNLP 2025. • 8 items • Updated 14 days ago • 1

Model Name: Helsinki-NLP/opus-mt-synthetic-en-uk