license: apache-2.0
language:
- en
- zh
- ru
- uk
- cs
- ja
- ko
pipeline_tag: translation
Marco-MT-Algharb
This repository contains the system for Algharb, the submission from the Marco Translation Team of Alibaba International Digital Commerce (AIDC) to the WMT 2025 General Machine Translation Shared Task.
Introduction
The Algharb system is a large translation model built based on the Qwen3-14B foundation. It is designed for high-quality translation across 13 diverse language directions and demonstrates state-of-the-art performance. Our approach is centered on a multi-stage refinement pipeline that systematically enhances translation fluency and faithfulness.
Usage
The model expects a specific instruction format for translation. The following example demonstrates how to construct the prompt and perform generation using the vllm library for efficient inference.
1. Dependencies
First, ensure you have the necessary libraries installed:
pip install torch transformers==4.55.0 vllm==0.10.0
2. Prompt Format and Decoding
The core of the process involves formatting the input text into a specific prompt template and then using the vllm engine to generate translations. For our hybrid decoding strategy, we generate multiple candidates (n > 1) for later re-ranking. The prompt template is:
"Human: Please translate the following text into {target_language}: \n{source_text}<|im_end|>\nAssistant:"
Here is a complete Python example:
from vllm import LLM, SamplingParams
# --- 1. Load Model and Tokenizer ---
model_path = "path/to/your/algharb_model"
llm = LLM(model=model_path)
# --- 2. Define Source Text and Target Language ---
source_text = "This paper presents the Algharb system, our submission to the WMT 2025."
source_lang_code = "en_XX" # Not used in prompt, for tracking
target_lang_code = "zh_CN"
# Helper dictionary to map language codes to full names for the prompt
lang_name_map = {
"zh_CN": "chinese",
"ko_KR": "korean",
"ja_JP": "japanese",
"ar_EG": "arabic",
"cs_CZ": "czech",
"ru_RU": "russian",
"uk_UA": "ukraine",
"et_EE": "estonian",
"bho_IN": "bhojpuri",
"sr_Latn_RS": "serbian",
"de_DE": "german"
}
target_language_name = lang_name_map.get(target_lang_code, "the target language")
# --- 3. Construct the Prompt ---
prompt = (
f"Human: Please translate the following text into {target_language_name}: \n"
f"{source_text}<|im_end|>\n"
f"Assistant:"
)
prompts_to_generate = [prompt]
print("Formatted Prompt:\n", prompt)
sampling_params = SamplingParams(
n=100,
temperature=1.0,
top_p=1.0,
max_tokens=512
)
# --- 5. Generate Translations ---
outputs = llm.generate(prompts_to_generate, sampling_params)
# --- 6. Process and Print Results ---
# The 'outputs' list contains one item for each prompt.
for output in outputs:
prompt_used = output.prompt
print(f"\n--- Candidates for source: '{source_text}' ---")
# Each output object contains 'n' generated sequences.
for i, candidate in enumerate(output.outputs):
generated_text = candidate.text.strip()
print(f"Candidate {i+1}: {generated_text}")
3. Apply MBR decoding
comet-mbr -s src.txt -t mbr_sample_100.txt -o mbr_trans.txt --num_samples 100 --gpus 1 --qe_model Unbabel/wmt22-cometkiwi-da
Note: Word alignment for MBR reranking will be available soon.
License
This model is licensed under Apache License Version 2 (https://www.apache.org/licenses/LICENSE-2.0.txt, SPDX-License-identifier: Apache-2.0).
Disclaimer
We used compliance checking algorithms during the training process, to ensure the compliance of the trained model(s) to the best of our ability. Due to complex data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.