Update README.md
Browse files
README.md
CHANGED
|
@@ -10,4 +10,111 @@ language:
|
|
| 10 |
- ko
|
| 11 |
pipeline_tag: translation
|
| 12 |
---
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
- ko
|
| 11 |
pipeline_tag: translation
|
| 12 |
---
|
| 13 |
+
|
| 14 |
+
# Marco-MT-Algharb
|
| 15 |
+
|
| 16 |
+
This repository contains the system description paper for Algharb, the submission from the Marco Translation Team of Alibaba International Digital Commerce (AIDC) to the WMT 2025 General Machine Translation Shared Task.
|
| 17 |
+
|
| 18 |
+
## Introduction
|
| 19 |
+
|
| 20 |
+
The Algharb system is a large translation model built based on the Qwen3-14B foundation. It is designed for high-quality translation across 13 diverse language directions and demonstrates state-of-the-art performance. Our approach is centered on a multi-stage refinement pipeline that systematically enhances translation fluency and faithfulness. In the WMT 2025 evaluation, Algharb significantly outperformed strong proprietary models like GPT-4o and Claude 3.7 Sonnet, achieving the top score in every submitted language pair.
|
| 21 |
+
|
| 22 |
+
## System Architecture & Methodology
|
| 23 |
+
|
| 24 |
+
The core of Algharb is its progressive training and decoding pipeline, which includes three key stages:
|
| 25 |
+
|
| 26 |
+
1. **Two-Step Supervised Fine-Tuning (SFT):** We first fine-tune the model on high-quality, rigorously cleaned parallel data. We then use data distillation, leveraging a powerful teacher model (DeepSeek-V3) to regenerate and learn from the data that was initially filtered out, expanding data coverage without sacrificing quality.
|
| 27 |
+
|
| 28 |
+
2. **Two-Step Reinforcement Learning (RL):** To align the model with human preferences, we first apply Contrastive Preference Optimization (CPO). We then introduce a novel dynamic multi-reward optimization method that combines external quality metrics with the model's own reward signal, allowing it to internalize the principles of high-quality translation.
|
| 29 |
+
|
| 30 |
+
3. **Hybrid Decoding Strategy:** To mitigate common omission errors, we developed a decoding algorithm that integrates a word-alignment-based penalty into the Minimum Bayes Risk (MBR) re-ranking framework. This ensures the final output is not only fluent but also lexically faithful to the source text.
|
| 31 |
+
|
| 32 |
+
## Usage
|
| 33 |
+
|
| 34 |
+
The model expects a specific instruction format for translation. The following example demonstrates how to construct the prompt and perform generation using the vllm library for efficient inference.
|
| 35 |
+
|
| 36 |
+
### 1. Dependencies
|
| 37 |
+
|
| 38 |
+
First, ensure you have the necessary libraries installed:
|
| 39 |
+
|
| 40 |
+
```bash
|
| 41 |
+
pip install torch transformers vllm
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
### 2. Prompt Format and Decoding
|
| 45 |
+
|
| 46 |
+
The core of the process involves formatting the input text into a specific prompt template and then using the vllm engine to generate translations. For our hybrid decoding strategy, we generate multiple candidates (n > 1) for later re-ranking.
|
| 47 |
+
The prompt template is:
|
| 48 |
+
|
| 49 |
+
```python
|
| 50 |
+
f"Human: Please translate the following text into {target_language}: \n{source_text}<|im_end|>\nAssistant:"
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
Here is a complete Python example:
|
| 54 |
+
```python
|
| 55 |
+
from vllm import LLM, SamplingParams
|
| 56 |
+
|
| 57 |
+
# --- 1. Load Model and Tokenizer ---
|
| 58 |
+
# Replace with the actual path to your fine-tuned Algharb model
|
| 59 |
+
model_path = "path/to/your/algharb_model"
|
| 60 |
+
llm = LLM(model=model_path)
|
| 61 |
+
|
| 62 |
+
# --- 2. Define Source Text and Target Language ---
|
| 63 |
+
source_text = "This paper presents the Algharb system, our submission to the WMT 2025."
|
| 64 |
+
source_lang_code = "en_XX" # Not used in prompt, for tracking
|
| 65 |
+
target_lang_code = "zh_CN"
|
| 66 |
+
|
| 67 |
+
# Helper dictionary to map language codes to full names for the prompt
|
| 68 |
+
lang_name_map = {
|
| 69 |
+
"zh_CN": "chinese",
|
| 70 |
+
"ko_KR": "korean",
|
| 71 |
+
"ja_JP": "japanese",
|
| 72 |
+
"ar_EG": "arabic", # Note: paper uses 'arz', this might need adjustment
|
| 73 |
+
"cs_CZ": "czech",
|
| 74 |
+
"ru_RU": "russian",
|
| 75 |
+
"uk_UA": "ukraine",
|
| 76 |
+
"et_EE": "estonian",
|
| 77 |
+
"bho_IN": "bhojpuri",
|
| 78 |
+
"sr_Latn_RS": "serbian",
|
| 79 |
+
"de_DE": "german"
|
| 80 |
+
}
|
| 81 |
+
|
| 82 |
+
target_language_name = lang_name_map.get(target_lang_code, "the target language")
|
| 83 |
+
|
| 84 |
+
# --- 3. Construct the Prompt ---
|
| 85 |
+
prompt = (
|
| 86 |
+
f"Human: Please translate the following text into {target_language_name}: \n"
|
| 87 |
+
f"{source_text}<|im_end|>\n"
|
| 88 |
+
f"Assistant:"
|
| 89 |
+
)
|
| 90 |
+
|
| 91 |
+
prompts_to_generate = [prompt]
|
| 92 |
+
print("Formatted Prompt:\n", prompt)
|
| 93 |
+
|
| 94 |
+
# --- 4. Configure Sampling Parameters for MBR ---
|
| 95 |
+
# We generate n candidates for our hybrid MBR decoding.
|
| 96 |
+
# The script uses temperature=1 for diverse sampling.
|
| 97 |
+
sampling_params = SamplingParams(
|
| 98 |
+
n=10, # Number of candidate translations to generate
|
| 99 |
+
temperature=1.0,
|
| 100 |
+
top_p=1.0,
|
| 101 |
+
max_tokens=512 # Adjust as needed
|
| 102 |
+
)
|
| 103 |
+
|
| 104 |
+
# --- 5. Generate Translations ---
|
| 105 |
+
outputs = llm.generate(prompts_to_generate, sampling_params)
|
| 106 |
+
|
| 107 |
+
# --- 6. Process and Print Results ---
|
| 108 |
+
# The 'outputs' list contains one item for each prompt.
|
| 109 |
+
for output in outputs:
|
| 110 |
+
prompt_used = output.prompt
|
| 111 |
+
print(f"\n--- Candidates for source: '{source_text}' ---")
|
| 112 |
+
|
| 113 |
+
# Each output object contains 'n' generated sequences.
|
| 114 |
+
for i, candidate in enumerate(output.outputs):
|
| 115 |
+
generated_text = candidate.text.strip()
|
| 116 |
+
print(f"Candidate {i+1}: {generated_text}")
|
| 117 |
+
|
| 118 |
+
# The generated candidates can now be passed to the
|
| 119 |
+
# hybrid MBR re-ranking process described in the paper.
|
| 120 |
+
```
|