hwang233 commited on
Commit
c3322cf
·
verified ·
1 Parent(s): 8cfd81e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +108 -1
README.md CHANGED
@@ -10,4 +10,111 @@ language:
10
  - ko
11
  pipeline_tag: translation
12
  ---
13
- new repo
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  - ko
11
  pipeline_tag: translation
12
  ---
13
+
14
+ # Marco-MT-Algharb
15
+
16
+ This repository contains the system description paper for Algharb, the submission from the Marco Translation Team of Alibaba International Digital Commerce (AIDC) to the WMT 2025 General Machine Translation Shared Task.
17
+
18
+ ## Introduction
19
+
20
+ The Algharb system is a large translation model built based on the Qwen3-14B foundation. It is designed for high-quality translation across 13 diverse language directions and demonstrates state-of-the-art performance. Our approach is centered on a multi-stage refinement pipeline that systematically enhances translation fluency and faithfulness. In the WMT 2025 evaluation, Algharb significantly outperformed strong proprietary models like GPT-4o and Claude 3.7 Sonnet, achieving the top score in every submitted language pair.
21
+
22
+ ## System Architecture & Methodology
23
+
24
+ The core of Algharb is its progressive training and decoding pipeline, which includes three key stages:
25
+
26
+ 1. **Two-Step Supervised Fine-Tuning (SFT):** We first fine-tune the model on high-quality, rigorously cleaned parallel data. We then use data distillation, leveraging a powerful teacher model (DeepSeek-V3) to regenerate and learn from the data that was initially filtered out, expanding data coverage without sacrificing quality.
27
+
28
+ 2. **Two-Step Reinforcement Learning (RL):** To align the model with human preferences, we first apply Contrastive Preference Optimization (CPO). We then introduce a novel dynamic multi-reward optimization method that combines external quality metrics with the model's own reward signal, allowing it to internalize the principles of high-quality translation.
29
+
30
+ 3. **Hybrid Decoding Strategy:** To mitigate common omission errors, we developed a decoding algorithm that integrates a word-alignment-based penalty into the Minimum Bayes Risk (MBR) re-ranking framework. This ensures the final output is not only fluent but also lexically faithful to the source text.
31
+
32
+ ## Usage
33
+
34
+ The model expects a specific instruction format for translation. The following example demonstrates how to construct the prompt and perform generation using the vllm library for efficient inference.
35
+
36
+ ### 1. Dependencies
37
+
38
+ First, ensure you have the necessary libraries installed:
39
+
40
+ ```bash
41
+ pip install torch transformers vllm
42
+ ```
43
+
44
+ ### 2. Prompt Format and Decoding
45
+
46
+ The core of the process involves formatting the input text into a specific prompt template and then using the vllm engine to generate translations. For our hybrid decoding strategy, we generate multiple candidates (n > 1) for later re-ranking.
47
+ The prompt template is:
48
+
49
+ ```python
50
+ f"Human: Please translate the following text into {target_language}: \n{source_text}<|im_end|>\nAssistant:"
51
+ ```
52
+
53
+ Here is a complete Python example:
54
+ ```python
55
+ from vllm import LLM, SamplingParams
56
+
57
+ # --- 1. Load Model and Tokenizer ---
58
+ # Replace with the actual path to your fine-tuned Algharb model
59
+ model_path = "path/to/your/algharb_model"
60
+ llm = LLM(model=model_path)
61
+
62
+ # --- 2. Define Source Text and Target Language ---
63
+ source_text = "This paper presents the Algharb system, our submission to the WMT 2025."
64
+ source_lang_code = "en_XX" # Not used in prompt, for tracking
65
+ target_lang_code = "zh_CN"
66
+
67
+ # Helper dictionary to map language codes to full names for the prompt
68
+ lang_name_map = {
69
+ "zh_CN": "chinese",
70
+ "ko_KR": "korean",
71
+ "ja_JP": "japanese",
72
+ "ar_EG": "arabic", # Note: paper uses 'arz', this might need adjustment
73
+ "cs_CZ": "czech",
74
+ "ru_RU": "russian",
75
+ "uk_UA": "ukraine",
76
+ "et_EE": "estonian",
77
+ "bho_IN": "bhojpuri",
78
+ "sr_Latn_RS": "serbian",
79
+ "de_DE": "german"
80
+ }
81
+
82
+ target_language_name = lang_name_map.get(target_lang_code, "the target language")
83
+
84
+ # --- 3. Construct the Prompt ---
85
+ prompt = (
86
+ f"Human: Please translate the following text into {target_language_name}: \n"
87
+ f"{source_text}<|im_end|>\n"
88
+ f"Assistant:"
89
+ )
90
+
91
+ prompts_to_generate = [prompt]
92
+ print("Formatted Prompt:\n", prompt)
93
+
94
+ # --- 4. Configure Sampling Parameters for MBR ---
95
+ # We generate n candidates for our hybrid MBR decoding.
96
+ # The script uses temperature=1 for diverse sampling.
97
+ sampling_params = SamplingParams(
98
+ n=10, # Number of candidate translations to generate
99
+ temperature=1.0,
100
+ top_p=1.0,
101
+ max_tokens=512 # Adjust as needed
102
+ )
103
+
104
+ # --- 5. Generate Translations ---
105
+ outputs = llm.generate(prompts_to_generate, sampling_params)
106
+
107
+ # --- 6. Process and Print Results ---
108
+ # The 'outputs' list contains one item for each prompt.
109
+ for output in outputs:
110
+ prompt_used = output.prompt
111
+ print(f"\n--- Candidates for source: '{source_text}' ---")
112
+
113
+ # Each output object contains 'n' generated sequences.
114
+ for i, candidate in enumerate(output.outputs):
115
+ generated_text = candidate.text.strip()
116
+ print(f"Candidate {i+1}: {generated_text}")
117
+
118
+ # The generated candidates can now be passed to the
119
+ # hybrid MBR re-ranking process described in the paper.
120
+ ```