怀羽 commited on
Commit
41f2d92
·
1 Parent(s): 9858d70

update repo

Browse files
Files changed (2) hide show
  1. README.md +89 -22
  2. marco_mt_label.png +0 -0
README.md CHANGED
@@ -26,10 +26,28 @@ This repository contains the system for Algharb, the submission from the Marco T
26
 
27
  The Algharb system is a large translation model built based on the Qwen3-14B foundation. It is designed for high-quality translation across 13 diverse language directions and demonstrates state-of-the-art performance. Our approach is centered on a multi-stage refinement pipeline that systematically enhances translation fluency and faithfulness.
28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  ## Usage
30
 
31
  The model expects a specific instruction format for translation. The following example demonstrates how to construct the prompt and perform generation using the vllm library for efficient inference.
32
 
 
33
  ### 1. Dependencies
34
 
35
  First, ensure you have the necessary libraries installed:
@@ -51,33 +69,85 @@ Here is a complete Python example:
51
  ```python
52
  from vllm import LLM, SamplingParams
53
 
54
- # --- 1. Load Model and Tokenizer ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  model_path = "path/to/your/algharb_model"
56
  llm = LLM(model=model_path)
57
 
58
- # --- 2. Define Source Text and Target Language ---
59
  source_text = "This paper presents the Algharb system, our submission to the WMT 2025."
60
- source_lang_code = "en_XX" # Not used in prompt, for tracking
61
- target_lang_code = "zh_CN"
62
 
63
- # Helper dictionary to map language codes to full names for the prompt
64
  lang_name_map = {
65
- "zh_CN": "chinese",
66
- "ko_KR": "korean",
67
- "ja_JP": "japanese",
68
- "ar_EG": "arabic",
69
- "cs_CZ": "czech",
70
- "ru_RU": "russian",
71
- "uk_UA": "ukraine",
72
- "et_EE": "estonian",
73
- "bho_IN": "bhojpuri",
74
- "sr_Latn_RS": "serbian",
75
- "de_DE": "german"
 
76
  }
77
 
78
  target_language_name = lang_name_map.get(target_lang_code, "the target language")
79
 
80
- # --- 3. Construct the Prompt ---
81
  prompt = (
82
  f"Human: Please translate the following text into {target_language_name}: \n"
83
  f"{source_text}<|im_end|>\n"
@@ -89,15 +159,13 @@ print("Formatted Prompt:\n", prompt)
89
 
90
  sampling_params = SamplingParams(
91
  n=100,
92
- temperature=1.0,
93
- top_p=1.0,
94
  max_tokens=512
95
  )
96
 
97
- # --- 5. Generate Translations ---
98
  outputs = llm.generate(prompts_to_generate, sampling_params)
99
 
100
- # --- 6. Process and Print Results ---
101
  # The 'outputs' list contains one item for each prompt.
102
  for output in outputs:
103
  prompt_used = output.prompt
@@ -109,7 +177,6 @@ for output in outputs:
109
  print(f"Candidate {i+1}: {generated_text}")
110
  ```
111
 
112
- ### 3. Apply MBR decoding
113
  ```bash
114
  comet-mbr -s src.txt -t mbr_sample_100.txt -o mbr_trans.txt --num_samples 100 --gpus 1 --qe_model Unbabel/wmt22-cometkiwi-da
115
  ```
 
26
 
27
  The Algharb system is a large translation model built based on the Qwen3-14B foundation. It is designed for high-quality translation across 13 diverse language directions and demonstrates state-of-the-art performance. Our approach is centered on a multi-stage refinement pipeline that systematically enhances translation fluency and faithfulness.
28
 
29
+ Supported language pairs:
30
+ | Languages pair | Chinese Names |
31
+ |---|---|
32
+ | en2zh | 英语到中文 |
33
+ | en2ja | 英语到日语 |
34
+ | en2ko | 英语到韩语 |
35
+ | en2ar | 英语到阿拉伯语 |
36
+ | en2et | 英语到爱沙尼亚语 |
37
+ | en2sr_latin | 英语到塞尔维亚语(拉丁化) |
38
+ | en2ru | 英语到俄语 |
39
+ | en2uk | 英语到乌克兰语 |
40
+ | en2cs | 英语到捷克语 |
41
+ | en2bho | 英语到博杰普尔语 |
42
+ | cs2uk | 捷克语到乌克兰语 |
43
+ | cs2de | 捷克语到德语 |
44
+ | ja2zh | 日语到中文 |
45
+
46
  ## Usage
47
 
48
  The model expects a specific instruction format for translation. The following example demonstrates how to construct the prompt and perform generation using the vllm library for efficient inference.
49
 
50
+
51
  ### 1. Dependencies
52
 
53
  First, ensure you have the necessary libraries installed:
 
69
  ```python
70
  from vllm import LLM, SamplingParams
71
 
72
+
73
+ model_path = "path/to/your/algharb_model"
74
+ llm = LLM(model=model_path)
75
+
76
+
77
+ source_text = "This paper presents the Algharb system, our submission to the WMT 2025."
78
+ source_lang_code = "en"
79
+ target_lang_code = "zh"
80
+
81
+ lang_name_map = {
82
+ "en": "english"
83
+ "zh": "chinese",
84
+ "ko": "korean",
85
+ "ja": "japanese",
86
+ "ar": "arabic",
87
+ "cs": "czech",
88
+ "ru": "russian",
89
+ "uk": "ukraine",
90
+ "et": "estonian",
91
+ "bho": "bhojpuri",
92
+ "sr_latin": "serbian",
93
+ "de": "german",
94
+ }
95
+
96
+ target_language_name = lang_name_map.get(target_lang_code, "the target language")
97
+
98
+ prompt = (
99
+ f"Human: Please translate the following text into {target_language_name}: \n"
100
+ f"{source_text}<|im_end|>\n"
101
+ f"Assistant:"
102
+ )
103
+
104
+ prompts_to_generate = [prompt]
105
+ print("Formatted Prompt:\n", prompt)
106
+
107
+ sampling_params = SamplingParams(
108
+ n=1,
109
+ temperature=0.001,
110
+ top_p=0.001,
111
+ max_tokens=512
112
+ )
113
+
114
+ outputs = llm.generate(prompts_to_generate, sampling_params)
115
+
116
+ for output in outputs:
117
+ generated_text = output.outputs[0].strip()
118
+ print(f"translation: {generated_text}")
119
+ ```
120
+
121
+ ## Apply MBR decoding
122
+ First, run random sample decoding:
123
+ ```python
124
+ from vllm import LLM, SamplingParams
125
+
126
  model_path = "path/to/your/algharb_model"
127
  llm = LLM(model=model_path)
128
 
 
129
  source_text = "This paper presents the Algharb system, our submission to the WMT 2025."
130
+ source_lang_code = "en"
131
+ target_lang_code = "zh"
132
 
 
133
  lang_name_map = {
134
+ "en": "english"
135
+ "zh": "chinese",
136
+ "ko": "korean",
137
+ "ja": "japanese",
138
+ "ar": "arabic",
139
+ "cs": "czech",
140
+ "ru": "russian",
141
+ "uk": "ukraine",
142
+ "et": "estonian",
143
+ "bho": "bhojpuri",
144
+ "sr_latin": "serbian",
145
+ "de": "german",
146
  }
147
 
148
  target_language_name = lang_name_map.get(target_lang_code, "the target language")
149
 
150
+
151
  prompt = (
152
  f"Human: Please translate the following text into {target_language_name}: \n"
153
  f"{source_text}<|im_end|>\n"
 
159
 
160
  sampling_params = SamplingParams(
161
  n=100,
162
+ temperature=1,
163
+ top_p=1,
164
  max_tokens=512
165
  )
166
 
 
167
  outputs = llm.generate(prompts_to_generate, sampling_params)
168
 
 
169
  # The 'outputs' list contains one item for each prompt.
170
  for output in outputs:
171
  prompt_used = output.prompt
 
177
  print(f"Candidate {i+1}: {generated_text}")
178
  ```
179
 
 
180
  ```bash
181
  comet-mbr -s src.txt -t mbr_sample_100.txt -o mbr_trans.txt --num_samples 100 --gpus 1 --qe_model Unbabel/wmt22-cometkiwi-da
182
  ```
marco_mt_label.png CHANGED