AIDC-AI
/

Marco-MT-Algharb

Translation

Safetensors

qwen3

Model card Files Files and versions

xet

Community

怀羽 commited on Sep 23

Commit

41f2d92

1 Parent(s): 9858d70

update repo

Browse files

Files changed (2) hide show

README.md +89 -22
marco_mt_label.png +0 -0

README.md CHANGED Viewed

@@ -26,10 +26,28 @@ This repository contains the system for Algharb, the submission from the Marco T
 The Algharb system is a large translation model built based on the Qwen3-14B foundation. It is designed for high-quality translation across 13 diverse language directions and demonstrates state-of-the-art performance. Our approach is centered on a multi-stage refinement pipeline that systematically enhances translation fluency and faithfulness.
 ## Usage
 The model expects a specific instruction format for translation. The following example demonstrates how to construct the prompt and perform generation using the vllm library for efficient inference.
 ### 1. Dependencies
 First, ensure you have the necessary libraries installed:
@@ -51,33 +69,85 @@ Here is a complete Python example:
 ```python
 from vllm import LLM, SamplingParams
-# --- 1. Load Model and Tokenizer ---
 model_path = "path/to/your/algharb_model"
 llm = LLM(model=model_path)
-# --- 2. Define Source Text and Target Language ---
 source_text = "This paper presents the Algharb system, our submission to the WMT 2025."
-source_lang_code = "en_XX" # Not used in prompt, for tracking
-target_lang_code = "zh_CN"
-# Helper dictionary to map language codes to full names for the prompt
 lang_name_map = {
-    "zh_CN": "chinese",
-    "ko_KR": "korean",
-    "ja_JP": "japanese",
-    "ar_EG": "arabic",
-    "cs_CZ": "czech",
-    "ru_RU": "russian",
-    "uk_UA": "ukraine",
-    "et_EE": "estonian",
-    "bho_IN": "bhojpuri",
-    "sr_Latn_RS": "serbian",
-    "de_DE": "german"
 }
 target_language_name = lang_name_map.get(target_lang_code, "the target language")
-# --- 3. Construct the Prompt ---
 prompt = (
     f"Human: Please translate the following text into {target_language_name}: \n"
     f"{source_text}<|im_end|>\n"
@@ -89,15 +159,13 @@ print("Formatted Prompt:\n", prompt)
 sampling_params = SamplingParams(
     n=100,
-    temperature=1.0,
-    top_p=1.0,
     max_tokens=512
 )
-# --- 5. Generate Translations ---
 outputs = llm.generate(prompts_to_generate, sampling_params)
-# --- 6. Process and Print Results ---
 # The 'outputs' list contains one item for each prompt.
 for output in outputs:
     prompt_used = output.prompt
@@ -109,7 +177,6 @@ for output in outputs:
         print(f"Candidate {i+1}: {generated_text}")
 ```
-### 3. Apply MBR decoding
 ```bash
 comet-mbr -s src.txt -t mbr_sample_100.txt -o mbr_trans.txt --num_samples 100 --gpus 1 --qe_model Unbabel/wmt22-cometkiwi-da
 ```

 The Algharb system is a large translation model built based on the Qwen3-14B foundation. It is designed for high-quality translation across 13 diverse language directions and demonstrates state-of-the-art performance. Our approach is centered on a multi-stage refinement pipeline that systematically enhances translation fluency and faithfulness.
+Supported language pairs:
+| Languages pair | Chinese Names |
+|---|---|
+| en2zh | 英语到中文 |
+| en2ja | 英语到日语 |
+| en2ko | 英语到韩语 |
+| en2ar | 英语到阿拉伯语 |
+| en2et | 英语到爱沙尼亚语 |
+| en2sr_latin | 英语到塞尔维亚语(拉丁化) |
+| en2ru | 英语到俄语 |
+| en2uk | 英语到乌克兰语 |
+| en2cs | 英语到捷克语 |
+| en2bho | 英语到博杰普尔语 |
+| cs2uk | 捷克语到乌克兰语 |
+| cs2de | 捷克语到德语 |
+| ja2zh | 日语到中文 |
 ## Usage
 The model expects a specific instruction format for translation. The following example demonstrates how to construct the prompt and perform generation using the vllm library for efficient inference.
 ### 1. Dependencies
 First, ensure you have the necessary libraries installed:
 ```python
 from vllm import LLM, SamplingParams
+model_path = "path/to/your/algharb_model"
+llm = LLM(model=model_path)
+source_text = "This paper presents the Algharb system, our submission to the WMT 2025."
+source_lang_code = "en"
+target_lang_code = "zh"
+lang_name_map = {
+    "en": "english"
+    "zh": "chinese",
+    "ko": "korean",
+    "ja": "japanese",
+    "ar": "arabic",
+    "cs": "czech",
+    "ru": "russian",
+    "uk": "ukraine",
+    "et": "estonian",
+    "bho": "bhojpuri",
+    "sr_latin": "serbian",
+    "de": "german",
+}
+target_language_name = lang_name_map.get(target_lang_code, "the target language")
+prompt = (
+    f"Human: Please translate the following text into {target_language_name}: \n"
+    f"{source_text}<|im_end|>\n"
+    f"Assistant:"
+)
+prompts_to_generate = [prompt]
+print("Formatted Prompt:\n", prompt)
+sampling_params = SamplingParams(
+    n=1,
+    temperature=0.001,
+    top_p=0.001,
+    max_tokens=512
+)
+outputs = llm.generate(prompts_to_generate, sampling_params)
+for output in outputs:
+    generated_text = output.outputs[0].strip()
+    print(f"translation: {generated_text}")
+```
+## Apply MBR decoding
+First, run random sample decoding:
+```python
+from vllm import LLM, SamplingParams
 model_path = "path/to/your/algharb_model"
 llm = LLM(model=model_path)
 source_text = "This paper presents the Algharb system, our submission to the WMT 2025."
+source_lang_code = "en"
+target_lang_code = "zh"
 lang_name_map = {
+    "en": "english"
+    "zh": "chinese",
+    "ko": "korean",
+    "ja": "japanese",
+    "ar": "arabic",
+    "cs": "czech",
+    "ru": "russian",
+    "uk": "ukraine",
+    "et": "estonian",
+    "bho": "bhojpuri",
+    "sr_latin": "serbian",
+    "de": "german",
 }
 target_language_name = lang_name_map.get(target_lang_code, "the target language")
 prompt = (
     f"Human: Please translate the following text into {target_language_name}: \n"
     f"{source_text}<|im_end|>\n"
 sampling_params = SamplingParams(
     n=100,
+    temperature=1,
+    top_p=1,
     max_tokens=512
 )
 outputs = llm.generate(prompts_to_generate, sampling_params)
 # The 'outputs' list contains one item for each prompt.
 for output in outputs:
     prompt_used = output.prompt
         print(f"Candidate {i+1}: {generated_text}")
 ```
 ```bash
 comet-mbr -s src.txt -t mbr_sample_100.txt -o mbr_trans.txt --num_samples 100 --gpus 1 --qe_model Unbabel/wmt22-cometkiwi-da
 ```

marco_mt_label.png CHANGED Viewed