ModelSpace commited on
Commit
731c350
·
verified ·
1 Parent(s): cdb5e2e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -12
README.md CHANGED
@@ -16,13 +16,12 @@ library_name: transformers
16
 
17
  ### Model Description
18
 
19
- GemmaX2-28-9B-Pretrain is a language model that results from continual pretraining of Gemma2-9B on a mix of 56 billion tokens of monolingual and parallel data in 28 different languages Arabic, Bengali, Czech, German, English, Spanish, Persian, French, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Burmese, Dutch, polish, Portuguese, Russian, Thai, Tagalog, Turkish, Urdu, Vietnamese, Chinese.
20
 
21
- GemmaX2-28-9B-v0.1 is the first model in the series. Compared to the current open-source state-of-the-art (SOTA) models, it achieves optimal translation performance across 28 languages, even reaching performance comparable to GPT-4 and Google Translate, indicating it has achieved translation capabilities on par with industry standards.
22
 
23
  - **Developed by:** Xiaomi
24
- - **Model type:** A 9B parameter model base on Gemma2, we obtained GemmaX2-28-9B-Pretrain by continuing pre-training on a large amount of monolingual and parallel data. Afterward, GemmaX2-28-9B-v0.1 was derived through supervised fine-tuning on a small set of high-quality instruction data.
25
- - **Language(s) (NLP):** Arabic, Bengali, Czech, German, English, Spanish, Persian, French, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Burmese, Dutch, polish, Portuguese, Russian, Thai, Tagalog, Turkish, Urdu, Vietnamese, Chinese.
26
  - **License:** gemma
27
 
28
  ### Model Source
@@ -33,10 +32,6 @@ GemmaX2-28-9B-v0.1 is the first model in the series. Compared to the current ope
33
 
34
  ![Experimental Result](main.png)
35
 
36
- ## Limitations
37
-
38
- GemmaX2-28-9B-v0.1 supports only the 28 most commonly used languages and does not guarantee powerful translation performance for other languages. Additionally, we will continue to improve GemmaX2-28-9B's translation performance, and future models will be release in due course.
39
-
40
 
41
 
42
  ## Run the model
@@ -56,9 +51,6 @@ outputs = model.generate(**inputs, max_new_tokens=50)
56
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
57
  ```
58
 
59
- ### Training Data
60
-
61
- We collected monolingual data from [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) and [MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400). For parallel data, we collected all Chinese-centric and English-centric parallel dataset from the [OPUS](https://opus.nlpl.eu/) collection up to Auguest 2024 and underwent a series of filtering processes, such as language detection, semantic duplication filtering, quality filtering, and more.
62
 
63
  ## Citation
64
 
@@ -72,4 +64,9 @@ We collected monolingual data from [CulturaX](https://huggingface.co/datasets/uo
72
  primaryClass={cs.CL},
73
  url={https://arxiv.org/abs/2502.02481},
74
  }
75
- ```
 
 
 
 
 
 
16
 
17
  ### Model Description
18
 
19
+ GemmaX2-28-9B-v0.1 is an LLM-based translation model. It has been fintuned on GemmaX2-28-9B-Pretrain, which is a language model developed through continual pretraining of Gemma2-9B using a mix of 56 billion tokens from both monolingual and parallel data across 28 different languages. Please find more details in our paper: [Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study](https://arxiv.org/abs/2502.02481).
20
 
 
21
 
22
  - **Developed by:** Xiaomi
23
+ - **Model type:** GemmaX2-28-9B-Pretrain is obtained by continually pretraining Gemma2-9B on a large amount of monolingual and parallel data. Subsequently, GemmaX2-28-9B-v0.1 is derived through supervised finetuning on a small set of high-quality translation instruction data.
24
+ - **Languages:** Arabic, Bengali, Czech, German, English, Spanish, Persian, French, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Burmese, Dutch, polish, Portuguese, Russian, Thai, Tagalog, Turkish, Urdu, Vietnamese, Chinese.
25
  - **License:** gemma
26
 
27
  ### Model Source
 
32
 
33
  ![Experimental Result](main.png)
34
 
 
 
 
 
35
 
36
 
37
  ## Run the model
 
51
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
52
  ```
53
 
 
 
 
54
 
55
  ## Citation
56
 
 
64
  primaryClass={cs.CL},
65
  url={https://arxiv.org/abs/2502.02481},
66
  }
67
+ ```
68
+
69
+
70
+ ## Limitations
71
+
72
+ GemmaX2-28-9B-v0.1 supports only the 28 most commonly used languages and does not guarantee powerful translation performance for other languages. Additionally, we will continue to improve GemmaX2-28-9B's translation performance, and future models will be release in due course.