ModelSpace gpengzhi commited on
Commit
c7a0ecd
·
verified ·
1 Parent(s): 88890b1

Update README.md (#2)

Browse files

- Update README.md (77801cdae137de4d6460708c2bf94455742f5357)


Co-authored-by: Pengzhi Gao <[email protected]>

Files changed (1) hide show
  1. README.md +36 -10
README.md CHANGED
@@ -9,6 +9,35 @@ base_model:
9
  - ModelSpace/GemmaX2-28-2B-Pretrain
10
  pipeline_tag: translation
11
  library_name: transformers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
  # Model Card for GemmaX2-28
@@ -23,7 +52,7 @@ GemmaX2-28-2B-v0.1 is the model version of GemmaX2-28-2B-Pretrain after SFT.
23
 
24
  - **Developed by:** Xiaomi
25
  - **Model type:** A 2B parameter model base on Gemma2, we obtained GemmaX2-28-2B-Pretrain by continuing pre-training on a large amount of monolingual and parallel data. Afterward, GemmaX2-28-2B-v0.1 was derived through supervised fine-tuning on a small set of high-quality instruction data.
26
- - **Language(s) (NLP):** Arabic, Bengali, Czech, German, English, Spanish, Persian, French, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Burmese, Dutch, polish, Portuguese, Russian, Thai, Tagalog, Turkish, Urdu, Vietnamese, Chinese.
27
  - **License:** gemma
28
 
29
  ### Model Source
@@ -34,11 +63,6 @@ GemmaX2-28-2B-v0.1 is the model version of GemmaX2-28-2B-Pretrain after SFT.
34
 
35
  ![Experimental Result](main.png)
36
 
37
- ## Limitations
38
-
39
- GemmaX2-28-2B-v0.1 supports only the 28 most commonly used languages and does not guarantee powerful translation performance for other languages. Additionally, we will continue to improve GemmaX2-28-2B's translation performance, and future models will be release in due course.
40
-
41
-
42
 
43
  ## Run the model
44
 
@@ -57,9 +81,6 @@ outputs = model.generate(**inputs, max_new_tokens=50)
57
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
58
  ```
59
 
60
- ### Training Data
61
-
62
- We collected monolingual data from [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) and [MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400). For parallel data, we collected all Chinese-centric and English-centric parallel dataset from the [OPUS](https://opus.nlpl.eu/) collection up to Auguest 2024 and underwent a series of filtering processes, such as language detection, semantic duplication filtering, quality filtering, and more.
63
 
64
  ## Citation
65
 
@@ -73,4 +94,9 @@ We collected monolingual data from [CulturaX](https://huggingface.co/datasets/uo
73
  primaryClass={cs.CL},
74
  url={https://arxiv.org/abs/2502.02481},
75
  }
76
- ```
 
 
 
 
 
 
9
  - ModelSpace/GemmaX2-28-2B-Pretrain
10
  pipeline_tag: translation
11
  library_name: transformers
12
+ language:
13
+ - ar
14
+ - bn
15
+ - cs
16
+ - de
17
+ - en
18
+ - es
19
+ - fa
20
+ - fr
21
+ - he
22
+ - hi
23
+ - id
24
+ - it
25
+ - ja
26
+ - km
27
+ - ko
28
+ - lo
29
+ - ms
30
+ - my
31
+ - nl
32
+ - pl
33
+ - pt
34
+ - ru
35
+ - th
36
+ - tl
37
+ - tr
38
+ - ur
39
+ - vi
40
+ - zh
41
  ---
42
 
43
  # Model Card for GemmaX2-28
 
52
 
53
  - **Developed by:** Xiaomi
54
  - **Model type:** A 2B parameter model base on Gemma2, we obtained GemmaX2-28-2B-Pretrain by continuing pre-training on a large amount of monolingual and parallel data. Afterward, GemmaX2-28-2B-v0.1 was derived through supervised fine-tuning on a small set of high-quality instruction data.
55
+ - **Language(s):** Arabic, Bengali, Czech, German, English, Spanish, Persian, French, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Burmese, Dutch, Polish, Portuguese, Russian, Thai, Tagalog, Turkish, Urdu, Vietnamese, Chinese.
56
  - **License:** gemma
57
 
58
  ### Model Source
 
63
 
64
  ![Experimental Result](main.png)
65
 
 
 
 
 
 
66
 
67
  ## Run the model
68
 
 
81
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
82
  ```
83
 
 
 
 
84
 
85
  ## Citation
86
 
 
94
  primaryClass={cs.CL},
95
  url={https://arxiv.org/abs/2502.02481},
96
  }
97
+ ```
98
+
99
+
100
+ ## Limitations
101
+
102
+ GemmaX2-28-2B-v0.1 supports only the 28 most commonly used languages and does not guarantee powerful translation performance for other languages. Additionally, we will continue to improve GemmaX2-28-2B's translation performance, and future models will be release in due course.