utrobinmv commited on
Commit
952a7ff
·
1 Parent(s): 5d3dfc6

add readme

Browse files
Files changed (1) hide show
  1. README.md +113 -0
README.md ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ru
4
+ - zh
5
+ - en
6
+ tags:
7
+ - translation
8
+ license: apache-2.0
9
+ datasets:
10
+ - ccmatrix
11
+ metrics:
12
+ - sacrebleu
13
+ widget:
14
+ - example_title: translate zh-ru
15
+ text: >
16
+ translate to ru: 开发的目的是为用户提供个人同步翻译。
17
+ - example_title: translate ru-en
18
+ text: >
19
+ translate to en: Цель разработки — предоставить пользователям личного синхронного переводчика.
20
+ - example_title: translate en-ru
21
+ text: >
22
+ translate to ru: The purpose of the development is to provide users with a personal synchronized interpreter.
23
+ - example_title: translate en-zh
24
+ text: >
25
+ translate to zh: The purpose of the development is to provide users with a personal synchronized interpreter.
26
+ - example_title: translate zh-en
27
+ text: >
28
+ translate to en: 开发的目的是为用户提供个人同步解释器。
29
+ - example_title: translate ru-zh
30
+ text: >
31
+ translate to zh: Цель разработки — предоставить пользователям личного синхронного переводчика.
32
+ ---
33
+
34
+ # m2m English, Russian and Chinese multilingual machine translation
35
+
36
+ This model represents a conventional m2m transformer in multitasking mode for translation into the required language, precisely configured for machine translation for pairs: ru-zh, zh-ru, en-zh, zh-en, en-ru, ru-en.
37
+
38
+ The model can perform direct translation between any pair of Russian, Chinese or English languages. For translation into the target language, the target language identifier is specified as a prefix 'translate to <lang>:'. In this case, the source language may not be specified, in addition, the source text may be multilingual.
39
+
40
+
41
+
42
+ Fine tune from the base model: utrobinmv/m2m_translate_en_ru_zh_large_4096
43
+
44
+ This version of the model was based on noisier data with a noise reduction function.
45
+ The model can additionally insert punctuation marks into sentences if they are missing from the source text. This is convenient to use for translating texts after ASR models.
46
+
47
+ The model has learned how to translate small markdown files while maintaining the markup and html tags.
48
+
49
+
50
+
51
+ Example translate Russian to Chinese
52
+
53
+ ```python
54
+ from transformers import M2M100ForConditionalGeneration, AutoTokenizer
55
+
56
+ device = 'cuda' #or 'cpu' for translate on cpu
57
+
58
+ model_name = 'utrobinmv/m2m_translate_en_ru_zh_large_4096'
59
+ model = M2M100ForConditionalGeneration.from_pretrained(model_name)
60
+ model.eval()
61
+ model.to(device)
62
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
63
+
64
+ prefix = 'translate to zh: '
65
+ src_text = prefix + "Съешь ещё этих мягких французских булок."
66
+
67
+ # translate Russian to Chinese
68
+ input_ids = tokenizer(src_text, return_tensors="pt")
69
+
70
+ generated_tokens = model.generate(**input_ids.to(device))
71
+
72
+ result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
73
+ print(result)
74
+ # 再吃这些法国的甜蜜的面包。
75
+ ```
76
+
77
+
78
+
79
+ and Example translate Chinese to Russian
80
+
81
+ ```python
82
+ from transformers import M2M100ForConditionalGeneration, AutoTokenizer
83
+
84
+ device = 'cuda' #or 'cpu' for translate on cpu
85
+
86
+ model_name = 'utrobinmv/m2m_translate_en_ru_zh_large_4096'
87
+ model = M2M100ForConditionalGeneration.from_pretrained(model_name)
88
+ model.eval()
89
+ model.to(device)
90
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
91
+
92
+ prefix = 'translate to ru: '
93
+ src_text = prefix + "再吃这些法国的甜蜜的面包。"
94
+
95
+ # translate Russian to Chinese
96
+ input_ids = tokenizer(src_text, return_tensors="pt")
97
+
98
+ generated_tokens = model.generate(**input_ids.to(device))
99
+
100
+ result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
101
+ print(result)
102
+ # Съешьте этот сладкий хлеб из Франции.
103
+ ```
104
+
105
+
106
+
107
+ ##
108
+
109
+
110
+
111
+ ## Languages covered
112
+
113
+ Russian (ru_RU), Chinese (zh_CN), English (en_US)