Translation
English
Icelandic
Eval Results
radinplaid commited on
Commit
385fd2e
·
verified ·
1 Parent(s): 602a850

Upload folder using huggingface_hub

Browse files
.ipynb_checkpoints/README-checkpoint.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - is
5
+ tags:
6
+ - translation
7
+ license: cc-by-4.0
8
+ datasets:
9
+ - quickmt/quickmt-train.is-en
10
+ model-index:
11
+ - name: quickmt-is-en
12
+ results:
13
+ - task:
14
+ name: Translation isl-eng
15
+ type: translation
16
+ args: isl-eng
17
+ dataset:
18
+ name: flores101-devtest
19
+ type: flores_101
20
+ args: isl_Latn eng_Latn devtest
21
+ metrics:
22
+ - name: BLEU
23
+ type: bleu
24
+ value: 34.76
25
+ - name: CHRF
26
+ type: chrf
27
+ value: 60.13
28
+ - name: COMET
29
+ type: comet
30
+ value: 85.39
31
+ ---
32
+
33
+
34
+ # `quickmt-is-en` Neural Machine Translation Model
35
+
36
+ `quickmt-is-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `is` into `en`.
37
+
38
+
39
+ ## Try it on our Huggingface Space
40
+
41
+ Give it a try before downloading here: https://huggingface.co/spaces/quickmt/QuickMT-Demo
42
+
43
+
44
+ ## Model Information
45
+
46
+ * Trained using [`eole`](https://github.com/eole-nlp/eole)
47
+ * 200M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
48
+ * 32k separate Sentencepiece vocabs
49
+ * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
50
+ * The pytorch model (for use with [`eole`](https://github.com/eole-nlp/eole)) is available in this repository in the `eole-model` folder
51
+
52
+ See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
53
+
54
+
55
+ ## Usage with `quickmt`
56
+
57
+ You must install the Nvidia cuda toolkit first, if you want to do GPU inference.
58
+
59
+ Next, install the `quickmt` python library and download the model:
60
+
61
+ ```bash
62
+ git clone https://github.com/quickmt/quickmt.git
63
+ pip install ./quickmt/
64
+
65
+ quickmt-model-download quickmt/quickmt-is-en ./quickmt-is-en
66
+ ```
67
+
68
+ Finally use the model in python:
69
+
70
+ ```python
71
+ from quickmt import Translator
72
+
73
+ # Auto-detects GPU, set to "cpu" to force CPU inference
74
+ t = Translator("./quickmt-is-en/", device="auto")
75
+
76
+ # Translate - set beam size to 1 for faster speed (but lower quality)
77
+ sample_text = 'Dr. Ehud Ur, læknaprófessor við Dalhousie-háskólann í Halifax í Nova Scotia og formaður klínískrar vísindadeildar Kanadíska sykursýkissambandsins, minnti á að rannsóknin væri rétt nýhafin.'
78
+
79
+ t(sample_text, beam_size=5)
80
+ ```
81
+
82
+ > 'Dr. Ehud Ur, a medical professor at Dalhousie University in Halifax, Nova Scotia, and chair of the clinical science department of the Canadian Diabetes Association, recalled that the study had just begun.'
83
+
84
+ ```python
85
+ # Get alternative translations by sampling
86
+ # You can pass any cTranslate2 `translate_batch` arguments
87
+ t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
88
+ ```
89
+
90
+ > 'Dr Ehud Ur, a medical professor at Dalhousie University in Halifax, Nova Scotia and chair of the clinical science section of the Canadian Diabetes Union, mentioned that the investigation was just beginning.'
91
+
92
+ The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`. A model in safetensors format to be used with `eole` is also provided.
93
+
94
+
95
+ ## Metrics
96
+
97
+ `bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("isl_Latn"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32.
98
+
99
+
100
+ | | bleu | chrf2 | comet22 | Time (s) |
101
+ |:---------------------------------|-------:|--------:|----------:|-----------:|
102
+ | quickmt/quickmt-is-en | 34.76 | 60.13 | 85.39 | 1.22 |
103
+ | Helsinki-NLP/opus-mt-is-en | 25.91 | 52.03 | 79.99 | 3.5 |
104
+ | facebook/nllb-200-distilled-600M | 30.13 | 54.77 | 82.23 | 21.3 |
105
+ | facebook/nllb-200-distilled-1.3B | 33.71 | 57.73 | 84.71 | 37.21 |
106
+ | facebook/m2m100_418M | 20.38 | 46.47 | 70.95 | 18.8 |
107
+ | facebook/m2m100_1.2B | 28.89 | 54.54 | 81.09 | 34.72 |
108
+
README.md CHANGED
@@ -1,3 +1,108 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - is
5
+ tags:
6
+ - translation
7
+ license: cc-by-4.0
8
+ datasets:
9
+ - quickmt/quickmt-train.is-en
10
+ model-index:
11
+ - name: quickmt-is-en
12
+ results:
13
+ - task:
14
+ name: Translation isl-eng
15
+ type: translation
16
+ args: isl-eng
17
+ dataset:
18
+ name: flores101-devtest
19
+ type: flores_101
20
+ args: isl_Latn eng_Latn devtest
21
+ metrics:
22
+ - name: BLEU
23
+ type: bleu
24
+ value: 34.76
25
+ - name: CHRF
26
+ type: chrf
27
+ value: 60.13
28
+ - name: COMET
29
+ type: comet
30
+ value: 85.39
31
+ ---
32
+
33
+
34
+ # `quickmt-is-en` Neural Machine Translation Model
35
+
36
+ `quickmt-is-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `is` into `en`.
37
+
38
+
39
+ ## Try it on our Huggingface Space
40
+
41
+ Give it a try before downloading here: https://huggingface.co/spaces/quickmt/QuickMT-Demo
42
+
43
+
44
+ ## Model Information
45
+
46
+ * Trained using [`eole`](https://github.com/eole-nlp/eole)
47
+ * 200M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
48
+ * 32k separate Sentencepiece vocabs
49
+ * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
50
+ * The pytorch model (for use with [`eole`](https://github.com/eole-nlp/eole)) is available in this repository in the `eole-model` folder
51
+
52
+ See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
53
+
54
+
55
+ ## Usage with `quickmt`
56
+
57
+ You must install the Nvidia cuda toolkit first, if you want to do GPU inference.
58
+
59
+ Next, install the `quickmt` python library and download the model:
60
+
61
+ ```bash
62
+ git clone https://github.com/quickmt/quickmt.git
63
+ pip install ./quickmt/
64
+
65
+ quickmt-model-download quickmt/quickmt-is-en ./quickmt-is-en
66
+ ```
67
+
68
+ Finally use the model in python:
69
+
70
+ ```python
71
+ from quickmt import Translator
72
+
73
+ # Auto-detects GPU, set to "cpu" to force CPU inference
74
+ t = Translator("./quickmt-is-en/", device="auto")
75
+
76
+ # Translate - set beam size to 1 for faster speed (but lower quality)
77
+ sample_text = 'Dr. Ehud Ur, læknaprófessor við Dalhousie-háskólann í Halifax í Nova Scotia og formaður klínískrar vísindadeildar Kanadíska sykursýkissambandsins, minnti á að rannsóknin væri rétt nýhafin.'
78
+
79
+ t(sample_text, beam_size=5)
80
+ ```
81
+
82
+ > 'Dr. Ehud Ur, a medical professor at Dalhousie University in Halifax, Nova Scotia, and chair of the clinical science department of the Canadian Diabetes Association, recalled that the study had just begun.'
83
+
84
+ ```python
85
+ # Get alternative translations by sampling
86
+ # You can pass any cTranslate2 `translate_batch` arguments
87
+ t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
88
+ ```
89
+
90
+ > 'Dr Ehud Ur, a medical professor at Dalhousie University in Halifax, Nova Scotia and chair of the clinical science section of the Canadian Diabetes Union, mentioned that the investigation was just beginning.'
91
+
92
+ The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`. A model in safetensors format to be used with `eole` is also provided.
93
+
94
+
95
+ ## Metrics
96
+
97
+ `bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("isl_Latn"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32.
98
+
99
+
100
+ | | bleu | chrf2 | comet22 | Time (s) |
101
+ |:---------------------------------|-------:|--------:|----------:|-----------:|
102
+ | quickmt/quickmt-is-en | 34.76 | 60.13 | 85.39 | 1.22 |
103
+ | Helsinki-NLP/opus-mt-is-en | 25.91 | 52.03 | 79.99 | 3.5 |
104
+ | facebook/nllb-200-distilled-600M | 30.13 | 54.77 | 82.23 | 21.3 |
105
+ | facebook/nllb-200-distilled-1.3B | 33.71 | 57.73 | 84.71 | 37.21 |
106
+ | facebook/m2m100_418M | 20.38 | 46.47 | 70.95 | 18.8 |
107
+ | facebook/m2m100_1.2B | 28.89 | 54.54 | 81.09 | 34.72 |
108
+
config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_source_bos": false,
3
+ "add_source_eos": false,
4
+ "bos_token": "<s>",
5
+ "decoder_start_token": "<s>",
6
+ "eos_token": "</s>",
7
+ "layer_norm_epsilon": 1e-06,
8
+ "multi_query_attention": false,
9
+ "unk_token": "<unk>"
10
+ }
eole-config.yaml ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## IO
2
+ save_data: data
3
+ overwrite: True
4
+ seed: 1234
5
+ report_every: 100
6
+ valid_metrics: ["BLEU"]
7
+ tensorboard: true
8
+ tensorboard_log_dir: tensorboard
9
+
10
+ ### Vocab
11
+ src_vocab: is.eole.vocab
12
+ tgt_vocab: en.eole.vocab
13
+ src_vocab_size: 32000
14
+ tgt_vocab_size: 32000
15
+ vocab_size_multiple: 8
16
+ share_vocab: false
17
+ n_sample: 0
18
+
19
+ data:
20
+ corpus_1:
21
+ path_src: hf://quickmt/quickmt-train.is-en/en
22
+ path_tgt: hf://quickmt/quickmt-train.is-en/is
23
+ path_sco: hf://quickmt/quickmt-train.is-en/sco
24
+ weight: 9
25
+ corpus_2:
26
+ path_src: hf://quickmt/newscrawl2024-en-backtranslated-is/is
27
+ path_tgt: hf://quickmt/newscrawl2024-en-backtranslated-is/en
28
+ path_sco: hf://quickmt/newscrawl2024-en-backtranslated-is/sco
29
+ weight: 5
30
+ valid:
31
+ path_src: valid.is
32
+ path_tgt: valid.en
33
+
34
+ transforms: [sentencepiece, filtertoolong]
35
+ transforms_configs:
36
+ sentencepiece:
37
+ src_subword_model: "is.spm.model"
38
+ tgt_subword_model: "en.spm.model"
39
+ filtertoolong:
40
+ src_seq_length: 256
41
+ tgt_seq_length: 256
42
+
43
+ training:
44
+ # Run configuration
45
+ model_path: quickmt-is-en-eole-model
46
+ keep_checkpoint: 4
47
+ train_steps: 60000
48
+ save_checkpoint_steps: 5000
49
+ valid_steps: 5000
50
+
51
+ # Train on a single GPU
52
+ world_size: 1
53
+ gpu_ranks: [0]
54
+
55
+ # Batching 10240
56
+ batch_type: "tokens"
57
+ batch_size: 6000
58
+ valid_batch_size: 2048
59
+ batch_size_multiple: 8
60
+ accum_count: [20]
61
+ accum_steps: [0]
62
+
63
+ # Optimizer & Compute
64
+ compute_dtype: "fp16"
65
+ optim: "adamw"
66
+ #use_amp: False
67
+ learning_rate: 3.0
68
+ warmup_steps: 5000
69
+ decay_method: "noam"
70
+ adam_beta2: 0.998
71
+
72
+ # Data loading
73
+ bucket_size: 128000
74
+ num_workers: 4
75
+ prefetch_factor: 32
76
+
77
+ # Hyperparams
78
+ dropout_steps: [0]
79
+ dropout: [0.1]
80
+ attention_dropout: [0.1]
81
+ max_grad_norm: 0
82
+ label_smoothing: 0.1
83
+ average_decay: 0.0001
84
+ param_init_method: xavier_uniform
85
+ normalization: "tokens"
86
+
87
+ model:
88
+ architecture: "transformer"
89
+ share_embeddings: false
90
+ share_decoder_embeddings: true
91
+ hidden_size: 1024
92
+ encoder:
93
+ layers: 8
94
+ decoder:
95
+ layers: 2
96
+ heads: 8
97
+ transformer_ff: 4096
98
+ embeddings:
99
+ word_vec_size: 1024
100
+ position_encoding_type: "SinusoidalInterleaved"
101
+
eole-model/config.json ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "n_sample": 0,
3
+ "tgt_vocab_size": 32000,
4
+ "tgt_vocab": "en.eole.vocab",
5
+ "tensorboard_log_dir_dated": "tensorboard/Nov-24_22-33-35",
6
+ "valid_metrics": [
7
+ "BLEU"
8
+ ],
9
+ "src_vocab_size": 32000,
10
+ "save_data": "data",
11
+ "share_vocab": false,
12
+ "overwrite": true,
13
+ "report_every": 100,
14
+ "tensorboard": true,
15
+ "seed": 1234,
16
+ "src_vocab": "is.eole.vocab",
17
+ "vocab_size_multiple": 8,
18
+ "tensorboard_log_dir": "tensorboard",
19
+ "transforms": [
20
+ "sentencepiece",
21
+ "filtertoolong"
22
+ ],
23
+ "training": {
24
+ "warmup_steps": 5000,
25
+ "label_smoothing": 0.1,
26
+ "attention_dropout": [
27
+ 0.1
28
+ ],
29
+ "decay_method": "noam",
30
+ "model_path": "quickmt-is-en-eole-model",
31
+ "compute_dtype": "torch.float16",
32
+ "dropout": [
33
+ 0.1
34
+ ],
35
+ "normalization": "tokens",
36
+ "dropout_steps": [
37
+ 0
38
+ ],
39
+ "param_init_method": "xavier_uniform",
40
+ "train_steps": 100000,
41
+ "adam_beta2": 0.998,
42
+ "max_grad_norm": 0.0,
43
+ "batch_type": "tokens",
44
+ "accum_count": [
45
+ 20
46
+ ],
47
+ "learning_rate": 3.0,
48
+ "num_workers": 0,
49
+ "accum_steps": [
50
+ 0
51
+ ],
52
+ "bucket_size": 128000,
53
+ "average_decay": 0.0001,
54
+ "batch_size": 6000,
55
+ "gpu_ranks": [
56
+ 0
57
+ ],
58
+ "prefetch_factor": 32,
59
+ "save_checkpoint_steps": 5000,
60
+ "world_size": 1,
61
+ "optim": "adamw",
62
+ "keep_checkpoint": 4,
63
+ "batch_size_multiple": 8,
64
+ "valid_batch_size": 2048,
65
+ "valid_steps": 5000
66
+ },
67
+ "transforms_configs": {
68
+ "sentencepiece": {
69
+ "src_subword_model": "${MODEL_PATH}/is.spm.model",
70
+ "tgt_subword_model": "${MODEL_PATH}/en.spm.model"
71
+ },
72
+ "filtertoolong": {
73
+ "src_seq_length": 256,
74
+ "tgt_seq_length": 256
75
+ }
76
+ },
77
+ "data": {
78
+ "corpus_1": {
79
+ "weight": 9,
80
+ "path_src": "train.is",
81
+ "path_tgt": "train.en",
82
+ "path_align": null,
83
+ "transforms": [
84
+ "sentencepiece",
85
+ "filtertoolong"
86
+ ]
87
+ },
88
+ "corpus_2": {
89
+ "weight": 5,
90
+ "path_src": "/home/mark/mt/data/newscrawl.backtrans.is",
91
+ "path_tgt": "/home/mark/mt/data/newscrawl.2024.en",
92
+ "path_align": null,
93
+ "transforms": [
94
+ "sentencepiece",
95
+ "filtertoolong"
96
+ ]
97
+ },
98
+ "valid": {
99
+ "path_src": "valid.is",
100
+ "path_tgt": "valid.en",
101
+ "path_align": null,
102
+ "transforms": [
103
+ "sentencepiece",
104
+ "filtertoolong"
105
+ ]
106
+ }
107
+ },
108
+ "model": {
109
+ "position_encoding_type": "SinusoidalInterleaved",
110
+ "hidden_size": 1024,
111
+ "architecture": "transformer",
112
+ "share_decoder_embeddings": true,
113
+ "heads": 8,
114
+ "share_embeddings": false,
115
+ "transformer_ff": 4096,
116
+ "encoder": {
117
+ "position_encoding_type": "SinusoidalInterleaved",
118
+ "hidden_size": 1024,
119
+ "n_positions": null,
120
+ "layers": 8,
121
+ "src_word_vec_size": 1024,
122
+ "encoder_type": "transformer",
123
+ "heads": 8,
124
+ "transformer_ff": 4096
125
+ },
126
+ "embeddings": {
127
+ "position_encoding_type": "SinusoidalInterleaved",
128
+ "tgt_word_vec_size": 1024,
129
+ "src_word_vec_size": 1024,
130
+ "word_vec_size": 1024
131
+ },
132
+ "decoder": {
133
+ "position_encoding_type": "SinusoidalInterleaved",
134
+ "hidden_size": 1024,
135
+ "n_positions": null,
136
+ "layers": 2,
137
+ "tgt_word_vec_size": 1024,
138
+ "decoder_type": "transformer",
139
+ "heads": 8,
140
+ "transformer_ff": 4096
141
+ }
142
+ }
143
+ }
eole-model/en.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ac985ba45c9ec783ae106ecde3c5873db2c14e4a1e76086e1eaf7d48295e9b0f
3
+ size 800209
eole-model/is.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:538f374f5558509c152305b8efbea6cc87daa58cfd52dea3bb962c0ad908c797
3
+ size 814659
eole-model/model.00.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:604248809bbf7091982bf258f138b66759ff1f1bbc8ddbd63d352565074f5bde
3
+ size 840314816
eole-model/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:af874d90330cc235279656d6780eed25689bbcfd8467926a1adce65340c778f8
3
+ size 409915789
source_vocabulary.json ADDED
The diff for this file is too large to render. See raw diff
 
src.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:538f374f5558509c152305b8efbea6cc87daa58cfd52dea3bb962c0ad908c797
3
+ size 814659
target_vocabulary.json ADDED
The diff for this file is too large to render. See raw diff
 
tgt.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ac985ba45c9ec783ae106ecde3c5873db2c14e4a1e76086e1eaf7d48295e9b0f
3
+ size 800209