Upload folder using huggingface_hub

Browse files

Files changed (14) hide show

.ipynb_checkpoints/README-checkpoint.md +108 -0
README.md +108 -3
config.json +10 -0
eole-config.yaml +101 -0
eole-model/config.json +143 -0
eole-model/en.spm.model +3 -0
eole-model/is.spm.model +3 -0
eole-model/model.00.safetensors +3 -0
eole-model/vocab.json +0 -0
model.bin +3 -0
source_vocabulary.json +0 -0
src.spm.model +3 -0
target_vocabulary.json +0 -0
tgt.spm.model +3 -0

.ipynb_checkpoints/README-checkpoint.md ADDED Viewed

	@@ -0,0 +1,108 @@

+---
+language:
+- en
+- is
+tags:
+- translation
+license: cc-by-4.0
+datasets:
+- quickmt/quickmt-train.is-en
+model-index:
+- name: quickmt-is-en
+  results:
+  - task:
+      name: Translation isl-eng
+      type: translation
+      args: isl-eng
+    dataset:
+      name: flores101-devtest
+      type: flores_101
+      args: isl_Latn eng_Latn devtest
+    metrics:
+    - name: BLEU
+      type: bleu
+      value: 34.76
+    - name: CHRF
+      type: chrf
+      value: 60.13
+    - name: COMET
+      type: comet
+      value: 85.39
+---
+# `quickmt-is-en` Neural Machine Translation Model
+`quickmt-is-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `is` into `en`.
+## Try it on our Huggingface Space
+Give it a try before downloading here: https://huggingface.co/spaces/quickmt/QuickMT-Demo
+## Model Information
+* Trained using [`eole`](https://github.com/eole-nlp/eole)
+* 200M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
+* 32k separate Sentencepiece vocabs
+* Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
+* The pytorch model (for use with [`eole`](https://github.com/eole-nlp/eole)) is available in this repository in the `eole-model` folder
+See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
+## Usage with `quickmt`
+You must install the Nvidia cuda toolkit first, if you want to do GPU inference.
+Next, install the `quickmt` python library and download the model:
+```bash
+git clone https://github.com/quickmt/quickmt.git
+pip install ./quickmt/
+quickmt-model-download quickmt/quickmt-is-en ./quickmt-is-en
+```
+Finally use the model in python:
+```python
+from quickmt import Translator
+# Auto-detects GPU, set to "cpu" to force CPU inference
+t = Translator("./quickmt-is-en/", device="auto")
+# Translate - set beam size to 1 for faster speed (but lower quality)
+sample_text = 'Dr. Ehud Ur, læknaprófessor við Dalhousie-háskólann í Halifax í Nova Scotia og formaður klínískrar vísindadeildar Kanadíska sykursýkissambandsins, minnti á að rannsóknin væri rétt nýhafin.'
+t(sample_text, beam_size=5)
+```
+> 'Dr. Ehud Ur, a medical professor at Dalhousie University in Halifax, Nova Scotia, and chair of the clinical science department of the Canadian Diabetes Association, recalled that the study had just begun.'
+```python
+# Get alternative translations by sampling
+# You can pass any cTranslate2 `translate_batch` arguments
+t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
+```
+> 'Dr Ehud Ur, a medical professor at Dalhousie University in Halifax, Nova Scotia and chair of the clinical science section of the Canadian Diabetes Union, mentioned that the investigation was just beginning.'
+The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible  to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`. A model in safetensors format to be used with `eole` is also provided.
+## Metrics
+`bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("isl_Latn"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32.
+|                                  |   bleu |   chrf2 |   comet22 |   Time (s) |
+|:---------------------------------|-------:|--------:|----------:|-----------:|
+| quickmt/quickmt-is-en            |  34.76 |   60.13 |     85.39 |       1.22 |
+| Helsinki-NLP/opus-mt-is-en       |  25.91 |   52.03 |     79.99 |       3.5  |
+| facebook/nllb-200-distilled-600M |  30.13 |   54.77 |     82.23 |      21.3  |
+| facebook/nllb-200-distilled-1.3B |  33.71 |   57.73 |     84.71 |      37.21 |
+| facebook/m2m100_418M             |  20.38 |   46.47 |     70.95 |      18.8  |
+| facebook/m2m100_1.2B             |  28.89 |   54.54 |     81.09 |      34.72 |

README.md CHANGED Viewed

@@ -1,3 +1,108 @@
----
-license: cc-by-4.0
----

+---
+language:
+- en
+- is
+tags:
+- translation
+license: cc-by-4.0
+datasets:
+- quickmt/quickmt-train.is-en
+model-index:
+- name: quickmt-is-en
+  results:
+  - task:
+      name: Translation isl-eng
+      type: translation
+      args: isl-eng
+    dataset:
+      name: flores101-devtest
+      type: flores_101
+      args: isl_Latn eng_Latn devtest
+    metrics:
+    - name: BLEU
+      type: bleu
+      value: 34.76
+    - name: CHRF
+      type: chrf
+      value: 60.13
+    - name: COMET
+      type: comet
+      value: 85.39
+---
+# `quickmt-is-en` Neural Machine Translation Model
+`quickmt-is-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `is` into `en`.
+## Try it on our Huggingface Space
+Give it a try before downloading here: https://huggingface.co/spaces/quickmt/QuickMT-Demo
+## Model Information
+* Trained using [`eole`](https://github.com/eole-nlp/eole)
+* 200M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
+* 32k separate Sentencepiece vocabs
+* Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
+* The pytorch model (for use with [`eole`](https://github.com/eole-nlp/eole)) is available in this repository in the `eole-model` folder
+See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
+## Usage with `quickmt`
+You must install the Nvidia cuda toolkit first, if you want to do GPU inference.
+Next, install the `quickmt` python library and download the model:
+```bash
+git clone https://github.com/quickmt/quickmt.git
+pip install ./quickmt/
+quickmt-model-download quickmt/quickmt-is-en ./quickmt-is-en
+```
+Finally use the model in python:
+```python
+from quickmt import Translator
+# Auto-detects GPU, set to "cpu" to force CPU inference
+t = Translator("./quickmt-is-en/", device="auto")
+# Translate - set beam size to 1 for faster speed (but lower quality)
+sample_text = 'Dr. Ehud Ur, læknaprófessor við Dalhousie-háskólann í Halifax í Nova Scotia og formaður klínískrar vísindadeildar Kanadíska sykursýkissambandsins, minnti á að rannsóknin væri rétt nýhafin.'
+t(sample_text, beam_size=5)
+```
+> 'Dr. Ehud Ur, a medical professor at Dalhousie University in Halifax, Nova Scotia, and chair of the clinical science department of the Canadian Diabetes Association, recalled that the study had just begun.'
+```python
+# Get alternative translations by sampling
+# You can pass any cTranslate2 `translate_batch` arguments
+t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
+```
+> 'Dr Ehud Ur, a medical professor at Dalhousie University in Halifax, Nova Scotia and chair of the clinical science section of the Canadian Diabetes Union, mentioned that the investigation was just beginning.'
+The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible  to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`. A model in safetensors format to be used with `eole` is also provided.
+## Metrics
+`bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("isl_Latn"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32.
+|                                  |   bleu |   chrf2 |   comet22 |   Time (s) |
+|:---------------------------------|-------:|--------:|----------:|-----------:|
+| quickmt/quickmt-is-en            |  34.76 |   60.13 |     85.39 |       1.22 |
+| Helsinki-NLP/opus-mt-is-en       |  25.91 |   52.03 |     79.99 |       3.5  |
+| facebook/nllb-200-distilled-600M |  30.13 |   54.77 |     82.23 |      21.3  |
+| facebook/nllb-200-distilled-1.3B |  33.71 |   57.73 |     84.71 |      37.21 |
+| facebook/m2m100_418M             |  20.38 |   46.47 |     70.95 |      18.8  |
+| facebook/m2m100_1.2B             |  28.89 |   54.54 |     81.09 |      34.72 |

config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "add_source_bos": false,
+  "add_source_eos": false,
+  "bos_token": "<s>",
+  "decoder_start_token": "<s>",
+  "eos_token": "</s>",
+  "layer_norm_epsilon": 1e-06,
+  "multi_query_attention": false,
+  "unk_token": "<unk>"
+}

eole-config.yaml ADDED Viewed

	@@ -0,0 +1,101 @@

+## IO
+save_data: data
+overwrite: True
+seed: 1234
+report_every: 100
+valid_metrics: ["BLEU"]
+tensorboard: true
+tensorboard_log_dir: tensorboard
+### Vocab
+src_vocab: is.eole.vocab
+tgt_vocab: en.eole.vocab
+src_vocab_size: 32000
+tgt_vocab_size: 32000
+vocab_size_multiple: 8
+share_vocab: false
+n_sample: 0
+data:
+    corpus_1:
+        path_src: hf://quickmt/quickmt-train.is-en/en
+        path_tgt: hf://quickmt/quickmt-train.is-en/is
+        path_sco: hf://quickmt/quickmt-train.is-en/sco
+        weight: 9
+    corpus_2:
+        path_src: hf://quickmt/newscrawl2024-en-backtranslated-is/is
+        path_tgt: hf://quickmt/newscrawl2024-en-backtranslated-is/en
+        path_sco: hf://quickmt/newscrawl2024-en-backtranslated-is/sco
+        weight: 5
+    valid:
+        path_src: valid.is
+        path_tgt: valid.en
+transforms: [sentencepiece, filtertoolong]
+transforms_configs:
+  sentencepiece:
+    src_subword_model: "is.spm.model"
+    tgt_subword_model: "en.spm.model"
+  filtertoolong:
+    src_seq_length: 256
+    tgt_seq_length: 256
+training:
+    # Run configuration
+    model_path: quickmt-is-en-eole-model
+    keep_checkpoint: 4
+    train_steps: 60000
+    save_checkpoint_steps: 5000
+    valid_steps: 5000
+    # Train on a single GPU
+    world_size: 1
+    gpu_ranks: [0]
+    # Batching 10240
+    batch_type: "tokens"
+    batch_size: 6000
+    valid_batch_size: 2048
+    batch_size_multiple: 8
+    accum_count: [20]
+    accum_steps: [0]
+    # Optimizer & Compute
+    compute_dtype: "fp16"
+    optim: "adamw"
+    #use_amp: False
+    learning_rate: 3.0
+    warmup_steps: 5000
+    decay_method: "noam"
+    adam_beta2: 0.998
+    # Data loading
+    bucket_size: 128000
+    num_workers: 4
+    prefetch_factor: 32
+    # Hyperparams
+    dropout_steps: [0]
+    dropout: [0.1]
+    attention_dropout: [0.1]
+    max_grad_norm: 0
+    label_smoothing: 0.1
+    average_decay: 0.0001
+    param_init_method: xavier_uniform
+    normalization: "tokens"
+model:
+    architecture: "transformer"
+    share_embeddings: false
+    share_decoder_embeddings: true
+    hidden_size: 1024
+    encoder:
+        layers: 8
+    decoder:
+        layers: 2
+    heads: 8
+    transformer_ff: 4096
+    embeddings:
+        word_vec_size: 1024
+        position_encoding_type: "SinusoidalInterleaved"

eole-model/config.json ADDED Viewed

	@@ -0,0 +1,143 @@

+{
+  "n_sample": 0,
+  "tgt_vocab_size": 32000,
+  "tgt_vocab": "en.eole.vocab",
+  "tensorboard_log_dir_dated": "tensorboard/Nov-24_22-33-35",
+  "valid_metrics": [
+    "BLEU"
+  ],
+  "src_vocab_size": 32000,
+  "save_data": "data",
+  "share_vocab": false,
+  "overwrite": true,
+  "report_every": 100,
+  "tensorboard": true,
+  "seed": 1234,
+  "src_vocab": "is.eole.vocab",
+  "vocab_size_multiple": 8,
+  "tensorboard_log_dir": "tensorboard",
+  "transforms": [
+    "sentencepiece",
+    "filtertoolong"
+  ],
+  "training": {
+    "warmup_steps": 5000,
+    "label_smoothing": 0.1,
+    "attention_dropout": [
+      0.1
+    ],
+    "decay_method": "noam",
+    "model_path": "quickmt-is-en-eole-model",
+    "compute_dtype": "torch.float16",
+    "dropout": [
+      0.1
+    ],
+    "normalization": "tokens",
+    "dropout_steps": [
+      0
+    ],
+    "param_init_method": "xavier_uniform",
+    "train_steps": 100000,
+    "adam_beta2": 0.998,
+    "max_grad_norm": 0.0,
+    "batch_type": "tokens",
+    "accum_count": [
+      20
+    ],
+    "learning_rate": 3.0,
+    "num_workers": 0,
+    "accum_steps": [
+      0
+    ],
+    "bucket_size": 128000,
+    "average_decay": 0.0001,
+    "batch_size": 6000,
+    "gpu_ranks": [
+      0
+    ],
+    "prefetch_factor": 32,
+    "save_checkpoint_steps": 5000,
+    "world_size": 1,
+    "optim": "adamw",
+    "keep_checkpoint": 4,
+    "batch_size_multiple": 8,
+    "valid_batch_size": 2048,
+    "valid_steps": 5000
+  },
+  "transforms_configs": {
+    "sentencepiece": {
+      "src_subword_model": "${MODEL_PATH}/is.spm.model",
+      "tgt_subword_model": "${MODEL_PATH}/en.spm.model"
+    },
+    "filtertoolong": {
+      "src_seq_length": 256,
+      "tgt_seq_length": 256
+    }
+  },
+  "data": {
+    "corpus_1": {
+      "weight": 9,
+      "path_src": "train.is",
+      "path_tgt": "train.en",
+      "path_align": null,
+      "transforms": [
+        "sentencepiece",
+        "filtertoolong"
+      ]
+    },
+    "corpus_2": {
+      "weight": 5,
+      "path_src": "/home/mark/mt/data/newscrawl.backtrans.is",
+      "path_tgt": "/home/mark/mt/data/newscrawl.2024.en",
+      "path_align": null,
+      "transforms": [
+        "sentencepiece",
+        "filtertoolong"
+      ]
+    },
+    "valid": {
+      "path_src": "valid.is",
+      "path_tgt": "valid.en",
+      "path_align": null,
+      "transforms": [
+        "sentencepiece",
+        "filtertoolong"
+      ]
+    }
+  },
+  "model": {
+    "position_encoding_type": "SinusoidalInterleaved",
+    "hidden_size": 1024,
+    "architecture": "transformer",
+    "share_decoder_embeddings": true,
+    "heads": 8,
+    "share_embeddings": false,
+    "transformer_ff": 4096,
+    "encoder": {
+      "position_encoding_type": "SinusoidalInterleaved",
+      "hidden_size": 1024,
+      "n_positions": null,
+      "layers": 8,
+      "src_word_vec_size": 1024,
+      "encoder_type": "transformer",
+      "heads": 8,
+      "transformer_ff": 4096
+    },
+    "embeddings": {
+      "position_encoding_type": "SinusoidalInterleaved",
+      "tgt_word_vec_size": 1024,
+      "src_word_vec_size": 1024,
+      "word_vec_size": 1024
+    },
+    "decoder": {
+      "position_encoding_type": "SinusoidalInterleaved",
+      "hidden_size": 1024,
+      "n_positions": null,
+      "layers": 2,
+      "tgt_word_vec_size": 1024,
+      "decoder_type": "transformer",
+      "heads": 8,
+      "transformer_ff": 4096
+    }
+  }
+}

eole-model/en.spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ac985ba45c9ec783ae106ecde3c5873db2c14e4a1e76086e1eaf7d48295e9b0f
+size 800209

eole-model/is.spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:538f374f5558509c152305b8efbea6cc87daa58cfd52dea3bb962c0ad908c797
+size 814659

eole-model/model.00.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:604248809bbf7091982bf258f138b66759ff1f1bbc8ddbd63d352565074f5bde
+size 840314816

eole-model/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:af874d90330cc235279656d6780eed25689bbcfd8467926a1adce65340c778f8
+size 409915789

source_vocabulary.json ADDED Viewed

The diff for this file is too large to render. See raw diff

src.spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:538f374f5558509c152305b8efbea6cc87daa58cfd52dea3bb962c0ad908c797
+size 814659

target_vocabulary.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tgt.spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ac985ba45c9ec783ae106ecde3c5873db2c14e4a1e76086e1eaf7d48295e9b0f
+size 800209