File size: 4,746 Bytes
b611033
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
433dad0
 
b611033
 
 
febfb74
65ecbbb
 
 
 
 
 
 
 
 
 
 
b611033
 
 
 
 
ff47c43
 
 
 
b611033
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
datasets:
- cobrayyxx/FLEURS_ID-EN
language:
- id
- en
metrics:
- bleu
- chrf
base_model:
- facebook/nllb-200-distilled-600M
pipeline_tag: translation
---
## Model description

This model is a fine-tuned version of [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) on an Indonesian-English [CoVoST2](https://huggingface.co/datasets/cobrayyxx/COVOST2_ID-EN) dataset.

## Intended uses & limitations

This model is used to predict the translation of Indonesian Transcription.

## How to Use
This is how to use the model with Faster-Whisper.
1. Convert the model into the CTranslate2 format with float16 quantization.
   ```
   !ct2-transformers-converter --model cobrayyxx/nllb-indo-en-covost2 --quantization float16 --output_dir ct2/ct2-nllb-indo-en-float16
   ```
2. Load the converted model using `ctranslate2` library.
   ```
    from faster_whisper import WhisperModel
    import os
   
    ct2_model_name = "ct2-nllb-indo-en-float16"
    
    ct_model_path = os.path.join("ct2", ct2_model_name)
    translator = ctranslate2.Translator(ct_model_path, device=device)
   ```
3. Download the SentencePiece model
    ```
    !wget https://s3.amazonaws.com/opennmt-models/nllb-200/flores200_sacrebleu_tokenizer_spm.model
    ```
4. Load the SentencePiece model
    ```
    import sentencepiece as spm

    sp_model_path = os.path.join(directory, "flores200_sacrebleu_tokenizer_spm.model")

    sp = spm.SentencePieceProcessor()
    sp.load(sp_model_path)
    ```
5. Now, the loaded model can be used.
   ```
    src_lang = "ind_Latn"
    tgt_lang = "eng_Latn"
    
    beam_size = 5
    
    source_sentences = lst_of_sentences
    
    source_sents = [sent.strip() for sent in source_sentences]
    target_prefix = [[tgt_lang]] * len(source_sents)
    
    # Chunk source sentences into subword
    source_sents_subworded = sp.encode_as_pieces(source_sents)
    source_sents_subworded = [[src_lang] + sent + ["</s>"] for sent in source_sents_subworded]
    
    # Translate the source sentences
    translations = translator.translate_batch(source_sents_subworded,
                                              batch_type="tokens",
                                              max_batch_size=2024,
                                              beam_size=beam_size,
                                              target_prefix=target_prefix)
    translations = [translation.hypotheses[0] for translation in translations]
    
    # Merge all of the subword in the target sentences
    translations_desubword = sp.decode(translations)
    translations_desubword = [sent[len(tgt_lang):].strip() for sent in translations_desubword]
   ```
  
    Note: If you faced the kernel error everytime running the code above. You have to install `nvidia-cublas` and `nvidia-cudnn`
    
    ```
    apt update
    apt install libcudnn9-cuda-12
    ```
  
    and Install the library using pip. [Read The Documentation for more.](https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file#gpu)
    ```
    pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*
  
    export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
    ```
    Special thanks to [Yasmin Moslem](https://huggingface.co/ymoslem) for her help in resolving this.
   
## Training procedure

### Training Results

| Epoch | Training Loss | Validation Loss | BLEU |
|-------|--------------|----------------|------|
| 1     | 0.119100     | 0.048539       | 60.267190 |
| 2     | 0.020900     | 0.044844       | 59.821654 |
| 3     | 0.014600     | 0.048637       | 60.185481 |
| 4     | 0.007200     | 0.052005       | 60.150045 |
| 5     | 0.005100     | 0.054909       | 59.938441 |
| 6     | 0.004500     | 0.056668       | 60.032409 |
| 7     | 0.003800     | 0.058903       | 60.176242 |
| 8     | 0.002900     | 0.059880       | 60.168394 |
| 9     | 0.002400     | 0.060914       | 60.280851 |

## Model Evaluation

The performance of the baseline and fine-tuned model were evaluated using the BLEU and CHRF++ metrics on the validation dataset. 
This fine-tuned model shows some improvement over the baseline model.
| Model                 | BLEU  | ChrF++ |
|-----------------------|------:|-------:|
| Baseline             | 50.91 | 68.1  |
| Fine-Tuned          |58.3 |  73.62 |
### Evaluation details
- BLEU: Measures the overlap between predicted and reference text based on n-grams.
- CHRF: Uses character n-grams for evaluation, making it particularly suitable for morphologically rich languages.

# Credits
Huge thanks to [Yasmin Moslem](https://huggingface.co/ymoslem) for mentoring me.