File size: 2,445 Bytes
526f830
 
 
 
0bcc47a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
526f830
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
datasets:
- atlasia/darija_english
---
# Darija-English Translator

This model is a fine-tuned version of [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) on the `darija_finetune_train` dataset. It is designed to translate text from Moroccan Darija (a dialect of Arabic) to English.

## Model Details

- **Library**: PEFT
- **License**: Apache 2.0
- **Base Model**: Qwen/Qwen2.5-1.5B-Instruct
- **Tags**: `llama-factory`, `lora`, `generated_from_trainer`

## How to Use

You can load and use the model with the `transformers` library:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Define model and tokenizer
base_model_id = "Qwen/Qwen2.5-1.5B-Instruct"
finetuned_model_id = "ELhadratiOth/darija-english-translater"
device = "cuda"  # Change to "cpu" if GPU is not available

model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    device_map="auto",
    torch_dtype=None
)

# Load the fine-tuned adapter
model.load_adapter(finetuned_model_id)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id)

def translate_darija(text):
    messages = [
        {"role": "system", "content": "You are a professional NLP data parser. Follow the provided task and output scheme for consistency."},
        {"role": "user", "content": f"## Task:\n{text}\n\n## English Translation:"}
    ]
    
    text_input = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    model_inputs = tokenizer([text_input], return_tensors="pt").to(device)
    
    generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=1024, do_sample=False, temperature=0.8)
    translation = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
    return translation

# Example usage
query = "Your Darija text here"
response = translate_darija(query)
print(response)
```

## Training Details

### Hyperparameters
- **Learning Rate**: 0.0001
- **Batch Size**:
  - Train: 1
  - Eval: 1
- **Seed**: 42
- **Distributed Training**: Multi-GPU
- **Number of Devices**: 2
- **Gradient Accumulation Steps**: 4
- **Total Train Batch Size**: 8
- **Total Eval Batch Size**: 2
- **Optimizer**: AdamW (betas=(0.9,0.999), epsilon=1e-08)
- **LR Scheduler**: Cosine
- **Warmup Ratio**: 0.1
- **Epochs**: 10

### Framework Versions
- PEFT: 0.12.0
- Transformers: 4.49.0
- PyTorch: 2.5.1+cu121
- Datasets: 3.2.0
- Tokenizers: 0.21.0