metadata
datasets:
- atlasia/darija_english
Darija-English Translator
This model is a fine-tuned version of Qwen/Qwen2.5-1.5B-Instruct on the darija_finetune_train
dataset. It is designed to translate text from Moroccan Darija (a dialect of Arabic) to English.
Model Details
- Library: PEFT
- License: Apache 2.0
- Base Model: Qwen/Qwen2.5-1.5B-Instruct
- Tags:
llama-factory
,lora
,generated_from_trainer
How to Use
You can load and use the model with the transformers
library:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Define model and tokenizer
base_model_id = "Qwen/Qwen2.5-1.5B-Instruct"
finetuned_model_id = "ELhadratiOth/darija-english-translater"
device = "cuda" # Change to "cpu" if GPU is not available
model = AutoModelForCausalLM.from_pretrained(
base_model_id,
device_map="auto",
torch_dtype=None
)
# Load the fine-tuned adapter
model.load_adapter(finetuned_model_id)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
def translate_darija(text):
messages = [
{"role": "system", "content": "You are a professional NLP data parser. Follow the provided task and output scheme for consistency."},
{"role": "user", "content": f"## Task:\n{text}\n\n## English Translation:"}
]
text_input = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text_input], return_tensors="pt").to(device)
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=1024, do_sample=False, temperature=0.8)
translation = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
return translation
# Example usage
query = "Your Darija text here"
response = translate_darija(query)
print(response)
Training Details
Hyperparameters
- Learning Rate: 0.0001
- Batch Size:
- Train: 1
- Eval: 1
- Seed: 42
- Distributed Training: Multi-GPU
- Number of Devices: 2
- Gradient Accumulation Steps: 4
- Total Train Batch Size: 8
- Total Eval Batch Size: 2
- Optimizer: AdamW (betas=(0.9,0.999), epsilon=1e-08)
- LR Scheduler: Cosine
- Warmup Ratio: 0.1
- Epochs: 10
Framework Versions
- PEFT: 0.12.0
- Transformers: 4.49.0
- PyTorch: 2.5.1+cu121
- Datasets: 3.2.0
- Tokenizers: 0.21.0