English-Hindi Colloquial Translator
Overview
This project aims to train and fine-tune a Gemma-2-9B model for English-to-Hindi and Hindi-to-English translation using Unsloth for efficient fine-tuning and inference. The dataset consists of parallel English-Hindi sentence pairs, formatted in an instruction-based structure for optimal training. The model is optimized using LoRA (Low-Rank Adaptation) and is further quantized for reduced memory consumption. The implementation leverages Hugging Face Transformers, TRL (Transformers Reinforcement Learning), and BitsAndBytes for efficient training and inference.
Features
- Dataset Preprocessing: Converts CSV dataset into an instruction-based format for fine-tuning.
- Fine-Tuning with LoRA: Uses Unsloth's LoRA implementation for optimized low-rank adaptation.
- Quantization Support: Reduces memory usage using 4-bit quantization.
- Optimized Training: Utilizes Gradient Checkpointing, AdamW Optimizer, and Flash Attention 2 for memory-efficient training.
- Inference and Transliteration: Supports both translation and Romanized Hindi (Hinglish) conversion using
indic-transliteration
.
Installation
To set up the required environment, execute the following commands:
pip install pip3-autoremove
pip-autoremove torch torchvision torchaudio -y
pip install torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu121
pip install unsloth datasets bitsandbytes
pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
For Flash Attention 2 support (NVIDIA GPUs with Compute Capability ≥ 8.0):
pip install --no-deps packaging ninja einops "flash-attn>=2.6.3"
Dataset Preparation
The dataset should be stored in a CSV file (english_to_hindi.csv
) with the following structure:
english,hindi
"Did you see that goal? That was a beauty, eh?","क्या आपने उस गोल को देखा? वह एक सुंदर था, है ना?"
Convert CSV to JSON Format
The script converts the dataset into a structured JSON format:
list_ds = convert_csv_to_json_format("/content/english_to_hindi.csv")
Example output:
[
{
"instruction": "Translate this to English",
"input": "क्या आपने उस गोल को देखा? वह एक सुंदर था, है ना?",
"output": "Did you see that goal? That was a beauty, eh?"
},
{
"instruction": "Translate this to Hindi",
"input": "Did you see that goal? That was a beauty, eh?",
"output": "क्या आपने उस गोल को देखा? वह एक सुंदर था, है ना?"
}
]
Model Training
Load Pretrained Model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/gemma-2-9b",
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
Apply LoRA Fine-Tuning
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=3407,
use_rslora=False,
loftq_config=None,
)
Training Configuration
training_args = TrainingArguments(
output_dir="./english-hindi-colloquial",
per_device_train_batch_size=4,
num_train_epochs=10,
learning_rate=3e-4,
weight_decay=0.01,
logging_steps=1,
evaluation_strategy="steps",
eval_steps=2,
save_steps=2,
push_to_hub=True,
hub_model_id="Amrutha23345/english_to_hindi_colloquial_translator",
fp16=True,
gradient_checkpointing=True
)
Start Training
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
args=training_args,
)
trainer.train()
trainer.save_model()
trainer.push_to_hub()
Inference
Text Generation
inputs = tokenizer([
alpaca_prompt.format("Translate to English", "क्या आपने उस गोल को देखा? वह एक सुंदर था, है ना?", "")
], return_tensors="pt").to("cuda")
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128)
Transliteration (Hindi to Hinglish)
from indic_transliteration import sanscript
from indic_transliteration.sanscript import transliterate
hindi_response = "क्या आपने उस गोल को देखा? वह एक सुंदर था, है ना?"
hinglish_response = transliterate(hindi_response, sanscript.DEVANAGARI, sanscript.ITRANS)
print("Hinglish Output:", hinglish_response)
Example Output:
Hinglish Output: kya aapne us goal ko dekha? vah ek sundar tha, hai na?
Performance Metrics
The script tracks GPU memory usage:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
print(f"Peak reserved memory = {used_memory} GB.")
Model Deployment
Push to Hugging Face Hub
trainer.push_to_hub()
Model can be accessed on Hugging Face: Amrutha23345/english_to_hindi_colloquial_translator
Conclusion
This project successfully fine-tunes a Gemma-2-9B model for English-Hindi translation, leveraging LoRA, Unsloth, and Flash Attention 2 for optimized training. The implementation supports both translation and Romanized Hindi (Hinglish) conversion, making it useful for real-world applications such as chatbots, NLP services, and educational tools.
Future Improvements
- Expand dataset to cover colloquial and dialect-based Hindi.
- Implement RLHF (Reinforcement Learning from Human Feedback) for improved translation quality.
- Optimize deployment using ONNX for edge-device inference.
References
Model tree for Amrutha23345/english_to_hindi_language_translator
Base model
google/gemma-2-9b