English-Hindi Colloquial Translator

Overview

This project aims to train and fine-tune a Gemma-2-9B model for English-to-Hindi and Hindi-to-English translation using Unsloth for efficient fine-tuning and inference. The dataset consists of parallel English-Hindi sentence pairs, formatted in an instruction-based structure for optimal training. The model is optimized using LoRA (Low-Rank Adaptation) and is further quantized for reduced memory consumption. The implementation leverages Hugging Face Transformers, TRL (Transformers Reinforcement Learning), and BitsAndBytes for efficient training and inference.

Features

Dataset Preprocessing: Converts CSV dataset into an instruction-based format for fine-tuning.
Fine-Tuning with LoRA: Uses Unsloth's LoRA implementation for optimized low-rank adaptation.
Quantization Support: Reduces memory usage using 4-bit quantization.
Optimized Training: Utilizes Gradient Checkpointing, AdamW Optimizer, and Flash Attention 2 for memory-efficient training.
Inference and Transliteration: Supports both translation and Romanized Hindi (Hinglish) conversion using indic-transliteration.

Installation

To set up the required environment, execute the following commands:

pip install pip3-autoremove
pip-autoremove torch torchvision torchaudio -y
pip install torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu121
pip install unsloth datasets bitsandbytes
pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

For Flash Attention 2 support (NVIDIA GPUs with Compute Capability ≥ 8.0):

pip install --no-deps packaging ninja einops "flash-attn>=2.6.3"

Dataset Preparation

The dataset should be stored in a CSV file (english_to_hindi.csv) with the following structure:

english,hindi
"Did you see that goal? That was a beauty, eh?","क्या आपने उस गोल को देखा? वह एक सुंदर था, है ना?"

Convert CSV to JSON Format

The script converts the dataset into a structured JSON format:

list_ds = convert_csv_to_json_format("/content/english_to_hindi.csv")

Example output:

[
  {
    "instruction": "Translate this to English",
    "input": "क्या आपने उस गोल को देखा? वह एक सुंदर था, है ना?",
    "output": "Did you see that goal? That was a beauty, eh?"
  },
  {
    "instruction": "Translate this to Hindi",
    "input": "Did you see that goal? That was a beauty, eh?",
    "output": "क्या आपने उस गोल को देखा? वह एक सुंदर था, है ना?"
  }
]

Model Training

Load Pretrained Model

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gemma-2-9b",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

Apply LoRA Fine-Tuning

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

Training Configuration

training_args = TrainingArguments(
    output_dir="./english-hindi-colloquial",
    per_device_train_batch_size=4,
    num_train_epochs=10,
    learning_rate=3e-4,
    weight_decay=0.01,
    logging_steps=1,
    evaluation_strategy="steps",
    eval_steps=2,
    save_steps=2,
    push_to_hub=True,
    hub_model_id="Amrutha23345/english_to_hindi_colloquial_translator",
    fp16=True,
    gradient_checkpointing=True
)

Start Training

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    args=training_args,
)
trainer.train()
trainer.save_model()
trainer.push_to_hub()

Inference

Text Generation

inputs = tokenizer([
    alpaca_prompt.format("Translate to English", "क्या आपने उस गोल को देखा? वह एक सुंदर था, है ना?", "")
], return_tensors="pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128)

Transliteration (Hindi to Hinglish)

from indic_transliteration import sanscript
from indic_transliteration.sanscript import transliterate

hindi_response = "क्या आपने उस गोल को देखा? वह एक सुंदर था, है ना?"
hinglish_response = transliterate(hindi_response, sanscript.DEVANAGARI, sanscript.ITRANS)
print("Hinglish Output:", hinglish_response)

Example Output:

Hinglish Output: kya aapne us goal ko dekha? vah ek sundar tha, hai na?

Performance Metrics

The script tracks GPU memory usage:

used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
print(f"Peak reserved memory = {used_memory} GB.")

Model Deployment

Push to Hugging Face Hub

trainer.push_to_hub()

Model can be accessed on Hugging Face: Amrutha23345/english_to_hindi_colloquial_translator

Conclusion

This project successfully fine-tunes a Gemma-2-9B model for English-Hindi translation, leveraging LoRA, Unsloth, and Flash Attention 2 for optimized training. The implementation supports both translation and Romanized Hindi (Hinglish) conversion, making it useful for real-world applications such as chatbots, NLP services, and educational tools.

Future Improvements

Expand dataset to cover colloquial and dialect-based Hindi.
Implement RLHF (Reinforcement Learning from Human Feedback) for improved translation quality.
Optimize deployment using ONNX for edge-device inference.

Amrutha23345
/

english_to_hindi_language_translator