IndicPhi-Mini: Adapting Phi-mini-MoE to Indic Langauge with Curated Data
🌍Introduction:
Large Language Models (LLMs) have achieved remarkable progress in tasks like translation, dialogue systems, and reasoning. However, these advances have not equally benefited Indic languages. The scarcity of high-quality datasets and limited adaptation efforts have left a noticeable gap, making LLMs less useful for hundreds of millions of Indic language speakers.
To address this, we introduce IndicPhi-mini – a fine-tuned version of Microsoft’s Phi-mini-MoE model, specifically adapted for Indic languages. We curated one of the largest multilingual Indic corpora to date and fine-tuned the model using efficient techniques like QLoRA and LoRA adapters.
Both our finetuned model and the curated dataset will be open-sourced soon on Hugging Face for reproducibility and community use:
🤖Model Card: https://huggingface.co/SandLogicTechnologies/IndicPhi-mini.
📂Dataset Card: https://huggingface.co/datasets/SandLogicTechnologies/Indic_Chat_Dataset.
📚Data Curation:
One of the biggest challenges in adapting LLMs for Indic languages is the lack of high-quality, diverse data. To address this, we curated and cleaned one of the largest Indic conversational datasets to date.
📥Raw Data Collection:
- Sources: 53 open datasets (mostly from Hugging Face)
- Initial size: ~561M samples
- Languages covered: 13 Indic languages (Hindi, Kannada, Telugu, Tamil, Marathi, Malayalam, Gujarati, Bengali, Odia, Punjabi, Assamese, Sinhala, Urdu)
- Domains: General knowledge, translation corpora, instruction datasets, dialogue/chat, and some code datasets
🛠️Processing Pipeline:
We applied a three-stage curation pipeline to turn raw data into clean, instruction-style conversations:
Manual Filtering
- Removed irrelevant or noisy subsets.
- Example: Hindi text-gen corpus reduced from 4.46M → 3.23M after pruning malformed entries.
Automated Preprocessing
- Deduplication: Eliminated duplicates & near-duplicates.
- Language Identification: Verified text belongs to the target language.
- Minimum Length Filtering: Removed incomplete or extremely short entries.
- Unicode & Formatting Normalization: Standardized punctuation, spaces, and encoding.
Format Conversion
- Converted all data into UltraChat-style schema (JSON with user–assistant turns).
- Ensured multi-turn, instruction-following consistency across languages.
✅Final Clean Dataset:
- Size: ~29M high-quality samples (after filtering from 561M).
- Balanced across languages (e.g., Hindi 4.63M, Kannada 3.54M, Tamil 3.86M, Malayalam 2.81M, Urdu ~58K).
- Conversational format ensures strong alignment for instruction-tuning.
🏋️Training Details:
To make fine-tuning feasible on a single A100 GPU, we used QLoRA (Quantized LoRA), which combines 4-bit quantization with parameter-efficient fine-tuning.
Training Setup:
- Hardware: 1 × NVIDIA A100 80GB
- Precision: QLoRA (NF4 4-bit)
- Batching: Effective batch size 256 (32 per device × 8 gradient accumulation steps)
- Steps: 8,500 training steps
- Optimizer: AdamW (8-bit)
- Learning Rate Schedule: Cosine decay + 1,000 warmup steps
- Final Training Loss: 0.4805
LoRA Adapter Configuration:
We trained adapters on projection + feedforward layers:
- q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Rank (r): 128
- Alpha: 128
- Dropout: 0
- Gradient Checkpointing: Enabled to reduce memory usage
This setup allowed us to run large batches in 4-bit precision while keeping GPU memory requirements manageable.
📊Evaluation and Results:
To evaluate the effectiveness of finetuning, we benchmarked both the base Phi-mini-M oE-instruct model and our finetuned version on two widely used Indic benchmarks:
- sarvamai/arc-challenge-indic – an Indic adaptation of the ARC Challenge dataset for reasoning tasks.
- sarvamai/mmlu-indic – an Indic adaptation of the MMLU dataset for knowledge and domain understanding across 9 Indian languages.
Across both benchmarks, the finetuned model consistently outperformed the base model, with absolute gains of 3–4% in accuracy. On arc-challenge-indic, which tests reasoning-heavy queries, the model demonstrated improved problem-solving ability, reflected in both raw and normalized accuracy. Similarly, on mmlu-indic, which emphasizes factual and domain knowledge, the model achieved notable improvements, indicating enhanced cross-lingual generalization.
📊 Overall Benchmark Results
| Dataset | Metric | Base Model | IndicPhi-mini (Fine-tuned) | Improvement |
|---|---|---|---|---|
| arc-challenge-indic | Accuracy | 21.03 | 24.46 | +3.43 |
| Accu Norm | 24.69 | 28.86 | +4.17 | |
| mmlu-indic | Accuracy | 27.47 | 30.95 | +3.48 |
These results show that finetuning not only boosts overall accuracy but also helps the model better adapt to the nuances of Indic languages. Importantly, the improvements are consistent across different task types, reasoning and knowledge recall highlighting the robustness of the approach.
🌐Per-Language Performance:
For evaluation, we first computed accuracy for each of the 9 Indic languages individually and then took the average across all languages to obtain the overall benchmark scores reported earlier.
The per-language analysis shows that all languages consistently benefited from finetuning, with improvements of a similar scale. This indicates that the finetuned model generalizes well across diverse Indic languages, rather than favoring only a subset.
Such uniform gains across languages strengthen the reliability of the model for multilingual applications, ensuring that performance improvements are not biased toward specific linguistic families.
The detailed per-language results are presented in the following table.
📊 Per-Language Results on sarvamai/arc-challenge-indic
| Language | Base Model Accuracy | Base Model Acc. Norm | IndicPhi-mini Accuracy | IndicPhi-mini Acc. Norm |
|---|---|---|---|---|
| Hindi | 22.61 | 26.17 | 25.15 | 29.40 |
| Kannada | 20.96 | 25.83 | 23.64 | 30.01 |
| Tamil | 20.78 | 24.61 | 24.28 | 29.11 |
| Telugu | 20.70 | 26.00 | 24.00 | 30.45 |
| Bengali | 21.91 | 25.04 | 25.22 | 29.80 |
| Gujarati | 18.17 | 21.30 | 22.87 | 26.40 |
| Malayalam | 22.26 | 23.91 | 25.64 | 27.01 |
| Marathi | 19.65 | 25.22 | 23.41 | 28.98 |
| Odia | 22.26 | 24.17 | 26.01 | 28.65 |
📊 Per-Language Results on sarvamai/mmlu-indic
| Language | Base Model Accuracy | IndicPhi-mini Accuracy |
|---|---|---|
| Hindi | 30.34 | 33.99 |
| Kannada | 26.53 | 30.01 |
| Tamil | 27.58 | 30.75 |
| Telugu | 26.07 | 30.98 |
| Bengali | 28.37 | 32.01 |
| Gujarati | 26.47 | 29.29 |
| Malayalam | 28.28 | 32.45 |
| Marathi | 27.92 | 30.11 |
| Odia | 25.69 | 28.97 |
Conclusion:
This work demonstrates that fine-tuning Phi-mini-MoE on a carefully curated Indic corpus can significantly improve its performance on multilingual reasoning and knowledge tasks. By aggregating diverse datasets and standardizing them into a clean instruction-following format, we enabled the model to better capture the nuances of Indic languages. The use of efficient techniques such as QLoRA quantization and LoRA adapters ensured that this adaptation remained practical on limited hardware. Consistent gains on benchmarks highlight the value of curated quality data and lightweight fine-tuning strategies in equipping compact models with stronger Indic language capabilities.