IndicPhi-Mini: Adapting Phi-mini-MoE to Indic Langauge with Curated Data

Community Article Published October 8, 2025

🌍Introduction:

📚Data Curation:
📥Raw Data Collection:

🛠️Processing Pipeline:

✅Final Clean Dataset:

🏋️Training Details:
Training Setup:

LoRA Adapter Configuration:

📊Evaluation and Results:
📊 Overall Benchmark Results

🌐Per-Language Performance:

📊 Per-Language Results on `sarvamai/arc-challenge-indic`

📊 Per-Language Results on `sarvamai/mmlu-indic`

Conclusion:

🌍Introduction:

Large Language Models (LLMs) have achieved remarkable progress in tasks like translation, dialogue systems, and reasoning. However, these advances have not equally benefited Indic languages. The scarcity of high-quality datasets and limited adaptation efforts have left a noticeable gap, making LLMs less useful for hundreds of millions of Indic language speakers.

To address this, we introduce IndicPhi-mini – a fine-tuned version of Microsoft’s Phi-mini-MoE model, specifically adapted for Indic languages. We curated one of the largest multilingual Indic corpora to date and fine-tuned the model using efficient techniques like QLoRA and LoRA adapters.

Both our finetuned model and the curated dataset will be open-sourced soon on Hugging Face for reproducibility and community use:

🤖Model Card: https://huggingface.co/SandLogicTechnologies/IndicPhi-mini.
📂Dataset Card: https://huggingface.co/datasets/SandLogicTechnologies/Indic_Chat_Dataset.

📚Data Curation:

One of the biggest challenges in adapting LLMs for Indic languages is the lack of high-quality, diverse data. To address this, we curated and cleaned one of the largest Indic conversational datasets to date.

📥Raw Data Collection:

Sources: 53 open datasets (mostly from Hugging Face)
Initial size: ~561M samples
Languages covered: 13 Indic languages (Hindi, Kannada, Telugu, Tamil, Marathi, Malayalam, Gujarati, Bengali, Odia, Punjabi, Assamese, Sinhala, Urdu)
Domains: General knowledge, translation corpora, instruction datasets, dialogue/chat, and some code datasets

🛠️Processing Pipeline:

We applied a three-stage curation pipeline to turn raw data into clean, instruction-style conversations:

Manual Filtering
- Removed irrelevant or noisy subsets.
- Example: Hindi text-gen corpus reduced from 4.46M → 3.23M after pruning malformed entries.
Automated Preprocessing
- Deduplication: Eliminated duplicates & near-duplicates.
- Language Identification: Verified text belongs to the target language.
- Minimum Length Filtering: Removed incomplete or extremely short entries.
- Unicode & Formatting Normalization: Standardized punctuation, spaces, and encoding.
Format Conversion
- Converted all data into UltraChat-style schema (JSON with user–assistant turns).
- Ensured multi-turn, instruction-following consistency across languages.

✅Final Clean Dataset:

Size: ~29M high-quality samples (after filtering from 561M).
Balanced across languages (e.g., Hindi 4.63M, Kannada 3.54M, Tamil 3.86M, Malayalam 2.81M, Urdu ~58K).
Conversational format ensures strong alignment for instruction-tuning.

🏋️Training Details:

To make fine-tuning feasible on a single A100 GPU, we used QLoRA (Quantized LoRA), which combines 4-bit quantization with parameter-efficient fine-tuning.

Training Setup:

Hardware: 1 × NVIDIA A100 80GB
Precision: QLoRA (NF4 4-bit)
Batching: Effective batch size 256 (32 per device × 8 gradient accumulation steps)
Steps: 8,500 training steps
Optimizer: AdamW (8-bit)
Learning Rate Schedule: Cosine decay + 1,000 warmup steps
Final Training Loss: 0.4805

LoRA Adapter Configuration:

We trained adapters on projection + feedforward layers:

q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Rank (r): 128
Alpha: 128
Dropout: 0
Gradient Checkpointing: Enabled to reduce memory usage

This setup allowed us to run large batches in 4-bit precision while keeping GPU memory requirements manageable.

📊Evaluation and Results:

To evaluate the effectiveness of finetuning, we benchmarked both the base Phi-mini-M oE-instruct model and our finetuned version on two widely used Indic benchmarks:

sarvamai/arc-challenge-indic – an Indic adaptation of the ARC Challenge dataset for reasoning tasks.
sarvamai/mmlu-indic – an Indic adaptation of the MMLU dataset for knowledge and domain understanding across 9 Indian languages.

Across both benchmarks, the finetuned model consistently outperformed the base model, with absolute gains of 3–4% in accuracy. On arc-challenge-indic, which tests reasoning-heavy queries, the model demonstrated improved problem-solving ability, reflected in both raw and normalized accuracy. Similarly, on mmlu-indic, which emphasizes factual and domain knowledge, the model achieved notable improvements, indicating enhanced cross-lingual generalization.

📊 Overall Benchmark Results

Dataset	Metric	Base Model	IndicPhi-mini (Fine-tuned)	Improvement
arc-challenge-indic	Accuracy	21.03	24.46	+3.43
	Accu Norm	24.69	28.86	+4.17
mmlu-indic	Accuracy	27.47	30.95	+3.48

These results show that finetuning not only boosts overall accuracy but also helps the model better adapt to the nuances of Indic languages. Importantly, the improvements are consistent across different task types, reasoning and knowledge recall highlighting the robustness of the approach.

🌐Per-Language Performance:

For evaluation, we first computed accuracy for each of the 9 Indic languages individually and then took the average across all languages to obtain the overall benchmark scores reported earlier.

The per-language analysis shows that all languages consistently benefited from finetuning, with improvements of a similar scale. This indicates that the finetuned model generalizes well across diverse Indic languages, rather than favoring only a subset.

Such uniform gains across languages strengthen the reliability of the model for multilingual applications, ensuring that performance improvements are not biased toward specific linguistic families.

The detailed per-language results are presented in the following table.

📊 Per-Language Results on `sarvamai/arc-challenge-indic`

Language	Base Model Accuracy	Base Model Acc. Norm	IndicPhi-mini Accuracy	IndicPhi-mini Acc. Norm
Hindi	22.61	26.17	25.15	29.40
Kannada	20.96	25.83	23.64	30.01
Tamil	20.78	24.61	24.28	29.11
Telugu	20.70	26.00	24.00	30.45
Bengali	21.91	25.04	25.22	29.80
Gujarati	18.17	21.30	22.87	26.40
Malayalam	22.26	23.91	25.64	27.01
Marathi	19.65	25.22	23.41	28.98
Odia	22.26	24.17	26.01	28.65

📊 Per-Language Results on `sarvamai/mmlu-indic`

Language	Base Model Accuracy	IndicPhi-mini Accuracy
Hindi	30.34	33.99
Kannada	26.53	30.01
Tamil	27.58	30.75
Telugu	26.07	30.98
Bengali	28.37	32.01
Gujarati	26.47	29.29
Malayalam	28.28	32.45
Marathi	27.92	30.11
Odia	25.69	28.97

Conclusion:

This work demonstrates that fine-tuning Phi-mini-MoE on a carefully curated Indic corpus can significantly improve its performance on multilingual reasoning and knowledge tasks. By aggregating diverse datasets and standardizing them into a clean instruction-following format, we enabled the model to better capture the nuances of Indic languages. The use of efficient techniques such as QLoRA quantization and LoRA adapters ensured that this adaptation remained practical on limited hardware. Consistent gains on benchmarks highlight the value of curated quality data and lightweight fine-tuning strategies in equipping compact models with stronger Indic language capabilities.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote