pruthwik/ilid-muril-model

Model Details

Model Description

The model is a MuRIL based finetuned model on the language identification task using the Huggingface transformers library.

Developed by: Pruthwik Mishra, Yash Ingle
Funded by: SVNIT, Surat
License: MIT
Finetuned from model: google/muril-base-cased

Model Sources

Repository: [https://github.com/yashingle-ai/TextLangDetect]
Paper: [https://arxiv.org/abs/2507.11832]

Uses

The model can be directly used for English and Indian language identification.

How to Get Started with the Model

"""Language Identification using fine-tuned model."""
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import TextClassificationPipeline
from datasets import Dataset
import torch

# this is an cased muril base model
tokenizer_model = "google/muril-base-cased"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model)
device = torch.device('cuda:0')


model = AutoModelForSequenceClassification.from_pretrained("pruthwik/ilid-muril-model")


def preprocess_function(examples):
    """Preprocess function for processing the data."""
    tokenized_inputs = tokenizer(examples['text'], truncation=True, max_length=256, padding='max_length')
    return tokenized_inputs


index_to_label_dict = {0: 'asm', 1: 'ben', 2: 'brx', 3: 'doi', 4: 'eng', 5: 'gom', 6: 'guj', 7: 'hin', 8: 'kan', 9: 'kas', 10: 'mai', 11: 'mal', 12: 'mar', 13: 'mni_Beng', 14: 'mni_Mtei', 15: 'npi', 16: 'ory', 17: 'pan', 18: 'san', 19: 'sat', 20: 'snd_Arab', 21: 'snd_Deva', 22: 'tam', 23: 'tel', 24: 'urd'}
test_texts = ["Hello, how are you?", "जब मैं छोटा था, मैं हर रोज़ पार्क जाता था।", "आनी हो एक गंभीर मूर्खपणा.", "ਮਨੁੱਖੀ ਦਿਮਾਗ਼ ਦੀ ਕਾਢ ਨੇ ਭਾਵੇਂ ਸਭ ਕੁਝ ਸੌਖਾ ਕਰ ਦਿੱਤਾ ਹੈ ਪਰ ਫਿਰ ਵੀ ਸਭ ਕੁਝ ਸਮਝਣਾ ਜਾਂ ਕਰਨਾ ਨਿਯਮਾਂ ਵਿੱਚ ਬੱਝਾ ਪਿਆ ਹੈ।", "માં વરસાદનું પાણી મોટા જથ્થામાં જમીનની નીચે જ ઉતરી જાય છે।", "କିନ୍ତୁ ପୁଅ, ତୁମେ ଛୋଟ।"]
test_dataset_raw = Dataset.from_dict({'text': test_texts})
# load the index to label dictionary from pickle file
num_labels = len(index_to_label_dict)
print(f'Number of labels: {num_labels}')
# create the tokenized dataset
test_tokenized_dataset = test_dataset_raw.map(preprocess_function, batched=True)
test_tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
# Load the model from the specified directory
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, device=0)
# Save the predictions to the output file
predictions_test = pipe(test_tokenized_dataset['text'], truncation=True, max_length=256)
actual_labels_test = []
for prediction in predictions_test:
    pred_label = prediction['label']
    pred_index = pred_label.split('_')[1]
    actual_labels_test.append(index_to_label_dict[int(pred_index)])
print(actual_labels_test)
"""Output:
['eng', 'hin', 'gom', 'pan', 'guj', 'ory']
"""

Label Indices For Languages

0: asm (Assamese)
1: ben (Bengali)
2: brx (Bodo)
3: doi (Dogri)
4: eng (English)
5: gom (Konkani)
6: guj (Gujarati)
7: hin (Hindi)
8: kan (Kannada)
9: kas (Kashmiri)
10: mai (Maithili)
11: mal (Malayalam)
12: mar (Marathi)
13: mni_Beng (Manipuri in Bengali Script)
14: mni_Mtei (Manipuri in Meitei Script)
15: npi (Nepali)
16: ory (Oriya/Odia)
17: pan (Punjabi)
18: san (Sanskrit)
19: sat (Santhali)
20: snd_Arab (Sindhi in Perso-Arabic Script)
21: snd_Deva (Sindhi in Devanagari Script)
22: tam (Tamil)
23: tel (Telugu)
24: urd (Urdu)

Downstream Use

Can be integrated into any pipeline that requires language identification for the concerned languages.

Out-of-Scope Use

The model may not work for languages other than English and Indian languages.

Limitations

The model may not perform well on very resource poor languages such as Manipuri (in Meitei script), Sindhi, Maithili.

How to Get Started with the Model

Use the code below to get started with the model.

Training Details

Training Data

Train Data

Dev Data

Training Procedure

The training includes the finetuning of the MuRIL base cased model for 10 epochs.

Training Hyperparameters

Training regime: fp32
Training Batch Size: 32
Evaluation Batch Size: 32
Learning Rate: 0.00002
Weight Decay: 0.01
Epoch: 10

Size

250K Dataset named ILID (Indian Language Identification Dataset)

Evaluation

The models are evaluated on the created corpora and the Bhasha-Abhijnaanam benchmark.

Testing Data, Factors & Metrics

Testing Data

Test Data

Metrics

F1-score

Results

0.96 F1 on an average

Summary

The model is a MuRIL based finetuned model on the language identification task. The model can identify English and all 22 official Indian languaages.

Compute Infrastructure

Model trained with one H100 NVIDIA GPU with 94GB RAM

Citation

BibTeX:

@misc{ingle2025ilidnativescriptlanguage,
      title={ILID: Native Script Language Identification for Indian Languages}, 
      author={Yash Ingle and Pruthwik Mishra},
      year={2025},
      eprint={2507.11832},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.11832}, 
}

pruthwik
/

ilid-muril-model

Model Details

Model Description

Model Sources

Uses

How to Get Started with the Model

Label Indices For Languages

Downstream Use

Out-of-Scope Use

Limitations

How to Get Started with the Model

Training Details

Training Data

Dev Data

Training Procedure

Training Hyperparameters

Size

Evaluation

Testing Data, Factors & Metrics

Testing Data

Metrics

Results

Summary

Compute Infrastructure

Citation

Model tree for pruthwik/ilid-muril-model

Dataset used to train pruthwik/ilid-muril-model