MuRIL_WR: MuRIL Telugu Sentiment Classification Model (With Rationale)

Model Overview

MuRIL_WR is a Telugu sentiment classification model based on MuRIL (Multilingual Representations for Indian Languages), a transformer-based BERT model specifically designed to support 17+ Indian languages, including Telugu and English.
The "WR" in the model name stands for "With Rationale", indicating that this model is trained using both sentiment labels and human-annotated rationales from the TeSent_Benchmark-Dataset.

Model Details

Architecture: MuRIL (BERT-base for Indian languages, pre-trained on 17+ languages)
Pretraining Data: Large corpus of Telugu sentences from web, religious scripts, news data, etc.
Pretraining Objectives: Masked Language Modeling (MLM) and Translation Language Modeling (TLM) tasks
Fine-tuning Data: TeSent_Benchmark-Dataset, using both sentence-level sentiment labels (positive, negative, neutral) and rationale annotations
Task: Sentence-level sentiment classification (3-way)
Rationale Usage: Used during training and/or inference ("WR" = With Rationale)

Intended Use

Primary Use: Benchmarking Telugu sentiment classification on the TeSent_Benchmark-Dataset, especially as a baseline for models trained with and without rationales
Research Setting: Recommended for academic research in low-resource and explainable NLP settings, especially for informal, social media, or conversational Telugu data

Why MuRIL?

MuRIL is specifically pre-trained on Indian languages and offers better understanding of Telugu morphology and syntax compared to general multilingual models like mBERT and XLM-R.
As the pre-training data favors informal texts from the web, MuRIL is especially effective for informal, social media, or conversational NLP tasks in Telugu. For formal or classical Telugu, performance may be lower.

Performance and Limitations

Strengths:

Superior understanding of Telugu compared to general multilingual models
Excels in informal, web, or conversational Telugu sentiment tasks
Provides explicit rationales for predictions, aiding explainability
Strong baseline for Telugu sentiment classification

Limitations:

May underperform on formal or classical Telugu tasks due to pre-training corpus
Applicability limited to Telugu analysis; not ideal for highly formal text processing
Requires sufficient labeled Telugu data and rationale annotations for best performance

Training Data

Dataset: TeSent_Benchmark-Dataset
Data Used: The Content (Telugu sentence), Label (sentiment label), and Rationale (human-annotated rationale) columns are used for MuRIL_WR training

Language Coverage

Language: Telugu (te)
Model Scope: This implementation and evaluation focus strictly on Telugu sentiment classification

Citation and More Details

For detailed experimental setup, evaluation metrics, and comparisons with rationale-based models, please refer to our paper.

License

Released under CC BY 4.0.

DSL-13-SRMAP
/

MuRIL_WR