|
--- |
|
license: apache-2.0 |
|
language: |
|
- ar |
|
pipeline_tag: text-classification |
|
datasets: |
|
- labr |
|
widget: |
|
- text: من أفضل الكتب التي قرأتها في هذا العام |
|
example_title: Positive |
|
- text: الكتاب سيء، لا أنصح أحد بقراءته أبدا |
|
example_title: Negative |
|
- text: لا يمكنك الجزم بشيء حول هذا الكتاب |
|
example_title: Neutral |
|
metrics: |
|
- precision |
|
- recall |
|
- f1 |
|
library_name: transformers |
|
tags: |
|
- code |
|
- sentiment analysis |
|
- sentiment-analysis |
|
--- |
|
|
|
# Introduction |
|
This model predicts the sentiment of a text if it is Positive, Neutral, or Negative. |
|
This model is a finetune version of [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2) on [labr](https://huggingface.co/datasets/labr). |
|
|
|
# Data |
|
The data used is [labr](https://huggingface.co/datasets/labr), an Arabic book reviews dataset. |
|
The sentiment is obtained from the number of stars given by each review. |
|
|
|
| Nubmer of stars | Sentiment | |
|
|-----------------|-----------| |
|
| 1-2 | Negative | |
|
| 3 | Neutral | |
|
| 4-5 | Positive | |
|
|
|
# Training |
|
Using the Arabic Pre-Trained [MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2) as a base, we finetuned the model for a classification task. |
|
For 3 epochs, the training has been done using huggingface trainer on Google Colab. |
|
This is a POC experiment, so the training hyper-parameters were not optimized. |
|
|
|
# Evaluation |
|
Using the test set from [labr](https://huggingface.co/datasets/labr), and the same preprocessing steps, the model was evaluated. |
|
Please note the for the following results, we obtained the macro average. |
|
| Metric | Score | |
|
|-----------------|-----------| |
|
| Precision | 0.663 | |
|
| Recall | 0.662 | |
|
| F1 | 0.66 | |
|
|
|
# Using the model |
|
To use the model in your code, follow huggingface instructions, or |
|
```python |
|
from transformers import pipeline |
|
|
|
pipe = pipeline("text-classification", model="AbdallahNasir/book-review-sentiment-classification") |
|
result = pipe("من أفضل الكتب التي قرأتها في هذا العام") |
|
print(result) |
|
``` |
|
|
|
# Training code |
|
Following this code, you will get the same results I got. You can run it in Google Colab. Please use a GPU runtime to finish the training quickly. |
|
|
|
```python |
|
# Notebook only: |
|
!pip install transformers[torch] datasets |
|
|
|
# Download and load the data |
|
import datasets |
|
dataset = datasets.load_dataset("labr") |
|
|
|
# Transform the ratings into Sentiment |
|
POSITIVE = "Positive" |
|
NEUTRAL = "Neutral" |
|
NEGATIVE = "Negative" |
|
rate_to_sentiment = {0: NEGATIVE, 1: NEGATIVE, 2: NEUTRAL, 3: POSITIVE, 4: POSITIVE} |
|
dataset = dataset.map(lambda example: {"sentiment": rate_to_sentiment[example["label"]]}, remove_columns=["label"]) |
|
dataset = dataset.rename_column("sentiment", "label") |
|
class_names = [POSITIVE, NEUTRAL, NEGATIVE] |
|
num_classes = len(class_names) |
|
dataset = dataset.cast_column('label', datasets.ClassLabel(num_classes=num_classes, names=class_names)) |
|
|
|
# Download and load the pre-trained model and tokenizer |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/MARBERTv2") |
|
model = AutoModelForSequenceClassification.from_pretrained("UBC-NLP/MARBERTv2", num_labels=3) |
|
|
|
# Tokenize data for training |
|
def tokenize_function(examples): |
|
return tokenizer(examples["text"], truncation=True, return_length=True,return_attention_mask=True, max_length=512) |
|
tokenized_datasets = dataset.map(tokenize_function, batched=False, num_proc=16) |
|
|
|
# Define data collator, useful for training and batching. |
|
from transformers import DataCollatorWithPadding |
|
data_collator = DataCollatorWithPadding(tokenizer=tokenizer) |
|
|
|
# Defining training args |
|
from transformers import TrainingArguments, Trainer |
|
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch") |
|
|
|
from transformers import Trainer |
|
trainer = Trainer( |
|
model, |
|
training_args, |
|
train_dataset=tokenized_datasets["train"], |
|
eval_dataset=tokenized_datasets["test"], |
|
data_collator=data_collator, |
|
tokenizer=tokenizer, |
|
) |
|
|
|
# Train and save |
|
trainer.train() |
|
trainer.save_model("final_output") |
|
``` |
|
|
|
##### Keywords |
|
* sentiment analysis |
|
* arabic |
|
* book reviews |