File size: 4,219 Bytes
27a53ce
 
 
 
 
 
 
35235de
95b713b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27a53ce
 
7a9dd99
77a13ef
 
7a9dd99
 
77a13ef
 
7a9dd99
 
 
 
 
 
 
77a13ef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a250f61
 
 
77a13ef
a250f61
 
 
 
77a13ef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95b713b
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
license: apache-2.0
language:
- ar
pipeline_tag: text-classification
datasets:
- labr
widget:
- text: من أفضل الكتب التي قرأتها في هذا العام
  example_title: Positive
- text: الكتاب سيء، لا أنصح أحد بقراءته أبدا
  example_title: Negative
- text: لا يمكنك الجزم بشيء حول هذا الكتاب
  example_title: Neutral
metrics:
- precision
- recall
- f1
library_name: transformers
tags:
- code
- sentiment analysis
- sentiment-analysis
---

# Introduction
This model predicts the sentiment of a text if it is Positive, Neutral, or Negative.
This model is a finetune version of [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2) on [labr](https://huggingface.co/datasets/labr).

# Data
The data used is [labr](https://huggingface.co/datasets/labr), an Arabic book reviews dataset.
The sentiment is obtained from the number of stars given by each review.

| Nubmer of stars | Sentiment |
|-----------------|-----------|
| 1-2             | Negative  |
| 3               | Neutral   |
| 4-5             | Positive  |

# Training
Using the Arabic Pre-Trained [MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2) as a base, we finetuned the model for a classification task.
For 3 epochs, the training has been done using huggingface trainer on Google Colab.
This is a POC experiment, so the training hyper-parameters were not optimized.

# Evaluation
Using the test set from [labr](https://huggingface.co/datasets/labr), and the same preprocessing steps, the model was evaluated.
Please note the for the following results, we obtained the macro average.
| Metric | Score |
|-----------------|-----------|
| Precision      | 0.663  |
| Recall            | 0.662   |
| F1             | 0.66  |

# Using the model
To use the model in your code, follow huggingface instructions, or 
```python
from transformers import pipeline

pipe = pipeline("text-classification", model="AbdallahNasir/book-review-sentiment-classification")
result = pipe("من أفضل الكتب التي قرأتها في هذا العام")
print(result)
```

# Training code
Following this code, you will get the same results I got. You can run it in Google Colab. Please use a GPU runtime to finish the training quickly.

```python
# Notebook only:
!pip install transformers[torch] datasets

# Download and load the data
import datasets
dataset = datasets.load_dataset("labr")

# Transform the ratings into Sentiment
POSITIVE = "Positive"
NEUTRAL = "Neutral"
NEGATIVE = "Negative"
rate_to_sentiment = {0: NEGATIVE, 1: NEGATIVE, 2: NEUTRAL, 3: POSITIVE, 4: POSITIVE}
dataset = dataset.map(lambda example: {"sentiment": rate_to_sentiment[example["label"]]}, remove_columns=["label"])
dataset = dataset.rename_column("sentiment", "label")
class_names = [POSITIVE, NEUTRAL, NEGATIVE]  
num_classes = len(class_names)
dataset = dataset.cast_column('label', datasets.ClassLabel(num_classes=num_classes, names=class_names))

# Download and load the pre-trained model and tokenizer
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/MARBERTv2")
model = AutoModelForSequenceClassification.from_pretrained("UBC-NLP/MARBERTv2", num_labels=3)

# Tokenize data for training
def tokenize_function(examples):
  return tokenizer(examples["text"],  truncation=True, return_length=True,return_attention_mask=True, max_length=512)
tokenized_datasets = dataset.map(tokenize_function, batched=False, num_proc=16)

# Define data collator, useful for training and batching.
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Defining training args
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")

from transformers import Trainer
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

# Train and save
trainer.train()
trainer.save_model("final_output")
```

##### Keywords
* sentiment analysis
* arabic
* book reviews