File size: 4,219 Bytes
27a53ce 35235de 95b713b 27a53ce 7a9dd99 77a13ef 7a9dd99 77a13ef 7a9dd99 77a13ef a250f61 77a13ef a250f61 77a13ef 95b713b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
---
license: apache-2.0
language:
- ar
pipeline_tag: text-classification
datasets:
- labr
widget:
- text: من أفضل الكتب التي قرأتها في هذا العام
example_title: Positive
- text: الكتاب سيء، لا أنصح أحد بقراءته أبدا
example_title: Negative
- text: لا يمكنك الجزم بشيء حول هذا الكتاب
example_title: Neutral
metrics:
- precision
- recall
- f1
library_name: transformers
tags:
- code
- sentiment analysis
- sentiment-analysis
---
# Introduction
This model predicts the sentiment of a text if it is Positive, Neutral, or Negative.
This model is a finetune version of [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2) on [labr](https://huggingface.co/datasets/labr).
# Data
The data used is [labr](https://huggingface.co/datasets/labr), an Arabic book reviews dataset.
The sentiment is obtained from the number of stars given by each review.
| Nubmer of stars | Sentiment |
|-----------------|-----------|
| 1-2 | Negative |
| 3 | Neutral |
| 4-5 | Positive |
# Training
Using the Arabic Pre-Trained [MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2) as a base, we finetuned the model for a classification task.
For 3 epochs, the training has been done using huggingface trainer on Google Colab.
This is a POC experiment, so the training hyper-parameters were not optimized.
# Evaluation
Using the test set from [labr](https://huggingface.co/datasets/labr), and the same preprocessing steps, the model was evaluated.
Please note the for the following results, we obtained the macro average.
| Metric | Score |
|-----------------|-----------|
| Precision | 0.663 |
| Recall | 0.662 |
| F1 | 0.66 |
# Using the model
To use the model in your code, follow huggingface instructions, or
```python
from transformers import pipeline
pipe = pipeline("text-classification", model="AbdallahNasir/book-review-sentiment-classification")
result = pipe("من أفضل الكتب التي قرأتها في هذا العام")
print(result)
```
# Training code
Following this code, you will get the same results I got. You can run it in Google Colab. Please use a GPU runtime to finish the training quickly.
```python
# Notebook only:
!pip install transformers[torch] datasets
# Download and load the data
import datasets
dataset = datasets.load_dataset("labr")
# Transform the ratings into Sentiment
POSITIVE = "Positive"
NEUTRAL = "Neutral"
NEGATIVE = "Negative"
rate_to_sentiment = {0: NEGATIVE, 1: NEGATIVE, 2: NEUTRAL, 3: POSITIVE, 4: POSITIVE}
dataset = dataset.map(lambda example: {"sentiment": rate_to_sentiment[example["label"]]}, remove_columns=["label"])
dataset = dataset.rename_column("sentiment", "label")
class_names = [POSITIVE, NEUTRAL, NEGATIVE]
num_classes = len(class_names)
dataset = dataset.cast_column('label', datasets.ClassLabel(num_classes=num_classes, names=class_names))
# Download and load the pre-trained model and tokenizer
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/MARBERTv2")
model = AutoModelForSequenceClassification.from_pretrained("UBC-NLP/MARBERTv2", num_labels=3)
# Tokenize data for training
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True, return_length=True,return_attention_mask=True, max_length=512)
tokenized_datasets = dataset.map(tokenize_function, batched=False, num_proc=16)
# Define data collator, useful for training and batching.
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# Defining training args
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
from transformers import Trainer
trainer = Trainer(
model,
training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
data_collator=data_collator,
tokenizer=tokenizer,
)
# Train and save
trainer.train()
trainer.save_model("final_output")
```
##### Keywords
* sentiment analysis
* arabic
* book reviews |