Commit
·
77a13ef
1
Parent(s):
7a9dd99
Update README.md
Browse files
README.md
CHANGED
@@ -15,10 +15,12 @@ widget:
|
|
15 |
---
|
16 |
|
17 |
# Introduction
|
18 |
-
This model predicts the sentiment of a text if it is Positive, Neutral, or Negative.
|
|
|
19 |
|
20 |
# Data
|
21 |
-
The data used is [labr](https://huggingface.co/datasets/labr), an Arabic book reviews dataset.
|
|
|
22 |
|
23 |
| Nubmer of stars | Sentiment |
|
24 |
|-----------------|-----------|
|
@@ -26,3 +28,75 @@ The data used is [labr](https://huggingface.co/datasets/labr), an Arabic book re
|
|
26 |
| 3 | Neutral |
|
27 |
| 4-5 | Positive |
|
28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
---
|
16 |
|
17 |
# Introduction
|
18 |
+
This model predicts the sentiment of a text if it is Positive, Neutral, or Negative.
|
19 |
+
This model is a finetune version of [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2) on [labr](https://huggingface.co/datasets/labr).
|
20 |
|
21 |
# Data
|
22 |
+
The data used is [labr](https://huggingface.co/datasets/labr), an Arabic book reviews dataset.
|
23 |
+
The sentiment is obtained from the number of stars given by each review.
|
24 |
|
25 |
| Nubmer of stars | Sentiment |
|
26 |
|-----------------|-----------|
|
|
|
28 |
| 3 | Neutral |
|
29 |
| 4-5 | Positive |
|
30 |
|
31 |
+
# Training
|
32 |
+
Using the Arabic Pre-Trained [MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2) as a base, we finetuned the model for a classification task.
|
33 |
+
For 3 epochs, the training has been done using huggingface trainer on Google Colab.
|
34 |
+
This is a POC experiment, so the training hyper-parameters were not optimized.
|
35 |
+
|
36 |
+
# Evaluation
|
37 |
+
Using the test set from [labr](https://huggingface.co/datasets/labr), and the same preprocessing steps, the model was evaluated.
|
38 |
+
Please note the for the following results, we obtained the macro average.
|
39 |
+
| Metric | Score |
|
40 |
+
|-----------------|-----------|
|
41 |
+
| Precision | 0.663 |
|
42 |
+
| Recall | 0.662 |
|
43 |
+
| F1 | 0.66 |
|
44 |
+
|
45 |
+
# Using the model
|
46 |
+
|
47 |
+
Do this and that
|
48 |
+
|
49 |
+
# Training code
|
50 |
+
Following this code, you will get the same results I got. You can run it in Google Colab. Please use a GPU runtime to finish the training quickly.
|
51 |
+
|
52 |
+
```python
|
53 |
+
# Notebook only:
|
54 |
+
!pip install transformers[torch] datasets
|
55 |
+
|
56 |
+
# Download and load the data
|
57 |
+
import datasets
|
58 |
+
dataset = datasets.load_dataset("labr")
|
59 |
+
|
60 |
+
# Transform the ratings into Sentiment
|
61 |
+
POSITIVE = "Positive"
|
62 |
+
NEUTRAL = "Neutral"
|
63 |
+
NEGATIVE = "Negative"
|
64 |
+
rate_to_sentiment = {0: NEGATIVE, 1: NEGATIVE, 2: NEUTRAL, 3: POSITIVE, 4: POSITIVE}
|
65 |
+
dataset = dataset.map(lambda example: {"sentiment": rate_to_sentiment[example["label"]]}, remove_columns=["label"])
|
66 |
+
dataset = dataset.rename_column("sentiment", "label")
|
67 |
+
class_names = [POSITIVE, NEUTRAL, NEGATIVE]
|
68 |
+
num_classes = len(class_names)
|
69 |
+
dataset = dataset.cast_column('label', datasets.ClassLabel(num_classes=num_classes, names=class_names))
|
70 |
+
|
71 |
+
# Download and load the pre-trained model and tokenizer
|
72 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
73 |
+
tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/MARBERTv2")
|
74 |
+
model = AutoModelForSequenceClassification.from_pretrained("UBC-NLP/MARBERTv2", num_labels=3)
|
75 |
+
|
76 |
+
# Tokenize data for training
|
77 |
+
def tokenize_function(examples):
|
78 |
+
return tokenizer(examples["text"], truncation=True, return_length=True,return_attention_mask=True, max_length=512)
|
79 |
+
tokenized_datasets = dataset.map(tokenize_function, batched=False, num_proc=16)
|
80 |
+
|
81 |
+
# Define data collator, useful for training and batching.
|
82 |
+
from transformers import DataCollatorWithPadding
|
83 |
+
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
|
84 |
+
|
85 |
+
# Defining training args
|
86 |
+
from transformers import TrainingArguments, Trainer
|
87 |
+
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
|
88 |
+
|
89 |
+
from transformers import Trainer
|
90 |
+
trainer = Trainer(
|
91 |
+
model,
|
92 |
+
training_args,
|
93 |
+
train_dataset=tokenized_datasets["train"],
|
94 |
+
eval_dataset=tokenized_datasets["test"],
|
95 |
+
data_collator=data_collator,
|
96 |
+
tokenizer=tokenizer,
|
97 |
+
)
|
98 |
+
|
99 |
+
# Train and save
|
100 |
+
trainer.train()
|
101 |
+
trainer.save_model("final_output")
|
102 |
+
```
|