AbdallahNasir
/

book-review-sentiment-classification

@@ -15,10 +15,12 @@ widget:
 ---
 # Introduction
-This model predicts the sentiment of a text if it is Positive, Neutral, or Negative. This model is a finetune version of [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2) on [labr](https://huggingface.co/datasets/labr).
 # Data
-The data used is [labr](https://huggingface.co/datasets/labr), an Arabic book reviews dataset. The sentiment is obtained from the number of stars given by each review.
 | Nubmer of stars | Sentiment |
 |-----------------|-----------|
@@ -26,3 +28,75 @@ The data used is [labr](https://huggingface.co/datasets/labr), an Arabic book re
 | 3               | Neutral   |
 | 4-5             | Positive  |

 ---
 # Introduction
+This model predicts the sentiment of a text if it is Positive, Neutral, or Negative.
+This model is a finetune version of [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2) on [labr](https://huggingface.co/datasets/labr).
 # Data
+The data used is [labr](https://huggingface.co/datasets/labr), an Arabic book reviews dataset.
+The sentiment is obtained from the number of stars given by each review.
 | Nubmer of stars | Sentiment |
 |-----------------|-----------|
 | 3               | Neutral   |
 | 4-5             | Positive  |
+# Training
+Using the Arabic Pre-Trained [MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2) as a base, we finetuned the model for a classification task.
+For 3 epochs, the training has been done using huggingface trainer on Google Colab.
+This is a POC experiment, so the training hyper-parameters were not optimized.
+# Evaluation
+Using the test set from [labr](https://huggingface.co/datasets/labr), and the same preprocessing steps, the model was evaluated.
+Please note the for the following results, we obtained the macro average.
+| Metric | Score |
+|-----------------|-----------|
+| Precision      | 0.663  |
+| Recall            | 0.662   |
+| F1             | 0.66  |
+# Using the model
+Do this and that
+# Training code
+Following this code, you will get the same results I got. You can run it in Google Colab. Please use a GPU runtime to finish the training quickly.
+```python
+# Notebook only:
+!pip install transformers[torch] datasets
+# Download and load the data
+import datasets
+dataset = datasets.load_dataset("labr")
+# Transform the ratings into Sentiment
+POSITIVE = "Positive"
+NEUTRAL = "Neutral"
+NEGATIVE = "Negative"
+rate_to_sentiment = {0: NEGATIVE, 1: NEGATIVE, 2: NEUTRAL, 3: POSITIVE, 4: POSITIVE}
+dataset = dataset.map(lambda example: {"sentiment": rate_to_sentiment[example["label"]]}, remove_columns=["label"])
+dataset = dataset.rename_column("sentiment", "label")
+class_names = [POSITIVE, NEUTRAL, NEGATIVE]
+num_classes = len(class_names)
+dataset = dataset.cast_column('label', datasets.ClassLabel(num_classes=num_classes, names=class_names))
+# Download and load the pre-trained model and tokenizer
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/MARBERTv2")
+model = AutoModelForSequenceClassification.from_pretrained("UBC-NLP/MARBERTv2", num_labels=3)
+# Tokenize data for training
+def tokenize_function(examples):
+  return tokenizer(examples["text"],  truncation=True, return_length=True,return_attention_mask=True, max_length=512)
+tokenized_datasets = dataset.map(tokenize_function, batched=False, num_proc=16)
+# Define data collator, useful for training and batching.
+from transformers import DataCollatorWithPadding
+data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
+# Defining training args
+from transformers import TrainingArguments, Trainer
+training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
+from transformers import Trainer
+trainer = Trainer(
+    model,
+    training_args,
+    train_dataset=tokenized_datasets["train"],
+    eval_dataset=tokenized_datasets["test"],
+    data_collator=data_collator,
+    tokenizer=tokenizer,
+)
+# Train and save
+trainer.train()
+trainer.save_model("final_output")
+```