AbdallahNasir commited on
Commit
77a13ef
·
1 Parent(s): 7a9dd99

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +76 -2
README.md CHANGED
@@ -15,10 +15,12 @@ widget:
15
  ---
16
 
17
  # Introduction
18
- This model predicts the sentiment of a text if it is Positive, Neutral, or Negative. This model is a finetune version of [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2) on [labr](https://huggingface.co/datasets/labr).
 
19
 
20
  # Data
21
- The data used is [labr](https://huggingface.co/datasets/labr), an Arabic book reviews dataset. The sentiment is obtained from the number of stars given by each review.
 
22
 
23
  | Nubmer of stars | Sentiment |
24
  |-----------------|-----------|
@@ -26,3 +28,75 @@ The data used is [labr](https://huggingface.co/datasets/labr), an Arabic book re
26
  | 3 | Neutral |
27
  | 4-5 | Positive |
28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ---
16
 
17
  # Introduction
18
+ This model predicts the sentiment of a text if it is Positive, Neutral, or Negative.
19
+ This model is a finetune version of [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2) on [labr](https://huggingface.co/datasets/labr).
20
 
21
  # Data
22
+ The data used is [labr](https://huggingface.co/datasets/labr), an Arabic book reviews dataset.
23
+ The sentiment is obtained from the number of stars given by each review.
24
 
25
  | Nubmer of stars | Sentiment |
26
  |-----------------|-----------|
 
28
  | 3 | Neutral |
29
  | 4-5 | Positive |
30
 
31
+ # Training
32
+ Using the Arabic Pre-Trained [MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2) as a base, we finetuned the model for a classification task.
33
+ For 3 epochs, the training has been done using huggingface trainer on Google Colab.
34
+ This is a POC experiment, so the training hyper-parameters were not optimized.
35
+
36
+ # Evaluation
37
+ Using the test set from [labr](https://huggingface.co/datasets/labr), and the same preprocessing steps, the model was evaluated.
38
+ Please note the for the following results, we obtained the macro average.
39
+ | Metric | Score |
40
+ |-----------------|-----------|
41
+ | Precision | 0.663 |
42
+ | Recall | 0.662 |
43
+ | F1 | 0.66 |
44
+
45
+ # Using the model
46
+
47
+ Do this and that
48
+
49
+ # Training code
50
+ Following this code, you will get the same results I got. You can run it in Google Colab. Please use a GPU runtime to finish the training quickly.
51
+
52
+ ```python
53
+ # Notebook only:
54
+ !pip install transformers[torch] datasets
55
+
56
+ # Download and load the data
57
+ import datasets
58
+ dataset = datasets.load_dataset("labr")
59
+
60
+ # Transform the ratings into Sentiment
61
+ POSITIVE = "Positive"
62
+ NEUTRAL = "Neutral"
63
+ NEGATIVE = "Negative"
64
+ rate_to_sentiment = {0: NEGATIVE, 1: NEGATIVE, 2: NEUTRAL, 3: POSITIVE, 4: POSITIVE}
65
+ dataset = dataset.map(lambda example: {"sentiment": rate_to_sentiment[example["label"]]}, remove_columns=["label"])
66
+ dataset = dataset.rename_column("sentiment", "label")
67
+ class_names = [POSITIVE, NEUTRAL, NEGATIVE]
68
+ num_classes = len(class_names)
69
+ dataset = dataset.cast_column('label', datasets.ClassLabel(num_classes=num_classes, names=class_names))
70
+
71
+ # Download and load the pre-trained model and tokenizer
72
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
73
+ tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/MARBERTv2")
74
+ model = AutoModelForSequenceClassification.from_pretrained("UBC-NLP/MARBERTv2", num_labels=3)
75
+
76
+ # Tokenize data for training
77
+ def tokenize_function(examples):
78
+ return tokenizer(examples["text"], truncation=True, return_length=True,return_attention_mask=True, max_length=512)
79
+ tokenized_datasets = dataset.map(tokenize_function, batched=False, num_proc=16)
80
+
81
+ # Define data collator, useful for training and batching.
82
+ from transformers import DataCollatorWithPadding
83
+ data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
84
+
85
+ # Defining training args
86
+ from transformers import TrainingArguments, Trainer
87
+ training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
88
+
89
+ from transformers import Trainer
90
+ trainer = Trainer(
91
+ model,
92
+ training_args,
93
+ train_dataset=tokenized_datasets["train"],
94
+ eval_dataset=tokenized_datasets["test"],
95
+ data_collator=data_collator,
96
+ tokenizer=tokenizer,
97
+ )
98
+
99
+ # Train and save
100
+ trainer.train()
101
+ trainer.save_model("final_output")
102
+ ```