PoliBERT-MY Model Card
Model Name: PoliBERT-MY
Base Model: BERT-base (uncased)
Task: Multi-label, multi-class classification of Malaysian political texts
Output: For each input text, the model classifies 12 topics with 4 possible labels (unknown, negative, neutral, positive).
Model Overview
PoliBERT-MY is a fine-tuned BERT-base model designed to classify political documents and news articles from Malaysia. It outputs predictions on 12 distinct topics:
- Democracy
- Economy
- Race
- Leadership
- Development
- Corruption
- Political Instability
- Safety
- Administration
- Education
- Religion
- Environment
For each topic, the model assigns one of four sentiment labels: unknown, negative, neutral, or positive.
Intended Use
- Political Analysis: Extracts topic-specific sentiment from Malaysian news articles and online comments.
- Media Monitoring: Automatically categorizes news and social media content to identify political trends and biases.
- Research: Serves as a case study for multi-label, multi-class classification in a politically sensitive domain.
Data Sources
The training data was aggregated from multiple sources:
Data Source | N | Labeling Method |
---|---|---|
English Newspaper | 5912 | BERT (MyPoliBERT-ver03 was used) |
English Newspaper Comments (Facebook) | 8471 | BERT |
Malay Newspaper | 5254 | OpenAI API (translated to English then classified) |
Chinese Newspaper | 2480 | OpenAI API (translated to English then classified) |
Tamil Newspaper | 1512 | OpenAI API (translated to English then classified) |
20000 | BERT (MyPoliBERT-ver03 was used) | |
Manifesto BN | 98 | OpenAI API |
Manifesto PH | 180 | OpenAI API |
Manifesto PN | 15 | OpenAI API |
Synthetic Data | 4124 | OpenAI API |
- NOTE: The originally aggregated dataset, which included data from various sources (such as English Newspapers, Facebook comments, Malay, Chinese, and Tamil Newspapers, Reddit, Manifestos, and Synthetic Data), contained some noise and misclassifications; after removing these noisy entries, 47,966 clean data points were used for training.
Labeling Method Details
BERT-based Labeling
- Method: For primarily English news articles and Facebook comments, labeling was performed using BERT.
- Implementation: The YagiASAFAS/MyPoliBERT-ver03 model was used to classify the texts directly.
OpenAI API Labeling
- Method: For non-English news articles (Malay, Chinese, Tamil), texts were first translated into English and then labeled.
- Process:
- Translation: A translation prompt was used to convert non-English texts into English.
- Classification: After translation, a classification prompt was used to assign labels.
- Additional Details:
OpenAI API labeling was performed by combining Human-in-the-loop machine learning—where prompt engineering was applied to select the most accurate prompt—with the OpenAI API (gpt-4o-mini) to generate labels.
Synthetic Data via Data Augmentation
Method: Synthetic data was generated to balance the dataset by augmenting underrepresented labels or sentiments.
Implementation: The OpenAI API was used (in combination with Human-in-the-loop prompt engineering) to generate artificial data that is either not present in the original dataset or is scarce. This synthetic data was then mixed with the original data to improve label balance.
Training Details
Hyperparameters:
- Learning Rate: 5e-05
- Train Batch Size: 16
- Eval Batch Size: 16
- Seed: 42
- Gradient Accumulation Steps: 4 (Total Train Batch Size = 16 × 4 = 64)
- Optimizer: ADAMW_TORCH (betas=(0.9, 0.999), epsilon=1e-08)
- LR Scheduler Type: Linear
- LR Warmup Steps: 500
- Number of Epochs: 5
- Mixed Precision Training: Native AMP
Label Imbalance Correction:
A correction factor was computed for each topic based on the number of non-'unknown' samples to mitigate label imbalance. The correction weight for each topic was calculated as:
weight = (average non-unknown count) / (non-unknown count for the topic)
Evaluation Results
The model achieved the following results on the evaluation set:
- Loss: 0.1928
- Democracy: F1 = 0.9556, Accuracy = 0.9574
- Economy: F1 = 0.9352, Accuracy = 0.9381
- Race: F1 = 0.9569, Accuracy = 0.9580
- Leadership: F1 = 0.8411, Accuracy = 0.8457
- Development: F1 = 0.9222, Accuracy = 0.9269
- Corruption: F1 = 0.9611, Accuracy = 0.9627
- Instability: F1 = 0.9462, Accuracy = 0.9492
- Safety: F1 = 0.9213, Accuracy = 0.9258
- Administration: F1 = 0.9367, Accuracy = 0.9412
- Education: F1 = 0.9661, Accuracy = 0.9678
- Religion: F1 = 0.9590, Accuracy = 0.9598
- Environment: F1 = 0.9808, Accuracy = 0.9821
- Overall: F1 = 0.9402, Accuracy = 0.9429
Training Results by Epoch
Training Loss | Epoch | Step | Validation Loss | Democracy F1 | Democracy Accuracy | Economy F1 | Economy Accuracy | Race F1 | Race Accuracy | Leadership F1 | Leadership Accuracy | Development F1 | Development Accuracy | Corruption F1 | Corruption Accuracy | Instability F1 | Instability Accuracy | Safety F1 | Safety Accuracy | Administration F1 | Administration Accuracy | Education F1 | Education Accuracy | Religion F1 | Religion Accuracy | Environment F1 | Environment Accuracy | Overall F1 | Overall Accuracy |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.2762 | 1.0 | 600 | 0.2618 | 0.9216 | 0.9410 | 0.8961 | 0.9121 | 0.9179 | 0.9339 | 0.7244 | 0.7770 | 0.8460 | 0.8856 | 0.9274 | 0.9416 | 0.8918 | 0.9236 | 0.8792 | 0.8998 | 0.8800 | 0.9163 | 0.9518 | 0.9588 | 0.9355 | 0.9454 | 0.9718 | 0.9757 | 0.8953 | 0.9176 |
0.2 | 2.0 | 1200 | 0.2052 | 0.9428 | 0.9518 | 0.9226 | 0.9292 | 0.9507 | 0.9542 | 0.7889 | 0.8134 | 0.8957 | 0.9128 | 0.9551 | 0.9587 | 0.9396 | 0.9465 | 0.9130 | 0.9185 | 0.9296 | 0.9375 | 0.9648 | 0.9664 | 0.9558 | 0.9577 | 0.9799 | 0.9817 | 0.9282 | 0.9357 |
0.1426 | 3.0 | 1800 | 0.1916 | 0.9538 | 0.9574 | 0.9318 | 0.9351 | 0.9564 | 0.9582 | 0.8296 | 0.8378 | 0.9163 | 0.9235 | 0.9586 | 0.9591 | 0.9468 | 0.9484 | 0.9200 | 0.9230 | 0.9331 | 0.9393 | 0.9648 | 0.9673 | 0.9582 | 0.9589 | 0.9826 | 0.9838 | 0.9377 | 0.9410 |
0.103 | 4.0 | 2400 | 0.1908 | 0.9548 | 0.9579 | 0.9348 | 0.9364 | 0.9570 | 0.9582 | 0.8368 | 0.8416 | 0.9214 | 0.9261 | 0.9615 | 0.9627 | 0.9460 | 0.9491 | 0.9209 | 0.9253 | 0.9370 | 0.9418 | 0.9675 | 0.9690 | 0.9602 | 0.9607 | 0.9809 | 0.9820 | 0.9399 | 0.9426 |
0.0838 | 4.9921 | 2995 | 0.1928 | 0.9556 | 0.9574 | 0.9352 | 0.9381 | 0.9569 | 0.9580 | 0.8411 | 0.8457 | 0.9222 | 0.9269 | 0.9611 | 0.9627 | 0.9462 | 0.9492 | 0.9213 | 0.9258 | 0.9367 | 0.9412 | 0.9661 | 0.9678 | 0.9590 | 0.9598 | 0.9808 | 0.9821 | 0.9402 | 0.9429 |
Usage
Inference
- Input: English text (or text translated into English)
- Output: A JSON object with 12 keys (one for each topic) containing one of the labels: unknown, negative, neutral, or positive.
- The model selects the sentiment with the highest probability for each topic.
Fine-Tuning with Best Hyperparameters
After hyperparameter search, update your training arguments using the best hyperparameters and reinitialize the Trainer:
# After hyperparameter search:
best_run = trainer.hyperparameter_search(direction='maximize', hp_space=hp_space, n_trials=5)
print('Best hyperparameters:', best_run.hyperparameters)
# Update TrainingArguments accordingly
training_args.learning_rate = best_run.hyperparameters['learning_rate']
training_args.num_train_epochs = best_run.hyperparameters['num_train_epochs']
training_args.gradient_accumulation_steps = best_run.hyperparameters['gradient_accumulation_steps']
# Reinitialize the Trainer with the updated arguments
trainer = CustomTrainer(
model_init=model_init,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
compute_metrics=compute_metrics,
callbacks=[early_stopping_callback],
label_weights_dict=label_weights
)
trainer.train()
- Downloads last month
- 3
Model tree for YagiASAFAS/PoliBERT-MY
Base model
google-bert/bert-base-uncased