CBDC-Sentiment: A Domain-Specific BERT for CBDC-Related Sentiment Analysis

CBDC-Sentiment is a 3-class (negative / neutral / positive) sentence-level BERT-based classifier built for Central Bank Digital Currency (CBDC) communications. It is trained to identify overall sentiment in central-bank style text such as consultations, speeches, reports, and reputable news.

Base Model: bilalzafar/CentralBank-BERT — CentralBank-BERT is a domain-adapted BERT base (uncased), pretrained on 66M+ tokens across 2M+ sentences from central-bank speeches published via the Bank for International Settlements (1996–2024). It is optimized for masked-token prediction within the specialized domains of monetary policy, financial regulation, and macroeconomic communication, enabling better contextual understanding of central-bank discourse and financial narratives.

Training data: The dataset consists of 2,405 custom, manually annotated sentences related to Central Bank Digital Currencies (CBDCs), extracted from BIS speeches. The class distribution is neutral: 1,068 (44.41%), positive: 1,026 (42.66%), and negative: 311 (12.93%). The data is split row-wise, stratified by label, into train: 1,924, validation: 240, and test: 241 examples.

Intended usage: Use this model to classify sentence-level sentiment in CBDC texts (reports, consultations, speeches, research notes, reputable news). It is domain-specific and not intended for generic or informal sentiment tasks.


Preprocessing & class imbalance

Sentences were lowercased (no stemming/lemmatization) and tokenized with the base tokenizer from bilalzafar/CentralBank-BERT using max_length=320 with truncation and dynamic padding via DataCollatorWithPadding. To address imbalance, training used Focal Loss (γ=1.0) with class weights computed from the train split (class_weight="balanced") applied in the loss, plus a WeightedRandomSampler with √(inverse-frequency) per-sample weights.


Training procedure

Training used bilalzafar/CentralBank-BERT as the base, with a 3-label AutoModelForSequenceClassification head. Optimization was AdamW (HF Trainer) with learning rate 2e-5, batch size 16 (train/eval), and up to 8 epochs with early stopping (patience=2)—best epoch ~6*. A warmup ratio of 0.06, weight decay 0.01, and fp16 precision were applied. Runs were seeded (42) and executed on Google Colab (T4).


Evaluation

On the validation split (~10% of data), the model achieved accuracy 0.8458, macro-F1 0.8270, and weighted-F1 0.8453. On the held-out test split (~10%), performance was accuracy 0.8216, macro-F1 0.8121, and weighted-F1 0.8216.

Per-class (test):

Class Precision Recall F1 Support
negative 0.8214 0.7419 0.7797 31
neutral 0.7857 0.8224 0.8037 107
positive 0.8614 0.8447 0.8529 103

Note: On the entire annotated dataset (in-domain evaluation, no hold-out), the model reaches ~0.95 accuracy / weighted-F1. These should be considered upper bounds; the test split above is the main reference for generalization.


Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

model_name = "bilalzafar/cbdc-sentiment"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

classifier = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer,
    truncation=True,
    padding=True,
    top_k=1  # return only the top prediction
)

text = "CBDCs will revolutionize payment systems and improve financial inclusion."
print(classifier(text))
# Example output: [{'label': 'positive', 'score': 0.9789}]
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bilalzafar/CBDC-Sentiment

Finetuned
(4)
this model