CentralBank-BERT: Domain-Adaptive Masked Language Model for Central Bank Communication

CentralBank-BERT is a domain-adapted masked language model based on bert-base-uncased, pretrained on more than 66 million tokens across over 2 million sentences extracted from central bank speeches published by the Bank for International Settlements (1996–2024).

This model is specifically optimized for masked token prediction within the highly specialized domains of monetary policy, financial regulation, and macroeconomic communication, enabling deeper contextual understanding of central banking discourse and financial narratives.


Dataset Summary

  • Source: BIS Central Bank Speeches (1996–2024)
  • Total Speeches: 19,609
  • MLM Sentences: 2,087,615 (~2.09M)
  • Total Tokens: 66,359,113 (~66.36M)
  • Avg. Tokens per Sentence: 31.79

Tokenizer

  • Type: BertTokenizerFast
  • Base Model: bert-base-uncased
  • Vocabulary Size: 30,522
  • Max Sequence Length: 128

Model Configuration

  • Architecture: BertForMaskedLM
  • Initialized From: bert-base-uncased
  • Total Parameters: 109,514,298 (~109.5M)
  • Trainable Parameters: 109,514,298

Training Details

  • Epochs: 1
  • Batch Size (per device): 16
  • Gradient Accumulation Steps: 2
  • Effective Batch Size: 32
  • MLM Probability: 15%
  • Device: NVIDIA Tesla P100 (Kaggle)
  • Mixed Precision (fp16): Yes
  • Training Duration: ~8 hrs 18 mins
  • Start: 2025-07-19 17:17
  • End: 2025-07-20 01:35

Evaluation Results

Perplexity

Model Perplexity
bert-base 13.06
cb-bert-mlm 4.66

Lower perplexity demonstrates better fit on domain-specific central bank language.

Manual Masked Sentence Evaluation

Manual evaluation on 20 BIS-style sentences showed a strong match rate, with most mismatches being acceptable financial synonyms—demonstrating the model's contextual understanding in domain-specific language.

Top-K Accuracy

The model achieved over 90% Top-20 accuracy, indicating robust masked token recovery in financial text. For full results and the accuracy curve, refer to the notebook cb-bert-mlm.ipynb.


Notebook: Training, Evaluation & Results

The full training pipeline, including data preprocessing, tokenizer setup, model training, evaluation, and result visualizations, is documented in the notebook cb-bert-mlm.ipynb. This notebook includes actual outputs from the training run, perplexity comparisons, manual masked sentence evaluations, and Top-K accuracy analysis—ensuring full transparency and reproducibility of the model development process.


Model Files

  • model.safetensors: Trained model weights
  • config.json: Model architecture and hyperparameters
  • tokenizer.json: Serialized tokenizer
  • vocab.txt: Vocabulary file
  • tokenizer_config.json: Tokenizer configuration
  • special_tokens_map.json: Special tokens mapping
  • training_args.bin: Training arguments used during pretraining

This model repository includes all essential files required to load, evaluate, or fine-tune the cb-bert-mlm model using Hugging Face's transformers library. These components are necessary to ensure full compatibility with the original training environment and to support seamless deployment or transfer learning.


Downloads last month
47
Safetensors
Model size
110M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bilalzafar/CentralBank-BERT

Finetuned
(5625)
this model
Finetunes
4 models