CentralBank-BERT: Domain-Adaptive Masked Language Model for Central Bank Communication
CentralBank-BERT is a domain-adapted masked language model based on bert-base-uncased
, pretrained on more than 66 million tokens across over 2 million sentences extracted from central bank speeches published by the Bank for International Settlements (1996–2024).
This model is specifically optimized for masked token prediction within the highly specialized domains of monetary policy, financial regulation, and macroeconomic communication, enabling deeper contextual understanding of central banking discourse and financial narratives.
Dataset Summary
- Source: BIS Central Bank Speeches (1996–2024)
- Total Speeches: 19,609
- MLM Sentences: 2,087,615 (~2.09M)
- Total Tokens: 66,359,113 (~66.36M)
- Avg. Tokens per Sentence: 31.79
Tokenizer
- Type:
BertTokenizerFast
- Base Model:
bert-base-uncased
- Vocabulary Size: 30,522
- Max Sequence Length: 128
Model Configuration
- Architecture:
BertForMaskedLM
- Initialized From:
bert-base-uncased
- Total Parameters: 109,514,298 (~109.5M)
- Trainable Parameters: 109,514,298
Training Details
- Epochs: 1
- Batch Size (per device): 16
- Gradient Accumulation Steps: 2
- Effective Batch Size: 32
- MLM Probability: 15%
- Device: NVIDIA Tesla P100 (Kaggle)
- Mixed Precision (fp16): Yes
- Training Duration: ~8 hrs 18 mins
- Start: 2025-07-19 17:17
- End: 2025-07-20 01:35
Evaluation Results
Perplexity
Model | Perplexity |
---|---|
bert-base | 13.06 |
cb-bert-mlm | 4.66 |
Lower perplexity demonstrates better fit on domain-specific central bank language.
Manual Masked Sentence Evaluation
Manual evaluation on 20 BIS-style sentences showed a strong match rate, with most mismatches being acceptable financial synonyms—demonstrating the model's contextual understanding in domain-specific language.
Top-K Accuracy
The model achieved over 90% Top-20 accuracy, indicating robust masked token recovery in financial text. For full results and the accuracy curve, refer to the notebook cb-bert-mlm.ipynb
.
Notebook: Training, Evaluation & Results
The full training pipeline, including data preprocessing, tokenizer setup, model training, evaluation, and result visualizations, is documented in the notebook cb-bert-mlm.ipynb
. This notebook includes actual outputs from the training run, perplexity comparisons, manual masked sentence evaluations, and Top-K accuracy analysis—ensuring full transparency and reproducibility of the model development process.
Model Files
model.safetensors
: Trained model weightsconfig.json
: Model architecture and hyperparameterstokenizer.json
: Serialized tokenizervocab.txt
: Vocabulary filetokenizer_config.json
: Tokenizer configurationspecial_tokens_map.json
: Special tokens mappingtraining_args.bin
: Training arguments used during pretraining
This model repository includes all essential files required to load, evaluate, or fine-tune the cb-bert-mlm
model using Hugging Face's transformers
library.
These components are necessary to ensure full compatibility with the original training environment and to support seamless deployment or transfer learning.
- Downloads last month
- 47