bge-centralbank
bge-centralbank is a domain-adapted Sentence Transformer developed to assess semantic similarity in central bank-related texts. It is based on BAAI/bge-large-en-v1.5
, and adapted through both unsupervised and supervised training.
Training Setup
1. Pretraining
The model was first pretrained using TSDAE on 177,842 title–abstract pairs drawn from the peS2o
dataset — a lightweight and pre-cleaned subset of the full S2ORC corpus. From this dataset, a domain-specific corpus was constructed by filtering on keywords relevant to macroeconomics, monetary policy, and financial markets. This step enables the model to better capture the language structures and terminology common in central bank literature.
2. Supervised Fine-tuning
Fine-tuning was conducted using a subset of the sentence-transformers/s2orc
dataset. All title–abstract pairs were selected where the abstract contained the term central bank, resulting in 15,513 positive examples (label = 1). Each pair represents a real paper, with a matching title and abstract.
An equal number of negative examples (label = 0) was generated by randomly mismatching titles and abstracts from unrelated papers.
A separate validation set of 2,000 title–abstract pairs (1,000 positive, 1,000 negative) was held out during training. The remaining 29,026 examples were used for supervised training with CosineSimilarityLoss
using the sentence-transformers
framework.
Evaluation
The model was evaluated on the held-out validation set. Compared to the base model, bge-centralbank
achieved a substantially lower average similarity score on negative pairs (from 0.6230 to 0.1177), showing improved ability to distinguish semantically unrelated text. It also achieved a higher point-biserial correlation (from 0.7933 to 0.9025), indicating better alignment with binary STS labels.
- Downloads last month
- 13
Model tree for hugom123/bge-centralbank
Base model
BAAI/bge-large-en-v1.5