bge-centralbank

bge-centralbank is a domain-adapted Sentence Transformer developed to assess semantic similarity in central bank-related texts. It is based on BAAI/bge-large-en-v1.5, and adapted through both unsupervised and supervised training.

Training Setup

1. Pretraining

The model was first pretrained using TSDAE on 177,842 title–abstract pairs drawn from the peS2o dataset — a lightweight and pre-cleaned subset of the full S2ORC corpus. From this dataset, a domain-specific corpus was constructed by filtering on keywords relevant to macroeconomics, monetary policy, and financial markets. This step enables the model to better capture the language structures and terminology common in central bank literature.

2. Supervised Fine-tuning

Fine-tuning was conducted using a subset of the sentence-transformers/s2orc dataset. All title–abstract pairs were selected where the abstract contained the term central bank, resulting in 15,513 positive examples (label = 1). Each pair represents a real paper, with a matching title and abstract.

An equal number of negative examples (label = 0) was generated by randomly mismatching titles and abstracts from unrelated papers.

A separate validation set of 2,000 title–abstract pairs (1,000 positive, 1,000 negative) was held out during training. The remaining 29,026 examples were used for supervised training with CosineSimilarityLoss using the sentence-transformers framework.

Evaluation

The model was evaluated on the held-out validation set. Compared to the base model, bge-centralbank achieved a substantially lower average similarity score on negative pairs (from 0.6230 to 0.1177), showing improved ability to distinguish semantically unrelated text. It also achieved a higher point-biserial correlation (from 0.7933 to 0.9025), indicating better alignment with binary STS labels.

hugom123
/

bge-centralbank

bge-centralbank

Training Setup

1. Pretraining

2. Supervised Fine-tuning

Evaluation

Model tree for hugom123/bge-centralbank

Datasets used to train hugom123/bge-centralbank