Job - Job matching BAAI/bge-small-en-v1.5

Top performing model on TalentCLEF 2025 Task A. Use it for multilingual job title matching

Model Details

Model Description

Model Type: Sentence Transformer
Base model: BAAI/bge-small-en-v1.5
Maximum Sequence Length: 512 tokens
Output Dimensionality: 384 dimensions
Similarity Function: Cosine Similarity
Training Datasets:
- full_en
- full_de
- full_es
- full_zh
- mix

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'Volksvertreter',
    'Parlamentarier',
    'Oberbürgermeister',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Datasets: full_en, full_es, full_de, full_zh, mix_es, mix_de and mix_zh
Evaluated with InformationRetrievalEvaluator

Metric	full_en	full_es	full_de	full_zh	mix_es	mix_de	mix_zh
cosine_accuracy@1	0.6571	0.1243	0.2956	0.3495	0.4113	0.2943	0.0971
cosine_accuracy@20	0.9905	1.0	0.9212	0.7379	0.7613	0.65	0.3586
cosine_accuracy@50	0.9905	1.0	0.9655	0.8252	0.8523	0.7608	0.4901
cosine_accuracy@100	0.9905	1.0	0.9754	0.8544	0.9121	0.8508	0.6002
cosine_accuracy@150	0.9905	1.0	0.9852	0.9029	0.9418	0.8898	0.6613
cosine_accuracy@200	0.9905	1.0	0.9852	0.9417	0.9548	0.9204	0.7062
cosine_precision@1	0.6571	0.1243	0.2956	0.3495	0.4113	0.2943	0.0971
cosine_precision@20	0.5024	0.4897	0.4246	0.1733	0.0892	0.0731	0.0314
cosine_precision@50	0.308	0.3179	0.2814	0.0944	0.0418	0.0361	0.0185
cosine_precision@100	0.1863	0.1986	0.1801	0.0589	0.0229	0.0206	0.0116
cosine_precision@150	0.1322	0.1469	0.1362	0.0458	0.0159	0.0147	0.0087
cosine_precision@200	0.103	0.1179	0.1105	0.0385	0.0122	0.0116	0.0071
cosine_recall@1	0.068	0.0031	0.0111	0.0273	0.1565	0.1109	0.0329
cosine_recall@20	0.5385	0.3221	0.2614	0.1766	0.6594	0.5344	0.2091
cosine_recall@50	0.726	0.4638	0.3835	0.2393	0.7705	0.6585	0.3054
cosine_recall@100	0.8329	0.5438	0.4677	0.2863	0.8472	0.7525	0.3835
cosine_recall@150	0.8745	0.5825	0.5183	0.3287	0.8825	0.8026	0.4309
cosine_recall@200	0.9057	0.6147	0.5517	0.3631	0.9051	0.8418	0.4715
cosine_ndcg@1	0.6571	0.1243	0.2956	0.3495	0.4113	0.2943	0.0971
cosine_ndcg@20	0.6845	0.5385	0.4601	0.2468	0.5117	0.3919	0.1385
cosine_ndcg@50	0.704	0.5012	0.4229	0.2394	0.542	0.4256	0.1656
cosine_ndcg@100	0.7589	0.5147	0.4371	0.2619	0.5588	0.4462	0.1835
cosine_ndcg@150	0.7774	0.5348	0.4629	0.2787	0.5656	0.4561	0.1931
cosine_ndcg@200	0.7893	0.5505	0.4797	0.2919	0.5697	0.4632	0.2007
cosine_mrr@1	0.6571	0.1243	0.2956	0.3495	0.4113	0.2943	0.0971
cosine_mrr@20	0.8103	0.5515	0.4896	0.4485	0.4979	0.3779	0.1522
cosine_mrr@50	0.8103	0.5515	0.4909	0.4515	0.501	0.3815	0.1564
cosine_mrr@100	0.8103	0.5515	0.4911	0.4519	0.5018	0.3827	0.158
cosine_mrr@150	0.8103	0.5515	0.4912	0.4523	0.5021	0.3831	0.1585
cosine_mrr@200	0.8103	0.5515	0.4912	0.4525	0.5021	0.3832	0.1588
cosine_map@1	0.6571	0.1243	0.2956	0.3495	0.4113	0.2943	0.0971
cosine_map@20	0.5418	0.4028	0.3236	0.147	0.4264	0.3097	0.0875
cosine_map@50	0.5327	0.3422	0.2644	0.1267	0.4338	0.3174	0.093
cosine_map@100	0.5657	0.3395	0.2576	0.1326	0.436	0.3199	0.095
cosine_map@150	0.5734	0.3478	0.2669	0.1352	0.4366	0.3207	0.0957
cosine_map@200	0.5772	0.3534	0.2722	0.1368	0.4368	0.3212	0.0961
cosine_map@500	0.5814	0.3631	0.2833	0.1407	0.4373	0.3219	0.0971

Training Details

Training Datasets

full_en

Dataset: full_en
Size: 28,880 training samples
Columns: anchor and positive
Approximate statistics based on the first 1000 samples:
anchor positive
type string string
details
min: 3 tokens
mean: 5.0 tokens
max: 10 tokens

min: 3 tokens
mean: 5.01 tokens
max: 13 tokens
Samples:

anchor positive

air commodore flight lieutenant

command and control officer flight officer

air commodore command and control officer

	anchor	positive
type	string	string
details	min: 3 tokens mean: 5.0 tokens max: 10 tokens	min: 3 tokens mean: 5.01 tokens max: 13 tokens

anchor	positive
`air commodore`	`flight lieutenant`
`command and control officer`	`flight officer`
`air commodore`	`command and control officer`