SentenceTransformer based on BAAI/bge-m3
This is a sentence-transformers model finetuned from BAAI/bge-m3 on the stsb-indo-edu dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: BAAI/bge-m3
- Maximum Sequence Length: 8192 tokens
- Output Dimensionality: 1024 dimensions
- Similarity Function: Cosine Similarity
- Training Dataset:
- Language: id
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("Pustekhan-ITB/indoedubert-bge-m3-exp2")
# Run inference
sentences = [
'Jawaban singkatnya adalah: kita terbuat dari "materi" yang disumbangkan oleh banyak bintang.',
'Sangat tidak mungkin bahwa kita terbuat dari benda-benda yang hanya terbuat dari satu bintang.',
'Sebuah band sedang bermain di atas panggung.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Semantic Similarity
- Datasets:
stsb-indo-edu-dev
andstsb-indo-edu-test
- Evaluated with
EmbeddingSimilarityEvaluator
Metric | stsb-indo-edu-dev | stsb-indo-edu-test |
---|---|---|
pearson_cosine | 0.8433 | 0.8443 |
spearman_cosine | 0.858 | 0.8603 |
Training Details
Training Dataset
stsb-indo-edu
- Dataset: stsb-indo-edu at 2c5aa12
- Size: 6,198 training samples
- Columns:
sentence1
,sentence2
, andscore
- Approximate statistics based on the first 1000 samples:
sentence1 sentence2 score type string string float details - min: 6 tokens
- mean: 10.95 tokens
- max: 28 tokens
- min: 6 tokens
- mean: 10.81 tokens
- max: 30 tokens
- min: 0.0
- mean: 0.46
- max: 1.0
- Samples:
sentence1 sentence2 score Pelajaran menari daerah membantu siswa SD melestarikan kebudayaan lokal
Tarian ini sering dipentaskan saat perayaan hari besar
0.76
Sebelum ujian sekolah, guru memberikan bimbingan belajar tambahan secara gratis
Upaya ini agar seluruh siswa siap menghadapi ujian
0.85
Beberapa SD terletak di daerah pegunungan, sehingga siswa harus berjalan kaki cukup jauh
Ini melatih kemandirian dan fisik yang kuat
0.63
- Loss:
CoSENTLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "pairwise_cos_sim" }
Evaluation Dataset
stsb-indo-edu
- Dataset: stsb-indo-edu at 2c5aa12
- Size: 1,536 evaluation samples
- Columns:
sentence1
,sentence2
, andscore
- Approximate statistics based on the first 1000 samples:
sentence1 sentence2 score type string string float details - min: 5 tokens
- mean: 15.96 tokens
- max: 44 tokens
- min: 6 tokens
- mean: 15.97 tokens
- max: 47 tokens
- min: 0.0
- mean: 0.42
- max: 1.0
- Samples:
sentence1 sentence2 score Seorang pria dengan topi keras sedang menari.
Seorang pria yang mengenakan topi keras sedang menari.
1.0
Seorang anak kecil sedang menunggang kuda.
Seorang anak sedang menunggang kuda.
0.95
Seorang pria sedang memberi makan seekor tikus kepada seekor ular.
Pria itu sedang memberi makan seekor tikus kepada ular.
1.0
- Loss:
CoSENTLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "pairwise_cos_sim" }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 32per_device_eval_batch_size
: 32num_train_epochs
: 1warmup_ratio
: 0.1fp16
: Truebatch_sampler
: no_duplicates
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 32per_device_eval_batch_size
: 32per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 1max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.1warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Truefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: no_duplicatesmulti_dataset_batch_sampler
: proportional
Training Logs
Epoch | Step | Training Loss | Validation Loss | stsb-indo-edu-dev_spearman_cosine | stsb-indo-edu-test_spearman_cosine |
---|---|---|---|---|---|
-1 | -1 | - | - | 0.8096 | - |
0.5155 | 100 | 6.0081 | 5.7898 | 0.8580 | - |
-1 | -1 | - | - | - | 0.8603 |
Framework Versions
- Python: 3.11.11
- Sentence Transformers: 3.4.1
- Transformers: 4.48.3
- PyTorch: 2.5.1
- Accelerate: 1.3.0
- Datasets: 3.3.0
- Tokenizers: 0.21.0
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
CoSENTLoss
@online{kexuefm-8847,
title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
author={Su Jianlin},
year={2022},
month={Jan},
url={https://kexue.fm/archives/8847},
}
- Downloads last month
- 21
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
Model tree for Pustekhan-ITB/indoedubert-bge-m3-exp2
Base model
BAAI/bge-m3Dataset used to train Pustekhan-ITB/indoedubert-bge-m3-exp2
Evaluation results
- Pearson Cosine on stsb indo edu devself-reported0.843
- Spearman Cosine on stsb indo edu devself-reported0.858
- Pearson Cosine on stsb indo edu testself-reported0.844
- Spearman Cosine on stsb indo edu testself-reported0.860