SentenceTransformer based on denaya/indoSBERT-large

This is a sentence-transformers model finetuned from denaya/indoSBERT-large on the bps-publication-pos-neg-pairs dataset. It maps sentences & paragraphs to a 256-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 1024, 'out_features': 256, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-semantic-base-v1-3")
# Run inference
sentences = [
    'Studi efisiensi industri manufaktr',
    'Statistik Potensi Desa Provinsi Maluku 2011',
    'Klasifikasi Baku Komoditas Indonesia (KBKI) 2012 Buku 4',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 256]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric allstats-semantic-base-v1-eval allstat-semantic-base-v1-test
pearson_cosine 0.9659 0.9592
spearman_cosine 0.7842 0.7818

Training Details

Training Dataset

bps-publication-pos-neg-pairs

  • Dataset: bps-publication-pos-neg-pairs at 46a5cb7
  • Size: 23,478 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string int
    details
    • min: 5 tokens
    • mean: 11.84 tokens
    • max: 24 tokens
    • min: 5 tokens
    • mean: 10.77 tokens
    • max: 28 tokens
    • 0: ~72.40%
    • 1: ~27.60%
  • Samples:
    query doc label
    Direktori perusahaan perantara keuangan bukan koperasi tahun 2006 (SE) Tinjauan Regional Berdasarkan PDRB Kabupaten/Kota 2018-2022, Buku 2 Pulau Jawa-Bali 0
    Informasi lengkap tentang PPLS 2011 Indeks Harga Perdagangan Besar Indonesia tahun 2005 0
    Data konversi GKG ke beras tahun 2012 Indikator Ekonomi Juli 2023 0
  • Loss: ContrastiveLoss with these parameters:
    {
        "distance_metric": "SiameseDistanceMetric.COSINE_DISTANCE",
        "margin": 0.5,
        "size_average": true
    }
    

Evaluation Dataset

bps-publication-pos-neg-pairs

  • Dataset: bps-publication-pos-neg-pairs at 46a5cb7
  • Size: 5,031 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string int
    details
    • min: 5 tokens
    • mean: 11.97 tokens
    • max: 24 tokens
    • min: 5 tokens
    • mean: 10.76 tokens
    • max: 32 tokens
    • 0: ~72.70%
    • 1: ~27.30%
  • Samples:
    query doc label
    Informasi angka tanaman berkhasiat ogbat dan tanaman hias di tahun 2005 Tinjauan Regional Berdasarkan PDRB Kabupaten/Kota 2010-2013 - Buku 2 Pulau Jawa-Bali 0
    Informasi lengkap statistik horsikultura tahun 2020 NERACA ENERGI INDONESIA 2017-2021 0
    Statistik air bersih Indonesia periode 2014-2019 Profil Usaha Konstruksi Perorangan Provinsi Kalimantan Utara, 2022 0
  • Loss: ContrastiveLoss with these parameters:
    {
        "distance_metric": "SiameseDistanceMetric.COSINE_DISTANCE",
        "margin": 0.5,
        "size_average": true
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • num_train_epochs: 8
  • warmup_ratio: 0.1
  • fp16: True
  • load_best_model_at_end: True
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 8
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss allstats-semantic-base-v1-eval_spearman_cosine allstat-semantic-base-v1-test_spearman_cosine
0 0 - 0.0053 0.7770 -
0.5450 200 0.0023 0.0005 0.7842 -
1.0899 400 0.0005 0.0002 0.7842 -
1.6349 600 0.0002 0.0002 0.7842 -
2.1798 800 0.0001 0.0001 0.7842 -
2.7248 1000 0.0001 0.0001 0.7842 -
3.2698 1200 0.0 0.0001 0.7842 -
3.8147 1400 0.0 0.0001 0.7842 -
4.3597 1600 0.0 0.0001 0.7842 -
4.9046 1800 0.0 0.0001 0.7842 -
5.4496 2000 0.0 0.0001 0.7842 -
5.9946 2200 0.0 0.0001 0.7842 -
6.5395 2400 0.0 0.0001 0.7842 -
7.0845 2600 0.0 0.0001 0.7842 -
7.6294 2800 0.0 0.0001 0.7842 -
-1 -1 - - - 0.7818
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.4.0
  • Transformers: 4.48.1
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

ContrastiveLoss

@inproceedings{hadsell2006dimensionality,
    author={Hadsell, R. and Chopra, S. and LeCun, Y.},
    booktitle={2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)},
    title={Dimensionality Reduction by Learning an Invariant Mapping},
    year={2006},
    volume={2},
    number={},
    pages={1735-1742},
    doi={10.1109/CVPR.2006.100}
}
Downloads last month
26
Safetensors
Model size
335M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yahyaabd/allstats-semantic-base-v1-3

Finetuned
(7)
this model

Dataset used to train yahyaabd/allstats-semantic-base-v1-3

Evaluation results