SentenceTransformer based on keepitreal/vietnamese-sbert

This is a sentence-transformers model finetuned from keepitreal/vietnamese-sbert on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: keepitreal/vietnamese-sbert
  • Maximum Sequence Length: 256 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • json

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("NghiBuine/search-ecommerce-product-model")
# Run inference
sentences = [
    'Áo Sơ Mi Nữ Tay Dài Họa Tiết Caro',
    'áo sơ mi nữ tay dài họa tiết caro nhỏ cổ điển trẻ trung',
    'các thiết kế như cạp cao tôn dáng, gấu tua rua, chi tiết rách nhẹ',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.0
cosine_accuracy@3 0.0093
cosine_accuracy@5 0.0278
cosine_accuracy@10 0.2778
cosine_precision@1 0.0
cosine_precision@3 0.0031
cosine_precision@5 0.0056
cosine_precision@10 0.0278
cosine_recall@1 0.0
cosine_recall@3 0.0093
cosine_recall@5 0.0278
cosine_recall@10 0.2778
cosine_ndcg@10 0.0919
cosine_mrr@10 0.0393
cosine_map@100 0.0495

Information Retrieval

Metric Value
cosine_accuracy@1 0.0
cosine_accuracy@3 0.0093
cosine_accuracy@5 0.0093
cosine_accuracy@10 0.2593
cosine_precision@1 0.0
cosine_precision@3 0.0031
cosine_precision@5 0.0019
cosine_precision@10 0.0259
cosine_recall@1 0.0
cosine_recall@3 0.0093
cosine_recall@5 0.0093
cosine_recall@10 0.2593
cosine_ndcg@10 0.0854
cosine_mrr@10 0.0363
cosine_map@100 0.0475

Information Retrieval

Metric Value
cosine_accuracy@1 0.0093
cosine_accuracy@3 0.0185
cosine_accuracy@5 0.0185
cosine_accuracy@10 0.2685
cosine_precision@1 0.0093
cosine_precision@3 0.0062
cosine_precision@5 0.0037
cosine_precision@10 0.0269
cosine_recall@1 0.0093
cosine_recall@3 0.0185
cosine_recall@5 0.0185
cosine_recall@10 0.2685
cosine_ndcg@10 0.095
cosine_mrr@10 0.0462
cosine_map@100 0.0564

Information Retrieval

Metric Value
cosine_accuracy@1 0.0
cosine_accuracy@3 0.0
cosine_accuracy@5 0.0093
cosine_accuracy@10 0.2407
cosine_precision@1 0.0
cosine_precision@3 0.0
cosine_precision@5 0.0019
cosine_precision@10 0.0241
cosine_recall@1 0.0
cosine_recall@3 0.0
cosine_recall@5 0.0093
cosine_recall@10 0.2407
cosine_ndcg@10 0.0777
cosine_mrr@10 0.0319
cosine_map@100 0.0428

Information Retrieval

Metric Value
cosine_accuracy@1 0.0
cosine_accuracy@3 0.0
cosine_accuracy@5 0.0
cosine_accuracy@10 0.2222
cosine_precision@1 0.0
cosine_precision@3 0.0
cosine_precision@5 0.0
cosine_precision@10 0.0222
cosine_recall@1 0.0
cosine_recall@3 0.0
cosine_recall@5 0.0
cosine_recall@10 0.2222
cosine_ndcg@10 0.0711
cosine_mrr@10 0.0288
cosine_map@100 0.0395

Training Details

Training Dataset

json

  • Dataset: json
  • Size: 972 training samples
  • Columns: positive and anchor
  • Approximate statistics based on the first 972 samples:
    positive anchor
    type string string
    details
    • min: 4 tokens
    • mean: 11.73 tokens
    • max: 37 tokens
    • min: 6 tokens
    • mean: 15.29 tokens
    • max: 41 tokens
  • Samples:
    positive anchor
    Giày Thể Thao Nữ Chunky Sneaker Hồng Pastel đế cao 4.5cm giúp hack dáng tăng chiều cao vẫn thoải mái
    Bộ Xếp Hình Gỗ 3D Động Vật Rừng rèn luyện kỹ năng nhận dạng hình khối và phát triển khả năng quan sát
    Rubik Mirror Cube Biến Hình 3x3 vỏ Rubik mạ gương vàng bóng nổi bật và sang trọng
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 32
  • gradient_accumulation_steps: 16
  • learning_rate: 2e-05
  • num_train_epochs: 4
  • bf16: True
  • load_best_model_at_end: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 8
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 16
  • eval_accumulation_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step dim_768_cosine_ndcg@10 dim_512_cosine_ndcg@10 dim_256_cosine_ndcg@10 dim_128_cosine_ndcg@10 dim_64_cosine_ndcg@10
0.5161 1 0.0934 0.0877 0.09 0.0803 0.0773
1.5484 3 0.0905 0.0836 0.0923 0.0782 0.0708
2.0645 4 0.0919 0.0854 0.0950 0.0777 0.0711
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.11.9
  • Sentence Transformers: 4.1.0
  • Transformers: 4.41.2
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.7.0
  • Datasets: 2.19.1
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
16
Safetensors
Model size
135M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NghiBuine/search-ecommerce-product-model

Finetuned
(23)
this model

Evaluation results