WebFAQ: A Multilingual Collection of Natural Q&A Datasets for Dense Retrieval
Paper
•
2502.20936
•
Published
•
2
This is a sentence-transformers model trained. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("PaDaS-Lab/xlm-roberta-base-msmarco-webfaq")
# Run inference
sentences = [
'Muss der Deckel der TipBox beim Autoklavieren geöffnet werden?',
'Nein, das ist nicht notwendig. Die neue TipBox kann bei 121°C im geschlossenen Zustand autoklaviert werden.',
'برآمدگی های بیضه ممکن است نشان دهنده مشکلی در بیضه ها باشد. ممکن است به دلیل صدمه ای به وجود آمده یا ممکن است یک مشکل پزشکی جدی باشد.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
Hat myTime ein großes Produktsortiment? |
Das Sortiment von myTime umfasst mehr als 13.000 Lebensmittel. Du findest alle Produkte, die du auch im Supermarkt findest, darunter Obst und Gemüse, trockene Lebensmittel wie Pasta und Reis, Backwaren, Snacks und Tiefkühlkost. Auch Getränke wie Kaffee, Alkohol und Soda findest du im Online-Supermarkt. |
Gibt es eine Tigerspin App? |
Tigerspin verzichtet auf eine mobile App. Wenn Sie ein paar Runden spielen möchten, öffnen Sie einfach die Webseite des Casinos und starten die Spiele im Browser. |
Bietet ihr auch maschinelle Übersetzungen an? Wenn ja, wann eignet sich diese und wann nicht? |
Maschinelle Übersetzungen sind ein spannendes Thema, auch aktuell bei techtrans. Unter maschineller Übersetzung (MÜ) versteht man die automatisierte Übertragung eines Ausgangstextes in die Zielsprache mittels einer sogenannten Übersetzungsengine. Eine solche Engine kann nach regelbasierten, statistischen oder neuronalen Prinzipien aufgebaut sein. |
MultipleNegativesRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim"
}
per_device_train_batch_size: 128per_device_eval_batch_size: 128num_train_epochs: 1fp16: Truemulti_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: noprediction_loss_only: Trueper_device_train_batch_size: 128per_device_eval_batch_size: 128per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 1max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Truefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robin| Epoch | Step | Training Loss |
|---|---|---|
| 0.025 | 500 | 0.1999 |
| 0.05 | 1000 | 0.0279 |
| 0.075 | 1500 | 0.0234 |
| 0.1 | 2000 | 0.0203 |
| 0.125 | 2500 | 0.0179 |
| 0.15 | 3000 | 0.0171 |
| 0.175 | 3500 | 0.0153 |
| 0.2 | 4000 | 0.015 |
| 0.225 | 4500 | 0.0143 |
| 0.25 | 5000 | 0.014 |
| 0.275 | 5500 | 0.0128 |
| 0.3 | 6000 | 0.013 |
| 0.325 | 6500 | 0.0129 |
| 0.35 | 7000 | 0.0124 |
| 0.375 | 7500 | 0.012 |
| 0.4 | 8000 | 0.0121 |
| 0.425 | 8500 | 0.0115 |
| 0.45 | 9000 | 0.0113 |
| 0.475 | 9500 | 0.0106 |
| 0.5 | 10000 | 0.0107 |
| 0.525 | 10500 | 0.011 |
| 0.55 | 11000 | 0.0108 |
| 0.575 | 11500 | 0.0103 |
| 0.6 | 12000 | 0.0097 |
| 0.625 | 12500 | 0.01 |
| 0.65 | 13000 | 0.0104 |
| 0.675 | 13500 | 0.0096 |
| 0.7 | 14000 | 0.0096 |
| 0.725 | 14500 | 0.0097 |
| 0.75 | 15000 | 0.0097 |
| 0.775 | 15500 | 0.0089 |
| 0.8 | 16000 | 0.0089 |
| 0.825 | 16500 | 0.0091 |
| 0.85 | 17000 | 0.0085 |
| 0.875 | 17500 | 0.0084 |
| 0.9 | 18000 | 0.0089 |
| 0.925 | 18500 | 0.0087 |
| 0.95 | 19000 | 0.0087 |
| 0.975 | 19500 | 0.0088 |
| 1.0 | 20000 | 0.0089 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{dinzinger2025webfaq,
title={WebFAQ: A Multilingual Collection of Natural Q&A Datasets for Dense Retrieval},
author={Michael Dinzinger and Laura Caspari and Kanishka Ghosh Dastidar and Jelena Mitrović and Michael Granitzer},
year={2025},
eprint={2502.20936},
archivePrefix={arXiv},
primaryClass={cs.CL}
}