Model Card for safora/persian-e5-large-scientific-retriever
Model Description
This model is a fine-tuned version of safora/persian-science-qa-e5-large
, specifically optimized for high-performance information retrieval in the Persian scientific domain. It is designed to be a core component of a Retrieval-Augmented Generation (RAG) system, where it excels at identifying the most relevant documents from a large corpus in response to a user's query.
This model was fine-tuned to address a common problem in RAG systems: the retrieval of documents that are thematically related but factually incorrect. By training on a rigorously cleaned dataset of "hard negatives," this model learns to be more precise and discriminative, significantly improving the quality of the context provided to a generative model.
Intended Uses & Limitations
This model is intended to be used for embedding Persian text for retrieval tasks. Given a query, it can be used to find the most relevant scientific abstracts or documents from a corpus using semantic search.
from sentence_transformers import SentenceTransformer
sentences = ["این یک نمونه جمله است", "این جمله دیگری است"]
model = SentenceTransformer('safora/persian-e5-large-scientific-retriever')
embeddings = model.encode(sentences)
print(embeddings)
While highly effective for scientific text, its performance on general-purpose or conversational text may not be superior to the original base model.
Fine-Tuning Methodology
The performance of this model is a direct result of a meticulous data-centric fine-tuning process.
Source Data and Training Dataset
The model was fine-tuned on a custom-built dataset of 1,016 triplets, created from a corpus of Persian scientific documents . This dataset, named retriever_finetuning_triplets.jsonl, is also available on the Hugging Face Hub at safora/persian-scientific-qa-triplets.
The creation of this dataset involved a multi-stage pipeline:
Heuristic Filtering: An initial set of 10,000+ generated question-answer pairs was filtered based on length, format, and language rules.
Semantic Validation: A cross-encoder (safora/reranker-xlm-roberta-large) was used to validate the semantic relevance between questions and their source abstracts. Pairs with a relevance score below 0.85 were discarded, resulting in a high-confidence set of positive pairs.
Hard-Negative Mining: For each high-confidence question, we searched the entire corpus to find the most similar but incorrect documents. These "hard negatives" are crucial for teaching the model fine-grained distinctions. This process transformed the positive pairs into a robust triplet dataset of (query, positive_passage, negative_passage).
Training Procedure
The model was fine-tuned using the sentence-transformers library with a MultipleNegativesRankingLoss function. We split the triplet dataset into a 90% training set and a 10% evaluation set to monitor for overfitting and save the best-performing checkpoint.
Evaluation Results
A rigorous comparative evaluation was conducted between this fine-tuned model and the original safora/persian-science-qa-e5-large base model on a held-out test set. The results demonstrate a dramatic and consistent improvement across all standard information retrieval metrics.
| | Accuracy@1 | Recall@5 | MRR@10 | MAP@100 |
|:-----------------|-------------:|-----------:|---------:|----------:|
| Base Model | 0.7255 | 0.9118 | 0.8167 | 0.8178 |
| Fine-Tuned Model | 0.8431 | 1 | 0.9216 | 0.9216 |
The most critical result for RAG applications is the Recall@5 score of 1.0, indicating that the correct document was found in the top 5 results 100% of the time. This ensures the generative component of a RAG system consistently receives the correct context.
Citation
If you use this model in your research or application, please cite the work:
Code snippet
@misc{safora_persian_sci_retriever_2025,
author = {Safora jolfaei},
title = {A High-Performance Embedding Model for Persian Scientific Information Retrieval},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Hub},
url = {[https://huggingface.co/safora/persian-e5-large-scientific-retriever](https://huggingface.co/safora/persian-e5-large-scientific-retriever)}
}
- Downloads last month
- 50