Model Card for safora/persian-e5-large-scientific-retriever

Model Description

This model is a fine-tuned version of safora/persian-science-qa-e5-large, specifically optimized for high-performance information retrieval in the Persian scientific domain. It is designed to be a core component of a Retrieval-Augmented Generation (RAG) system, where it excels at identifying the most relevant documents from a large corpus in response to a user's query.

This model was fine-tuned to address a common problem in RAG systems: the retrieval of documents that are thematically related but factually incorrect. By training on a rigorously cleaned dataset of "hard negatives," this model learns to be more precise and discriminative, significantly improving the quality of the context provided to a generative model.

Intended Uses & Limitations

This model is intended to be used for embedding Persian text for retrieval tasks. Given a query, it can be used to find the most relevant scientific abstracts or documents from a corpus using semantic search.

from sentence_transformers import SentenceTransformer

sentences = ["این یک نمونه جمله است", "این جمله دیگری است"]

model = SentenceTransformer('safora/persian-e5-large-scientific-retriever')
embeddings = model.encode(sentences)
print(embeddings)

While highly effective for scientific text, its performance on general-purpose or conversational text may not be superior to the original base model.

Fine-Tuning Methodology
The performance of this model is a direct result of a meticulous data-centric fine-tuning process.

Source Data and Training Dataset
The model was fine-tuned on a custom-built dataset of 1,016 triplets, created from a corpus of Persian scientific documents . This dataset, named retriever_finetuning_triplets.jsonl, is also available on the Hugging Face Hub at safora/persian-scientific-qa-triplets.

The creation of this dataset involved a multi-stage pipeline:

Heuristic Filtering: An initial set of 10,000+ generated question-answer pairs was filtered based on length, format, and language rules.

Semantic Validation: A cross-encoder (safora/reranker-xlm-roberta-large) was used to validate the semantic relevance between questions and their source abstracts. Pairs with a relevance score below 0.85 were discarded, resulting in a high-confidence set of positive pairs.

Hard-Negative Mining: For each high-confidence question, we searched the entire corpus to find the most similar but incorrect documents. These "hard negatives" are crucial for teaching the model fine-grained distinctions. This process transformed the positive pairs into a robust triplet dataset of (query, positive_passage, negative_passage).

Training Procedure
The model was fine-tuned using the sentence-transformers library with a MultipleNegativesRankingLoss function. We split the triplet dataset into a 90% training set and a 10% evaluation set to monitor for overfitting and save the best-performing checkpoint.

Evaluation Results
A rigorous comparative evaluation was conducted between this fine-tuned model and the original safora/persian-science-qa-e5-large base model on a held-out test set. The results demonstrate a dramatic and consistent improvement across all standard information retrieval metrics.

|                  |   Accuracy@1 |   Recall@5 |   MRR@10 |   MAP@100 |
|:-----------------|-------------:|-----------:|---------:|----------:|
| Base Model       |       0.7255 |     0.9118 |   0.8167 |    0.8178 |
| Fine-Tuned Model |       0.8431 |     1      |   0.9216 |    0.9216 |



The most critical result for RAG applications is the Recall@5 score of 1.0, indicating that the correct document was found in the top 5 results 100% of the time. This ensures the generative component of a RAG system consistently receives the correct context.

Citation
If you use this model in your research or application, please cite the work:

Code snippet

@misc{safora_persian_sci_retriever_2025,
  author    = {Safora jolfaei},
  title     = {A High-Performance Embedding Model for Persian Scientific Information Retrieval},
  year      = {2025},
  publisher = {Hugging Face},
  journal   = {Hugging Face Hub},
  url       = {[https://huggingface.co/safora/persian-e5-large-scientific-retriever](https://huggingface.co/safora/persian-e5-large-scientific-retriever)}
}