YESciEval Logo

Large Language Models (LLMs) have become pivotal in powering scientific question-answering across modern search engines, yet their evaluation robustness remains largely underexplored. To address this gap, we introduce YESciEval — an open-source framework that leverages fine-grained rubric-based assessments combined with reinforcement learning to reduce optimism bias in LLM evaluators. YESciEval provides a comprehensive library for evaluating the quality of synthesized scientific answers using predefined rubrics and sophisticated LLM-based judgment models. This framework enables you to assess answers on key criteria by utilizing pretrained judges and parsing LLM outputs into structured JSON formats for detailed analysis.

The YESciEval-BioASQ-Llama-3.1-8B is a Biomedical domain judge tuned on the BioASQ dataset from the BioASQ challenge.

Usage

First of all, install the YESciEval library via PiP:

pip install yescieval

Get started with YESciEval in just a few lines of code. This guide demonstrates how to initialize inputs, load the judge, and initiate the rubric for evaluation of the answer.

from yescieval import Readability, BioASQAutoJudge

# Sample papers with following format {"title": "abstract", ... }
papers = {
    "A Study on AI": "This paper discusses recent advances in artificial intelligence, including deep learning.",
    "Machine Learning Basics": "An overview of supervised learning methods such as decision trees and SVMs.",
    "Neural Networks Explained": "Explains backpropagation and gradient descent for training networks.",
    "Ethics in AI": "Explores ethical concerns in automated decision-making systems.",
    "Applications of AI in Healthcare": "Details how AI improves diagnostics and personalized medicine."
}

# Question and synthesized answer
question = "How is AI used in modern healthcare systems?"
answer = (
    "AI is being used in healthcare for diagnosing diseases, predicting patient outcomes, "
    "and assisting in treatment planning. It also supports personalized medicine and medical imaging."
)

# Step 1: Create a rubric
rubric = Readability(papers=papers, question=question, answer=answer)

# Step 2: Load a judge model
judge = BioASQAutoJudge()
judge.from_pretrained(token="your_huggingface_token")

# Step 3: Evaluate the answer
result = judge.evaluate(rubric=rubric)
print("Raw Evaluation Output:")
print(result)

A total of nine evaluation rubrics were defined as part of the YESciEval test framework and can be used via yescieval. The following simple example shows how to import rubrics in your code:

from yescieval import Informativeness, Correctness, Completeness, 
                      Coherence, Relevancy, Integration, 
                      Cohesion, Readability, Conciseness

A complete list of rubrics is available at YESciEval 📚 Rubrics page. For more detailed documentation, visit 📚 https://yescieval.readthedocs.io

Citation

If you find our work helpful, feel free to give us a cite.

@article{d2025yescieval,
      title={YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering},
      author={D'Souza, Jennifer and Giglou, Hamed Babaei and M{\"u}nch, Quentin},
      journal={arXiv preprint arXiv:2505.14279},
      year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SciKnowOrg/YESciEval-BioASQ-Llama-3.1-8B

Finetuned
(1420)
this model

Collection including SciKnowOrg/YESciEval-BioASQ-Llama-3.1-8B