Large Language Models (LLMs) have become pivotal in powering scientific question-answering across modern search engines, yet their evaluation robustness remains largely underexplored. To address this gap, we introduce YESciEval — an open-source framework that leverages fine-grained rubric-based assessments combined with reinforcement learning to reduce optimism bias in LLM evaluators. YESciEval provides a comprehensive library for evaluating the quality of synthesized scientific answers using predefined rubrics and sophisticated LLM-based judgment models. This framework enables you to assess answers on key criteria by utilizing pretrained judges and parsing LLM outputs into structured JSON formats for detailed analysis.
The YESciEval-BioASQ-Llama-3.1-8B
is a Biomedical domain judge tuned on the BioASQ dataset from the BioASQ challenge.
Usage
First of all, install the YESciEval
library via PiP:
pip install yescieval
Get started with YESciEval in just a few lines of code. This guide demonstrates how to initialize inputs, load the judge, and initiate the rubric for evaluation of the answer.
from yescieval import Readability, BioASQAutoJudge
# Sample papers with following format {"title": "abstract", ... }
papers = {
"A Study on AI": "This paper discusses recent advances in artificial intelligence, including deep learning.",
"Machine Learning Basics": "An overview of supervised learning methods such as decision trees and SVMs.",
"Neural Networks Explained": "Explains backpropagation and gradient descent for training networks.",
"Ethics in AI": "Explores ethical concerns in automated decision-making systems.",
"Applications of AI in Healthcare": "Details how AI improves diagnostics and personalized medicine."
}
# Question and synthesized answer
question = "How is AI used in modern healthcare systems?"
answer = (
"AI is being used in healthcare for diagnosing diseases, predicting patient outcomes, "
"and assisting in treatment planning. It also supports personalized medicine and medical imaging."
)
# Step 1: Create a rubric
rubric = Readability(papers=papers, question=question, answer=answer)
# Step 2: Load a judge model
judge = BioASQAutoJudge()
judge.from_pretrained(token="your_huggingface_token")
# Step 3: Evaluate the answer
result = judge.evaluate(rubric=rubric)
print("Raw Evaluation Output:")
print(result)
A total of nine evaluation rubrics were defined as part of the YESciEval test framework and can be used via yescieval
. The following simple example shows how to import rubrics in your code:
from yescieval import Informativeness, Correctness, Completeness,
Coherence, Relevancy, Integration,
Cohesion, Readability, Conciseness
A complete list of rubrics is available at YESciEval 📚 Rubrics page. For more detailed documentation, visit 📚 https://yescieval.readthedocs.io
Citation
If you find our work helpful, feel free to give us a cite.
@article{d2025yescieval,
title={YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering},
author={D'Souza, Jennifer and Giglou, Hamed Babaei and M{\"u}nch, Quentin},
journal={arXiv preprint arXiv:2505.14279},
year={2025}
}
Model tree for SciKnowOrg/YESciEval-BioASQ-Llama-3.1-8B
Base model
meta-llama/Llama-3.1-8B