EC-RAFT: Automated Generation of Clinical Trial Eligibility Criteria

Model Description

EC-RAFT is a fine-tuned Retrieval-Augmented Fine-Tuning (RAFT) model based on LLaMA-3.1-8B-Instruct architecture.
It is designed to automatically generate structured, high-quality clinical trial eligibility criteria (EC) directly from trial titles and descriptions.

EC-RAFT integrates domain-specific retrieval with synthesized intermediate reasoning steps, enabling it to produce clinically relevant and contextually appropriate EC sets.

Fine-tuning Details

Original Model: LLaMA-3.1-8B-Instruct
Datasets used for fine-tuning:
- ClinicalTrials.gov (267,347 trials, 2000–2024) biodatlab/ec-raft-dataset
- Retrieval corpus constructed using SciNCL model
- Intermediate reasoning steps R generated using Gemini-1.5-flash-002
- Fine-tuning method:
  - Retrieval-Augmented Fine-Tuning (RAFT)
  - Low-Rank Adaptation (LoRA)

Model Performance

Evaluated on a held-out ClinicalTrials.gov test split:

Metric	Score
BERTScore (semantic similarity)	86.23
Precision (LLM-guided evaluation)	78.84%
Recall (LLM-guided evaluation)	75.89%
Mean LLM-as-a-Judge Score (0–3)	1.7150
Mean Pair-BERTScore	67.76

Outperforms zero-shot LLaMA-3.1 and Gemini-1.5-flash baselines
Outperforms fine-tuned LLaMA and Meditron baselines
Clinically validated: LLM-as-a-Judge scores highly correlated with human physician evaluation

Intended Use

Assist researchers, trial designers, and sponsors in drafting clinical trial eligibility criteria.
Automate EC generation to reduce manual effort and improve consistency.
Support clinical trial design transparency and quality.
Enable integration with trial registry platforms, clinical trial matching systems, and EC recommendation tools.

Limitations

Requires human validation of generated EC before clinical use.
Trained on public ClinicalTrials.gov data — may not generalize well to:
- Rare or novel diseases
- Specialized or non-standard trial designs
- Non-public trial data
Optimized for English-language clinical trials.
As with any LLM-based system, risks include hallucination, subtle errors, and domain shifts.
Evaluation metrics (BERTScore, LLM-as-a-Judge) are proxies — not full substitutes for domain expert review.

Acknowledgments

This model was developed using resources provided by:

RAVIS Technology for feedback and collaboration.
Faculty of Medicine Ramathibodi Hospital
NSTDA Supercomputer Center (ThaiSC), Project #pv814001

We also acknowledge the contributions of the broader open-source community whose tools and prior works on RAFT, SciNCL, LoRA, LLaMA-3, and biomedical NLP made this project possible.

biodatlab
/

ec-raft