EC-RAFT: Automated Generation of Clinical Trial Eligibility Criteria
Model Description
EC-RAFT is a fine-tuned Retrieval-Augmented Fine-Tuning (RAFT) model based on LLaMA-3.1-8B-Instruct architecture.
It is designed to automatically generate structured, high-quality clinical trial eligibility criteria (EC) directly from trial titles and descriptions.
EC-RAFT integrates domain-specific retrieval with synthesized intermediate reasoning steps, enabling it to produce clinically relevant and contextually appropriate EC sets.
Fine-tuning Details
- Original Model: LLaMA-3.1-8B-Instruct
- Datasets used for fine-tuning:
- ClinicalTrials.gov (267,347 trials, 2000β2024) biodatlab/ec-raft-dataset
- Retrieval corpus constructed using SciNCL model
- Intermediate reasoning steps R generated using Gemini-1.5-flash-002
- Fine-tuning method:
- Retrieval-Augmented Fine-Tuning (RAFT)
- Low-Rank Adaptation (LoRA)
Model Performance
Evaluated on a held-out ClinicalTrials.gov test split:
Metric | Score |
---|---|
BERTScore (semantic similarity) | 86.23 |
Precision (LLM-guided evaluation) | 78.84% |
Recall (LLM-guided evaluation) | 75.89% |
Mean LLM-as-a-Judge Score (0β3) | 1.7150 |
Mean Pair-BERTScore | 67.76 |
- Outperforms zero-shot LLaMA-3.1 and Gemini-1.5-flash baselines
- Outperforms fine-tuned LLaMA and Meditron baselines
- Clinically validated: LLM-as-a-Judge scores highly correlated with human physician evaluation
Intended Use
- Assist researchers, trial designers, and sponsors in drafting clinical trial eligibility criteria.
- Automate EC generation to reduce manual effort and improve consistency.
- Support clinical trial design transparency and quality.
- Enable integration with trial registry platforms, clinical trial matching systems, and EC recommendation tools.
Limitations
- Requires human validation of generated EC before clinical use.
- Trained on public ClinicalTrials.gov data β may not generalize well to:
- Rare or novel diseases
- Specialized or non-standard trial designs
- Non-public trial data
- Optimized for English-language clinical trials.
- As with any LLM-based system, risks include hallucination, subtle errors, and domain shifts.
- Evaluation metrics (BERTScore, LLM-as-a-Judge) are proxies β not full substitutes for domain expert review.
Acknowledgments
This model was developed using resources provided by:
- RAVIS Technology for feedback and collaboration.
- Faculty of Medicine Ramathibodi Hospital
- NSTDA Supercomputer Center (ThaiSC), Project #pv814001
We also acknowledge the contributions of the broader open-source community whose tools and prior works on RAFT, SciNCL, LoRA, LLaMA-3, and biomedical NLP made this project possible.
- Downloads last month
- 105
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support