π§ͺ Phi-2 + LoRA Fine-Tuned on PubMedQA (pqa_labeled)
This model is a LoRA-adapted version of microsoft/phi-2, fine-tuned using Unsloth on the pqa_labeled
subset of the PubMedQA dataset.
π§ Task
Given a biomedical research question and an abstract, the model answers:
- yes
- no
- maybe
This is framed as a 3-class classification problem for biomedical question answering.
β Summary
This work demonstrates the feasibility of using a general-purpose text generation model (Phi-2), augmented with lightweight LoRA adapters, to perform medical question classification on the PubMedQA dataset.
Instead of using specialized biomedical models or classification heads, we reframe the task as text generation β prompting the model to generate "yes", "no", or "maybe" answers. Despite its small size and lack of biomedical pretraining, the Phi-2 model, fine-tuned via LoRA on just ~1,000 examples, achieves a competitive 62.2% test accuracy.
This approach bridges the gap between generative language modeling and medical classification, offering a reproducible, efficient, and surprisingly strong baseline β particularly valuable for low-resource, edge, or clinical applications.
Key Highlights:
- β General-purpose Phi-2 used without biomedical pretraining
- β Uses generative decoding instead of classification heads
- β Efficient LoRA fine-tuning with just ~1k examples
- β Achieves 62.2% accuracy on PubMedQA test set
- β Low compute footprint (4-bit support)
π Dataset Split
- Total entries: 1000
- Train: 800 (80%)
- Validation: 100 (10%)
- Test: 100 (10%)
Validation set was used for best checkpoint selection. Results below are on the unseen test set.
π Evaluation Metrics
Final model performance:
Label | Precision | Recall | F1 Score | Support |
---|---|---|---|---|
yes | 0.70 | 0.95 | 0.81 | 59 |
no | 1.00 | 0.07 | 0.13 | 29 |
maybe | 0.19 | 0.30 | 0.23 | 10 |
- Accuracy: 62.24%
- Macro F1: 0.39
- Weighted F1: 0.55
β Notably strong at "yes" predictions. Future work will target improvements on "no" and "maybe".
ποΈ Training Configuration
- Framework: PyTorch + PEFT + Unsloth
- Base Model:
microsoft/phi-2
- PEFT Method: LoRA
- Epochs: 33 (best models saved based on validation set)
- Batch Size: 16
- Learning Rate: 2e-4
- LoRA Config:
r
: 16alpha
: 16target_modules
: ["q_proj", "v_proj"]
- Loss curve: Validation loss improved steadily from 1.94 β 1.30
- NEFTune: 5
πΎ Usage
With Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
# Load base + LoRA adapter
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2")
model = PeftModel.from_pretrained(model, "ShahzebKhoso/phi2-pubmedqa-lora")
tokenizer = AutoTokenizer.from_pretrained("ShahzebKhoso/phi2-pubmedqa-lora")
With Unsloth
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "ShahzebKhoso/phi2-pubmedqa-lora",
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
)
FastLanguageModel.for_inference(model)
β οΈ Limitations
- May overpredict "yes" answers due to class imbalance.
- "no" and "maybe" classes need further refinement.
- Not suitable for clinical use or medical decision-making.
π Citation
@misc{shahzebkhoso2025phi2pubmedqa,
title={Fine-tuning Phi-2 on PubMedQA with LoRA},
author={Shahzeb Khoso},
year={2025},
howpublished={\url{https://huggingface.co/shahzebkhoso/phi2-pubmedqa-lora}},
}
β¨ Acknowledgements
- Downloads last month
- 4
Model tree for ShahzebKhoso/phi2-pubmedqa-lora
Base model
microsoft/phi-2