🧪 Phi-2 + LoRA Fine-Tuned on PubMedQA (pqa_labeled)

This model is a LoRA-adapted version of microsoft/phi-2, fine-tuned using Unsloth on the pqa_labeled subset of the PubMedQA dataset.

🧠 Task

Given a biomedical research question and an abstract, the model answers:

yes
no
maybe

This is framed as a 3-class classification problem for biomedical question answering.

✅ Summary

This work demonstrates the feasibility of using a general-purpose text generation model (Phi-2), augmented with lightweight LoRA adapters, to perform medical question classification on the PubMedQA dataset.

Instead of using specialized biomedical models or classification heads, we reframe the task as text generation — prompting the model to generate "yes", "no", or "maybe" answers. Despite its small size and lack of biomedical pretraining, the Phi-2 model, fine-tuned via LoRA on just ~1,000 examples, achieves a competitive 62.2% test accuracy.

This approach bridges the gap between generative language modeling and medical classification, offering a reproducible, efficient, and surprisingly strong baseline — particularly valuable for low-resource, edge, or clinical applications.

Key Highlights:

✅ General-purpose Phi-2 used without biomedical pretraining
✅ Uses generative decoding instead of classification heads
✅ Efficient LoRA fine-tuning with just ~1k examples
✅ Achieves 62.2% accuracy on PubMedQA test set
✅ Low compute footprint (4-bit support)

📊 Dataset Split

Total entries: 1000
Train: 800 (80%)
Validation: 100 (10%)
Test: 100 (10%)

Validation set was used for best checkpoint selection. Results below are on the unseen test set.

📈 Evaluation Metrics

Final model performance:

Label	Precision	Recall	F1 Score	Support
yes	0.70	0.95	0.81	59
no	1.00	0.07	0.13	29
maybe	0.19	0.30	0.23	10

Accuracy: 62.24%
Macro F1: 0.39
Weighted F1: 0.55

✅ Notably strong at "yes" predictions. Future work will target improvements on "no" and "maybe".

🏋️ Training Configuration

Framework: PyTorch + PEFT + Unsloth
Base Model: microsoft/phi-2
PEFT Method: LoRA
Epochs: 33 (best models saved based on validation set)
Batch Size: 16
Learning Rate: 2e-4
LoRA Config:
- r: 16
- alpha: 16
- target_modules: ["q_proj", "v_proj"]
Loss curve: Validation loss improved steadily from 1.94 → 1.30
NEFTune: 5

💾 Usage

With Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# Load base + LoRA adapter
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2")
model = PeftModel.from_pretrained(model, "ShahzebKhoso/phi2-pubmedqa-lora")
tokenizer = AutoTokenizer.from_pretrained("ShahzebKhoso/phi2-pubmedqa-lora")

With Unsloth

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "ShahzebKhoso/phi2-pubmedqa-lora",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)

⚠️ Limitations

May overpredict "yes" answers due to class imbalance.
"no" and "maybe" classes need further refinement.
Not suitable for clinical use or medical decision-making.

📚 Citation

@misc{shahzebkhoso2025phi2pubmedqa,
  title={Fine-tuning Phi-2 on PubMedQA with LoRA},
  author={Shahzeb Khoso},
  year={2025},
  howpublished={\url{https://huggingface.co/shahzebkhoso/phi2-pubmedqa-lora}},
}

ShahzebKhoso
/

phi2-pubmedqa-lora