πŸ§ͺ Phi-2 + LoRA Fine-Tuned on PubMedQA (pqa_labeled)

This model is a LoRA-adapted version of microsoft/phi-2, fine-tuned using Unsloth on the pqa_labeled subset of the PubMedQA dataset.

🧠 Task

Given a biomedical research question and an abstract, the model answers:

  • yes
  • no
  • maybe

This is framed as a 3-class classification problem for biomedical question answering.

βœ… Summary

This work demonstrates the feasibility of using a general-purpose text generation model (Phi-2), augmented with lightweight LoRA adapters, to perform medical question classification on the PubMedQA dataset.

Instead of using specialized biomedical models or classification heads, we reframe the task as text generation β€” prompting the model to generate "yes", "no", or "maybe" answers. Despite its small size and lack of biomedical pretraining, the Phi-2 model, fine-tuned via LoRA on just ~1,000 examples, achieves a competitive 62.2% test accuracy.

This approach bridges the gap between generative language modeling and medical classification, offering a reproducible, efficient, and surprisingly strong baseline β€” particularly valuable for low-resource, edge, or clinical applications.

Key Highlights:

  • βœ… General-purpose Phi-2 used without biomedical pretraining
  • βœ… Uses generative decoding instead of classification heads
  • βœ… Efficient LoRA fine-tuning with just ~1k examples
  • βœ… Achieves 62.2% accuracy on PubMedQA test set
  • βœ… Low compute footprint (4-bit support)

πŸ“Š Dataset Split

  • Total entries: 1000
  • Train: 800 (80%)
  • Validation: 100 (10%)
  • Test: 100 (10%)

Validation set was used for best checkpoint selection. Results below are on the unseen test set.

πŸ“ˆ Evaluation Metrics

Final model performance:

Label Precision Recall F1 Score Support
yes 0.70 0.95 0.81 59
no 1.00 0.07 0.13 29
maybe 0.19 0.30 0.23 10
  • Accuracy: 62.24%
  • Macro F1: 0.39
  • Weighted F1: 0.55

βœ… Notably strong at "yes" predictions. Future work will target improvements on "no" and "maybe".


πŸ‹οΈ Training Configuration

  • Framework: PyTorch + PEFT + Unsloth
  • Base Model: microsoft/phi-2
  • PEFT Method: LoRA
  • Epochs: 33 (best models saved based on validation set)
  • Batch Size: 16
  • Learning Rate: 2e-4
  • LoRA Config:
    • r: 16
    • alpha: 16
    • target_modules: ["q_proj", "v_proj"]
  • Loss curve: Validation loss improved steadily from 1.94 β†’ 1.30
  • NEFTune: 5

πŸ’Ύ Usage

With Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# Load base + LoRA adapter
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2")
model = PeftModel.from_pretrained(model, "ShahzebKhoso/phi2-pubmedqa-lora")
tokenizer = AutoTokenizer.from_pretrained("ShahzebKhoso/phi2-pubmedqa-lora")

With Unsloth

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "ShahzebKhoso/phi2-pubmedqa-lora",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)

⚠️ Limitations

  • May overpredict "yes" answers due to class imbalance.
  • "no" and "maybe" classes need further refinement.
  • Not suitable for clinical use or medical decision-making.

πŸ“š Citation

@misc{shahzebkhoso2025phi2pubmedqa,
  title={Fine-tuning Phi-2 on PubMedQA with LoRA},
  author={Shahzeb Khoso},
  year={2025},
  howpublished={\url{https://huggingface.co/shahzebkhoso/phi2-pubmedqa-lora}},
}

✨ Acknowledgements

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ShahzebKhoso/phi2-pubmedqa-lora

Base model

microsoft/phi-2
Adapter
(900)
this model

Dataset used to train ShahzebKhoso/phi2-pubmedqa-lora