metadata
			license: apache-2.0
base_model:
  - microsoft/Phi-3.5-vision-instruct
tags:
  - medical-vision-language-model
  - chest-xray
  - preference-optimization
metrics:
  - accuracy
  - f1
language:
  - en
pipeline_tag: visual-question-answering
library_name: transformers
CheX-Phi3.5V — Preference-Optimised Vision-Language Model for Chest X-ray Interpretation
CheX-Phi3.5V is a vision–language model (VLM) that answers clinical questions about chest radiographs and generates structured findings reports.
Built on Phi-3.5 Vision-Instruct (7 B), it introduces Direct Preference Optimization (DPO) and contrastive rejection to achieve fine-grained medical reasoning while suppressing hallucinations.
Key Features
| Aspect | Description | 
|---|---|
| Modality | Single-image chest radiography (frontal & lateral) | 
| Tasks | Visual Question Answering (VQA) & Findings generation | 
| Backbone | Phi-3.5 Vision 7 B with an enhanced visual projection layer | 
| Optimisation | 2-stage SFT → DPO + contrastive rejection learning | 
| License | Apache 2.0 — free for research and commercial use | 
Quick Start
from transformers import AutoModelForVision2Seq, AutoProcessor
model_id  = "remove4anonymous/CheX-Phi-3.5-vision-instruct-DPO"
model     = AutoModelForVision2Seq.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)
inputs = processor(
    images="example_frontal.jpg",
    text="Question: What abnormalities are present?\nAnswer:",
    return_tensors="pt"
).to(model.device)
generated = model.generate(**inputs, max_new_tokens=128)
print(processor.decode(generated[0], skip_special_tokens=True))
Dependencies
pip install transformers>=4.41.0 timm accelerateFor batch inference or a Streamlit demo, see the scripts in the GitHub repo.
Available Checkpoints
| HF Repo | Stage | Recommended Use | 
|---|---|---|
| CheX-Phi3.5-vision-instruct-DPO | DPO | Production / evaluation | 
| CheX-Phi3.5-vision-instruct-SFT | SFT (phase 2) | Further preference tuning | 
| Phi-3.5-vision-instruct | Base | Custom fine-tuning | 
Training Data & Procedure
| Stage | Data & Size | Objective | 
|---|---|---|
| SFT | 120 k QA triplets ( MIMIC-EXT VQA) | Supervised instruction tuning | 
| DPO | 30 k preference-paired QA | Direct Preference Optimization | 
| Contrastive | 250 k unlabelled MIMIC-CXR images | Rejection learning to curb hallucinations | 
Hardware : 8 × A100 80 GB • FP16 • DeepSpeed ZeRO-3 Total steps ≈ 2.2 M.
Evaluation
| Dataset | Split | Metric | Score | 
|---|---|---|---|
| MIMIC-CXR VQA | test | Accuracy | 0.894 | 
| OpenI CXR-QA | test | BLEU-4 | 79.4 | 
| Radiologist Turing Test | 200 cases | Pass rate | 61 % | 
Evaluation scripts are provided in stage3_evaluate_mimic_ext_vqa.sh.
Ethical & Safety Considerations
- Clinical usage — Outputs are assistive only; a certified radiologist must confirm findings.
- Bias — Training data skewed towards North-American adult populations; paediatric or non-western cohorts may underperform.
- Privacy — MIMIC-CXR is fully de-identified; the model does not memorise PHI.
- Hallucinations — Contrastive rejection reduces but does not eliminate false positives; use confidence thresholds.
Known Limitations
- No generalisation to CT, MRI, or ultrasound modalities.
- Sensitive to extreme noise & portable AP projections.
- Knowledge cutoff = Mar 2023; newly described conditions may be missed.
Resources
- Code & training scripts — https://github.com/remove4anonymous/CheX-Phi35V
- Data utilities — tools/generate_visual_prompt.py
- Demo notebook — demo.py
Citation
@misc{liu2025chexphi35v,
  title        = {CheX-Phi3.5V: Preference-Optimised Vision-Language Model for Chest X-ray Interpretation},
  author       = {Liu, Xiao and Others},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/remove4anonymous/CheX-Phi-3.5-vision-instruct-DPO}}
}
If you use CheX-Phi3.5V, please cite us and consider sharing your downstream results!