File size: 5,562 Bytes
0718a4a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
---
license: apache-2.0
base_model:
- microsoft/Phi-3.5-vision-instruct
tags:
- medical-vision-language-model
- chest-xray
- preference-optimization
metrics:
- accuracy
- f1
language:
- en
pipeline_tag: visual-question-answering
library_name: transformers
---
# CheX-Phi3.5V — Preference-Optimised Vision-Language Model for Chest X-ray Interpretation
**CheX-Phi3.5V** is a vision–language model (VLM) that answers clinical questions about chest radiographs and generates structured findings reports.
Built on **Phi-3.5 Vision-Instruct (7 B)**, it introduces **Direct Preference Optimization (DPO)** and **contrastive rejection** to achieve fine-grained medical reasoning while suppressing hallucinations.
---
## Key Features
| Aspect | Description |
|-----------------|------------------------------------------------------------------------------------------------|
| **Modality** | Single-image chest radiography (frontal & lateral) |
| **Tasks** | Visual Question Answering (VQA) & Findings generation |
| **Backbone** | Phi-3.5 Vision 7 B with an enhanced visual projection layer |
| **Optimisation**| 2-stage SFT → DPO + contrastive rejection learning |
| **License** | Apache 2.0 — free for research **and** commercial use |
---
## Quick Start
```python
from transformers import AutoModelForVision2Seq, AutoProcessor
model_id = "remove4anonymous/CheX-Phi-3.5-vision-instruct-DPO"
model = AutoModelForVision2Seq.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)
inputs = processor(
images="example_frontal.jpg",
text="Question: What abnormalities are present?\nAnswer:",
return_tensors="pt"
).to(model.device)
generated = model.generate(**inputs, max_new_tokens=128)
print(processor.decode(generated[0], skip_special_tokens=True))
````
> **Dependencies** `pip install transformers>=4.41.0 timm accelerate`
> For batch inference or a Streamlit demo, see the scripts in the [GitHub repo](https://github.com/remove4anonymous/CheX-Phi35V).
---
## Available Checkpoints
| HF Repo | Stage | Recommended Use |
| ------------------------------------------------------------------------------------------------------------- | ------------- | ------------------------- |
| [`CheX-Phi3.5-vision-instruct-DPO`](https://huggingface.co/remove4anonymous/CheX-Phi-3.5-vision-instruct-DPO) | **DPO** | Production / evaluation |
| [`CheX-Phi3.5-vision-instruct-SFT`](https://huggingface.co/remove4anonymous/CheX-Phi-3.5-vision-instruct-SFT) | SFT (phase 2) | Further preference tuning |
| [`Phi-3.5-vision-instruct`](https://huggingface.co/remove4anonymous/Phi-3.5-vision-instruct) | Base | Custom fine-tuning |
---
## Training Data & Procedure
| Stage | Data & Size | Objective |
| --------------- | ----------------------------------- | ----------------------------------------- |
| **SFT** | 120 k QA triplets (`MIMIC-EXT VQA`) | Supervised instruction tuning |
| **DPO** | 30 k preference-paired QA | Direct Preference Optimization |
| **Contrastive** | 250 k unlabelled MIMIC-CXR images | Rejection learning to curb hallucinations |
*Hardware* : 8 × A100 80 GB • FP16 • DeepSpeed ZeRO-3
*Total steps* ≈ 2.2 M.
---
## Evaluation
| Dataset | Split | Metric | Score |
| ----------------------- | --------- | --------- | --------- |
| MIMIC-CXR VQA | test | Accuracy | **0.894** |
| OpenI CXR-QA | test | BLEU-4 | **79.4** |
| Radiologist Turing Test | 200 cases | Pass rate | 61 % |
Evaluation scripts are provided in `stage3_evaluate_mimic_ext_vqa.sh`.
---
## Ethical & Safety Considerations
* **Clinical usage** — Outputs are *assistive* only; a certified radiologist must confirm findings.
* **Bias** — Training data skewed towards North-American adult populations; paediatric or non-western cohorts may underperform.
* **Privacy** — MIMIC-CXR is fully de-identified; the model does not memorise PHI.
* **Hallucinations** — Contrastive rejection reduces but does not eliminate false positives; use confidence thresholds.
---
## Known Limitations
1. No generalisation to CT, MRI, or ultrasound modalities.
2. Sensitive to extreme noise & portable AP projections.
3. Knowledge cutoff = Mar 2023; newly described conditions may be missed.
---
## Resources
* **Code & training scripts** — [https://github.com/remove4anonymous/CheX-Phi35V](https://github.com/remove4anonymous/CheX-Phi35V)
* **Data utilities** — `tools/generate_visual_prompt.py`
* **Demo notebook** — `demo.py`
---
## Citation
```bibtex
@misc{liu2025chexphi35v,
title = {CheX-Phi3.5V: Preference-Optimised Vision-Language Model for Chest X-ray Interpretation},
author = {Liu, Xiao and Others},
year = {2025},
howpublished = {\url{https://huggingface.co/remove4anonymous/CheX-Phi-3.5-vision-instruct-DPO}}
}
```
> If you use **CheX-Phi3.5V**, please cite us and consider sharing your downstream results!
|