File size: 5,562 Bytes

0718a4a

---
license: apache-2.0
base_model:
- microsoft/Phi-3.5-vision-instruct
tags:
- medical-vision-language-model
- chest-xray
- preference-optimization
metrics:
- accuracy
- f1
language:
- en
pipeline_tag: visual-question-answering
library_name: transformers
---

# CheX-Phi3.5V — Preference-Optimised Vision-Language Model for Chest X-ray Interpretation

**CheX-Phi3.5V** is a vision–language model (VLM) that answers clinical questions about chest radiographs and generates structured findings reports.  
Built on **Phi-3.5 Vision-Instruct (7 B)**, it introduces **Direct Preference Optimization (DPO)** and **contrastive rejection** to achieve fine-grained medical reasoning while suppressing hallucinations.

---

## Key Features

| Aspect          | Description                                                                                    |
|-----------------|------------------------------------------------------------------------------------------------|
| **Modality**    | Single-image chest radiography (frontal & lateral)                                             |
| **Tasks**       | Visual Question Answering (VQA) & Findings generation                                          |
| **Backbone**    | Phi-3.5 Vision 7 B with an enhanced visual projection layer                                    |
| **Optimisation**| 2-stage SFT → DPO + contrastive rejection learning                                             |
| **License**     | Apache 2.0 — free for research **and** commercial use                                          |

---

## Quick Start

```python
from transformers import AutoModelForVision2Seq, AutoProcessor

model_id  = "remove4anonymous/CheX-Phi-3.5-vision-instruct-DPO"
model     = AutoModelForVision2Seq.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

inputs = processor(
    images="example_frontal.jpg",
    text="Question: What abnormalities are present?\nAnswer:",
    return_tensors="pt"
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=128)
print(processor.decode(generated[0], skip_special_tokens=True))
````

> **Dependencies**  `pip install transformers>=4.41.0 timm accelerate`
> For batch inference or a Streamlit demo, see the scripts in the [GitHub repo](https://github.com/remove4anonymous/CheX-Phi35V).

---

## Available Checkpoints

| HF Repo                                                                                                       | Stage         | Recommended Use           |
| ------------------------------------------------------------------------------------------------------------- | ------------- | ------------------------- |
| [`CheX-Phi3.5-vision-instruct-DPO`](https://huggingface.co/remove4anonymous/CheX-Phi-3.5-vision-instruct-DPO) | **DPO**       | Production / evaluation   |
| [`CheX-Phi3.5-vision-instruct-SFT`](https://huggingface.co/remove4anonymous/CheX-Phi-3.5-vision-instruct-SFT) | SFT (phase 2) | Further preference tuning |
| [`Phi-3.5-vision-instruct`](https://huggingface.co/remove4anonymous/Phi-3.5-vision-instruct)                  | Base          | Custom fine-tuning        |

---

## Training Data & Procedure

| Stage           | Data & Size                         | Objective                                 |
| --------------- | ----------------------------------- | ----------------------------------------- |
| **SFT**         | 120 k QA triplets (`MIMIC-EXT VQA`) | Supervised instruction tuning             |
| **DPO**         | 30 k preference-paired QA           | Direct Preference Optimization            |
| **Contrastive** | 250 k unlabelled MIMIC-CXR images   | Rejection learning to curb hallucinations |

*Hardware* : 8 × A100 80 GB • FP16 • DeepSpeed ZeRO-3
*Total steps* ≈ 2.2 M.

---

## Evaluation

| Dataset                 | Split     | Metric    | Score     |
| ----------------------- | --------- | --------- | --------- |
| MIMIC-CXR VQA           | test      | Accuracy  | **0.894** |
| OpenI CXR-QA            | test      | BLEU-4    | **79.4**  |
| Radiologist Turing Test | 200 cases | Pass rate | 61 %      |

Evaluation scripts are provided in `stage3_evaluate_mimic_ext_vqa.sh`.

---

## Ethical & Safety Considerations

* **Clinical usage** — Outputs are *assistive* only; a certified radiologist must confirm findings.
* **Bias** — Training data skewed towards North-American adult populations; paediatric or non-western cohorts may underperform.
* **Privacy** — MIMIC-CXR is fully de-identified; the model does not memorise PHI.
* **Hallucinations** — Contrastive rejection reduces but does not eliminate false positives; use confidence thresholds.

---

## Known Limitations

1. No generalisation to CT, MRI, or ultrasound modalities.
2. Sensitive to extreme noise & portable AP projections.
3. Knowledge cutoff = Mar 2023; newly described conditions may be missed.

---

## Resources

* **Code & training scripts** — [https://github.com/remove4anonymous/CheX-Phi35V](https://github.com/remove4anonymous/CheX-Phi35V)
* **Data utilities** — `tools/generate_visual_prompt.py`
* **Demo notebook** — `demo.py`

---

## Citation

```bibtex
@misc{liu2025chexphi35v,
  title        = {CheX-Phi3.5V: Preference-Optimised Vision-Language Model for Chest X-ray Interpretation},
  author       = {Liu, Xiao and Others},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/remove4anonymous/CheX-Phi-3.5-vision-instruct-DPO}}
}
```

> If you use **CheX-Phi3.5V**, please cite us and consider sharing your downstream results!