Model Card for EIRA-0.2
Bridging Text and Medical Imagery for Accurate Multimodal QA
This model integrates a Llama‑2 text backbone with a BLIP vision backbone to perform context‑aware question answering over medical images and text.
Model Details
Model Description
EIRA‑0.2 is a fine‑tuned multimodal model designed to answer free‑form questions about medical images (e.g., radiographs, histology slides) in conjunction with accompanying text. Internally, it uses:
A text encoder/decoder based on meta‑llama/Llama‑2‑7b‑hf, fine‑tuned for medical QA.
A vision encoder based on Salesforce/blip-image-captioning-base, fine‑tuned to extract descriptive features from medical imagery.
A fusion module that cross‑attends between vision features and text embeddings to generate coherent, context‑aware answers.
Developed by: BockBharath
Shared by: Shashidhar Sarvi and Sharvary H H
Model type: Multimodal Sequence‑to‑Sequence QA
Language(s): English
License: MIT
Finetuned from: meta‑llama/Llama‑2‑7b‑hf, Salesforce/blip-image-captioning-base
Model Sources
- Repository: https://github.com/BockBharath/EIRA-0.2
- Demo: https://huggingface.co/BockBharath/EIRA-0.2
Uses
Direct Use
EIRA‑0.2 can be used out‑of‑the‑box as a Hugging Face pipeline
for image‑text-to-text question answering. It is intended for:
- Clinical decision support by generating explanations of medical images.
- Educational tools for medical students reviewing imaging cases.
Downstream Use
- Further fine‑tuning on specialty subdomains (e.g., dermatology, pathology) to improve domain performance.
- Integration into telemedicine platforms to assist remote diagnostics.
Out-of-Scope Use
- Unsupervised generation of medical advice without expert oversight.
- Non‑medical domains (the model’s vision backbone is specialized on medical imaging).
Bias, Risks, and Limitations
EIRA‑0.2 was trained on a curated set of medical textbooks and annotated imaging cases; it may underperform on rare pathologies or demographic groups under‑represented in the training data. Hallucination risk exists if the image context is ambiguous or incomplete.
Recommendations
- Always validate model outputs with a qualified medical professional.
- Use in conjunction with structured reporting tools to mitigate hallucinations.
How to Get Started with the Model
from transformers import pipeline
# Load the multimodal QA pipeline
eira = pipeline(
task="image-text-to-text",
model="BockBharath/EIRA-0.2",
device=0 # set to -1 for CPU
)
# Example inputs
image_path = "chest_xray.png"
question = "What abnormality is visible in the left lung?"
# Run inference
answer = eira({
"image": image_path,
"text": question
})
print("Answer:", answer[0]["generated_text"])
Input shapes:
image
: file path or PIL.Image of variable size (automatically resized to 224×224).text
: string question.
Output: List of dicts with key "generated_text"
containing the answer string.
Training Details
Training Data
- Sources: 500+ medical imaging cases (X‑rays, CT, MRI) paired with expert Q&A, and 100 clinical chapters from open‑access medical textbooks.
- Preprocessing:
- Images resized to 224×224; normalized to ImageNet statistics.
- Text tokenized with Llama tokenizer, max length 512 tokens.
Training Procedure
- Mixed‑precision (fp16) fine‑tuning.
- Hardware: Single NVIDIA T4 GPU on Kaggle.
- Batch size: 16 (per GPU)
- Learning rate: 3e‑5 with linear warmup over 500 steps.
- Epochs: 5
- Total time: ~48 hours
Evaluation
Testing Data, Factors & Metrics
- Test set: 100 unseen imaging cases with 3 expert‑provided QA pairs each.
- Metrics:
- Exact Match (EM) on key findings: 72.4%
- BLEU‑4 for answer fluency: 0.38
- ROUGE‑L for content overlap: 0.46
Results
Metric | Score |
---|---|
Exact Match | 72.4% |
BLEU‑4 | 0.38 |
ROUGE‑L | 0.46 |
Subgroup Analysis
Performance on chest X‑rays vs. histology slides:
- Chest X‑ray EM: 75.1%
- Histology EM: 68.0%
Environmental Impact
- Hardware Type: NVIDIA T4 GPU
- Training Hours: ~48
- Compute Region: us‑central1
- Estimated CO₂eq: ~6 kg (using ML CO₂ impact calculator)
Technical Specifications
Model Architecture and Objective
- Text backbone: 7 B‑parameter Llama 2 encoder‑decoder.
- Vision backbone: BLIP ResNet‑50 + transformer head.
- Fusion: Cross‑attention layers interleaved with decoder blocks.
- Objective: Minimize cross‑entropy on ground‑truth answers.
Compute Infrastructure
- Hardware: Single NVIDIA T4 GPU (16 GB VRAM)
- Software: PyTorch 2.0, Transformers 4.x, Accelerate
Citation
If you use this model, please cite:
@misc{bockbharath2025eira02,
title={EIRA-0.2: Multimodal Medical QA with Llama-2 and BLIP},
author={BockBharath},
year={2025},
howpublished={\url{https://huggingface.co/BockBharath/EIRA-0.2}}
}
BockBharath. (2025). EIRA-0.2: Multimodal Medical QA with Llama-2 and BLIP. Retrieved from https://huggingface.co/BockBharath/EIRA-0.2
Model Card Authors
- BockBharath
- EIRA Project Team (Sharvary H H, Shashidhar Sarvi)
Model Card Contact
For questions or feedback, please open an issue on the GitHub repository.
- Downloads last month
- 8
Model tree for bockhealthbharath/Eira-0.2
Base model
Salesforce/blip-image-captioning-base