Model Card for EIRA-0.2

Bridging Text and Medical Imagery for Accurate Multimodal QA

This model integrates a Llama‑2 text backbone with a BLIP vision backbone to perform context‑aware question answering over medical images and text.

Model Details

Model Description

EIRA‑0.2 is a fine‑tuned multimodal model designed to answer free‑form questions about medical images (e.g., radiographs, histology slides) in conjunction with accompanying text. Internally, it uses:

A text encoder/decoder based on meta‑llama/Llama‑2‑7b‑hf, fine‑tuned for medical QA.
A vision encoder based on Salesforce/blip-image-captioning-base, fine‑tuned to extract descriptive features from medical imagery.
A fusion module that cross‑attends between vision features and text embeddings to generate coherent, context‑aware answers.
Developed by: BockBharath
Shared by: Shashidhar Sarvi and Sharvary H H
Model type: Multimodal Sequence‑to‑Sequence QA
Language(s): English
License: MIT
Finetuned from: meta‑llama/Llama‑2‑7b‑hf, Salesforce/blip-image-captioning-base

Model Sources

Repository: https://github.com/BockBharath/EIRA-0.2
Demo: https://huggingface.co/BockBharath/EIRA-0.2

Uses

Direct Use

EIRA‑0.2 can be used out‑of‑the‑box as a Hugging Face pipeline for image‑text-to-text question answering. It is intended for:

Clinical decision support by generating explanations of medical images.
Educational tools for medical students reviewing imaging cases.

Downstream Use

Further fine‑tuning on specialty subdomains (e.g., dermatology, pathology) to improve domain performance.
Integration into telemedicine platforms to assist remote diagnostics.

Out-of-Scope Use

Unsupervised generation of medical advice without expert oversight.
Non‑medical domains (the model’s vision backbone is specialized on medical imaging).

Bias, Risks, and Limitations

EIRA‑0.2 was trained on a curated set of medical textbooks and annotated imaging cases; it may underperform on rare pathologies or demographic groups under‑represented in the training data. Hallucination risk exists if the image context is ambiguous or incomplete.

Recommendations

Always validate model outputs with a qualified medical professional.
Use in conjunction with structured reporting tools to mitigate hallucinations.

How to Get Started with the Model

from transformers import pipeline

# Load the multimodal QA pipeline
eira = pipeline(
    task="image-text-to-text",
    model="BockBharath/EIRA-0.2",
    device=0  # set to -1 for CPU
)

# Example inputs
image_path = "chest_xray.png"
question = "What abnormality is visible in the left lung?"

# Run inference
answer = eira({
    "image": image_path,
    "text": question
})
print("Answer:", answer[0]["generated_text"])

Input shapes:

image: file path or PIL.Image of variable size (automatically resized to 224×224).
text: string question.

Output: List of dicts with key "generated_text" containing the answer string.

Training Details

Training Data

Sources: 500+ medical imaging cases (X‑rays, CT, MRI) paired with expert Q&A, and 100 clinical chapters from open‑access medical textbooks.
Preprocessing:
- Images resized to 224×224; normalized to ImageNet statistics.
- Text tokenized with Llama tokenizer, max length 512 tokens.

Training Procedure

Mixed‑precision (fp16) fine‑tuning.
Hardware: Single NVIDIA T4 GPU on Kaggle.
Batch size: 16 (per GPU)
Learning rate: 3e‑5 with linear warmup over 500 steps.
Epochs: 5
Total time: ~48 hours

Evaluation

Testing Data, Factors & Metrics

Test set: 100 unseen imaging cases with 3 expert‑provided QA pairs each.
Metrics:
- Exact Match (EM) on key findings: 72.4%
- BLEU‑4 for answer fluency: 0.38
- ROUGE‑L for content overlap: 0.46

Results

Metric	Score
Exact Match	72.4%
BLEU‑4	0.38
ROUGE‑L	0.46

Subgroup Analysis

Performance on chest X‑rays vs. histology slides:

Chest X‑ray EM: 75.1%
Histology EM: 68.0%

Environmental Impact

Hardware Type: NVIDIA T4 GPU
Training Hours: ~48
Compute Region: us‑central1
Estimated CO₂eq: ~6 kg (using ML CO₂ impact calculator)

Technical Specifications

Model Architecture and Objective

Text backbone: 7 B‑parameter Llama 2 encoder‑decoder.
Vision backbone: BLIP ResNet‑50 + transformer head.
Fusion: Cross‑attention layers interleaved with decoder blocks.
Objective: Minimize cross‑entropy on ground‑truth answers.

Compute Infrastructure

Hardware: Single NVIDIA T4 GPU (16 GB VRAM)
Software: PyTorch 2.0, Transformers 4.x, Accelerate

Citation

If you use this model, please cite:

@misc{bockbharath2025eira02,
  title={EIRA-0.2: Multimodal Medical QA with Llama-2 and BLIP},
  author={BockBharath},
  year={2025},
  howpublished={\url{https://huggingface.co/BockBharath/EIRA-0.2}}
}

BockBharath. (2025). EIRA-0.2: Multimodal Medical QA with Llama-2 and BLIP. Retrieved from https://huggingface.co/BockBharath/EIRA-0.2

Model Card Authors

BockBharath
EIRA Project Team (Sharvary H H, Shashidhar Sarvi)

Model Card Contact

For questions or feedback, please open an issue on the GitHub repository.

bockhealthbharath
/

Eira-0.2