Model Card for EIRA-0.2

Bridging Text and Medical Imagery for Accurate Multimodal QA

This model integrates a Llama‑2 text backbone with a BLIP vision backbone to perform context‑aware question answering over medical images and text.

Model Details

Model Description

EIRA‑0.2 is a fine‑tuned multimodal model designed to answer free‑form questions about medical images (e.g., radiographs, histology slides) in conjunction with accompanying text. Internally, it uses:

  • A text encoder/decoder based on meta‑llama/Llama‑2‑7b‑hf, fine‑tuned for medical QA.

  • A vision encoder based on Salesforce/blip-image-captioning-base, fine‑tuned to extract descriptive features from medical imagery.

  • A fusion module that cross‑attends between vision features and text embeddings to generate coherent, context‑aware answers.

  • Developed by: BockBharath

  • Shared by: Shashidhar Sarvi and Sharvary H H

  • Model type: Multimodal Sequence‑to‑Sequence QA

  • Language(s): English

  • License: MIT

  • Finetuned from: meta‑llama/Llama‑2‑7b‑hf, Salesforce/blip-image-captioning-base

Model Sources

Uses

Direct Use

EIRA‑0.2 can be used out‑of‑the‑box as a Hugging Face pipeline for image‑text-to-text question answering. It is intended for:

  • Clinical decision support by generating explanations of medical images.
  • Educational tools for medical students reviewing imaging cases.

Downstream Use

  • Further fine‑tuning on specialty subdomains (e.g., dermatology, pathology) to improve domain performance.
  • Integration into telemedicine platforms to assist remote diagnostics.

Out-of-Scope Use

  • Unsupervised generation of medical advice without expert oversight.
  • Non‑medical domains (the model’s vision backbone is specialized on medical imaging).

Bias, Risks, and Limitations

EIRA‑0.2 was trained on a curated set of medical textbooks and annotated imaging cases; it may underperform on rare pathologies or demographic groups under‑represented in the training data. Hallucination risk exists if the image context is ambiguous or incomplete.

Recommendations

  • Always validate model outputs with a qualified medical professional.
  • Use in conjunction with structured reporting tools to mitigate hallucinations.

How to Get Started with the Model

from transformers import pipeline

# Load the multimodal QA pipeline
eira = pipeline(
    task="image-text-to-text",
    model="BockBharath/EIRA-0.2",
    device=0  # set to -1 for CPU
)

# Example inputs
image_path = "chest_xray.png"
question = "What abnormality is visible in the left lung?"

# Run inference
answer = eira({
    "image": image_path,
    "text": question
})
print("Answer:", answer[0]["generated_text"])

Input shapes:

  • image: file path or PIL.Image of variable size (automatically resized to 224×224).
  • text: string question.

Output: List of dicts with key "generated_text" containing the answer string.

Training Details

Training Data

  • Sources: 500+ medical imaging cases (X‑rays, CT, MRI) paired with expert Q&A, and 100 clinical chapters from open‑access medical textbooks.
  • Preprocessing:
    • Images resized to 224×224; normalized to ImageNet statistics.
    • Text tokenized with Llama tokenizer, max length 512 tokens.

Training Procedure

  • Mixed‑precision (fp16) fine‑tuning.
  • Hardware: Single NVIDIA T4 GPU on Kaggle.
  • Batch size: 16 (per GPU)
  • Learning rate: 3e‑5 with linear warmup over 500 steps.
  • Epochs: 5
  • Total time: ~48 hours

Evaluation

Testing Data, Factors & Metrics

  • Test set: 100 unseen imaging cases with 3 expert‑provided QA pairs each.
  • Metrics:
    • Exact Match (EM) on key findings: 72.4%
    • BLEU‑4 for answer fluency: 0.38
    • ROUGE‑L for content overlap: 0.46

Results

Metric Score
Exact Match 72.4%
BLEU‑4 0.38
ROUGE‑L 0.46

Subgroup Analysis

Performance on chest X‑rays vs. histology slides:

  • Chest X‑ray EM: 75.1%
  • Histology EM: 68.0%

Environmental Impact

  • Hardware Type: NVIDIA T4 GPU
  • Training Hours: ~48
  • Compute Region: us‑central1
  • Estimated CO₂eq: ~6 kg (using ML CO₂ impact calculator)

Technical Specifications

Model Architecture and Objective

  • Text backbone: 7 B‑parameter Llama 2 encoder‑decoder.
  • Vision backbone: BLIP ResNet‑50 + transformer head.
  • Fusion: Cross‑attention layers interleaved with decoder blocks.
  • Objective: Minimize cross‑entropy on ground‑truth answers.

Compute Infrastructure

  • Hardware: Single NVIDIA T4 GPU (16 GB VRAM)
  • Software: PyTorch 2.0, Transformers 4.x, Accelerate

Citation

If you use this model, please cite:

@misc{bockbharath2025eira02,
  title={EIRA-0.2: Multimodal Medical QA with Llama-2 and BLIP},
  author={BockBharath},
  year={2025},
  howpublished={\url{https://huggingface.co/BockBharath/EIRA-0.2}}
}
BockBharath. (2025). EIRA-0.2: Multimodal Medical QA with Llama-2 and BLIP. Retrieved from https://huggingface.co/BockBharath/EIRA-0.2

Model Card Authors

  • BockBharath
  • EIRA Project Team (Sharvary H H, Shashidhar Sarvi)

Model Card Contact

For questions or feedback, please open an issue on the GitHub repository.

Downloads last month
8
Safetensors
Model size
247M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bockhealthbharath/Eira-0.2

Finetuned
(22)
this model