metadata

license: mit
tags:
  - visual-question-answering
  - VQA
  - vilt
  - transformer
  - vision-language
  - inclusive-ai
datasets:
  - vizwiz
language:
  - en
library_name: transformers
pipeline_tag: visual-question-answering
model-index:
  - name: vilt-vqa-vizwiz
    results:
      - task:
          type: visual-question-answering
          name: Visual Question Answering
        dataset:
          name: VizWiz
          type: vizwiz
        metrics:
          - name: Accuracy
            type: accuracy
            value: 29.01%
          - name: BLEU-1
            type: bleu
            value: 0.3017

ViLT VQA (Fine-tuned on VizWiz)

This model is a fine-tuned version of ViLT (Vision-and-Language Transformer) on the VizWiz dataset—a collection of real-world visual questions submitted by blind and visually impaired users.

ViLT is a lightweight and efficient VLM that aligns text and image embeddings via a transformer encoder without using an explicit visual feature extractor (e.g. CNN or ViT), resulting in faster inference and reduced computational cost.

Model Details

Base Model: dandelin/vilt-b32-finetuned-vqa
Fine-tuned on: Sample of VizWiz VQA dataset
Framework: Hugging Face Transformers (PyTorch)
Use Case: Assistive VQA systems for accessibility and inclusion

Intended Use

Designed for Visual Question Answering in practical, assistive settings. Suitable for low-latency deployments where model speed is critical.

Example Usage

from transformers import ViltProcessor, ViltForQuestionAnswering
from PIL import Image
import requests

processor = ViltProcessor.from_pretrained("Zagarsuren/vilt-finetuned-vizwiz")
model = ViltForQuestionAnswering.from_pretrained("Zagarsuren/vilt-finetuned-vizwiz")

image = Image.open(requests.get("https://example.com/image.jpg", stream=True).raw)
question = "What colour is the jacket?"

encoding = processor(image, question, return_tensors="pt")
outputs = model(**encoding)
predicted_answer = model.config.id2label[outputs.logits.argmax(-1).item()]
print(predicted_answer)

Evaluation Results

Metric	Score
Accuracy	29.01%
BLEU-1	0.3017
Response Time (avg)	13.93ms

ViLT offers faster inference than larger VLMs (e.g. Florence-2), making it ideal for edge deployment and resource-constrained environments. However, its performance is comparatively lower on unanswerable and complex reasoning tasks.

Limitations

Weaker performance on complex and compositional reasoning
Struggles with low-quality or cluttered images typical in VizWiz
May produce uncertain answers for ambiguous or unanswerable questions

Citation

If you use this model, please cite:

@{
  title={VisionAid-VQA: Inclusive Visual Question Answering Using Deep Learning and Multimodal Attention Mechanisms},
  author={Zagarsuren Sukhbaatar},
  year={2025},
  url={https://huggingface.co/Zagarsuren/vilt-finetuned-vizwiz}
}

License

MIT License