πŸ¦™πŸ“· MicroLLaVA-Qwen3-0.6B-base-siglip2-so400m

A compact yet competitive open-source vision-language model trained from scratch on a single RTX 4090.

This is a ~1B parameter model that performs on par with the original LLaVA-1.5-7B, making it a serious candidate for visual question answering tasksβ€”both for everyday users and researchers focused on efficient multimodal architectures.

It is also well suited for Edge AI applications, such as on-device visual question answering, thanks to its small size and fast inference performance.


πŸ“Œ Model Summary

keeeeenw/MicroLLaVA-Qwen3-0.6B-base-siglip2-so400m combines the strengths of:

Trained using a modified TinyLLaVA Factory fork, this model reaches 78.52% overall on VQAv2 test-dev, outperforming:

  • πŸ₯ˆ Zhang199/TinyLLaVA-Qwen2-0.5B-SigLIP: 72.33%
  • πŸ₯‰ On par with LLaVA-1.5-7B: ~78.5%

See https://huggingface.co/Zhang199/TinyLLaVA-Qwen2-0.5B-SigLIP#result for evaluation of the original models from TinyLLaVa and LLaVa authors.


πŸ–ΌοΈ Example Inference

Code:


# Tested with transformers: 4.55.2, torch: 2.8.0, and torchvision: 0.23.0
# Older versions of the transformers do not support qwen3 model.
# pip install transformers torch torchvision

from transformers import AutoTokenizer, AutoModelForCausalLM

hf_path = 'keeeeenw/MicroLlava-Qwen3-0.6B-base-siglip2-so400m'
model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True)
# model.cuda() # if no cuda is used, it takes ~78 seconds inference on my AMD 5950 CPU. 
config = model.config
tokenizer = AutoTokenizer.from_pretrained(hf_path, use_fast=False, model_max_length = config.tokenizer_model_max_length,padding_side = config.tokenizer_padding_side)

prompt="What are the things I should be cautious about when I visit here?"
image_url="https://llava-vl.github.io/static/images/view.jpg"
output_text, genertaion_time = model.chat(prompt=prompt,
                                          image=image_url,
                                          tokenizer=tokenizer)

print('model output:', output_text)
print('runing time:', genertaion_time)

Prompt:
🧠 "What are the things I should be cautious about when I visit here?"

Image:
example image

Model Output: When visiting this wooden pier on the lake, there are a few things to be cautious about. First, be aware of the water depth, as the water may be shallow or choppy, which could pose a risk of falling into the water or getting caught in a strong current. Second, be mindful of the weather conditions, as the image shows a cloudy day, which might indicate the possibility of rain or strong winds. These conditions can make the experience more challenging and potentially dangerous. Lastly, be cautious about the presence of any wildlife in the area, as the lake and its surroundings may attract various animals. Always follow safety guidelines and be respectful of the natural environment when visiting this picturesque location.


βš™οΈ Training Details

  • 🧠 Total parameters: ~1B
  • πŸ–₯️ Hardware: Single NVIDIA RTX 4090 (24GB VRAM)
  • ⏱️ Total training time: ~24 hours
    • Stage 1 (pretraining): ~8 hours
    • Stage 2 (fine-tuning): ~12 hours
  • 🧾 Training method: Follows official TinyLLaVA instructions
    • ⚠️ ocr_vqa was omitted in fine-tuning due to dataset access issues (to be re-added in next training run)

✨ Key Contributions

  1. βœ… Validated Qwen3-0.6B + SigLIP2 as an efficient, high-performance combination for vision-language tasks
  2. πŸ› οΈ Upgraded TinyLLaVA to support Qwen3 models and latest PyTorch/transformers
  3. 🧩 Created a new Qwen3 chat template
    β†’ qwen3_base_template.py
  4. βš™οΈ Performed hyperparameter tuning for optimal Qwen3 + SigLIP2 performance
  5. πŸš€ Released standalone Hugging Face inference supportβ€”no need to install TinyLLaVA

πŸ“Š VQAv2 Evaluation (test-dev)

Question Type Accuracy
Yes/No 91.56%
Number 65.69%
Other 70.28%
Overall 78.52%

Note: Evaluation was performed on the VQAv2 test-dev set. It is unclear whether prior models used test-dev or test set. This model is currently being evaluated on the full test set and other benchmarks.


πŸ”œ Upcoming Work

This model is currently undergoing evaluation on:

  • VQAv2 test set
  • GQA
  • SQA
  • TextVQA
  • MM-VET
  • POPE
  • MME
  • MMMU

Stay tuned for updates!


🧾 Citation

If you find this model helpful, please consider citing or referencing this repo:

@misc{wang2024microllama,
  title        = {MicroLLaVA: a TinyLLaVA based VLM with MicroLlama 300M for single GPU training},
  author       = {Zixiao Ken Wang},
  year         = {2025},
  url          = {https://huggingface.co/keeeeenw/MicroLlava,https://huggingface.co/keeeeenw/MicroLlava-Qwen3-0.6B-base-siglip2-so400m}
}

Please also support my release of https://huggingface.co/keeeeenw/MicroLlava which is based on own https://huggingface.co/keeeeenw/MicroLlama for the language capabilities.

Downloads last month
6
Safetensors
Model size
1.03B params
Tensor type
F32
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for keeeeenw/MicroLlava-Qwen3-0.6B-base-siglip2-so400m

Finetuned
(307)
this model

Evaluation results