Model Card for LLaVA_MORE-gemma_2_9b-finetuning
LLaVA-MORE is a new family of Multimodal Large Language Models (MLLMs) that integrates recent language models with diverse visual backbones. This specific model, LLaVA_MORE-gemma_2_9b-finetuning
, is fine-tuned on LLaVA-Instruct-665K
using gemma-2-9b-it
as the LLM backbone and a CLIP-based visual backbone. It is designed to evaluate multimodal reasoning, generation, and instruction following tasks.
🔥 LLaVA-MORE 🔥
A Comparative Study of LLMs and Visual Backbones
for Enhanced Visual Instruction Tuning
Citation
BibTeX:
@inproceedings{cocchi2025llava,
title={{LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning}},
author={Cocchi, Federico and Moratelli, Nicholas and Caffagni, Davide and Sarto, Sara and Baraldi, Lorenzo and Cornia, Marcella and Cucchiara, Rita},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops},\
year={2025}
}
Model Details
Model Description
Recent progress in Multimodal Large Language Models (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying language model. LLaVA-MORE is a new family of MLLMs that integrates recent language models with diverse visual backbones. To ensure fair comparisons, a unified training protocol is employed consistently across all architectures. This analysis systematically explores both small- and medium-scale LLMs (including Phi-4, LLaMA-3.1, and Gemma-2) to evaluate multimodal reasoning, generation, and instruction following, while examining the relationship between model size and performance. Beyond evaluating the LLM impact, a comprehensive study of various visual encoders (CLIP-based, DINOv2, SigLIP, and SigLIP2) is conducted.
This specific model, LLaVA_MORE-gemma_2_9b-finetuning
, is a fine-tuned variant on LLaVA-Instruct-665K
using google/gemma-2-9b-it
as its LLM backbone and openai/clip-vit-large-patch14-336
as its visual backbone.
- Developed by: Federico Cocchi, Nicholas Moratelli, Davide Caffagni, Sara Sarto, Lorenzo Baraldi, Marcella Cornia and Rita Cucchiara (AImageLab, University of Modena and Reggio Emilia)
- Funded by: PNRR-M4C2 project FAIR - Future Artificial Intelligence Research and PNRR project ITSERR - Italian Strengthening of Esfri RI Resilience
- Shared by: AImageLab
- Model type: Multimodal Large Language Model (MLLM)
- Language(s) (NLP): English
- License: Apache 2.0
- Finetuned from model: google/gemma-2-9b-it (LLM backbone), openai/clip-vit-large-patch14-336 (Visual Backbone)
Model Sources
- Repository: https://github.com/aimagelab/LLaVA-MORE
- Paper: LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
- Project Page: https://aimagelab.ing.unimore.it/imagelab
- Demo: https://huggingface.co/spaces/aimagelab/LLaVA-MORE-8B
- Hugging Face Collection: https://huggingface.co/collections/aimagelab/llava-more-66aa6c49167e190bf27e7be4
Uses
Direct Use
LLaVA-MORE models are intended for direct use in various multimodal tasks, including:
- Visual instruction tuning.
- Multimodal reasoning and question answering (VQA).
- Image-to-text generation based on visual input and textual prompts.
- Comparative studies on the performance of different LLMs and visual backbones for MLLMs.
Downstream Use
LLaVA-MORE provides a solid foundation for further research and development in multimodal AI. Researchers and developers can use this family of models as a base for fine-tuning on specific downstream tasks or integrating into larger applications requiring visual instruction following and multimodal understanding.
Out-of-Scope Use
As with all large language models, LLaVA-MORE models may:
- Produce hallucinations or factual inaccuracies.
- Exhibit biases present in the training data.
- Generate harmful, offensive, or inappropriate content.
- Not perform optimally on tasks or domains significantly different from their training data.
The models are not intended for use in safety-critical applications without thorough human review and oversight.
Bias, Risks, and Limitations
The models are trained on large-scale datasets that may contain societal biases, stereotypes, or harmful content. Users should be aware of these potential biases and exercise caution when deploying the model in sensitive applications. Performance may vary across different visual and linguistic contexts. The abstract also highlights inconsistencies in training data and evaluation protocols in prior work, which this paper aims to address by using a unified protocol, but inherent limitations of MLLMs may still apply.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. It is recommended to carefully evaluate the model's outputs for their specific use case and consider implementing additional safeguards or human oversight, especially in high-stakes scenarios. Understanding the limitations arising from the training data and model architecture is crucial.
Training Details
Training Data
LLaVA-MORE models are trained in two stages:
- Pretraining Stage: Uses data from LCS-558K.
- Finetuning Stage: Uses data from LLaVA-Instruct-665K.
Training Procedure
The training protocol is unified and applied consistently across all architectures, designed for distributed training on HPC facilities with a SLURM scheduler.
Preprocessing
Preprocessing details can be found in the original GitHub repository.
Training Hyperparameters
The models are generally trained using float16
or bfloat16
mixed precision (as indicated by torch_dtype: float16
in config.json
and typical LLM training practices). Specific hyperparameters are available in the released bash scripts on the GitHub repository.
Speeds, Sizes, Times
Not explicitly detailed in the provided context, but the training involved HPC resources.
Evaluation
Testing Data, Factors & Metrics
Testing Data
The models are evaluated on a range of multimodal datasets, including Text-VQA, Science-QA, AI2D, SEED-vid, SEED-all, SEED-img, MMMU, MMBench-Cn, MMBench-En, POPE, GQA, MME-P, and MME-C.
Factors
The evaluation considers different LLM backbones (Phi-4, LLaMA-3.1, Gemma-2) and various visual encoders (CLIP-based, DINOv2, SigLIP, SigLIP2) as factors.
Metrics
Performance metrics vary by dataset and typically include accuracy, score, or other task-specific metrics as presented in the benchmark table.
Results
The performance of LLaVA-MORE models compared to other LLaVA versions across different multimodal datasets is presented below.

Latest Updates
- [2025/07/22] 📚 LLaVA-MORE has been accepted at "What is Next in Multimodal Foundation Models? " @ ICCV Workshop 2025
- [2025/05/22] Check out our latest paper
- [2025/03/18] 🔥 LLaVA-MORE 8B is now availalbe on Ollama!
- [2024/08/16] 📌 Improved LLaVA-MORE 8B model, considering advanced image backbones.
- [2024/08/01] 🔥 First release of our LLaVA-MORE 8B, based on LLaMA 3.1.
- [2024/08/01] 🔎 If you are interested in this area of research, check out our survey on the revolution of Multimodal LLMs, recently published in ACL (Findings).
- [2024/08/01] 📚 Check out the latest researches from AImageLab.
Installation
To create the conda environment named more
, use the following instructions. With this environment you will have all the packages to run the code (training and evaluation) in this repository.
conda create -n more python==3.8.16
conda activate more
pip install -r requirements.txt
Note that the requirements are heavily inspired by the original LLaVA repository.
Environmental Impact
Computational work is supported by CINECA using high-performance computing resources. This work is supported by the PNRR-M4C2 project FAIR - Future Artificial Intelligence Research and by the PNRR project ITSERR - Italian Strengthening of Esfri RI Resilience.
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). Specific details on hardware, hours used, cloud provider, compute region, and carbon emitted are not provided.
Technical Specifications
Model Architecture and Objective
The model uses the LlavaGemmaForCausalLM
architecture. Its objective is to enable multimodal reasoning, generation, and instruction following by integrating visual backbones with large language models, specifically focusing on comparing different LLM and visual encoder choices.
Compute Infrastructure
Training was performed on HPC facilities with a SLURM scheduler, specifically using resources from CINECA.
Hardware
High-performance computing resources were utilized.
Software
The project's requirements.txt
specifies necessary Python packages.
Acknowledgments
We thank the LLaVA team for open-sourcing a modular codebase to extend and train different models within the LLaVA family. We are also happy users of the lmms-eval library, which has significantly reduced the evaluation time of our checkpoints across different datasets.
Model Card Authors
Niels (Hugging Face Community Science Team)
Model Card Contact
AImageLab (via GitHub issues on the repository)
- Downloads last month
- 11