Model Card for LLaVA_MORE-gemma_2_9b-finetuning

LLaVA-MORE is a new family of Multimodal Large Language Models (MLLMs) that integrates recent language models with diverse visual backbones. This specific model, LLaVA_MORE-gemma_2_9b-finetuning, is fine-tuned on LLaVA-Instruct-665K using gemma-2-9b-it as the LLM backbone and a CLIP-based visual backbone. It is designed to evaluate multimodal reasoning, generation, and instruction following tasks.

🔥 LLaVA-MORE 🔥
A Comparative Study of LLMs and Visual Backbones
for Enhanced Visual Instruction Tuning

Federico Cocchi

Lorenzo Baraldi

Citation

BibTeX:

@inproceedings{cocchi2025llava,
      title={{LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning}},
      author={Cocchi, Federico and Moratelli, Nicholas and Caffagni, Davide and Sarto, Sara and Baraldi, Lorenzo and Cornia, Marcella and Cucchiara, Rita},
      booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops},\
      year={2025}
}

Model Details

Model Description

Recent progress in Multimodal Large Language Models (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying language model. LLaVA-MORE is a new family of MLLMs that integrates recent language models with diverse visual backbones. To ensure fair comparisons, a unified training protocol is employed consistently across all architectures. This analysis systematically explores both small- and medium-scale LLMs (including Phi-4, LLaMA-3.1, and Gemma-2) to evaluate multimodal reasoning, generation, and instruction following, while examining the relationship between model size and performance. Beyond evaluating the LLM impact, a comprehensive study of various visual encoders (CLIP-based, DINOv2, SigLIP, and SigLIP2) is conducted.

This specific model, LLaVA_MORE-gemma_2_9b-finetuning, is a fine-tuned variant on LLaVA-Instruct-665K using google/gemma-2-9b-it as its LLM backbone and openai/clip-vit-large-patch14-336 as its visual backbone.

Developed by: Federico Cocchi, Nicholas Moratelli, Davide Caffagni, Sara Sarto, Lorenzo Baraldi, Marcella Cornia and Rita Cucchiara (AImageLab, University of Modena and Reggio Emilia)
Funded by: PNRR-M4C2 project FAIR - Future Artificial Intelligence Research and PNRR project ITSERR - Italian Strengthening of Esfri RI Resilience
Shared by: AImageLab
Model type: Multimodal Large Language Model (MLLM)
Language(s) (NLP): English
License: Apache 2.0
Finetuned from model: google/gemma-2-9b-it (LLM backbone), openai/clip-vit-large-patch14-336 (Visual Backbone)

Model Sources

Repository: https://github.com/aimagelab/LLaVA-MORE
Paper: LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
Project Page: https://aimagelab.ing.unimore.it/imagelab
Demo: https://huggingface.co/spaces/aimagelab/LLaVA-MORE-8B
Hugging Face Collection: https://huggingface.co/collections/aimagelab/llava-more-66aa6c49167e190bf27e7be4

Uses

Direct Use

LLaVA-MORE models are intended for direct use in various multimodal tasks, including:

Visual instruction tuning.
Multimodal reasoning and question answering (VQA).
Image-to-text generation based on visual input and textual prompts.
Comparative studies on the performance of different LLMs and visual backbones for MLLMs.

Downstream Use

LLaVA-MORE provides a solid foundation for further research and development in multimodal AI. Researchers and developers can use this family of models as a base for fine-tuning on specific downstream tasks or integrating into larger applications requiring visual instruction following and multimodal understanding.

Out-of-Scope Use

As with all large language models, LLaVA-MORE models may:

Produce hallucinations or factual inaccuracies.
Exhibit biases present in the training data.
Generate harmful, offensive, or inappropriate content.
Not perform optimally on tasks or domains significantly different from their training data.

The models are not intended for use in safety-critical applications without thorough human review and oversight.

Bias, Risks, and Limitations

The models are trained on large-scale datasets that may contain societal biases, stereotypes, or harmful content. Users should be aware of these potential biases and exercise caution when deploying the model in sensitive applications. Performance may vary across different visual and linguistic contexts. The abstract also highlights inconsistencies in training data and evaluation protocols in prior work, which this paper aims to address by using a unified protocol, but inherent limitations of MLLMs may still apply.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. It is recommended to carefully evaluate the model's outputs for their specific use case and consider implementing additional safeguards or human oversight, especially in high-stakes scenarios. Understanding the limitations arising from the training data and model architecture is crucial.

Training Details

Training Data

LLaVA-MORE models are trained in two stages:

Pretraining Stage: Uses data from LCS-558K.
Finetuning Stage: Uses data from LLaVA-Instruct-665K.

Training Procedure

The training protocol is unified and applied consistently across all architectures, designed for distributed training on HPC facilities with a SLURM scheduler.

Preprocessing

Preprocessing details can be found in the original GitHub repository.

Training Hyperparameters

The models are generally trained using float16 or bfloat16 mixed precision (as indicated by torch_dtype: float16 in config.json and typical LLM training practices). Specific hyperparameters are available in the released bash scripts on the GitHub repository.

Speeds, Sizes, Times

Not explicitly detailed in the provided context, but the training involved HPC resources.

Evaluation

Testing Data, Factors & Metrics

Testing Data

The models are evaluated on a range of multimodal datasets, including Text-VQA, Science-QA, AI2D, SEED-vid, SEED-all, SEED-img, MMMU, MMBench-Cn, MMBench-En, POPE, GQA, MME-P, and MME-C.

Factors

The evaluation considers different LLM backbones (Phi-4, LLaMA-3.1, Gemma-2) and various visual encoders (CLIP-based, DINOv2, SigLIP, SigLIP2) as factors.

Metrics

Performance metrics vary by dataset and typically include accuracy, score, or other task-specific metrics as presented in the benchmark table.

Results

The performance of LLaVA-MORE models compared to other LLaVA versions across different multimodal datasets is presented below.

Latest Updates

[2025/07/22] 📚 LLaVA-MORE has been accepted at "What is Next in Multimodal Foundation Models? " @ ICCV Workshop 2025
[2025/05/22] Check out our latest paper
[2025/03/18] 🔥 LLaVA-MORE 8B is now availalbe on Ollama!
[2024/08/16] 📌 Improved LLaVA-MORE 8B model, considering advanced image backbones.
[2024/08/01] 🔥 First release of our LLaVA-MORE 8B, based on LLaMA 3.1.
[2024/08/01] 🔎 If you are interested in this area of research, check out our survey on the revolution of Multimodal LLMs, recently published in ACL (Findings).
[2024/08/01] 📚 Check out the latest researches from AImageLab.

Installation

To create the conda environment named more, use the following instructions. With this environment you will have all the packages to run the code (training and evaluation) in this repository.

conda create -n more python==3.8.16
conda activate more
pip install -r requirements.txt

Note that the requirements are heavily inspired by the original LLaVA repository.

Environmental Impact

Computational work is supported by CINECA using high-performance computing resources. This work is supported by the PNRR-M4C2 project FAIR - Future Artificial Intelligence Research and by the PNRR project ITSERR - Italian Strengthening of Esfri RI Resilience.

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). Specific details on hardware, hours used, cloud provider, compute region, and carbon emitted are not provided.

Technical Specifications

Model Architecture and Objective

The model uses the LlavaGemmaForCausalLM architecture. Its objective is to enable multimodal reasoning, generation, and instruction following by integrating visual backbones with large language models, specifically focusing on comparing different LLM and visual encoder choices.

Compute Infrastructure

Training was performed on HPC facilities with a SLURM scheduler, specifically using resources from CINECA.

Hardware

High-performance computing resources were utilized.

Software

The project's requirements.txt specifies necessary Python packages.

Acknowledgments

We thank the LLaVA team for open-sourcing a modular codebase to extend and train different models within the LLaVA family. We are also happy users of the lmms-eval library, which has significantly reduced the evaluation time of our checkpoints across different datasets.

Model Card Authors

Niels (Hugging Face Community Science Team)

Model Card Contact

AImageLab (via GitHub issues on the repository)