nielsr HF Staff commited on
Commit
186e6bc
·
verified ·
1 Parent(s): 524a9ed

Improve model card for LLaVA_MORE-phi_4-finetuning

Browse files

This PR significantly improves the model card for `aimagelab/LLaVA_MORE-phi_4-finetuning` by:

* Adding detailed metadata including the `license`, `pipeline_tag`, `base_model`, and relevant `tags` for better discoverability and functionality on the Hub.
* Populating the model card content with a comprehensive description derived from the paper abstract and GitHub README.
* Providing direct links to the paper, the GitHub repository, the project website, the Hugging Face model collection, and a demo space.
* Including detailed performance benchmarks, training data and procedure, acknowledgments, and academic citation.
* Adding a robust "How to Get Started" section with a runnable Python code example that demonstrates model inference including necessary image preprocessing steps.

This update will make the model much more accessible, informative, and usable for the Hugging Face community.

Files changed (1) hide show
  1. README.md +209 -143
README.md CHANGED
@@ -1,199 +1,265 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
 
 
 
 
 
 
9
 
 
10
 
 
 
 
11
 
12
  ## Model Details
13
 
14
  ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
29
 
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
 
36
  ## Uses
37
 
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
  ### Direct Use
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
 
52
  ### Out-of-Scope Use
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
 
58
  ## Bias, Risks, and Limitations
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
 
64
  ### Recommendations
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
 
70
  ## How to Get Started with the Model
71
 
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
  ## Training Details
77
 
78
  ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
 
84
  ### Training Procedure
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
 
103
  ## Evaluation
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
 
175
- **BibTeX:**
176
 
177
- [More Information Needed]
 
 
178
 
179
- **APA:**
180
 
181
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
182
 
183
- ## Glossary [optional]
184
 
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
 
187
- [More Information Needed]
188
 
189
- ## More Information [optional]
190
 
191
- [More Information Needed]
192
 
193
- ## Model Card Authors [optional]
 
194
 
195
- [More Information Needed]
196
 
197
- ## Model Card Contact
 
198
 
199
- [More Information Needed]
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
+ pipeline_tag: image-text-to-text
4
+ license: apache-2.0
5
+ tags:
6
+ - multimodal
7
+ - vision-language-model
8
+ - llava
9
+ - instruction-tuned
10
+ - phi-4
11
+ - vqa
12
+ base_model: microsoft/Phi-4-mini-instruct
13
  ---
14
 
15
+ # Model Card for LLaVA_MORE-phi_4-finetuning
16
 
17
+ <div align="center">
18
+ <img src="https://huggingface.co/aimagelab/LLaVA_MORE-phi_4-finetuning/resolve/main/images/image_no_back.png" width="200" height="200">
19
+ <h1> 🔥 LLaVA-MORE 🔥
20
+
21
+ A Comparative Study of LLMs and Visual Backbones <br>for Enhanced Visual Instruction Tuning
22
+ </h1>
23
+ </div>
24
 
25
+ This model is part of the **LLaVA-MORE** family of Multimodal Large Language Models (MLLMs), presented in the paper [LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning](https://huggingface.co/papers/2503.15621).
26
 
27
+ LLaVA-MORE integrates recent language models with diverse visual backbones. It employs a unified training protocol applied consistently across all architectures to ensure fair comparisons and systematically explore the trade-offs between model size, architecture, and performance. This model, `LLaVA_MORE-phi_4-finetuning`, uses **Phi-4 Instruct** as its LLM backbone and is finetuned on the [LLaVA-Instruct-665K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) dataset.
28
+
29
+ It is designed for multimodal reasoning, generation, and instruction following, and provides insights into the design of more effective MLLMs.
30
 
31
  ## Model Details
32
 
33
  ### Model Description
34
 
35
+ This is a checkpoint from the LLaVA-MORE family of MLLMs. It integrates the **Phi-4 Instruct** Large Language Model with a visual backbone (specifically, `openai/clip-vit-large-patch14-336` as per `config.json`). It has been finetuned on the `LLaVA-Instruct-665K` dataset. The project aims to provide a reproducible evaluation framework to guide future model development by systematically studying the impact of different LLMs and visual encoders, as well as factors like image resolution and pre-training datasets.
 
 
36
 
37
+ - **Developed by:** Federico Cocchi, Nicholas Moratelli, Davide Caffagni, Sara Sarto, Lorenzo Baraldi, Marcella Cornia and Rita Cucchiara (AImageLab, University of Modena and Reggio Emilia).
38
+ - **Model type:** Multimodal Large Language Model (MLLM) / Vision-Language Model
39
+ - **Language(s):** English
40
+ - **License:** Apache-2.0
41
+ - **Finetuned from model:** `microsoft/Phi-4-mini-instruct`
 
 
42
 
43
+ ### Model Sources
44
 
45
+ - **Repository:** [https://github.com/aimagelab/LLaVA-MORE](https://github.com/aimagelab/LLaVA-MORE)
46
+ - **Paper:** [https://huggingface.co/papers/2503.15621](https://huggingface.co/papers/2503.15621)
47
+ - **Project Website:** [https://aimagelab.ing.unimore.it/imagelab](https://aimagelab.ing.unimore.it/imagelab)
48
+ - **Hugging Face Collection:** [LLaVA-MORE Models](https://huggingface.co/collections/aimagelab/llava-more-66aa6c49167e190bf27e7be4)
49
+ - **Hugging Face Demo:** [https://huggingface.co/spaces/aimagelab/LLaVA-MORE](https://huggingface.co/spaces/aimagelab/LLaVA-MORE)
50
 
51
  ## Uses
52
 
 
 
53
  ### Direct Use
54
 
55
+ This model is intended for various multimodal reasoning, generation, and instruction-following tasks. It can be used to process visual inputs in conjunction with textual prompts to generate informative and relevant text responses. Typical applications include visual question answering, image captioning, and conversational AI involving images.
 
 
 
 
 
 
 
 
56
 
57
  ### Out-of-Scope Use
58
 
59
+ This model is not intended for generating harmful content, engaging in misinformation, or being deployed in applications without proper human oversight. As an AI model, it may hallucinate or produce factually incorrect information. It should not be used in safety-critical applications without thorough domain-specific evaluation and mitigation strategies.
 
 
60
 
61
  ## Bias, Risks, and Limitations
62
 
63
+ Given that the model is trained on large datasets, it may inherit biases present in the data, leading to biased outputs. Potential risks include generating offensive, inaccurate, or harmful content. Like all generative models, it may also hallucinate or provide factually incorrect information.
 
 
64
 
65
  ### Recommendations
66
 
67
+ Users should be aware of the inherent biases and limitations of MLLMs. It is recommended to apply human review to outputs, especially in sensitive applications. Further research and evaluation are needed to fully understand and mitigate potential societal impacts.
 
 
68
 
69
  ## How to Get Started with the Model
70
 
71
+ To get started with inference, you can use the `transformers` library along with the provided `run_llava.py` script from the project's GitHub repository or integrate it directly using Python as shown below.
72
+
73
+ First, install the necessary packages as described in the [GitHub Installation section](https://github.com/aimagelab/LLaVA-MORE#installation):
74
+ ```bash
75
+ conda create -n more python==3.8.16
76
+ conda activate more
77
+ pip install -r requirements.txt # Refer to the GitHub repo for the exact requirements.txt
78
+ ```
79
+
80
+ **Using the `run_llava.py` script (recommended for full functionality):**
81
+
82
+ ```bash
83
+ cd ~/LLaVA-MORE # Navigate to the cloned LLaVA-MORE repository
84
+ source activate more
85
+ export PYTHONPATH=.
86
+
87
+ model_path=aimagelab/LLaVA_MORE-phi_4-finetuning # Adjust to the specific model path
88
+ model_architecture=llava_phi # Based on config.json
89
+ conversation=phi_4 # This might vary based on tokenizer config, check original LLaVA-MORE code for best match
90
+
91
+ export HF_TOKEN=hf_read_token # Replace with your Hugging Face read token if needed
92
+ export TOKENIZER_PATH=$model_path
93
+
94
+ python -u src/llava/eval/run_llava.py --model-path $model_path --model-architecture $model_architecture --conv-mode $conversation
95
+ ```
96
+
97
+ **Direct Python Inference Example:**
98
+
99
+ This example demonstrates how to load the model and perform a simple inference. Note that the image preprocessing part, essential for LLaVA-style models, is included for a complete example.
100
+
101
+ ```python
102
+ import numpy as np
103
+ import torch
104
+ import torchvision.transforms as T
105
+ from PIL import Image
106
+ from torchvision.transforms.functional import InterpolationMode
107
+ from transformers import AutoModel, AutoTokenizer
108
+
109
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
110
+ IMAGENET_STD = (0.229, 0.224, 0.406)
111
+
112
+ def build_transform(input_size):
113
+ """Builds the image transformation pipeline."""
114
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
115
+ transform = T.Compose([
116
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
117
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
118
+ T.ToTensor(),
119
+ T.Normalize(mean=MEAN, std=STD)
120
+ ])
121
+ return transform
122
+
123
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
124
+ """Finds the target aspect ratio closest to the image's aspect ratio."""
125
+ best_ratio_diff = float('inf')
126
+ best_ratio = (1, 1)
127
+ area = width * height
128
+ for ratio in target_ratios:
129
+ target_aspect_ratio = ratio[0] / ratio[1]
130
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
131
+ if ratio_diff < best_ratio_diff:
132
+ best_ratio_diff = ratio_diff
133
+ best_ratio = ratio
134
+ elif ratio_diff == best_ratio_diff:
135
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
136
+ best_ratio = ratio
137
+ return best_ratio
138
+
139
+ def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
140
+ """Dynamically preprocesses images for multi-scale input."""
141
+ orig_width, orig_height = image.size
142
+ aspect_ratio = orig_width / orig_height
143
+
144
+ target_ratios = set(
145
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
146
+ i * j <= max_num and i * j >= min_num)
147
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
148
+
149
+ target_aspect_ratio = find_closest_aspect_ratio(
150
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
151
+
152
+ target_width = image_size * target_aspect_ratio[0]
153
+ target_height = image_size * target_aspect_ratio[1]
154
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
155
+
156
+ resized_img = image.resize((target_width, target_height))
157
+ processed_images = []
158
+ for i in range(blocks):
159
+ box = (
160
+ (i % (target_width // image_size)) * image_size,
161
+ (i // (target_width // image_size)) * image_size,
162
+ ((i % (target_width // image_size)) + 1) * image_size,
163
+ ((i // (target_width // image_size)) + 1) * image_size
164
+ )
165
+ split_img = resized_img.crop(box)
166
+ processed_images.append(split_img)
167
+ assert len(processed_images) == blocks
168
+ if use_thumbnail and len(processed_images) != 1:
169
+ thumbnail_img = image.resize((image_size, image_size))
170
+ processed_images.append(thumbnail_img)
171
+ return processed_images
172
+
173
+ def load_image(image_path, input_size=448, max_num=12):
174
+ """Loads and preprocesses an image."""
175
+ image = Image.open(image_path).convert('RGB')
176
+ transform = build_transform(input_size=input_size)
177
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
178
+ pixel_values = [transform(image) for image in images]
179
+ pixel_values = torch.stack(pixel_values)
180
+ return pixel_values
181
+
182
+ # Load model and tokenizer
183
+ model_id = "aimagelab/LLaVA_MORE-phi_4-finetuning" # This specific model
184
+ model = AutoModel.from_pretrained(
185
+ model_id,
186
+ torch_dtype=torch.bfloat16, # Or torch.float16 if bfloat16 is not supported by your GPU
187
+ low_cpu_mem_usage=True,
188
+ trust_remote_code=True
189
+ ).eval().cuda() # Move model to GPU
190
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True, use_fast=False)
191
+
192
+ # Example Usage
193
+ # Replace 'path/to/your_image.jpg' with a valid image file on your system
194
+ # Or download an example image: e.g., !wget -P ./examples/images/ https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-finetuning/resolve/main/images/plot.png
195
+ pixel_values = load_image('./examples/images/plot.png', max_num=6).to(torch.bfloat16).cuda()
196
+ generation_config = dict(max_new_tokens=1024, do_sample=True)
197
+
198
+ question = "Describe the image in detail."
199
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
200
+ print(f'User: {question}
201
+ Assistant: {response}')
202
+ ```
203
 
204
  ## Training Details
205
 
206
  ### Training Data
207
 
208
+ The LLaVA-MORE models are typically trained in two stages:
209
+ - **Pretraining:** On the [LCS-558K dataset](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain).
210
+ - **Finetuning:** On the [LLaVA-Instruct-665K dataset](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K).
211
 
212
  ### Training Procedure
213
 
214
+ The training employs a unified protocol consistently applied across all architectures to ensure fair comparisons and enhance reproducibility. The project publicly releases the source code and bash scripts for distributed training on HPC facilities with a SLURM scheduler. More details on the training procedure and hyperparameters can be found in the [Training section of the GitHub repository](https://github.com/aimagelab/LLaVA-MORE#training).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
215
 
216
  ## Evaluation
217
 
218
+ ### Benchmarks and Comparisons on Instruction Multimodal Datasets in the Literature
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
219
 
220
+ The table below presents the performance of LLaVA-MORE variants, including this model, compared to other LLaVA versions across various multimodal datasets. For the most up-to-date and complete evaluation results, please refer to the [Performance section in the GitHub repository](https://github.com/aimagelab/LLaVA-MORE#performance).
221
 
222
+ <div align="center">
223
+ <img src="https://huggingface.co/aimagelab/LLaVA_MORE-phi_4-finetuning/resolve/main/images/plot.png" width="500">
224
+ </div>
225
 
226
+ <div align="center">
227
 
228
+ | Model Name | Text-VQA* | Science-QA | AI2D | SEED-vid | SEED-all | SEED-img | MMMU | MMBench-Cn | MMBench-En | POPE | GQA | MME-P | MME-C |
229
+ |----------------------|:----------: |:------------:|:------:|:----------:|:----------:|:----------:|:------:|:------------:|:------------:|:------:|:-----:|:--------:|:-------:|
230
+ | LLaVA-v1.5-7B | 58.2 | 69.0 | 56.4 | 42.0 | 61.6 | 66.8 | 34.2 | 56.5 | 65.3 | 85.6 | 62.4 | 1474.3 | 314.6 |
231
+ | LLaVA-v1.5-LLaMA3-8B | 57.6 | 74.2 | 60.7 | 42.0 | 64.3 | 70.1 | 37.3 | 65.4 | 70.3 | 85.4 | 63.5 | 1544.4 | 330.3 |
232
+ | **LLaVA-v1.5-LLaMA3_1-8B** | 58.4 | 76.3 | 61.8 | 42.4 | 64.1 | 69.8 | 39.4 | **68.2** | 72.4 | 85.1 | 63.6 | 1531.5 | **353.3** |
233
+ | **LLaVA-v1.5-LLaMA3_1-8B-S2** | 60.9 | 76.7 | 62.2 | 42.3 | 64.2 | 69.9 | 38.7 | 65.8 | 71.1 | 86.5 | 64.5 | **1563.8** | 293.2 |
234
+ | **LLaVA-v1.5-LLaMA3_1-8B-siglip** | 62.1 | **77.5** | 63.6 | **46.1** | 65.8 | 71.0 | 39.8 | **68.2** | **73.1** | 86.1 | 64.6 | 1531.0 | 315.4 |
235
+ | **LLaVA-v1.5-LLaMA3_1-8B-S2-siglip** | 63.5 | 77.1 | 62.7 | 44.7 | 65.5 | 71.0 | **40.0** | 68.0 | 71.8 | 86.0 | 64.9 | 1541.4 | 336.4 |
236
+ | **LLaVA-v1.5-Phi_4-4B** | 54.0 | 71.3 | 61.1 | 42.3 | 63.5 | 69.1 | 38.8 | 64.2 | 69.2 | 85.9 | 62.1 | 1372.2 | 281.1 |
237
+ | **LLaVA-v1.5-gemma_2-9B** | 60.7 | 75.4 | 64.8 | 44.1 | 64.5 | 69.9 | 37.9 | 65.9 | 71.9 | **86.8** | 64.2 | 1522.5 | 307.5 |
238
+ | **LLaVA-v1.5-gemma_2-9B-siglip2** | **66.7** | 76.2 | **65.3** | 46.0 | **67.5** | **73.1** | 38.7 | 68.0 | 72.0 | 86.1 | **65.6** | 1510.9 | 308.2 |
239
+ | **LLaVA-v1.5-Distill-LLaMA-8B** | 56.3 | 74.5 | 58.8 | 43.5 | 63.5 | 68.6 | 38.1 | 66.8 | 61.3 | 85.1 | 63.0 | 1495.1 | 295.0 |
240
 
 
241
 
242
+ </div>
243
 
244
+ \* The results of TextVQA are computed with OCR token in the input prompt. **The models in bold represent LLaVA-MORE variants.**
245
 
246
+ ## Checkpoints
247
 
248
+ For a complete list of all LLaVA-MORE checkpoints, you can refer to the [Hugging Face model collection](https://huggingface.co/collections/aimagelab/llava-more-66aa6c49167e190bf27e7be4).
249
 
250
+ ## Acknowledgments
251
+ We thank the [LLaVA](https://github.com/haotian-liu/LLaVA.git) team for open-sourcing a modular codebase to extend and train different models within the LLaVA family. We are also happy users of the [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval.git) library, which has significantly reduced the evaluation time of our checkpoints across different datasets.
252
 
253
+ We also thank [CINECA](https://www.hpc.cineca.it/systems/hardware/leonardo/) for the availability of high-performance computing resources used to train LLaVA-MORE. This work is supported by the PNRR-M4C2 project [FAIR - Future Artificial Intelligence Research](https://fondazione-fair.it/) and by the PNRR project [ITSERR - Italian Strengthening of Esfri RI Resilience](https://www.itserr.it/).
254
 
255
+ ## Citation
256
+ If you make use of our work, please cite our paper:
257
 
258
+ ```bibtex
259
+ @inproceedings{cocchi2025llava,
260
+ title={{LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning}},
261
+ author={Cocchi, Federico and Moratelli, Nicholas and Caffagni, Davide and Sarto, Sara and Baraldi, Lorenzo and Cornia, Marcella and Cucchiara, Rita},
262
+ booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops},
263
+ year={2025}
264
+ }
265
+ ```