Update README.md
Browse filesModify table to represent true parameter number, move disclaimer to the top
README.md
CHANGED
@@ -11,12 +11,17 @@ library_name: transformers
|
|
11 |
**Model Summary:**
|
12 |
granite-vision-3.1-2b-preview is a compact and efficient vision-language model, specifically designed for visual document understanding, enabling automated content extraction from tables, charts, infographics, plots, diagrams, and more. The model was trained on a meticulously curated instruction-following dataset, comprising diverse public datasets and synthetic datasets tailored to support a wide range of document understanding and general image tasks. It was trained by fine-tuning a Granite large language model (https://huggingface.co/ibm-granite/granite-3.1-2b-instruct) with both image and text modalities.
|
13 |
|
|
|
|
|
|
|
|
|
|
|
14 |
|
15 |
**Evaluations:**
|
16 |
|
17 |
We evaluated Granite Vision 3.1 alongside other vision-language models (VLMs) in the 1B-4B parameter range using the standard llms-eval benchmark. The evaluation spanned multiple public benchmarks, with particular emphasis on document understanding tasks while also including general visual question-answering benchmarks.
|
18 |
|
19 |
-
| | Molmo-E (1B) | InternVL2 (2B) | Phi3v (4B) | Phi3.5v (4B) | Granite Vision
|
20 |
|-----------|--------------|----------------|-------------|------------|------------|
|
21 |
| **Document benchmarks** |
|
22 |
| DocVQA | 0.66 | 0.87 | 0.87 | **0.88** | **0.88** |
|
@@ -158,12 +163,6 @@ The architecture of granite-vision-3.1-2b-preview consists of the following comp
|
|
158 |
|
159 |
We built upon LlaVA (https://llava-vl.github.io) to train our model. We use multi-layer encoder features and a denser grid resolution in AnyRes to enhance the model's ability to understand nuanced visual content, which is essential for accurately interpreting document images.
|
160 |
|
161 |
-
_Note:_
|
162 |
-
|
163 |
-
We denote our model as Granite-Vision-3.1-2B-Preview, where the version (3.1) and size (2B) of the base large language model
|
164 |
-
are explicitly indicated. However, when considering the integrated vision encoder and projector, the total parameter count of our
|
165 |
-
model increases to 3 billion parameters.
|
166 |
-
|
167 |
|
168 |
**Training Data:**
|
169 |
|
|
|
11 |
**Model Summary:**
|
12 |
granite-vision-3.1-2b-preview is a compact and efficient vision-language model, specifically designed for visual document understanding, enabling automated content extraction from tables, charts, infographics, plots, diagrams, and more. The model was trained on a meticulously curated instruction-following dataset, comprising diverse public datasets and synthetic datasets tailored to support a wide range of document understanding and general image tasks. It was trained by fine-tuning a Granite large language model (https://huggingface.co/ibm-granite/granite-3.1-2b-instruct) with both image and text modalities.
|
13 |
|
14 |
+
_Note:_
|
15 |
+
|
16 |
+
We denote our model as Granite-Vision-3.1-2B-Preview, where the version (3.1) and size (2B) of the base large language model
|
17 |
+
are explicitly indicated. However, when considering the integrated vision encoder and projector, the total parameter count of our
|
18 |
+
model increases to 3 billion parameters.
|
19 |
|
20 |
**Evaluations:**
|
21 |
|
22 |
We evaluated Granite Vision 3.1 alongside other vision-language models (VLMs) in the 1B-4B parameter range using the standard llms-eval benchmark. The evaluation spanned multiple public benchmarks, with particular emphasis on document understanding tasks while also including general visual question-answering benchmarks.
|
23 |
|
24 |
+
| | Molmo-E (1B) | InternVL2 (2B) | Phi3v (4B) | Phi3.5v (4B) | Granite Vision (3B) |
|
25 |
|-----------|--------------|----------------|-------------|------------|------------|
|
26 |
| **Document benchmarks** |
|
27 |
| DocVQA | 0.66 | 0.87 | 0.87 | **0.88** | **0.88** |
|
|
|
163 |
|
164 |
We built upon LlaVA (https://llava-vl.github.io) to train our model. We use multi-layer encoder features and a denser grid resolution in AnyRes to enhance the model's ability to understand nuanced visual content, which is essential for accurately interpreting document images.
|
165 |
|
|
|
|
|
|
|
|
|
|
|
|
|
166 |
|
167 |
**Training Data:**
|
168 |
|