aarbelle commited on
Commit
191f434
·
verified ·
1 Parent(s): fdb5d25

Update README.md

Browse files

Modify table to represent true parameter number, move disclaimer to the top

Files changed (1) hide show
  1. README.md +6 -7
README.md CHANGED
@@ -11,12 +11,17 @@ library_name: transformers
11
  **Model Summary:**
12
  granite-vision-3.1-2b-preview is a compact and efficient vision-language model, specifically designed for visual document understanding, enabling automated content extraction from tables, charts, infographics, plots, diagrams, and more. The model was trained on a meticulously curated instruction-following dataset, comprising diverse public datasets and synthetic datasets tailored to support a wide range of document understanding and general image tasks. It was trained by fine-tuning a Granite large language model (https://huggingface.co/ibm-granite/granite-3.1-2b-instruct) with both image and text modalities.
13
 
 
 
 
 
 
14
 
15
  **Evaluations:**
16
 
17
  We evaluated Granite Vision 3.1 alongside other vision-language models (VLMs) in the 1B-4B parameter range using the standard llms-eval benchmark. The evaluation spanned multiple public benchmarks, with particular emphasis on document understanding tasks while also including general visual question-answering benchmarks.
18
 
19
- | | Molmo-E (1B) | InternVL2 (2B) | Phi3v (4B) | Phi3.5v (4B) | Granite Vision 3.1 (2B) |
20
  |-----------|--------------|----------------|-------------|------------|------------|
21
  | **Document benchmarks** |
22
  | DocVQA | 0.66 | 0.87 | 0.87 | **0.88** | **0.88** |
@@ -158,12 +163,6 @@ The architecture of granite-vision-3.1-2b-preview consists of the following comp
158
 
159
  We built upon LlaVA (https://llava-vl.github.io) to train our model. We use multi-layer encoder features and a denser grid resolution in AnyRes to enhance the model's ability to understand nuanced visual content, which is essential for accurately interpreting document images.
160
 
161
- _Note:_
162
-
163
- We denote our model as Granite-Vision-3.1-2B-Preview, where the version (3.1) and size (2B) of the base large language model
164
- are explicitly indicated. However, when considering the integrated vision encoder and projector, the total parameter count of our
165
- model increases to 3 billion parameters.
166
-
167
 
168
  **Training Data:**
169
 
 
11
  **Model Summary:**
12
  granite-vision-3.1-2b-preview is a compact and efficient vision-language model, specifically designed for visual document understanding, enabling automated content extraction from tables, charts, infographics, plots, diagrams, and more. The model was trained on a meticulously curated instruction-following dataset, comprising diverse public datasets and synthetic datasets tailored to support a wide range of document understanding and general image tasks. It was trained by fine-tuning a Granite large language model (https://huggingface.co/ibm-granite/granite-3.1-2b-instruct) with both image and text modalities.
13
 
14
+ _Note:_
15
+
16
+ We denote our model as Granite-Vision-3.1-2B-Preview, where the version (3.1) and size (2B) of the base large language model
17
+ are explicitly indicated. However, when considering the integrated vision encoder and projector, the total parameter count of our
18
+ model increases to 3 billion parameters.
19
 
20
  **Evaluations:**
21
 
22
  We evaluated Granite Vision 3.1 alongside other vision-language models (VLMs) in the 1B-4B parameter range using the standard llms-eval benchmark. The evaluation spanned multiple public benchmarks, with particular emphasis on document understanding tasks while also including general visual question-answering benchmarks.
23
 
24
+ | | Molmo-E (1B) | InternVL2 (2B) | Phi3v (4B) | Phi3.5v (4B) | Granite Vision (3B) |
25
  |-----------|--------------|----------------|-------------|------------|------------|
26
  | **Document benchmarks** |
27
  | DocVQA | 0.66 | 0.87 | 0.87 | **0.88** | **0.88** |
 
163
 
164
  We built upon LlaVA (https://llava-vl.github.io) to train our model. We use multi-layer encoder features and a denser grid resolution in AnyRes to enhance the model's ability to understand nuanced visual content, which is essential for accurately interpreting document images.
165
 
 
 
 
 
 
 
166
 
167
  **Training Data:**
168