ibm-granite
/

granite-vision-3.1-2b-preview

@@ -11,12 +11,17 @@ library_name: transformers
 **Model Summary:**
 granite-vision-3.1-2b-preview is a compact and efficient vision-language model, specifically designed for visual document understanding, enabling automated content extraction from tables, charts, infographics, plots, diagrams, and more. The model was trained on a meticulously curated instruction-following dataset, comprising diverse public datasets and synthetic datasets tailored to support a wide range of document understanding and general image tasks. It was trained by fine-tuning a Granite large language model (https://huggingface.co/ibm-granite/granite-3.1-2b-instruct) with both image and text modalities.
 **Evaluations:**
 We evaluated Granite Vision 3.1 alongside other vision-language models (VLMs) in the 1B-4B parameter range using the standard llms-eval benchmark. The evaluation spanned multiple public benchmarks, with particular emphasis on document understanding tasks while also including general visual question-answering benchmarks.
-|  | Molmo-E (1B) | InternVL2 (2B) | Phi3v (4B) | Phi3.5v (4B) | Granite Vision 3.1 (2B) |
 |-----------|--------------|----------------|-------------|------------|------------|
 | **Document benchmarks** |
 | DocVQA | 0.66 | 0.87 | 0.87 | **0.88** | **0.88** |
@@ -158,12 +163,6 @@ The architecture of granite-vision-3.1-2b-preview consists of the following comp
 We built upon LlaVA (https://llava-vl.github.io) to train our model. We use multi-layer encoder features and a denser grid resolution in AnyRes to enhance the model's ability to understand nuanced visual content, which is essential for accurately interpreting document images.
-_Note:_
-We denote our model as Granite-Vision-3.1-2B-Preview, where the version (3.1) and size (2B) of the base large language model
-are explicitly indicated. However, when considering the integrated vision encoder and projector, the total parameter count of our
-model increases to 3 billion parameters.
 **Training Data:**

 **Model Summary:**
 granite-vision-3.1-2b-preview is a compact and efficient vision-language model, specifically designed for visual document understanding, enabling automated content extraction from tables, charts, infographics, plots, diagrams, and more. The model was trained on a meticulously curated instruction-following dataset, comprising diverse public datasets and synthetic datasets tailored to support a wide range of document understanding and general image tasks. It was trained by fine-tuning a Granite large language model (https://huggingface.co/ibm-granite/granite-3.1-2b-instruct) with both image and text modalities.
+_Note:_
+We denote our model as Granite-Vision-3.1-2B-Preview, where the version (3.1) and size (2B) of the base large language model
+are explicitly indicated. However, when considering the integrated vision encoder and projector, the total parameter count of our
+model increases to 3 billion parameters.
 **Evaluations:**
 We evaluated Granite Vision 3.1 alongside other vision-language models (VLMs) in the 1B-4B parameter range using the standard llms-eval benchmark. The evaluation spanned multiple public benchmarks, with particular emphasis on document understanding tasks while also including general visual question-answering benchmarks.
+|  | Molmo-E (1B) | InternVL2 (2B) | Phi3v (4B) | Phi3.5v (4B) | Granite Vision (3B) |
 |-----------|--------------|----------------|-------------|------------|------------|
 | **Document benchmarks** |
 | DocVQA | 0.66 | 0.87 | 0.87 | **0.88** | **0.88** |
 We built upon LlaVA (https://llava-vl.github.io) to train our model. We use multi-layer encoder features and a denser grid resolution in AnyRes to enhance the model's ability to understand nuanced visual content, which is essential for accurately interpreting document images.
 **Training Data:**