Update README.md
Browse files
README.md
CHANGED
@@ -1,7 +1,7 @@
|
|
1 |
---
|
2 |
license: mit
|
3 |
pipeline_tag: image-feature-extraction
|
4 |
-
base_model:
|
5 |
base_model_relation: finetune
|
6 |
---
|
7 |
|
@@ -10,7 +10,7 @@ base_model_relation: finetune
|
|
10 |
<h1 style="color: black;">A Token-level Text Image Foundation Model for Document Understanding</h1>
|
11 |
|
12 |
|
13 |
-
[\[π GitHub\]](https://github.com/Token-family/
|
14 |
|
15 |
</center>
|
16 |
|
@@ -28,12 +28,12 @@ base_model_relation: finetune
|
|
28 |
|
29 |
</center>
|
30 |
|
31 |
-
We are excited to announce the release of **`
|
32 |
-
designed to support a variety of traditional downstream applications. To facilitate the pretraining of
|
33 |
we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
|
34 |
**`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs.
|
35 |
Furthermore, leveraging this foundation with exceptional image-as-text capability,
|
36 |
-
we seamlessly replace previous VFMs with
|
37 |
|
38 |
<center>
|
39 |
|
@@ -68,16 +68,16 @@ The comparisons with other visual foundation models:
|
|
68 |
| [CLIP](https://github.com/openai/CLIP) | image-level | WIT400M | 400M | 0.4B |
|
69 |
| [DINO](https://github.com/facebookresearch/dino) | image-level | ImageNet | 14M | - |
|
70 |
| [SAM](https://github.com/facebookresearch/SAM) | pixel-level | SA1B | 11M | 1.1B |
|
71 |
-
| **
|
72 |
|
73 |
|
74 |
-
<!-- ##
|
75 |
-->
|
76 |
-
<h2 style="color: #4CAF50;">
|
77 |
|
78 |
### Model Architecture
|
79 |
|
80 |
-
An overview of the proposed
|
81 |
features are aligned within the same semantic space. This βimage-as-textβ alignment seamlessly facilitates user-interactive
|
82 |
applications, including text segmentation, retrieval, and visual question answering.
|
83 |
|
@@ -87,19 +87,19 @@ applications, including text segmentation, retrieval, and visual question answer
|
|
87 |
|
88 |
### Model Cards
|
89 |
|
90 |
-
In the following table, we provide all models [π€ link] of the
|
91 |
|
92 |
| Model Name | Description |
|
93 |
| :-----------------------: | :-------------------------------------------------------------------: |
|
94 |
-
|
|
95 |
-
|
|
96 |
-
|
|
97 |
-
|
|
98 |
|
99 |
### Quick Start
|
100 |
|
101 |
> \[!Warning\]
|
102 |
-
> π¨ Note: In our experience, the `
|
103 |
|
104 |
```python
|
105 |
import os
|
@@ -109,7 +109,7 @@ from transformers import AutoTokenizer
|
|
109 |
from internvl.model.internvl_chat import InternVLChatConfig, InternVisionModel
|
110 |
from utils import post_process, generate_similiarity_map, load_model, load_image
|
111 |
|
112 |
-
checkpoint = 'TongkunGuan/
|
113 |
image_path = './demo_images/0000000.png'
|
114 |
input_query = '11/12/2020'
|
115 |
out_dir = 'results'
|
@@ -137,7 +137,7 @@ all_bpe_strings = [tokenizer.decode(input_id) for input_id in input_ids]
|
|
137 |
|
138 |
|
139 |
"""Obtaining similarity """
|
140 |
-
vit_embeds = model.
|
141 |
vit_embeds_local, resized_size = post_process(vit_embeds, target_aspect_ratio)
|
142 |
token_features = vit_embeds_local / vit_embeds_local.norm(dim=-1, keepdim=True)
|
143 |
input_embedings = input_embeds / input_embeds.norm(dim=-1, keepdim=True)
|
@@ -161,7 +161,7 @@ The evaluation is divided into two key categories:
|
|
161 |
(2) image segmentation;
|
162 |
(3) visual question answering;
|
163 |
|
164 |
-
This approach allows us to assess the representation quality of
|
165 |
Please refer to our technical report for more details.
|
166 |
|
167 |
#### text retrial
|
@@ -193,7 +193,7 @@ Please refer to our technical report for more details.
|
|
193 |
<!-- ## TokenVL -->
|
194 |
<h2 style="color: #4CAF50;">TokenVL</h2>
|
195 |
|
196 |
-
we employ the
|
197 |
Following the previous training paradigm, TokenVL also includes two stages:
|
198 |
|
199 |
**Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
pipeline_tag: image-feature-extraction
|
4 |
+
base_model: TokenFD
|
5 |
base_model_relation: finetune
|
6 |
---
|
7 |
|
|
|
10 |
<h1 style="color: black;">A Token-level Text Image Foundation Model for Document Understanding</h1>
|
11 |
|
12 |
|
13 |
+
[\[π GitHub\]](https://github.com/Token-family/TokenFD) [\[π Paper\]](https://arxiv.org/pdf/2503.02304) [\[π Project_page\]](https://token-family.github.io/project_page/) [\[π€ HF Demo\]](https://huggingface.co/spaces/TongkunGuan/TokenFD) [\[π Quick Start\]](#quick-start)
|
14 |
|
15 |
</center>
|
16 |
|
|
|
28 |
|
29 |
</center>
|
30 |
|
31 |
+
We are excited to announce the release of **`TokenFD`**, the first token-level visual foundation model specifically tailored for text-image-related tasks,
|
32 |
+
designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenFD,
|
33 |
we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
|
34 |
**`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs.
|
35 |
Furthermore, leveraging this foundation with exceptional image-as-text capability,
|
36 |
+
we seamlessly replace previous VFMs with TokenFD to construct a document-level MLLM, **`TokenVL`**, for VQA-based document understanding tasks.
|
37 |
|
38 |
<center>
|
39 |
|
|
|
68 |
| [CLIP](https://github.com/openai/CLIP) | image-level | WIT400M | 400M | 0.4B |
|
69 |
| [DINO](https://github.com/facebookresearch/dino) | image-level | ImageNet | 14M | - |
|
70 |
| [SAM](https://github.com/facebookresearch/SAM) | pixel-level | SA1B | 11M | 1.1B |
|
71 |
+
| **TokenFD** | **token-level** | **TokenIT** | **20M** | **1.8B** |
|
72 |
|
73 |
|
74 |
+
<!-- ## TokenFD
|
75 |
-->
|
76 |
+
<h2 style="color: #4CAF50;">TokenFD</h2>
|
77 |
|
78 |
### Model Architecture
|
79 |
|
80 |
+
An overview of the proposed TokenFD, where the token-level image features and token-level language
|
81 |
features are aligned within the same semantic space. This βimage-as-textβ alignment seamlessly facilitates user-interactive
|
82 |
applications, including text segmentation, retrieval, and visual question answering.
|
83 |
|
|
|
87 |
|
88 |
### Model Cards
|
89 |
|
90 |
+
In the following table, we provide all models [π€ link] of the TokenFD series.
|
91 |
|
92 |
| Model Name | Description |
|
93 |
| :-----------------------: | :-------------------------------------------------------------------: |
|
94 |
+
| TokenFD-4096-English | feature dimension is 4096; support interactive with English texts.|
|
95 |
+
| TokenFD-4096-Chinese | feature dimension is 4096; support interactive with Chinese texts. |
|
96 |
+
| TokenFD-2048-Bilingual | feature dimension is 4096; support interactive with English and Chinese texts. |
|
97 |
+
| TokenFD-4096-English-seg | On `TokenFD-4096-English`, background noise is filtered out. You can use prompt ' ' to get a highlight background. |
|
98 |
|
99 |
### Quick Start
|
100 |
|
101 |
> \[!Warning\]
|
102 |
+
> π¨ Note: In our experience, the `TokenFD-2048-Bilingual` series is better suited for building MLLMs than the `-seg` version.
|
103 |
|
104 |
```python
|
105 |
import os
|
|
|
109 |
from internvl.model.internvl_chat import InternVLChatConfig, InternVisionModel
|
110 |
from utils import post_process, generate_similiarity_map, load_model, load_image
|
111 |
|
112 |
+
checkpoint = 'TongkunGuan/TokenFD_4096_English_seg'
|
113 |
image_path = './demo_images/0000000.png'
|
114 |
input_query = '11/12/2020'
|
115 |
out_dir = 'results'
|
|
|
137 |
|
138 |
|
139 |
"""Obtaining similarity """
|
140 |
+
vit_embeds = model.forward_TokenFD(pixel_values.to(model.device)) #(vit_batch_size, 16*16, 2048)
|
141 |
vit_embeds_local, resized_size = post_process(vit_embeds, target_aspect_ratio)
|
142 |
token_features = vit_embeds_local / vit_embeds_local.norm(dim=-1, keepdim=True)
|
143 |
input_embedings = input_embeds / input_embeds.norm(dim=-1, keepdim=True)
|
|
|
161 |
(2) image segmentation;
|
162 |
(3) visual question answering;
|
163 |
|
164 |
+
This approach allows us to assess the representation quality of TokenFD.
|
165 |
Please refer to our technical report for more details.
|
166 |
|
167 |
#### text retrial
|
|
|
193 |
<!-- ## TokenVL -->
|
194 |
<h2 style="color: #4CAF50;">TokenVL</h2>
|
195 |
|
196 |
+
we employ the TokenFD as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding.
|
197 |
Following the previous training paradigm, TokenVL also includes two stages:
|
198 |
|
199 |
**Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**
|