TongkunGuan commited on
Commit
3e49e40
Β·
verified Β·
1 Parent(s): 44d226f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -19
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
  license: mit
3
  pipeline_tag: image-feature-extraction
4
- base_model: TokenOCR
5
  base_model_relation: finetune
6
  ---
7
 
@@ -10,7 +10,7 @@ base_model_relation: finetune
10
  <h1 style="color: black;">A Token-level Text Image Foundation Model for Document Understanding</h1>
11
 
12
 
13
- [\[πŸ“‚ GitHub\]](https://github.com/Token-family/TokenOCR) [\[πŸ“– Paper\]]() [\[πŸ†• Blog\]]() [\[πŸ€— HF Demo\]](https://huggingface.co/spaces/TongkunGuan/TokenOCR) [\[πŸš€ Quick Start\]](#quick-start)
14
 
15
  </center>
16
 
@@ -28,12 +28,12 @@ base_model_relation: finetune
28
 
29
  </center>
30
 
31
- We are excited to announce the release of **`TokenOCR`**, the first token-level visual foundation model specifically tailored for text-image-related tasks,
32
- designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR,
33
  we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
34
  **`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs.
35
  Furthermore, leveraging this foundation with exceptional image-as-text capability,
36
- we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, **`TokenVL`**, for VQA-based document understanding tasks.
37
 
38
  <center>
39
 
@@ -68,16 +68,16 @@ The comparisons with other visual foundation models:
68
  | [CLIP](https://github.com/openai/CLIP) | image-level | WIT400M | 400M | 0.4B |
69
  | [DINO](https://github.com/facebookresearch/dino) | image-level | ImageNet | 14M | - |
70
  | [SAM](https://github.com/facebookresearch/SAM) | pixel-level | SA1B | 11M | 1.1B |
71
- | **TokenOCR** | **token-level** | **TokenIT** | **20M** | **1.8B** |
72
 
73
 
74
- <!-- ## TokenOCR
75
  -->
76
- <h2 style="color: #4CAF50;">TokenOCR</h2>
77
 
78
  ### Model Architecture
79
 
80
- An overview of the proposed TokenOCR, where the token-level image features and token-level language
81
  features are aligned within the same semantic space. This β€œimage-as-text” alignment seamlessly facilitates user-interactive
82
  applications, including text segmentation, retrieval, and visual question answering.
83
 
@@ -87,19 +87,19 @@ applications, including text segmentation, retrieval, and visual question answer
87
 
88
  ### Model Cards
89
 
90
- In the following table, we provide all models [πŸ€— link] of the TokenOCR series.
91
 
92
  | Model Name | Description |
93
  | :-----------------------: | :-------------------------------------------------------------------: |
94
- | TokenOCR-4096-English | feature dimension is 4096; support interactive with English texts.|
95
- | TokenOCR-4096-Chinese | feature dimension is 4096; support interactive with Chinese texts. |
96
- | TokenOCR-2048-Bilingual | feature dimension is 4096; support interactive with English and Chinese texts. |
97
- | TokenOCR-4096-English-seg | On `TokenOCR-4096-English`, background noise is filtered out. You can use prompt ' ' to get a highlight background. |
98
 
99
  ### Quick Start
100
 
101
  > \[!Warning\]
102
- > 🚨 Note: In our experience, the `TokenOCR-2048-Bilingual` series is better suited for building MLLMs than the `-seg` version.
103
 
104
  ```python
105
  import os
@@ -109,7 +109,7 @@ from transformers import AutoTokenizer
109
  from internvl.model.internvl_chat import InternVLChatConfig, InternVisionModel
110
  from utils import post_process, generate_similiarity_map, load_model, load_image
111
 
112
- checkpoint = 'TongkunGuan/TokenOCR_4096_English_seg'
113
  image_path = './demo_images/0000000.png'
114
  input_query = '11/12/2020'
115
  out_dir = 'results'
@@ -137,7 +137,7 @@ all_bpe_strings = [tokenizer.decode(input_id) for input_id in input_ids]
137
 
138
 
139
  """Obtaining similarity """
140
- vit_embeds = model.forward_tokenocr(pixel_values.to(model.device)) #(vit_batch_size, 16*16, 2048)
141
  vit_embeds_local, resized_size = post_process(vit_embeds, target_aspect_ratio)
142
  token_features = vit_embeds_local / vit_embeds_local.norm(dim=-1, keepdim=True)
143
  input_embedings = input_embeds / input_embeds.norm(dim=-1, keepdim=True)
@@ -161,7 +161,7 @@ The evaluation is divided into two key categories:
161
  (2) image segmentation;
162
  (3) visual question answering;
163
 
164
- This approach allows us to assess the representation quality of TokenOCR.
165
  Please refer to our technical report for more details.
166
 
167
  #### text retrial
@@ -193,7 +193,7 @@ Please refer to our technical report for more details.
193
  <!-- ## TokenVL -->
194
  <h2 style="color: #4CAF50;">TokenVL</h2>
195
 
196
- we employ the TokenOCR as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding.
197
  Following the previous training paradigm, TokenVL also includes two stages:
198
 
199
  **Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**
 
1
  ---
2
  license: mit
3
  pipeline_tag: image-feature-extraction
4
+ base_model: TokenFD
5
  base_model_relation: finetune
6
  ---
7
 
 
10
  <h1 style="color: black;">A Token-level Text Image Foundation Model for Document Understanding</h1>
11
 
12
 
13
+ [\[πŸ“‚ GitHub\]](https://github.com/Token-family/TokenFD) [\[πŸ“– Paper\]](https://arxiv.org/pdf/2503.02304) [\[πŸ†• Project_page\]](https://token-family.github.io/project_page/) [\[πŸ€— HF Demo\]](https://huggingface.co/spaces/TongkunGuan/TokenFD) [\[πŸš€ Quick Start\]](#quick-start)
14
 
15
  </center>
16
 
 
28
 
29
  </center>
30
 
31
+ We are excited to announce the release of **`TokenFD`**, the first token-level visual foundation model specifically tailored for text-image-related tasks,
32
+ designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenFD,
33
  we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
34
  **`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs.
35
  Furthermore, leveraging this foundation with exceptional image-as-text capability,
36
+ we seamlessly replace previous VFMs with TokenFD to construct a document-level MLLM, **`TokenVL`**, for VQA-based document understanding tasks.
37
 
38
  <center>
39
 
 
68
  | [CLIP](https://github.com/openai/CLIP) | image-level | WIT400M | 400M | 0.4B |
69
  | [DINO](https://github.com/facebookresearch/dino) | image-level | ImageNet | 14M | - |
70
  | [SAM](https://github.com/facebookresearch/SAM) | pixel-level | SA1B | 11M | 1.1B |
71
+ | **TokenFD** | **token-level** | **TokenIT** | **20M** | **1.8B** |
72
 
73
 
74
+ <!-- ## TokenFD
75
  -->
76
+ <h2 style="color: #4CAF50;">TokenFD</h2>
77
 
78
  ### Model Architecture
79
 
80
+ An overview of the proposed TokenFD, where the token-level image features and token-level language
81
  features are aligned within the same semantic space. This β€œimage-as-text” alignment seamlessly facilitates user-interactive
82
  applications, including text segmentation, retrieval, and visual question answering.
83
 
 
87
 
88
  ### Model Cards
89
 
90
+ In the following table, we provide all models [πŸ€— link] of the TokenFD series.
91
 
92
  | Model Name | Description |
93
  | :-----------------------: | :-------------------------------------------------------------------: |
94
+ | TokenFD-4096-English | feature dimension is 4096; support interactive with English texts.|
95
+ | TokenFD-4096-Chinese | feature dimension is 4096; support interactive with Chinese texts. |
96
+ | TokenFD-2048-Bilingual | feature dimension is 4096; support interactive with English and Chinese texts. |
97
+ | TokenFD-4096-English-seg | On `TokenFD-4096-English`, background noise is filtered out. You can use prompt ' ' to get a highlight background. |
98
 
99
  ### Quick Start
100
 
101
  > \[!Warning\]
102
+ > 🚨 Note: In our experience, the `TokenFD-2048-Bilingual` series is better suited for building MLLMs than the `-seg` version.
103
 
104
  ```python
105
  import os
 
109
  from internvl.model.internvl_chat import InternVLChatConfig, InternVisionModel
110
  from utils import post_process, generate_similiarity_map, load_model, load_image
111
 
112
+ checkpoint = 'TongkunGuan/TokenFD_4096_English_seg'
113
  image_path = './demo_images/0000000.png'
114
  input_query = '11/12/2020'
115
  out_dir = 'results'
 
137
 
138
 
139
  """Obtaining similarity """
140
+ vit_embeds = model.forward_TokenFD(pixel_values.to(model.device)) #(vit_batch_size, 16*16, 2048)
141
  vit_embeds_local, resized_size = post_process(vit_embeds, target_aspect_ratio)
142
  token_features = vit_embeds_local / vit_embeds_local.norm(dim=-1, keepdim=True)
143
  input_embedings = input_embeds / input_embeds.norm(dim=-1, keepdim=True)
 
161
  (2) image segmentation;
162
  (3) visual question answering;
163
 
164
+ This approach allows us to assess the representation quality of TokenFD.
165
  Please refer to our technical report for more details.
166
 
167
  #### text retrial
 
193
  <!-- ## TokenVL -->
194
  <h2 style="color: #4CAF50;">TokenVL</h2>
195
 
196
+ we employ the TokenFD as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding.
197
  Following the previous training paradigm, TokenVL also includes two stages:
198
 
199
  **Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**