TongkunGuan
/

TokenFD_4096_English_seg

Image Feature Extraction

Safetensors

internvl_chat

Model card Files Files and versions Community

TongkunGuan commited on Mar 8

Commit

3e49e40

verified ·

1 Parent(s): 44d226f

Update README.md

Browse files

Files changed (1) hide show

README.md +19 -19

README.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 license: mit
 pipeline_tag: image-feature-extraction
-base_model: TokenOCR
 base_model_relation: finetune
 ---
@@ -10,7 +10,7 @@ base_model_relation: finetune
 <h1 style="color: black;">A Token-level Text Image Foundation Model for Document Understanding</h1>
-[\[📂 GitHub\]](https://github.com/Token-family/TokenOCR)    [\[📖 Paper\]]() [\[🆕 Blog\]]()    [\[🤗 HF Demo\]](https://huggingface.co/spaces/TongkunGuan/TokenOCR)    [\[🚀 Quick Start\]](#quick-start)
 </center>
@@ -28,12 +28,12 @@ base_model_relation: finetune
 </center>
-We are excited to announce the release of **`TokenOCR`**, the first token-level visual foundation model specifically tailored for text-image-related tasks,
-designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR,
 we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
 **`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs.
 Furthermore, leveraging this foundation with exceptional image-as-text capability,
-we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, **`TokenVL`**, for VQA-based document understanding tasks.
 <center>
@@ -68,16 +68,16 @@ The comparisons with other visual foundation models:
 | [CLIP](https://github.com/openai/CLIP) | image-level | WIT400M  | 400M   | 0.4B   |
 | [DINO](https://github.com/facebookresearch/dino) | image-level | ImageNet | 14M    | -      |
 | [SAM](https://github.com/facebookresearch/SAM)  | pixel-level | SA1B     | 11M    | 1.1B   |
-| **TokenOCR**           | **token-level** | **TokenIT**  | **20M**    | **1.8B**   |
-<!-- ## TokenOCR
  -->
-<h2 style="color: #4CAF50;">TokenOCR</h2>
 ### Model Architecture
-An overview of the proposed TokenOCR, where the token-level image features and token-level language
 features are aligned within the same semantic space. This “image-as-text” alignment seamlessly facilitates user-interactive
 applications, including text segmentation, retrieval, and visual question answering.
@@ -87,19 +87,19 @@ applications, including text segmentation, retrieval, and visual question answer
 ### Model Cards
-In the following table, we provide all models [🤗 link] of the TokenOCR series.
 |        Model Name         |                                Description                                |
 | :-----------------------: | :-------------------------------------------------------------------: |
-| TokenOCR-4096-English | feature dimension is 4096; support interactive with English texts.|
-|  TokenOCR-4096-Chinese  |  feature dimension is 4096; support interactive with Chinese texts.  |
-|  TokenOCR-2048-Bilingual  |  feature dimension is 4096; support interactive with English and Chinese texts. |
-| TokenOCR-4096-English-seg |  On `TokenOCR-4096-English`, background noise is filtered out. You can use prompt ' ' to get a highlight background. |
 ### Quick Start
 > \[!Warning\]
-> 🚨 Note: In our experience, the `TokenOCR-2048-Bilingual` series is better suited for building MLLMs than the `-seg` version.
 ```python
 import os
@@ -109,7 +109,7 @@ from transformers import AutoTokenizer
 from internvl.model.internvl_chat import InternVLChatConfig, InternVisionModel
 from utils import  post_process, generate_similiarity_map, load_model, load_image
-checkpoint = 'TongkunGuan/TokenOCR_4096_English_seg'
 image_path = './demo_images/0000000.png'
 input_query = '11/12/2020'
 out_dir = 'results'
@@ -137,7 +137,7 @@ all_bpe_strings = [tokenizer.decode(input_id) for input_id in input_ids]
 """Obtaining similarity """
-vit_embeds = model.forward_tokenocr(pixel_values.to(model.device)) #(vit_batch_size, 16*16, 2048)
 vit_embeds_local, resized_size = post_process(vit_embeds, target_aspect_ratio)
 token_features = vit_embeds_local / vit_embeds_local.norm(dim=-1, keepdim=True)
 input_embedings = input_embeds / input_embeds.norm(dim=-1, keepdim=True)
@@ -161,7 +161,7 @@ The evaluation is divided into two key categories:
 (2) image segmentation;
 (3) visual question answering;
-This approach allows us to assess the representation quality of TokenOCR.
 Please refer to our technical report for more details.
 #### text retrial
@@ -193,7 +193,7 @@ Please refer to our technical report for more details.
 <!-- ## TokenVL -->
 <h2 style="color: #4CAF50;">TokenVL</h2>
-we employ the TokenOCR as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding.
 Following the previous training paradigm, TokenVL also includes two stages:
 **Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**

 ---
 license: mit
 pipeline_tag: image-feature-extraction
+base_model: TokenFD
 base_model_relation: finetune
 ---
 <h1 style="color: black;">A Token-level Text Image Foundation Model for Document Understanding</h1>
+[\[📂 GitHub\]](https://github.com/Token-family/TokenFD)    [\[📖 Paper\]](https://arxiv.org/pdf/2503.02304) [\[🆕 Project_page\]](https://token-family.github.io/project_page/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/TongkunGuan/TokenFD)    [\[🚀 Quick Start\]](#quick-start)
 </center>
 </center>
+We are excited to announce the release of **`TokenFD`**, the first token-level visual foundation model specifically tailored for text-image-related tasks,
+designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenFD,
 we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
 **`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs.
 Furthermore, leveraging this foundation with exceptional image-as-text capability,
+we seamlessly replace previous VFMs with TokenFD to construct a document-level MLLM, **`TokenVL`**, for VQA-based document understanding tasks.
 <center>
 | [CLIP](https://github.com/openai/CLIP) | image-level | WIT400M  | 400M   | 0.4B   |
 | [DINO](https://github.com/facebookresearch/dino) | image-level | ImageNet | 14M    | -      |
 | [SAM](https://github.com/facebookresearch/SAM)  | pixel-level | SA1B     | 11M    | 1.1B   |
+| **TokenFD**           | **token-level** | **TokenIT**  | **20M**    | **1.8B**   |
+<!-- ## TokenFD
  -->
+<h2 style="color: #4CAF50;">TokenFD</h2>
 ### Model Architecture
+An overview of the proposed TokenFD, where the token-level image features and token-level language
 features are aligned within the same semantic space. This “image-as-text” alignment seamlessly facilitates user-interactive
 applications, including text segmentation, retrieval, and visual question answering.
 ### Model Cards
+In the following table, we provide all models [🤗 link] of the TokenFD series.
 |        Model Name         |                                Description                                |
 | :-----------------------: | :-------------------------------------------------------------------: |
+| TokenFD-4096-English | feature dimension is 4096; support interactive with English texts.|
+|  TokenFD-4096-Chinese  |  feature dimension is 4096; support interactive with Chinese texts.  |
+|  TokenFD-2048-Bilingual  |  feature dimension is 4096; support interactive with English and Chinese texts. |
+| TokenFD-4096-English-seg |  On `TokenFD-4096-English`, background noise is filtered out. You can use prompt ' ' to get a highlight background. |
 ### Quick Start
 > \[!Warning\]
+> 🚨 Note: In our experience, the `TokenFD-2048-Bilingual` series is better suited for building MLLMs than the `-seg` version.
 ```python
 import os
 from internvl.model.internvl_chat import InternVLChatConfig, InternVisionModel
 from utils import  post_process, generate_similiarity_map, load_model, load_image
+checkpoint = 'TongkunGuan/TokenFD_4096_English_seg'
 image_path = './demo_images/0000000.png'
 input_query = '11/12/2020'
 out_dir = 'results'
 """Obtaining similarity """
+vit_embeds = model.forward_TokenFD(pixel_values.to(model.device)) #(vit_batch_size, 16*16, 2048)
 vit_embeds_local, resized_size = post_process(vit_embeds, target_aspect_ratio)
 token_features = vit_embeds_local / vit_embeds_local.norm(dim=-1, keepdim=True)
 input_embedings = input_embeds / input_embeds.norm(dim=-1, keepdim=True)
 (2) image segmentation;
 (3) visual question answering;
+This approach allows us to assess the representation quality of TokenFD.
 Please refer to our technical report for more details.
 #### text retrial
 <!-- ## TokenVL -->
 <h2 style="color: #4CAF50;">TokenVL</h2>
+we employ the TokenFD as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding.
 Following the previous training paradigm, TokenVL also includes two stages:
 **Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**