Visually Guided Generative Text-Layout Pre-training for Document Intelligence

The ViTLP model was proposed in Visually Guided Generative Text-Layout Pre-training for Document Intelligence, which is a generative foundation model for document intelligence. We provide the pre-trained checkpoint ViTLP-medium (380M). The pre-trained ViTLP model can natively perform OCR text localization and recognition.

Demo on Document Text Recognition & Localization

The code of ViTLP inference and demo is assisible at https://github.com/Veason-silverbullet/ViTLP.

ocr-demo-1.png

ocr-demo-2.png

Preset FAQ

  • Why is ViTLP-medium (380M)?

When I commenced this project, it was on the eve of LLMs (precisely speaking, ChatGPT). ViTLP-base presented in our paper, is actually a rather small pre-trained model. We know it is expected to scale up ViTLP in this LLM era. However, the pre-training scale is commonly constrained by computation resources and the pre-training dataset scale, in which context ViTLP-medium (380M) is the largest pre-training scale so far we can support.

Besides, this scale of ViTLP also brings inference sweetness including speed and memory usage. Typically, OCR on a page of a document image can be processed within 5~10 seconds in an Nvidia 4090, which is comparable to (and faster than) most OCR engines (and LLMs).

Note

ViTLP is pronounced /ˈvai·tlp/ (vital). The first version of our paper was submitted to OpenReview in June 2023.

Downloads last month
42
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model’s pipeline type.