jakep-allenai commited on
Commit
24221ab
·
verified ·
1 Parent(s): fb52726

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -7
README.md CHANGED
@@ -72,13 +72,88 @@ This model expects as input a single document image, rendered such that the long
72
  The prompt must then contain the additional metadata from the document, and the easiest way to generate this
73
  is to use the methods provided by the [olmOCR toolkit](https://github.com/allenai/olmocr).
74
 
75
-
76
- ## Manual Usage
77
-
78
- If you must run the model as a one-off, please follow the instructions below.
79
-
80
- Note: It is important to keep the prompt and image dimensions exactly as specified, or else performance may drop from the benchmark numbers we report.
81
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
 
83
  ## License and use
84
 
 
72
  The prompt must then contain the additional metadata from the document, and the easiest way to generate this
73
  is to use the methods provided by the [olmOCR toolkit](https://github.com/allenai/olmocr).
74
 
75
+ ## Manual Prompting
76
+
77
+ If you want to prompt this model manually instead of using the [olmOCR toolkit](https://github.com/allenai/olmocr), please see the code below.
78
+
79
+ In normal usage, the olmOCR toolkit builds the prompt by rendering the PDF page, and
80
+ extracting relevant text blocks and image metadata. To duplicate that you will need to
81
+
82
+ ```bash
83
+ pip install olmocr
84
+ ```
85
+
86
+ and then run the following sample code.
87
+
88
+
89
+ ```python
90
+ import torch
91
+ import base64
92
+ import urllib.request
93
+
94
+ from io import BytesIO
95
+ from PIL import Image
96
+ from transformers import AutoProcessor, Qwen2_5VLForConditionalGeneration
97
+
98
+ from olmocr.data.renderpdf import render_pdf_to_base64png
99
+ from olmocr.prompts import build_no_anchoring_v4_yaml_prompt
100
+
101
+ # Initialize the model
102
+ model = Qwen2_5VLForConditionalGeneration.from_pretrained("allenai/olmOCR-7B-1025", torch_dtype=torch.bfloat16).eval()
103
+ processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
104
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
105
+ model.to(device)
106
+
107
+ # Grab a sample PDF
108
+ urllib.request.urlretrieve("https://molmo.allenai.org/paper.pdf", "./paper.pdf")
109
+
110
+ # Render page 1 to an image
111
+ image_base64 = render_pdf_to_base64png("./paper.pdf", 1, target_longest_image_dim=1288)
112
+
113
+
114
+ # Build the full prompt
115
+ messages = [
116
+ {
117
+ "role": "user",
118
+ "content": [
119
+ {"type": "text", "text": build_no_anchoring_v4_yaml_prompt()},
120
+ {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}},
121
+ ],
122
+ }
123
+ ]
124
+
125
+ # Apply the chat template and processor
126
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
127
+ main_image = Image.open(BytesIO(base64.b64decode(image_base64)))
128
+
129
+ inputs = processor(
130
+ text=[text],
131
+ images=[main_image],
132
+ padding=True,
133
+ return_tensors="pt",
134
+ )
135
+ inputs = {key: value.to(device) for (key, value) in inputs.items()}
136
+
137
+
138
+ # Generate the output
139
+ output = model.generate(
140
+ **inputs,
141
+ temperature=0.8,
142
+ max_new_tokens=50,
143
+ num_return_sequences=1,
144
+ do_sample=True,
145
+ )
146
+
147
+ # Decode the output
148
+ prompt_length = inputs["input_ids"].shape[1]
149
+ new_tokens = output[:, prompt_length:]
150
+ text_output = processor.tokenizer.batch_decode(
151
+ new_tokens, skip_special_tokens=True
152
+ )
153
+
154
+ print(text_output)
155
+ # ['{"primary_language":"en","is_rotation_valid":true,"rotation_correction":0,"is_table":false,"is_diagram":false,"natural_text":"Molmo and PixMo:\\nOpen Weights and Open Data\\nfor State-of-the']
156
+ ```
157
 
158
  ## License and use
159