is there a max token limit for this? my ocr always seems to end abruptly
is there a max token limit for this? my ocr always seems to end abruptly
It's as SmolVLM original implementation 8192. If you can share your example please do.
sure. sharing an example with a single image extraction
import torch
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load images
image = load_image("/content/2

.jpg")
# Initialize processor and model
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained(
"ds4sd/SmolDocling-256M-preview",
torch_dtype=torch.bfloat16,
).to(DEVICE)
# Create input messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Convert this page to docling."}
]
},
]
# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)
# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=8192)
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
trimmed_generated_ids,
skip_special_tokens=False,
)[0].lstrip()
# Populate document
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
print(doctags)
# create a docling document
doc = DoclingDocument(name="Document")
doc.load_from_doctags(doctags_doc)
# export as any format
# HTML
# doc.save_as_html(output_file)
# MD
print(doc.export_to_markdown())
only half of it gets extracted
No problem. Actually this helps catch a bug, it seems the conversion to DoclingDocument didn't populate the caption. The caption is in the prediction though, we will make a fix.
thanks very much!
@jinoooooooooo can you share your notebook setup or script . For my usecase my docs are similiar to what you have pasted above , but results are very bad.
till certain length its working fine post that the same part is getting repeated. @asnassar
@jinoooooooooo
we fixed the issue, I suggest you update docling-core package and it should work now.
@kasatgaurav
if you it is possible please make a separate issue on here or on https://github.com/docling-project/docling/issues with an example so we can fix this in the upcoming checkpoint.