Image-Text-to-Text
Transformers
ONNX
Safetensors
English
idefics3
conversational

is there a max token limit for this? my ocr always seems to end abruptly

#1
by jinoooooooooo - opened

is there a max token limit for this? my ocr always seems to end abruptly

Docling org

It's as SmolVLM original implementation 8192. If you can share your example please do.

sure. sharing an example with a single image extraction

import torch
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load images
image = load_image("/content/2
![2.jpg](https://cdn-uploads.huggingface.co/production/uploads/61ebdb79592a25e6c39bc13f/CVbjQ6FiFpyWc9zShFins.jpeg)
.jpg")

# Initialize processor and model
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained(
    "ds4sd/SmolDocling-256M-preview",
    torch_dtype=torch.bfloat16,
).to(DEVICE)

# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to docling."}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=8192)
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
    trimmed_generated_ids,
    skip_special_tokens=False,
)[0].lstrip()

# Populate document
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
print(doctags)
# create a docling document
doc = DoclingDocument(name="Document")
doc.load_from_doctags(doctags_doc)

# export as any format
# HTML
# doc.save_as_html(output_file)
# MD
print(doc.export_to_markdown())

2.jpg
this is a sample image

sample extraction

image.png

only half of it gets extracted

Docling org

I think you just need to resize your terminal, the output is overflowing. Also you could just save the markdown output to a text file for inspection!

Screenshot 2025-03-17 at 18.39.56.png

my bad, i see the whole output now, but the text above the table has been skipped, any idea why this might happen?

image.png

Docling org

No problem. Actually this helps catch a bug, it seems the conversion to DoclingDocument didn't populate the caption. The caption is in the prediction though, we will make a fix.

thanks very much!

@jinoooooooooo can you share your notebook setup or script . For my usecase my docs are similiar to what you have pasted above , but results are very bad.

till certain length its working fine post that the same part is getting repeated. @asnassar

Docling org

@jinoooooooooo we fixed the issue, I suggest you update docling-core package and it should work now.
@kasatgaurav if you it is possible please make a separate issue on here or on https://github.com/docling-project/docling/issues with an example so we can fix this in the upcoming checkpoint.

asnassar changed discussion status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment