is there a max token limit for this? my ocr always seems to end abruptly

by jinoooooooooo - opened 11 days ago

Discussion

jinoooooooooo

11 days ago

is there a max token limit for this? my ocr always seems to end abruptly

asnassar

Docling org 11 days ago

It's as SmolVLM original implementation 8192. If you can share your example please do.

jinoooooooooo

11 days ago

sure. sharing an example with a single image extraction

import torch
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load images
image = load_image("/content/2
![2.jpg](https://cdn-uploads.huggingface.co/production/uploads/61ebdb79592a25e6c39bc13f/CVbjQ6FiFpyWc9zShFins.jpeg)
.jpg")

# Initialize processor and model
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained(
    "ds4sd/SmolDocling-256M-preview",
    torch_dtype=torch.bfloat16,
).to(DEVICE)

# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to docling."}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=8192)
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
    trimmed_generated_ids,
    skip_special_tokens=False,
)[0].lstrip()

# Populate document
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
print(doctags)
# create a docling document
doc = DoclingDocument(name="Document")
doc.load_from_doctags(doctags_doc)

# export as any format
# HTML
# doc.save_as_html(output_file)
# MD
print(doc.export_to_markdown())

jinoooooooooo

11 days ago

this is a sample image

jinoooooooooo

11 days ago

sample extraction

jinoooooooooo

11 days ago

only half of it gets extracted

asnassar

Docling org 11 days ago

I think you just need to resize your terminal, the output is overflowing. Also you could just save the markdown output to a text file for inspection!

jinoooooooooo

11 days ago

my bad, i see the whole output now, but the text above the table has been skipped, any idea why this might happen?

asnassar

Docling org 11 days ago

No problem. Actually this helps catch a bug, it seems the conversion to DoclingDocument didn't populate the caption. The caption is in the prediction though, we will make a fix.

jinoooooooooo

11 days ago

thanks very much!

kasatgaurav

11 days ago

•

edited 11 days ago

@jinoooooooooo can you share your notebook setup or script . For my usecase my docs are similiar to what you have pasted above , but results are very bad.

kasatgaurav

11 days ago

till certain length its working fine post that the same part is getting repeated. @asnassar

asnassar

Docling org 11 days ago

@jinoooooooooo we fixed the issue, I suggest you update docling-core package and it should work now.
@kasatgaurav if you it is possible please make a separate issue on here or on https://github.com/docling-project/docling/issues with an example so we can fix this in the upcoming checkpoint.

asnassar changed discussion status to closed 11 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment