Batch inference

#31

by linhng0101 - opened Aug 22

Aug 22

I noticed that when doing batch inference, if the images in the batch are roughly of the same size, the model output similar result to single inference. But when the sizes are different, the result for the smaller images always look worse than single inference result. Could this be because of padding tokens? Which padding token is recommended for batch inference?

Souvik3333

Nanonets org Aug 22

We used the same padding token mentioned in the config here. Don't think it will be padding token issue unless you are sending some other token manually.

linhng0101

Aug 22

No, I'm using the default padding token. Could you think of any other reason that make batch inference perform worse than single inference? Thank you

Souvik3333

Nanonets org Aug 23

Batch inference should perform the same as single inference. Do you have any reproducible code? I can look into it.

linhng0101

Aug 25

•

edited Aug 25

Code I use for batch inference

#nanonet model
path_nanonet = "nanonets/Nanonets-OCR-s"

model_nanonet = AutoModelForImageTextToText.from_pretrained(
    path_nanonet, 
    torch_dtype=torch.float32, 
    device_map="cuda", 
    attn_implementation="eager"
)
model_nanonet.eval()

tokenizer_nanonet = AutoTokenizer.from_pretrained(path_nanonet)
processor_nanonet = AutoProcessor.from_pretrained(path_nanonet)

PROMPT = """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes."""

def batch_ocr_page_with_nanonets(image_path_list, batch_size=4, max_new_tokens=2048):
    """
    OCR multiple pages in batches using nanonets/Nanonets-OCR-s
    - matches single-image quality by aligning template, padding, and image preproc
    """

    results = []

    for i in range(0, len(image_path_list), batch_size):
        batch_paths = image_path_list[i:i+batch_size]

        texts, images = [], []
        for p in batch_paths:
            img = Image.open(p)
            messages = [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": [
                    {"type": "image", "image": p},
                    {"type": "text", "text": PROMPT}
                ]},
            ]
            t = processor_nanonet.apply_chat_template(
                messages, tokenize=False, add_generation_prompt=True
            )
            texts.append(t)
            images.append(img)
            

        batch = processor_nanonet(
            text=texts, images=images, padding=True, return_tensors="pt"
        )
        batch = batch.to(model_nanonet.device)
        
        out = model_nanonet.generate(
                **batch,
                do_sample=False,
                max_new_tokens=max_new_tokens,
                )

        
        gen_ids = [out_ids[len(in_ids):] for in_ids, out_ids in zip(batch["input_ids"], out)]

        batch_texts = processor_nanonet.batch_decode(gen_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)

        results.extend(batch_texts)

    return results

I ran batch inference on these 2 images

I ran inference with batch size = 1 and 2. The results of section 6 image look quite similar. But the result of section 5 when do batch inference is just one character 'V' (the result if batch_size = 1 looks accurate tho)

Appreciate your help!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment