layoutlmv3-xfund

This model is a fine-tuned version of microsoft/layoutlmv3-base on an unknown dataset. It achieves the following results on the evaluation set:

Loss: 0.6625
Precision: 0.7711
Recall: 0.8476
F1: 0.8075
Accuracy: 0.8030

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e-05
train_batch_size: 2
eval_batch_size: 2
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
num_epochs: 5
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy
0.7142	1.0	522	0.7296	0.6225	0.7066	0.6619	0.7212
0.5881	2.0	1044	0.6032	0.6841	0.8100	0.7417	0.7688
0.4179	3.0	1566	0.5904	0.7204	0.8222	0.7679	0.7858
0.3507	4.0	2088	0.6088	0.7600	0.8458	0.8006	0.7979
0.2618	5.0	2610	0.6625	0.7711	0.8476	0.8075	0.8030

Inference

# Install the Python wrapper
!pip install pytesseract pillow

# Install the Tesseract engine on a Debian/Ubuntu-based system (like Colab)
!sudo apt install tesseract-ocr

import torch
from transformers import AutoProcessor, AutoModelForTokenClassification
from PIL import Image, ImageDraw, ImageFont
import pytesseract
import numpy as np
import os  # For setting environment variable

# --- CRITICAL FOR DEBUGGING: Set this at the very top ---
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

# --- ADD THE NORMALIZATION FUNCTION ---
def normalize_bbox(bbox, width, height):
    return [
        int(1000 * min(max(bbox[0] / width, 0), 1)),
        int(1000 * min(max(bbox[1] / height, 0), 1)),
        int(1000 * min(max(bbox[2] / width, 0), 1)),
        int(1000 * min(max(bbox[3] / height, 0), 1))
    ]

# --- 1. Load your Fine-Tuned Model and Processor ---
MODEL_ID = "nnul/layoutlmv3-xfund"

print("Loading processor...")
processor = AutoProcessor.from_pretrained(MODEL_ID)
print("Loading model...")
model = AutoModelForTokenClassification.from_pretrained(MODEL_ID)

print("Moving model to device...")
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
print("Model moved successfully.")

# --- 2. Load the Image ---
image_path = "your_image.png"
image = Image.open(image_path).convert("RGB")
width, height = image.size

# --- 3. Perform OCR and NORMALIZE Bounding Boxes ---
print("Performing OCR...")
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
words = []
unnormalized_boxes = []
normalized_boxes = []

for i in range(len(data['text'])):
    if int(data['conf'][i]) > 30 and data['text'][i].strip() != '':
        word = data['text'][i]
        x, y, w, h = data['left'][i], data['top'][i], data['width'][i], data['height'][i]
        
        actual_box = [x, y, x + w, y + h]
        unnormalized_boxes.append(actual_box)

        normalized_box = normalize_bbox(actual_box, width, height)
        normalized_boxes.append(normalized_box)
        
        words.append(word)

print(f"OCR found {len(words)} words.")

# --- 4. Manually Preprocess and Predict ---
print("Preprocessing inputs...")
encoding = processor(
    image,
    words,
    boxes=normalized_boxes,
    return_tensors="pt",
    truncation=True
)

print("Moving inputs to device...")
for k, v in encoding.items():
    encoding[k] = v.to(device)

print("Running inference...")
with torch.no_grad():
    outputs = model(**encoding)

logits = outputs.logits
predictions_indices = logits.argmax(-1).squeeze().tolist()

word_ids = encoding.word_ids()
previous_word_id = None
word_predictions = []
for idx, word_id in enumerate(word_ids):
    if word_id is not None and word_id != previous_word_id:
        label_id = predictions_indices[idx]
        word_predictions.append(model.config.id2label[label_id])
    previous_word_id = word_id

def visualize_predictions(image, words, boxes, predictions):
    label2color = {
        "B-QUESTION": "blue", "I-QUESTION": "blue",
        "B-ANSWER": "green", "I-ANSWER": "green",
        "B-HEADER": "orange", "I-HEADER": "orange",
        "O": "gray"
    }
    draw_image = image.copy()
    draw = ImageDraw.Draw(draw_image)
    try:
        font = ImageFont.truetype("arial.ttf", 12)
    except IOError:
        font = ImageFont.load_default()
    for word, box, label in zip(words, boxes, predictions):
        color = label2color.get(label, 'red')
        draw.rectangle(box, outline=color, width=2)
        entity_type = label.split('-')[1] if '-' in label else 'OTHER'
        if entity_type != 'OTHER':
             draw.text((box[0], box[1] - 10), entity_type, fill=color, font=font)
    return draw_image

print("Visualizing results...")
visualized_image = visualize_predictions(image, words, unnormalized_boxes, word_predictions)
display(visualized_image)
visualized_image.save("result_visualization_manual.png")
print("Saved visualization to result_visualization_manual.png")

Framework versions

Transformers 4.52.4
Pytorch 2.6.0+cu124
Datasets 3.6.0
Tokenizers 0.21.1

nnul
/

layoutlmv3-xfund

layoutlmv3-xfund

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Inference

Framework versions

Model tree for nnul/layoutlmv3-xfund

Evaluation results