layoutlmv3-xfund

This model is a fine-tuned version of microsoft/layoutlmv3-base on an unknown dataset. It achieves the following results on the evaluation set:

  • Loss: 0.6625
  • Precision: 0.7711
  • Recall: 0.8476
  • F1: 0.8075
  • Accuracy: 0.8030

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 3e-05
  • train_batch_size: 2
  • eval_batch_size: 2
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 5
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Precision Recall F1 Accuracy
0.7142 1.0 522 0.7296 0.6225 0.7066 0.6619 0.7212
0.5881 2.0 1044 0.6032 0.6841 0.8100 0.7417 0.7688
0.4179 3.0 1566 0.5904 0.7204 0.8222 0.7679 0.7858
0.3507 4.0 2088 0.6088 0.7600 0.8458 0.8006 0.7979
0.2618 5.0 2610 0.6625 0.7711 0.8476 0.8075 0.8030

Inference

# Install the Python wrapper
!pip install pytesseract pillow

# Install the Tesseract engine on a Debian/Ubuntu-based system (like Colab)
!sudo apt install tesseract-ocr
import torch
from transformers import AutoProcessor, AutoModelForTokenClassification
from PIL import Image, ImageDraw, ImageFont
import pytesseract
import numpy as np
import os  # For setting environment variable

# --- CRITICAL FOR DEBUGGING: Set this at the very top ---
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

# --- ADD THE NORMALIZATION FUNCTION ---
def normalize_bbox(bbox, width, height):
    return [
        int(1000 * min(max(bbox[0] / width, 0), 1)),
        int(1000 * min(max(bbox[1] / height, 0), 1)),
        int(1000 * min(max(bbox[2] / width, 0), 1)),
        int(1000 * min(max(bbox[3] / height, 0), 1))
    ]
# --- 1. Load your Fine-Tuned Model and Processor ---
MODEL_ID = "nnul/layoutlmv3-xfund"

print("Loading processor...")
processor = AutoProcessor.from_pretrained(MODEL_ID)
print("Loading model...")
model = AutoModelForTokenClassification.from_pretrained(MODEL_ID)

print("Moving model to device...")
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
print("Model moved successfully.")
# --- 2. Load the Image ---
image_path = "your_image.png"
image = Image.open(image_path).convert("RGB")
width, height = image.size
# --- 3. Perform OCR and NORMALIZE Bounding Boxes ---
print("Performing OCR...")
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
words = []
unnormalized_boxes = []
normalized_boxes = []

for i in range(len(data['text'])):
    if int(data['conf'][i]) > 30 and data['text'][i].strip() != '':
        word = data['text'][i]
        x, y, w, h = data['left'][i], data['top'][i], data['width'][i], data['height'][i]
        
        actual_box = [x, y, x + w, y + h]
        unnormalized_boxes.append(actual_box)

        normalized_box = normalize_bbox(actual_box, width, height)
        normalized_boxes.append(normalized_box)
        
        words.append(word)

print(f"OCR found {len(words)} words.")
# --- 4. Manually Preprocess and Predict ---
print("Preprocessing inputs...")
encoding = processor(
    image,
    words,
    boxes=normalized_boxes,
    return_tensors="pt",
    truncation=True
)

print("Moving inputs to device...")
for k, v in encoding.items():
    encoding[k] = v.to(device)

print("Running inference...")
with torch.no_grad():
    outputs = model(**encoding)

logits = outputs.logits
predictions_indices = logits.argmax(-1).squeeze().tolist()

word_ids = encoding.word_ids()
previous_word_id = None
word_predictions = []
for idx, word_id in enumerate(word_ids):
    if word_id is not None and word_id != previous_word_id:
        label_id = predictions_indices[idx]
        word_predictions.append(model.config.id2label[label_id])
    previous_word_id = word_id
def visualize_predictions(image, words, boxes, predictions):
    label2color = {
        "B-QUESTION": "blue", "I-QUESTION": "blue",
        "B-ANSWER": "green", "I-ANSWER": "green",
        "B-HEADER": "orange", "I-HEADER": "orange",
        "O": "gray"
    }
    draw_image = image.copy()
    draw = ImageDraw.Draw(draw_image)
    try:
        font = ImageFont.truetype("arial.ttf", 12)
    except IOError:
        font = ImageFont.load_default()
    for word, box, label in zip(words, boxes, predictions):
        color = label2color.get(label, 'red')
        draw.rectangle(box, outline=color, width=2)
        entity_type = label.split('-')[1] if '-' in label else 'OTHER'
        if entity_type != 'OTHER':
             draw.text((box[0], box[1] - 10), entity_type, fill=color, font=font)
    return draw_image
print("Visualizing results...")
visualized_image = visualize_predictions(image, words, unnormalized_boxes, word_predictions)
display(visualized_image)
visualized_image.save("result_visualization_manual.png")
print("Saved visualization to result_visualization_manual.png")

Framework versions

  • Transformers 4.52.4
  • Pytorch 2.6.0+cu124
  • Datasets 3.6.0
  • Tokenizers 0.21.1
Downloads last month
34
Safetensors
Model size
125M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for nnul/layoutlmv3-xfund

Finetuned
(275)
this model