📄 LayoutLMv3 Fine-Tuned on FUNSD for Key-Value Pair Extraction

Model Details

Developed by: nnul Model type: LayoutLMv3 (microsoft/layoutlmv3-base) Language(s): English License: Apache 2.0 Fine-tuned from: microsoft/layoutlmv3-base

This model is a fine-tuned version of LayoutLMv3 on the FUNSD dataset. It has been trained for the task of form understanding, specifically token classification for extracting structured information from scanned forms (e.g., questions and answers in a key-value format).

Model Description

The model performs token-level classification, labeling each token as one of:

QUESTION
ANSWER
HEADER
O (other)

It takes as input a scanned form image and its OCR-extracted tokens and bounding boxes.

Model Sources

Dataset: nielsr/funsd-layoutlmv3
Base model: microsoft/layoutlmv3-base

Uses

Direct Use

Key-value pair extraction from scanned documents
Form understanding
Preprocessing step for document-based QA, autofill, or RPA systems

Downstream Use

Automating information extraction from forms
Fine-tuning on custom form datasets (insurance, tax, invoices, etc.)

Out-of-Scope Use

Documents not structured like forms
Non-English documents (was not trained on multilingual data)
Highly noisy OCR (e.g., handwriting)

Bias, Risks, and Limitations

Biased toward the structure and layout of FUNSD forms (U.S.-centric, clean typewritten documents).
May perform poorly on handwritten or low-quality scans.
Assumes accurate OCR input.

How to Get Started

from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification
from PIL import Image

# Load model and processor
model = LayoutLMv3ForTokenClassification.from_pretrained("nnul/layoutlmv3-finetuned-funsd")
processor = LayoutLMv3Processor.from_pretrained("nnul/layoutlmv3-finetuned-funsd")

# Load and prepare image + OCR tokens and boxes
image = Image.open("your_form.jpg").convert("RGB")
words = ["Name", ":", "John", "Doe"]
boxes = [[100,100,150,120], [155,100,160,120], [165,100,220,120], [225,100,270,120]]

encoding = processor(image, words, boxes=boxes, return_tensors="pt")
outputs = model(**encoding)
predictions = outputs.logits.argmax(-1)

Training Details

Training Data

FUNSD Dataset
~199 forms, annotated with token-level BIO labels

Training Hyperparameters

Epochs: 7
Learning rate: default
Batch size: 2
Optimizer: AdamW
Training time: ~5 minutes on A100 (Colab)

Evaluation

Label	Precision	Recall	F1-Score	Support
ANSWER	0.90	0.93	0.92	817
HEADER	0.67	0.64	0.66	119
QUESTION	0.91	0.94	0.93	1077
Micro Avg	0.90	0.92	0.91	2013
Macro Avg	0.83	0.84	0.83	2013
Weighted Avg	0.89	0.92	0.91	2013

Environmental Impact

Parameter	Value
Hardware Used	NVIDIA A100 GPU (Colab)
Training Time	~5 minutes
Cloud Provider	Google Colab
Carbon Emitted	Negligible

Citation

@misc{layoutlmv3-funsd,
  title={LayoutLMv3 Fine-tuned on FUNSD},
  author={nnul},
  year={2025},
  howpublished={\url{https://huggingface.co/your-username/layoutlmv3-finetuned-funsd}},
  note={Fine-tuned LayoutLMv3 for key-value extraction from forms}
}