Document QA Model

This is a fine-tuned document question-answering model based on layoutlmv3-base. It is trained to understand documents using OCR data (via PaddleOCR) and accurately answer questions related to structured information in the document layout.

Model Details

Model Description

Model Name: document-qa-model
Base Model: microsoft/layoutlmv3-base
Fine-tuned by: Lakshya Singh (solo contributor)
Languages: English, Spanish, French, German, Italian
License: Apache-2.0 (inherited from base model)
Intended Use: Extract answers to structured queries from scanned documents
Not funded — this project was completed independently.

Model Sources

Repository: Github Link
Trained on: Adapted version of nielsr/docvqa_1200_examples
Model metrics: See

Uses

Direct Use

This model can be used for:

Question Answering on document images (PDFs, invoices, utility bills)
Information extraction tasks using OCR and layout-aware understanding

Out-of-Scope Use

Not suitable for conversational QA
Not suitable for images with no OCR-processed text

Training Details

Dataset

The dataset consisted of:

Images of utility bills and documents
OCR data with bounding boxes (from PaddleOCR)
Queries in English, Spanish, and Chinese
Answer spans with match scores and positions

Training Procedure

Preprocessing: PaddleOCR was used to extract tokens, positions, and structure
Model: LayoutLMv3-base
Epochs: 4
Learning rate schedule: Shown in image below

Training Metrics

F1 Score (validation):
Loss & Learning Rate Chart:

Evaluation

Metrics Used

F1 score
Match score of predicted spans
Token overlap vs ground truth

Summary

The model performs well on document-style QA tasks, especially with:

Clearly structured OCR results
Document types similar to utility bills, invoices, and forms

How to Use

Available on my Github

lakshya-rawat
/

document-qa-model