--- library_name: transformers tags: - document-question-answering - layoutlmv3 - ocr - document-understanding - paddleocr - multilingual - layout-aware - lakshya-singh license: apache-2.0 language: - en base_model: - microsoft/layoutlmv3-base datasets: - nielsr/docvqa_1200_examples --- # Document QA Model This is a fine-tuned **document question-answering model** based on `layoutlmv3-base`. It is trained to understand documents using OCR data (via PaddleOCR) and accurately answer questions related to structured information in the document layout. --- ## Model Details ### Model Description - **Model Name:** `document-qa-model` - **Base Model:** [`microsoft/layoutlmv3-base`](https://huggingface.co/microsoft/layoutlmv3-base) - **Fine-tuned by:** Lakshya Singh (solo contributor) - **Languages:** English, Spanish, French, German, Italian - **License:** Apache-2.0 (inherited from base model) - **Intended Use:** Extract answers to structured queries from scanned documents - **Not funded** — this project was completed independently. --- ## Model Sources - **Repository:** [`Github Link`](https://github.com/Lakshyasinghrawat12/DocumentQA-lakshya-rawat-document-qa-model) - **Trained on:** Adapted version of [`nielsr/docvqa_1200_examples`](https://huggingface.co/datasets/nielsr/docvqa_1200_examples) - **Model metrics:** See ![training_history.png](https://cdn-uploads.huggingface.co/production/uploads/66a7331438fbd9075584523f/MtMe5CZy3wb2nEG1wTRMc.png) --- ## Uses ### Direct Use This model can be used for: - Question Answering on document images (PDFs, invoices, utility bills) - Information extraction tasks using OCR and layout-aware understanding ### Out-of-Scope Use - Not suitable for conversational QA - Not suitable for images with no OCR-processed text --- ## Training Details ### Dataset The dataset consisted of: - **Images** of utility bills and documents - **OCR data** with bounding boxes (from PaddleOCR) - **Queries** in English, Spanish, and Chinese - **Answer spans** with match scores and positions ### Training Procedure - Preprocessing: PaddleOCR was used to extract tokens, positions, and structure - Model: LayoutLMv3-base - Epochs: 4 - Learning rate schedule: Shown in image below ### Training Metrics - **F1 Score** (validation): ![training_history.png](https://cdn-uploads.huggingface.co/production/uploads/66a7331438fbd9075584523f/MtMe5CZy3wb2nEG1wTRMc.png) - **Loss & Learning Rate Chart**: ![training_history.png](https://cdn-uploads.huggingface.co/production/uploads/66a7331438fbd9075584523f/MtMe5CZy3wb2nEG1wTRMc.png) --- ## Evaluation ### Metrics Used - F1 score - Match score of predicted spans - Token overlap vs ground truth ### Summary The model performs well on document-style QA tasks, especially with: - Clearly structured OCR results - Document types similar to utility bills, invoices, and forms --- ## How to Use - Available on my [`Github`](https://github.com/Lakshyasinghrawat12/DocumentQA-lakshya-rawat-document-qa-model)