Visual Question Answering
PEFT
Safetensors
French
English
File size: 4,986 Bytes
fad58ef
7884e07
 
 
 
 
 
 
 
 
fad58ef
 
7884e07
fad58ef
1235916
abe3734
 
 
fad58ef
abe3734
 
 
fad58ef
7884e07
fad58ef
7884e07
 
 
 
 
 
fad58ef
7884e07
fad58ef
abe3734
 
 
fad58ef
7884e07
fad58ef
7884e07
fad58ef
7884e07
 
 
 
 
 
fad58ef
1235916
fad58ef
7884e07
 
 
fad58ef
7884e07
 
 
 
 
 
fad58ef
7884e07
fad58ef
7884e07
 
 
fad58ef
7884e07
 
 
 
 
 
 
 
fad58ef
 
7884e07
fad58ef
abe3734
 
fad58ef
abe3734
 
fad58ef
7884e07
fad58ef
5d6efd1
fad58ef
abe3734
 
fad58ef
 
7884e07
fad58ef
7884e07
 
 
abe3734
7884e07
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
base_model: google/paligemma-3b-ft-docvqa-896  
library_name: peft  
license: apache-2.0  
datasets:  
- cmarkea/table-vqa  
language:  
- fr  
- en  
pipeline_tag: visual-question-answering  
---

## Model Description

**paligemma-3b-ft-tablevqa-896-lora** is a fine-tuned version of the **[google/paligemma-3b-ft-docvqa-896](https://huggingface.co/google/paligemma-3b-ft-docvqa-896)** model,
trained specifically on the **[table-vqa](https://huggingface.co/datasets/cmarkea/table-vqa)** dataset published by Crédit Mutuel Arkéa. This model leverages the
**LoRA** (Low-Rank Adaptation) technique, which significantly reduces the computational complexity of fine-tuning while maintaining high performance. The model operates
in bfloat16 precision for efficiency, making it an ideal solution for resource-constrained environments.

This model is designed for multilingual environments (French and English) and excels in table-based visual question-answering (VQA) tasks. It is highly suitable for
extracting information from tables in documents, making it a strong candidate for applications in financial reporting, data analysis, or administrative document processing.
The model was fine-tuned over a span of 7 days using a single A100 40GB GPU.

## Key Features

- **Language:** Multilingual capabilities, optimized for French and English.
- **Model Type:** Multi-modal (image-text-to-text).
- **Precision:** bfloat16 for resource efficiency.
- **Training Duration:** 7 days on A100 40GB GPU.
- **Fine-Tuning Method:** LoRA (Low-Rank Adaptation).
- **Domain:** Table-based visual question answering.

## Model Architecture

This model was built on top of **[google/paligemma-3b-ft-docvqa-896](https://huggingface.co/google/paligemma-3b-ft-docvqa-896)**, using its pre-trained multi-modal
capabilities to process both text and images (e.g., document tables). LoRA was applied to reduce the size and complexity of fine-tuning while preserving accuracy,
allowing the model to excel in specific tasks such as table understanding and VQA.

## Usage

You can use this model for visual question answering with table-based data by following the steps below:

```python
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"

model_id = "cmarkea/paligemma-3b-ft-tablevqa-896-lora"

# Sample image for inference
url = "https://datasets-server.huggingface.co/cached-assets/cmarkea/table-vqa/--/c26968da3346f92ab6bfc5fec85592f8250e23f5/--/default/train/22/image/image.jpg?Expires=1728915081&Signature=Zkrd9ZWt5b9XtY0UFrgfrTuqo58DHWIJ00ZwXAymmL-mrwqnWWmiwUPelYOOjPZZdlP7gAvt96M1PKeg9a2TFm7hDrnnRAEO~W89li~AKU2apA81M6AZgwMCxc2A0xBe6rnCPQumiCGD7IsFnFVwcxkgMQXyNEL7bEem6cT0Cief9DkURUDCC-kheQY1hhkiqLLUt3ITs6o2KwPdW97EAQ0~VBK1cERgABKXnzPfAImnvjw7L-5ZXCcMJLrvuxwgOQ~DYPs456ZVxQLbTxuDwlxvNbpSKoqoAQv0CskuQwTFCq2b5MOkCCp9zoqYJxhUhJ-aI3lhyIAjmnsL4bhe6A__&Key-Pair-Id=K3EI6M078Z3AC3"
image = Image.open(requests.get(url, stream=True).raw)

# Load the fine-tuned model and processor
model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map=device,
).eval()

processor = AutoProcessor.from_pretrained("google/paligemma-3b-ft-docvqa-896")

# Input prompt for table VQA
prompt = "How many rows are in this table?"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)

# Generate the answer
input_len = model_inputs["input_ids"].shape[-1]
with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)
```


## Performance

The model's performance was evaluated on 200 question-answer pairs, extracted from 100 tables from the test set of the
**[table-vqa](https://huggingface.co/datasets/cmarkea/table-vqa)** dataset. For each table, two pairs were selected: one in French and the other in English.

To evaluate the model’s responses, the **[LLM-as-Juries](https://arxiv.org/abs/2404.18796)** framework was employed using three judge models (GPT-4o, Gemini1.5 Pro,
and Claude 3.5-Sonnet). The evaluation was based on a scale from 0 to 5, tailored to the VQA context, ensuring accurate judgment of the model’s performance.

Here’s a visualization of the results:

![constellation](https://i.postimg.cc/y6tkYg9F/constellation-03.png)

In comparison, this model outperforms **[HuggingFaceM4/Idefics3-8B-Llama3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3)** in terms of accuracy and efficiency,
despite having a smaller parameter size.


## Citation

```bibtex
@online{AgDePaligemmaTabVQA,
  AUTHOR = {Tom Agonnoude, Cyrile Delestre},
  URL = {https://huggingface.co/cmarkea/paligemma-tablevqa-896-lora},
  YEAR = {2024},
  KEYWORDS = {Multimodal, VQA, Table Understanding, LoRA},
}