updated the model card
Browse files
README.md
CHANGED
@@ -9,14 +9,123 @@ tags:
|
|
9 |
license: apache-2.0
|
10 |
language:
|
11 |
- en
|
|
|
|
|
|
|
|
|
12 |
---
|
13 |
|
14 |
-
#
|
15 |
|
16 |
-
- **Developed by:** Mayank022
|
17 |
-
- **License:** apache-2.0
|
18 |
-
- **Finetuned from model :** unsloth/qwen2-vl-7b-instruct-unsloth-bnb-4bit
|
19 |
|
20 |
-
This qwen2_vl model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
|
21 |
|
22 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
license: apache-2.0
|
10 |
language:
|
11 |
- en
|
12 |
+
datasets:
|
13 |
+
- unsloth/LaTeX_OCR
|
14 |
+
library_name: unsloth
|
15 |
+
model_name: Qwen2-VL-7B-Instruct with LoRA (Equation-to-LaTeX)
|
16 |
---
|
17 |
|
18 |
+
# Qwen2-VL: Equation Image β LaTeX with LoRA + Unsloth
|
19 |
|
|
|
|
|
|
|
20 |
|
|
|
21 |
|
22 |
+
|
23 |
+
Fine-tune [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), a Vision-Language model, to convert equation images into LaTeX code using the [Unsloth](https://github.com/unslothai/unsloth) framework and LoRA adapters.
|
24 |
+
|
25 |
+
|
26 |
+
## Project Objective
|
27 |
+
|
28 |
+
Train an Equation-to-LaTeX transcriber using a pre-trained multimodal model. The model learns to read rendered math equations and generate corresponding LaTeX.
|
29 |
+
|
30 |
+
|
31 |
+

|
32 |
+
|
33 |
+
|
34 |
+
---
|
35 |
+
[Source code on github ](https://github.com/Mayankpratapsingh022/Finetuning-LLMs/tree/main/Qwen_2_VL_Multimodel_LLM_Finetuning)
|
36 |
+
|
37 |
+
## Dataset
|
38 |
+
|
39 |
+
- [`unsloth/LaTeX_OCR`](https://huggingface.co/datasets/unsloth/LaTeX_OCR) β Image-LaTeX pairs of printed mathematical expressions.
|
40 |
+
- ~68K train / 7K test samples.
|
41 |
+
- Example:
|
42 |
+
- Image: 
|
43 |
+
- Target: `R - { \frac { 1 } { 2 } } ( \nabla \Phi ) ^ { 2 } - { \frac { 1 } { 2 } } \nabla ^ { 2 } \Phi = 0 .`
|
44 |
+
|
45 |
+
---
|
46 |
+
|
47 |
+
## Tech Stack
|
48 |
+
|
49 |
+
| Component | Description |
|
50 |
+
|----------|-------------|
|
51 |
+
| Qwen2-VL | Multimodal vision-language model (7B) by Alibaba |
|
52 |
+
| Unsloth | Fast & memory-efficient training |
|
53 |
+
| LoRA (via PEFT) | Parameter-efficient fine-tuning |
|
54 |
+
| 4-bit Quantization | Enabled by `bitsandbytes` |
|
55 |
+
| Datasets, HF Hub | For loading/saving models & datasets |
|
56 |
+
|
57 |
+
---
|
58 |
+
|
59 |
+
## Setup
|
60 |
+
|
61 |
+
```bash
|
62 |
+
pip install unsloth unsloth_zoo peft trl datasets accelerate bitsandbytes xformers==0.0.29.post3 sentencepiece protobuf hf_transfer triton
|
63 |
+
```
|
64 |
+
|
65 |
+
---
|
66 |
+
|
67 |
+
## Training (Jupyter Notebook)
|
68 |
+
|
69 |
+
Refer to: `Qwen2__VL_image_to_latext.ipynb`
|
70 |
+
|
71 |
+
Steps:
|
72 |
+
1. Load Qwen2-VL (`load_in_4bit=True`)
|
73 |
+
2. Load dataset via `datasets.load_dataset("unsloth/LaTeX_OCR")`
|
74 |
+
3. Apply LoRA adapters
|
75 |
+
4. Use `SFTTrainer` from Unsloth to fine-tune
|
76 |
+
5. Save adapters or merged model
|
77 |
+
|
78 |
+
LoRA rank used: `r=16`
|
79 |
+
LoRA alpha: `16`
|
80 |
+
|
81 |
+
---
|
82 |
+
|
83 |
+
## Inference
|
84 |
+
|
85 |
+
```python
|
86 |
+
from PIL import Image
|
87 |
+
image = Image.open("equation.png")
|
88 |
+
prompt = "Write the LaTeX representation for this image."
|
89 |
+
inputs = tokenizer(image, tokenizer.apply_chat_template([("user", prompt)], add_generation_prompt=True), return_tensors="pt").to("cuda")
|
90 |
+
output = model.generate(**inputs, max_new_tokens=128)
|
91 |
+
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
92 |
+
```
|
93 |
+
|
94 |
+
---
|
95 |
+
|
96 |
+
## Evaluation
|
97 |
+
|
98 |
+
- Exact Match Accuracy: ~90%+
|
99 |
+
- Strong generalization to complex equations and symbols
|
100 |
+
|
101 |
+
---
|
102 |
+
|
103 |
+
## Results
|
104 |
+
|
105 |
+
| Metric | Value |
|
106 |
+
|------------------|---------------|
|
107 |
+
| Exact Match | ~90β92% |
|
108 |
+
| LoRA Params | ~<1% of model |
|
109 |
+
| Training Time | ~20β40 mins on A100 |
|
110 |
+
| Model Size | 7B (4-bit) |
|
111 |
+
|
112 |
+
---
|
113 |
+
|
114 |
+
## Future Work
|
115 |
+
|
116 |
+
- Extend to handwritten formulas (e.g., CROHME dataset)
|
117 |
+
- Add LaTeX syntax validation or auto-correction
|
118 |
+
- Build a lightweight Gradio/Streamlit interface for demo
|
119 |
+
|
120 |
+
---
|
121 |
+
|
122 |
+
## Folder Structure
|
123 |
+
|
124 |
+
```
|
125 |
+
.
|
126 |
+
βββ Qwen2__VL_image_to_latext.ipynb # Training Notebook
|
127 |
+
βββ output/ # Saved fine-tuned model
|
128 |
+
βββ README.md
|
129 |
+
```
|
130 |
+
|
131 |
+
---
|