Qwen2.5_3B_VL_PDF_ROTATION_DETECTION_MK1
State-of-the-Art Rotation Detection (Binary Classification: Flipped / Not Flipped)
We have successfully fine-tuned Qwen2.5-VL 3B for robust binary rotation detection, specifically targeting document page orientation (flipped vs. correctly oriented).
The model was trained on approximately 8GB of scanned PDF page image data, comprising 12,000 annotated samples.
This fine-tuned model is purpose-built for document-specific rotation detection, rather than general-purpose image classification.
It significantly enhances automated document processing pipelines, especially in scenarios where end-users may inadvertently scan pages upside down.
Such orientation issues often lead to suboptimal performance in downstream OCR systems and Vision-Language Models (VLMs).
By integrating this model, organizations can improve data quality and consistency in document workflows, enabling more accurate and efficient information extraction.
Does respond with Yes if flipped and No if not flipped
Metric | Base | Fine-Tuned | Absolute Gain | Relative Improvement |
---|---|---|---|---|
Precision | 66.22% | 100.00% | +33.78 pp | +51.0% |
Recall | 14.80% | 100.00% | +85.20 pp | +575.7% |
F1 Score | 24.20% | 100.00% | +75.80 pp | +313.2% |
Accuracy | 69.30% | 100.00% | +30.70 pp | +44.3% |
Base Model Performance (Eval Set 1200 Samples):
{'precision': 0.6622, 'recall': 0.148, 'f1_score': 0.242, 'accuracy': 0.693}
Fine Tuned Model Performance (Eval Set 1200 Samples):
Despite being trained exclusively on 996ร996 pixel images of PDF pages, the model demonstrates improved performance when evaluated on higher-resolution inputs.
996x996: {'precision': 1.0, 'recall': 0.9756, 'f1_score': 0.9877, 'accuracy': 0.9922}
1992x1992: {'precision': 1.0, 'recall': 1.0, 'f1_score': 1.0, 'accuracy': 1.0}
(1,095 pages (German) from real-world PDFs within the intended deployment domain; excluded from training and general validation, with empty pages removed)
996x996: {'precision': 1.0, 'recall': 0.9982, 'f1_score': 0.9991, 'accuracy': 0.9991}
1992x1992: {'precision': 1.0, 'recall': 1.0, 'f1_score': 1.0, 'accuracy': 1.0}
Even though the Training Datasaet is mostly english the trained Model does Perform even better on the german test set.
Which indicates good performance for all latin languages, which is yet to be evaluated.
Image Size
Tokens for Image Sizes
- 448x448: 400 Token
- 996x996: 1400 Token
- 1992x1992: 5200 Token
996x996 seemed to be the best trade of in Terms of Speed and Quality during training.
Classification Quality does Profit from beeing presented with 1992x1992 Images though.
Efficiency:
Capable of completing multiple Pages per Second on Consumer Hardware like the 4060 Ti.
Hardware: RTX 4060Ti (90% Memory Utilization)
Inference Engine: VLLM
image_size | speed | F1 |
---|---|---|
448 ร 448 | 10.5 pages/s | 0.4787 |
996 ร 996 | 2.5 pages/s | 0.9991 |
1992 ร 1992 | 0.4 pages/s | 1.0 |
Image encoding is the main bottleneck limits parallelism to 2โ4 requests (996x996 on the consumer Setup).
Short generations and low KV-cache usage (~2% for 996x996).
Faster GPUs will help more than additional GPU memory.
Training Hyperparams
Image Dimensions: 996x996
batch_size=16
learning_rate=2e-5
max_grad_norm=0.5
warmup_ratio=0.03
weight_decay=0.01
epochs=1
Dataset used: https://www.kaggle.com/datasets/manisha717/dataset-of-pdf-files
Artifically Augmented to increase Dataset Size and generalization ability.
50/50 Flip Ratio
USAGE:
Prompt:
# Prompt to use
prompt = '''You are given an image of a document page.
Your task is to determine whether the page is upside down (flipped by 180 degrees).
Ignore small rotations or skew.
Answer with 'Yes' if the page is flipped, and 'No' if it is oriented correctly.'''
# Example Call to vllm openai server
def call_vlm(
self,
prompt: str,
image_base64: str,
max_tokens: int = 3
) -> Dict[str, Any]:
"""
Call the VLM model with text prompt and base64 image
"""
payload = {
"model": "AlioLeuchtmann/Qwen2.5_3B_VL_PDF_ROTATION_DETECTION_MK1",
"messages": [ # See https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct for other Types.
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}}
]
}
],
"max_tokens": max_tokens,
"temperature": 0.0,
}
try:
response = self.session.post(
f"{self.base_url}/v1/chat/completions",
headers=self.headers,
json=payload,
timeout=120
)
response.raise_for_status()
return response.json()
except Exception:
traceback.print_exc()
return {}
- Downloads last month
- 130
Model tree for AlioLeuchtmann/Qwen2.5_3B_VL_PDF_ROTATION_DETECTION_MK1
Base model
Qwen/Qwen2.5-VL-3B-Instruct