Qwen2.5_3B_VL_PDF_ROTATION_DETECTION_MK1

State-of-the-Art Rotation Detection (Binary Classification: Flipped / Not Flipped)

We have successfully fine-tuned Qwen2.5-VL 3B for robust binary rotation detection, specifically targeting document page orientation (flipped vs. correctly oriented).
The model was trained on approximately 8GB of scanned PDF page image data, comprising 12,000 annotated samples.

This fine-tuned model is purpose-built for document-specific rotation detection, rather than general-purpose image classification.
It significantly enhances automated document processing pipelines, especially in scenarios where end-users may inadvertently scan pages upside down.
Such orientation issues often lead to suboptimal performance in downstream OCR systems and Vision-Language Models (VLMs).

By integrating this model, organizations can improve data quality and consistency in document workflows, enabling more accurate and efficient information extraction.

Does respond with Yes if flipped and No if not flipped

Metric	Base	Fine-Tuned	Absolute Gain	Relative Improvement
Precision	66.22%	100.00%	+33.78 pp	+51.0%
Recall	14.80%	100.00%	+85.20 pp	+575.7%
F1 Score	24.20%	100.00%	+75.80 pp	+313.2%
Accuracy	69.30%	100.00%	+30.70 pp	+44.3%

Base Model Performance (Eval Set 1200 Samples):

{'precision': 0.6622, 'recall': 0.148, 'f1_score': 0.242, 'accuracy': 0.693}

Fine Tuned Model Performance (Eval Set 1200 Samples):

Despite being trained exclusively on 996×996 pixel images of PDF pages, the model demonstrates improved performance when evaluated on higher-resolution inputs.

996x996: {'precision': 1.0, 'recall': 0.9756, 'f1_score': 0.9877, 'accuracy': 0.9922}
1992x1992: {'precision': 1.0, 'recall': 1.0, 'f1_score': 1.0, 'accuracy': 1.0}

(1,095 pages (German) from real-world PDFs within the intended deployment domain; excluded from training and general validation, with empty pages removed)
996x996: {'precision': 1.0, 'recall': 0.9982, 'f1_score': 0.9991, 'accuracy': 0.9991}
1992x1992: {'precision': 1.0, 'recall': 1.0, 'f1_score': 1.0, 'accuracy': 1.0}

Even though the Training Datasaet is mostly english the trained Model does Perform even better on the german test set.
Which indicates good performance for all latin languages, which is yet to be evaluated.

Image Size

Tokens for Image Sizes

448x448: 400 Token
996x996: 1400 Token
1992x1992: 5200 Token

996x996 seemed to be the best trade of in Terms of Speed and Quality during training.
Classification Quality does Profit from beeing presented with 1992x1992 Images though.

Efficiency:

Capable of completing multiple Pages per Second on Consumer Hardware like the 4060 Ti.

Hardware: RTX 4060Ti (90% Memory Utilization)
Inference Engine: VLLM

image_size	speed	F1
448 × 448	10.5 pages/s	0.4787
996 × 996	2.5 pages/s	0.9991
1992 × 1992	0.4 pages/s	1.0

Image encoding is the main bottleneck limits parallelism to 2–4 requests (996x996 on the consumer Setup).
Short generations and low KV-cache usage (~2% for 996x996).
Faster GPUs will help more than additional GPU memory.

Training Hyperparams

Image Dimensions: 996x996
batch_size=16
learning_rate=2e-5
max_grad_norm=0.5
warmup_ratio=0.03
weight_decay=0.01
epochs=1

Dataset used: https://www.kaggle.com/datasets/manisha717/dataset-of-pdf-files
Artifically Augmented to increase Dataset Size and generalization ability.
50/50 Flip Ratio

USAGE:

Prompt:


    # Prompt to use 
    prompt = '''You are given an image of a document page.
Your task is to determine whether the page is upside down (flipped by 180 degrees).
Ignore small rotations or skew.
Answer with 'Yes' if the page is flipped, and 'No' if it is oriented correctly.'''

    # Example Call to vllm openai server
    def call_vlm(
        self,
        prompt: str,
        image_base64: str,
        max_tokens: int = 3
    ) -> Dict[str, Any]:
        """
        Call the VLM model with text prompt and base64 image
        """
        payload = {
            "model": "AlioLeuchtmann/Qwen2.5_3B_VL_PDF_ROTATION_DETECTION_MK1",
            "messages": [ # See https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct  for other Types.
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": prompt},
                        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}}
                    ]
                }
            ],
            "max_tokens": max_tokens,
            "temperature": 0.0,
        }

        try:
            response = self.session.post(
                f"{self.base_url}/v1/chat/completions",
                headers=self.headers,
                json=payload,
                timeout=120
            )
            response.raise_for_status()
            return response.json()
        except Exception:
            traceback.print_exc()
            return {}

AlioLeuchtmann
/

Qwen2.5_3B_VL_PDF_ROTATION_DETECTION_MK1