metadata

license: apache-2.0
language:
  - en
  - zh
base_model:
  - Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
  - text-generation-inference
  - trl
  - ocr
  - vision-language
  - reasoning
  - grounded-visual-reasoning
  - sft
  - grpo
  - no-thinking
  - code
  - thinking=0

Enesidaon-VLR-7B-no-Thinking

The Enesidaon-VLR-7B-no-Thinking model is a high-fidelity vision-language reasoning (experimental) model designed for fine-grained multimodal comprehension. Built on top of Qwen2.5-VL-7B-Instruct, this model improves image captioning, sampled video reasoning, and detailed document understanding. Unlike standard approaches, it explicitly grounds its textual reasoning steps to visual coordinates, enabling precise and explainable multimodal reasoning. The model is trained using supervised fine-tuning (SFT) on visually-grounded reasoning traces and further optimized via GRPO reinforcement learning, resulting in superior chain-of-thought reasoning without overthinking or unnecessary hallucinations.

Key Enhancements

Visually-Grounded Reasoning and Explanation: Explicitly anchors reasoning chains to image regions and document elements for transparent, explainable multimodal outputs.
Advanced Image Captioning: Produces context-aware, detailed captions with grounded reasoning for improved visual understanding.
Sampled Video Reasoning: Handles long-duration video inputs with temporal reasoning for content summarization and QA.
Context-Aware Document Analysis: Excels in document retrieval, structured and unstructured content extraction, and analytical content recognition.
Fine-Grained Visual Grounding: Enhanced capability for multimodal linking across charts, tables, and graphical elements with spatial grounding.
Reinforcement-Learned Reasoning: Trained with GRPO to incentivize accurate, grounded reasoning aligned with visual cues.
State-of-the-Art Benchmarking: Competitive results on OCR, visual QA, and reasoning tasks including DocVQA, MathVista, RealWorldQA, and MTVQA.

Quick Start with Transformers

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Enesidaon-VLR-7B-no-Thinking", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/Enesidaon-VLR-7B-no-Thinking")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image with reasoning."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Intended Use

This model is intended for:

Grounded visual reasoning with spatially-aligned chain-of-thought explanations.
Accurate, explainable image captioning and video reasoning.
Multimodal document analysis with visually-referenced reasoning steps.
Analytical content recognition, table/chart interpretation, and structured extraction.
Multilingual reasoning over documents and visual scenes for global applications.
Educational and enterprise solutions requiring step-by-step reasoning transparency.
Robotic and mobile device automation with vision-guided contextual decision-making.

Limitations

May require high memory for long videos and complex document inputs.
Performance can degrade with extremely low-resolution or heavily occluded images.
Not fully optimized for real-time inference on low-resource edge devices.
Visual token configurations significantly impact grounded reasoning performance.
Some rare cases of reasoning drift or incomplete grounding.

References

YaRN: Efficient Context Window Extension of Large Language Models
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy