|
--- |
|
license: apache-2.0 |
|
base_model: |
|
- Qwen/Qwen2.5-VL-7B-Instruct |
|
language: |
|
- en |
|
- zh |
|
tags: |
|
- text-generation-inference |
|
- science |
|
- universal truth |
|
- chamaeleontis |
|
- reason |
|
- vl |
|
pipeline_tag: image-text-to-text |
|
library_name: transformers |
|
--- |
|
|
|
 |
|
|
|
# **Chamaeleontis-7B-Reason-rVL** |
|
|
|
> The **Chamaeleontis-7B-Reason-rVL** model is a general-purpose multimodal reasoning model built on **Qwen2.5-VL-7B-Instruct**, optimized for **image and video frame understanding**, **contextual analysis**, and **natural language reasoning**. It excels at extracting structured insights from both static visuals and video streams through step-by-step logical interpretation and visual common sense grounding. |
|
|
|
> Chamaeleontis: Reason-based Qwen2.5VL model for visual understanding and CoT reasoning |
|
|
|
--- |
|
|
|
## Key Capabilities |
|
|
|
* **Visual Frame Reasoning**: Understands and interprets key frames from videos or images with attention to spatial and physical relationships. |
|
* **Chain-of-Thought Natural Language Output**: Generates thoughtful, logically connected answers for visual queries. |
|
* **General-Purpose Multimodal Insight**: Applicable across diverse domains for both still image and sequential video analysis. |
|
* **Custom Frame Selection Logic**: Applies smart frame sampling for high-relevance visual inputs during training and inference. |
|
* **Common Sense Visual Comprehension**: Supports physical reasoning, object interaction detection, and real-world logic estimation from visuals. |
|
* **Q\&A Over Visual Inputs**: Ideal for use cases like question answering, summarization, and temporal information extraction from visual sources. |
|
|
|
--- |
|
|
|
## Quick Start with Transformers |
|
|
|
```python |
|
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor |
|
from qwen_vl_utils import process_vision_info |
|
|
|
model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
|
"prithivMLmods/Chamaeleontis-7B-Reason-rVL", torch_dtype="auto", device_map="auto" |
|
) |
|
|
|
processor = AutoProcessor.from_pretrained("prithivMLmods/Chamaeleontis-7B-Reason-rVL") |
|
|
|
messages = [ |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{"type": "video", "video": "path/to/your/video.mp4"}, |
|
{"type": "text", "text": "Briefly describe the physical events and interactions in this video using logical steps."}, |
|
], |
|
} |
|
] |
|
|
|
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
image_inputs, video_inputs = process_vision_info(messages) |
|
|
|
inputs = processor( |
|
text=[text], |
|
images=image_inputs, |
|
videos=video_inputs, |
|
padding=True, |
|
return_tensors="pt", |
|
) |
|
inputs = inputs.to("cuda") |
|
|
|
generated_ids = model.generate(**inputs, max_new_tokens=512) |
|
generated_ids_trimmed = [ |
|
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
|
] |
|
output_text = processor.batch_decode( |
|
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
) |
|
print(output_text) |
|
``` |
|
|
|
--- |
|
|
|
## Use Cases |
|
|
|
* **Image and Video Q\&A**: Responds to natural language queries with logical, visual-grounded answers. |
|
* **Physical Event Interpretation**: Understands object motion, cause-effect dynamics, and temporal interactions. |
|
* **Visual Insight Summarization**: Extracts core narrative and insights from multimedia content. |
|
* **Educational Content Understanding**: Supports reasoning-based tasks in educational and research video analysis. |
|
* **Multimodal Reasoning**: Merges visual and textual understanding to support complex interpretation tasks. |
|
|
|
--- |
|
|
|
## Limitations |
|
|
|
* **Experimental Architecture**: While robust, it may need tuning for domain-specific datasets. |
|
* **Inference Overhead**: High GPU memory requirements for long or high-resolution video processing. |
|
* **Frame Sampling Bias**: Quality depends on relevance and spacing of sampled frames. |
|
* **Context Window Boundaries**: Very long inputs may require segmenting or hierarchical reasoning strategies. |