|
--- |
|
language: |
|
- en |
|
base_model: |
|
- Qwen/Qwen2.5-VL-3B-Instruct |
|
pipeline_tag: image-text-to-text |
|
library_name: transformers |
|
tags: |
|
- trl |
|
- VisualUnderstanding |
|
- text-generation-inference |
|
- VisionLanguageAttribution |
|
- AttributeCaptioning |
|
- VLA |
|
datasets: |
|
- prithivMLmods/blip3o-caption-mini-arrow |
|
- prithivMLmods/Caption3o-Opt-v3 |
|
- prithivMLmods/Caption3o-Opt-v2 |
|
- >- |
|
Multimodal-Fatima/Caltech101_not_background_test_facebook_opt_2.7b_Attributes_Caption_ns_5647 |
|
--- |
|
|
|
 |
|
|
|
# **DeepAttriCap-VLA-3B** |
|
|
|
> The **DeepAttriCap-VLA-3B** model is a fine-tuned version of **Qwen2.5-VL-3B-Instruct**, tailored for **Vision-Language Attribution** and **Image Captioning**. This variant is designed to generate precise, attribute-rich descriptions that define the visual properties of objects and scenes in detail, ensuring both object-level identification and contextual captioning. |
|
|
|
# Key Highlights |
|
|
|
1. **Vision-Language Attribution**: Produces structured captions with explicit object attributes, properties, and contextual details. |
|
2. **High-Precision Descriptions**: Captures fine-grained visual properties (shape, color, texture, material, relations). |
|
3. **Balanced Object-Centric and Scene-Level Captions**: Generates both holistic captions and per-object attributions. |
|
4. **Adaptable Across Image Types**: Works well on natural, artistic, abstract, and technical imagery. |
|
5. **Built on Qwen2.5-VL Architecture**: Leverages the strengths of the 3B multimodal instruction-tuned variant for fine-grained reasoning. |
|
6. **Multilingual Capability**: English is default, with multilingual captioning enabled through prompt engineering. |
|
|
|
> model type: experimental |
|
|
|
# Training Details |
|
|
|
This model was fine-tuned on a mixture of curated image–caption datasets with emphasis on **attribute-based captioning** and **precise object-property definition**: |
|
|
|
* **[prithivMLmods/blip3o-caption-mini-arrow](https://huggingface.co/datasets/prithivMLmods/blip3o-caption-mini-arrow)** |
|
* **[prithivMLmods/Caption3o-Opt-v3](https://huggingface.co/datasets/prithivMLmods/Caption3o-Opt-v3)** |
|
* **[prithivMLmods/Caption3o-Opt-v2](https://huggingface.co/datasets/prithivMLmods/Caption3o-Opt-v2)** |
|
* **[Multimodal-Fatima/Caltech101\_not\_background\_test\_facebook\_opt\_2.7b\_Attributes\_Caption\_ns\_5647](https://huggingface.co/datasets/Multimodal-Fatima/Caltech101_not_background_test_facebook_opt_2.7b_Attributes_Caption_ns_5647)** |
|
|
|
The training objective emphasized **attribution-style captioning**—capturing precise object details, relationships, and scene-level semantics. |
|
|
|
--- |
|
|
|
## SYSTEM_PROMPT |
|
|
|
```py |
|
CAPTION_SYSTEM_PROMPT = """ |
|
You are an AI assistant that rigorously follows this response protocol: |
|
|
|
1. For every input image, your primary task is to write a **precise caption**. The caption must capture the **essence of the image** in clear, concise, and contextually accurate language. |
|
|
|
2. Along with the caption, provide a structured set of **attributes** that describe the visual elements. Attributes should include details such as objects, people, actions, colors, environment, mood, and other notable characteristics. |
|
|
|
3. Always include a **class_name** field. This must represent the **core theme or main subject** of the image in a compact format. |
|
- Use the syntax: `{class_name==write_the_core_theme}` |
|
- Example: `{class_name==dog_playing}` or `{class_name==city_sunset}` |
|
|
|
4. Maintain the following strict format in your output: |
|
- **Caption:** <one-sentence description> |
|
- **Attributes:** <comma-separated list of visual attributes> |
|
- **{class_name==core_theme}** |
|
|
|
5. Ensure captions are **precise, neutral, and descriptive**, avoiding unnecessary elaboration or subjective interpretation unless explicitly required. |
|
|
|
6. Do not reference the rules or instructions in the output. Only return the formatted caption, attributes, and class_name. |
|
|
|
""".strip() |
|
``` |
|
|
|
[](https://huggingface.co/prithivMLmods/DeepAttriCap-VLA-3B/blob/main/deepattricap-vla-3b-colab-notebook-demo/DeepAttriCap_VLA_3B.ipynb) |
|
|
|
|
|
--- |
|
|
|
# Quick Start with Transformers |
|
|
|
```python |
|
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor |
|
from qwen_vl_utils import process_vision_info |
|
|
|
model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
|
"prithivMLmods/DeepAttriCap-VLA-3B", torch_dtype="auto", device_map="auto" |
|
) |
|
|
|
processor = AutoProcessor.from_pretrained("prithivMLmods/DeepAttriCap-VLA-3B") |
|
|
|
messages = [ |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"}, |
|
{"type": "text", "text": "Provide an attribute-rich caption for this image."}, |
|
], |
|
} |
|
] |
|
|
|
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
image_inputs, video_inputs = process_vision_info(messages) |
|
|
|
inputs = processor( |
|
text=[text], |
|
images=image_inputs, |
|
videos=video_inputs, |
|
padding=True, |
|
return_tensors="pt" |
|
).to("cuda") |
|
|
|
generated_ids = model.generate(**inputs, max_new_tokens=128) |
|
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)] |
|
|
|
output_text = processor.batch_decode( |
|
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
) |
|
print(output_text) |
|
``` |
|
|
|
# Intended Use |
|
|
|
* Attribute-rich object recognition and captioning. |
|
* Vision-language research in attribution and property extraction. |
|
* Dataset creation for fine-grained visual description tasks. |
|
* Enabling descriptive captions for images with complex object relationships. |
|
* Supporting creative, technical, and educational use cases requiring precise captions. |
|
|
|
# Limitations |
|
|
|
* May produce variable levels of granularity depending on the image complexity. |
|
* Not optimized for highly censored or safety-critical deployments. |
|
* Might over-attribute or hallucinate properties in ambiguous or abstract visuals |