File size: 6,121 Bytes
e09a141
 
 
 
 
 
 
 
545e5b7
 
e09a141
545e5b7
 
 
fe6c2dc
 
 
 
 
 
3fa92fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
08bf25d
 
3fa92fc
 
 
 
 
 
 
 
 
 
 
a3ccc6a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ab2917c
 
 
a3ccc6a
 
3fa92fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
language:
- en
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- trl
- VisualUnderstanding
- text-generation-inference
- VisionLanguageAttribution
- AttributeCaptioning
- VLA
datasets:
- prithivMLmods/blip3o-caption-mini-arrow
- prithivMLmods/Caption3o-Opt-v3
- prithivMLmods/Caption3o-Opt-v2
- >-
  Multimodal-Fatima/Caltech101_not_background_test_facebook_opt_2.7b_Attributes_Caption_ns_5647
---

![2.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/EUorMi4zOONUl9USQzBRp.png)

# **DeepAttriCap-VLA-3B**

> The **DeepAttriCap-VLA-3B** model is a fine-tuned version of **Qwen2.5-VL-3B-Instruct**, tailored for **Vision-Language Attribution** and **Image Captioning**. This variant is designed to generate precise, attribute-rich descriptions that define the visual properties of objects and scenes in detail, ensuring both object-level identification and contextual captioning.

# Key Highlights

1. **Vision-Language Attribution**: Produces structured captions with explicit object attributes, properties, and contextual details.
2. **High-Precision Descriptions**: Captures fine-grained visual properties (shape, color, texture, material, relations).
3. **Balanced Object-Centric and Scene-Level Captions**: Generates both holistic captions and per-object attributions.
4. **Adaptable Across Image Types**: Works well on natural, artistic, abstract, and technical imagery.
5. **Built on Qwen2.5-VL Architecture**: Leverages the strengths of the 3B multimodal instruction-tuned variant for fine-grained reasoning.
6. **Multilingual Capability**: English is default, with multilingual captioning enabled through prompt engineering.

> model type: experimental

# Training Details

This model was fine-tuned on a mixture of curated image–caption datasets with emphasis on **attribute-based captioning** and **precise object-property definition**:

* **[prithivMLmods/blip3o-caption-mini-arrow](https://huggingface.co/datasets/prithivMLmods/blip3o-caption-mini-arrow)**
* **[prithivMLmods/Caption3o-Opt-v3](https://huggingface.co/datasets/prithivMLmods/Caption3o-Opt-v3)**
* **[prithivMLmods/Caption3o-Opt-v2](https://huggingface.co/datasets/prithivMLmods/Caption3o-Opt-v2)**
* **[Multimodal-Fatima/Caltech101\_not\_background\_test\_facebook\_opt\_2.7b\_Attributes\_Caption\_ns\_5647](https://huggingface.co/datasets/Multimodal-Fatima/Caltech101_not_background_test_facebook_opt_2.7b_Attributes_Caption_ns_5647)**

The training objective emphasized **attribution-style captioning**—capturing precise object details, relationships, and scene-level semantics.

---

## SYSTEM_PROMPT

```py
CAPTION_SYSTEM_PROMPT = """
You are an AI assistant that rigorously follows this response protocol:

1. For every input image, your primary task is to write a **precise caption**. The caption must capture the **essence of the image** in clear, concise, and contextually accurate language.

2. Along with the caption, provide a structured set of **attributes** that describe the visual elements. Attributes should include details such as objects, people, actions, colors, environment, mood, and other notable characteristics.

3. Always include a **class_name** field. This must represent the **core theme or main subject** of the image in a compact format.  
   - Use the syntax: `{class_name==write_the_core_theme}`  
   - Example: `{class_name==dog_playing}` or `{class_name==city_sunset}`  

4. Maintain the following strict format in your output:
   - **Caption:** <one-sentence description>  
   - **Attributes:** <comma-separated list of visual attributes>  
   - **{class_name==core_theme}**

5. Ensure captions are **precise, neutral, and descriptive**, avoiding unnecessary elaboration or subjective interpretation unless explicitly required.

6. Do not reference the rules or instructions in the output. Only return the formatted caption, attributes, and class_name.

""".strip()
```

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://huggingface.co/prithivMLmods/DeepAttriCap-VLA-3B/blob/main/deepattricap-vla-3b-colab-notebook-demo/DeepAttriCap_VLA_3B.ipynb)


---

# Quick Start with Transformers

```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/DeepAttriCap-VLA-3B", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/DeepAttriCap-VLA-3B")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
            {"type": "text", "text": "Provide an attribute-rich caption for this image."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt"
).to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]

output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```

# Intended Use

* Attribute-rich object recognition and captioning.
* Vision-language research in attribution and property extraction.
* Dataset creation for fine-grained visual description tasks.
* Enabling descriptive captions for images with complex object relationships.
* Supporting creative, technical, and educational use cases requiring precise captions.

# Limitations

* May produce variable levels of granularity depending on the image complexity.
* Not optimized for highly censored or safety-critical deployments.
* Might over-attribute or hallucinate properties in ambiguous or abstract visuals