prithivMLmods commited on
Commit
3fa92fc
·
verified ·
1 Parent(s): d163959

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -1
README.md CHANGED
@@ -13,4 +13,86 @@ datasets:
13
  - prithivMLmods/Caption3o-Opt-v2
14
  - >-
15
  Multimodal-Fatima/Caltech101_not_background_test_facebook_opt_2.7b_Attributes_Caption_ns_5647
16
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  - prithivMLmods/Caption3o-Opt-v2
14
  - >-
15
  Multimodal-Fatima/Caltech101_not_background_test_facebook_opt_2.7b_Attributes_Caption_ns_5647
16
+ ---
17
+
18
+ ![2.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/EUorMi4zOONUl9USQzBRp.png)
19
+
20
+ # **DeepAttriCap-VLA-3B**
21
+
22
+ > The **DeepAttriCap-VLA-3B** model is a fine-tuned version of **Qwen2.5-VL-3B-Instruct**, tailored for **Vision-Language Attribution** and **Image Captioning**. This variant is designed to generate precise, attribute-rich descriptions that define the visual properties of objects and scenes in detail, ensuring both object-level identification and contextual captioning.
23
+
24
+ # Key Highlights
25
+
26
+ 1. **Vision-Language Attribution**: Produces structured captions with explicit object attributes, properties, and contextual details.
27
+ 2. **High-Precision Descriptions**: Captures fine-grained visual properties (shape, color, texture, material, relations).
28
+ 3. **Balanced Object-Centric and Scene-Level Captions**: Generates both holistic captions and per-object attributions.
29
+ 4. **Adaptable Across Image Types**: Works well on natural, artistic, abstract, and technical imagery.
30
+ 5. **Built on Qwen2.5-VL Architecture**: Leverages the strengths of the 3B multimodal instruction-tuned variant for fine-grained reasoning.
31
+ 6. **Multilingual Capability**: English is default, with multilingual captioning enabled through prompt engineering.
32
+
33
+ # Training Details
34
+
35
+ This model was fine-tuned on a mixture of curated image–caption datasets with emphasis on **attribute-based captioning** and **precise object-property definition**:
36
+
37
+ * **[prithivMLmods/blip3o-caption-mini-arrow](https://huggingface.co/datasets/prithivMLmods/blip3o-caption-mini-arrow)**
38
+ * **[prithivMLmods/Caption3o-Opt-v3](https://huggingface.co/datasets/prithivMLmods/Caption3o-Opt-v3)**
39
+ * **[prithivMLmods/Caption3o-Opt-v2](https://huggingface.co/datasets/prithivMLmods/Caption3o-Opt-v2)**
40
+ * **[Multimodal-Fatima/Caltech101\_not\_background\_test\_facebook\_opt\_2.7b\_Attributes\_Caption\_ns\_5647](https://huggingface.co/datasets/Multimodal-Fatima/Caltech101_not_background_test_facebook_opt_2.7b_Attributes_Caption_ns_5647)**
41
+
42
+ The training objective emphasized **attribution-style captioning**—capturing precise object details, relationships, and scene-level semantics.
43
+
44
+ # Quick Start with Transformers
45
+
46
+ ```python
47
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
48
+ from qwen_vl_utils import process_vision_info
49
+
50
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
51
+ "prithivMLmods/DeepAttriCap-VLA-3B", torch_dtype="auto", device_map="auto"
52
+ )
53
+
54
+ processor = AutoProcessor.from_pretrained("prithivMLmods/DeepAttriCap-VLA-3B")
55
+
56
+ messages = [
57
+ {
58
+ "role": "user",
59
+ "content": [
60
+ {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
61
+ {"type": "text", "text": "Provide an attribute-rich caption for this image."},
62
+ ],
63
+ }
64
+ ]
65
+
66
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
67
+ image_inputs, video_inputs = process_vision_info(messages)
68
+
69
+ inputs = processor(
70
+ text=[text],
71
+ images=image_inputs,
72
+ videos=video_inputs,
73
+ padding=True,
74
+ return_tensors="pt"
75
+ ).to("cuda")
76
+
77
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
78
+ generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
79
+
80
+ output_text = processor.batch_decode(
81
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
82
+ )
83
+ print(output_text)
84
+ ```
85
+
86
+ # Intended Use
87
+
88
+ * Attribute-rich object recognition and captioning.
89
+ * Vision-language research in attribution and property extraction.
90
+ * Dataset creation for fine-grained visual description tasks.
91
+ * Enabling descriptive captions for images with complex object relationships.
92
+ * Supporting creative, technical, and educational use cases requiring precise captions.
93
+
94
+ # Limitations
95
+
96
+ * May produce variable levels of granularity depending on the image complexity.
97
+ * Not optimized for highly censored or safety-critical deployments.
98
+ * Might over-attribute or hallucinate properties in ambiguous or abstract visuals