Image-Text-to-Text
Safetensors
slimm
qwen2
conversational
Files changed (1) hide show
  1. README.md +101 -87
README.md CHANGED
@@ -1,88 +1,102 @@
1
- ---
2
- base_model:
3
- - Qwen/Qwen2.5-0.5B-Instruct
4
- license: apache-2.0
5
- pipeline_tag: image-text-to-text
6
- library_name: slimm
7
- ---
8
-
9
- # Model Card for CoMP-MM-1B
10
-
11
- <!-- Provide a quick summary of what the model is/does. -->
12
- This is an LMM that supports **native image resolution inputs**, composed of [CoMP-SigLIP](https://huggingface.co/SliMM-X/CoMP-SigLIP-So400M) and [Qwen2.5](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct).
13
-
14
- ## Model Sources
15
-
16
- <!-- Provide the basic links for the model. -->
17
-
18
- - **Repository:** https://github.com/SliMM-X/CoMP-MM
19
- - **Paper:** https://arxiv.org/abs/2503.18931
20
- - **Project Page:** https://slimm-x.github.io/comp
21
-
22
- ## How to Get Started with the Model
23
-
24
- Install the github repo, and use the code below to get started with the model.
25
-
26
- ```python
27
- # this is very similar to qwen2-vl
28
- from slimm.model.processor import SliMMQwen2VLProcessor
29
- from slimm.model.slimm import SliMMForConditionalGeneration
30
- from slimm.model.utils_vl import process_vision_info
31
-
32
- model_path = "SliMM-X/CoMP-MM-1B"
33
-
34
- model = SliMMForConditionalGeneration.from_pretrained(
35
- model_path, torch_dtype="auto", device_map="cuda"
36
- )
37
- processor = SliMMQwen2VLProcessor.from_pretrained(model_path)
38
-
39
- messages = [
40
- {
41
- "role": "user",
42
- "content": [
43
- {
44
- "type": "image",
45
- "image": "https://slimm-x.github.io/comp/figs/teaser.png",
46
- },
47
- {"type": "text", "text": "Describe this image."},
48
- ],
49
- }
50
- ]
51
-
52
- # Preparation for inference
53
- text = processor.apply_chat_template(
54
- messages, tokenize=False, add_generation_prompt=True
55
- )
56
- image_inputs, video_inputs = process_vision_info(messages)
57
- inputs = processor(
58
- text=[text],
59
- images=image_inputs,
60
- videos=video_inputs,
61
- padding=True,
62
- return_tensors="pt",
63
- )
64
- inputs = inputs.to("cuda")
65
-
66
- # Inference: Generation of the output
67
- generated_ids = model.generate(**inputs, max_new_tokens=128)
68
- generated_ids_trimmed = [
69
- out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
70
- ]
71
- output_text = processor.batch_decode(
72
- generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
73
- )
74
- print(output_text)
75
- ```
76
-
77
- ## Citation
78
-
79
- **BibTeX:**
80
-
81
- ```bibtex
82
- @article{comp2025,
83
- title={CoMP: Continual Multimodal Pre-training for Vision Foundation Models},
84
- author={Chen, Yitong and Meng, Lingchen and Peng, Wujian and Wu, Zuxuan and Jiang, Yu-Gang},
85
- year={2025},
86
- journal={arXiv preprint arXiv:2503.18931},
87
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
  ```
 
1
+ ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-0.5B-Instruct
4
+ license: apache-2.0
5
+ pipeline_tag: image-text-to-text
6
+ library_name: slimm
7
+ language:
8
+ - zho
9
+ - eng
10
+ - fra
11
+ - spa
12
+ - por
13
+ - deu
14
+ - ita
15
+ - rus
16
+ - jpn
17
+ - kor
18
+ - vie
19
+ - tha
20
+ - ara
21
+ ---
22
+
23
+ # Model Card for CoMP-MM-1B
24
+
25
+ <!-- Provide a quick summary of what the model is/does. -->
26
+ This is an LMM that supports **native image resolution inputs**, composed of [CoMP-SigLIP](https://huggingface.co/SliMM-X/CoMP-SigLIP-So400M) and [Qwen2.5](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct).
27
+
28
+ ## Model Sources
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** https://github.com/SliMM-X/CoMP-MM
33
+ - **Paper:** https://arxiv.org/abs/2503.18931
34
+ - **Project Page:** https://slimm-x.github.io/comp
35
+
36
+ ## How to Get Started with the Model
37
+
38
+ Install the github repo, and use the code below to get started with the model.
39
+
40
+ ```python
41
+ # this is very similar to qwen2-vl
42
+ from slimm.model.processor import SliMMQwen2VLProcessor
43
+ from slimm.model.slimm import SliMMForConditionalGeneration
44
+ from slimm.model.utils_vl import process_vision_info
45
+
46
+ model_path = "SliMM-X/CoMP-MM-1B"
47
+
48
+ model = SliMMForConditionalGeneration.from_pretrained(
49
+ model_path, torch_dtype="auto", device_map="cuda"
50
+ )
51
+ processor = SliMMQwen2VLProcessor.from_pretrained(model_path)
52
+
53
+ messages = [
54
+ {
55
+ "role": "user",
56
+ "content": [
57
+ {
58
+ "type": "image",
59
+ "image": "https://slimm-x.github.io/comp/figs/teaser.png",
60
+ },
61
+ {"type": "text", "text": "Describe this image."},
62
+ ],
63
+ }
64
+ ]
65
+
66
+ # Preparation for inference
67
+ text = processor.apply_chat_template(
68
+ messages, tokenize=False, add_generation_prompt=True
69
+ )
70
+ image_inputs, video_inputs = process_vision_info(messages)
71
+ inputs = processor(
72
+ text=[text],
73
+ images=image_inputs,
74
+ videos=video_inputs,
75
+ padding=True,
76
+ return_tensors="pt",
77
+ )
78
+ inputs = inputs.to("cuda")
79
+
80
+ # Inference: Generation of the output
81
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
82
+ generated_ids_trimmed = [
83
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
84
+ ]
85
+ output_text = processor.batch_decode(
86
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
87
+ )
88
+ print(output_text)
89
+ ```
90
+
91
+ ## Citation
92
+
93
+ **BibTeX:**
94
+
95
+ ```bibtex
96
+ @article{comp2025,
97
+ title={CoMP: Continual Multimodal Pre-training for Vision Foundation Models},
98
+ author={Chen, Yitong and Meng, Lingchen and Peng, Wujian and Wu, Zuxuan and Jiang, Yu-Gang},
99
+ year={2025},
100
+ journal={arXiv preprint arXiv:2503.18931},
101
+ }
102
  ```