SliMM-X
/

CoMP-MM-1B

@@ -1,88 +1,102 @@
----
-base_model:
-- Qwen/Qwen2.5-0.5B-Instruct
-license: apache-2.0
-pipeline_tag: image-text-to-text
-library_name: slimm
----
-# Model Card for CoMP-MM-1B
-<!-- Provide a quick summary of what the model is/does. -->
-This is an LMM that supports **native image resolution inputs**, composed of [CoMP-SigLIP](https://huggingface.co/SliMM-X/CoMP-SigLIP-So400M) and [Qwen2.5](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct).
-## Model Sources
-<!-- Provide the basic links for the model. -->
-- **Repository:** https://github.com/SliMM-X/CoMP-MM
-- **Paper:** https://arxiv.org/abs/2503.18931
-- **Project Page:** https://slimm-x.github.io/comp
-## How to Get Started with the Model
-Install the github repo, and use the code below to get started with the model.
-```python
-# this is very similar to qwen2-vl
-from slimm.model.processor import SliMMQwen2VLProcessor
-from slimm.model.slimm import SliMMForConditionalGeneration
-from slimm.model.utils_vl import process_vision_info
-model_path = "SliMM-X/CoMP-MM-1B"
-model = SliMMForConditionalGeneration.from_pretrained(
-    model_path, torch_dtype="auto", device_map="cuda"
-)
-processor = SliMMQwen2VLProcessor.from_pretrained(model_path)
-messages = [
-    {
-        "role": "user",
-        "content": [
-            {
-                "type": "image",
-                "image": "https://slimm-x.github.io/comp/figs/teaser.png",
-            },
-            {"type": "text", "text": "Describe this image."},
-        ],
-    }
-]
-# Preparation for inference
-text = processor.apply_chat_template(
-    messages, tokenize=False, add_generation_prompt=True
-)
-image_inputs, video_inputs = process_vision_info(messages)
-inputs = processor(
-    text=[text],
-    images=image_inputs,
-    videos=video_inputs,
-    padding=True,
-    return_tensors="pt",
-)
-inputs = inputs.to("cuda")
-# Inference: Generation of the output
-generated_ids = model.generate(**inputs, max_new_tokens=128)
-generated_ids_trimmed = [
-    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
-]
-output_text = processor.batch_decode(
-    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
-)
-print(output_text)
-```
-## Citation
-**BibTeX:**
-```bibtex
-@article{comp2025,
-      title={CoMP: Continual Multimodal Pre-training for Vision Foundation Models},
-      author={Chen, Yitong and Meng, Lingchen and Peng, Wujian and Wu, Zuxuan and Jiang, Yu-Gang},
-      year={2025},
-      journal={arXiv preprint arXiv:2503.18931},
-}
 ```

+---
+base_model:
+- Qwen/Qwen2.5-0.5B-Instruct
+license: apache-2.0
+pipeline_tag: image-text-to-text
+library_name: slimm
+language:
+- zho
+- eng
+- fra
+- spa
+- por
+- deu
+- ita
+- rus
+- jpn
+- kor
+- vie
+- tha
+- ara
+---
+# Model Card for CoMP-MM-1B
+<!-- Provide a quick summary of what the model is/does. -->
+This is an LMM that supports **native image resolution inputs**, composed of [CoMP-SigLIP](https://huggingface.co/SliMM-X/CoMP-SigLIP-So400M) and [Qwen2.5](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct).
+## Model Sources
+<!-- Provide the basic links for the model. -->
+- **Repository:** https://github.com/SliMM-X/CoMP-MM
+- **Paper:** https://arxiv.org/abs/2503.18931
+- **Project Page:** https://slimm-x.github.io/comp
+## How to Get Started with the Model
+Install the github repo, and use the code below to get started with the model.
+```python
+# this is very similar to qwen2-vl
+from slimm.model.processor import SliMMQwen2VLProcessor
+from slimm.model.slimm import SliMMForConditionalGeneration
+from slimm.model.utils_vl import process_vision_info
+model_path = "SliMM-X/CoMP-MM-1B"
+model = SliMMForConditionalGeneration.from_pretrained(
+    model_path, torch_dtype="auto", device_map="cuda"
+)
+processor = SliMMQwen2VLProcessor.from_pretrained(model_path)
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": "https://slimm-x.github.io/comp/figs/teaser.png",
+            },
+            {"type": "text", "text": "Describe this image."},
+        ],
+    }
+]
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to("cuda")
+# Inference: Generation of the output
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+## Citation
+**BibTeX:**
+```bibtex
+@article{comp2025,
+      title={CoMP: Continual Multimodal Pre-training for Vision Foundation Models},
+      author={Chen, Yitong and Meng, Lingchen and Peng, Wujian and Wu, Zuxuan and Jiang, Yu-Gang},
+      year={2025},
+      journal={arXiv preprint arXiv:2503.18931},
+}
 ```