Add model card

This PR adds a model card with essential metadata for improved discoverability. The card includes the pipeline tag, library name, license, and links to the paper, code, and model zoo. It also provides a concise model description at the beginning for quick understanding.

Files changed (1) hide show

README.md +50 -2

README.md CHANGED Viewed

@@ -1,3 +1,51 @@
-This repository contains the model of the paper [ViSpeak: Visual Instruction Feedback in Streaming Videos](https://arxiv.org/abs/2503.12769).
-Code: https://github.com/HumanMLLM/ViSpeak

+---
+library_name: transformers
+license: apache-2.0
+pipeline_tag: video-text-to-text
+---
+# ViSpeak: Visual Instruction Feedback in Streaming Videos
+ViSpeak is a state-of-the-art streaming video understanding Large Multi-modal Model (LMM) achieving GPT-4o-level performance on various benchmarks.  It's designed for the novel task of Visual Instruction Feedback, enabling models to respond to visual instructions within streaming videos.
+[ViSpeak Paper](https://arxiv.org/abs/2503.12769) | [ViSpeak-Bench](https://github.com/HumanMLLM/ViSpeak-Bench) | [ViSpeak Models](https://huggingface.co/fushh7) | [ViSpeak-Instruct Dataset](https://huggingface.co/datasets/fushh7/ViSpeak-Instruct) (Coming Soon) | [ViSpeak-Bench Dataset](https://huggingface.co/datasets/fushh7/ViSpeak-Bench) (Coming Soon)
+<p align="center">
+    <img src="./asset/task_example.jpg" width="100%" height="100%">
+</p>
+## Model Zoo
+- Our checkpoints are available on Huggingface and ModelScope:
+|   Model    |                             HuggingFace Link                             | ModelScope Link                                                              | Size |
+| :--------: | :----------------------------------------------------------: | :---------------------------------------------------------------------------: | :--: |
+| ViSpeak-s2 | [Huggingface](https://huggingface.co/fushh7/ViSpeak-s2)       | [ModelScope](https://modelscope.cn/models/fushh7/ViSpeak-s2)                 |  7B  |
+| ViSpeak-s3 | [Huggingface](https://huggingface.co/fushh7/ViSpeak-s3)       | [ModelScope](https://modelscope.cn/models/fushh7/ViSpeak-s3)                 |  7B  |
+- Our model is built upon [VITA 1.5](https://github.com/VITA-MLLM/VITA). To use the model, users should:
+  - Download [audio-encoder](https://huggingface.co/VITA-MLLM/VITA-1.5) and [visual encoder](https://huggingface.co/OpenGVLab/InternViT-300M-448px)
+  - And modify the `config.json` file
+## Demo
+```python
+python demo.py --model_path /your/path/to/ViSpeak-s3/ --video demo.mp4
+```
+- Example Output:
+```
+Welcome! I'm ready to help you with any issue you might have. 3.9792746113989637
+```
+**(The rest of the content from the original README follows here, including Experimental Results, Evaluating on MLLM Benchmarks, Training, Citation, Statement, Related Works, and Acknowledgement.)**
+## 📜 License
+- Our models and code are under the Apache License 2.0.
+- Our self-collected videos are under [**CC BY-NC-SA 4.0**](https://creativecommons.org/licenses/by-nc-nd/4.0/) license.