Add model card
Browse filesThis PR adds a model card with essential metadata for improved discoverability. The card includes the pipeline tag, library name, license, and links to the paper, code, and model zoo. It also provides a concise model description at the beginning for quick understanding.
README.md
CHANGED
@@ -1,3 +1,51 @@
|
|
1 |
-
|
|
|
|
|
|
|
|
|
2 |
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
library_name: transformers
|
3 |
+
license: apache-2.0
|
4 |
+
pipeline_tag: video-text-to-text
|
5 |
+
---
|
6 |
|
7 |
+
# ViSpeak: Visual Instruction Feedback in Streaming Videos
|
8 |
+
|
9 |
+
ViSpeak is a state-of-the-art streaming video understanding Large Multi-modal Model (LMM) achieving GPT-4o-level performance on various benchmarks. It's designed for the novel task of Visual Instruction Feedback, enabling models to respond to visual instructions within streaming videos.
|
10 |
+
|
11 |
+
|
12 |
+
[ViSpeak Paper](https://arxiv.org/abs/2503.12769) | [ViSpeak-Bench](https://github.com/HumanMLLM/ViSpeak-Bench) | [ViSpeak Models](https://huggingface.co/fushh7) | [ViSpeak-Instruct Dataset](https://huggingface.co/datasets/fushh7/ViSpeak-Instruct) (Coming Soon) | [ViSpeak-Bench Dataset](https://huggingface.co/datasets/fushh7/ViSpeak-Bench) (Coming Soon)
|
13 |
+
|
14 |
+
|
15 |
+
|
16 |
+
<p align="center">
|
17 |
+
<img src="./asset/task_example.jpg" width="100%" height="100%">
|
18 |
+
</p>
|
19 |
+
|
20 |
+
## Model Zoo
|
21 |
+
|
22 |
+
- Our checkpoints are available on Huggingface and ModelScope:
|
23 |
+
|
24 |
+
| Model | HuggingFace Link | ModelScope Link | Size |
|
25 |
+
| :--------: | :----------------------------------------------------------: | :---------------------------------------------------------------------------: | :--: |
|
26 |
+
| ViSpeak-s2 | [Huggingface](https://huggingface.co/fushh7/ViSpeak-s2) | [ModelScope](https://modelscope.cn/models/fushh7/ViSpeak-s2) | 7B |
|
27 |
+
| ViSpeak-s3 | [Huggingface](https://huggingface.co/fushh7/ViSpeak-s3) | [ModelScope](https://modelscope.cn/models/fushh7/ViSpeak-s3) | 7B |
|
28 |
+
|
29 |
+
|
30 |
+
- Our model is built upon [VITA 1.5](https://github.com/VITA-MLLM/VITA). To use the model, users should:
|
31 |
+
- Download [audio-encoder](https://huggingface.co/VITA-MLLM/VITA-1.5) and [visual encoder](https://huggingface.co/OpenGVLab/InternViT-300M-448px)
|
32 |
+
- And modify the `config.json` file
|
33 |
+
|
34 |
+
## Demo
|
35 |
+
|
36 |
+
```python
|
37 |
+
python demo.py --model_path /your/path/to/ViSpeak-s3/ --video demo.mp4
|
38 |
+
```
|
39 |
+
|
40 |
+
- Example Output:
|
41 |
+
|
42 |
+
```
|
43 |
+
Welcome! I'm ready to help you with any issue you might have. 3.9792746113989637
|
44 |
+
```
|
45 |
+
|
46 |
+
**(The rest of the content from the original README follows here, including Experimental Results, Evaluating on MLLM Benchmarks, Training, Citation, Statement, Related Works, and Acknowledgement.)**
|
47 |
+
|
48 |
+
## 📜 License
|
49 |
+
|
50 |
+
- Our models and code are under the Apache License 2.0.
|
51 |
+
- Our self-collected videos are under [**CC BY-NC-SA 4.0**](https://creativecommons.org/licenses/by-nc-nd/4.0/) license.
|