nielsr HF staff commited on
Commit
c8795aa
·
verified ·
1 Parent(s): 9195047

Add model card

Browse files

This PR adds a model card with essential metadata for improved discoverability. The card includes the pipeline tag, library name, license, and links to the paper, code, and model zoo. It also provides a concise model description at the beginning for quick understanding.

Files changed (1) hide show
  1. README.md +50 -2
README.md CHANGED
@@ -1,3 +1,51 @@
1
- This repository contains the model of the paper [ViSpeak: Visual Instruction Feedback in Streaming Videos](https://arxiv.org/abs/2503.12769).
 
 
 
 
2
 
3
- Code: https://github.com/HumanMLLM/ViSpeak
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ pipeline_tag: video-text-to-text
5
+ ---
6
 
7
+ # ViSpeak: Visual Instruction Feedback in Streaming Videos
8
+
9
+ ViSpeak is a state-of-the-art streaming video understanding Large Multi-modal Model (LMM) achieving GPT-4o-level performance on various benchmarks. It's designed for the novel task of Visual Instruction Feedback, enabling models to respond to visual instructions within streaming videos.
10
+
11
+
12
+ [ViSpeak Paper](https://arxiv.org/abs/2503.12769) | [ViSpeak-Bench](https://github.com/HumanMLLM/ViSpeak-Bench) | [ViSpeak Models](https://huggingface.co/fushh7) | [ViSpeak-Instruct Dataset](https://huggingface.co/datasets/fushh7/ViSpeak-Instruct) (Coming Soon) | [ViSpeak-Bench Dataset](https://huggingface.co/datasets/fushh7/ViSpeak-Bench) (Coming Soon)
13
+
14
+
15
+
16
+ <p align="center">
17
+ <img src="./asset/task_example.jpg" width="100%" height="100%">
18
+ </p>
19
+
20
+ ## Model Zoo
21
+
22
+ - Our checkpoints are available on Huggingface and ModelScope:
23
+
24
+ | Model | HuggingFace Link | ModelScope Link | Size |
25
+ | :--------: | :----------------------------------------------------------: | :---------------------------------------------------------------------------: | :--: |
26
+ | ViSpeak-s2 | [Huggingface](https://huggingface.co/fushh7/ViSpeak-s2) | [ModelScope](https://modelscope.cn/models/fushh7/ViSpeak-s2) | 7B |
27
+ | ViSpeak-s3 | [Huggingface](https://huggingface.co/fushh7/ViSpeak-s3) | [ModelScope](https://modelscope.cn/models/fushh7/ViSpeak-s3) | 7B |
28
+
29
+
30
+ - Our model is built upon [VITA 1.5](https://github.com/VITA-MLLM/VITA). To use the model, users should:
31
+ - Download [audio-encoder](https://huggingface.co/VITA-MLLM/VITA-1.5) and [visual encoder](https://huggingface.co/OpenGVLab/InternViT-300M-448px)
32
+ - And modify the `config.json` file
33
+
34
+ ## Demo
35
+
36
+ ```python
37
+ python demo.py --model_path /your/path/to/ViSpeak-s3/ --video demo.mp4
38
+ ```
39
+
40
+ - Example Output:
41
+
42
+ ```
43
+ Welcome! I'm ready to help you with any issue you might have. 3.9792746113989637
44
+ ```
45
+
46
+ **(The rest of the content from the original README follows here, including Experimental Results, Evaluating on MLLM Benchmarks, Training, Citation, Statement, Related Works, and Acknowledgement.)**
47
+
48
+ ## 📜 License
49
+
50
+ - Our models and code are under the Apache License 2.0.
51
+ - Our self-collected videos are under [**CC BY-NC-SA 4.0**](https://creativecommons.org/licenses/by-nc-nd/4.0/) license.