lixin4ever
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -39,6 +39,7 @@ tags:
|
|
39 |
|
40 |
|
41 |
## 🌎 Model Zoo
|
|
|
42 |
| Model Name | Type | Visual Encoder | Language Decoder | # Training Frames |
|
43 |
|:-------------------|:--------------:|:----------------|:------------------|:----------------------:|
|
44 |
| [VideoLLaMA2-7B-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-7B-Base) | Base | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) | 8 |
|
@@ -52,6 +53,7 @@ tags:
|
|
52 |
| [VideoLLaMA2.1-7B-16F-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base) | Base | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | 16 |
|
53 |
| [VideoLLaMA2.1-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F) | Chat | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | 16 |
|
54 |
|
|
|
55 |
| Model Name | Type | Visual Encoder | Audio Encoder | Language Decoder | # Training Frames |
|
56 |
|:-------------------|:--------------:|:----------------|:----------------|:------------------|:----------------------:|
|
57 |
| [VideoLLaMA2.1-7B-AV](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-AV) (**This Checkpoint**) | Chat | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Fine-tuned BEATs_iter3+(AS2M)(cpt2)](https://1drv.ms/u/s!AqeByhGUtINrgcpj8ujXH1YUtxooEg?e=E9Ncea) | [VideoLLaMA2.1-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F) | 8 |
|
@@ -62,8 +64,6 @@ tags:
|
|
62 |
### Multi-Choice Video QA & Video Captioning
|
63 |
<p><img src="https://cdn-uploads.huggingface.co/production/uploads/63913b120cf6b11c487ca31d/Z81Dl2MeVlg8wLbYOyTvI.png" width="800" "/></p>
|
64 |
|
65 |
-
|
66 |
-
|
67 |
### Open-Ended Video QA
|
68 |
<p><img src="https://cdn-uploads.huggingface.co/production/uploads/63913b120cf6b11c487ca31d/UoAr7SjbPSPe1z23HBsUh.png" width="800" "/></p>
|
69 |
|
@@ -75,7 +75,7 @@ tags:
|
|
75 |
|
76 |
|
77 |
|
78 |
-
## 🤖 Inference with VideoLLaMA2
|
79 |
```python
|
80 |
import sys
|
81 |
sys.path.append('./')
|
|
|
39 |
|
40 |
|
41 |
## 🌎 Model Zoo
|
42 |
+
### Vision-Only Checkpoints
|
43 |
| Model Name | Type | Visual Encoder | Language Decoder | # Training Frames |
|
44 |
|:-------------------|:--------------:|:----------------|:------------------|:----------------------:|
|
45 |
| [VideoLLaMA2-7B-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-7B-Base) | Base | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) | 8 |
|
|
|
53 |
| [VideoLLaMA2.1-7B-16F-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base) | Base | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | 16 |
|
54 |
| [VideoLLaMA2.1-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F) | Chat | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | 16 |
|
55 |
|
56 |
+
### Audio-Visual Checkpoints
|
57 |
| Model Name | Type | Visual Encoder | Audio Encoder | Language Decoder | # Training Frames |
|
58 |
|:-------------------|:--------------:|:----------------|:----------------|:------------------|:----------------------:|
|
59 |
| [VideoLLaMA2.1-7B-AV](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-AV) (**This Checkpoint**) | Chat | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Fine-tuned BEATs_iter3+(AS2M)(cpt2)](https://1drv.ms/u/s!AqeByhGUtINrgcpj8ujXH1YUtxooEg?e=E9Ncea) | [VideoLLaMA2.1-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F) | 8 |
|
|
|
64 |
### Multi-Choice Video QA & Video Captioning
|
65 |
<p><img src="https://cdn-uploads.huggingface.co/production/uploads/63913b120cf6b11c487ca31d/Z81Dl2MeVlg8wLbYOyTvI.png" width="800" "/></p>
|
66 |
|
|
|
|
|
67 |
### Open-Ended Video QA
|
68 |
<p><img src="https://cdn-uploads.huggingface.co/production/uploads/63913b120cf6b11c487ca31d/UoAr7SjbPSPe1z23HBsUh.png" width="800" "/></p>
|
69 |
|
|
|
75 |
|
76 |
|
77 |
|
78 |
+
## 🤖 Inference with VideoLLaMA2-AV
|
79 |
```python
|
80 |
import sys
|
81 |
sys.path.append('./')
|