lixin4ever commited on
Commit
4c84984
·
verified ·
1 Parent(s): c7e14fd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -39,6 +39,7 @@ tags:
39
 
40
 
41
  ## 🌎 Model Zoo
 
42
  | Model Name | Type | Visual Encoder | Language Decoder | # Training Frames |
43
  |:-------------------|:--------------:|:----------------|:------------------|:----------------------:|
44
  | [VideoLLaMA2-7B-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-7B-Base) | Base | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) | 8 |
@@ -52,6 +53,7 @@ tags:
52
  | [VideoLLaMA2.1-7B-16F-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base) | Base | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | 16 |
53
  | [VideoLLaMA2.1-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F) | Chat | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | 16 |
54
 
 
55
  | Model Name | Type | Visual Encoder | Audio Encoder | Language Decoder | # Training Frames |
56
  |:-------------------|:--------------:|:----------------|:----------------|:------------------|:----------------------:|
57
  | [VideoLLaMA2.1-7B-AV](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-AV) (**This Checkpoint**) | Chat | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Fine-tuned BEATs_iter3+(AS2M)(cpt2)](https://1drv.ms/u/s!AqeByhGUtINrgcpj8ujXH1YUtxooEg?e=E9Ncea) | [VideoLLaMA2.1-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F) | 8 |
@@ -62,8 +64,6 @@ tags:
62
  ### Multi-Choice Video QA & Video Captioning
63
  <p><img src="https://cdn-uploads.huggingface.co/production/uploads/63913b120cf6b11c487ca31d/Z81Dl2MeVlg8wLbYOyTvI.png" width="800" "/></p>
64
 
65
-
66
-
67
  ### Open-Ended Video QA
68
  <p><img src="https://cdn-uploads.huggingface.co/production/uploads/63913b120cf6b11c487ca31d/UoAr7SjbPSPe1z23HBsUh.png" width="800" "/></p>
69
 
@@ -75,7 +75,7 @@ tags:
75
 
76
 
77
 
78
- ## 🤖 Inference with VideoLLaMA2
79
  ```python
80
  import sys
81
  sys.path.append('./')
 
39
 
40
 
41
  ## 🌎 Model Zoo
42
+ ### Vision-Only Checkpoints
43
  | Model Name | Type | Visual Encoder | Language Decoder | # Training Frames |
44
  |:-------------------|:--------------:|:----------------|:------------------|:----------------------:|
45
  | [VideoLLaMA2-7B-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-7B-Base) | Base | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) | 8 |
 
53
  | [VideoLLaMA2.1-7B-16F-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base) | Base | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | 16 |
54
  | [VideoLLaMA2.1-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F) | Chat | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | 16 |
55
 
56
+ ### Audio-Visual Checkpoints
57
  | Model Name | Type | Visual Encoder | Audio Encoder | Language Decoder | # Training Frames |
58
  |:-------------------|:--------------:|:----------------|:----------------|:------------------|:----------------------:|
59
  | [VideoLLaMA2.1-7B-AV](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-AV) (**This Checkpoint**) | Chat | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Fine-tuned BEATs_iter3+(AS2M)(cpt2)](https://1drv.ms/u/s!AqeByhGUtINrgcpj8ujXH1YUtxooEg?e=E9Ncea) | [VideoLLaMA2.1-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F) | 8 |
 
64
  ### Multi-Choice Video QA & Video Captioning
65
  <p><img src="https://cdn-uploads.huggingface.co/production/uploads/63913b120cf6b11c487ca31d/Z81Dl2MeVlg8wLbYOyTvI.png" width="800" "/></p>
66
 
 
 
67
  ### Open-Ended Video QA
68
  <p><img src="https://cdn-uploads.huggingface.co/production/uploads/63913b120cf6b11c487ca31d/UoAr7SjbPSPe1z23HBsUh.png" width="800" "/></p>
69
 
 
75
 
76
 
77
 
78
+ ## 🤖 Inference with VideoLLaMA2-AV
79
  ```python
80
  import sys
81
  sys.path.append('./')