Zhuoning commited on
Commit
3046045
Β·
verified Β·
1 Parent(s): 2f56928

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -12
README.md CHANGED
@@ -1,11 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # 🎯 General Video Embedder (GVE)
2
 
3
  > **One Embedder for All Video Retrieval Scenarios**
4
- > Queries of text, image, video, or any combination modalities β€” GVE understands them all for representations, **zero-shot**, **without in-domain training**.
5
 
6
  GVE is the first video embedding model that **generalizes across 9 abilities, including 3 diverse retrieval tasks and 6 domains** β€” from coarse text-to-video to fine-grained spatial/temporal queries, composed (text+image) queries, and long-context retrieval β€” all evaluated on our new **Universal Video Retrieval Benchmark (UVRB)**.
7
 
8
- Built on **Qwen2.5-VL** and trained only with **LoRA** with 13M collected and synthesized multimodal data, GVE achieves **SOTA zero-shot performance** than competitors.
9
 
10
  ---
11
 
@@ -60,11 +83,9 @@ Built on **Qwen2.5-VL** and trained only with **LoRA** with 13M collected and sy
60
  1. Loading model
61
 
62
  ```python
63
- from transformers import AutoModel, AutoProcessor
64
-
65
- model_path = '.'
66
- model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map='auto', low_cpu_mem_usage=True, torch_dtype='bfloat16')
67
- processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True, add_eos_token=True)
68
  processor.tokenizer.padding_side = 'left'
69
  ```
70
 
@@ -111,9 +132,13 @@ embedding = F.normalize(outputs['last_hidden_state'][:, -1, :], p=2, dim=1)
111
  ## πŸ“š Citation
112
 
113
  ```bibtex
114
- @inproceedings{guo2025general-video-embedding,
115
- title={Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum},
116
- author={Zhuoning Guo, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie and Xiaowen Chu},
117
- year={2025}
 
 
 
 
118
  }
119
- ```
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ library_name: transformers
5
+ tags:
6
+ - pytorch
7
+ - video
8
+ - retrieval
9
+ - embedding
10
+ - multimodal
11
+ - qwen2.5-vl
12
+ pipeline_tag: sentence-similarity
13
+ datasets:
14
+ - Alibaba-NLP/UVRB
15
+ - Vividbot/vast-2m-vi
16
+ - TempoFunk/webvid-10M
17
+ - OpenGVLab/InternVid
18
+ metrics:
19
+ - recall
20
+ base_model:
21
+ - Qwen/Qwen2.5-VL-7B-Instruct
22
+ ---
23
+
24
  # 🎯 General Video Embedder (GVE)
25
 
26
  > **One Embedder for All Video Retrieval Scenarios**
27
+ > Queries of text, image, video, or any combination modalities β€” GVE understands them all for representations, zero-shot, without in-domain training.
28
 
29
  GVE is the first video embedding model that **generalizes across 9 abilities, including 3 diverse retrieval tasks and 6 domains** β€” from coarse text-to-video to fine-grained spatial/temporal queries, composed (text+image) queries, and long-context retrieval β€” all evaluated on our new **Universal Video Retrieval Benchmark (UVRB)**.
30
 
31
+ Built on **Qwen2.5-VL** and trained only with LoRA with **13M** collected and synthesized multimodal data, GVE achieves **SOTA zero-shot performance** than competitors.
32
 
33
  ---
34
 
 
83
  1. Loading model
84
 
85
  ```python
86
+ model_path = 'Alibaba-NLP/GVE-7B'
87
+ model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map='auto', low_cpu_mem_usage=True, torch_dtype=torch.bfloat16)
88
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True, use_fast=True)
 
 
89
  processor.tokenizer.padding_side = 'left'
90
  ```
91
 
 
132
  ## πŸ“š Citation
133
 
134
  ```bibtex
135
+ @misc{guo2025gve,
136
+ title={Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum},
137
+ author={Zhuoning Guo and Mingxin Li and Yanzhao Zhang and Dingkun Long and Pengjun Xie and Xiaowen Chu},
138
+ year={2025},
139
+ eprint={2510.27571},
140
+ archivePrefix={arXiv},
141
+ primaryClass={cs.CV},
142
+ url={https://arxiv.org/abs/2510.27571},
143
  }
144
+ ```