Alibaba-NLP
/

GVE-7B

@@ -1,11 +1,34 @@
 # 🎯 General Video Embedder (GVE)
 > **One Embedder for All Video Retrieval Scenarios**
-> Queries of text, image, video, or any combination modalities — GVE understands them all for representations, **zero-shot**, **without in-domain training**.
 GVE is the first video embedding model that **generalizes across 9 abilities, including 3 diverse retrieval tasks and 6 domains** — from coarse text-to-video to fine-grained spatial/temporal queries, composed (text+image) queries, and long-context retrieval — all evaluated on our new **Universal Video Retrieval Benchmark (UVRB)**.
-Built on **Qwen2.5-VL** and trained only with **LoRA** with 13M collected and synthesized multimodal data, GVE achieves **SOTA zero-shot performance** than competitors.
 ---
@@ -60,11 +83,9 @@ Built on **Qwen2.5-VL** and trained only with **LoRA** with 13M collected and sy
 1. Loading model
 ```python
-from transformers import AutoModel, AutoProcessor
-model_path = '.'
-model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map='auto', low_cpu_mem_usage=True, torch_dtype='bfloat16')
-processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True, add_eos_token=True)
 processor.tokenizer.padding_side = 'left'
 ```
@@ -111,9 +132,13 @@ embedding = F.normalize(outputs['last_hidden_state'][:, -1, :], p=2, dim=1)
 ## 📚 Citation
 ```bibtex
-@inproceedings{guo2025general-video-embedding,
-  title={Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum},
-  author={Zhuoning Guo, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie and Xiaowen Chu},
-  year={2025}
 }
-```

+---
+language: en
+license: apache-2.0
+library_name: transformers
+tags:
+- pytorch
+- video
+- retrieval
+- embedding
+- multimodal
+- qwen2.5-vl
+pipeline_tag: sentence-similarity
+datasets:
+- Alibaba-NLP/UVRB
+- Vividbot/vast-2m-vi
+- TempoFunk/webvid-10M
+- OpenGVLab/InternVid
+metrics:
+- recall
+base_model:
+- Qwen/Qwen2.5-VL-7B-Instruct
+---
 # 🎯 General Video Embedder (GVE)
 > **One Embedder for All Video Retrieval Scenarios**
+> Queries of text, image, video, or any combination modalities — GVE understands them all for representations, zero-shot, without in-domain training.
 GVE is the first video embedding model that **generalizes across 9 abilities, including 3 diverse retrieval tasks and 6 domains** — from coarse text-to-video to fine-grained spatial/temporal queries, composed (text+image) queries, and long-context retrieval — all evaluated on our new **Universal Video Retrieval Benchmark (UVRB)**.
+Built on **Qwen2.5-VL** and trained only with LoRA with **13M** collected and synthesized multimodal data, GVE achieves **SOTA zero-shot performance** than competitors.
 ---
 1. Loading model
 ```python
+model_path = 'Alibaba-NLP/GVE-7B'
+model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map='auto', low_cpu_mem_usage=True, torch_dtype=torch.bfloat16)
+processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True, use_fast=True)
 processor.tokenizer.padding_side = 'left'
 ```
 ## 📚 Citation
 ```bibtex
+@misc{guo2025gve,
+  title={Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum},
+  author={Zhuoning Guo and Mingxin Li and Yanzhao Zhang and Dingkun Long and Pengjun Xie and Xiaowen Chu},
+  year={2025},
+  eprint={2510.27571},
+  archivePrefix={arXiv},
+  primaryClass={cs.CV},
+  url={https://arxiv.org/abs/2510.27571},
 }
+```