update README.md
Browse files- .gitattributes +2 -0
- README.md +69 -0
- assets/performance_of_tarsier2.png +3 -0
- assets/tarsier2_training_dataset.png +3 -0
.gitattributes
CHANGED
@@ -44,3 +44,5 @@ config.json filter=lfs diff=lfs merge=lfs -text
|
|
44 |
model.safetensors.index.json filter=lfs diff=lfs merge=lfs -text
|
45 |
processor_config.json filter=lfs diff=lfs merge=lfs -text
|
46 |
vocab.json filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
44 |
model.safetensors.index.json filter=lfs diff=lfs merge=lfs -text
|
45 |
processor_config.json filter=lfs diff=lfs merge=lfs -text
|
46 |
vocab.json filter=lfs diff=lfs merge=lfs -text
|
47 |
+
assets/performance_of_tarsier2.png filter=lfs diff=lfs merge=lfs -text
|
48 |
+
assets/tarsier2_training_dataset.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
@@ -1,3 +1,72 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
tags:
|
4 |
+
- video LLM
|
5 |
---
|
6 |
+
|
7 |
+
|
8 |
+
# Tarsier Model Card
|
9 |
+
## Introduction
|
10 |
+
We propose Tarsier2-7B(-0115) as the latest member of the Tarsier series. Tarsier2-7B sets new state-of-the-art results across 16 public video understanding benchmarks, spanning tasks such as video captioning, video question-answering, video grounding, hallucination test, etc. In terms of the Tarsier series model's main feature - detailed video description, Tarsier2-7B consistently outperformed leading proprietary models, including GPT-4o and Gemini 1.5 Pro, in both automatic metrics and human evaluation.
|
11 |
+
|
12 |
+
Compared to [Tarsier-7B](https://huggingface.co/omni-research/Tarsier-7b), Tarsier2-7B is comprehensively upgraded in base model ([Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)) and **training data & stage**:
|
13 |
+
- Pre-train: We scale up the training data to 40M video-text pairs, featuring in both volume and diversity.
|
14 |
+
- SFT: Fine-grained temporal alignment is performed during supervised fine-tuning.
|
15 |
+
- DPO: Using model-based sampling to automatically construct preference data and applying DPO training for optimization.
|
16 |
+
|
17 |
+
## Model details
|
18 |
+
- Base Model: [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)
|
19 |
+
- Training Data:
|
20 |
+
- Pre-train: Over 40M samples of the mixture of video, image and text data, with 20.4M open-source and 19.8M in-house. Detailed as following:
|
21 |
+
<div align="center">
|
22 |
+
<img src="assets/tarsier2_training_dataset.png" width = "100%">
|
23 |
+
</a>
|
24 |
+
<br>Figure 1: Summary of datasets used in the pre-training stage of Tarsier2.
|
25 |
+
</div>
|
26 |
+
- Post-train: 150K human-annotated detailed video descriptions for SFT and 20K automatically sampled and filtered preference pairs for DPO.
|
27 |
+
|
28 |
+
**Model date:**
|
29 |
+
Tarsier2-Recap-7b was trained in December 2024.
|
30 |
+
|
31 |
+
**Paper or resources for more information:**
|
32 |
+
- online demo: https://huggingface.co/spaces/omni-research/Tarsier2-7b
|
33 |
+
- github repo: https://github.com/bytedance/tarsier/tree/tarsier2
|
34 |
+
- paper link: https://arxiv.org/abs/2501.07888
|
35 |
+
- leaderboard: https://tarsier-vlm.github.io/
|
36 |
+
|
37 |
+
## Performace
|
38 |
+
Tarsier2-7B excels in various video understanding tasks, including video captioning, video question-answering, video grounding, hallucination test, etc.
|
39 |
+
<div align="center">
|
40 |
+
<img src="assets/performance_of_tarsier2.png" width = "100%">
|
41 |
+
<br>Figure 2: Performance comparison of Tarsier2 with previous SOTA models at 7B-scale and GPT-4o.
|
42 |
+
</div>
|
43 |
+
|
44 |
+
## License
|
45 |
+
Qwen/Qwen2-VL-7B-Instruct license.
|
46 |
+
|
47 |
+
## Intended use
|
48 |
+
**Primary intended uses:**
|
49 |
+
The primary use of Tarsier is research on large multimodal models, especially video description.
|
50 |
+
|
51 |
+
**Primary intended users:**
|
52 |
+
The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
|
53 |
+
|
54 |
+
## How to Use
|
55 |
+
see https://github.com/bytedance/tarsier?tab=readme-ov-file#usage.
|
56 |
+
|
57 |
+
**Where to send questions or comments about the model:**
|
58 |
+
https://github.com/bytedance/tarsier/issues
|
59 |
+
|
60 |
+
## Citation
|
61 |
+
If you find our work helpful, feel free to cite us as:
|
62 |
+
```BibTeX
|
63 |
+
@misc{yuan2025tarsier2advancinglargevisionlanguage,
|
64 |
+
title={Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding},
|
65 |
+
author={Liping Yuan and Jiawei Wang and Haomiao Sun and Yuchen Zhang and Yuan Lin},
|
66 |
+
year={2025},
|
67 |
+
eprint={2501.07888},
|
68 |
+
archivePrefix={arXiv},
|
69 |
+
primaryClass={cs.CV},
|
70 |
+
url={https://arxiv.org/abs/2501.07888},
|
71 |
+
}
|
72 |
+
```
|
assets/performance_of_tarsier2.png
ADDED
![]() |
Git LFS Details
|
assets/tarsier2_training_dataset.png
ADDED
![]() |
Git LFS Details
|