0nejiawei commited on
Commit
5c7766b
·
1 Parent(s): 9ef5051

update README.md

Browse files
.gitattributes CHANGED
@@ -44,3 +44,5 @@ config.json filter=lfs diff=lfs merge=lfs -text
44
  model.safetensors.index.json filter=lfs diff=lfs merge=lfs -text
45
  processor_config.json filter=lfs diff=lfs merge=lfs -text
46
  vocab.json filter=lfs diff=lfs merge=lfs -text
 
 
 
44
  model.safetensors.index.json filter=lfs diff=lfs merge=lfs -text
45
  processor_config.json filter=lfs diff=lfs merge=lfs -text
46
  vocab.json filter=lfs diff=lfs merge=lfs -text
47
+ assets/performance_of_tarsier2.png filter=lfs diff=lfs merge=lfs -text
48
+ assets/tarsier2_training_dataset.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,72 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ tags:
4
+ - video LLM
5
  ---
6
+
7
+
8
+ # Tarsier Model Card
9
+ ## Introduction
10
+ We propose Tarsier2-7B(-0115) as the latest member of the Tarsier series. Tarsier2-7B sets new state-of-the-art results across 16 public video understanding benchmarks, spanning tasks such as video captioning, video question-answering, video grounding, hallucination test, etc. In terms of the Tarsier series model's main feature - detailed video description, Tarsier2-7B consistently outperformed leading proprietary models, including GPT-4o and Gemini 1.5 Pro, in both automatic metrics and human evaluation.
11
+
12
+ Compared to [Tarsier-7B](https://huggingface.co/omni-research/Tarsier-7b), Tarsier2-7B is comprehensively upgraded in base model ([Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)) and **training data & stage**:
13
+ - Pre-train: We scale up the training data to 40M video-text pairs, featuring in both volume and diversity.
14
+ - SFT: Fine-grained temporal alignment is performed during supervised fine-tuning.
15
+ - DPO: Using model-based sampling to automatically construct preference data and applying DPO training for optimization.
16
+
17
+ ## Model details
18
+ - Base Model: [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)
19
+ - Training Data:
20
+ - Pre-train: Over 40M samples of the mixture of video, image and text data, with 20.4M open-source and 19.8M in-house. Detailed as following:
21
+ <div align="center">
22
+ <img src="assets/tarsier2_training_dataset.png" width = "100%">
23
+ </a>
24
+ <br>Figure 1: Summary of datasets used in the pre-training stage of Tarsier2.
25
+ </div>
26
+ - Post-train: 150K human-annotated detailed video descriptions for SFT and 20K automatically sampled and filtered preference pairs for DPO.
27
+
28
+ **Model date:**
29
+ Tarsier2-Recap-7b was trained in December 2024.
30
+
31
+ **Paper or resources for more information:**
32
+ - online demo: https://huggingface.co/spaces/omni-research/Tarsier2-7b
33
+ - github repo: https://github.com/bytedance/tarsier/tree/tarsier2
34
+ - paper link: https://arxiv.org/abs/2501.07888
35
+ - leaderboard: https://tarsier-vlm.github.io/
36
+
37
+ ## Performace
38
+ Tarsier2-7B excels in various video understanding tasks, including video captioning, video question-answering, video grounding, hallucination test, etc.
39
+ <div align="center">
40
+ <img src="assets/performance_of_tarsier2.png" width = "100%">
41
+ <br>Figure 2: Performance comparison of Tarsier2 with previous SOTA models at 7B-scale and GPT-4o.
42
+ </div>
43
+
44
+ ## License
45
+ Qwen/Qwen2-VL-7B-Instruct license.
46
+
47
+ ## Intended use
48
+ **Primary intended uses:**
49
+ The primary use of Tarsier is research on large multimodal models, especially video description.
50
+
51
+ **Primary intended users:**
52
+ The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
53
+
54
+ ## How to Use
55
+ see https://github.com/bytedance/tarsier?tab=readme-ov-file#usage.
56
+
57
+ **Where to send questions or comments about the model:**
58
+ https://github.com/bytedance/tarsier/issues
59
+
60
+ ## Citation
61
+ If you find our work helpful, feel free to cite us as:
62
+ ```BibTeX
63
+ @misc{yuan2025tarsier2advancinglargevisionlanguage,
64
+ title={Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding},
65
+ author={Liping Yuan and Jiawei Wang and Haomiao Sun and Yuchen Zhang and Yuan Lin},
66
+ year={2025},
67
+ eprint={2501.07888},
68
+ archivePrefix={arXiv},
69
+ primaryClass={cs.CV},
70
+ url={https://arxiv.org/abs/2501.07888},
71
+ }
72
+ ```
assets/performance_of_tarsier2.png ADDED

Git LFS Details

  • SHA256: 7a62fb8ef75c1fc3fe58442fed54ba33f7feca90701be820acbe15cf347edd9f
  • Pointer size: 131 Bytes
  • Size of remote file: 758 kB
assets/tarsier2_training_dataset.png ADDED

Git LFS Details

  • SHA256: 830fb46e82a53b179de5d7035c4b6d1e125cf4f21087da5f0a9ebf2dea669641
  • Pointer size: 131 Bytes
  • Size of remote file: 744 kB