update README.md

Browse files

Files changed (4) hide show

.gitattributes +2 -0
README.md +69 -0
assets/performance_of_tarsier2.png +3 -0
assets/tarsier2_training_dataset.png +3 -0

.gitattributes CHANGED Viewed

@@ -44,3 +44,5 @@ config.json filter=lfs diff=lfs merge=lfs -text
 model.safetensors.index.json filter=lfs diff=lfs merge=lfs -text
 processor_config.json filter=lfs diff=lfs merge=lfs -text
 vocab.json filter=lfs diff=lfs merge=lfs -text

 model.safetensors.index.json filter=lfs diff=lfs merge=lfs -text
 processor_config.json filter=lfs diff=lfs merge=lfs -text
 vocab.json filter=lfs diff=lfs merge=lfs -text
+assets/performance_of_tarsier2.png filter=lfs diff=lfs merge=lfs -text
+assets/tarsier2_training_dataset.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,72 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+tags:
+- video LLM
 ---
+# Tarsier Model Card
+## Introduction
+We propose Tarsier2-7B(-0115) as the latest member of the Tarsier series. Tarsier2-7B sets new state-of-the-art results across 16 public video understanding benchmarks, spanning tasks such as video captioning, video question-answering, video grounding, hallucination test, etc. In terms of the Tarsier series model's main feature - detailed video description, Tarsier2-7B consistently outperformed leading proprietary models, including GPT-4o and Gemini 1.5 Pro, in both automatic metrics and human evaluation.
+Compared to [Tarsier-7B](https://huggingface.co/omni-research/Tarsier-7b), Tarsier2-7B is comprehensively upgraded in base model ([Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)) and **training data & stage**:
+  - Pre-train: We scale up the training data to 40M video-text pairs, featuring in both volume and diversity.
+  - SFT: Fine-grained temporal alignment is performed during supervised fine-tuning.
+  - DPO: Using model-based sampling to automatically construct preference data and applying DPO training for optimization.
+## Model details
+- Base Model: [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)
+- Training Data:
+    - Pre-train: Over 40M samples of the mixture of video, image and text data, with 20.4M open-source and 19.8M in-house. Detailed as following:
+    <div align="center">
+    <img src="assets/tarsier2_training_dataset.png" width = "100%">
+    </a>
+    <br>Figure 1: Summary of datasets used in the pre-training stage of Tarsier2.
+    </div>
+    - Post-train: 150K human-annotated detailed video descriptions for SFT and 20K automatically sampled and filtered preference pairs for DPO.
+**Model date:**
+Tarsier2-Recap-7b was trained in December 2024.
+**Paper or resources for more information:**
+- online demo: https://huggingface.co/spaces/omni-research/Tarsier2-7b
+- github repo: https://github.com/bytedance/tarsier/tree/tarsier2
+- paper link: https://arxiv.org/abs/2501.07888
+- leaderboard: https://tarsier-vlm.github.io/
+## Performace
+Tarsier2-7B excels in various video understanding tasks, including video captioning, video question-answering, video grounding, hallucination test, etc.
+<div align="center">
+  <img src="assets/performance_of_tarsier2.png" width = "100%">
+  <br>Figure 2: Performance comparison of Tarsier2 with previous SOTA models at 7B-scale and GPT-4o.
+</div>
+## License
+Qwen/Qwen2-VL-7B-Instruct license.
+## Intended use
+**Primary intended uses:**
+The primary use of Tarsier is research on large multimodal models, especially video description.
+**Primary intended users:**
+The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
+## How to Use
+see https://github.com/bytedance/tarsier?tab=readme-ov-file#usage.
+**Where to send questions or comments about the model:**
+https://github.com/bytedance/tarsier/issues
+## Citation
+If you find our work helpful, feel free to cite us as:
+```BibTeX
+@misc{yuan2025tarsier2advancinglargevisionlanguage,
+      title={Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding},
+      author={Liping Yuan and Jiawei Wang and Haomiao Sun and Yuchen Zhang and Yuan Lin},
+      year={2025},
+      eprint={2501.07888},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2501.07888},
+}
+```

assets/performance_of_tarsier2.png ADDED Viewed

Git LFS Details

SHA256: 7a62fb8ef75c1fc3fe58442fed54ba33f7feca90701be820acbe15cf347edd9f
Pointer size: 131 Bytes
Size of remote file: 758 kB

assets/tarsier2_training_dataset.png ADDED Viewed

Git LFS Details

SHA256: 830fb46e82a53b179de5d7035c4b6d1e125cf4f21087da5f0a9ebf2dea669641
Pointer size: 131 Bytes
Size of remote file: 744 kB