zijian.kang
commited on
Commit
·
0b68fad
1
Parent(s):
a20f777
update
Browse files
README.md
CHANGED
@@ -12,9 +12,10 @@ base_model:
|
|
12 |
|
13 |
SAIL-VL is a state-of-the-art vision-language model (VLM) developed by the Bytedance Douyin Content Team. The goal of SAIL-VL is to develope a high-performance vision language model that facilitates deployment on mobile devices and ensures accessibility and affordability for a broad audience. Through careful tuning of data and training recipes, SAIL-VL demonstrates that even a small VLM can benefit significantly from data scaling. Our model outperforms Qwen2-VL, InternVL2 and even recent SoTA models of comparable sizes.
|
14 |
|
15 |
-
In a word, SAIL-VL is a foundational VLM for vision-language applications. Welcome to explore its capabilities and feel free to contact us for any
|
16 |
|
17 |
## News🚀🚀🚀
|
|
|
18 |
- 2024-2-19: 📖 We released our 8B model, check out at [🤗SAIL-VL-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B) ~
|
19 |
- 2024-1-10: 📖 We released our paper on Arxiv: [Scalable Vision Language Model Training via High Quality Data Curation
|
20 |
](https://arxiv.org/abs/2501.05952)
|
@@ -25,9 +26,10 @@ In a word, SAIL-VL is a foundational VLM for vision-language applications. Welco
|
|
25 |
|
26 |
| Architecture | ViT | LLM | Adapter | Token Merge | Resolution |
|
27 |
| --- | --- | --- | --- | --- | --- |
|
|
|
|
|
28 |
| [🤗SAIL-VL-2B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-2B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
|
29 |
| [🤗SAIL-VL-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
|
30 |
-
|
31 |
### Training Recipes Overview:
|
32 |
|
33 |
Sail-VL benefits from high-quality data and carefully curated training recipes. We find the data quality, quantity and the design of curriculum training pipeline are crucial for model performance. With the proper design and data, the model's capacity scales effectively with data expansion at all stages, leading to enhanced performance.
|
|
|
12 |
|
13 |
SAIL-VL is a state-of-the-art vision-language model (VLM) developed by the Bytedance Douyin Content Team. The goal of SAIL-VL is to develope a high-performance vision language model that facilitates deployment on mobile devices and ensures accessibility and affordability for a broad audience. Through careful tuning of data and training recipes, SAIL-VL demonstrates that even a small VLM can benefit significantly from data scaling. Our model outperforms Qwen2-VL, InternVL2 and even recent SoTA models of comparable sizes.
|
14 |
|
15 |
+
In a word, SAIL-VL is a foundational VLM for vision-language applications. Welcome to explore its capabilities and feel free to contact us for any question or opportunity.
|
16 |
|
17 |
## News🚀🚀🚀
|
18 |
+
- 2024-4-16: 📖 We released our powerful v1.5 series models, check out at [🤗SAIL-VL-1.5-2B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-1.5-2B)[🤗SAIL-VL-1.5-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-1.5-8B) ~
|
19 |
- 2024-2-19: 📖 We released our 8B model, check out at [🤗SAIL-VL-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B) ~
|
20 |
- 2024-1-10: 📖 We released our paper on Arxiv: [Scalable Vision Language Model Training via High Quality Data Curation
|
21 |
](https://arxiv.org/abs/2501.05952)
|
|
|
26 |
|
27 |
| Architecture | ViT | LLM | Adapter | Token Merge | Resolution |
|
28 |
| --- | --- | --- | --- | --- | --- |
|
29 |
+
| [🤗SAIL-VL-1.5-2B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-1.5-2B) | [🤗AimV2-Huge](https://huggingface.co/apple/aimv2-huge-patch14-448) |[🤗Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
|
30 |
+
| [🤗SAIL-VL-1.5-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-1.5-8B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
|
31 |
| [🤗SAIL-VL-2B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-2B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
|
32 |
| [🤗SAIL-VL-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
|
|
|
33 |
### Training Recipes Overview:
|
34 |
|
35 |
Sail-VL benefits from high-quality data and carefully curated training recipes. We find the data quality, quantity and the design of curriculum training pipeline are crucial for model performance. With the proper design and data, the model's capacity scales effectively with data expansion at all stages, leading to enhanced performance.
|