BytedanceDouyinContent
/

SAIL-VL-1d5-2B

Model card Files Files and versions

Blue-skyyy commited on Aug 4

Commit

6b85e55

·

verified ·

1 Parent(s): f33304a

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -12,7 +12,7 @@ base_model:
 SAIL-VL is a state-of-the-art vision-language model (VLM) developed by the Bytedance Douyin Content Team. The goal of SAIL-VL is to develope a high-performance vision language model that facilitates deployment on mobile devices and ensures accessibility and affordability for a broad audience. Through careful tuning of data and training recipes, SAIL-VL demonstrates that even a small VLM can benefit significantly from data scaling.
-SAIL-VL V1.5 is the brand new version of our model, which incorporates advanced techniques to achieve higher performance. For visual encoding, we use the stronger AIM-V2 ViT as our vision encoder, introduce the progressive training strategy to warmup and a visual token scaling strategy during inference. During training, we introduce an adaptive stream packing strategy to support higher throughput and longer sequences. Finally, we add more conversation and reasoning data, filter out noisy data and add a new training stage for videos. With all these updates, our model outperforms recent SoTA models of comparable sizes, InternVL-3-2B, Ovis2-2B and even Qwen2.5-VL-3B.
 Please enjoy our model and feel free to contact us for any question or opportunity.

 SAIL-VL is a state-of-the-art vision-language model (VLM) developed by the Bytedance Douyin Content Team. The goal of SAIL-VL is to develope a high-performance vision language model that facilitates deployment on mobile devices and ensures accessibility and affordability for a broad audience. Through careful tuning of data and training recipes, SAIL-VL demonstrates that even a small VLM can benefit significantly from data scaling.
+SAIL-VL V1.5 is the brand new version of our model, which incorporates advanced techniques to achieve higher performance. For visual encoding, we use the stronger SAILViT-Huge ViT as our vision encoder, introduce the progressive training strategy to warmup and a visual token scaling strategy during inference. During training, we introduce an adaptive stream packing strategy to support higher throughput and longer sequences. Finally, we add more conversation and reasoning data, filter out noisy data and add a new training stage for videos. With all these updates, our model outperforms recent SoTA models of comparable sizes, InternVL-3-2B, Ovis2-2B and even Qwen2.5-VL-3B.
 Please enjoy our model and feel free to contact us for any question or opportunity.