Weiyun1025 commited on
Commit
0686106
Β·
verified Β·
1 Parent(s): 1d5671b

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -17,7 +17,7 @@ tags:
17
 
18
  # InternVL3-9B
19
 
20
- [\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](TBD)
21
 
22
  [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[πŸ€— HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/)
23
 
@@ -64,9 +64,9 @@ Notably, in InternVL3, we integrate the [Variable Visual Position Encoding (V2PE
64
 
65
  ### Native Multimodal Pre-Training
66
 
67
- We propose a [Native Multimodal Pre-Training](TBD) approach that consolidates language and vision learning into a single pre-training stage.
68
  In contrast to standard paradigms that first train a language-only model and subsequently adapt it to handle additional modalities, our method interleaves multimodal data (e.g., image-text, video-text, or image-text interleaved sequences) with large-scale textual corpora. This unified training scheme allows the model to learn both linguistic and multimodal representations simultaneously, ultimately enhancing its capability to handle vision-language tasks without the need for separate alignment or bridging modules.
69
- Please see [our paper](TBD) for more details.
70
 
71
  ### Supervised Fine-Tuning
72
 
 
17
 
18
  # InternVL3-9B
19
 
20
+ [\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479)
21
 
22
  [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[πŸ€— HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/)
23
 
 
64
 
65
  ### Native Multimodal Pre-Training
66
 
67
+ We propose a [Native Multimodal Pre-Training](https://huggingface.co/papers/2504.10479) approach that consolidates language and vision learning into a single pre-training stage.
68
  In contrast to standard paradigms that first train a language-only model and subsequently adapt it to handle additional modalities, our method interleaves multimodal data (e.g., image-text, video-text, or image-text interleaved sequences) with large-scale textual corpora. This unified training scheme allows the model to learn both linguistic and multimodal representations simultaneously, ultimately enhancing its capability to handle vision-language tasks without the need for separate alignment or bridging modules.
69
+ Please see [our paper](https://huggingface.co/papers/2504.10479) for more details.
70
 
71
  ### Supervised Fine-Tuning
72