Model

We used the same Vision Transformer architecture ViT-L/14@336px as CLIP.

Data

Our model was trained on publicly available image-caption data from the LAION400M and COYO700M datasets.

Performance and Limitations

A. MLLMs Evaluation Results

In our experiments, we replaced the CLIP model in LLaVA-NeXT with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used Qwen2.5-7B. The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.

Vision Tower	MLCD (ViT_L_14_336px)	CLIP (ViT_L_14_336px)
LLM	Qwen2.5-7B	Qwen2.5-7B
AI2D	76.98	73.15
ScienceQA_img	78.09	76.35
GQA	64.17	63.31
InfoVQA_val	43.48	38.88
MMBench_cn_dev	74.83	72.51
MMBench_en_dev	76.37	74.57
MME(cognition)	432	384
MME(perception)	1598	1512
SeedBench	68.20	66.80
SeedBench_img	73.75	72.72
MMStar	50.98	48.98
MMMU	44.30	44.20
OCRBench	531.00	525.00
ChartQA	67.84	66.52
DocVQA_val	76.46	75.21
POPE	88.69	88.83
TextVQA_val	61.69	62.47

B. Linear Probe Evaluation Results

This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.

Dataset	MLCD (ViT_L_14_336px)	CLIP (ViT_L_14_336px)
AVG	87.15	85.35
Food101	96.21	95.90
CIFAR-10	99.36	97.90
CIFAR-100	93.69	87.40
Birdsnap	88.18	79.90
SUN397	87.96	82.20
Stanford Cars	95.16	91.50
FGVC Aircraft	86.38	71.60
Describable Textures Dataset	86.70	83.00
Oxford-IIIT Pets	96.27	95.10
Caltech-101	97.92	96.00
Flowers102	99.58	99.20
MNIST	98.67	99.20
STL-10	99.28	99.70
EuroSAT	99.06	98.10
RESISC45	95.48	94.90
GTSRB	92.32	92.40
KITTI	75.39	69.20
Country211	38.12	46.40
PatchCamelyon	88.00	85.60
UCF101	92.86	92.00
Kinetics-700	73.35	73.00
CLEVR	64.40	60.30
Hateful Memes	72.00	77.30
SST-2	76.33	80.50
ImageNet	86.30	85.40

C. Limitations

Models with higher resolution are more friendly to OCR results. We are currently training such models and will soon make them available.

Acknowledgments

We would like to express our gratitude to Xie Yin and Yumeng Wang for their significant contributions to the experimental validation in MLLMs.

Downloads last month: 76

Safetensors

Model size

304M params

Tensor type

F32

Datasets used to train DeepGlint-AI/mlcd-vit-large-patch14-336

Collection including DeepGlint-AI/mlcd-vit-large-patch14-336

MLCD

Collection

Large-Scale Visual Representation Model • 8 items • Updated May 16 • 11