SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

This model, VLAA-Thinker-Qwen2VL-7B, is a vision-language model fine-tuned on the VLAA-Thinking dataset. As described in , it leverages a combination of supervised fine-tuning (SFT) and reinforcement learning (RL) to improve reasoning capabilities in LLMs. The model excels in multimodal reasoning tasks, achieving state-of-the-art performance on the OpenCompass Multimodal Reasoning Leaderboard as of April 7th, 2025.

🌐 Project Page β€’ Arxiv Logo Arxiv β€’ πŸ’» Code

Both VLAA-Thinker-Qwen2.5-3B and VLAA-Thinker-Qwen2.5-7B achieve SOTA performance on OpenCompass Multimodal Reasoning Leaderboard as of April 7th, 2025. pipeline


pipeline

Quick Start πŸš€

Inference

Run python inference.py. Note that our model is trained with a system prompt. Please ensure that it is included for inference.

Dataset Download

Run bash ./utils/download_dataset.sh. Specify the dataset root with absolute path. The dataset should be ordered as follows:

β”œβ”€β”€ VLAA-Thinking-SFT-126K.json
β”œβ”€β”€ VLAA-Thinking-GRPO-25K.json
└── images
    β”œβ”€β”€ allava_laion
    β”œβ”€β”€ arxivqa
    β”œβ”€β”€ chartqa
    β”œβ”€β”€ clevr_math
    β”œβ”€β”€ coco
    β”‚   └── train2017
    β”œβ”€β”€ docvqa
    β”œβ”€β”€ geoqa170k
    β”œβ”€β”€ synthesis
    β”œβ”€β”€ vg
    β”‚   β”œβ”€β”€ VG_100K
    β”‚   └── VG_100K_2
    └── vizwiz

Training

Code coming soon!

(Rest of the README content can be kept as is)

Downloads last month
8
Safetensors
Model size
8.29B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for UCSC-VLAA/VLAA-Thinker-Qwen2VL-7B

Quantizations
2 models

Collection including UCSC-VLAA/VLAA-Thinker-Qwen2VL-7B