File size: 6,974 Bytes
37dfcd9
 
 
92a8d4d
 
d2089c3
37dfcd9
 
 
8a2e07e
 
 
af935d2
 
 
d94df61
3e779b5
72e3579
 
7552cbe
 
6f5ab85
 
 
 
8a2e07e
 
 
 
 
918d590
8a2e07e
a1ada44
ec60e9f
8a2e07e
 
 
 
 
d94df61
af935d2
8a2e07e
af935d2
 
 
 
 
 
 
 
 
 
8a2e07e
 
cdaad41
6d287df
30c515b
d94df61
 
30c515b
d94df61
6d287df
cdaad41
8a2e07e
cdaad41
8a2e07e
cdaad41
d94df61
8a2e07e
d94df61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cdaad41
d94df61
be6e330
d94df61
b42cb54
d94df61
be6e330
d94df61
 
 
be6e330
d94df61
 
be6e330
92a8d4d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
language:
- en
library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
- multimodal
- image caption
- captioning
datasets:
- internlm/CapRL-2M
---

# CapRL-3B

πŸ“–<a href="https://arxiv.org/abs/2509.22647">Paper</a> | 🏠<a href="https://github.com/InternLM/CapRL">Github</a> |πŸ€—<a href="https://huggingface.co/internlm/CapRL-3B">CapRL-3B Model</a> |πŸ€—<a href="https://huggingface.co/yuhangzang/CapRL-InternVL3.5-8B">CapRL-InternVL3.5-8B Model</a> |
  πŸ€—<a href="https://huggingface.co/datasets/internlm/CapRL-2M">CapRL-2M Dataset</a> 
  
  πŸ€—<a href="https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189">CapRL Collection</a> | πŸ€—<a href="https://huggingface.co/papers/2509.22647">Daily Paper</a> ο½œπŸ€—<a href="https://huggingface.co/mradermacher/CapRL-3B-GGUF">CapRL-3B-GGUF</a> ο½œπŸ€—<a href="https://huggingface.co/mradermacher/CapRL-3B-i1-GGUF">CapRL-3B-i1-GGUF</a>


Now you can try out CapRL-3B with your own images🎨!&nbsp;&nbsp;&nbsp;&nbsp;➑️&nbsp;&nbsp;&nbsp;&nbsp;[🌈CapRL Space](https://huggingface.co/spaces/yuhangzang/caprl)


When selecting between the available CapRL models, it's essential to consider the trade-off between performance and computational cost.
This guide will help you choose the most suitable model for your specific needs:
|Model|Parameters|Strength|
|-|-|-|
|πŸ€—[CapRL-3B](https://huggingface.co/internlm/CapRL-3B)|3B|Speed, Efficiency|
|πŸ€—[CapRL-InternVL3.5-8B](https://huggingface.co/yuhangzang/CapRL-InternVL3.5-8B)|8B|High Performance, Advanced Captioning Ability|

## πŸ“’ News
We are working on even stronger base models and upgrading our training recipe β€” stay tuned!
- πŸ”₯ [10/15/2025] The total downloads of the CapRL-related [models and dataset](https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189) reached 6,000 within just 20 days!
- πŸš€ [10/15/2025] We are excited to announce the release of **[CapRL-InternVL3.5-8B](https://huggingface.co/internlm/CapRL-InternVL3.5-8B)**, whose image captioning capability outperforms Qwen2.5-VL-72B!
- πŸš€ [10/15/2025] Thanks [mradermacher](https://huggingface.co/mradermacher) for the valuable contribution! [CapRL-3B-GGUF](https://huggingface.co/mradermacher/CapRL-3B-GGUF) is the static quants version, and [CapRL-3B-i1-GGUF](https://huggingface.co/mradermacher/CapRL-3B-i1-GGUF) is weighted/imatrix quants version.
- πŸš€ [10/15/2025] We release [QA curation code](https://github.com/InternLM/CapRL).
- πŸš€ [09/25/2025] We release **CapRL** repository, [CapRL-3B model](https://huggingface.co/internlm/CapRL-3B), [evaluation code](https://github.com/InternLM/CapRL) and [dataset](https://huggingface.co/datasets/internlm/CapRL-2M).

## Introduction
We are excited to introduce [CapRL-3B](https://huggingface.co/internlm/CapRL-3B), a lightweight 3B image captioner that achieves perception capabilities comparable to Qwen2.5-VL-72B.

This is the first study of applying Reinforcement Learning with Verifiable Rewards for the
open-ended and subjective image captioning task. Unlike traditional Supervised Fine-Tuning, which
can lead to models memorizing a limited set of annotated captions, our method allows the model to
explore and generate a broader range of creative and general descriptions.
CapRL is a new training paradigm featuring a decoupled two-stage pipeline. The initial
stage uses LVLMs to generate rich and accurate captions. Subsequently, the second stage evaluates
caption quality by using a vision-only LLM to perform the QA task. We also created a specific QA
curation pipeline to ensure the quality of the questions and answers used for the second stage.

By employing the CapRL training framework, initializing with the Qwen2.5-VL-3B model, and using a carefully 
filtered 75K QA dataset as the training set, we obtained a highly capable captioner, [CapRL-3B](https://huggingface.co/internlm/CapRL-3B).

<p align="center">
  <img src="./assets/teaser.png"  width="750"/>
</p>
<p align="center">
  <img src="./assets/performance_update.png" width="750"/>
</p>

## Key Features
* **Remarkable visual understanding for Chart, Infographics and Document**: [CapRL-3B](https://huggingface.co/internlm/CapRL-3B) achieves perception accuracy and visual information coverage comparable to Qwen2.5-VL-72B.
* **Well-organized output**: The outputs of CapRL-3B are relatively well-structured, making them clear and easy to understand.
* **Detailed description for natural images**: The outputs of [CapRL-3B](https://huggingface.co/internlm/CapRL-3B) can perfectly cover all valid visual information while containing fewer hallucinations.

## Usage
If you want to use **[CapRL-3B](https://huggingface.co/internlm/CapRL-3B)** for captioning, you can directly follow the exact same inference approach as in [Qwen2.5-VL-series](https://github.com/QwenLM/Qwen3-VL/tree/d2240f11656bfe404b9ba56db4e51cd09f522ff1).

We recommend using **vLLM** to speed up inference.



### Start an OpenAI API Service

Run the command below to start an OpenAI-compatible API service:

```bash
vllm serve "/PATH/CapRL-3B" \
    --trust-remote-code \
    --tensor-parallel-size=1 \
    --pipeline-parallel-size=1 \
    --gpu_memory_utilization=0.95 \
    --served-model-name=caprl \
    --port 8000 \
    --host 0.0.0.0
```

Then you can use the chat API as below: (see [OpenAI API protocol document](https://platform.openai.com/docs/guides/vision/uploading-base-64-encoded-images) for more details):
```python
import base64
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
image_path = "/path/to/local/image.png"
with open(image_path, "rb") as f:
    encoded_image = base64.b64encode(f.read())
encoded_image_text = encoded_image.decode("utf-8")
base64_qwen = f"data:image;base64,{encoded_image_text}"
chat_response = client.chat.completions.create(
    model="caprl",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": base64_qwen
                    },
                },
                {"type": "text", "text": "What is the text in the illustrate?"},
            ],
        },
    ],
    temperature=1.0,
    max_tokens=max_tokens,
    top_p=1.0,
    extra_body={
        "repetition_penalty": 1.0,
        },
)
print("Chat response:", chat_response)
```



## Cases
<p align="center">
  <img src="./assets/comparison.png"  width="750"/>
</p>

<p align="center">
  <img src="./assets/info_caprl.png"  width="750"/>
</p>

<p align="center">
  <img src="./assets/info_caprl2.png"  width="750"/>
</p>
<p align="center">
  <img src="./assets/natural_caprl.png"  width="750"/>
</p>