Svngoku commited on
Commit
75c0fca
·
verified ·
1 Parent(s): 061e0ab

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,301 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - jinaai/jina-reranker-m0
4
+ ---
5
+ # jinaai/jina-reranker-m0 (Quantized)
6
+
7
+ ## Description
8
+ This model is a quantized version of the original model [`jinaai/jina-reranker-m0`](https://huggingface.co/jinaai/jina-reranker-m0).
9
+
10
+ It's quantized using the BitsAndBytes library to 4-bit using the [bnb-my-repo](https://huggingface.co/spaces/bnb-community/bnb-my-repo) space.
11
+
12
+ ## Quantization Details
13
+ - **Quantization Type**: int4
14
+ - **bnb_4bit_quant_type**: fp4
15
+ - **bnb_4bit_use_double_quant**: True
16
+ - **bnb_4bit_compute_dtype**: bfloat16
17
+ - **bnb_4bit_quant_storage**: int8
18
+
19
+
20
+
21
+ # 📄 Original Model Information
22
+
23
+
24
+ ---
25
+ pipeline_tag: text-classification
26
+ tags:
27
+ - vidore
28
+ - reranker
29
+ - qwen2_vl
30
+ language:
31
+ - multilingual
32
+ base_model:
33
+ - Qwen/Qwen2-VL-2B-Instruct
34
+ inference: false
35
+ license: cc-by-nc-4.0
36
+ library_name: transformers
37
+ ---
38
+
39
+ <br><br>
40
+
41
+ <p align="center">
42
+ <img src="https://huggingface.co/datasets/jinaai/documentation-images/resolve/main/logo.webp" alt="Jina AI: Your Search Foundation, Supercharged!" width="150px">
43
+ </p>
44
+
45
+ <p align="center">
46
+ <b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
47
+ </p>
48
+
49
+ [Blog](https://jina.ai/news/jina-reranker-m0-multilingual-multimodal-document-reranker) | [API](https://jina.ai/reranker) | [AWS](https://aws.amazon.com/marketplace/pp/prodview-ctlpeffe5koac?sr=0-1&ref_=beagle&applicationId=AWSMPContessa) | [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.jina-reranker-m0) | [Arxiv](coming soon)
50
+
51
+
52
+ # jina-reranker-m0: Multilingual Multimodal Document Reranker
53
+
54
+ ## Intended Usage & Model Info
55
+
56
+ **jina-reranker-m0** is our new **multilingual multimodal reranker** model for ranking visual documents across multiple languages: it accepts a query alongside a collection of visually rich document images, including pages with text, figures, tables, infographics, and various layouts across multiple domains and over 29 languages.
57
+ It outputs a ranked list of documents ordered by their relevance to the input query. Compared to `jina-reranker-v2-base-multilingual`, `jina-reranker-m0` also improves text reranking for multilingual content, long documents, and code searching tasks.
58
+
59
+ ## Architecture
60
+
61
+ **jina-reranker-m0** is built on a decoder-only vision language model architecture, specifically:
62
+
63
+ - **Base model**: `Qwen2-VL-2B-Instruct`, utilizing its vision encoder, projection layer, and language model
64
+ - **Adaptation**: Fine-tuned the language model with LoRA (Low-Rank Adaptation) techniques
65
+ - **Output layer**: Post-trained MLP head to generate ranking scores measuring query-document relevance
66
+ - **Training objective**: Optimized with pairwise and listwise ranking losses to produce discriminative relevance scores
67
+
68
+ This represents a significant architectural shift from our previous cross-encoder models:
69
+
70
+ | | **jina-reranker-m0** | **jina-reranker-v2** |
71
+ |----------------------------------|--------------------------------------|-------------------------------------|
72
+ | **Architecture** | Vision Language Model | Cross-Encoder |
73
+ | **Base model** | Qwen2-VL-2B | Jina-XLM-RoBERTa |
74
+ | **Parameters** | 2.4 B | 278 M |
75
+ | **Max context length** | 10,240 tokens (query + document) | 8,192 tokens |
76
+ | **Image processing** | 768 × 28 × 28 patches (dynamic resolution) | ❌ |
77
+ | **Multilingual support** | 29+ languages | Multiple languages |
78
+ | **Tasks supported** | Text2Text, Text2Image,<br>Image2Text, Text2Mixed | Text2Text |
79
+
80
+ ## Capabilities
81
+
82
+ - **Multimodal Understanding**: Processes both textual and visual content, including pages with mixed text, figures, tables, and various layouts
83
+ - **Long Context Processing**: Handles up to 10K tokens, enabling reranking of lengthy documents
84
+ - **Dynamic Image Resolution**: Supports images from 56×56 pixels up to 4K resolution with dynamic patch processing
85
+ - **Multilingual Support**: Effectively reranks content across 29+ languages, including bidirectional language pairs
86
+ - **Zero-shot Domain Transfer**: Performs well on unseen domains and document types without specific fine-tuning
87
+ - **Code Search**: Enhanced capabilities for programming language search and technical document ranking
88
+
89
+
90
+ Compared to `jina-reranker-v2-base-multilingual`, `jina-reranker-m0` significantly improves text reranking for multilingual content, long documents, and code searching tasks, while adding powerful new capabilities for visual document understanding.
91
+
92
+ # Usage
93
+
94
+ 1. The easiest way to use `jina-reranker-m0` is to call Jina AI's [Reranker API](https://jina.ai/reranker/).
95
+
96
+ ```bash
97
+ curl -X POST \
98
+ https://api.jina.ai/v1/rerank \
99
+ -H "Content-Type: application/json" \
100
+ -H "Authorization: Bearer JINA_API_KEY" \
101
+ -d '{
102
+ "model": "jina-reranker-m0",
103
+ "query": "slm markdown",
104
+ "documents": [
105
+ {
106
+ "image": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
107
+ },
108
+ {
109
+ "image": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png"
110
+ },
111
+ {
112
+ "image": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/wired-preview.png"
113
+ },
114
+ {
115
+ "text": "We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The models effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements."
116
+ },
117
+ {
118
+ "image": "https://jina.ai/blog-banner/using-deepseek-r1-reasoning-model-in-deepsearch.webp"
119
+ },
120
+ {
121
+ "text": "数据提取么?为什么不用正则啊,你用正则不就全解决了么?"
122
+ },
123
+ {
124
+ "text": "During the California Gold Rush, some merchants made more money selling supplies to miners than the miners made finding gold."
125
+ },
126
+ {
127
+ "text": "Die wichtigsten Beiträge unserer Arbeit sind zweifach: Erstens führen wir eine neuartige dreistufige Datensynthese-Pipeline namens Draft-Refine-Critique ein, die durch iterative Verfeinerung hochwertige Trainingsdaten generiert; und zweitens schlagen wir eine umfassende Trainingsstrategie vor, die kontinuierliches Vortraining zur Längenerweiterung, überwachtes Feintuning mit spezialisierten Kontrollpunkten, direkte Präferenzoptimierung (DPO) und iteratives Self-Play-Tuning kombiniert. Um die weitere Forschung und Anwendung der strukturierten Inhaltsextraktion zu erleichtern, ist das Modell auf Hugging Face öffentlich verfügbar."
128
+ }
129
+ ],
130
+ "return_documents": false
131
+ }'
132
+ ```
133
+ You will receive a JSON response with the relevance scores for each document in relation to the query. The response will look like this:
134
+
135
+ ```json
136
+ {
137
+ "model":"jina-reranker-m0",
138
+ "usage": {
139
+ "total_tokens":2813
140
+ },
141
+ "results":[
142
+ {
143
+ "index":1,
144
+ "relevance_score":0.9310624287463884
145
+ },
146
+ {
147
+ "index":4,
148
+ "relevance_score":0.8982678574191957
149
+ },
150
+ {
151
+ "index":0,
152
+ "relevance_score":0.890233167219021
153
+ },
154
+ ...
155
+ ]
156
+ }
157
+ ```
158
+ The `relevance_score` field indicates the relevance of each document to the query, with higher scores indicating greater relevance.
159
+
160
+
161
+ 2. You can also use the `transformers` library to interact with the model programmatically.
162
+
163
+ Before you start, install the `transformers` libraries:
164
+
165
+ ```bash
166
+ pip install transformers >= 4.47.3
167
+ ```
168
+
169
+ If you run it on a GPU that support FlashAttention-2. By 2024.9.12, it supports Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100),
170
+
171
+ ```bash
172
+ pip install flash-attn --no-build-isolation
173
+ ```
174
+
175
+ And then use the following code snippet to load the model:
176
+
177
+ ```python
178
+ from transformers import AutoModel
179
+
180
+ # comment out the flash_attention_2 line if you don't have a compatible GPU
181
+ model = AutoModel.from_pretrained(
182
+ 'jinaai/jina-reranker-m0',
183
+ torch_dtype="auto",
184
+ trust_remote_code=True,
185
+ attn_implementation="flash_attention_2"
186
+ )
187
+
188
+ model.to('cuda') # or 'cpu' if no GPU is available
189
+ model.eval()
190
+ ```
191
+
192
+ Now you can use the model function `compute_score` to compute the relevance scores for a query and a list of documents. The function takes a list of sentence pairs, where each pair consists of a query and a document. The model will return a list of scores indicating the relevance of each document to the query.
193
+
194
+ **A. Visual Documents Reranking**
195
+
196
+ For handling the image documents, you can use the following code snippet:
197
+ ```python
198
+ # Example query and documents
199
+ query = "slm markdown"
200
+ documents = [
201
+ "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png",
202
+ "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png",
203
+ "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/wired-preview.png",
204
+ "https://jina.ai/blog-banner/using-deepseek-r1-reasoning-model-in-deepsearch.webp"
205
+ ]
206
+
207
+ # construct sentence pairs
208
+ image_pairs = [[query, doc] for doc in documents]
209
+
210
+ scores = model.compute_score(image_pairs, max_length=2048, doc_type="image")
211
+ # [0.49375027418136597, 0.7889736890792847, 0.47813892364501953, 0.5210812091827393]
212
+ ```
213
+
214
+ **B. Textual Documents Reranking**
215
+
216
+ ```python
217
+ query = "slm markdown"
218
+ documents = [
219
+ "We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The models effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.",
220
+ "数据提取么?为什么不用正则啊,你用正则不就全解决了么?",
221
+ "During the California Gold Rush, some merchants made more money selling supplies to miners than the miners made finding gold.",
222
+ "Die wichtigsten Beiträge unserer Arbeit sind zweifach: Erstens führen wir eine neuartige dreistufige Datensynthese-Pipeline namens Draft-Refine-Critique ein, die durch iterative Verfeinerung hochwertige Trainingsdaten generiert; und zweitens schlagen wir eine umfassende Trainingsstrategie vor, die kontinuierliches Vortraining zur Längenerweiterung, überwachtes Feintuning mit spezialisierten Kontrollpunkten, direkte Präferenzoptimierung (DPO) und iteratives Self-Play-Tuning kombiniert. Um die weitere Forschung und Anwendung der strukturierten Inhaltsextraktion zu erleichtern, ist das Modell auf Hugging Face öffentlich verfügbar.",
223
+ ]
224
+
225
+ # construct sentence pairs
226
+ text_pairs = [[query, doc] for doc in documents]
227
+
228
+ scores = model.compute_score(text_pairs, max_length=1024, doc_type="text")
229
+ ```
230
+
231
+ The scores will be a list of floats, where each float represents the relevance score of the corresponding document to the query. Higher scores indicate higher relevance.
232
+ For instance the returning scores in this case will be:
233
+ ```bash
234
+ [0.6839263439178467, 0.4432148039340973, 0.5904013514518738, 0.45481112599372864]
235
+ ```
236
+
237
+ **C. Image Querying for Textual Documents**
238
+
239
+ The model also supports querying textual documents with an image query. You can use the following code snippet:
240
+
241
+ ```python
242
+ query = "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png"
243
+
244
+ documents = [
245
+ "We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The models effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.",
246
+ "数据提取么?为什么不用正则啊,你用正则不就全解决了么?",
247
+ "During the California Gold Rush, some merchants made more money selling supplies to miners than the miners made finding gold.",
248
+ "Die wichtigsten Beiträge unserer Arbeit sind zweifach: Erstens führen wir eine neuartige dreistufige Datensynthese-Pipeline namens Draft-Refine-Critique ein, die durch iterative Verfeinerung hochwertige Trainingsdaten generiert; und zweitens schlagen wir eine umfassende Trainingsstrategie vor, die kontinuierliches Vortraining zur Längenerweiterung, überwachtes Feintuning mit spezialisierten Kontrollpunkten, direkte Präferenzoptimierung (DPO) und iteratives Self-Play-Tuning kombiniert. Um die weitere Forschung und Anwendung der strukturierten Inhaltsextraktion zu erleichtern, ist das Modell auf Hugging Face öffentlich verfügbar.",
249
+ ]
250
+ # reverse the order of the query and document
251
+ image_pairs = [[query, doc] for doc in documents]
252
+ scores = model.compute_score(image_pairs, max_length=2048, query_type="image", doc_type="text")
253
+
254
+ # [0.98099285364151, 0.7701883316040039, 0.5637142062187195, 0.9308615922927856]
255
+ ```
256
+
257
+ **D. Image Querying for Image Documents**
258
+
259
+ The model also supports querying image documents with an image query. You can use the following code snippet:
260
+
261
+ ```python
262
+ query = "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png"
263
+
264
+ documents = [
265
+ "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png",
266
+ "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png",
267
+ "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/wired-preview.png",
268
+ "https://jina.ai/blog-banner/using-deepseek-r1-reasoning-model-in-deepsearch.webp"
269
+ ]
270
+
271
+ image_pairs = [[query, doc] for doc in documents]
272
+ scores = model.compute_score(image_pairs, max_length=2048, doc_type="image", query_type='image')
273
+ # [0.6275860667228699, 0.9922324419021606, 0.8090347051620483, 0.7941296100616455]
274
+ ```
275
+
276
+ # Model Performance
277
+
278
+ Performance of the jina-reranker-m0 on ViDoRe, MBEIR, and Winoground visual retrieval benchmarks showcases its capabilities across diverse multimodal retrieval tasks spanning multiple domains and languages. Each dot represents performance scores for different types of visual documents. The boxplots illustrate the distribution of these scores, with the highlighted numbers indicating the average (mean) performance. For complete benchmark results, please refer to the appendix of this post.
279
+
280
+ We conduct extensive evaluations on the performance of the model across various visual retrieval benchmarks.
281
+
282
+ ![Model performance comparison across benchmarks](https://jina-ai-gmbh.ghost.io/content/images/size/w1600/2025/04/all-benchmarks--6-.png)
283
+
284
+ As shown in the figure above, the performance of the `jina-reranker-m0` on `ViDoRe`, `MBEIR`, and `Winoground` visual retrieval benchmarks showcases its capabilities across diverse multimodal retrieval tasks spanning multiple domains and languages. Each dot represents performance scores for different types of visual documents. The boxplots illustrate the distribution of these scores, with the highlighted numbers indicating the average (mean) performance.
285
+
286
+ We also evaluate the performance of the `jina-reranker-m0` across four text-to-text reranking benchmarks. Each benchmark may include multiple datasets, languages, or tasks, represented by individual dots inside the boxplot. The boxplot shows the distribution of these scores, with the highlighted number showing the average (mean) performance. While most benchmarks use NDCG@10 as their performance metric, MKQA uses recall@10 instead, as MKQA's annotation data doesn't support NDCG calculation (the official evaluation uses recall, which determines document relevance through heuristics).
287
+
288
+ ![Model performance comparison across text-to-text benchmarks](https://jina-ai-gmbh.ghost.io/content/images/size/w1600/2025/04/model-perf-boxplot--13-.png)
289
+
290
+ For complete benchmark results, please refer to the [online results table](https://docs.google.com/spreadsheets/d/1KrCD7l0lhzMkyg3z-gEDmymxe4Eun9Z-C0kU3_cxw7Q/edit?usp=sharing).
291
+
292
+
293
+
294
+
295
+ # Contact
296
+
297
+ Join our [Discord community](https://discord.jina.ai/) and chat with other community members about ideas.
298
+
299
+ # License
300
+
301
+ `jina-reranker-m0` is listed on AWS & Azure. If you need to use it beyond those platforms or on-premises within your company, note that the models is licensed under CC BY-NC 4.0. For commercial usage inquiries, feel free to [contact us](https://jina.ai/contact-sales/).
added_tokens.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "<|box_end|>": 151649,
3
+ "<|box_start|>": 151648,
4
+ "<|endoftext|>": 151643,
5
+ "<|im_end|>": 151645,
6
+ "<|im_start|>": 151644,
7
+ "<|image_pad|>": 151655,
8
+ "<|object_ref_end|>": 151647,
9
+ "<|object_ref_start|>": 151646,
10
+ "<|quad_end|>": 151651,
11
+ "<|quad_start|>": 151650,
12
+ "<|video_pad|>": 151656,
13
+ "<|vision_end|>": 151653,
14
+ "<|vision_pad|>": 151654,
15
+ "<|vision_start|>": 151652
16
+ }
config.json ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "jinaai/jina-reranker-m0",
3
+ "architectures": [
4
+ "Qwen2VLModel"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "auto_map": {
8
+ "AutoModel": "jinaai/jina-reranker-m0--modeling.JinaVLForRanking"
9
+ },
10
+ "bos_token_id": 151643,
11
+ "eos_token_id": 151645,
12
+ "hidden_act": "silu",
13
+ "hidden_size": 1536,
14
+ "image_token_id": 151655,
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 8960,
17
+ "max_position_embeddings": 32768,
18
+ "max_window_layers": 28,
19
+ "model_type": "qwen2_vl",
20
+ "num_attention_heads": 12,
21
+ "num_hidden_layers": 28,
22
+ "num_key_value_heads": 2,
23
+ "quantization_config": {
24
+ "_load_in_4bit": true,
25
+ "_load_in_8bit": false,
26
+ "bnb_4bit_compute_dtype": "bfloat16",
27
+ "bnb_4bit_quant_storage": "int8",
28
+ "bnb_4bit_quant_type": "fp4",
29
+ "bnb_4bit_use_double_quant": true,
30
+ "llm_int8_enable_fp32_cpu_offload": false,
31
+ "llm_int8_has_fp16_weight": false,
32
+ "llm_int8_skip_modules": null,
33
+ "llm_int8_threshold": 6.0,
34
+ "load_in_4bit": true,
35
+ "load_in_8bit": false,
36
+ "quant_method": "bitsandbytes"
37
+ },
38
+ "rms_norm_eps": 1e-06,
39
+ "rope_scaling": {
40
+ "mrope_section": [
41
+ 16,
42
+ 24,
43
+ 24
44
+ ],
45
+ "rope_type": "default",
46
+ "type": "default"
47
+ },
48
+ "rope_theta": 1000000.0,
49
+ "sliding_window": 32768,
50
+ "tie_word_embeddings": true,
51
+ "torch_dtype": "bfloat16",
52
+ "transformers_version": "4.49.0",
53
+ "use_cache": false,
54
+ "use_sliding_window": false,
55
+ "video_token_id": 151656,
56
+ "vision_config": {
57
+ "hidden_size": 1536,
58
+ "in_chans": 3,
59
+ "model_type": "qwen2_vl",
60
+ "spatial_patch_size": 14
61
+ },
62
+ "vision_end_token_id": 151653,
63
+ "vision_start_token_id": 151652,
64
+ "vision_token_id": 151654,
65
+ "vocab_size": 151936
66
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:56cfb86c8eeb47c0e03f481d7d15e76f2655506fe9a6a76d369540793dab7ab0
3
+ size 1143320463
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:091aa7594dc2fcfbfa06b9e3c22a5f0562ac14f30375c13af7309407a0e67b8a
3
+ size 11420371
tokenizer_config.json ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "151643": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "151644": {
13
+ "content": "<|im_start|>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "151645": {
21
+ "content": "<|im_end|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "151646": {
29
+ "content": "<|object_ref_start|>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "151647": {
37
+ "content": "<|object_ref_end|>",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ },
44
+ "151648": {
45
+ "content": "<|box_start|>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": true
51
+ },
52
+ "151649": {
53
+ "content": "<|box_end|>",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false,
58
+ "special": true
59
+ },
60
+ "151650": {
61
+ "content": "<|quad_start|>",
62
+ "lstrip": false,
63
+ "normalized": false,
64
+ "rstrip": false,
65
+ "single_word": false,
66
+ "special": true
67
+ },
68
+ "151651": {
69
+ "content": "<|quad_end|>",
70
+ "lstrip": false,
71
+ "normalized": false,
72
+ "rstrip": false,
73
+ "single_word": false,
74
+ "special": true
75
+ },
76
+ "151652": {
77
+ "content": "<|vision_start|>",
78
+ "lstrip": false,
79
+ "normalized": false,
80
+ "rstrip": false,
81
+ "single_word": false,
82
+ "special": true
83
+ },
84
+ "151653": {
85
+ "content": "<|vision_end|>",
86
+ "lstrip": false,
87
+ "normalized": false,
88
+ "rstrip": false,
89
+ "single_word": false,
90
+ "special": true
91
+ },
92
+ "151654": {
93
+ "content": "<|vision_pad|>",
94
+ "lstrip": false,
95
+ "normalized": false,
96
+ "rstrip": false,
97
+ "single_word": false,
98
+ "special": true
99
+ },
100
+ "151655": {
101
+ "content": "<|image_pad|>",
102
+ "lstrip": false,
103
+ "normalized": false,
104
+ "rstrip": false,
105
+ "single_word": false,
106
+ "special": true
107
+ },
108
+ "151656": {
109
+ "content": "<|video_pad|>",
110
+ "lstrip": false,
111
+ "normalized": false,
112
+ "rstrip": false,
113
+ "single_word": false,
114
+ "special": true
115
+ }
116
+ },
117
+ "additional_special_tokens": [
118
+ "<|im_start|>",
119
+ "<|im_end|>",
120
+ "<|object_ref_start|>",
121
+ "<|object_ref_end|>",
122
+ "<|box_start|>",
123
+ "<|box_end|>",
124
+ "<|quad_start|>",
125
+ "<|quad_end|>",
126
+ "<|vision_start|>",
127
+ "<|vision_end|>",
128
+ "<|vision_pad|>",
129
+ "<|image_pad|>",
130
+ "<|video_pad|>"
131
+ ],
132
+ "bos_token": null,
133
+ "chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}",
134
+ "clean_up_tokenization_spaces": false,
135
+ "eos_token": "<|im_end|>",
136
+ "errors": "replace",
137
+ "extra_special_tokens": {},
138
+ "model_max_length": 32768,
139
+ "pad_token": "<|endoftext|>",
140
+ "padding_side": "left",
141
+ "processor_class": "Qwen2VLProcessor",
142
+ "split_special_tokens": false,
143
+ "tokenizer_class": "Qwen2Tokenizer",
144
+ "unk_token": null
145
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff