KARAKURI VL 32B Thinking 2507 Experimental

Note: This is an experimental model that generates reasoning traces within <think> tags before providing final answers. The model may occasionally produce incomplete responses or unclosed tags.

Model Details

Model Description

Developed by: KARAKURI Inc.
Model type: Vision-Language Models
Languages: Japanese and English
License: Apache 2.0
Finetuned from model: KARAKURI VL 32B Instruct 2507
Contact: For questions and comments about the model, please email [email protected]
Demo: https://vl.karakuri.cc/

Usage

Recommended System Prompt

We strongly recommend using the following system prompt that was used during reinforcement learning. This prompt helps stabilize the model's behavior and ensures proper closure of <think> tags in responses.

Important Notes:

If you want to customize the system prompt for your use case, please use this as a base for customization
Depending on the system prompt settings, the response may end without closing the <think> tag
The content within <think> tags represents the model's internal reasoning process

あなたは、ユーザーの意図を深く理解し、多角的な視点から考察し、具体的で実践的な情報を提供することを目指す、高度なAIアシスタントです。

あなたの応答は、以下の2つの主要な部分で構成されます。

1. **思考プロセス (<think>タグ内):**
    - ユーザーの質問や要求の核心を特定します。
    - 関連情報や考慮すべき点を網羅的に洗い出します。
    - 問題を解決するための複数のアプローチや選択肢を検討し、それぞれの利点と欠点を比較考察します（必要な場合）。
    - **深く時間をかけて考察し**、様々な視点や可能性を検討してください。急がずに、丁寧な思考を心がけてください。
    - 結論に至るまでの論理的なステップを、段階的かつ明確に記述します。思考の深さを示すために、なぜそのように考えるのか、どのような前提に基づいているのかも適宜含めてください。
    - **必要に応じて、異なる角度から検証したり、提案内容の妥当性を確認したりしてください。**

2. **ユーザーへの最終回答:**
    - **注意：ユーザーには最終回答のみが提供され、思考プロセスは見えません。したがって、最終回答は思考プロセスの要約ではなく、それ単体で自己完結した内容である必要があります。**
    - 思考プロセスで得られた洞察に基づき、ユーザーにとって最も価値のある情報を提供します。
    - 回答は、明確で、構造化され、理解しやすい言葉遣いを心がけてください。
    - 単に情報を提供するだけでなく、ユーザーが次にとるべき行動を具体的にイメージできるよう、実践的なアドバイスや提案を含めるように努めてください。
    - 常に親切で、丁寧なコミュニケーションを心がけてください。

Use in 🤗 Transformers

First, install the required dependencies:

pip install transformers accelerate qwen-vl-utils[decord]==0.0.8

Then, use the following code to load the model and generate responses:

from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info

model_name = "karakuri-ai/karakuri-vl-32b-thinking-2507-exp"
model = AutoModelForImageTextToText.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)

system_prompt = """あなたは、ユーザーの意図を深く理解し、多角的な視点から考察し、具体的で実践的な情報を提供することを目指す、高度なAIアシスタントです。

あなたの応答は、以下の2つの主要な部分で構成されます。

1. **思考プロセス (<think>タグ内):**
    - ユーザーの質問や要求の核心を特定します。
    - 関連情報や考慮すべき点を網羅的に洗い出します。
    - 問題を解決するための複数のアプローチや選択肢を検討し、それぞれの利点と欠点を比較考察します（必要な場合）。
    - **深く時間をかけて考察し**、様々な視点や可能性を検討してください。急がずに、丁寧な思考を心がけてください。
    - 結論に至るまでの論理的なステップを、段階的かつ明確に記述します。思考の深さを示すために、なぜそのように考えるのか、どのような前提に基づいているのかも適宜含めてください。
    - **必要に応じて、異なる角度から検証したり、提案内容の妥当性を確認したりしてください。**

2. **ユーザーへの最終回答:**
    - **注意：ユーザーには最終回答のみが提供され、思考プロセスは見えません。したがって、最終回答は思考プロセスの要約ではなく、それ単体で自己完結した内容である必要があります。**
    - 思考プロセスで得られた洞察に基づき、ユーザーにとって最も価値のある情報を提供します。
    - 回答は、明確で、構造化され、理解しやすい言葉遣いを心がけてください。
    - 単に情報を提供するだけでなく、ユーザーが次にとるべき行動を具体的にイメージできるよう、実践的なアドバイスや提案を含めるように努めてください。
    - 常に親切で、丁寧なコミュニケーションを心がけてください。"""

messages = [
    {
        "role": "system",
        "content": system_prompt,
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Training Details

Training Infrastructure

Hardware: The model was trained on 20 nodes of an Amazon EC2 trn1.32xlarge instance.
Software: We use code based on neuronx-nemo-megatron.

Acknowledgments

This work was supported by the Ministry of Economy, Trade and Industry (METI) and the New Energy and Industrial Technology Development Organization (NEDO) through the Generative AI Accelerator Challenge (GENIAC).

Citation

@misc{karakuri_vl_32b_thinking_2507_exp,
    author       = { {KARAKURI} {Inc.} },
    title        = { {KARAKURI} {VL} 32{B} {Thinking} 2507 {Experimental} },
    year         = { 2025 },
    url          = { https://huggingface.co/karakuri-ai/karakuri-vl-32b-thinking-2507-exp },
    publisher    = { {Hugging Face} },
    journal      = { {Hugging Face} repository }
}