moonshotai
/

Kimi-VL-A3B-Instruct

Image-Text-to-Text

feature-extraction

Model card Files Files and versions

teowu commited on Apr 15

Commit

5012bd6

·

verified ·

1 Parent(s): b39889c

Update README.md

Files changed (1) hide show

README.md +4 -1

README.md CHANGED Viewed

@@ -63,7 +63,6 @@ The model adopts an MoE language model, a native-resolution visual encoder (Moon
 > - For **Thinking models**, it is recommended to use `Temperature = 0.6`.
 > - For **Instruct models**, it is recommended to use `Temperature = 0.2`.
 ## Performance
 As an efficient model, Kimi-VL can robustly handle diverse tasks (fine-grained perception, math, college-level problems, OCR, agent, etc) across a broad spectrum of input forms (single-image, multi-image, video, long-document, etc).
@@ -132,6 +131,10 @@ Full comparison (GPT-4o included for reference):
 ### Inference with 🤗 Hugging Face Transformers
 We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.48.2 as the development environment.
 ```python

 > - For **Thinking models**, it is recommended to use `Temperature = 0.6`.
 > - For **Instruct models**, it is recommended to use `Temperature = 0.2`.
 ## Performance
 As an efficient model, Kimi-VL can robustly handle diverse tasks (fine-grained perception, math, college-level problems, OCR, agent, etc) across a broad spectrum of input forms (single-image, multi-image, video, long-document, etc).
 ### Inference with 🤗 Hugging Face Transformers
+> [!Note]
+> Recommended prompt for OS agent tasks (Expected output is a point):
+> - `Please observe the screenshot, please locate the following elements with action and point.<instruction> [YOUR INSTRUCTION]`
 We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.48.2 as the development environment.
 ```python