WorldRWKV/RWKV7-0.4B-siglip2

** Model Detail

Model type: RWKV7 SigLIP2 is an opensource chatbot trained using RWKV7 architecture and SigLIP2 Encoder.
Model date: Feb,2025
Paper or resources for more information: https://github.com/JL-er/WorldRWKV
Where to send questions or comments about the model: https://github.com/JL-er/WorldRWKV/issues

** Training datasets:

Pretrain: LLaVA 595k
Fine-tune: LLaVA 665k

** Evaluation dataset Currently, we tested RWKV7 SigLIP2 on 4 benchmarks proposed for instruction-following LMMs. More benchmarks will be released soon.

Benchmarks
Encoder LLM VQAV2 TextVQA GQA ScienceQA

SigLIP2 RWKV7-0.4B 72.04 38.75 55.52 43.32
Inference

Encoder	LLM	VQAV2	TextVQA	GQA	ScienceQA
SigLIP2	RWKV7-0.4B	72.04	38.75	55.52	43.32

from infer.worldmodel import Worldinfer
from PIL import Image


llm_path='WorldRWKV/RWKV7-0.4B-siglip2/rwkv-0' #Local model path
encoder_path='google/siglip2-base-patch16-384'
encoder_type='siglip'

model = Worldinfer(model_path=llm_path, encoder_type=encoder_type, encoder_path=encoder_path)

img_path = './docs/03-Confusing-Pictures.jpg'
image = Image.open(img_path).convert('RGB')

text = '\x16User: What is unusual about this image?\x17Assistant:'

result = model.generate(text, image)

print(result)