waon-siglip2-base-patch16-256

| 🤗 HuggingFace | 📄 Paper | 🧑‍💻 Code |

We fine-tuned the google/siglip2-base-patch16-256 on WAON, a large-scale Japanese image-text pair dataset. Our model achieves state-of-the-art performance on WAON-Bench, a Japanese cultural image classification benchmark.

Evaluation

Our model achieves the best score on Japanese benchmarks (Recruit and WAON-Bench).

Model	Params	XM3600	ImageNet	Recruit	WAON-Bench	Avg
siglip2-base-patch16-256 (fine-tuned on WAON)	375M	73.75	49.61	83.14	94.97	75.37
siglip2-base-patch16-256 (fine-tuned on ReLAION)	375M	72.39	47.38	81.65	92.99	73.60
siglip2-base-patch16-256	375M	38.28	48.12	76.98	87.81	62.80
clip-japanese-base	196M	78.00	48.90	81.65	90.05	74.65
siglip-base-patch16-256-mult	371M	43.22	53.26	75.10	89.25	65.21
Japanese Stable CLIP ViT-L-16	414M	66.03	55.97	71.29	82.03	68.83
LAION-CLIP-ViT-H-14	1193M	72.64	47.67	70.62	85.88	69.20

How to Use

Here is a sample code snippet for zero-shot image classification:

import torch
import requests
from PIL import Image
from transformers import AutoProcessor, AutoModel

ckpt = "llm-jp/waon-siglip2-base-patch16-256"
model = AutoModel.from_pretrained(ckpt)
processor = AutoProcessor.from_pretrained(ckpt)

url = "https://upload.wikimedia.org/wikipedia/commons/5/58/Shiba_inu_taiki.jpg"
image = Image.open(requests.get(url, stream=True, headers={"User-Agent": "Mozilla/5.0"}).raw).convert("RGB")
candidate_labels = ["柴犬", "日本猫", "いわし"]

# IMPORTANT: we pass `padding=max_length` and `max_length=64` since the model was trained with this
inputs = processor(text=candidate_labels, images=image, padding="max_length", max_length=64, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image)
for i, label in enumerate(candidate_labels):
    print(f"prob that image is '{label}': {probs[0][i]:.2%}")
# prob that image is '柴犬': 96.57%
# prob that image is '日本猫': 0.03%
# prob that image is 'いわし': 0.00%

For more information, please read the SigLIP2 documentation.

Citation

@misc{sugiura2025waonlargescalehighqualityjapanese,
      title={WAON: Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models}, 
      author={Issa Sugiura and Shuhei Kurita and Yusuke Oda and Daisuke Kawahara and Yasuo Okabe and Naoaki Okazaki},
      year={2025},
      eprint={2510.22276},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.22276}, 
}