WAON
Collection
WAON: Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models
β’
5 items
β’
Updated
β’
1
We fine-tuned the google/siglip2-base-patch16-256 on WAON, a large-scale Japanese image-text pair dataset. Our model achieves state-of-the-art performance on WAON-Bench, a Japanese cultural image classification benchmark.
Our model achieves the best score on Japanese benchmarks (Recruit and WAON-Bench).
| Model | Params | XM3600 | ImageNet | Recruit | WAON-Bench | Avg |
|---|---|---|---|---|---|---|
| siglip2-base-patch16-256 (fine-tuned on WAON) | 375M | 73.75 | 49.61 | 83.14 | 94.97 | 75.37 |
| siglip2-base-patch16-256 (fine-tuned on ReLAION) | 375M | 72.39 | 47.38 | 81.65 | 92.99 | 73.60 |
| siglip2-base-patch16-256 | 375M | 38.28 | 48.12 | 76.98 | 87.81 | 62.80 |
| clip-japanese-base | 196M | 78.00 | 48.90 | 81.65 | 90.05 | 74.65 |
| siglip-base-patch16-256-mult | 371M | 43.22 | 53.26 | 75.10 | 89.25 | 65.21 |
| Japanese Stable CLIP ViT-L-16 | 414M | 66.03 | 55.97 | 71.29 | 82.03 | 68.83 |
| LAION-CLIP-ViT-H-14 | 1193M | 72.64 | 47.67 | 70.62 | 85.88 | 69.20 |
Here is a sample code snippet for zero-shot image classification:
import torch
import requests
from PIL import Image
from transformers import AutoProcessor, AutoModel
ckpt = "llm-jp/waon-siglip2-base-patch16-256"
model = AutoModel.from_pretrained(ckpt)
processor = AutoProcessor.from_pretrained(ckpt)
url = "https://upload.wikimedia.org/wikipedia/commons/5/58/Shiba_inu_taiki.jpg"
image = Image.open(requests.get(url, stream=True, headers={"User-Agent": "Mozilla/5.0"}).raw).convert("RGB")
candidate_labels = ["ζ΄η¬", "ζ₯ζ¬η«", "γγγ"]
# IMPORTANT: we pass `padding=max_length` and `max_length=64` since the model was trained with this
inputs = processor(text=candidate_labels, images=image, padding="max_length", max_length=64, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image)
for i, label in enumerate(candidate_labels):
print(f"prob that image is '{label}': {probs[0][i]:.2%}")
# prob that image is 'ζ΄η¬': 96.57%
# prob that image is 'ζ₯ζ¬η«': 0.03%
# prob that image is 'γγγ': 0.00%
For more information, please read the SigLIP2 documentation.
@misc{sugiura2025waonlargescalehighqualityjapanese,
title={WAON: Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models},
author={Issa Sugiura and Shuhei Kurita and Yusuke Oda and Daisuke Kawahara and Yasuo Okabe and Naoaki Okazaki},
year={2025},
eprint={2510.22276},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.22276},
}
Base model
google/siglip2-base-patch16-256