zhixiangwei
/

vlm150m-hqclip-large-vitb16

@@ -1,25 +1,69 @@
 ---
-license: apache-2.0
 datasets:
 - zhixiangwei/VLM-150M
 language:
 - en
 ---
-Pretraining HQ-CLIP-B-16 on VLM-150M.
-|Dataset|Performance|
-|:----------------|---------:|
-| ImageNet 1k     | 0.70556  |
-| ImageNet V2     | 0.6308   |
-| ImageNet-A      | 0.391067 |
-| ImageNet-O      | 0.4295   |
-| ImageNet-R      | 0.801367 |
-| ImageNet Sketch | 0.573189 |
-| ObjectNet       | 0.606439 |
-| IN-shifts       | 0.57206  |
-| VTAB            | 0.575571 |
-| MSCOCO          | 0.521573 |
-| Flickr30k       | 0.7786   |
-| WinoGAViL       | 0.528097 |
-| Retrieval       | 0.609423 |
-| Avg.            | 0.585715 |

 ---
 datasets:
 - zhixiangwei/VLM-150M
 language:
 - en
+license: apache-2.0
+library_name: transformers
+pipeline_tag: zero-shot-image-classification
+tags:
+- clip
+- zero-shot-image-classification
+- image-retrieval
 ---
+# HQ-CLIP: High-Quality CLIP Models
+This repository hosts the **HQ-CLIP-B-16** model, presented in the paper [HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models](https://huggingface.co/papers/2507.22431).
+HQ-CLIP is a new family of CLIP models trained on `VLM-150M`, a high-quality image-text dataset refined by a novel LVLM-driven data pipeline. This framework leverages Large Vision-Language Models (LVLMs) to process images and their raw alt-text, generating rich, multi-grained annotations (long/short positive/negative descriptions/tags). This refined data allows HQ-CLIP to achieve state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained visual understanding tasks with a comparable training data scale, even surpassing standard CLIP models trained on 10x larger datasets.
+**Project Page**: [https://zxwei.site/hqclip/](https://zxwei.site/hqclip/)
+**Code & Data**: [https://github.com/zhixiangwei/HQ-CLIP](https://github.com/zhixiangwei/HQ-CLIP)
+## Performance
+The model's performance on various benchmarks:
+| Dataset         | Performance |
+| :-------------- | ----------: |
+| ImageNet 1k     | 0.70556     |
+| ImageNet V2     | 0.6308      |
+| ImageNet-A      | 0.391067    |
+| ImageNet-O      | 0.4295      |
+| ImageNet-R      | 0.801367    |
+| ImageNet Sketch | 0.573189    |
+| ObjectNet       | 0.606439    |
+| IN-shifts       | 0.57206     |
+| VTAB            | 0.575571    |
+| MSCOCO          | 0.521573    |
+| Flickr30k       | 0.7786      |
+| WinoGAViL       | 0.528097    |
+| Retrieval       | 0.609423    |
+| Avg.            | 0.585715    |
+## Usage
+You can use this model for zero-shot image classification with the 🤗 Transformers library:
+```python
+from PIL import Image
+import requests
+from transformers import CLIPProcessor, CLIPModel
+model = CLIPModel.from_pretrained("zhixiangwei/hqclip-B-16")
+processor = CLIPProcessor.from_pretrained("zhixiangwei/hqclip-B-16")
+url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+candidate_labels = ["a photo of a cat", "a photo of a dog", "a photo of a squirrel"]
+inputs = processor(text=candidate_labels, images=image, return_tensors="pt", padding=True)
+outputs = model(**inputs)
+logits_per_image = outputs.logits_per_image # this is the image-text similarity score
+probs = logits_per_image.softmax(dim=1) # we can take softmax to get probabilities
+print(f"Probabilities: {probs}")
+# Expected: tensor([[0.9996, 0.0004, 0.0000]]) - indicating it's very likely a cat
+```