Improve model card: Add pipeline tag, paper link, and usage example
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
@@ -1,25 +1,69 @@
|
|
1 |
---
|
2 |
-
license: apache-2.0
|
3 |
datasets:
|
4 |
- zhixiangwei/VLM-150M
|
5 |
language:
|
6 |
- en
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
---
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
|
23 |
-
|
|
24 |
-
|
|
25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
|
|
2 |
datasets:
|
3 |
- zhixiangwei/VLM-150M
|
4 |
language:
|
5 |
- en
|
6 |
+
license: apache-2.0
|
7 |
+
library_name: transformers
|
8 |
+
pipeline_tag: zero-shot-image-classification
|
9 |
+
tags:
|
10 |
+
- clip
|
11 |
+
- zero-shot-image-classification
|
12 |
+
- image-retrieval
|
13 |
---
|
14 |
+
|
15 |
+
# HQ-CLIP: High-Quality CLIP Models
|
16 |
+
|
17 |
+
This repository hosts the **HQ-CLIP-B-16** model, presented in the paper [HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models](https://huggingface.co/papers/2507.22431).
|
18 |
+
|
19 |
+
HQ-CLIP is a new family of CLIP models trained on `VLM-150M`, a high-quality image-text dataset refined by a novel LVLM-driven data pipeline. This framework leverages Large Vision-Language Models (LVLMs) to process images and their raw alt-text, generating rich, multi-grained annotations (long/short positive/negative descriptions/tags). This refined data allows HQ-CLIP to achieve state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained visual understanding tasks with a comparable training data scale, even surpassing standard CLIP models trained on 10x larger datasets.
|
20 |
+
|
21 |
+
**Project Page**: [https://zxwei.site/hqclip/](https://zxwei.site/hqclip/)
|
22 |
+
**Code & Data**: [https://github.com/zhixiangwei/HQ-CLIP](https://github.com/zhixiangwei/HQ-CLIP)
|
23 |
+
|
24 |
+
## Performance
|
25 |
+
|
26 |
+
The model's performance on various benchmarks:
|
27 |
+
|
28 |
+
| Dataset | Performance |
|
29 |
+
| :-------------- | ----------: |
|
30 |
+
| ImageNet 1k | 0.70556 |
|
31 |
+
| ImageNet V2 | 0.6308 |
|
32 |
+
| ImageNet-A | 0.391067 |
|
33 |
+
| ImageNet-O | 0.4295 |
|
34 |
+
| ImageNet-R | 0.801367 |
|
35 |
+
| ImageNet Sketch | 0.573189 |
|
36 |
+
| ObjectNet | 0.606439 |
|
37 |
+
| IN-shifts | 0.57206 |
|
38 |
+
| VTAB | 0.575571 |
|
39 |
+
| MSCOCO | 0.521573 |
|
40 |
+
| Flickr30k | 0.7786 |
|
41 |
+
| WinoGAViL | 0.528097 |
|
42 |
+
| Retrieval | 0.609423 |
|
43 |
+
| Avg. | 0.585715 |
|
44 |
+
|
45 |
+
## Usage
|
46 |
+
|
47 |
+
You can use this model for zero-shot image classification with the 🤗 Transformers library:
|
48 |
+
|
49 |
+
```python
|
50 |
+
from PIL import Image
|
51 |
+
import requests
|
52 |
+
from transformers import CLIPProcessor, CLIPModel
|
53 |
+
|
54 |
+
model = CLIPModel.from_pretrained("zhixiangwei/hqclip-B-16")
|
55 |
+
processor = CLIPProcessor.from_pretrained("zhixiangwei/hqclip-B-16")
|
56 |
+
|
57 |
+
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
58 |
+
image = Image.open(requests.get(url, stream=True).raw)
|
59 |
+
|
60 |
+
candidate_labels = ["a photo of a cat", "a photo of a dog", "a photo of a squirrel"]
|
61 |
+
inputs = processor(text=candidate_labels, images=image, return_tensors="pt", padding=True)
|
62 |
+
|
63 |
+
outputs = model(**inputs)
|
64 |
+
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
|
65 |
+
probs = logits_per_image.softmax(dim=1) # we can take softmax to get probabilities
|
66 |
+
|
67 |
+
print(f"Probabilities: {probs}")
|
68 |
+
# Expected: tensor([[0.9996, 0.0004, 0.0000]]) - indicating it's very likely a cat
|
69 |
+
```
|