Improve model card: Add pipeline tag, paper link, and usage example

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +63 -19
README.md CHANGED
@@ -1,25 +1,69 @@
1
  ---
2
- license: apache-2.0
3
  datasets:
4
  - zhixiangwei/VLM-150M
5
  language:
6
  - en
 
 
 
 
 
 
 
7
  ---
8
- Pretraining HQ-CLIP-B-16 on VLM-150M.
9
-
10
- |Dataset|Performance|
11
- |:----------------|---------:|
12
- | ImageNet 1k | 0.70556 |
13
- | ImageNet V2 | 0.6308 |
14
- | ImageNet-A | 0.391067 |
15
- | ImageNet-O | 0.4295 |
16
- | ImageNet-R | 0.801367 |
17
- | ImageNet Sketch | 0.573189 |
18
- | ObjectNet | 0.606439 |
19
- | IN-shifts | 0.57206 |
20
- | VTAB | 0.575571 |
21
- | MSCOCO | 0.521573 |
22
- | Flickr30k | 0.7786 |
23
- | WinoGAViL | 0.528097 |
24
- | Retrieval | 0.609423 |
25
- | Avg. | 0.585715 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  datasets:
3
  - zhixiangwei/VLM-150M
4
  language:
5
  - en
6
+ license: apache-2.0
7
+ library_name: transformers
8
+ pipeline_tag: zero-shot-image-classification
9
+ tags:
10
+ - clip
11
+ - zero-shot-image-classification
12
+ - image-retrieval
13
  ---
14
+
15
+ # HQ-CLIP: High-Quality CLIP Models
16
+
17
+ This repository hosts the **HQ-CLIP-B-16** model, presented in the paper [HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models](https://huggingface.co/papers/2507.22431).
18
+
19
+ HQ-CLIP is a new family of CLIP models trained on `VLM-150M`, a high-quality image-text dataset refined by a novel LVLM-driven data pipeline. This framework leverages Large Vision-Language Models (LVLMs) to process images and their raw alt-text, generating rich, multi-grained annotations (long/short positive/negative descriptions/tags). This refined data allows HQ-CLIP to achieve state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained visual understanding tasks with a comparable training data scale, even surpassing standard CLIP models trained on 10x larger datasets.
20
+
21
+ **Project Page**: [https://zxwei.site/hqclip/](https://zxwei.site/hqclip/)
22
+ **Code & Data**: [https://github.com/zhixiangwei/HQ-CLIP](https://github.com/zhixiangwei/HQ-CLIP)
23
+
24
+ ## Performance
25
+
26
+ The model's performance on various benchmarks:
27
+
28
+ | Dataset | Performance |
29
+ | :-------------- | ----------: |
30
+ | ImageNet 1k | 0.70556 |
31
+ | ImageNet V2 | 0.6308 |
32
+ | ImageNet-A | 0.391067 |
33
+ | ImageNet-O | 0.4295 |
34
+ | ImageNet-R | 0.801367 |
35
+ | ImageNet Sketch | 0.573189 |
36
+ | ObjectNet | 0.606439 |
37
+ | IN-shifts | 0.57206 |
38
+ | VTAB | 0.575571 |
39
+ | MSCOCO | 0.521573 |
40
+ | Flickr30k | 0.7786 |
41
+ | WinoGAViL | 0.528097 |
42
+ | Retrieval | 0.609423 |
43
+ | Avg. | 0.585715 |
44
+
45
+ ## Usage
46
+
47
+ You can use this model for zero-shot image classification with the 🤗 Transformers library:
48
+
49
+ ```python
50
+ from PIL import Image
51
+ import requests
52
+ from transformers import CLIPProcessor, CLIPModel
53
+
54
+ model = CLIPModel.from_pretrained("zhixiangwei/hqclip-B-16")
55
+ processor = CLIPProcessor.from_pretrained("zhixiangwei/hqclip-B-16")
56
+
57
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
58
+ image = Image.open(requests.get(url, stream=True).raw)
59
+
60
+ candidate_labels = ["a photo of a cat", "a photo of a dog", "a photo of a squirrel"]
61
+ inputs = processor(text=candidate_labels, images=image, return_tensors="pt", padding=True)
62
+
63
+ outputs = model(**inputs)
64
+ logits_per_image = outputs.logits_per_image # this is the image-text similarity score
65
+ probs = logits_per_image.softmax(dim=1) # we can take softmax to get probabilities
66
+
67
+ print(f"Probabilities: {probs}")
68
+ # Expected: tensor([[0.9996, 0.0004, 0.0000]]) - indicating it's very likely a cat
69
+ ```