kellempxt commited on
Commit
4a0ac25
Β·
verified Β·
1 Parent(s): 1a291a6

Delete read.yaml

Browse files
Files changed (1) hide show
  1. read.yaml +0 -127
read.yaml DELETED
@@ -1,127 +0,0 @@
1
- # Long-CLIP
2
- This repository is the official implementation of Long-CLIP
3
-
4
- **Long-CLIP: Unlocking the Long-Text Capability of CLIP**\
5
- [Beichen Zhang](https://beichenzbc.github.io), [Pan Zhang](https://panzhang0212.github.io/), [Xiaoyi Dong](https://lightdxy.github.io/), [Yuhang Zang](https://yuhangzang.github.io/), [Jiaqi Wang](https://myownskyw7.github.io/)
6
-
7
- ## πŸ’‘ Highlights
8
- - πŸ”₯ **Long Input length** Increase the maximum input length of CLIP from **77** to **248**.
9
- - πŸ”₯ **Strong Performace** Improve the R@5 of long-caption text-image retrieval by **20%** and traditional text-image retrieval by **6%**.
10
- - πŸ”₯ **Plug-in and play** Can be directly applied in **any work** that requires long-text capability.
11
-
12
-
13
- ## πŸ“œ News
14
- πŸš€ [2024/7/3] Our paper has been accepted by ***ECCV2024***.
15
-
16
- πŸš€ [2024/7/3] We release the code of using Long-CLIP in ***SDXL***. For detailed information, you may refer to `SDXL/SDXL.md`.
17
-
18
- πŸš€ [2024/5/21] We update the paper and checkpoints after fixing the bug in DDP and add results in Urban-1k. Special thanks to @MajorDavidZhang for finding and refining this bug in DDP! Now the fine-tuning only takes ***0.5*** hours on *8 GPUs*!
19
-
20
- πŸš€ [2024/5/21] Urban-1k: a scaling-up version of Urban-200 dataset in the paper has been released at this [page](https://huggingface.co/datasets/BeichenZhang/Urban1k).
21
-
22
- πŸš€ [2024/4/1] The training code is released!
23
-
24
- πŸš€ [2024/3/25] The Inference code and models ([LongCLIP-B](https://huggingface.co/BeichenZhang/LongCLIP-B) and [LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L)) are released!
25
-
26
- πŸš€ [2024/3/25] The [paper](https://arxiv.org/abs/2403.15378) is released!
27
-
28
- ## πŸ‘¨β€πŸ’» Todo
29
- - [x] Training code for Long-CLIP based on OpenAI-CLIP
30
- - [x] Evaluation code for Long-CLIP
31
- - [x] evaluation code for zero-shot classification and text-image retrieval tasks.
32
- - [x] Usage example of Long-CLIP
33
- - [x] Checkpoints of Long-CLIP
34
-
35
-
36
- ## πŸ› οΈ Usage
37
-
38
- ### Installation
39
-
40
- Our model is based on [CLIP](https://github.com/openai/CLIP), please prepare environment for CLIP.
41
-
42
-
43
- ### how to use
44
-
45
- Please first clone our [repo](https://github.com/beichenzbc/Long-CLIP) from github by running the following command.
46
-
47
- ```shell
48
- git clone https://github.com/beichenzbc/Long-CLIP.git
49
- cd Long-CLIP
50
- ```
51
-
52
- Then, download the checkpoints of our model [LongCLIP-B](https://huggingface.co/BeichenZhang/LongCLIP-B) and/or [LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L) and place it under `./checkpoints`
53
-
54
- ```python
55
- from model import longclip
56
- import torch
57
- from PIL import Image
58
-
59
- device = "cuda" if torch.cuda.is_available() else "cpu"
60
- model, preprocess = longclip.load("./checkpoints/longclip-B.pt", device=device)
61
-
62
- text = longclip.tokenize(["A man is crossing the street with a red car parked nearby.", "A man is driving a car in an urban scene."]).to(device)
63
- image = preprocess(Image.open("./img/demo.png")).unsqueeze(0).to(device)
64
-
65
- with torch.no_grad():
66
- image_features = model.encode_image(image)
67
- text_features = model.encode_text(text)
68
-
69
- logits_per_image = image_features @ text_features.T
70
- probs = logits_per_image.softmax(dim=-1).cpu().numpy()
71
-
72
- print("Label probs:", probs)
73
- ```
74
-
75
- ### Evaluation
76
- #### Zero-shot classification
77
-
78
- To run zero-shot classification on imagenet dataset, run the following command after preparing the data
79
- ```shell
80
- cd eval/classification/imagenet
81
- python imagenet.py
82
- ```
83
-
84
- Similarly, run the following command for cifar datset
85
- ```shell
86
- cd eval/classification/cifar
87
- python cifar10.py #cifar10
88
- python cifar100.py #cifar100
89
- ```
90
-
91
- #### Retrieval
92
- To run text-image retrieval on COCO2017 or Flickr30k, run the following command after preparing the data
93
- ```shell
94
- cd eval/retrieval
95
- python coco.py #COCO2017
96
- python flickr30k.py #Flickr30k
97
- ```
98
- ### Traning
99
- Please refer to `train/train.md` for training details.
100
-
101
- ## ⭐ Demos
102
- ### Long-CLIP-SDXL
103
- <p align="center"> <a>
104
- <img src="./img/demo_SDXL.png" width="900" />
105
- </a> </p>
106
-
107
- ### Long-caption text-image retrieval
108
- <p align="center"> <a>
109
- <img src="./img/retrieval.png" width="900" />
110
- </a> </p>
111
-
112
- ### Plug-and-Play text to image generation
113
- <p align="center"> <a>
114
- <img src="./img/generation.png" width="900" />
115
- </a> </p>
116
-
117
-
118
- ## Citation
119
- If you find our work helpful for your research, please consider giving a citation:
120
- ```
121
- @article{zhang2024longclip,
122
- title={Long-CLIP: Unlocking the Long-Text Capability of CLIP},
123
- author={Beichen Zhang and Pan Zhang and Xiaoyi Dong and Yuhang Zang and Jiaqi Wang},
124
- journal={arXiv preprint arXiv:2403.15378},
125
- year={2024}
126
- }
127
- ```