kellempxt commited on
Commit
a7a6f62
Β·
verified Β·
1 Parent(s): a07510f

Create read.md

Browse files

# Long-CLIP
This repository is the official implementation of Long-CLIP

**Long-CLIP: Unlocking the Long-Text Capability of CLIP**\
[Beichen Zhang](https://beichenzbc.github.io), [Pan Zhang](https://panzhang0212.github.io/), [Xiaoyi Dong](https://lightdxy.github.io/), [Yuhang Zang](https://yuhangzang.github.io/), [Jiaqi Wang](https://myownskyw7.github.io/)

## πŸ’‘ Highlights
- πŸ”₯ **Long Input length** Increase the maximum input length of CLIP from **77** to **248**.
- πŸ”₯ **Strong Performace** Improve the R@5 of long-caption text-image retrieval by **20%** and traditional text-image retrieval by **6%**.
- πŸ”₯ **Plug-in and play** Can be directly applied in **any work** that requires long-text capability.


## πŸ“œ News
πŸš€ [2024/7/3] Our paper has been accepted by ***ECCV2024***.

πŸš€ [2024/7/3] We release the code of using Long-CLIP in ***SDXL***. For detailed information, you may refer to `SDXL/SDXL.md`.

πŸš€ [2024/5/21] We update the paper and checkpoints after fixing the bug in DDP and add results in Urban-1k. Special thanks to

@MajorDavidZhang
for finding and refining this bug in DDP! Now the fine-tuning only takes ***0.5*** hours on *8 GPUs*!

πŸš€ [2024/5/21] Urban-1k: a scaling-up version of Urban-200 dataset in the paper has been released at this [page](https://huggingface.co/datasets/BeichenZhang/Urban1k).

πŸš€ [2024/4/1] The training code is released!

πŸš€ [2024/3/25] The Inference code and models ([LongCLIP-B](https://huggingface.co/BeichenZhang/LongCLIP-B) and [LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L)) are released!

πŸš€ [2024/3/25] The [paper](https://arxiv.org/abs/2403.15378) is released!

## πŸ‘¨β€πŸ’» Todo
- [x] Training code for Long-CLIP based on OpenAI-CLIP
- [x] Evaluation code for Long-CLIP
- [x] evaluation code for zero-shot classification and text-image retrieval tasks.
- [x] Usage example of Long-CLIP
- [x] Checkpoints of Long-CLIP


## πŸ› οΈ Usage

### Installation

Our model is based on [CLIP](https://github.com/openai/CLIP), please prepare environment for CLIP.


### how to use

Please first clone our [repo](https://github.com/beichenzbc/Long-CLIP) from github by running the following command.

```shell
git clone https://github.com/beichenzbc/Long-CLIP.git
cd Long-CLIP
```

Then, download the checkpoints of our model [LongCLIP-B](https://huggingface.co/BeichenZhang/LongCLIP-B) and/or [LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L) and place it under `./checkpoints`

```python
from model import longclip
import torch
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = longclip.load("./checkpoints/longclip-B.pt", device=device)

text = longclip.tokenize(["A man is crossing the street with a red car parked nearby.", "A man is driving a car in an urban scene."]).to(device)
image = preprocess(Image.open("./img/demo.png")).unsqueeze(0).to(device)

with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)

logits_per_image = image_features @ text_features.T
probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)
```

### Evaluation
#### Zero-shot classification

To run zero-shot classification on imagenet dataset, run the following command after preparing the data
```shell
cd eval/classification/imagenet
python imagenet.py
```

Similarly, run the following command for cifar datset
```shell
cd eval/classification/cifar
python cifar10.py #cifar10
python cifar100.py #cifar100
```

#### Retrieval
To run text-image retrieval on COCO2017 or Flickr30k, run the following command after preparing the data
```shell
cd eval/retrieval
python coco.py #COCO2017
python flickr30k.py #Flickr30k
```
### Traning
Please refer to `train/train.md` for training details.

## ⭐ Demos
### Long-CLIP-SDXL
<p align="center"> <a>
<img src="./img/demo_SDXL.png" width="900" />
</a> </p>

### Long-caption text-image retrieval
<p align="center"> <a>
<img src="./img/retrieval.png" width="900" />
</a> </p>

### Plug-and-Play text to image generation
<p align="center"> <a>
<img src="./img/generation.png" width="900" />
</a> </p>


## Citation
If you find our work helpful for your research, please consider giving a citation:
```


@article
{zhang2024longclip,
title={Long-CLIP: Unlocking the Long-Text Capability of CLIP},
author={Beichen Zhang and Pan Zhang and Xiaoyi Dong and Yuhang Zang and Jiaqi Wang},
journal={arXiv preprint arXiv:2403.15378},
year={2024}
}
```

Files changed (1) hide show
  1. read.md +127 -0
read.md ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Long-CLIP
2
+ This repository is the official implementation of Long-CLIP
3
+
4
+ **Long-CLIP: Unlocking the Long-Text Capability of CLIP**\
5
+ [Beichen Zhang](https://beichenzbc.github.io), [Pan Zhang](https://panzhang0212.github.io/), [Xiaoyi Dong](https://lightdxy.github.io/), [Yuhang Zang](https://yuhangzang.github.io/), [Jiaqi Wang](https://myownskyw7.github.io/)
6
+
7
+ ## πŸ’‘ Highlights
8
+ - πŸ”₯ **Long Input length** Increase the maximum input length of CLIP from **77** to **248**.
9
+ - πŸ”₯ **Strong Performace** Improve the R@5 of long-caption text-image retrieval by **20%** and traditional text-image retrieval by **6%**.
10
+ - πŸ”₯ **Plug-in and play** Can be directly applied in **any work** that requires long-text capability.
11
+
12
+
13
+ ## πŸ“œ News
14
+ πŸš€ [2024/7/3] Our paper has been accepted by ***ECCV2024***.
15
+
16
+ πŸš€ [2024/7/3] We release the code of using Long-CLIP in ***SDXL***. For detailed information, you may refer to `SDXL/SDXL.md`.
17
+
18
+ πŸš€ [2024/5/21] We update the paper and checkpoints after fixing the bug in DDP and add results in Urban-1k. Special thanks to @MajorDavidZhang for finding and refining this bug in DDP! Now the fine-tuning only takes ***0.5*** hours on *8 GPUs*!
19
+
20
+ πŸš€ [2024/5/21] Urban-1k: a scaling-up version of Urban-200 dataset in the paper has been released at this [page](https://huggingface.co/datasets/BeichenZhang/Urban1k).
21
+
22
+ πŸš€ [2024/4/1] The training code is released!
23
+
24
+ πŸš€ [2024/3/25] The Inference code and models ([LongCLIP-B](https://huggingface.co/BeichenZhang/LongCLIP-B) and [LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L)) are released!
25
+
26
+ πŸš€ [2024/3/25] The [paper](https://arxiv.org/abs/2403.15378) is released!
27
+
28
+ ## πŸ‘¨β€πŸ’» Todo
29
+ - [x] Training code for Long-CLIP based on OpenAI-CLIP
30
+ - [x] Evaluation code for Long-CLIP
31
+ - [x] evaluation code for zero-shot classification and text-image retrieval tasks.
32
+ - [x] Usage example of Long-CLIP
33
+ - [x] Checkpoints of Long-CLIP
34
+
35
+
36
+ ## πŸ› οΈ Usage
37
+
38
+ ### Installation
39
+
40
+ Our model is based on [CLIP](https://github.com/openai/CLIP), please prepare environment for CLIP.
41
+
42
+
43
+ ### how to use
44
+
45
+ Please first clone our [repo](https://github.com/beichenzbc/Long-CLIP) from github by running the following command.
46
+
47
+ ```shell
48
+ git clone https://github.com/beichenzbc/Long-CLIP.git
49
+ cd Long-CLIP
50
+ ```
51
+
52
+ Then, download the checkpoints of our model [LongCLIP-B](https://huggingface.co/BeichenZhang/LongCLIP-B) and/or [LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L) and place it under `./checkpoints`
53
+
54
+ ```python
55
+ from model import longclip
56
+ import torch
57
+ from PIL import Image
58
+
59
+ device = "cuda" if torch.cuda.is_available() else "cpu"
60
+ model, preprocess = longclip.load("./checkpoints/longclip-B.pt", device=device)
61
+
62
+ text = longclip.tokenize(["A man is crossing the street with a red car parked nearby.", "A man is driving a car in an urban scene."]).to(device)
63
+ image = preprocess(Image.open("./img/demo.png")).unsqueeze(0).to(device)
64
+
65
+ with torch.no_grad():
66
+ image_features = model.encode_image(image)
67
+ text_features = model.encode_text(text)
68
+
69
+ logits_per_image = image_features @ text_features.T
70
+ probs = logits_per_image.softmax(dim=-1).cpu().numpy()
71
+
72
+ print("Label probs:", probs)
73
+ ```
74
+
75
+ ### Evaluation
76
+ #### Zero-shot classification
77
+
78
+ To run zero-shot classification on imagenet dataset, run the following command after preparing the data
79
+ ```shell
80
+ cd eval/classification/imagenet
81
+ python imagenet.py
82
+ ```
83
+
84
+ Similarly, run the following command for cifar datset
85
+ ```shell
86
+ cd eval/classification/cifar
87
+ python cifar10.py #cifar10
88
+ python cifar100.py #cifar100
89
+ ```
90
+
91
+ #### Retrieval
92
+ To run text-image retrieval on COCO2017 or Flickr30k, run the following command after preparing the data
93
+ ```shell
94
+ cd eval/retrieval
95
+ python coco.py #COCO2017
96
+ python flickr30k.py #Flickr30k
97
+ ```
98
+ ### Traning
99
+ Please refer to `train/train.md` for training details.
100
+
101
+ ## ⭐ Demos
102
+ ### Long-CLIP-SDXL
103
+ <p align="center"> <a>
104
+ <img src="./img/demo_SDXL.png" width="900" />
105
+ </a> </p>
106
+
107
+ ### Long-caption text-image retrieval
108
+ <p align="center"> <a>
109
+ <img src="./img/retrieval.png" width="900" />
110
+ </a> </p>
111
+
112
+ ### Plug-and-Play text to image generation
113
+ <p align="center"> <a>
114
+ <img src="./img/generation.png" width="900" />
115
+ </a> </p>
116
+
117
+
118
+ ## Citation
119
+ If you find our work helpful for your research, please consider giving a citation:
120
+ ```
121
+ @article{zhang2024longclip,
122
+ title={Long-CLIP: Unlocking the Long-Text Capability of CLIP},
123
+ author={Beichen Zhang and Pan Zhang and Xiaoyi Dong and Yuhang Zang and Jiaqi Wang},
124
+ journal={arXiv preprint arXiv:2403.15378},
125
+ year={2024}
126
+ }
127
+ ```