jz2023 commited on
Commit
2568a1a
·
1 Parent(s): aca0840

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -3
README.md CHANGED
@@ -1,3 +1,104 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # Model Details
6
+
7
+ Perception Encoder (PE) is a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. It was introduced in "[Perception Encoder: The best visual embeddings
8
+ are hidden inside the network](https://TBC)".
9
+
10
+ **Model Developer**: Meta
11
+
12
+ **Model Overview**: Perception Encoder (PE) is a family of large-scale vision encoder models with state-of-the-art performance on a large variety of vision tasks. By using a robust contrastive pretraining recipe and finetuning on synthetically aligned videos, PE not only outperforms all existing models on classification and retrieval, but it also internally produces strong, general features that scale for downstream tasks. PE unlocks the ability for large-scale contrastive pretraining to transfer to downstream tasks with alignment tuning to capitalize on those general features.
13
+
14
+ <img src="https://huggingface.co/facebook/PE-Core-G14-448/resolve/main/docs/pe_image1.png" style="width: 100%; margin: 0 auto; display: block;" />
15
+
16
+
17
+ | Scale | Tower | Params | Width | Depth | MLP | Heads | CLIP Dim | Resolution | Patch Size | Text Context Length |
18
+ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
19
+ | **B** | Vision | 0.09B | 768 | 12 | 3072 | 12 | 1024 | 384 | 16 | 32 |
20
+ | | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 | 384 | 16 | 32 |
21
+ | **L** | Vision | 0.32B | 1024 | 24 | 4096 | 16 | 1024 | 336 | 14 | 32 |
22
+ | | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 | 336 | 14 | 32 |
23
+ | **G** | Vision | 1.88B | 1536 | 50 | 8960 | 16 | 1280 | 392 | 14 | 72 |
24
+ | | Text | 0.47B | 1280 | 24 | 5120 | 20 | 1280 | 392 | 14 | 72 |
25
+
26
+
27
+ # How to use
28
+
29
+ ## PE codebase
30
+ We provide the pretraining code in https://github.com/meta-ai-research-fair/occhi.git
31
+
32
+ ```shell
33
+ git clone https://github.com/meta-ai-research-fair/occhi.git
34
+ cd occhi
35
+
36
+ conda create --name occhi-env python=3.12
37
+ conda activate occhi-env
38
+
39
+ # Install PyTorch
40
+ pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124
41
+
42
+ # We use torchcodec for decoding videos into PyTorch tensors
43
+ conda install ffmpeg -c conda-forge
44
+ pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124
45
+
46
+ pip install -e .
47
+ ```
48
+ ## Image and Textg Feature extraction with a Trained Model :robot:
49
+ ```python
50
+ import torch
51
+ from occhi.vision_encoder.factory import create_model_and_transforms, get_tokenizer
52
+ from PIL import Image
53
+
54
+ model_name = 'PEv1-G-14'
55
+ pretrained='ckpts/pev1_gs14_448_rc2.pt'
56
+
57
+ model, _, preprocess = create_model_and_transforms(
58
+ model_name,
59
+ pretrained=pretrained,
60
+ )
61
+ model = model.cuda()
62
+ tokenizer = get_tokenizer(model_name)
63
+
64
+ image = preprocess(Image.open("docs/cat.png")).unsqueeze(0).cuda()
65
+ text = tokenizer(["a diagram", "a dog", "a cat"]).cuda()
66
+
67
+ with torch.no_grad(), torch.autocast("cuda"):
68
+ image_features = model.encode_image(image)
69
+ text_features = model.encode_text(text)
70
+ image_features /= image_features.norm(dim=-1, keepdim=True)
71
+ text_features /= text_features.norm(dim=-1, keepdim=True)
72
+ text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
73
+
74
+ print("Label probs:", text_probs) # prints: [[0.0, 0.0, 1.0]]
75
+ ```
76
+
77
+ You can find more details in the GitHub repo.
78
+
79
+
80
+ # Evaluation
81
+ We evaluate the pretrained MobileLLM models on Zero-shot Common Sense Reasoning tasks
82
+
83
+ Here is the table in Markdown format:
84
+
85
+ ## Zero-Shot Image Results
86
+
87
+ <img src="https://huggingface.co/facebook/PE-Core-G14-448/resolve/main/docs/pe_zeroshot_image.png" style="width: 100%; margin: 0;" />
88
+
89
+ ## Zero-Shot Video Results
90
+
91
+ <img src="https://huggingface.co/facebook/PE-Core-G14-448/resolve/main/docs/pe_zeroshot_video.png" style="width: 90%; margin: 0" />
92
+
93
+
94
+ # Citation
95
+
96
+ If you find our code useful for your research, please consider citing:
97
+
98
+ @article{PE,
99
+ title={Perception Encoder},
100
+ author={},
101
+ journal={arXiv:xxx.xxxxx},
102
+ year={2025}
103
+ }
104
+