zhibinlan commited on
Commit
2c86551
·
1 Parent(s): 9b59a49
Files changed (1) hide show
  1. README.md +139 -0
README.md CHANGED
@@ -1,3 +1,142 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+ # LLaVE-7B
5
+
6
+ ## Model Summary
7
+
8
+ The LLaVE models are 7B parameter multimodal embedding models based on the LLaVA-OneVision-7B model with a context window of 4K tokens.
9
+
10
+ - **Repository:** [LLaVE](https://github.com/DeepLearnXMU/LLaVE)
11
+ - **Paper:** [LLaVE](arxiv.org/)
12
+
13
+ ## Train/Eval Data
14
+ - Train data: https://huggingface.co/datasets/TIGER-Lab/MMEB-train
15
+ - Eval data: https://huggingface.co/datasets/TIGER-Lab/MMEB-eval
16
+
17
+ ## Use
18
+
19
+ ### Intended use
20
+
21
+ The model have the ability to embed with images, multi-image and videos.
22
+
23
+ ### Quick Start
24
+
25
+ First clone our github
26
+ ```bash
27
+ git clone https://github.com/DeepLearnXMU/LLaVE
28
+ cd LLaVE
29
+ pip install -e ".[train]"
30
+ ```
31
+
32
+ We provide the simple embedding process for using our model. For more details, you could refer to [Github](https://github.com/DeepLearnXMU/LLaVE).
33
+
34
+ ```python
35
+ # pip install git+https://github.com/DeepLearnXMU/LLaVE
36
+
37
+
38
+ import torch
39
+ import copy
40
+ from PIL import Image
41
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
42
+ from llava.conversation import conv_templates
43
+ from llava.model.builder import load_pretrained_model
44
+ from llava.mm_utils import tokenizer_image_token, process_images
45
+
46
+ pretrained = "zhibinlan/LLaVE-7B"
47
+ model_name = "llava_qwen"
48
+ device = "cuda"
49
+ device_map = "auto"
50
+ tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
51
+ model.eval()
52
+
53
+ # Image + Text -> Text
54
+ image = Image.open("figures/example.jpg")
55
+ image_tensor = process_images([image], image_processor, model.config)
56
+ image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
57
+ conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
58
+
59
+ question = DEFAULT_IMAGE_TOKEN + " Represent the given image with the following question: What is in the image"
60
+ conv = copy.deepcopy(conv_templates[conv_template])
61
+ conv.append_message(conv.roles[0], question)
62
+ conv.append_message(conv.roles[1], "\n")
63
+ prompt_question = conv.get_prompt()
64
+ input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
65
+ attention_mask=input_ids.ne(tokenizer.pad_token_id)
66
+ image_sizes = [image.size]
67
+ query_embed = model.encode_multimodal_embeddings(input_ids, attention_mask=attention_mask,images=image_tensor, image_sizes=image_sizes)
68
+
69
+ target_string = "A cat and a dog"
70
+ conv = copy.deepcopy(conv_templates[conv_template])
71
+ conv.append_message(conv.roles[0], target_string)
72
+ conv.append_message(conv.roles[1], "\n")
73
+ target_string = conv.get_prompt()
74
+ target_input_ids = tokenizer(target_string, return_tensors="pt").input_ids.to(device)
75
+ attention_mask=target_input_ids.ne(tokenizer.pad_token_id)
76
+ target_embed = model.encode_multimodal_embeddings(target_input_ids, attention_mask=attention_mask)
77
+
78
+ print("A cat and a dog similarity score: ", query_embed @ target_embed.T)
79
+ # 7B: A cat and a dog similarity score: tensor([[0.6240]]
80
+
81
+ neg_string = "A cat and a tiger"
82
+ conv = copy.deepcopy(conv_templates[conv_template])
83
+ conv.append_message(conv.roles[0], neg_string)
84
+ conv.append_message(conv.roles[1], "\n")
85
+ neg_string = conv.get_prompt()
86
+ neg_input_ids = tokenizer(neg_string, return_tensors="pt").input_ids.to(device)
87
+ attention_mask=neg_input_ids.ne(tokenizer.pad_token_id)
88
+ neg_embed = model.encode_multimodal_embeddings(neg_input_ids, attention_mask=attention_mask)
89
+ print("A cat and a tiger similarity score: ", query_embed @ neg_embed.T)
90
+ # 7B: A cat and a tiger similarity score: tensor([[0.4543]]
91
+
92
+
93
+ # Text -> Image
94
+ pos_string = "Find me an everyday image that matches the given caption: A cat and a dog."
95
+ conv = copy.deepcopy(conv_templates[conv_template])
96
+ conv.append_message(conv.roles[0], pos_string)
97
+ conv.append_message(conv.roles[1], "\n")
98
+ pos_string = conv.get_prompt()
99
+ pos_input_ids = tokenizer(pos_string, return_tensors="pt").input_ids.to(device)
100
+ attention_mask=pos_input_ids.ne(tokenizer.pad_token_id)
101
+ pos_query_embed = model.encode_multimodal_embeddings(pos_input_ids, attention_mask=attention_mask)
102
+
103
+ target = DEFAULT_IMAGE_TOKEN + " Represent the given image."
104
+ conv = copy.deepcopy(conv_templates[conv_template])
105
+ conv.append_message(conv.roles[0], target)
106
+ conv.append_message(conv.roles[1], "\n")
107
+ prompt_target = conv.get_prompt()
108
+ target_input_ids = tokenizer_image_token(prompt_target, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
109
+ attention_mask=target_input_ids.ne(tokenizer.pad_token_id)
110
+ target_image_sizes = [image.size]
111
+ target_embed = model.encode_multimodal_embeddings(target_input_ids, attention_mask=attention_mask,images=image_tensor, image_sizes=target_image_sizes)
112
+
113
+ print("A cat and a dog image similarity score: ", pos_query_embed @ target_embed.T)
114
+ # 7B: A cat and a dog similarity score: tensor([[0.5347]]
115
+
116
+ neg_string = "Find me an everyday image that matches the given caption: A cat and a tiger."
117
+ conv = copy.deepcopy(conv_templates[conv_template])
118
+ conv.append_message(conv.roles[0], neg_string)
119
+ conv.append_message(conv.roles[1], "\n")
120
+ neg_string = conv.get_prompt()
121
+ neg_input_ids = tokenizer(neg_string, return_tensors="pt").input_ids.to(device)
122
+ attention_mask=neg_input_ids.ne(tokenizer.pad_token_id)
123
+ neg_query_embed = model.encode_multimodal_embeddings(neg_input_ids, attention_mask=attention_mask)
124
+
125
+ print("A cat and a tiger image similarity score: ", neg_query_embed @ target_embed.T)
126
+ # 7B: A cat and a dog similarity score: tensor([[0.4001]]
127
+ ```
128
+
129
+ ## Hardware & Software
130
+ - **GPUs:** 16 * Ascend 910B GPUs (64GB) (for whole model training)
131
+ - **Orchestration:** [Huggingface Trainer](https://huggingface.co/docs/transformers/main_classes/trainer)
132
+ - **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
133
+
134
+ ## Citation
135
+ ```
136
+ @article{lan2025llave,
137
+ title={VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks},
138
+ author={Lan, Zhibin and Niu, Liqiang and Meng, Fandong and Zhou, Jie and Su, Jinsong},
139
+ journal={arXiv preprint},
140
+ year={2025}
141
+ }
142
+ ```