Does the model support outputting embeddings for a single image?
I want to cluster some images, so I need to output embeddings for the images. I have tried several times by myself but failed. The following are the error messages and the code.
from src.model import MMEBModel
from src.arguments import ModelArguments
from src.model_utils import load_processor
from PIL import Image
import torch
1. 初始化模型参数
model_args = ModelArguments(
    model_name='/root/autodl-tmp/Qwen/Qwen2-VL-2B-Instruct',
    checkpoint_path='/root/autodl-tmp/TIGER-Lab/VLM2Vec-Qwen2VL-2B',
    pooling='last',
    normalize=True,
    model_backbone='qwen2_vl',
    lora=True
)
processor = load_processor(model_args)
model = MMEBModel.load(model_args)
model = model.to('cuda', dtype=torch.bfloat16)
model.eval()
3. 加载单张图像
image = Image.open('figures/example.jpg').convert('RGB')
4. 预处理图像 (关键点:text 设为空字符串或 None)
inputs = processor(text="", images=image, return_tensors="pt")
5. 移动到 CUDA
inputs = {k: v.unsqueeze(0).to('cuda') for k, v in inputs.items()}
6. 模型前向,提取 embedding
with torch.no_grad():
    image_embedding = model(qry=inputs)["qry_reps"]  # 这是图像的 embedding 向量
print("Image Embedding Shape:", image_embedding.shape)
print("Image Embedding:", image_embedding)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)
The error message is on the last line. I'm sorry that the information I provided seems rather disorganized.
Hi @YoloBird , thanks for your interest in our work! Yes, it definitely works for single-image embedding. However, I believe you need to include the image special token in the text in step 4:
inputs = processor(text="<|image_pad|>", images=image, return_tensors="pt")

 
						