InternViT-6B + QLLaMA, can be used for image-text retrieval like CLIP

by vitvit - opened Jan 7

Discussion

vitvit

Jan 7

Can you provide an example? (using text and image)

czczup

OpenGVLab org Jan 7

Hi, please see the quick start section in the model card.

https://huggingface.co/OpenGVLab/InternVL-14B-224px#quick-start

vitvit

Jan 7

It is not clear. It specifies how to load image encoder but not the fext encoder

dch239

Jan 25

I agree with vitvit. Is there a way to we get CLIP like embeddings out of the model that could be indexed to a vector database to be searched upon later?

czczup

OpenGVLab org Mar 3

Below is a complete example that shows how to load the model and obtain the image and text embeddings separately. Note that the prefix 'summarize:' for the text input and setting tokenizer.pad_token_id = 0 are necessary—omitting these may lead to abnormal results.

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor, AutoTokenizer

# 1. Load the model, image processor, and tokenizer
model = AutoModel.from_pretrained(
    'OpenGVLab/InternVL-14B-224px',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).cuda().eval()

image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternVL-14B-224px')

tokenizer = AutoTokenizer.from_pretrained(
    'OpenGVLab/InternVL-14B-224px',
    use_fast=False,
    add_eos_token=True
)
tokenizer.pad_token_id = 0  # Set pad_token_id to 0; necessary to avoid abnormal results

# 2. Prepare input data
# Load an image and convert it to RGB
image = Image.open('./examples/image1.jpg').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

# Prepare text input with the necessary prefix
text = "summarize: a photo of a red panda"
input_ids = tokenizer(text, return_tensors='pt', max_length=80,
                      truncation=True, padding='max_length').input_ids.cuda()

# 3. Get image and text embeddings separately
# Get image embeddings using the model's encode_image method
image_embeds = model.encode_image(pixel_values)

# Get text embeddings using the model's encode_text method
text_embeds = model.encode_text(input_ids)

# Print the shapes of the embeddings as an example
print("Image embeddings shape:", image_embeds.shape)
print("Text embeddings shape:", text_embeds.shape)

Explanation

Model Loading:
The model is loaded using AutoModel.from_pretrained with the trust_remote_code=True flag to load custom model code. The model is then moved to the GPU and set to evaluation mode.
Image Processing:
The CLIPImageProcessor preprocesses the image (converted to RGB) to generate pixel_values, which are then moved to the GPU.
Text Processing:
The AutoTokenizer tokenizes the input text. Note that the prefix 'summarize:' is added to the text (as required by the model), and tokenizer.pad_token_id is explicitly set to 0. Both steps are crucial for correct processing.
Embedding Extraction:
The model's encode_image and encode_text methods are used to obtain the CLIP-style embeddings. These normalized embeddings can be used directly for vector indexing or similarity search.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment