how to input a video?

#16
by maltoseflower - opened

according to this model card, phi-4-mm is able to deal with a video. however, i cannot find how to do. anyone can help me?

This comment has been hidden (marked as Off-Topic)
This comment has been hidden (marked as Off-Topic)

I don't think videos are supported...at least that's how I understand the model card. Only pictures and audio is possible - but not videos.

@maltoseflower Currently, phi-4-mm can support multi-image as input, so if you convert videos into frames, it will work. But it does not support multi-image+audio+text as input at the same time.

So, as of now, it's not possible to directly embed a video, right? But if we want to work with videos, we can extract frames and use them as multiple images instead. This method should work, right?

@n3xt1lxs yes.

@donniems Do you have an example code for converting a video into multiple images and then using them for generation?

when i try 4 multi image, why vram gpu cost so high 21GB+ (try in A30) Is that normal?

import os
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

model_path = "microsoft/Phi-4-multimodal-instruct"

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True,
    _attn_implementation='flash_attention_2',
).cuda()

generation_config = GenerationConfig.from_pretrained(model_path)

image_dir = "./extracted_frames"

image_files = [
    f for f in os.listdir(image_dir) 
    if f.endswith(".png") or f.endswith(".jpg")
]

images = []
placeholder = ""

for i, filename in enumerate(sorted(image_files), start=1):
    image_path = os.path.join(image_dir, filename)
    img = Image.open(image_path)
    images.append(img)
    
    placeholder += f"<|image_{i}|>"

messages = [
    {
        "role": "user",
        "content": (
            placeholder 
            + "Please describe or summarize the content of these images."
        )
    }
]

prompt = processor.tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)

inputs = processor(prompt, images=images, return_tensors='pt').to('cuda:0')

generation_args = {
    "max_new_tokens": 512,
    "temperature": 0.5,
    "do_sample": True,
}

generate_ids = model.generate(
    **inputs, 
    **generation_args, 
    generation_config=generation_config,
)

generate_ids = generate_ids[:, inputs["input_ids"].shape[1]:]

response = processor.batch_decode(
    generate_ids, 
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=False
)[0]

print(response)

The GPU memory is related to frame numbers and resolution of each frame. If you reduce the GPU memory, you can try reducing the frame number or resolution.

Sign up or log in to comment