how to input a video?

#16

by maltoseflower - opened Mar 3

Mar 3

according to this model card, phi-4-mm is able to deal with a video. however, i cannot find how to do. anyone can help me?

zhangyang001

Mar 3

This comment has been hidden (marked as Off-Topic)

zhangyang001

Mar 3

This comment has been hidden (marked as Off-Topic)

junkstage

Mar 3

I don't think videos are supported...at least that's how I understand the model card. Only pictures and audio is possible - but not videos.

donniems

Mar 3

@maltoseflower Currently, phi-4-mm can support multi-image as input, so if you convert videos into frames, it will work. But it does not support multi-image+audio+text as input at the same time.

n3xt1lxs

Mar 6

So, as of now, it's not possible to directly embed a video, right? But if we want to work with videos, we can extract frames and use them as multiple images instead. This method should work, right?

donniems

Mar 6

@n3xt1lxs yes.

n3xt1lxs

Mar 6

@donniems Do you have an example code for converting a video into multiple images and then using them for generation?

n3xt1lxs

Mar 6

•

edited Mar 6

when i try 4 multi image, why vram gpu cost so high 21GB+ (try in A30) Is that normal?

import os
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

model_path = "microsoft/Phi-4-multimodal-instruct"

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True,
    _attn_implementation='flash_attention_2',
).cuda()

generation_config = GenerationConfig.from_pretrained(model_path)

image_dir = "./extracted_frames"

image_files = [
    f for f in os.listdir(image_dir) 
    if f.endswith(".png") or f.endswith(".jpg")
]

images = []
placeholder = ""

for i, filename in enumerate(sorted(image_files), start=1):
    image_path = os.path.join(image_dir, filename)
    img = Image.open(image_path)
    images.append(img)
    
    placeholder += f"<|image_{i}|>"

messages = [
    {
        "role": "user",
        "content": (
            placeholder 
            + "Please describe or summarize the content of these images."
        )
    }
]

prompt = processor.tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)

inputs = processor(prompt, images=images, return_tensors='pt').to('cuda:0')

generation_args = {
    "max_new_tokens": 512,
    "temperature": 0.5,
    "do_sample": True,
}

generate_ids = model.generate(
    **inputs, 
    **generation_args, 
    generation_config=generation_config,
)

generate_ids = generate_ids[:, inputs["input_ids"].shape[1]:]

response = processor.batch_decode(
    generate_ids, 
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=False
)[0]

print(response)

donniems

Mar 9

The GPU memory is related to frame numbers and resolution of each frame. If you reduce the GPU memory, you can try reducing the frame number or resolution.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment