how to input a video?
according to this model card, phi-4-mm is able to deal with a video. however, i cannot find how to do. anyone can help me?
I don't think videos are supported...at least that's how I understand the model card. Only pictures and audio is possible - but not videos.
@maltoseflower Currently, phi-4-mm can support multi-image as input, so if you convert videos into frames, it will work. But it does not support multi-image+audio+text as input at the same time.
So, as of now, it's not possible to directly embed a video, right? But if we want to work with videos, we can extract frames and use them as multiple images instead. This method should work, right?
when i try 4 multi image, why vram gpu cost so high 21GB+ (try in A30) Is that normal?
import os
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
model_path = "microsoft/Phi-4-multimodal-instruct"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
_attn_implementation='flash_attention_2',
).cuda()
generation_config = GenerationConfig.from_pretrained(model_path)
image_dir = "./extracted_frames"
image_files = [
f for f in os.listdir(image_dir)
if f.endswith(".png") or f.endswith(".jpg")
]
images = []
placeholder = ""
for i, filename in enumerate(sorted(image_files), start=1):
image_path = os.path.join(image_dir, filename)
img = Image.open(image_path)
images.append(img)
placeholder += f"<|image_{i}|>"
messages = [
{
"role": "user",
"content": (
placeholder
+ "Please describe or summarize the content of these images."
)
}
]
prompt = processor.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = processor(prompt, images=images, return_tensors='pt').to('cuda:0')
generation_args = {
"max_new_tokens": 512,
"temperature": 0.5,
"do_sample": True,
}
generate_ids = model.generate(
**inputs,
**generation_args,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs["input_ids"].shape[1]:]
response = processor.batch_decode(
generate_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)[0]
print(response)
The GPU memory is related to frame numbers and resolution of each frame. If you reduce the GPU memory, you can try reducing the frame number or resolution.