Model Card for Qwen-VL-Narrator
Qwen-VL-Narrator is an expert model for understanding video clips from film and TV dramas and designed to generate fine-grained descriptions of characters, scenes, and filming techniques. It can be applied to scenarios such as video retrieval, summarization, understanding, and fine-grained annotation.
Please try:
- HuggingFace Demo
- ModelScope Demo (Chinese only)
Highlights
- Small Model Size: The model is fine-tuned from Qwen2-VL 7B, allowing for easy deployment on a single H20, L20, or even a 5090 GPU.
- High-Quality Video Descriptions: Thanks to the diversity of its training samples, the model can provide more detailed video descriptions than previous models, demonstrating excellent performance for precise and comprehensive annotation.
- Integration with Workflows: The model can be integrated into film and television production workflows to provide summary information for video clips to other modules, enabling capabilities like long-video consolidation and structured output.
π Model Capabilities
Qwen-VL-Narrator's core capabilities include:
- π₯ Character Understanding: The model can describe the appearance and demeanor of characters in detail, including but not limited to: facial features, body shape, clothing, actions, and expressions.
- ποΈ Scene Understanding: The model can describe the environment and setting in detail, including but not limited to: location, lighting, props, and atmosphere.
- π Story Telling: The model can describe events and actions in the video in detail, and can describe character dialogues based on subtitles.
- π¬ Technical Analysis: The model can analyze professional filmmaking techniques in detail, including but not limited to: camera movement, composition, color, staging, and transitions.
π Showcase
For more use cases, such as structured output, image description, etc., please use the API we provide.
π How to Use
Minimal example for video inference.
For better inference performance, please deploy the model via
vllmorsglang.
pip install "transformers>=4.45.0" accelerate qwen-vl-utils[decord]
This model is fine-tuned from Qwen2-VL, and its usage is the same as the original model. To do a inference, please provide the video content and use the following prompt.
Recommended video length is within 1 minute. Recommended video parameters:
{ "max_pixels": 784 * 441, "fps": 2.0, "max_frames": 96, "min_frames": 16 }
# Code sample from https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"xiaosu-zhu/Qwen-VL-Narrator", torch_dtype="auto", device_map="auto"
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
# "Qwen/Qwen2.5-VL-7B-Instruct",
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
# default processer
processor = AutoProcessor.from_pretrained("xiaosu-zhu/Qwen-VL-Narrator")
# Messages containing a images list as a video and a text query
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": [
"file:///path/to/frame1.jpg",
"file:///path/to/frame2.jpg",
"file:///path/to/frame3.jpg",
"file:///path/to/frame4.jpg",
"max_pixels": 784 * 441,
"fps": [CALCULATED]
],
},
{"type": "text", "text": r"""Requirements:
- A precise description of the characters' appearance, clothing, actions, and expressions, including an analysis of their race/skin color.
- A detailed analysis of the set design, atmosphere, props, and environment.
- An objective and accurate presentation of the video's plot and narrative (with inferences aided by subtitles).
- An analysis of filming techniques, including camera language, shot types, and focus.
- Artistic processing including emotions/intentions are prohibited. Output only objective descriptions.
Output Format: Integrate the above content into a single, natural, and fluent paragraph describing the video clip. The description must be logically coherent and clear."""},
],
}
]
# Messages containing a local video path and a text query
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "file:///path/to/video1.mp4",
"max_pixels": 784 * 441,
"fps": 2.0,
"max_frames": 96,
"min_frames": 16
},
{"type": "text", "text": r"""Requirements:
- A precise description of the characters' appearance, clothing, actions, and expressions, including an analysis of their race/skin color.
- A detailed analysis of the set design, atmosphere, props, and environment.
- An objective and accurate presentation of the video's plot and narrative (with inferences aided by subtitles).
- An analysis of filming techniques, including camera language, shot types, and focus.
- Artistic processing including emotions/intentions are prohibited. Output only objective descriptions.
Output Format: Integrate the above content into a single, natural, and fluent paragraph describing the video clip. The description must be logically coherent and clear."""},
],
}
]
# Messages containing a video url and a text query
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "https://url.for.a.video",
"max_pixels": 784 * 441,
"fps": 2.0,
"max_frames": 96,
"min_frames": 16
},
{"type": "text", "text": r"""Requirements:
- A precise description of the characters' appearance, clothing, actions, and expressions, including an analysis of their race/skin color.
- A detailed analysis of the set design, atmosphere, props, and environment.
- An objective and accurate presentation of the video's plot and narrative (with inferences aided by subtitles).
- An analysis of filming techniques, including camera language, shot types, and focus.
- Artistic processing including emotions/intentions are prohibited. Output only objective descriptions.
Output Format: Integrate the above content into a single, natural, and fluent paragraph describing the video clip. The description must be logically coherent and clear."""},
],
}
]
#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
fps=fps,
padding=True,
return_tensors="pt",
**video_kwargs,
)
inputs = inputs.to("cuda")
# Inference
generated_ids = model.generate(**inputs, max_new_tokens=1536)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
π‘ Use Cases
Qwen-VL-Narrator can be applied in various domains to automate video analysis and support downstream applications.
- Content Indexing and Search: Create detailed, searchable metadata for large video archives, making it easy for users to find specific scenes, characters, or shots.
- Pre-production and Scripting: Analyze raw footage to quickly generate video summaries or production scripts.
- Automated Audio Description: Automatically generate audio descriptions for visually impaired audiences, providing accessible content.
- Video Generation Data Annotation: Provide video-text annotation data for video generation models, achieving high-quality video-text alignment and enhancing the instruction-following ability.
Limitations and Bias
This model, like all large language and vision models, has limitations.
- Due to biases and quality issues in the training data, the model's output may not be completely accurate and may contain hallucinations.
- The quality of descriptions may vary depending on the video's type, style, and content complexity.
- Due to the architectural limitations of Qwen2-VL, the model cannot process or describe audio.
- When the input video duration exceeds 1 minute, the description quality may decline. Please segment and preprocess the video according to your workflow.
Contributors
Xiaosu Zhu, Sijia Cai, Bing Deng and Jieping Ye @ Data to Intelligence Lab, Alibaba Cloud
This model was made possible through the close collaboration in Data to Intelligence Lab, Alibaba Cloud.
- Downloads last month
- 32
