MiniCPM-V-200K-video-finetune / README.md

Improve model card: Add abstract and sample usage (#2)

69c196b verified 3 months ago

6.85 kB

	---
	base_model:
	- openbmb/MiniCPM-Llama3-V-2_5
	datasets:
	- MBZUAI/VideoInstruct-100K
	- Share14/ShareGemini
	library_name: transformers
	license: apache-2.0
	pipeline_tag: video-text-to-text
	tags:
	- MiniCPM-V
	- finetune
	- MLLM
	---

	# Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation

	<p>
	💻 <a href="https://github.com/VITA-MLLM/Sparrow">GitHub</a>&nbsp&nbsp \| &nbsp&nbsp 📑 <a href="https://arxiv.org/pdf/2411.19951">Paper</a> &nbsp&nbsp </a>
	</p>


	## Model Summary

	This model is a part of the project [Sparrow](https://github.com/VITA-MLLM/Sparrow). It's a video-LLM fine-tuned from the image-LLM
	[MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5).

	Abstract:
	Recent years have seen the success of Multimodal Large Language Models (MLLMs) in the domain of vision understanding. The success of these models can largely be attributed to the dominant scaling law, which states that larger parameter sizes and data volumes contribute to better performance. Notably, data scaling has been primarily driven by automatic data pipelines, which focus on the self-instruction of LLMs. The paradigm has been taken for granted for quite some time, but the study of the effectiveness of scaling with these data has been neglected for a long time. In this context, this work revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective. Our primary study approach involves fine-tuning pre-trained image-LLMs with video data and examining learning efficiency through data scaling. Results from our preliminary experiments reveal a low learning efficiency phenomenon when simply scaling up video data samples, which, through our probing, can be ascribed to a lack of instruction diversity. Aiming at this issue, we propose a data augmentation method called Sparrow, which synthesizes video-like samples from pure text instruction data. Mixing these synthetic samples with the video data enables a more efficient training scheme. Through comprehensive experiments, we demonstrate that our proposed method achieves performance comparable to or even superior to that of baselines trained with significantly more samples. Meanwhile, we find that incorporating these synthetic samples can enhance the performance of long video understanding without requiring training on long video data.

	## Sample Usage

	This model is designed for video-language understanding. You can load it using the `transformers` library. Ensure `trust_remote_code=True` is set for proper model loading. For video input, you will typically provide a list of image frames (PIL Images).

	Prerequisites:
	You might need `decord` to easily load video frames. Install it via `pip install decord`.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
	from PIL import Image
	import torch
	import numpy as np
	from decord import VideoReader, cpu # For video loading

	# Load model and processor
	model_id = "VITA-MLLM/Sparrow-Llama3-V-2_5" # Replace with the actual model ID if different
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16, # Use bfloat16 for better performance/memory
	device_map="auto",
	trust_remote_code=True
	)
	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
	processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

	# --- Example: Load video frames ---
	video_path = "path/to/your/video.mp4" # <--- IMPORTANT: Replace with your video file path!
	video_frames = []
	try:
	vr = VideoReader(video_path, ctx=cpu(0))
	# Sample a maximum of 32 frames uniformly for demonstration
	total_frames = len(vr)
	num_frames_to_sample = min(total_frames, 32)
	frame_indices = np.linspace(0, total_frames - 1, num_frames_to_sample, dtype=int)

	video_frames = [Image.fromarray(vr[i].asnumpy()) for i in frame_indices]
	print(f"Loaded {len(video_frames)} frames from {video_path}")
	except Exception as e:
	print(f"Could not load video from {video_path}: {e}")
	print("Using placeholder images for demonstration. Please provide a valid video file.")
	video_frames = [Image.new("RGB", (224, 224), color="blue")] * 4 # Fallback to placeholder images


	# --- Prepare prompt with video frames ---
	# The <video> tag is specific to MiniCPM-V models for indicating video/image input.
	# It should be repeated for each image frame provided.
	messages = [
	{"role": "user", "content": "<video>" * len(video_frames) + "
	Describe this video in detail."}
	]

	# Apply chat template and tokenize inputs
	inputs = processor.apply_chat_template(
	messages,
	video=video_frames, # Pass the list of PIL Images here
	tokenize=True,
	add_generation_prompt=True,
	return_tensors="pt"
	)

	# Move inputs to appropriate device (e.g., GPU)
	inputs = {k: v.to(model.device) for k, v in inputs.items()}

	# --- Generate response ---
	with torch.no_grad():
	generated_ids = model.generate(
	input_ids=inputs["input_ids"],
	attention_mask=inputs["attention_mask"],
	image_pixel_values=inputs["image_pixel_values"], # Essential for vision inputs
	max_new_tokens=256, # Adjust as needed
	do_sample=True,
	temperature=0.7,
	top_p=0.9,
	)

	# Decode and print the output
	response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
	# Clean up any potential chat template artifacts at the beginning/end
	response = response.split('<\|start_header_id\|>assistant<\|end_header_id\|>')[-1].strip()

	print("
	Generated Response:")
	print(response)

	```

	## License

	#### Model License

	* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
	* The usage of MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
	* The models and weights of MiniCPM are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, are also available for free commercial use.


	#### Statement
	* As an LLM, MiniCPM-Llama3-V 2.5 generates contents by learning a large mount of texts, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-Llama3-V 2.5 does not represent the views and positions of the model developers
	* We will not be liable for any problems arising from the use of the MinCPM-V open Source model, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.


	## Training dataset
	- 100K video instruction data from Video-ChatGPT
	- 100K video caption data from ShareGemini