File size: 1,923 Bytes
467918b 17ade0d 467918b 17ade0d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
---
license: mit
base_model:
- openbmb/MiniCPM-V-2_6
---
<div align="center">
<h1 style="margin: 0">
<img src="assets/logo.png" style="width:1.5em; vertical-align: middle; display: inline-block; margin: 0" alt="Logo">
<span style="vertical-align: middle; display: inline-block; margin: 0"><b>CaReBench: A Fine-grained Benchmark for Video Captioning and Retrieval</b></span>
</h1>
<p style="margin: 0">
Yifan Xu, <a href="https://scholar.google.com/citations?user=evR3uR0AAAAJ">Xinhao Li</a>, Yichun Yang, Desen Meng, Rui Huang, <a href="https://scholar.google.com/citations?user=HEuN8PcAAAAJ">Limin Wang</a>
</p>
<p align="center">
🤗 <a href="https://huggingface.co/MCG-NJU/CaRe-7B">Model</a>    |    🤗 <a href="https://huggingface.co/datasets/MCG-NJU/CaReBench">Data</a>   |    📑 <a href="https://arxiv.org/pdf/2501.00513">Paper</a>   
</p>
</div>
## 📝 Introduction
This is MiniCPM-V 2.6 trained with *Retrieval Adaptation*. Refer to [our paper](https://arxiv.org/pdf/2501.00513) for details.
## Usage
Loading from the huggingface remote path is not tested. It is **recommended** to download this checkpoint to your local environment to prevent potential bugs.
### For Retrieval Tasks
```python
from utils.video import read_frames_decord
from models.modeling_encoders import AutoEncoder
from torch.nn.functional import cosine_similarity
encoder = AutoEncoder.from_pretrained('path/to/checkpoints/MiniCPM-V-2_6-RA')
frames = read_frames_decord(video_path='assets/demo.mp4', num_frames=32)
text = "This video features a man slicing tomatoes in the kitchen."
vision_emb = encoder.encode_vision(frames.unsqueeze(0))
text_emb = encoder.encode_text(text)
print(f'Vision embedding shape: {vision_emb.shape}')
print(f'Text embedding shape: {text_emb.shape}')
print(f'Cosine similarity: {cosine_similarity(vision_emb, text_emb)}')
``` |