File size: 12,671 Bytes
36f863d 55e132b 36f863d 55e132b 36f863d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 |
---
language:
- en
base_model:
- OpenGVLab/InternVL2_5-8B
tags:
- video
- emotion
---
# π Libra-Emo Model
**A Multimodal Large Language Model for Fine-Grained Negative Emotion Detection**
This is the official model release of Libra-Emo, a multimodal large language model for fine-grained negative emotion detection. The model is built upon [InternVL 2.5](https://github.com/OpenGVLab/InternVL) and fine-tuned on our [Libra-Emo Dataset](https://huggingface.co/datasets/caskcsg/Libra-Emo).
## π Model Description
Libra-Emo Model is designed to understand and analyze emotions in video content. It can:
- Recognize **13** fine-grained emotion categories
- Provide detailed **explanations** for emotion classifications
- Process both visual and textual (subtitle) information
- Handle real-world video scenarios with complex emotional expressions
## π Usage
### Environment Setup
Our model is tested with CUDA 12.1. To set up the environment:
```bash
# Create and activate conda environment
conda create -n libra-emo python=3.10
conda activate libra-emo
# Clone and install InternVL dependencies
git clone https://github.com/OpenGVLab/InternVL.git
cd InternVL
pip install -r requirements/internvl_chat.txt
```
### Usage Example
Here's a complete example of how to use Libra-Emo Model for video emotion analysis:
```python
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
def build_transform(input_size):
MEAN = (0.485, 0.456, 0.406)
STD = (0.229, 0.224, 0.225)
transform = T.Compose(
[
T.Lambda(lambda img: img.convert("RGB") if img.mode != "RGB" else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD),
]
)
return transform
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff = float("inf")
best_ratio = (1, 1)
area = width * height
for ratio in target_ratios:
target_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
return best_ratio
def dynamic_preprocess(
image, min_num=1, max_num=12, image_size=448, use_thumbnail=False
):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
# calculate the existing image aspect ratio
target_ratios = set(
(i, j)
for n in range(min_num, max_num + 1)
for i in range(1, n + 1)
for j in range(1, n + 1)
if i * j <= max_num and i * j >= min_num
)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
# find the closest aspect ratio to the target
target_aspect_ratio = find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size
)
# calculate the target width and height
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
# resize the image
resized_img = image.resize((target_width, target_height))
processed_images = []
for i in range(blocks):
box = (
(i % (target_width // image_size)) * image_size,
(i // (target_width // image_size)) * image_size,
((i % (target_width // image_size)) + 1) * image_size,
((i // (target_width // image_size)) + 1) * image_size,
)
# split the image
split_img = resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) == blocks
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
def load_image(image_file, input_size=448, max_num=12):
image = Image.open(image_file).convert("RGB")
transform = build_transform(input_size=input_size)
images = dynamic_preprocess(
image, image_size=input_size, use_thumbnail=True, max_num=max_num
)
pixel_values = [transform(image) for image in images]
pixel_values = torch.stack(pixel_values)
return pixel_values
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
if bound:
start, end = bound[0], bound[1]
else:
start, end = -100000, 100000
start_idx = max(first_idx, round(start * fps))
end_idx = min(round(end * fps), max_frame)
seg_size = float(end_idx - start_idx) / num_segments
frame_indices = np.array(
[
int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
for idx in range(num_segments)
]
)
return frame_indices
def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
max_frame = len(vr) - 1
fps = float(vr.get_avg_fps())
pixel_values_list, num_patches_list = [], []
transform = build_transform(input_size=input_size)
frame_indices = get_index(
bound, fps, max_frame, first_idx=0, num_segments=num_segments
)
for frame_index in frame_indices:
img = Image.fromarray(vr[frame_index].asnumpy()).convert("RGB")
img = dynamic_preprocess(
img, image_size=input_size, use_thumbnail=True, max_num=max_num
)
pixel_values = [transform(tile) for tile in img]
pixel_values = torch.stack(pixel_values)
num_patches_list.append(pixel_values.shape[0])
pixel_values_list.append(pixel_values)
pixel_values = torch.cat(pixel_values_list)
return pixel_values, num_patches_list
# Step 1: load the model
# If you have an 80G A100 GPU, you can put the entire model on a single GPU.
model_path = "caskcsg/Libra-Emo-8B"
model = AutoModel.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
device_map="cuda:0"
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(
model_path, trust_remote_code=True, use_fast=False
)
# Step 2: load the video
video_path = "your_video_path" # change to your video path
pixel_values, num_patches_list = load_video(
video_path, num_segments=16, max_num=1
)
pixel_values = pixel_values.to(torch.bfloat16).to(model.device)
video_prefix = "".join(
[f"Frame-{i+1}: <image>\n" for i in range(len(num_patches_list))]
)
# Step 3: set the question
subtitle = None # change to your subtitle (subtitle is optional, if you don't have subtitle, please set it to None)
if subtitle is None:
question = (
video_prefix
+ "The above is a video. Please accurately identify the emotional label expressed by the people in the video. Emotional labels include should be limited to: happy, excited, angry, disgusted, hateful, surprised, amazed, frustrated, sad, fearful, despairful, ironic, neutral. The output format should be:\n[label]\n[explanation]"
)
else:
question = (
video_prefix
+ f"The above is a video. The video's subtitle is '{subtitle}', which maybe the words spoken by the person. Please accurately identify the emotional label expressed by the people in the video. Emotional labels include should be limited to: happy, excited, angry, disgusted, hateful, surprised, amazed, frustrated, sad, fearful, despairful, ironic, neutral. The output format should be:\n[label]\n[explanation]"
)
# Step 4: generate the response
response, history = model.chat(
tokenizer,
pixel_values,
question,
dict(max_new_tokens=512, do_sample=False),
num_patches_list=num_patches_list,
history=None,
return_history=True,
)
print(response)
```
The model will output the emotion label and explanation in the following format:
```
[label]
[explanation]
```
**Note**: If you aim to obtain emotion labels more quickly without requiring explanations, consider reducing the `max_new_tokens` value in the generation configuration.
## π Performance Comparison
We evaluate our models on the [Libra-Emo Bench](https://huggingface.co/datasets/caskcsg/Libra-Emo), comparing with both closed-source and open-source models. The evaluation metrics include accuracy and F1 scores for all emotions (13 classes) and negative emotions (8 classes).
### Performance Comparison of MLLMs on Libra-Emo Bench
| **Model** | **Accuracy** | **Macro-F1** | **Weighted-F1** | **Accuracy (Neg)** | **Macro-F1 (Neg)** | **Weighted-F1 (Neg)** |
|:--------------------------|:------------:|:------------:|:---------------:|:------------------:|:------------------:|:---------------------:|
| ***Closed-Source Models*** | | | | | | |
| Gemini-2.0-Flash | **65.67** | **63.98** | **64.51** | 65.00 | 62.97 | 63.86 |
| Gemini-1.5-Flash | 64.41 | 62.36 | 62.52 | 61.32 | 58.85 | 58.74 |
| GPT-4o | 62.99 | 63.56 | 63.32 | **67.89** | **67.54** | **67.89** |
| Claude-3.5-Sonnet | 52.13 | 48.38 | 49.38 | 49.47 | 49.32 | 50.50 |
| ***Open-Source Models*** | | | | | | |
| LLaVA-Video-7B-Qwen2 | 33.39 | 30.14 | 31.25 | 22.11 | 25.55 | 26.65 |
| MiniCPM-o 2.6 (8B) | 42.83 | 40.23 | 40.26 | 40.53 | 37.29 | 38.00 |
| Qwen2.5-VL-7B | 47.56 | 44.18 | 43.68 | 41.32 | 39.07 | 38.50 |
| NVILA-8B | 41.89 | 35.92 | 36.01 | 42.89 | 32.83 | 33.88 |
| Phi-3.5-vision-instruct | 53.39 | 51.23 | 51.16 | **52.89** | **49.97** | **49.98** |
| InternVL-2.5-1B | 23.46 | 17.33 | 18.14 | 22.11 | 16.48 | 17.26 |
| InternVL-2.5-2B | 25.98 | 22.31 | 22.19 | 30.79 | 24.97 | 24.59 |
| InternVL-2.5-4B | 42.99 | 39.58 | 38.81 | 37.89 | 38.78 | 38.55 |
| InternVL-2.5-8B | **54.96** | **51.42** | **51.64** | 50.53 | 47.07 | 47.22 |
| ***Fine-Tuned on Libra-Emo*** | | | | | | |
| Libra-Emo-1B | 53.54 (β30.08) | 49.44 (β32.11) | 50.19 (β32.05) | 46.84 (β24.73) | 41.53 (β25.05) | 42.25 (β24.99) |
| Libra-Emo-2B | 56.38 (β30.40) | 53.60 (β31.29) | 53.90 (β31.71) | 50.26 (β19.47) | 48.79 (β23.82) | 48.91 (β24.32) |
| Libra-Emo-4B | 65.20 (β22.21) | 64.12 (β24.54) | 64.41 (β25.60) | 60.79 (β22.90) | 61.30 (β22.52) | 61.61 (β23.06) |
| **Libra-Emo-8B** | **71.18 (β16.22)** | **70.51 (β19.09)** | **70.71 (β19.07)** | **70.53 (β20.00)** | **69.94 (β22.87)** | **70.14 (β22.92)** |
### Key Findings
1. Our Libra-Emo models significantly outperform their base InternVL models, with improvements up to 30% in accuracy and F1 scores.
2. The 8B version achieves the best performance, reaching 71.18% accuracy and 70.51% macro-F1 score on all emotions.
3. For negative emotions, our models show strong performance with up to 70.53% accuracy and 70.14% weighted-F1 score.
4. The performance scales well with model size, showing consistent improvements from 1B to 8B parameters.
> **Note**: Our technical report with detailed methodology and analysis will be released soon. |