🎭 Libra-Emo Model

A Multimodal Large Language Model for Fine-Grained Negative Emotion Detection

This is the official model release of Libra-Emo, a multimodal large language model for fine-grained negative emotion detection. The model is built upon InternVL 2.5 and fine-tuned on our Libra-Emo Dataset.

πŸ“ Model Description

Libra-Emo Model is designed to understand and analyze emotions in video content. It can:

  • Recognize 13 fine-grained emotion categories
  • Provide detailed explanations for emotion classifications
  • Process both visual and textual (subtitle) information
  • Handle real-world video scenarios with complex emotional expressions

πŸš€ Usage

Environment Setup

Our model is tested with CUDA 12.1. To set up the environment:

# Create and activate conda environment
conda create -n libra-emo python=3.10
conda activate libra-emo

# Clone and install InternVL dependencies
git clone https://github.com/OpenGVLab/InternVL.git
cd InternVL
pip install -r requirements/internvl_chat.txt

Usage Example

Here's a complete example of how to use Libra-Emo Model for video emotion analysis:

import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

def build_transform(input_size):
    MEAN = (0.485, 0.456, 0.406)
    STD = (0.229, 0.224, 0.225)
    transform = T.Compose(
        [
            T.Lambda(lambda img: img.convert("RGB") if img.mode != "RGB" else img),
            T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
            T.ToTensor(),
            T.Normalize(mean=MEAN, std=STD),
        ]
    )
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float("inf")
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(
    image, min_num=1, max_num=12, image_size=448, use_thumbnail=False
):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j)
        for n in range(min_num, max_num + 1)
        for i in range(1, n + 1)
        for j in range(1, n + 1)
        if i * j <= max_num and i * j >= min_num
    )
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size
    )

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size,
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert("RGB")
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(
        image, image_size=input_size, use_thumbnail=True, max_num=max_num
    )
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
    if bound:
        start, end = bound[0], bound[1]
    else:
        start, end = -100000, 100000
    start_idx = max(first_idx, round(start * fps))
    end_idx = min(round(end * fps), max_frame)
    seg_size = float(end_idx - start_idx) / num_segments
    frame_indices = np.array(
        [
            int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
            for idx in range(num_segments)
        ]
    )
    return frame_indices

def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())

    pixel_values_list, num_patches_list = [], []
    transform = build_transform(input_size=input_size)
    frame_indices = get_index(
        bound, fps, max_frame, first_idx=0, num_segments=num_segments
    )
    for frame_index in frame_indices:
        img = Image.fromarray(vr[frame_index].asnumpy()).convert("RGB")
        img = dynamic_preprocess(
            img, image_size=input_size, use_thumbnail=True, max_num=max_num
        )
        pixel_values = [transform(tile) for tile in img]
        pixel_values = torch.stack(pixel_values)
        num_patches_list.append(pixel_values.shape[0])
        pixel_values_list.append(pixel_values)
    pixel_values = torch.cat(pixel_values_list)
    return pixel_values, num_patches_list
    

# Step 1: load the model
# If you have an 80G A100 GPU, you can put the entire model on a single GPU.
model_path = "caskcsg/Libra-Emo-1B"
model = AutoModel.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map="cuda:0"
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(
    model_path, trust_remote_code=True, use_fast=False
)

# Step 2: load the video
video_path = "your_video_path" # change to your video path
pixel_values, num_patches_list = load_video(
    video_path, num_segments=16, max_num=1
)
pixel_values = pixel_values.to(torch.bfloat16).to(model.device)
video_prefix = "".join(
    [f"Frame-{i+1}: <image>\n" for i in range(len(num_patches_list))]
)

# Step 3: set the question
subtitle = None # change to your subtitle (subtitle is optional, if you don't have subtitle, please set it to None)
if subtitle is None:
    question = (
        video_prefix
        + "The above is a video. Please accurately identify the emotional label expressed by the people in the video. Emotional labels include should be limited to: happy, excited, angry, disgusted, hateful, surprised, amazed, frustrated, sad, fearful, despairful, ironic, neutral. The output format should be:\n[label]\n[explanation]"
    )
else:
    question = (
        video_prefix
        + f"The above is a video. The video's subtitle is '{subtitle}', which maybe the words spoken by the person. Please accurately identify the emotional label expressed by the people in the video. Emotional labels include should be limited to: happy, excited, angry, disgusted, hateful, surprised, amazed, frustrated, sad, fearful, despairful, ironic, neutral. The output format should be:\n[label]\n[explanation]"
    )

# Step 4: generate the response
response, history = model.chat(
    tokenizer,
    pixel_values,
    question,
    dict(max_new_tokens=512, do_sample=False),
    num_patches_list=num_patches_list,
    history=None,
    return_history=True,
)
print(response)

The model will output the emotion label and explanation in the following format:

[label]
[explanation]

Note: If you aim to obtain emotion labels more quickly without requiring explanations, consider reducing the max_new_tokens value in the generation configuration.

πŸ“Š Performance Comparison

We evaluate our models on the Libra-Emo Bench, comparing with both closed-source and open-source models. The evaluation metrics include accuracy and F1 scores for all emotions (13 classes) and negative emotions (8 classes).

Performance Comparison of MLLMs on Libra-Emo Bench

Model Accuracy Macro-F1 Weighted-F1 Accuracy (Neg) Macro-F1 (Neg) Weighted-F1 (Neg)
Closed-Source Models
Gemini-2.0-Flash 65.67 63.98 64.51 65.00 62.97 63.86
Gemini-1.5-Flash 64.41 62.36 62.52 61.32 58.85 58.74
GPT-4o 62.99 63.56 63.32 67.89 67.54 67.89
Claude-3.5-Sonnet 52.13 48.38 49.38 49.47 49.32 50.50
Open-Source Models
LLaVA-Video-7B-Qwen2 33.39 30.14 31.25 22.11 25.55 26.65
MiniCPM-o 2.6 (8B) 42.83 40.23 40.26 40.53 37.29 38.00
Qwen2.5-VL-7B 47.56 44.18 43.68 41.32 39.07 38.50
NVILA-8B 41.89 35.92 36.01 42.89 32.83 33.88
Phi-3.5-vision-instruct 53.39 51.23 51.16 52.89 49.97 49.98
InternVL-2.5-1B 23.46 17.33 18.14 22.11 16.48 17.26
InternVL-2.5-2B 25.98 22.31 22.19 30.79 24.97 24.59
InternVL-2.5-4B 42.99 39.58 38.81 37.89 38.78 38.55
InternVL-2.5-8B 54.96 51.42 51.64 50.53 47.07 47.22
Fine-Tuned on Libra-Emo
Libra-Emo-1B 53.54 (↑30.08) 49.44 (↑32.11) 50.19 (↑32.05) 46.84 (↑24.73) 41.53 (↑25.05) 42.25 (↑24.99)
Libra-Emo-2B 56.38 (↑30.40) 53.60 (↑31.29) 53.90 (↑31.71) 50.26 (↑19.47) 48.79 (↑23.82) 48.91 (↑24.32)
Libra-Emo-4B 65.20 (↑22.21) 64.12 (↑24.54) 64.41 (↑25.60) 60.79 (↑22.90) 61.30 (↑22.52) 61.61 (↑23.06)
Libra-Emo-8B 71.18 (↑16.22) 70.51 (↑19.09) 70.71 (↑19.07) 70.53 (↑20.00) 69.94 (↑22.87) 70.14 (↑22.92)

Key Findings

  1. Our Libra-Emo models significantly outperform their base InternVL models, with improvements up to 30% in accuracy and F1 scores.
  2. The 8B version achieves the best performance, reaching 71.18% accuracy and 70.51% macro-F1 score on all emotions.
  3. For negative emotions, our models show strong performance with up to 70.53% accuracy and 70.14% weighted-F1 score.
  4. The performance scales well with model size, showing consistent improvements from 1B to 8B parameters.

Note: Our technical report with detailed methodology and analysis will be released soon.

Downloads last month
18
Safetensors
Model size
938M params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for caskcsg/Libra-Emo-1B

Finetuned
(18)
this model

Collection including caskcsg/Libra-Emo-1B