RESA-mix-7B

RESA-mix-7B is a multimodal large language model (MLLM) based on the RESA-7B architecture, with additional training data from LLaVA-Next to enhance multimodal reasoning. This model was developed as part of the paper:
πŸ“„ Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation.

🧠 Model Overview

  • Base architecture: LLaVA
  • Visual encoder: openai/clip-vit-large-patch14-336
    • To replicate the RESA-R-mix-7B variant, replace the visual encoder with chs20/fare4-clip.
  • Language model: Vicuna-7B
  • Fine-tuning method: Supervised Fine-Tuning (SFT)
  • Enhancement: Chain-of-Thought (CoT) prefix generation over a combined dataset
  • New Data: Mixed with 10k additional samples from LLaVA-Next, with CoT generated by GPT-4o
  • Total SFT samples: VLGuard-CoT + LLaVA-Next-CoT (~12k)

πŸ§ͺ Training Details

  • Datasets:
    • VLGuard-CoT (CoT generated with GPT-4o)
    • LLaVA-Next 10k (also CoT-augmented with GPT-4o)
  • Training Objective: Multimodal instruction tuning with CoT-augmented supervision

πŸ” Intended Use

RESA-mix-7B is designed for:

  • Evaluating MLLM alignment and safety with diverse multimodal instruction datasets
  • Enhancing multimodal reasoning via GPT-4o-guided CoT
  • Studying the trade-off between general-purpose and safety-aligned data on model generalization and trustworthiness.

πŸš€ Example Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("yfwang22/RESA-mix-7B")
model = AutoModelForCausalLM.from_pretrained("yfwang22/RESA-mix-7B")

inputs = tokenizer("Describe the ethical risks shown in this image.", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
3
Safetensors
Model size
7.06B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support