RESA-mix-7B

RESA-mix-7B is a multimodal large language model (MLLM) based on the RESA-7B architecture, with additional training data from LLaVA-Next to enhance multimodal reasoning. This model was developed as part of the paper:
📄 Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation.

🧠 Model Overview

Base architecture: LLaVA
Visual encoder: openai/clip-vit-large-patch14-336
- To replicate the RESA-R-mix-7B variant, replace the visual encoder with chs20/fare4-clip.
Language model: Vicuna-7B
Fine-tuning method: Supervised Fine-Tuning (SFT)
Enhancement: Chain-of-Thought (CoT) prefix generation over a combined dataset
New Data: Mixed with 10k additional samples from LLaVA-Next, with CoT generated by GPT-4o
Total SFT samples: VLGuard-CoT + LLaVA-Next-CoT (~12k)

🧪 Training Details

Datasets:
- VLGuard-CoT (CoT generated with GPT-4o)
- LLaVA-Next 10k (also CoT-augmented with GPT-4o)
Training Objective: Multimodal instruction tuning with CoT-augmented supervision

🔍 Intended Use

RESA-mix-7B is designed for:

Evaluating MLLM alignment and safety with diverse multimodal instruction datasets
Enhancing multimodal reasoning via GPT-4o-guided CoT
Studying the trade-off between general-purpose and safety-aligned data on model generalization and trustworthiness.

🚀 Example Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("yfwang22/RESA-mix-7B")
model = AutoModelForCausalLM.from_pretrained("yfwang22/RESA-mix-7B")

inputs = tokenizer("Describe the ethical risks shown in this image.", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))