RESA-7B

RESA-7B is a multimodal large language model (MLLM) developed as part of the paper:
📄 Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation.
This model builds upon the LLaVA architecture, incorporating reasoning enhancement via chain-of-thought (CoT) supervision using VLGuard data.

🧠 Model Overview

Base architecture: LLaVA
Visual encoder: openai/clip-vit-large-patch14-336
- To replicate the full RESA-R-7B variant as described in the paper, replace the visual encoder with chs20/fare4-clip.
Language model: Vicuna-7B
Fine-tuning method: Supervised Fine-Tuning (SFT)
Enhancement: We prepend Chain-of-Thought (CoT) prefixes to VLGuard's multimodal instruction dataset to improve reasoning capabilities in safety-sensitive scenarios.

🧪 Training Details

Dataset: VLGuard (2k samples)
Augmentation: Each sample was prepended with an GPT4o-generated CoT reasoning prefix.
Fine-tuning: Conducted on the full 2k augmented samples with supervised instruction-following loss.

🔍 Intended Use

RESA-7B is designed for research in the evaluation and alignment of multimodal models, particularly in safety-critical scenarios. It is useful for:

Safety and trustworthiness evaluation
Reasoning in multimodal question answering
Studying the effect of CoT augmentation on MLLM behavior

🚀 Example Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("yfwang22/RESA-7B")
model = AutoModelForCausalLM.from_pretrained("yfwang22/RESA-7B")

# Example input
input_text = "Describe the safety concern in the given image."

inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))