RESA-7B

RESA-7B is a multimodal large language model (MLLM) developed as part of the paper:
πŸ“„ Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation.
This model builds upon the LLaVA architecture, incorporating reasoning enhancement via chain-of-thought (CoT) supervision using VLGuard data.

🧠 Model Overview

  • Base architecture: LLaVA
  • Visual encoder: openai/clip-vit-large-patch14-336
    • To replicate the full RESA-R-7B variant as described in the paper, replace the visual encoder with chs20/fare4-clip.
  • Language model: Vicuna-7B
  • Fine-tuning method: Supervised Fine-Tuning (SFT)
  • Enhancement: We prepend Chain-of-Thought (CoT) prefixes to VLGuard's multimodal instruction dataset to improve reasoning capabilities in safety-sensitive scenarios.

πŸ§ͺ Training Details

  • Dataset: VLGuard (2k samples)
  • Augmentation: Each sample was prepended with an GPT4o-generated CoT reasoning prefix.
  • Fine-tuning: Conducted on the full 2k augmented samples with supervised instruction-following loss.

πŸ” Intended Use

RESA-7B is designed for research in the evaluation and alignment of multimodal models, particularly in safety-critical scenarios. It is useful for:

  • Safety and trustworthiness evaluation
  • Reasoning in multimodal question answering
  • Studying the effect of CoT augmentation on MLLM behavior

πŸš€ Example Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("yfwang22/RESA-7B")
model = AutoModelForCausalLM.from_pretrained("yfwang22/RESA-7B")

# Example input
input_text = "Describe the safety concern in the given image."

inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
3
Safetensors
Model size
7.06B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support