RESA-mix-7B
RESA-mix-7B is a multimodal large language model (MLLM) based on the RESA-7B architecture, with additional training data from LLaVA-Next
to enhance multimodal reasoning. This model was developed as part of the paper:
π Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation.
π§ Model Overview
- Base architecture: LLaVA
- Visual encoder: openai/clip-vit-large-patch14-336
- To replicate the RESA-R-mix-7B variant, replace the visual encoder with
chs20/fare4-clip
.
- To replicate the RESA-R-mix-7B variant, replace the visual encoder with
- Language model: Vicuna-7B
- Fine-tuning method: Supervised Fine-Tuning (SFT)
- Enhancement: Chain-of-Thought (CoT) prefix generation over a combined dataset
- New Data: Mixed with 10k additional samples from LLaVA-Next, with CoT generated by GPT-4o
- Total SFT samples: VLGuard-CoT + LLaVA-Next-CoT (~12k)
π§ͺ Training Details
- Datasets:
- VLGuard-CoT (CoT generated with GPT-4o)
- LLaVA-Next 10k (also CoT-augmented with GPT-4o)
- Training Objective: Multimodal instruction tuning with CoT-augmented supervision
π Intended Use
RESA-mix-7B is designed for:
- Evaluating MLLM alignment and safety with diverse multimodal instruction datasets
- Enhancing multimodal reasoning via GPT-4o-guided CoT
- Studying the trade-off between general-purpose and safety-aligned data on model generalization and trustworthiness.
π Example Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("yfwang22/RESA-mix-7B")
model = AutoModelForCausalLM.from_pretrained("yfwang22/RESA-mix-7B")
inputs = tokenizer("Describe the ethical risks shown in this image.", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 3
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support