RESA-7B
RESA-7B is a multimodal large language model (MLLM) developed as part of the paper:
π Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation.
This model builds upon the LLaVA architecture, incorporating reasoning enhancement via chain-of-thought (CoT) supervision using VLGuard data.
π§ Model Overview
- Base architecture: LLaVA
- Visual encoder: openai/clip-vit-large-patch14-336
- To replicate the full RESA-R-7B variant as described in the paper, replace the visual encoder with
chs20/fare4-clip
.
- To replicate the full RESA-R-7B variant as described in the paper, replace the visual encoder with
- Language model: Vicuna-7B
- Fine-tuning method: Supervised Fine-Tuning (SFT)
- Enhancement: We prepend Chain-of-Thought (CoT) prefixes to VLGuard's multimodal instruction dataset to improve reasoning capabilities in safety-sensitive scenarios.
π§ͺ Training Details
- Dataset: VLGuard (2k samples)
- Augmentation: Each sample was prepended with an GPT4o-generated CoT reasoning prefix.
- Fine-tuning: Conducted on the full 2k augmented samples with supervised instruction-following loss.
π Intended Use
RESA-7B is designed for research in the evaluation and alignment of multimodal models, particularly in safety-critical scenarios. It is useful for:
- Safety and trustworthiness evaluation
- Reasoning in multimodal question answering
- Studying the effect of CoT augmentation on MLLM behavior
π Example Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("yfwang22/RESA-7B")
model = AutoModelForCausalLM.from_pretrained("yfwang22/RESA-7B")
# Example input
input_text = "Describe the safety concern in the given image."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 3
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support