RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language

Project Page: https://bashlab.github.io/raven_project/ β€’ Code: https://github.com/BASHLab/RAVEN


Abstract

Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning -- each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and evaluation, we release AVS-QA, a dataset of 300K synchronized Audio--Video-Sensor streams paired with automatically generated question-answer pairs. Experimental results on seven multi-modal QA benchmarks -- including egocentric and exocentric tasks -- show that RAVEN achieves up to 14.5% and 8.0% gains in accuracy compared to state-of-the-art multi-modal large language models, respectively. Incorporating sensor data provides an additional 16.4% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23%. Our code and dataset are available at this https URL .


πŸš€ Main Results

Comparison of RAVEN and prior MLLMs on exocentric open-ended video QA (MSVD-QA, MSRVTT-QA, ActivityNet-QA) and audio-visual QA (AVSD, MUSIC-QA) benchmarks. Best and second-best scores are in $\textbf{Bold}$ and $\underline{\text{underline}}$. $^*$ indicates scores reproduced by us.

Comparison of RAVEN with MLLMs on the EgoThink (Reasoning) and AVS-QA benchmarks. RAVEN outperforms across metrics and excels in reasoning. $\textbf{Bold}$ and $\underline{\text{underline}}$ indicate the best and second-best scores.


πŸ“ AVS-QA Dataset

Train and test split of AVS-QA is provided here.
More details here.

πŸ› οΈ Requirements and Installation

Basic Dependencies:

  • Python >= 3.8
  • Pytorch >= 2.2.0
  • CUDA Version >= 11.8
  • transformers == 4.40.0 (for reproducing paper results)
  • tokenizers == 0.19.1
cd RAVEN
pip install -r requirements.txt
pip install flash-attn==2.5.8 --no-build-isolation
pip install opencv-python==4.5.5.64
apt-get update && apt-get install ffmpeg libsm6 libxext6  -y

πŸ€ Model Zoo

Model Name Modal Type
RAVEN-7B-AV AV
RAVEN-7B-AVS AVS

πŸ€– Sample Usage

CUDA_VISIBLE_DEVICES=0 python inference.py --model-path=<MODEL PATH> --modal-type=<MODAL TYPE>

πŸ‘ Acknowledgement

The codebase of RAVEN is adapted from VideoLLaMA2. We are also grateful for their contribution.

Downloads last month
10
Safetensors
Model size
8.6B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support