Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models Paper ⢠2310.05863 ⢠Published Oct 9, 2023 ⢠1
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization Paper ⢠2410.06682 ⢠Published Oct 9, 2024
video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models Paper ⢠2506.15220 ⢠Published Jun 18 ⢠1
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model Paper ⢠2502.11775 ⢠Published Feb 17 ⢠9
video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models Paper ⢠2506.15220 ⢠Published Jun 18 ⢠1