VideoMathQA VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos MBZUAI/VideoMathQA Viewer • Updated Jun 6 • 2.1k • 523 • 7
CASS Large-scale dataset and model suite for cross-architecture GPU code transpilation between CUDA and HIP at both source and assembly levels MBZUAI/cass Viewer • Updated May 28 • 135k • 226 • 2 MBZUAI/cass-bench Viewer • Updated Apr 27 • 40 • 79 • 2
BiMediX2 BiMediX2 : Bio-Medical EXpert LMM for Diverse Medical Modalities MBZUAI/BiMediX2-8B-hf Image-Text-to-Text • 8B • Updated Jun 3 • 153 • 1 MBZUAI/BiMediX2-8B Image-Text-to-Text • 8B • Updated Dec 15, 2024 • 61 • 6 MBZUAI/BiMediX2-8B-Bi Image-Text-to-Text • 8B • Updated Dec 15, 2024 • 5 MBZUAI/BiMediX2-70B Image-Text-to-Text • 71B • Updated Dec 15, 2024 • 52 • 1
VideoGPT+ VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding MBZUAI/VideoGPT-plus_Phi3-mini-4k Updated Jun 17, 2024 • 6 MBZUAI/VideoGPT-plus_Phi3-mini-4k_Pretrain Updated Jun 17, 2024 • 1 MBZUAI/VCGBench-Diverse Updated Jul 1, 2024 • 80 • 3 MBZUAI/VCG-plus_112K Viewer • Updated Jun 17, 2024 • 139k • 102 • 7
Video-ChatGPT "Video-ChatGPT" is a video conversation model capable of generating meaningful conversation about videos. MBZUAI/Video-ChatGPT-7B Visual Question Answering • Updated Jun 8, 2023 • 42 MBZUAI/VideoInstruct-100K Viewer • Updated Sep 29, 2023 • 100k • 159 • 43
PALO PALO: A Polyglot Large Multimodal Model for 5B People MBZUAI/PALO-13B Text Generation • Updated Mar 25, 2024 • 6 MBZUAI/MobilePALO-1.7B Text Generation • Updated Mar 25, 2024 • 6 MBZUAI/PALO-7B Text Generation • Updated Mar 25, 2024 • 10 MBZUAI/multilingual-llava-bench-in-the-wild Preview • Updated Mar 3, 2024 • 97 • 3
GeoChat GeoChat is the first grounded Large Vision Language Model, specifically tailored to Remote Sensing(RS) scenarios. MBZUAI/geochat-7B Text Generation • Updated Mar 1, 2024 • 1.11k • 22 MBZUAI/GeoChat-Bench Preview • Updated Mar 5, 2024 • 53 • 3 MBZUAI/GeoChat_Instruct Updated Mar 5, 2024 • 187 • 18
NADI 2025 Sub-task 3 datasets Official training and dev datasets for NADI 2025 Subtask 3 (Diacritic Restoration) Shared Task MBZUAI/ArVoice Viewer • Updated 24 days ago • 46.2k • 730 • 14 MBZUAI/ClArTTS Viewer • Updated Feb 25 • 9.71k • 463 • 14 MBZUAI/TunSwitch Viewer • Updated May 28 • 5.25k • 10 • 1 herwoww/mdaspc Viewer • Updated May 28 • 65.8k • 36
GeoPixel Pixel Grounding Large Multimodal Model in Remote Sensing MBZUAI/GeoPixel-7B Updated Feb 20 • 874 • 5 MBZUAI/GeoPixel-7B-RES Updated Feb 20 • 398 • 2 GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing Paper • 2501.13925 • Published Jan 23 • 8 MBZUAI/GeoPixelD Viewer • Updated Feb 26 • 18.7k • 92 • 3
GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing Paper • 2501.13925 • Published Jan 23 • 8
ArTST - Arabic Text Speech Transformer Open source project for Arabic Speech Recognition and Generation MBZUAI/artst_asr Automatic Speech Recognition • 0.2B • Updated Mar 6 • 83 • 2 MBZUAI/speecht5_tts_clartts_ar Text-to-Speech • Updated Feb 23, 2024 • 1.11k • 20 Build error 5 5 ArtstTTS 🔥 Build error 3 3 ArtstASR 💭
GLaMM Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated. MBZUAI/GLaMM-FullScope Text Generation • Updated Apr 27, 2024 • 105 • 6 MBZUAI/GranD Updated Apr 17, 2024 • 105 • 13 MBZUAI/GranD-f Preview • Updated Mar 21, 2024 • 120 • 11 MBZUAI/GLaMM-GranD-Pretrained Text Generation • Updated Dec 26, 2023 • 118k • 3
LLaVA++ (LLaMA-3 and Phi-3-Mini) Extending Visual Capabilities of LLaVA with LLaMA-3 and Phi-3 Running on A10G 33 33 LLaVA++ (LLaMA-3-V) 👁 Generate answers using a text-based model Running on T4 24 24 LLaVA++ (Phi-3-V) 👁 Start and control a conversational model server MBZUAI/LLaVA-Phi-3-mini-4k-instruct Text Generation • 4B • Updated Apr 27, 2024 • 469 • 22 MBZUAI/LLaVA-Meta-Llama-3-8B-Instruct-FT Text Generation • 8B • Updated Apr 27, 2024 • 71 • 12
MobiLlama Collection of MobiLlama Language Models. MBZUAI/MobiLlama-05B Text Generation • Updated Feb 28, 2024 • 417 • 41 MBZUAI/MobiLlama-1B Text Generation • 1B • Updated Feb 28, 2024 • 74 • 18 MBZUAI/MobiLlama-05B-Chat Text Generation • Updated Feb 28, 2024 • 30 • 17 MBZUAI/MobiLlama-08B Text Generation • Updated Feb 28, 2024 • 11 • 6
Satmae++ Collection of ViT models trained using SatMAE++ approach. mubashir04/checkpoint_ViT-L_finetune_fmow_rgb Updated Mar 26, 2024 • 1 mubashir04/checkpoint_ViT-L_finetune_fmow_sentinel Updated Mar 26, 2024 • 2 mubashir04/checkpoint_ViT-L_pretrain_fmow_rgb Updated Mar 26, 2024 mubashir04/checkpoint_ViT-L_pretrain_fmow_sentinel Updated Mar 26, 2024
VideoMathQA VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos MBZUAI/VideoMathQA Viewer • Updated Jun 6 • 2.1k • 523 • 7
NADI 2025 Sub-task 3 datasets Official training and dev datasets for NADI 2025 Subtask 3 (Diacritic Restoration) Shared Task MBZUAI/ArVoice Viewer • Updated 24 days ago • 46.2k • 730 • 14 MBZUAI/ClArTTS Viewer • Updated Feb 25 • 9.71k • 463 • 14 MBZUAI/TunSwitch Viewer • Updated May 28 • 5.25k • 10 • 1 herwoww/mdaspc Viewer • Updated May 28 • 65.8k • 36
CASS Large-scale dataset and model suite for cross-architecture GPU code transpilation between CUDA and HIP at both source and assembly levels MBZUAI/cass Viewer • Updated May 28 • 135k • 226 • 2 MBZUAI/cass-bench Viewer • Updated Apr 27 • 40 • 79 • 2
GeoPixel Pixel Grounding Large Multimodal Model in Remote Sensing MBZUAI/GeoPixel-7B Updated Feb 20 • 874 • 5 MBZUAI/GeoPixel-7B-RES Updated Feb 20 • 398 • 2 GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing Paper • 2501.13925 • Published Jan 23 • 8 MBZUAI/GeoPixelD Viewer • Updated Feb 26 • 18.7k • 92 • 3
GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing Paper • 2501.13925 • Published Jan 23 • 8
BiMediX2 BiMediX2 : Bio-Medical EXpert LMM for Diverse Medical Modalities MBZUAI/BiMediX2-8B-hf Image-Text-to-Text • 8B • Updated Jun 3 • 153 • 1 MBZUAI/BiMediX2-8B Image-Text-to-Text • 8B • Updated Dec 15, 2024 • 61 • 6 MBZUAI/BiMediX2-8B-Bi Image-Text-to-Text • 8B • Updated Dec 15, 2024 • 5 MBZUAI/BiMediX2-70B Image-Text-to-Text • 71B • Updated Dec 15, 2024 • 52 • 1
ArTST - Arabic Text Speech Transformer Open source project for Arabic Speech Recognition and Generation MBZUAI/artst_asr Automatic Speech Recognition • 0.2B • Updated Mar 6 • 83 • 2 MBZUAI/speecht5_tts_clartts_ar Text-to-Speech • Updated Feb 23, 2024 • 1.11k • 20 Build error 5 5 ArtstTTS 🔥 Build error 3 3 ArtstASR 💭
VideoGPT+ VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding MBZUAI/VideoGPT-plus_Phi3-mini-4k Updated Jun 17, 2024 • 6 MBZUAI/VideoGPT-plus_Phi3-mini-4k_Pretrain Updated Jun 17, 2024 • 1 MBZUAI/VCGBench-Diverse Updated Jul 1, 2024 • 80 • 3 MBZUAI/VCG-plus_112K Viewer • Updated Jun 17, 2024 • 139k • 102 • 7
GLaMM Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated. MBZUAI/GLaMM-FullScope Text Generation • Updated Apr 27, 2024 • 105 • 6 MBZUAI/GranD Updated Apr 17, 2024 • 105 • 13 MBZUAI/GranD-f Preview • Updated Mar 21, 2024 • 120 • 11 MBZUAI/GLaMM-GranD-Pretrained Text Generation • Updated Dec 26, 2023 • 118k • 3
Video-ChatGPT "Video-ChatGPT" is a video conversation model capable of generating meaningful conversation about videos. MBZUAI/Video-ChatGPT-7B Visual Question Answering • Updated Jun 8, 2023 • 42 MBZUAI/VideoInstruct-100K Viewer • Updated Sep 29, 2023 • 100k • 159 • 43
LLaVA++ (LLaMA-3 and Phi-3-Mini) Extending Visual Capabilities of LLaVA with LLaMA-3 and Phi-3 Running on A10G 33 33 LLaVA++ (LLaMA-3-V) 👁 Generate answers using a text-based model Running on T4 24 24 LLaVA++ (Phi-3-V) 👁 Start and control a conversational model server MBZUAI/LLaVA-Phi-3-mini-4k-instruct Text Generation • 4B • Updated Apr 27, 2024 • 469 • 22 MBZUAI/LLaVA-Meta-Llama-3-8B-Instruct-FT Text Generation • 8B • Updated Apr 27, 2024 • 71 • 12
PALO PALO: A Polyglot Large Multimodal Model for 5B People MBZUAI/PALO-13B Text Generation • Updated Mar 25, 2024 • 6 MBZUAI/MobilePALO-1.7B Text Generation • Updated Mar 25, 2024 • 6 MBZUAI/PALO-7B Text Generation • Updated Mar 25, 2024 • 10 MBZUAI/multilingual-llava-bench-in-the-wild Preview • Updated Mar 3, 2024 • 97 • 3
MobiLlama Collection of MobiLlama Language Models. MBZUAI/MobiLlama-05B Text Generation • Updated Feb 28, 2024 • 417 • 41 MBZUAI/MobiLlama-1B Text Generation • 1B • Updated Feb 28, 2024 • 74 • 18 MBZUAI/MobiLlama-05B-Chat Text Generation • Updated Feb 28, 2024 • 30 • 17 MBZUAI/MobiLlama-08B Text Generation • Updated Feb 28, 2024 • 11 • 6
GeoChat GeoChat is the first grounded Large Vision Language Model, specifically tailored to Remote Sensing(RS) scenarios. MBZUAI/geochat-7B Text Generation • Updated Mar 1, 2024 • 1.11k • 22 MBZUAI/GeoChat-Bench Preview • Updated Mar 5, 2024 • 53 • 3 MBZUAI/GeoChat_Instruct Updated Mar 5, 2024 • 187 • 18
Satmae++ Collection of ViT models trained using SatMAE++ approach. mubashir04/checkpoint_ViT-L_finetune_fmow_rgb Updated Mar 26, 2024 • 1 mubashir04/checkpoint_ViT-L_finetune_fmow_sentinel Updated Mar 26, 2024 • 2 mubashir04/checkpoint_ViT-L_pretrain_fmow_rgb Updated Mar 26, 2024 mubashir04/checkpoint_ViT-L_pretrain_fmow_sentinel Updated Mar 26, 2024